Hybrid Methods in Pattern Recognition

HYBRID METHODS IN PATTERN RECOGNITION World Scientific Series in Machine Perception and Artificial Intelligence - Vol...

Author: H. Bunke | A. Kandel

163 downloads 1231 Views 15MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

HYBRID METHODS IN PATTERN RECOGNITION

World Scientific

Series in Machine Perception and Artificial Intelligence - Vol. 47

HYBRID METHODS IN PATTERN RECOGNITION Editors

H Bunke University of Bern, Switzerland

A Kandel University of South Florida, USA

li> World Scientific •

New Jersey • London • Singapore • Hong Kong

SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors:

H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)

Vol. 34: Advances in Handwriting Recognition (Ed. S.-W. Lee) Vol. 35: Vision Interface — Real World Applications of Computer Vision (Eds. M. Cherietand Y.-H. Yang) Vol. 36: Wavelet Theory and Its Application to Pattern Recognition (Y. Y. Tang, L. H. Yang, J. Liu and H. Ma) Vol. 37: Image Processing for the Food Industry (E. R. Davies) Vol. 38: New Approaches to Fuzzy Modeling and Control — Design and Analysis (M. Margaliot and G. Langholz) Vol. 39: Artificial Intelligence Techniques in Breast Cancer Diagnosis and Prognosis (Eds. A. Jain, A. Jain, S. Jain and L. Jain) Vol. 40: Texture Analysis in Machine Vision (Ed. M. K. Pietikaineri) Vol. 41: Neuro-Fuzzy Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 42: Invariants for Pattern Recognition and Classification (Ed. M. A. Rodrigues) Vol. 43: Agent Engineering (Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang) Vol. 44: Multispectral Image Processing and Pattern Recognition (Eds. J. Shen, P. S. P. Wang and T. Zhang) Vol. 45: Hidden Markov Models: Applications in Computer Vision (Eds. H. Bunke and T. Caelli) Vol. 46: Syntactic Pattern Recognition for Seismic Oil Exploration (K. Y. Huang) Vol. 47: Hybrid Methods in Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 48: Multimodal Interface for Human-Machine Communications (Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang) Vol. 49: Neural Networks and Systolic Array Design (Eds. D. Zhang and S. K. Pal)

*For the complete list of titles in this series, please write to the Publisher.

HYBRID METHODS IN PATTERN RECOGNITION

Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Fairer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

HYBRID METHODS IN PATTERN RECOGNITION Series in Machine Perception and Artificial Intelligence — Vol. 47 Copyright © 2002 by World Scientific Publishing Co. Pte. Ltd. AH rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-02-4832-6

Printed in Singapore.

Dedicated to The Honorable Congressman C. W. Bill Young House of Representatives for his vision and continuous support in creating the National Institute for Systems Test and Productivity at the Computer Science and Engineering Department, University of South Florida

This page is intentionally left blank

Preface

The discipline of pattern recognition has seen enormous progress since its beginnings more than four decades ago. Over the years various approaches have emerged, based on statistical decision theory, structural matching and parsing, neural networks, fuzzy logic, artificial intelligence, evolutionary computing, and others. Obviously, these approaches are characterized by a high degree of diversity. In order to combine their strengths and avoid their weaknesses, hybrid pattern recognition schemes have been proposed, combining several techniques into a single pattern recognition system. Hybrid methods have been known about for a long time, but they have gained new interest only recently. An example is the area of classifier combination, which has attracted enormous attention over the past few years. The contributions included in this volume cover recent advances in hybrid pattern recognition. In the first chapter by H. Ishibuchi and M. Nii, a novel type of neural network architecture is introduced, which can process fuzzy input data. This type of neural net is quite powerful because it can simultaneously deal with different data formats, such as real or fuzzy numbers and intervals, as well as linguistic variables. The following two chapters deal with hybrid systems that aim at the application of neural networks in the domain of structural pattern recognition. In the second chapter by G. Adorni et al., an extension of the classical backpropagation algorithm that can be applied in the graph domain is proposed. This extension allows us to apply multilayer perceptron neural networks not only to feature vectors, but also to patterns represented by means of graphs. A generalization of self-organizing maps from n-dimensional real space to the domain of graphs is proposed in Chap. 3, by S. Giinter and H. Bunke. In particular, the problem of finding the optimal number of clusters in a graph clustering task is addressed.

Vll

Vlll

Preface

In Chap. 4, A. Bargiela and W. Pedrycz introduce a general framework for clustering through identification of information granules. It is argued t h a t the clusters, or granules, produced by this method are particularly suitable for hybrid systems. T h e next two chapters describe combinations of neural networks and hidden Markov models. First, in Chap. 5, G. Rigoll reviews a number of possible combination schemes. Most of t h e m originated in the context of speech and handwriting recognition; however, they are applicable to a much wider spectrum of applications. In Chap. 6, by T. Artieres et al., a system for on-line recognition of handwritten words and sentences is investigated. T h e main building blocks of this system are a hidden Markov model and a neural net. T h e following three chapters address the emerging field of multiple classifier systems. First, in Chap. 7, T. K. Ho provides a critical survey of the field. She identifies the lessons learned from previous work, points out the remaining problems, and suggests ways to advance the state-of-the-art. Then, in Chap. 8, F . Roli and G. Giacinto describe procedures for the systematic generation of multiple classifiers and their combination. Finally, in Chap. 9, A. Verikas et al. propose an approach to the integration of multiple neural networks into an ensemble. B o t h the generation of the individual nets and the combination of their o u t p u t s is described. In the final three chapters of the book applications of hybrid methods are presented. In Chap. 10, A. Klose and R. Kruse describe a system for the interpretation of remotely sensed images. This system integrates methods from the fields of neural nets, fuzzy logic, and evolutionary computation. In Chap. 11, D.-W. J u n g and R.-H. P a r k address the problem of fingerprint identification. T h e authors use a combination of various methods t o achieve robust recognition at a high speed. Last b u t not least, M. Junker et al. describe a system for automatic text categorization. Their system integrates symbolic rule-based learning with subsymbolic learning using support vector machines. Although it is not possible to cover all current activities in hybrid pattern recognition in one book, we believe t h a t the papers included in this volume are a valuable and representative sample of up-to-date work in this emerging and important branch of p a t t e r n recognition. We hope t h a t the contributions are valuable and will be useful to many of our colleagues working in the field.

Preface

IX

The editors are grateful to all the authors for their cooperation and the timely submission of their manuscripts. Finally, we would like to thank Scott Dick and Adam Schenker of the Computer Science and Engineering Department at the University of South Florida for their assistance and support.

Horst Bunke, Bern, Switzerland Abraham Kandel, Tampa, Florida August 2001

This page is intentionally left blank

Contents

Preface H. Bunke and A. Kandel Neuro-Fuzzy Systems Chapter 1 Fuzzification of Neural Networks for Classification Problems H. Ishibuchi and M. Nii Neural Networks for Structural Pattern Recognition Chapter 2 Adaptive Graphic Pattern Recognition: Foundations and Perspectives G. Adorni, S. Cagnoni and M. Gori Chapter 3

Adaptive Self-Organizing Map in the Graph Domain S. Giinter and H. Bunke

Clustering for Hybrid Systems Chapter 4 From Numbers to Information Granules: A Study in Unsupervised Learning and Feature Analysis A. Bargiela and W. Pedrycz Combining Neural Networks and Hidden Markov Models Chapter 5 Combination of Hidden Markov Models and Neural Networks for Hybrid Statistical Pattern Recognition G. Rigoll Chapter 6

From Character to Sentences: A Hybrid Neuro-Markovian System for On-Line Handwriting Recognition T. Artieres, P. Gallinari, H. Li, S. Marukatat and B. Dorizzi

vii

1

33

61

75

113

145

xii

Contents

Multiple Classifier Systems Chapter 7 Multiple Classifier Combination: Lessons and Next Steps T. K. Ho Chapter 8 Chapter 9

171

Design of Multiple Classifier Systems F. Roll and G. Giacinto

199

Fusing Neural Networks Through Fuzzy Integration A. Verikas, A. Lipnickas, M. Bacauskiene and K. Malmqvist

227

Applications of Hybrid Systems Chapter 10 Hybrid Data Mining Methods in Image Processing A. Klose and R. Kruse

253

Chapter 11 Robust Fingerprint Identification Based on Hybrid Pattern Recognition Methods D.-W. Jung and R.-H. Park

275

Chapter 12

Text Categorization Using Learned Document Features M. Junker, A. Abecker and A. Dengel

301

CHAPTER 1 FUZZIFICATION OF N E U R A L N E T W O R K S FOR CLASSIFICATION PROBLEMS

Hisao Ishibuchi Department of Industrial Engineering, Osaka Prefecture University 1-1 Gakuen-cho, Sakai, Osaka 599-8531, Japan E-mail: [email protected]

Manabu Nii Department of Computer Engineering, Himeji Institute of Technology 2167 Shosha, Himeji, Hyogo 671-2201, Japan E-mail: nii@comp. eng. himeji-tech. ac.jp

This chapter explains the handling of linguistic knowledge and fuzzy inputs in multi-layer feedforward neural networks for pattern classification problems. First we show how fuzzy input vectors can be classified by trained neural networks. The input-output relation of each unit is extended to the case of fuzzy inputs using fuzzy arithmetic. That is, fuzzy outputs from neural networks are denned by fuzzy arithmetic. The classification of each fuzzy input vector is performed by a decision rule using the corresponding fuzzy output vector. Next we show how neural networks can be trained from fuzzy training patterns. Our fuzzy training pattern is a pair of a fuzzy input vector and a non-fuzzy class label. We define a cost function to be minimized in the learning process as a distance between a fuzzy output vector and a non-fuzzy target vector. A learning algorithm is derived from the cost function in the same manner as the well-known back-propagation algorithm. Then we show how linguistic rules can be extracted from trained neural networks. Our linguistic rule has linguistic antecedent conditions, a non-fuzzy consequent class, and a certainty grade. We also show how linguistic rules can be utilized in the learning process. That is, linguistic rules are used as training data. Our learning scheme can simultaneously utilize linguistic rules and numerical data in the same framework. Finally we describe the architecture, learning, and application areas of interval-arithmetic-based

1

H. Ishibuchi & M. Nii

2

neural networks, which can be viewed as a basic form of our fuzzified neural networks. 1. Introduction Multilayer feedforward neural networks can be fuzzified by extending their inputs, connection weights and/or targets to fuzzy numbers (Buckley and Hayashi 1994). Various learning algorithms have been proposed for adjusting connection weights of fuzzified neural networks (for example, Hayashi et al. 1993, Krishnamraju et al. 1994, Ishibuchi et al. 1995a, 1995b, Feuring 1996, Teodorescu and Arotaritei 1997, Dunyak and Wunsch 1997, 1999). Fuzzified neural networks have many promising application areas such as fuzzy regression analysis (Dunyak and Wunsch 2000, Ishibuchi and Nii 2001), decision making (Ishibuchi and Nii 2000, Kuo et al. 2001), forecasting (Kuo and Xue 1999), fuzzy rule extraction (Ishibuchi and Nii 1996, Ishibuchi et al. 1997), and learning from fuzzy rules (Ishibuchi et al. 1993, 1994). The approximation ability of fuzzified neural networks was studied by Buckley and Hayashi (1999) and Buckley and Feuring (2000). Perceptron neural networks were fuzzified in Chen and Chang (2000). In this chapter, we illustrate how fuzzified neural networks can be applied to pattern classification problems. We use multilayer feedforward neural networks with fuzzy inputs, non-fuzzy connection weights, and nonfuzzy targets for handling uncertain patterns and linguistic rules such as "If X\ is small and x
Fuzzification

of Neural Networks for Classification

Problems

3

Sft, linguistic values can be handled in the same framework as fuzzy numbers. Next we discuss the learning of neural networks from fuzzy training patterns. Labeled fuzzy p a t t e r n s are used as training data. T h a t is, each training p a t t e r n is a pair of a fuzzy input vector and its class label. In the same manner as the well-known back-propagation algorithm (Rumelhart et al. 1996), a learning algorithm is derived from a cost function defined by a fuzzy o u t p u t vector and a non-fuzzy target vector. T h e n we illust r a t e the linguistic rule extraction from neural networks. Linguistic rules of the following form are extracted froin a neural network trained for an n-dimensional p a t t e r n classification problem. Rule Rp : If x\ is ap\ and . . . and xn is apn then Class Cp with CFP , (1) where Rp is the label of the p-th rule, x = {x\,... , xn) is an n-dimensional p a t t e r n vector, aPi is an antecedent linguistic value on the i-th feature, Cp is a consequent class, and CFP is a certainty grade. T h e n antecedent linguistic values are presented as an n-dimensional fuzzy input vector to the trained neural network. T h e consequent class and the certainty grade are specified based on the corresponding fuzzy o u t p u t vector. We also discuss the learning of neural networks from linguistic rules of the form in (1). In this case, the antecedent linguistic values are used as a fuzzy input vector as in the fuzzy rule extraction. T h e corresponding target vector is determined by the consequent class. T h e certainty grade can be used for adjusting the importance of each linguistic rule in the learning process. Finally, we describe interval-arithmetic-based neural networks. Since fuzzy arithmetic is numerically performed on the level set (i.e., a-cut) of the fuzzy input vector, interval-arithmetic-based neural networks can be viewed as a basic form of fuzzified neural networks. We illustrate some applications of interval-arithmetic-based neural networks to p a t t e r n classification problems. For example, they can be used for handing incomplete p a t t e r n s with missing inputs where each missing input is represented by an interval including its possible values. They can also be used for decreasing the number of inputs required for the classification of new p a t t e r n s . We show an interval-arithmetic-based approach where each unmeasured input is represented by an interval including its possible values. W h e n h u m a n knowledge is represented by intervals such as "If x\ is in [10, 30] and X2 is in [4, 7] then Class 2", interval-arithmetic-based neural networks can be used for incorporating such knowledge into the learning of neural networks.

H. Ishibuchi & M. Nii

4

2. Classification of Fuzzy Patterns by Trained Neural Networks In this section, we concentrate our attention on the classification of uncertain patterns by trained neural networks. The learning of neural networks from uncertain training patterns is discussed in the next section. 2.1. Classification

Task

Let us assume that a standard three-layer feedforward neural network (Rumelhart et al. 1986) has already been trained for an n-dimensional pattern classification problem with c classes. The number of input units is the same as the dimensionality of the pattern classification problem (i.e., n). The number of hidden units, which is denoted by n # in this chapter, can be arbitrarily specified. The number of output units is the same as the number of classes (i.e., c). Thus our three-layer feedforward neural network has the n x njj x c structure. When an n-dimensional real vector x p = [xp\,.., , xpn) is presented to our neural network, the input-output relation of each unit is written as follows (Rumelhart et al. 1986): [Neural Network Architecture] Input units : opi = xVi, Hidden units : opj = f

i = 1, 2 , . . . , n , ^

u)2iovi + 9j

Output units : opk = f \^2wkjOPj

V=i

(2) ,

+ 9k \ ,

j = 1, 2 , . . . , nH ,

(3)

k = 1, 2 , . . . , c.

(4)

/

In this formulation, w is a connection weight and 8 is a bias. We use the following sigmoidal activation function for the hidden and output units:

Normally the input vector x p is classified by the output unit with the largest output value. This means that we use the following decision rule: If opk < opi for fc = 1,2,... ,c (k j^ I) then classify x p as Class I.

(6)

Fuzzification

of Neural Networks for Classification

Problems

5

Fig. 1 is an example of the classification boundary generated by a trained neural network using this classification rule. Fig. 1 also shows training data used in the learning of the neural network.

Fig. 1.

Classification boundary and training patterns.

S 0.0,

Fig. 2.

2

3 4 Input value

5

* • *

Examples of membership functions of "about 2" and "about 5."

Our task in this section is to classify uncertain patterns represented by fuzzy vectors. For example, let us consider the classification of a fuzzy vector XA = (2, 2) using the trained neural network in Fig. 1. The meaning of each fuzzy number is mathematically specified by a membership function on the real axis 5ft. For example, the fuzzy number 2 may be defined by a triangular membership function as shown in Fig. 2. Roughly speaking, the membership function JJL^ {x) of 2 specifies the possible range of 2 on the real

6

H. Ishibuchi & M. Nii

axis 5ft. More specifically, the value of /xg (x) for a specific input x denotes the extent (i.e., membership grade) to which x is compatible with the fuzzy concept "about 2". T h e membership function /^(ir) of 2 in Fig. 2 is written as /j,-2{x) = m a x { 0 , 1 - \2-x\}.

(7)

While the fuzzy vector x ^ = ( 2 , 2 ) involves a certain amount of uncertainty, the neural network may be able to classify x ^ as Class 1 because x ^ is located far from the classification b o u n d a r y (see Fig. 1). On the other hand, it seems to be difficult for the neural network to classify another fuzzy vector Xg = (5, 5) because xg is located near the classification boundary (see Fig. 1 for the location of x g and Fig. 2 for the membership function of 5). In this section, we mathematically formulate these intuitive discussions as a decision rule for fuzzy input vectors. 2 . 2 . Calculation

of Fuzzy

Outputs

Let Xj, = (xpi,... ,Xpn) be an n-dimensional fuzzy input vector to our neural network. Note t h a t xPi can be a real number and an interval because they are represented in the same framework as fuzzy numbers. For example, a real number a and an interval A = [ai, a%\ are represented by the following membership functions: , . ^

)

=

^

X )

=

/ 1, \o,

if x = a, otherwise.

\0,

otherwise.

(8)

(9)

Thus the fuzzy input vector x p = (xpi,. .. , xpn) can be a mixture of fuzzy numbers, intervals and real numbers such as x p = (5, [2, 3], 3.48). W h e n the fuzzy input vector x p = (xpi,... ,xpn) is presented to the neural network, the input-output relation of each unit in (2)-(5) is denned by fuzzy arithmetic (Kaufmann and G u p t a 1985). For example, the fuzzy o u t p u t dpj from the j - t h hidden unit is calculated by extending the input vector x p = (xpi,... ,xpn) to the fuzzy vector x p = (xpi,... ,xpn) in (2)-(3) as / n \ (10)

Fuzzification

of Neural Networks for Classification

Problems

7

In the same manner, the fuzzy output opk from the fc-th output unit is calculated as oPk = f | ^2wkjop:J

+ ok

(11)

As we can see from (10)-(11), the calculation of the fuzzy outputs 6Pj and opk involves the multiplication of fuzzy numbers by real numbers, the addition of fuzzy numbers, and the nonlinear mapping of fuzzy numbers by the sigmoidal activation function /(•). These operations on fuzzy numbers are illustrated in Fig. 3 and Fig. 4.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Input value

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Input value

Fig. 3.

Illustration of w • a and a + b.

Numerical calculations of the fuzzy outputs are performed on the level sets (i.e., a-cuts) of the fuzzy input vector using interval arithmetic (Moore 1979). This is because the fuzzy output from each unit cannot be represented in a simple parameterized form due to their nonlinear shape (see the fuzzy output in Fig. 4). In such numerical calculations, each fuzzy input 0.2, is first discretized into a set of its a-cuts for some a's (e.g. 0.4, 0.6, 0.8, 1.0). The a-cut of a fuzzy number a is defined as follows (see Fig. 5): [a]a = {x I Ha(x) >a}

for

0 < a < 1.

(12)

H. Ishibuchi & M. Nil

Lx 1

'

- 3 - 2 - 1 0

Fig. 4.

j

1 2 3 Input value

Illustration of the fuzzy activation function

Via Fig. 5.

f(x).

Va\a Input value

A fuzzy number a and its a-cut.

fs(*)

Input value Fig. 6.

Representation of a fuzzy number a by its five a-cuts for 0.2, 0.4, 0.6, 0.8, 1.0.

As shown in Fig. 5, cc-cuts of any fuzzy numbers are closed intervals. In Fig. 6, a fuzzy number a is discretized into its five a-cuts for a = 0.2, 0.4, 0.6, 0.8, 1.0. The a-cut of the fuzzy output opk from each output unit is calculated using interval arithmetic from the a-cut of the fuzzy input vector

Fuzzification

of Neural Networks for Classification

Problems

9

x p . That is, the input-output relations for fuzzy numbers in (10)-(11) is rewritten for a-cuts (i.e., intervals) as follows (see Buckley and Hayashi (1994), Hayashi et al. (1993), and Ishibuchi et al. (1993)): l°pj]a = / ( Yl

[Opk]a = /

W

H ferfla + 6i ) '

^2wkj[Opj]a

+0k I •

( 13 )

(14)

The fuzzy output opk from each output unit is constructed by calculating its a-cut [opk]a for various values of a using interval arithmetic. Details of interval arithmetic in the neural network are given in Section 6 of this chapter. 2.3. Classification

of Fuzzy Input

Vectors

As we have already mentioned, when we try to classify an uncertain pattern by a trained neural network, we first represent the uncertain pattern as a fuzzy vector x p = (xpi,... , xpn). Then x p is used as an input vector to the neural network. The corresponding fuzzy output vector 6P — ( o p i , . . . , opc) from the neural network is calculated using interval arithmetic on the acut of x p . In this subsection, we discuss the classification of the fuzzy input vector x p using the fuzzy output vector 6P. By directly extending the decision rule in (6) to the case of the fuzzy output vector 6p, we have its fuzzy version: If dPk < dpi for k = 1,2,... , c (k ^ I) then classify x p as Class I.

(15)

In this decision rule for the fuzzy input vector x p , we have to define the inequality relation between fuzzy numbers opk and bp\. Since the fuzzy output vector is numerically calculated by interval arithmetic on the a-cut of the fuzzy input vector x p , we define the inequality bpk < opi based on the a-cuts of opk and bv\. We use the following inequality relation between the a-cuts of opk and opi: [Opk]a < [Opl)a & [5pk]a < [5pl}a ,

(16)

where the superscripts " U" and "L" denote the upper limit and the lower limit of an a-cut, respectively (see Fig. 5). Using this inequality relation for

H. Ishibuchi & M. Nil

10

a prespecified value of a, we modify the decision rule in (15) for the fuzzy input vector as ^ [opfc]a < [°pi]a for fc = 1,2,... , c (fc ^ Z) then classify x p as Class I. (17) Fig. 7 shows the fuzzy output vector 6 B = (OBI, 5B2, 053) corresponding to the fuzzy input vector x ^ = (5, 5) in Fig. 1. In Fig. 7, OBX has no fuzziness (i.e., OBI — 0.0). As shown in this figure, XB can be classified as Class 3 by our decision rule in (17) when the value of a is larger than 0.43. If the value of a is smaller than 0.43, there is no class that satisfies our decision rule for the fuzzy output vector 6 B in Fig. 7. On the other hand, x ^ = (2, 2) can be classified as Class 1 by our decision rule for almost all values of a. Fig. 8 shows the corresponding fuzzy output vector 6A = (5AI, OA2, 6.43)- As shown in Fig. 8, each element of 6^ can be viewed as OAI = 1, OA2 — 0.0, and 5A3 = 6. Thus x^ = (2, 2) is classified as Class 1 regardless of the specification of a in (17). t °m

\jll2

.

.

.

.

1

Output value Fig. 7. Fuzzy output vector 6 g = (ogi, 5B2, 033). The shape of each fuzzy output is depicted by calculating its a-cuts for a = 0.01, 0.02,... , 1.00. °A2

1.00 °A3

f\ nn

SK 0

°A\

J^

0.5 Output value Fig. 8. Fuzzy output vector 6,4 = (OAI, OAI, 0,43). The shape of each fuzzy output is depicted by calculating its a-cuts for a = 0.01, 0.02,... , 1.00.

Fuzzification

of Neural Networks for Classification

Problems

11

As shown in Fig. 7 and Fig. 8, x^ = (2, 2) is classifiable by our decision rule for a wide range of a while xg = (5, 5) is classifiable for a narrower range of a. This means that the classification of x^ has a high confidence grade while that of x s has a low confidence grade. The confidence grade of the classification of the fuzzy input vector x p can be defined by the width of the range of a for which x p is classifiable. In the case of Fig. 7, XB = (5,5) is classified as Class 3 with confidence grade 0.57. On the other hand, x# = (2, 2) is classified as Class 1 with confidence grade 1.00. Note that the classification of the fuzzy input vector x p is rejected regardless of the specification of a if x p is not classifiable for a — 1.0. When x p is classifiable for a = 1.0, the confidence grade can be efficiently searched for in the unit interval [0, 1] using a simple bisection method. 3. Training of Neural Networks from Fuzzy Training Patterns 3.1. Learning

Task

Let us assume that we have m fuzzy training patterns x p = (xp\,... , xpn), p = 1,2,... ,m, each of which has already been classified as one of c classes. Note that x p may be a mixture of linguistic values, fuzzy numbers, intervals and real numbers, which can be represented in the same framework as fuzzy numbers using membership functions. Fuzzy training patterns may be obtained from human experts as linguistic knowledge such as "If x\ is small and 22 is large then Class 3". This fuzzy rule is handled as a fuzzy training pattern (small, large) labeled as Class 3. Fuzzy training patterns may be also obtained from measurements involving uncertain and/or missing values such as (2.45, ?, 4). Each missing value is represented by an interval including its possible values. For example, if the domain of the second feature of the pattern (2.45, ?, 4) is the unit interval [0, 1], this pattern is transformed into the equivalent fuzzy pattern (2.45, [0, 1], 4) with no missing value. Our task in this section is to train the standard feedforward neural network defined by (2)-(5) using m fuzzy training patterns Xp — \xPi,

• • • , Xpn),

3.2. Learning

p

1,

ZJ,

. . . , m.

Algorithm

For training the neural network, each fuzzy training pattern x p is used as an input vector. The corresponding fuzzy output vector 6P — (dpi,... , opc) is

H. Ishibuchi & M. Nii

12

calculated using interval arithmetic on the a-cut of x p . A non-fuzzy target vector tp = (tpi,... , tpC) for the fuzzy output vector 6 p is defined based on the given classification of the training pattern

t pk

1, 0,

if x p belongs to Class k , otherwise,

fc = l , 2 ,

(18)

The neural network is trained to decrease the difference between the non-fuzzy target vector tp and the fuzzy output vector 6 p . Since the a-cut \opk]a of each fuzzy output opk is calculated in the feedforward calculation, we first consider the difference between the a-cut [opk]a and the target tpkThis is the difference between an interval and a real number. Let us define an error measure between [opk]a and tpk as S-pka = Cpka + epka >

\^)

where e^ka and e^ka are defined using the lower limit [opk]a and the upper limit [opk}^ of the a-cut \opk\a as ~pka e

pka

(tpk - [oPk}a) = (fpk ~ \Opk]a)

/2 I2-

(20) (21)

That is, e^ka and e^ka axe squared errors for the lower limit and the upper limit of the a-cut, respectively. This definition of the error measure epka is illustrated in Fig. 9.

tOptfi Output value Fig. 9.

Illustration of e^ka

and ePha- In this figure, tpk = 1 is assumed.

Using the error measure epkQ for the a-cut of the fuzzy output opk from the k-th output unit, we define the error measure epa for the a-cut of the

Fuzzification

of Neural Networks for Classification

Problems

13

fuzzy o u t p u t vector 6 P as \^, fc=l

T h e connection weights and the biases of the neural network can be adjusted by the gradient descent technique as in the s t a n d a r d back-propagation algorithm. For example, the u p d a t e rule for the connection weight Wji can be written as

where r\ is a learning rate (i.e., positive constant). T h e explicit calculation of the partial derivative in the u p d a t e rule is shown in Section 6 of this chapter. T h e learning can b e weighted by the value of a as dep

+ °-V

l-Z?)-

(24)

In this case, the a-cut for a high level has a larger effect on the learning t h a n t h a t for a low level. The neural network is trained on all the m fuzzy training p a t t e r n s and various values of a. Let s be the number of different values of a (i.e., a\, a?2, • • • ,<*«)• W h e n incremental learning is used, the algorithm is: [Learning A l g o r i t h m ] Step 1: Initialization of the value of a: Let a be a i . Step 2: Initialization of the fuzzy input vector: Let x p be x i . Step 3: Feedforward calculation: Using interval arithmetic, calculate the acut of the fuzzy o u t p u t vector 6 p corresponding to the a-cut of the fuzzy input vector x p . Step 4: Adjustment: Adjust the connection weights and biases for decreasing the error measure epa using the gradient descent technique. Step 5: U p d a t e of the training p a t t e r n : W h e n the current x p is not the last fuzzy p a t t e r n x m , replace x p with the next fuzzy p a t t e r n and return to Step 3. Step 6: U p d a t e of the value of a: If the current a is not the last value as of a, then replace a with the next value of a and return to Step 2. Step 7: Termination test: If a pre-specified stopping condition is satisfied, terminate the learning. Otherwise, return to Step 1.

H. Ishibuchi & M. Nii

14

3.3. Numerical

Examples

Example 1: Let us assume that we have the nine fuzzy patterns in Table 1. The membership function of each fuzzy number is assumed to be triangular as shown in Fig. 2. That is, the membership function of a is given by Ha(x) = max{0, 1 - \x - a\} .

(25)

We trained a three-layer feedforward neural network with two input, five hidden, and three output units. In the learning, we used the weighted update rule in (24) for five values of a (i.e., a = 0.2, 0.4, 0.6, 0.8, 1.0). As in the standard back-propagation algorithm, we added a momentum term to our update rule in (24). The learning rate and the momentum constant were specified as 0.25 and 0.9, respectively. We iterated the learning algorithm 1000 times (i.e., 1000 epochs) for each cc-cut of each training pattern. Table 1. Fuzzy training patterns in Example 1. P

%>i

ip2

1 2 3 4 5 6 7 8 9

1 2 3 1 3 5 4 6 6

3 1 3

x

l

5 6 5 1 1 4

Class label Class Class Class Class Class Class Class Class Class

1 1 1 2 2 2 3 3 3

o Class 1 • Class 2 * Class 3

Fig. 10. Fuzzy training patterns in Example 1 and the classification boundary by the trained neural network.

Fuzzification

of Neural Networks for Classification

Problems

15

Fig. 10 shows the classification boundary obtained by the trained neural network, together with the a-cuts of the nine fuzzy patterns for a = 0.2, 0.4, 0.6, 0.8, 1.0. From this figure, we can see that the neural network was successfully trained using the nine fuzzy training patterns in Table 1. Example 2: Let us assume that we have the six training patterns in Table 2. They are fuzzy, interval, and real vectors. Using this example, we show that neural networks can be trained even when the training data are a mixture of fuzzy numbers, intervals and real numbers. In the same manner as in the previous example, we trained a neural network with two input, five hidden, and three output units. Fig. 11 shows the classification boundary obtained by the trained neural network together with the six training patterns. From this figure, we can see that our approach can handle fuzzy numbers, intervals and real numbers in the same framework. This is because intervals and real numbers, as well as fuzzy numbers, are represented by membership functions in our approach. Table 2. Fuzzy training patterns in Example 2. p

Xp\

a;P2

Class label

1 2 3 4 5 6

1 [1, 6] 3.0 [1, 4] 5 5

3 [1, 2] 3.0 [5, 6] 5 3

Class Class Class Class Class Class

x

l

o Class 1

• Class 2

1 1 2 2 3 3

* Class 3

6 Class 2 5 4

"s.^ * s ^

3

- in

1

\»

s 1

E

2 Class 1 1 0

1

i

i

i

i

2

3

4

5

•

-

6 " '"'

Fig. 11. Fuzzy training patterns in Example 2 and the classification boundary by the trained neural network.

H. Ishibuchi & M. Nii

16

4. L i n g u i s t i c R u l e E x t r a c t i o n f r o m N e u r a l N e t w o r k s 4.1.

Assumptions

In this section, we show how we can extract linguistic rules of t h e following type from trained neural networks. Rule Rp : If X\ is ap\ and . . . and xn is apn then Class Cp with CFP , (26) where api is an antecedent linguistic value such as "small" and "large". While neural networks simplified by pruning algorithms are usually used for rule extraction in the literature, we do not assume any special network structure or learning algorithm in this section. Our approach is applicable to arbitrary neural networks with the s t a n d a r d feedforward architecture. In this section, we assume t h a t a standard three-layer feedforward neural network in (2)-(5) has already been trained for an n-dimensional p a t t e r n classification problem with c classes. We also assume t h a t a set of linguistic values is given for each feature of the p a t t e r n classification problem by h u m a n users. For example, weight may be described in some situation by the three linguistic values "light", "middle", and "heavy" as shown in Fig. 12. Of course, different sets of linguistic values should be used in different situations.

.

light

0

u. \lmiddle\l

10 20 30 40 50 60 70 80 90 100 110

Weight [Kg]

Fig. 12.

4 . 2 . Rule

Linguistic values "light", "middle", and "heavy."

Extraction

In our rule extraction method, the antecedent linguistic values
Fuzzification

of Neural Networks for Classification

Problems

17

Xp = kp = (api,... ,apn). The determination of the consequent class Cv is the same as the classification of the fuzzy input vector a p discussed in Section 2. The certainty grade CFp is specified as the confidence grade of the classification of a p . The calculation of the confidence grade has already been described in Section 2. In this manner, we can determine the consequent class Cp and certainty grade CFP for the antecedent linguistic values api, . . . , apn using the trained neural network. The value of CFP can be used to decrease the number of extracted linguistic rules. For example, we can specify a lower bound CFm-m for CFp. We extract the corresponding linguistic rule Rp only when CFP is larger than or equal to the lower bound ^ -^min-

The antecedent part of each linguistic rule is specified as a combination of the given linguistic values. Let Ki be the number of the given linguistic values for the i-th feature (i.e., i-th input). In addition to the Ki linguistic values for the i-th feature, we also use "don't care" as an antecedent linguistic value. Thus the total number of combinations of antecedent linguistic values for the n features is {K\ + 1) x • • • x (Kn + 1). When the domain of the i-th feature is an interval Di, "don't care" for the i-th feature can be viewed as the interval Di itself. That is, the membership function of "don't care" for the i-th feature is specified as J 1, [idon't care(Xi)

-

|

Q

^

if Xi e Di, o t h e r w i s e

.

_

W)

To illustrate our rule extraction method, we show a simulation result using the trained neural network in Fig. 1. Let us assume that the pattern space in Fig. 1 is [0, 6] x [0, 6]. This means that [0, 6] is the domain of each feature (i.e., D\ = Z?2 = [0, 6]). In our approach, we need a set of linguistic values for each feature. We assume that the five linguistic values in Fig. 13

Input value

Fig. 13. Five linguistic values "small", and "large."

"medium small",

"medium",

"medium

large",

18

H. Ishibuchi & M. Nii

are given for each feature. Thus our task is to extract linguistic rules of the following type from the trained neural network in Fig. 1 using the five linguistic values in Fig. 13. Rule Rp : If X\ is ap\ and x2 is ap2 then Class Cp with CFp .

(28)

In this task, we have (5 + 1) x (5 + 1) combinations of antecedent linguistic values: (api, ap2) = (don't care, don't care), (don't care, small), ..., (large, large). Each of these combinations was presented to the trained neural network as a fuzzy input vector. Then the consequent class and the certainty grade were determined by the corresponding fuzzy output vector. In this manner, we extracted linguistic rules from the trained neural network. The threshold value CFmia was specified as CFmin = 0.01. The extracted linguistic rules are shown in Fig. 14, where each real number in parentheses is the certainty grade of the corresponding linguistic rule. In addition to the 25 linguistic rules in Fig. 14, the following rule was also extracted: If x\ is large and x2 is don't care then Class 3 with 0.08 .

(29)

Since the "don't care" condition can be omitted, this linguistic rule is simplified as If x\ is large then Class 3 with 0.08 .

(30)

From Fig. 1 and Fig. 14, we can see that linguistic rules near the classification boundary have small certainty grades. On the other hand,

Class 2 (0.46)

(0.81)

Class 2 (0.88)

Class 2 ! (0.28) \

Class 3 (0.50)

Class 1 (0.19)

Class 2 (0.04)

Class 2 (0.24)

Class 2 ! (0.07) 1

Class 3 (0.69)

Class 1 (0.96)

Class 1 (0.71)

Class 1 (0.31)

Class 3 ! {0.25) I

Class 3 (0.78)

Class 1 (1.00)

Class 1 (1.00)

Class 1 (0.39)

Class 3 ; (0.33) ;

Class 3 (0.93)

Class 1 (1.00)

Class 1 (1.00)

(0.33)

Class 3 ; (0.28) J

Class 3 (1.00)

Fig. 14. Extracted linguistic rules. Each real number in parentheses shows the certainty grade of the corresponding linguistic rule.

Fuzzification

of Neural Networks for Classification x

2

O Class 1 • Class 2 A class 3 A

0 Fig. 15.

Problems

1 2

3

4

5

A

6

Classification boundary induced by the extracted linguistic rules.

linguistic rules far from the classification boundary have large certainty grades. Fig. 15 shows the classification boundary generated by the extracted linguistic rules. To depict the classification boundary, we used a fuzzy reasoning method based on a single winner rule (Ishibuchi et al. 1992, Ishibuchi et al. 1999b). From the comparison between Fig. 1 and Fig. 15, we can see t h a t the classification b o u n d a r y induced by the extracted linguistic rules is similar to t h a t induced by the trained neural network.

4 . 3 . Rule

Selection

Our rule extraction method examines all combinations of antecedent linguistic values. This leads to an exponential increase in the number of examined linguistic rules relative to the dimensionality of the p a t t e r n classification problem. A simple trick for avoiding such an exponential increase is to examine only general linguistic rules with a few antecedent conditions. In other words, linguistic rules with many "don't care" conditions are examined. Let us define the length of a linguistic rule by the number of its antecedent conditions excluding "don't care" conditions. For example, the length of the linguistic rule in (29) is one. Even when the total number of combinations of antecedent linguistic values is huge, the number of general linguistic rules is not large. General (i.e., short) rules are also more preferable t h a n specific (i.e., long) rules from the viewpoint of interpretability. It is usually difficult for h u m a n users to intuitively understand the meaning of a long linguistic rule with many antecedent conditions. A small number of relevant rules can be selected from a large number of extracted rules using genetic algorithms (Ishibuchi et al. 1997). Let us

H. Ishibuchi & M. Nil

20

assume that r linguistic rules are extracted from the trained neural network. In this case, any subset of the extracted rules can be represented by a binary string of the length r. Genetic algorithms (Holland 1975 and Goldberg 1989) can be used to find a small number of linguistic rules that yield a high classification rate. The fitness value of each subset is defined by the number of linguistic rules and the classification rate on the training patterns. 5. Training of Neural Networks from Linguistic Rules 5.1. Learning

Task

Our task in this section is to train neural networks using linguistic rules. As in the previous sections, we use a three-layer feedforward neural network with n input and c output units for an n-dimensional pattern classification problem with c classes. We assume that m linguistic rules Rp, p = 1, 2 , . . . ,TO,of the form (26) are given for training the neural network. 5.2. Learning from Linguistic Grades

Rules with No

Certainty

First let us consider a simpler case where the given linguistic rules have no certainty grades. In this case, the linguistic rules can be written as follows: Rule Rp : If Xi is api and . . . and xn is apn then Class Cp , p=l,2,...,m.

(31)

These linguistic rules can be viewed as fuzzy training patterns a p = ( a p i , . . . ,apn), p = 1,2,... ,m. The target vector t p = ( i p i , . . . ,tpc) for each fuzzy training pattern a p is specified by the corresponding consequent class Cp in the same manner as in Section 3. The neural network can be trained in the same manner as in Section 3 using the m input-target pairs (a p , t p ) , p= 1,2,... ,m. Since linguistic values, fuzzy numbers, intervals and real numbers can be handled in the same way by representing them as membership functions, various kinds of information (e.g., linguistic rules, fuzzy patterns, interval patterns, and real patterns) can be simultaneously utilized in the learning of neural networks. This is the main advantage of our approach over standard learning algorithms that can handle only numerical data.

Fuzzification

of Neural Networks for Classification

5.3. Learning from Linguistic

Problems

Rules with Certainty

21

Grades

When each of the given linguistic rules Rp, p=l,2,...,m, has the certainty grade CFp, it can be used as the weight (or importance) of Rp in the learning of the neural network. More specifically, we can define the cost function epa for the a-cut of the fuzzy input vector a p as c

&pa

=

^"p

' /

y

&pka 5

l*-*^J

k=l

where epka is defined from the fuzzy input vector ap and the target vector tp in the same manner as in Section 3. To illustrate the effect of having a certainty grade for each linguistic rule on the learning of neural networks, let us assume that we have the following two linguistic rules for a two-dimensional pattern classification problem in the pattern space [0, 6] x [0, 6]: Rule Ri : If xx is large then Class 1 with CFX = 1.0,

(33)

Rule R2 : If x2 is large then Class 2 with CF2 = 0.3 ,

(34)

where we use the same membership function for the linguistic value "large" as in the previous section (i.e., as in Fig. 13). These two linguistic rules conflict with each other in the input region compatible with these two rules (i.e., the area described as "xi is large and £2 is large"). We also assume that we have numerical training data in Fig. 16. We trained a neural network with two input, five hidden, and three output units using these numerical x

2

O Class 1 • Class 2 A Class 3

6 5 4 3 2 1 0

1

2

3

4

5

6 "*'

Fig. 16. Classification boundary by the trained neural network for the case of CF\ = 1.0 and CF2 = 0.3.

H. Ishibuchi & M. Nii

22 x

2

O Class 1 • Class 2 A class 3

Fig. 17. Classification boundary by the trained neural network for the case of CF\ = 0.3 and CF2 = 1.0.

data and the two linguistic rules. Fig. 16 shows the classification boundary generated by the trained neural network after 1000 epochs. Let us focus our attention on the conflicting region with large x\ and large x-i- As we can see from Fig. 16, this region is classified as Class 1 (i.e., the consequent class of the stronger rule). For comparison, we also trained the same neural network by specifying the certainty grades as CF\ = 0.3 and CF^ = 1.0. In this case, we obtained the classification boundary in Fig. 17 where the conflicting region is classified as Class 2 (i.e., the consequent class of the stronger rule). 6. Interval-Arithmetic-Based Neural Networks In this section, we explicitly describe interval arithmetic and a learning algorithm for neural networks with interval input vectors. We also describe the learning from expert knowledge and the classification of incomplete patterns with missing inputs. Such classifications can be utilized for decreasing the number of inputs to be measured (i.e., for decreasing the measurement cost). 6.1. Feedforward

Calculation

We first show the feedforward calculation in our three-layer feedforward neural network in (2)-(5) for an interval input vector X p = {Xp\,... ,X p n ) where Xpi = [xpi, a £ ] . If we view the interval input vector X p as the ev-cut of the fuzzy input vector xp, the following descriptions are directly related to the fuzzified neural network in the previous sections.

Fuzzification

of Neural Networks for Classification

Problems

23

The interval output Opi = [opi, o^] from each input unit i is the same as the interval input Xpi = [xpi, xpi\. Using interval arithmetic, the interval output 0Pj = [Opj, opA from each hidden unit j is calculated as

°p,

= f ( E

w

«&

= /1 E

w

i*°p> + e j ]

^°pi + E

iWji>0

>

( 35 )

Wji<0

w

w

^°Pi + E

Wji>0

*°#

+ e

A •

(36)

utji<0

u k] from each hidden In the same manner, the interval output Opk = [OpJpk' o^ k, pk unit k is calculated as

°pk

= /| E

w

kj°pj+

yiufc;)>o °pk

= 'l

E

w

E

fej°TO+6'fe

(37)

w

(38)

u)fcj
\wkj>0

kj°pj+

E

kjOpj+ok

t«fcj<0

The following operations on intervals are used in the above calculations: A + B = [aL, au] + [bL, bu] = [aL + 6 L , au + bu], L

17

i c/1 _ / [u> • a , u> • a ], w-A = w [aL, au] = { _ ' [u; • a^, u; • aL],

if w > 0 , „.: ' if iu
i f

/ ( X ) = / ( [ ^ , ^ ] ) = [/(x L ), / ( ^ ) ] .

6.2. Derivation

of Learning

(39) (40) (41)

Algorithm

As in Section 3, the target vector tp = ( i p i , . . . ,tpc) is specified from the label (i.e., classification) of the interval input vector X p . A cost function is defined from the interval output vector O p = ( O p i , . . . , Opc) and the target vector t p = (tpi,... ,t pc ) as c

c

eP = £ > p f c " ohpkfl2 + ^(tPk fc=i fe=i

~ oupk)2/2 .

(42)

24

H. Ishibuchi & M. Nii

In the gradient descent learning, the connection weights wkj and are updated as de„

wnkr =
dev dw n

n

(43)

dw kj

„.,old

w J*

u

(44)

The biases are also updated in the same manner. The partial derivative dep/dwkj is calculated as follows. (1) When wk} > 0, dep °pk°pk

dwkj

(45)

~ °pk°pk >

where $pk — (tpk ~ °pk)°pk(1

_

°pk)

(46)

S

-

Opk)

(47)

°pkupk •

(48)

pk = {tpk - 0pk)0pk(l

(2) When wkj < 0, den dw,kj The partial derivative dep/dwji (1) When

Wji

dep dw JI

°pkupk

is calculated as follows.

> 0,

V

5Hkwkj

fc=i

Y.

J2

pkwkj

} Opji1 ~

°pj)°pi

k=l wki<0

S

pkWkJ+ Yl 5PkWkj

fe=l fc=l

wkj>0

5

uifc^O

} Opji1

~ Opj)Opi .

(49)

Fuzzification

of Neural Networks for Classification

Problems

25

(2) When wjt < 0 de. dwji

Yl

d

pkwkj+ J2 6pkw^ \

LUfcj>0

Z^

°PJ(I-°PJH

Wkj<0

S

pkWkj+

}__, tfkwkj } 0^(1 - 0^)0^

.

(50)

fc=i fc=i vkj>0

6.3. Learning from Expert

Wkj<0

Knowledge

Expert knowledge may be represented in various forms such as linguistic rules and prototypes. One simple form is based on intervals. Let us assume that we have the following rules for a two-dimensional pattern classification problem with the pattern space [0, 6] x [0, 6]: If zi < 2 then Class 1,

(51)

If 3 < x\ and 5 < x-i then Class 2 ,

(52)

If 3 < xi < 5 and 2 < x2 < 4 then Class 3 .

(53)

Since the domain of each feature (i.e., each input) is the interval [0, 6], these rules are rewritten as If x\ is in [0, 2] and x2 is in [0, 6] then Class 1,

(54)

If x\ is in [3, 6] and x-i is in [5, 6] then Class 2 ,

(55)

If x\ is in [3, 5] and X2 is in [2, 4] then Class 3 .

(56)

Thus these rules are handled as three interval training patterns ([0, 2], [0, 6]), ([3, 6], [5, 6]) and ([3, 5], [2, 4]). Using these interval training patterns and some numerical training data, we trained a neural network with two input, five hidden and three output units by the above learning algorithm. In the experiment, real numbers were handled as a special case of intervals with no width. Our learning algorithm was iterated 1000 times (e.g., 1000 epochs) for each pattern, with an added momentum term. In Fig. 18, we show the classification boundary obtained from the trained neural network, together with the training data used in the learning. From this figure, we can see that all the interval and non-interval training patterns are correctly classified by the trained neural network.

H. Ishibuchi & M. Nii

26 x

2

Fig. 18.

6.4. Decreasing

O Class 1 • Class 2 A class 3

Classification boundary and training patterns.

the Measurement

Cost

oA

oA

oA

value

c

As we have already explained in Section 2, incomplete input patterns with missing inputs can be represented as interval vectors where each missing input is replaced with its domain interval. Let us assume that we have an input pattern (?, 2) with a missing input. If we use the trained neural network in Fig. 18, this input pattern is represented by an interval vector X^4 = ([0, 6], [2, 2]) since the domain of each input in Fig. 18 is the interval [0, 6]. This interval vector X ^ is presented to the trained neural network. The corresponding interval output vector 0,4 = (OAI, 0^2, OAS) is calculated as shown in Fig. 19. The three interval outputs totally overlap with each other. Thus we cannot classify the input pattern (?, 2) from the interval output vector. This corresponds to the fact that we cannot

3 -4—1

3

O 0.0

1st Oiltput

Fig. 19.

2nd Output

3rd Ou tput

Interval output vector corresponding to the input pattern (?, 2).

Fuzzification

of Neural Networks for Classification

Problems

27

classify the input p a t t e r n (?, 2) in Fig. 18. On the other hand, we can classify another input p a t t e r n (1, ?) in Fig. 18. Even when the value of the second input x2 is missing, we can classify the input p a t t e r n (1, ?) as Class 1 from Fig. 18. This input p a t t e r n is handled as an interval input vector X B = ([1, 1], [0, 6]). T h e corresponding interval o u t p u t vector is calculated as O B = (OBI, 0B2, 0 B 3 ) ^ ([1, 1], [0, 0], [0, 0]). T h u s we can classify the input vector (1, ?) as Class 1 from the interval o u t p u t vector. Let us describe the above discussion more formally. For classifying an input vector xp with missing inputs by a trained neural network, first such an input vector is denoted by an interval input vector X p with no missing inputs. Each missing input is represented by its domain interval. Each of the given (i.e., measured) inputs is represented by the equivalent interval with no width (e.g., xpi = [xpi, xpi\). Note t h a t t h e interval input vector X p includes all the possible values of the missing inputs. T h a t is, x p 6 X p always holds for any actual value of each missing input. Then the interval input vector X p is presented to the trained neural network for calculating the interval o u t p u t vector Op = (Opi,... , Opc). We use the following classification rule for examining the classifiability of the interval input vector Xp: If cFvk < opl for k = 1, 2 , . . . , c (k ^ V) then classify X p as Class I.

(57)

Note t h a t the following relation holds if o^k < o£ holds. opk < Opi for v o p f c e Opk and v o p ; e Opi.

(58)

T h u s any input vector x p in X p is classified as t h e same class as X p when X p is classifiable by our decision rule. In other words, X p is classifiable by our decision rule only when the input region specified by X p is classified as the same class by the trained neural network. In this case, we can classify the input p a t t e r n with missing inputs before acquiring the actual value of each missing input. T h e above idea can be used for decreasing the measurement cost for classifying new p a t t e r n s in the classification phase. In s t a n d a r d applications of neural networks to p a t t e r n classification problems, all the input values should be measured for classifying new patterns. As we have described, all input values are not always necessary for the classification purpose. For example, a new p a t t e r n with X\ — 1 can be classified as Class 1 in Fig. 18 without measuring the second input x2 . This example suggests t h a t we m a y

28

H. Ishibuchi & M. Nii

be able to decrease the measurement cost if we first measure the first input x\. The second input is to be measured only when the new pattern is not classifiable by the first input (i.e., only when (xpi, ?) is not classifiable). We cannot, however, decrease the measurement cost if we first measure the second input x^ in Fig. 18. From the above discussions, we can see that the measurement order should be appropriately specified for efficiently decreasing the measurement cost. Let us assume that we have m training patterns. We also assume that a neural network has already been trained. In this case, we can determine the measurement order of the n inputs (i.e., n features) by the following procedure. [Determination of the measurement order] Step 1: Let ^ = {x\,... ,xn} where ^ is the set of unmeasured inputs. Step 2: Perform the following procedures (a)-(c) for all combinations of (|\f| - 1) inputs from * where | * | denotes the number of inputs in * . (a) Select (|*| - 1) inputs from * . (b) Classify the m training patterns using only the selected inputs. The other inputs are handled as missing inputs. (c) Calculate the classification rate in (b). Step 3: Replace ty with the set of the (|\]/| — 1) inputs that has the highest classification rate in Step 2. If \t includes only a single input, stop the algorithm. Otherwise return to Step 2.

7. Conclusion In this chapter, we described several approaches for applying fuzzified neural networks to pattern classification problems. The basic idea in this chapter is to handle different kinds of information (i.e., real numbers, intervals, fuzzy numbers, and linguistic values) in the same framework using membership functions. Real numbers and intervals are viewed as special cases of fuzzy numbers. As a result, human knowledge and numerical data can be simultaneously utilized in the learning of neural networks. Our basic idea makes it possible to classify a mixed input pattern of real numbers, intervals, fuzzy numbers and linguistic values by neural networks. We also

Fuzzification

of Neural Networks for Classification

Problems

29

described interval-arithmetic-based neural networks. O u r approach to the decrease of the measurement cost can be implemented without causing any deterioration in the classification ability because the measurement for each new p a t t e r n is continued until it becomes classifiable. One important issue, which was not discussed in this chapter, is the increase of fuzziness during the feedforward calculation in fuzzified neural networks (Ishibuchi et al. 1999a). Since the input-output relation is defined by fuzzy arithmetic, fuzzy o u t p u t s from o u t p u t units always include excess fuzziness. Such excess fuzziness has a bad effect on the classification of fuzzy input vectors and the linguistic rule extraction. Interval o u t p u t s from output units also include excess width. Such excess width has a bad effect on the classification of interval input vectors and the decrease of the measurement cost. References 1. Buckley J. J. and Hayashi Y. "Fuzzy neural networks: A survey", Fuzzy Sets and Systems 66, 1-13 (1994). 2. Buckley J. J. and Hayashi Y. "Can neural nets be universal approximators for fuzzy functions?", Fuzzy Sets and Systems 101, 323-330 (1999). 3. Buckley J. J. and Feuring T. "Universal approximators for fuzzy functions", Fuzzy Sets and Systems 113, 411-415 (2000). 4. Chen J. L. and Chang J. Y. "Fuzzy perceptron neural networks for classifiers with numerical data and linguistic rules as inputs", IEEE Trans, on Fuzzy Systems 8, 730-745 (2000). 5. Dunyak J. and Wunsch D. "A training technique for fuzzy number neural networks", Proc. of 1997 IEEE International Conference on Neural Networks, 533-536 (1997). 6. Dunyak J. and Wunsch D. "Fuzzy number neural networks", Fuzzy Sets and Systems 108, 49-58 (1999). 7. Dunyak J. and Wunsch D. "Fuzzy regression by fuzzy number neural networks", Fuzzy Sets and Systems 112, 371-380 (2000). 8. Feuring T. "Learning in fuzzy neural networks", Proc. of 1996 IEEE International Conference on Neural Networks, 1061-1066 (1996). 9. Goldberg D. E. Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading (1989). 10. Hayashi Y., Buckley J. J., and Czogala E. "Fuzzy neural network with fuzzy signals and weights", International Journal of Intelligent Systems 8, 527-537 (1993). 11. Holland J. H. Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor (1975).

30

H. Ishibuchi & M. Nii

12. Ishibuchi H. and Tanaka H. "An extension of the BP-algorithm to interval input vectors", Proc. of IEEE International Joint Conference on Neural Networks, 1588-1593 (1991). 13. Ishibuchi H., Nozaki K., and Tanaka H. "Distributed representation of fuzzy rules and its application to pattern classification", Fuzzy Sets and Systems 52, 21-32 (1992). 14. Ishibuchi H., Fujioka R., and Tanaka H. "Neural networks that learn from fuzzy if-then rules", IEEE Transactions on Fuzzy Systems 1, 85-97 (1993). 15. Ishibuchi H., Tanaka H., and Okada H. "Interpolation of fuzzy if-then rules by neural networks", International Journal of Approximate Reasoning 10, 3-27 (1994). 16. Ishibuchi H., Kwon K., and Tanaka H. "A learning algorithm of fuzzy neural networks with triangular fuzzy weights", Fuzzy Sets and Systems 71, 277-293 (1995a). 17. Ishibuchi H., Morioka K., and Turksen I. B. "Learning by fuzzified neural network", International Journal of Approximate Reasoning 13, 327-358 (1995b). 18. Ishibuchi H. and Nii M. "Generating fuzzy if-then rules from trained neural networks: Linguistic analysis of neural networks", Proc. of 1996 IEEE International Conference on Neural Networks, 1133-1138 (1996). 19. Ishibuchi H., Nii M., and Murata T. "Linguistic rule extraction from neural networks and genetic-algorithm-based rule selection", Proc. of 1997 IEEE International Conference on Neural Networks, 2390-2395 (1997). 20. Ishibuchi H. and Nii M. "Minimizing the measurement cost in the classification of new samples by neural-network based classifiers", Proc. of 5th International Conference on Soft Computing and Information/Intelligent Systems (IIZUKA'98), 634-637 (1998). 21. Ishibuchi H., Nii M., and Tanaka K. "Decreasing excess fuzziness in fuzzy outputs from neural networks for linguistic rule extraction", Proc. of International Joint Conference on Neural Networks, CD-ROM Proceedings (1999a). 22. Ishibuchi H., Nakashima T., and Morisawa T. "Voting in fuzzy rule-based systems for pattern classification problems", Fuzzy Sets and Systems 103, 223-238 (1999b). 23. Ishibuchi H. and Nii M. "Neural networks for soft decision making", Fuzzy Sets and Systems 115, 121-140 (2000). 24. Ishibuchi H. and Nii M. "Fuzzy regression using asymmetric fuzzy coefficients and fuzzified neural networks", Fuzzy Sets and Systems 119, 273-290 (2001). 25. Kaufmann A. and Gupta M. M. Introduction to Fuzzy Arithmetic, Van Nostrand Reinhold, New York (1985). 26. Krishnamraju P. V., Buckley J. J., Reilly K. D., and Hayashi Y. "Genetic learning algorithms for fuzzy neural nets", Proc. of 1994 IEEE International Conference on Fuzzy Systems, 1969-1974 (1994). 27. Kuo R. J. and Xue K. C. "Fuzzy neural networks with application to sales forecasting", Fuzzy Sets and Systems 108, 123-143 (1999).

Fuzzification of Neural Networks for Classification Problems

31

28. Kuo R. J., Chen C. H., and Hwang Y. C. "An intelligent stock trading decision support system through integration of genetic algorithm based fuzzy neural network and artificial neural network", Fuzzy Sets and Systems 118, 21-45 (2001). 29. Moore R. E. Methods and Applications of Interval Analysis, SIAM Studies in Applied Mathematics, Philadelphia (1979). 30. Rumelhart D. E., McClelland J. L., and the P D P Research Group Parallel Distributed Processing, MIT Press, Cambridge, Massachusetts (1986). 31. Teodorescu H. N. and Arotaritei D. "Analysis of learning algorithms performance for algebraic fuzzy neural networks", Proc. of 1997 International Fuzzy Systems Association World Congress TV, 468-473 (1997).

This page is intentionally left blank

CHAPTER 2 ADAPTIVE GRAPHIC PATTERN RECOGNITION: FOUNDATIONS AND PERSPECTIVES

Giovanni Adorni and Stefano Cagnoni Dipartimento di Ingegneria dell'Informazione Universita di Parma, Viale delle Scienze, Parma, Italy

Marco Gori Dipartimento di Ingegneria dell'Informazione Universita di Siena, Via Roma, 56, 53100 Siena, Italy In this chapter we propose a new approach to pattern recognition, referred to as adaptive graphical pattern recognition which is inbetween decision-theoretic and structural pattern recognition. In particular we focus on the extension of classic supervised neural network-based approaches to pattern recognition, and show that the classic backpropagation learning scheme can naturally be extended to the case of patterns that are represented by directed ordered acyclic graphs. More general graphs can easily express complex patterns, but we demonstrate that the corresponding extension of classic neural network architectures and learning algorithms is less effective. This extended view of neural networks operating on graphs gives rise to a new wave of connectionist-based techniques. Experimental results on problems of pattern classification and image retrieval clearly indicate the effectiveness of the proposed approach, especially when neither purely structural nor purely subsymbolic representations are appropriate.

1. I n t r o d u c t i o n In this chapter we propose a general framework for the development of a novel approach t o p a t t e r n recognition which is based on learning in graphic domains. T h e d a t a of these domains simultaneously possess the highly structured representation of classical syntactic and structural approaches, 33

34

G. Adorni, S. Cagnoni & M. Gori

and the sub-symbolic capabilities of decision-theoretic models, typical of connectionism and statistics. Preliminary efforts have been made to construct a general framework for these learning schemes. 1 In this chapter we focus on the extension of classic neural network-based approaches to pattern recognition, with an emphasis on supervised learning. In particular, we show t h a t the classic backpropagation learning scheme can naturally be extended to the case of p a t t e r n s t h a t are represented by directed ordered acyclic graphs. Unlike other more general graphical structures, this kind of abstract representation is sometimes a convoluted representation of the original p a t t e r n . However, it offers one of the simplest mechanisms for expressing the p a t t e r n ' s structure. More general graphs can easily express complex patterns, but we demonstrate t h a t the corresponding extension of classic neural network architectures a n d learning algorithms is less effective. In the simplest case, the computation carried out by neural networks on graphs is independent of their nodes. This is in fact a strong hypothesis t h a t dramatically simplifies the learning scheme but, unfortunately, it is not very appropriate to all p a t t e r n recognition problems. We briefly review recent a t t e m p t s to face this problem, as well as other open problems in this area. This extended view of neural networks operating on graphs gives rise to a new wave of connectionist-based techniques for p a t t e r n recognition, which is in some sense in between traditional decision-theoretic and struct u r a l approaches. Throughout this chapter, this new approach is referred to as adaptive graphic pattern recognition and its application to p a t t e r n classification and image retrieval tasks are briefly reviewed. T h e chapter is organized as follows. In Section 2 we introduce the problem of p a t t e r n representation from the two opposite viewpoints of decision-theoretic and structured p a t t e r n recognition. Section 3 briefly reviews the most popular approaches to p a t t e r n representation involving complex d a t a structures. Section 4 introduces the basic concepts behind neural processing of structured data, while Section 5 proposes the new general framework of graphical p a t t e r n recognition. Finally, we draw some conclusions in Section 6 and give a perspective of the research in the field. 2. D e c i s i o n - T h e o r e t i c V e r s u s S t r u c t u r a l A p p r o a c h e s In the last three decades the emphasis in p a t t e r n recognition research has hovered pendulum-like from decision-theoretic to structured approaches.

Adaptive Graphic Pattern Recognition: Foundations

and Perspectives

35

Decision-theoretic methods are essentially based on numerical features that provide a global representation of a pattern via an appropriate preprocessing algorithm. Many different decision-theoretic methods have been developed in the framework of connectionist models, which operate on sub-symbolic pattern representations. On the other hand, syntactic and structural pattern recognition (and, additionally, artificial intelligencebased methods) have been developed that emphasize the symbolic nature of patterns. Since their main focus is on expectations that can be derived from previous knowledge of the components detected in the patterns under consideration, such methods are often referred to as "knowledge-based" methods. Such different approaches to pattern recognition have given rise to the longstanding debate between traditional AI methods, based on symbols, and computational intelligence methods, which operate on numbers. 2 However, both purely decision-theoretic and syntactical/structural approaches are of limited value when applied to many interesting real-world problems for different reasons. Syntactical and structural methods can model the structure of patterns. However, these methods are not very well-suited for dealing with patterns corrupted by noise. This limitation was recognized early on, and several approaches have been pursued to incorporate statistical properties into structured approaches. The symbols used for either syntactical or structural approaches have been enriched with attributes, which are in fact vectors of real numbers representing appropriate features of the patterns. These attributes are expected to allow some statistical variability in the patterns under consideration. Error-correction mechanisms have been introduced to deal with either noise or distortions, 3 Additionally, symbolic string parsing has been extended using stochastic grammars to model the uncertainty and randomness of the accepted strings. 3,4 In these approaches, a probability measure is attached to the productions of the grammar G, and accepted strings can be assigned an attribute representing the probability with which they belong to the class represented by G. Lu and Fu 5 combined errorcorrection parsing and stochastic grammars to attain a better integration with statistical approaches. A comprehensive survey on the incorporation of statistical approaches into syntactical and structural pattern recognition can be found in Ref. 6. Likewise, related approaches to integrate Al-based and decision-theoretic methods can be found in Ref. 7.

36

G. Adorni, S. Cagnoni & M. Gori

On the other hand, either parametric or non-parametric statistical methods can nicely deal with distorted p a t t e r n s and noise, but they are severely limited in all cases in which the p a t t e r n s are strongly structured. T h e feature extraction process in those cases seems to be inherently illposed; the features are either global or degenerate to the pixel level. T h e renewal of interest in artificial neural networks which began in the mid-eighties suggested shifting the emphasis from the complex task of feature selection and extraction to the development of effective architectures and learning algorithms. In principle, neural networks are capable of extracting by themselves optimal features for classification during the learning process. "Learning from examples," which is typical of connectionist models, does not require specific assumptions on the d a t a probability distribution. T h e field of neural networks, however, has now reached the point where it is necessary t o state t h a t neglecting the issue of p a t t e r n representation and relying exclusively on learning is neither theoretically nor experimentally justified. There is evidence to claim t h a t complex p a t t e r n recognition tasks require architectures with a huge number of parameters t o make the loading of t h e weights effective. This makes generalization t o new examples very hard. 8 On the other hand, the adoption of architectures with few parameters, which would facilitate generalization, results in very hard optimization problems which are typically populated by many suboptimal local minima. T h e n a t u r e of this problem is partially addressed in the critical analyses on connectionist models by Fodor and Pylyshyn 9 and Minsky. 1 0

3 . O n t h e E x t r a c t i o n of S t r u c t u r e d R e p r e s e n t a t i o n s As early pointed out by Wiener, a p a t t e r n can often arrangement characterized by the order of the elements rather than by the intrinsic nature of these elements. hierarchical, and topological relations between p a r t s yield significant information which seems to be useful tion processes.

be regarded as an of which it is made, Hence, the causal, of a given p a t t e r n in h u m a n recogni-

As pointed out in the previous section, structured p a t t e r n descriptors are usually opposed to feature-based descriptors, and can be regarded as high-level vs. low-level representations, respectively. However, we notice t h a t while representing p a t t e r n s using a set of features does not require making the structure of the p a t t e r n s explicit, when using structured

Adaptive

Graphic Pattern Recognition: Foundations

and Perspectives

37

representations one deals with a high level of abstraction and, moreover, can take low-level features into account. If one considers the inherent structured nature of some patterns (e.g. logos, sketches) and the different levels of abstraction at which they can be analyzed, hierarchical pattern representations are a rather natural choice. For the purpose of this chapter, structured representations can be divided into knowledge-based representations, image-partitioning representations, and multi-resolution transforms and representations. In knowledge-based representations, elementary features like edges can be combined to form segments. Edges can be further combined to compose geometrical shapes that identify objects or object parts. Then, different objects can be related to one another to give a scene its meaning. Such relationships can be of two kinds, the is-a relationship and the part.of relationship, which correspond the combination of parts into wholes and abstraction, respectively. The main idea behind image-partitioning representation is to hierarchically subdivide an image until it is decomposed into basic components that present uniform properties. Such "atomic" elements can either be uniform regions or contour segments. Finally, multi-resolution representations of data are based on the observation that, in the real world, objects look different depending on the scale of observation. In the remainder of the section, we will only briefly review imagepartitioning representations, which have been mainly adopted so far in the the pattern recognition approach proposed in this chapter.61 3.1. Image-partitioning

Representations

Image-partitioning representations can be positioned half-way between multi-resolution transforms, which translate one image from a domain to another one, spreading information at different resolution levels, and most knowledge-based representations, in which an image is hierarchically represented on the basis of a decomposition based on high-level symbolic processes and logic inferences. Whereas in classical structured approaches the "atomic" parts into which an image is decomposed are geometrical primitives, in partitioning methods the atomic elements are extracted directly by classic image processing techniques. Different partitioning methods are a

T o the best of our knowledge, multi-resolution and knowledge-based representations have not been used so far as graphic inputs to neural networks. However, in principle, they can be utilized in the same manner as image-partitioning representations

38

G. Adorni, S. Cagnoni & M. Gori

used depending on whether the representation is based on portions of the image or on its edges. • Region-based representations Region-based image-partitioning constructs a representation beginning with regions of the image, which are segmented on the basis of some predefined uniformity criteria. Binary partition trees provide one of the simplest region-based representations. They are structured representations of regions that can be obtained from an initial partition. The leaves represent regions that belong to an initial partition. The remaining nodes represent regions that are obtained by appropriate merging of regions and, finally, the root node turns out to represent the whole image. This kind of representation is closely related to multi-resolution representations. Binary partition trees have been shown to be suitable for several applications, such as information retrieval, segmentation, and filtering.11 However, the representation turns out to be strongly application-dependent, since all possible merging of regions belonging to the initial partition are not represented in the tree, and their selection depends on the merging criterion. In addition, the choice of the initial partition is arbitrary, and affects the representation significantly. Quad-trees 12 ' 13 are amongst the oldest and most-widely adopted regionbased image-partitioning representations. An extensive survey on quad-tree representation and their applications can be found in Refs. 14 and 15. Octrees 16 ' 17 are the three-dimensional extension of the quad-tree. They are built by starting from a cubical volume (pattern) and recursively subdividing this volume into eight congruent disjoint cubes (octants) until a uniformity criterion is satisfied. Several operations can be defined on quadtrees, which are reviewed in Refs. 17 and 18. In particular, for binary images, set-theoretic operations such as union and intersection are quite simple to implement. 12 • Contour-based representations Contour-based representations use image contours as primitives. Strip trees 1 8 - 2 1 are based on a top-down approach to curve approximation. They are binary trees representing a single curve, obtained by subsequently subdividing the curve into segments. Strip trees are very efficient representations of a curve for applications in computer graphics. 22

Adaptive Graphic Pattern Recognition: Foundations

and Perspectives

39

A chain code 23 ' 24 is a vector of integers that samples the orientation of a point-size contour at each point. To build the pattern representation, one starts from a point in the contour and looks for an adjacent point in a 3 x 3 neighborhood, using a 3-bit code to represent its relative position. An exponent is added to all elements of the chain, representing how many pixels have the same direction. This procedure is iterated until the starting pixel is reached again. Chain code representations have the same drawbacks as strip trees, since they require that the image be pre-processed before it can be used. Chain code representations are also very sensitive to noise (occlusions, impulsive noise), since the chain code can change in length and content for each possible alteration of the contour. Contour tree 25 is a general term referring to several hierarchical contour representations. Generally speaking, a contour tree is a data structure that is built according to the following rules: 1. The root corresponds to a closed contour that includes all other contours, if any. Otherwise, the root corresponds with the image border. 2. Each node of the tree corresponds to a closed contour. 3. A contour lying inside another contour is represented as the other's child. An example of contour-tree construction from a given pattern is shown in Fig. 1. The contour tree representation is suitable for many applications in the field of image processing where the representation of concentric isosurfaces is important. 26 ' 27 However, the need to pre-segment the image into closed contours limits the versatility of this representation. To overcome this problem, the closed-contour constraint can be relaxed. Suitable image processing algorithms can be used that extract closed contours (see e.g. Ref. 28). Contour-tree based representations have been primarily used in conjunction with the pattern recognition approach proposed in this chapter.

4. Neural networks for structured pattern representations In this section we introduce the basic principles behind adaptive processing of structured information. We focus on supervised learning by extending the backprogation algorithm for classic multilayer perceptrons to the case of directed acyclic graphs.

40

G. Adorni, S. Cagnoni & M. Gori

Fig. 1. The contour-tree algorithm and the corresponding representation for a company logo. Note that, unlike the typical pre-processing schemes adopted for neural networks, this representation is invariant under rotation.

4.1. Multilayer

Perceptrons

for Static

Representations

Feedforward neural networks 29 are directed acyclic graphs whose nodes carry out a forward computation based on any topological sort b S of the vertices. If we denote by pa[i>] be the parents of v, then the corresponding neural output is for eachv £ <S Xv = a ^2Zzepa[t>]

w

v,zXzj

where c(-) = tanh(-) is the node output. In the case of multilayer networks, the computational scheme reduces to a pipe of the layers. Let Topological sorting arises whenever we have a problem involving a partial ordering. For instance, a large glossary containing definitions of technical terms might require topological sorting. We can write wi -< w? provided that the definition of term wz depends directly or indirectly on that of term w±. Since we need no circular definition, we have a problem of topological sorting, that is we want to arrange terms in such a way that no term is used before it has been defined.

Adaptive Graphic Pattern Recognition: Foundations

and Perspectives

41

C = {(ua, da), ua € U} be the training set and let M be a feedforward neural network. The degree of "matching" between the desired target d(u) and the response of the neural network can a;(it, w) be expressed by:

EM = $ > ( « ) , where e(u) is any metrics that yields the distance between d(u) a;(it, w). For instance:

E=\^(x{u,w)-d{u))2. ueu

(1) and

(2)

The optimization of E typically involves adjusting a huge number of parameters. In the case of large optimization problems, the gradient heuristics is commonly used dw

„

— = ~HVWE,

, „

(3)

Of course, the trajectory can get stuck in local minima of E(w). The gradient is calculated by Backpropagation, which turns out to be an optimal algorithm in the sense of computational complexity.0 In the last few years, multilayer perceptrons have been extensively used in pattern recognition. Let us consider the typical pattern recognition learning task depicted in Fig. 2. In many applications to pattern recognition, neural-based learning machines simply process de-sampled images so as to provide a more compact, yet sufficiently accurate representation. In so doing, the task of selecting appropriate features is delegated to the learning c

In the literature there is often confusion when using the term "Backpropagation." Many people refer to Backpropagation as to the algorithm for performing the gradient descent of the error function. The algorithm is based on a forward-backward computation of the gradient. Using a more appropriate meaning of the term, others call Backpropagation simply that special gradient computation scheme. Let m = \w\ be the number of weights, then the forward-backward steps takes Q(m). This is obviously a lower bound on the computational complexity; that is no other algorithms can perform (asymptotically) the same function more effectively, since one needs at least to load all the weights to compute the gradient. This was pointed out in Ref. 30. It is worth mentioning that classic numerical gradient computation algorithms based on the computation of function E takes 0(m2). In many pattern recognition application where m is a huge number, the Backpropagation algorithms makes the optimization-based learning a viable approach, whereas any classic optimization method which does not take the neural network structure into account would get stuck simply in the gradient computation.

42

67. Adorni, S. Cagnoni & M. Gori

Fig. 2. The typical computation carried out by multilayer perceptrons for pattern recognition while choosing a simple pre-processing based on image de-sampling.

process, which is expected to provide feature-based input representations in the first hidden layer. Multilayer neural networks are powerful computational devices, which go well beyond the limitations pointed out by Minsky and Papert for Rosenblatt's perceptrons. 10 In spite of this clear advantage, however, in many real world problems, the Backpropagation learning procedure does not yield satisfactorily results. Enthusiastic reports of experimental results for some problems and bad failures for others clearly indicate that for the learning to be effective a number of crucial architectural solutions and tricks are of crucial importance, but at the same time, the learning task itself can be hard to attack. Amongst different design choices, the effect of the number of parameters seems to be quite clear. Experimental evidence has been accumulated in many artificial and real-world problems that allows us to conclude that when the number of network parameters increases the Backpropagation learning algorithm has more chance to find global minima but, unfortunately, generalisation to new examples is likely to be less effective. Hence, generalisation to new examples and optimal convergence of the learning algorithm look like conjugate variables in quantum mechanics: increasing of the number of parameters typically results in a better optimal convergence behavior, but the drawback is that generalisation capabilities are likely to decrease, and vice versa.

Adaptive

Graphic Pattern Recognition: Foundations

and Perspectives

43

Fig. 3. The hat cannot be detected robustly by simply inspecting local features on the top of the picture; that would result in a trivial error on the case of policemen with one (two) raised hand(s). On the other hand, the extraction of a graphical representation simplifies the learning task dramatically.

In pattern recognition, a potential advantage of the scheme depicted in Fig. 2 is that of delegating the feature extraction to the learning process. Unfortunately, there are learning tasks in which this straightforward representation makes the subsequent learning process very hard. Let us consider the artificial learning task depicted in Fig. 3, in which we want to recognise policemen with hats. The artificial example makes it clear that the hat is not always located in the same portion of the picture. The hat cannot be detected robustly by

44

G. Adorni, S. Cagnoni & M. Gori

simply inspecting local features on the top of the picture; t h a t would result in a trivial error on the case of policemen with one (two) raised hand(s). Hence, in order to learn the concept, we need to incorporate shift invariance. W h e n using multilayer perceptrons, shift invariance a n d other complex mappings, including scale and rotation invariance can be implemented with appropriate architectures. d Unfortunately, the problem is moved to learnability; a huge number of parameters might be required for avoiding local minima in the error function. Consequently, an appropriate generalisation to new examples requires a corresponding growth of the number of examples, which in t u r n leads t o an explosion of the computational cost of the learning.

4 . 2 . Processing

Directed

Acyclic

Graphs

In most interesting problems of p a t t e r n analysis and recognition, d a t a are inherently structured. If we consider the policeman in Fig. 3 we immediately conclude t h a t a graphical representation of the p a t t e r n is definitively more appropriate t h a n the simple static representation, based on the image desampling often adopted in conjunction with multilayer perceptrons. Like data, the model can itself be structured in the sense t h a t the generic variable xi%v might be independent of q\xXi^, where, following the notation introduced in Ref. 1, q^1 is the operator which denotes the fc-th child of a given node. T h e structure of independence for some variables represents a form of prior knowledge. For instance, a classic form of independence arises when the connections of any two state variables, xv and xw, only take place between components XitV and xw^ with the same index i. In the case of lists, this assumption means t h a t only local-feedback connections are permitted for the state variables. Likewise, other statements of independence might involve input-state variables a n d / o r state-output variables. An explicit statement of independence can be regarded as a sort of prior knowledge on the transduction t h a t the machine is expected to learn. In general these statements can also be different for different nodes and can be conveniently expressed by a graphical structure t h a t is referred to as a recursive network. An example of recursive network R is shown on the left-side of Fig. 4. In this case, R is simply a graph which states the full dependency of d

I n particular, multilayer neural networks guarantee universal approximation even with one hidden layer, provided that enough units are adopted in the hidden layer.

Adaptive Graphic Pattern Recognition: Foundations

and Perspectives

45

*o>;' i< 6ah

\/"J»

6aa

6a6

Fig. 4. Compiling the encoding network from the recursive network and the given data structure by function ipr.

the state variable from the state of the children q^xxv, Q21xv, q^xxv and the label uv attached to the node. Unlike Fig. 4, in m a n y real world problems the knowledge in a recursive network R yields topological constraints t h a t often make it possible t o cut the number of trainable parameters significantly. Let us consider a directed ordered graph. For any node v one can identify a set, eventually empty, of ordered children ch[u]. Let ajch[v] be the state associated with set ch[w] and 0 be vector of learning parameters. T h e state xv and the o u t p u t yv of each node v follows the equations Xv = /(»ch[v], UV: V, 0 ) , Vv = 9(Xv, uv, v,

9).

(4)

This is a straightforward extension of classic causal models in system theory. T h e hypothesis of dealing with directed acyclic graph t u r n s out to be useful for carrying out a forward computation, and the hypothesis of considering ordered sets of children is used in order to define the position of the parameters in functions / and g. Alternatively, one can keep essentially the same computational scheme for directed positional acyclic graphs in which the children of each node are associated with an integer. T h e difference with respect to directed ordered acyclic graphs is t h a t they consider only ordered sets of children, and do not include the case in which the children of a given node are not given in a sequential ordering. For instance, in Fig. 3, the difference of the two p a t t e r n s is kept in the graphical representation in t h e case of directed positional acyclic graphs, b u t is lost in the case of a representation based on directed acyclic graphs. Given

46

G. Adorni, S. Cagnoni & M. Gori

the recursive network R and any DOAG u, we can construct an encoded representation of u on the basis of the independence constraints expressed by R, t h a t is ur = ipr(R,u). T h e scheme adopted for compiling ur is depicted in Figure 4, while a detailed description of the mathematical process involved is given in Ref. 1. From the encoding network depicted in Fig. 4 we can see a pictorial representation of the computation taking place in the recursive neural network. Each n i l pointer, represented by a small box, is associated with a frontier state xv, which is in fact an initial state used to terminate the recursive equation. T h e graph plays its own role in the computation either because of the information attached to its nodes or for its topology. Any formal description of the computation of the input graph requires sorting the nodes, so as to define for which nodes the state must be computed first. As already pointed out for the computation of the activation of the neurons of a feedforward neural network, the computation can be based on any topological sorting. One can use a d a t a flow computation model where the state of a given node can only be computed once all the states of its children are known. To some extent, the computation of the o u t p u t yv can be regarded as a transduction of the input graph u to an o u t p u t y with the same skeleton 6 as u. These IO-isomorph transductions are the direct generalisation of the classic concept of transduction of lists. When processing graphs, the concept of IO-isomorph transductions can also be extended to the case in which the skeleton of the graph is also modified. Because of the kind of problems considered in this chapter, however, this case will not be treated. T h e classification of DOAGs is in fact the most important IOisomorph transduction for applications to p a t t e r n recognition. T h e o u t p u t of the classification process corresponds with ys, t h a t is the o u t p u t value of the variables attached to the supersource in the encoding network. Basically, when the focus is on classification, we disregard all the o u t p u t s yv of the IO-isomorph transduction apart from the final values ys of the forward computation. T h e information attached to the recursive network, however, needs to be integrated with a specific choice of functions / and g which must be suitable for learning the parameters 0. T h e connectionist assumption for functions / and g t u r n s out to be adequate especially to fulfill computational e

T h e skeleton of a graph is the structure of the data regardless of the information attached to the nodes.

Adaptive Graphic Pattern Recognition: Foundations

and Perspectives

47

complexity requirements. The extension to the case of DO AGs is straightforward. Let o be the maximum outdegree of the given directed graph. The dependence of node v on its children chjt;] can be expressed by pointer matrices Av(k) € M n,n , k = 1 , . . . ,o. Likewise, the information attached to the nodes can be propagated by weight matrix Bv £ M.n'm. Hence, the first-order connectionist assumption yields

xv = a I ^2 Av(k) • q^Xy + Bv • uv I .

(5)

The output can be computed by means of yv = a(Cv • xv + Dv • uv), where Cv € W'n and Dv e K p , m . Hence the learning parameters can be grouped for each node of the graph in 6V = {[Av(l),...

, Av(o)}, Bv, Cv, Dv} .

The most attracting feature of connectionist assumptions for / and g is that they provide universal approximation capabilities by means of a graphical structure with units of a few different types (e.g. the sigmoid). The strong consequence of this graphical representation for / and g is that, for any input graph, an encoding neural network can be created which is itself a graph with neurons as nodes. Hence, the connectionist assumption makes it possible to go one step further than beyond the general independence constraints expressed by means of the concept of recursive network. The encoding neural network un associated with equation 5 is constructed by replacing each node of the encoding network ur with the chosen connectionist map; that is 4>n = -ipn ° Ipr • Un -» 4>n{ur) = i)n(lpr(R,u))

.

The construction of the encoding neural network un from the encoding neural network ur is depicted in Fig. 5. For the particular case of stationary models, in which the parameters 0V are independent of the node v. Encoding neural networks turns out to be weighed graphs, that is there is always a real variable attached to the edges (weight). Note that the architectural choice expressed by equation 5 can be easily extended so as to express functions / and g by general feedforward neural architectures. Of course, the composition of directed acyclic graphs (data) with the local node computation based on feedforward neural networks, which are directed acyclic

G. Adorni, S. Cagnoni & M. Gori

b

M

• •• a

• . '': .

M

i M

! e i

#

m

J M

.i.\i.

^

j

k

* *

^

*

Fig. 5. The construction of a first-order recursive neural network from the encoding network of Fig. 4. The construction holds under the assumption that the frontier states are null.

graphs, yields in general encoding neural networks which are still acyclic graphs. As a result, the supervised of learning a given set of DOAGs results in the supervision of the corresponding encoding neural networks. Because of the non-stationarity hypothesis, the parameters are independent of the node and, therefore, the learning of the weights 8 can be framed as an optimization problem. We can thus use the Backpropagation algorithm for training. Since the Backpropagation of the error takes place on neural networks which encode the structure of the given examples, the corresponding algorithm for the gradient computation, in this case, is referred to as Backpropagation through structure.1'31 This algorithm uses the classical forward

Adaptive Graphic Pattern Recognition: Foundations

and Perspectives

49

and backward steps, the only difference being that the parameters of the different encoded neural networks must be shared. 4.3. Cycles, Non-stationarity,

and

Beyond

Our proposed computation scheme proposed for directed acyclic graphs is a straightforward extension of the case of static data. The hypothesis that the children of each node are ordered is fundamental, and allows us to attach the appropriate set of weights to each child. The assumption that the graph is acyclic yields acyclic encoded neural networks for which the Backpropagation algorithm holds. Finally, the non-stationarity hypothesis makes it possible to attach the same set of weights to each node. The hypothesis of dealing with ordered graphs can be relaxed in different ways. A straightforward solution is to share the pointer matrices Ak among the children. In so doing, a unique matrix A is used for all the children, which overcomes the problem of defining the position in the computation of functions / and g. Alternatively, given chbj] one can consider the set of its permutations 'P(ch[ti]) and calculate functions / and g by an appropriate sharing of the weights.32 The second solution is more general than the first in terms of computational capabilities, but turns out to be effective only when the outdegree of the graphs is quite small. Otherwise, the cardinality oiV(ch[v}) explodes. The construction of the encoding neural network gives rise to feedforward neural networks in the case of acyclic graphs. As shown in Fig. 6, for general graphs and directed graphs with cycles, the same construction of the encoding neural networks produces a recurrent neural network. As a result, the computation of each graph cannot be performed by simple forward step. The feedback loops in the neural network can produce complex dynamics, which do not necessarily correspond with convergence to an equilibrium point. It is worth mentioning that cyclic and undirected pattern representations can be extracted in a more natural way than directed ordered graphs. However, the drawback to this approach is that the corresponding learning process is significantly more expensive. In general, given a planar graph, one can construct a corresponding DOAG provided that an anchor node is also specified. Unfortunately, in pattern recognition one cannot always rely on the availability of such an anchor; there are cases in which the corresponding graphical extraction is likely not to be very robust.

50

G. Adorni, S. Cagnoni & M. Gori

Fig. 6. The encoding of cyclic graphs yields cyclic encoding networks, which in turn, gives rise to recurrent neural network architectures.

The proposed models represent a natural extension of the processing of sequences by causal dynamical systems. In pattern recognition, the hypothesis of causality could be profitably removed, since there is no need to carry out an on-line computation at node level. Having homogeneous computations at node level may not be adequate for many pattern recognition problems. This has been already pointed out in Ref. 33, where a simple solution has been adopted to account for nonstationarity. The graphs are partitioned into different sets depending on the number of nodes, and are processed separately. A more general and computational scheme has been devised in Ref. 34, where a linguistic description of non-stationarity is given which is used to compile the encoding neural networks. 5. Graphical Pattern Recognition The Ref. Ref. The and,

term adaptive graphical pattern recognition was first introduced in 33, but early experiments using this approach were carried out in 35. Graphs are either in the data or in the computational model. adopted connectionist models inherit the structure of the data graph moreover, they have their own graphical structure that expresses the

Adaptive Graphic Pattern Recognition: Foundations

and Perspectives

51

dependencies on the single variables. Basically, graphical pattern recognition methods integrate domain structure into decision-theoretic models. The structure can be introduced at two different levels. First, we can introduce a bias on the map (e.g. receptive fields). In so doing, the pattern of connectivity in the neural network is driven by the prior knowledge in the application domain. Second, each pattern can be represented by a corresponding graph. As put forward in the previous section, the hypothesis of directed ordered graphs can be profitably exploited to generalize the forward and backward computation of classical feedforward networks. The proposed approach can be pursued in most interesting pattern recognition problems. In this chapter we focus attention on supervised learning schemes, but related extensions have recently been conceived for unsupervised learning. 5.1.

Classification

Recursive neural networks seem to be very appropriate for either classification or regression. Basically, the structured input representation is converted to a static representation (the neural activations in the hidden layers), which is subsequently encoded into the required class. This approach shares the advantages and disadvantages of related. MLP-based approaches for static data. In particular, the approach is well-suited for complex discrimination problems. The effectiveness of recursive neural networks for pattern classification has been shown in Ref. 36 by massive experimentation on logo recognition. In particular, it has been shown that the network performance is improved by properly filtering the logo image before extracting the data structure. The patterns were represented using trees extracted by an opportune modification of the contour-tree algorithm. That modification plays a fundamental role in the creation of data structures that enhance the structure of the pattern. The experimental results show that, though in theory the contour tree rotation invariance no longer holds, as a matter of fact, there is a very slight dependence of the performance on the rotation angle. These experimental results indicate that adaptive graphical pattern recognition is appropriate when we need to recognize patterns in the presence of noise, and under rotation and scale invariance. These very promising results suggest that the proposed method bridges nicely decision-theoretic approaches based on numerical features and syntactic and structural approaches.

52

G. Adorni, S. Cagnoni & M. Gori

Network growing a n d pruning can b e successfully used for improving the learning process. It is worth mentioning t h a t recursive neural networks can profitably be used for classification of highly structured inputs, like image documents representations by XY-trees. Unfortunately, in this particular kind of application the major limitation t u r n s out to be t h a t the number of classes is fixed in advance, a limitation which is inherited from multilayer networks. Neural networks in structured domains can be used in verification problems, where one wants to establish whether a given p a t t e r n belongs to a given class. Unlike p a t t e r n classification, one does not know in advance the kind of inputs t o be processed. It has been pointed out t h a t sigmoidal multilayered neural networks are not appropriate for this task. 3 7 Consequently, our recursive neural networks are also not appropriate for verification tasks. However, as for multilayer networks, the adoption of radial basis function units suffices to remove this limitation. 5 . 2 . Image

Retrieval

T h e neural networks introduced in this chapter and their related extensions are good candidates for many interesting image retrieval tasks. In particular, the proposed models introduce a new notion of similarity, which is constructed on the basis of the user feedback. In most approaches proposed in the literature, queries involve either global or local features, and disregard the p a t t e r n structure. T h e proposed approach makes it possible to retrieve p a t t e r n s on the basis of a strong involvement of the p a t t e r n structure, since t h e graph topology plays a crucial role in t h e computation. On the other hand, since the nodes contain a vector of real-valued features, the proposed approach can also be able to exploit the sub-symbolic nature of the p a t t e r n s . Figure 7 shows a possible graphical representation of the images of a given database. T h e database has been created using an a t t r i b u t e plex g r a m m a r as described in Ref. 38. Unlike p a t t e r n classification, in which the learning scheme is a straightforward extension to backpropagation for static d a t a , learning the notion of similarity requires the definition of an appropriate target function. For each pair of images, the user provides feedback on how relevant the retrieved image is to the query. Consequently, the learning process consists of adapting the weights so as to incorporate the user feedback. Given any two pairs of images the user is asked whether they look similar and is expected to

Adaptive Graphic Pattern Recognition: Foundations

Original Image

Segmented Image

i

r r

i

!

53

Graph Extraction

Point of view

U l

cznr:

and Perspectives

i

i

1

i

I

r

LJ_„

1

_l

1

~JHJ L"T~ i

i

i

i

i

~~~-—-_

T~ 1 I I

I I

L i

Features Insertion Fig. 7.

DOAG Extraction

Extracted Graph

Extraction of an appropriate graphical representation from the given images.

provide a simple Boolean answer. In the case the images are not similar (see e.g. Fig. 8) their corresponding points in the hidden layer of the recursive neural network must be moved far apart, whereas in the case the images are similar, the corresponding points must be moved close to each other (see e.g. Fig. 9). Let Ui and u-2 be the graphical representations of two images for which the user is evaluating the similarity. u\ and u