KOHONEN MAPS
This Page Intentionally Left Blank
KOHONEN MAPS
Edited
by
ERKKI OJA and SAMUEL KASKI Neural Networks Research Centre Helsinki University of Technology P.O. Box 5400, FIN-02015 HUT, Finland
1999
ELSEVIER AMSTERDAM
- LAUSANNE
- NEW
YORK
- OXFORD
- SHANNON
- SINGAPORE
- TOKYO
ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands
9 1999 Elsevier Science B.V. All rights reserved.
This work is protected under copyright by Elsevier Science, and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier Science Rights & Permissions Department, PO Box 800, Oxford OX5 1DX, UK; phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail:
[email protected]. You may also contact Rights & Permissions directly through Elsevier's home page (http://www.elsevier.nl), selecting first 'Customer Support', then 'General Information', then 'Permissions Query Form'. In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (978) 7508400, fax: (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK; phone: (+44) 171 631 5555; fax: (+44) 171 631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier Science Rights & Permissions Department, at the mail, fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made.
First edition 1999 Library of Congress Cataloging in Publication Data A catalog record from the Library of Congress has been applied for.
ISBN:
0 444 50270 X
The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands.
Preface:
Kohonen
Maps
Professor Teuvo Kohonen is known world-wide as a leading pioneer in neurocomputing. His research interests include the theory of self-organization, associative memories, neural networks, and pattern recognition, on which he has published over 200 research papers and four monograph books. His influence on the contemporary research is indicated by the more than 3300 recent publications world wide in which the most central of his ideas, the Self-Organizing Map (SOM), also known as the Kohonen Map, is analyzed and applied to data analysis and pattern recognition problems. Kohonen's research on the SelfOrganizing Map began in early 1981. What was required was an efficient algorithm that would map similar patterns, given as vectors close to each other in the input space, onto contiguous locations in the output space. Numerous experiments were made with the SOM, including the formation of phoneme maps for use in speech recognition. Extensions to supervised learning tasks, the Supervised SOM and Learning Vector Quantization (LVQ) algorithms, brought further improvements. The SOM algorithm was one of the strong underlying factors in the new popularity of neural networks starting in the early 80's. It is the most widely used neural network learning rule in the class of unsupervised algorithms, and has been implemented in a large number of commercial and public domain neural network software packages. The best source of details and applications of the SOM are Kohonen's books "Self-Organization and Associative Memory" (Springer, 1984) and "Self-Organizing Maps" (Springer, 1995; 2nd extended edition, 1997). Recently, Teuvo Kohonen has been working on a new type of feature extraction algorithm, the Adaptive-Subspace SOM (ASSOM), which combines the old Learning Subspace Method and the Self-Organizing Map. He has shown how invariant feature detectors, for example the well-known wavelet filters for digital images and signals will emerge automatically in the ASSOM. In another sample of his present research, the SOM algorithm is applied to organize large collections of free-form text documents like those available in the Internet. The method is called WEBSOM. In the largest application of the WEBSOM
Yl
implemented so far, reported in this book for the first time, about 7 million documents have been organized. Teuvo Kohonen will retire from his office at the Academy of Finland in July, 1999, although he will not retire from his research work. With the decade drawing to an end, during which neural networks in general and the Kohonen Map in particular attained high visibility and much success, it may be time to take a look at the state of the art and the future. Therefore, we decided to organize a high-level workshop in July 1999 on the theory, methodology and applications of the SOM, to celebrate this occasion. Many of the top experts in the field accepted our invitation to participate and submit articles covering their research. The result is contained in this book, expertly compiled and printed by Elsevier. The 30 chapters of this book cover the current status of SOM theory, such as connections of SOM to clustering, vector quantization, classification, and active learning; relation of SOM to generative probabilistic models; optimization strategies for SOM; and energy functions and topological ordering. Most of the chapters, however, are focussed on applications of the SOM. Data mining and exploratory data analysis is a central topic, applied to large databases of financial data, medical data, free-form text documents, digital images, speech, and process measurements. Other applications covered are robotics, printed circuit board optimization and electronic circuit design, EEG classification, human voice analysis, and spectroscopy. Finally, there are a few chapters on biological models related to the SOM such as models of cortical maps and spatio-temporal memory.
A c k n o w l e d g e m e n t s . We wish to thank all the people who have made the WSOM'99 workshop and hence this book possible. We are especially grateful to the rest of the organizing committee: Esa Alhoniemi, Johan Himberg, Jukka Iivarinen, Krista Lagus, Markus Peura, Olli Simula, and Juha Vesanto. Finally, we thank the Academy of Finland for financial support.
Espoo, Finland, April 23, 1999
Erkki Oja
Samuel Kaski
VII
Table of c o n t e n t s
Preface: Kohonen Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v vii
Analyzing and representing multidimensional quantitative and qualitative data: Demographic study of the RhSne valley. The domestic consumption of the Canadian families. M. Cottrell, P. Gaubert, P. L e t r e m y , P. R o u s s e t
....................................
1
Value maps: Finding value in markets that are expensive G. J. Deboeck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Data mining and knowledge discovery with emergent Self-Organizing Feature Maps for multivariate time series A . Ultsch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
From aggregation operators to soft Learning Vector Quantization and clustering algorithms N. B. K a r a y i a n n i s
................................................................
47
Active learning in Self-Organizing Maps M. H a s e n j S g e r , H. R i t t e r , K. O b e r m a y e r
..........................................
57
Point prototype generation and classifier design J. C. B e z d e k , L. I. K u n c h e v a
.....................................................
71
Self-Organizing Maps on non-Euclidean spaces H. R i t t e r
.........................................................................
97
Self-Organising Maps for pattern recognition N. M. A l l i n s o n , H. Y i n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
Tree structured Self-Organizing Maps P. K o i k k a l a i n e n
.................................................................
Growing self-organizing networks B. Fritzke
121
history, status quo, and perspectives
.......................................................................
131
Kohonen Self-Organizing Map with quantized weights P. T h i r a n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
145
VIII
On the optimization of Self-Organizing Maps by genetic algorithms D. Polani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
157
Self organization of a massive text document collection T. Kohonen, S. Kaski, K. Lagus, J. SalojSrvi, J. Honkela, V. Paatero, A. Saarela . 171
Document classification with Self-Organizing Maps D. Merkl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
Navigation in databases using Self-Organising Maps S. A. S h u m s k y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
197
A SOM-based sensing approach to robotic manipulation tasks E. Cervera, A. P. del Pobil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
207
SOM-TSP: An approach to optimize surface component mounting on a printed circuit board H. Tokutaka, K. Fujimura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
219
Self-Organising Maps in computer aided design of electronic circuits A. Hemani, A. Postula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
231
Modeling self-organization in the visual cortex R. Miikkulainen, J. A. Bednar, Y. Choe, J. Sirosh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
243
A spatio-temporal memory based on SOMs with activity diffusion N. R. Euliano, J. C. Principe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
253
Advances in modeling cortical maps P. G. Morasso, V. Sanguineti, F. Frisone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
267
Topology preservation in Self-Organizing Maps T. ViUmann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
279
Second-order learning in Self-Organizing Maps R. Der, M. H e r r m a n n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
293
Energy functions for Self-Organizing Maps T. Heskes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
303
LVQ and single trial EEG classification G. Pfurtscheller, M. Pregenzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
317
Self-Organizing Map in categorization of voice qualities L. Leinonen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
329
Chemometric analyses with Self Organising Feature Maps: A worked example of the analysis of cosmetics using Raman spectroscopy R. Goodacre, N. Kaderbhai, A. C. McGovern, E. A. Goodacre . . . . . . . . . . . . . . . . . . . .
335
iX
Self-Organizing Maps for content-based image database retrieval E. Oja, J. Laaksonen, M. Koskela, S. Brandt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
349
Indexing audio documents by using latent semantic analysis and SOM M. Kurimo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
363
Self-Organizing Map in analysis of large-scale industrial systems O. Simula, J. Ahola, E. Alhoniemi, J. Himberg, J. Vesanto . . . . . . . . . . . . . . . . . . . . . . .
375
Keyword index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
389
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9
Elsevier Science B.V. All rights reserved
Analyzing and representing multidimensional quantitative and qualitative data : Demographic study of the Rh6ne valley. The domestic consumption of the Canadian families. Marie Cottrell, Patrice Gaubert, Patrick Letremy, Patrick Rousset SAMOS-MATISSE, Universit6 Paris 1 90, rue de Tolbiac, F-75634 Paris Cedex 13, France 1. I N T R O D U C T I O N The SOM algorithm is now extensively used for data mining, representation of multidimensional data and analysis of relations between variables([1], [2], [5], [9], [11], [12], [13], [15], [16], [17]). With respect to any other classification method, the main characteristic of the SOM classification is the conservation of the topology: after learning, ~ close ~ observations are associated to the same class or to ~ close ~ classes according to the definition of the neighborhood in the SOM network. This feature allows to consider the resulting classification as a good starting point for further developments as shown in what follows. But in fact its capabilities have not been fully exploited so far. In this chapter, we present some of the techniques that can be derived from the SOM algorithm: the representation of the classes contents, the visualization of the distances between classes, a rapid and robust twolevel classification based on the quantitative variables, the computation of clustering indicators, the crossing of the classification with some qualitative variables to interpret the classification and give prominence to the most important explanatory variables. See in [3], [4], [8], [9] precise definitions of all these techniques. We also define two original algorithms (KORRESP and KACM) to analyze the relations between qualitative variables. The paper is organized as follows : in sections 2 and 3, we present the main tools, in section 4 and 5, we show real applications in socio-economic fields. 2. T H E M A I N T E C H N I Q U E S Let us give some notations : we consider a set of N observations, where each individual is described by P quantitative real valued variables and K qualitative variables. The main tool is a Kohonen network, generally a two-dimensional grid with n units, but the method can be used with any topological organization of the network. After learning, each unit i is represented in the R P space by its weight vector Cg (or code vector). We do not examine here the delicate problem of the learning of the code vectors ([9], [16], [17]) which is supposed to be successfully realized from the N observations restricted to their P quantitative variables.
Classification: After convergence, each observation is classified by a nearest neighbor method, (in RP): observation I belongs to class i if and only if the code vector Ci is the closest among all the
code vectors. The distance in R P is the Euclidean distance in general, but it can be chosen in another way according to the application.
Representation of the contents: The classes are represented according to the chosen topology of the network, along a chain, or on a grid, and all the elements can be drawn inside their classes. So it is possible to see how the observations are modified from a class to its neighbors and to appreciate the homogeneity of the classes. Distances between classes To highlight the true inter-classes distances, following the method proposed in [6], we represent each unit by an octagon inside the cell of the SOM map. The bigger it is, the closer it is to the border, the nearer the code vector is of its neighbors. This avoids misleading interpretations and gives an idea of the discrimination of the classes. Two-level classification: A hierarchical clustering of the n code vectors puts together the most similar SOM classes and provides a second classification into a smaller number of classes. These macro-classes create connected areas in the initial map, so the neighborhood relations between them are kept. This grouping together facilitates the interpretation of the contents of the classes. Crossing with qualitative variable: To interpret the classes according to an explicative qualitative variable, it is valuable to study the discrete distribution of its modalities in each class. We propose to draw inside each cell a frequency pie for example. So we make clear the continuity of the classes as well as the cutoffs. We also can associate to a SOM class the more frequent modalities of a qualitative variable and in this manner give a better description of the classes. 3. ANALYSIS OF R E L A T I O N S B E T W E E N Q U A L I T A T I V E VARIABLES Let us define here two original algorithms to analyze the relations between qualitative variables. The first one is defined only for two qualitative variables. It is called KORRESP and is analogous to the classical Correspondence Analysis. The second one is devoted to the analysis of any finite number of qualitative variables. It is called KACM and is similar to the Multiple Correspondence Analysis. See [3], [4] for previous related papers. For both algorithms, we consider a sample of individuals and a number K of qualitative variables. Each variable k = 1,2 ..... K has m~ possible modalities. For each individual, there is one and only one modality. If M is the total number of modalities, each individual is represented by a row M-vector with values in {0, 1 }. There is only one 1 between the 1st component and the ml-th one, only one 1 between the ml+l-th component and the (ml+mz)-th one and so on. In the general case, where M > 2, the data are summarized into a Burt Table which is a cross tabulation table. It is a M x M symmetric matrix and is composed of K x K blocks, such that the (k, /)-block Bkl (for k < l) is the (mk X mlt) contingency table which crosses the variable
k and the variable I. The block Bkk is a diagonal matrix, whose diagonal entries are the numbers of individuals who have respectively chosen the modalities 1, 2 ..... mk, for modality k. From now on, the Burt Table is denoted by B. In the case M=2, we only need the contingency table T which crosses the two variables. In that case, we set p (resp. q) for mi (resp. m2).
3.1 The KORRESP algorithm Let M = 2. In the contingency table T, the first qualitative variable has p levels and corresponds to the rows. The second one has q levels and corresponds to the columns. The entry nij is the number of individuals categorized by the row i and the column j. From the contingency table, the matrix of relative frequencies (fij = nij/(Zij nij)) is computed. Then the rows and the columns are normalized in order to have a sum equal to 1. The row profile r(i), 1 < i < p is the discrete probability distribution of the second variable when the first variable has modality i and the column profile c(j), 1 < j < q is the discrete probability distribution of the first variable when the second variable has modality j. The classical Correspondence Analysis is a simultaneous weighted Principal Component Analysis on the row profiles and on the column profiles. The distance is chosen to be the ~2 distance. In the simultaneous representation, related modalities are projected into neighboring points. To define the algorithm KORRESP, we build a new data matrix D: to each row profile
r(i), we associate the column profile c(j(i)) which maximizes the probability o f j given i, and conversely, we associate to each column profile c(j) the row profile r(i(j)) the most probable given j. The data matrix D is the ((p+q)x (q+p))-matrix whose first p rows are the vectors (r(i),c(j(i))) and last q rows are the vectors (r(i(j)),c(j)). The SOM algorithm is processed on the rows of this data matrix D. Note that we randomly pick the inputs among alternatively the p first rows and the q last ones and that the winning unit is computed only on the q first components in the first case, on the p last ones in the second case according to the ~2 distance. After convergence, each modality of both variables is classified into a Vorono?" class. Related modalities are classified into the same class or into neighboring classes. This method give a very quick, efficient way to analyze the relations between two qualitative variables. See [3] for real-world applications.
3.2 The KACM Algorithm When there are more than two qualitative variables, the above method no longer works. In that case, the data matrix is the Burt Table B. The rows are normalized, in order to have a sum equal to 1. At each step, we pick a normalized row at random according to the frequency of the corresponding modality. We define the winning unit according to the ~2 distance and update the weight vectors as usual. After convergence, we get an organized classification of all the modalities, where related modalities belong to the same class or to neighboring classes. In that case also, the KACM method provides a very interesting alternative to classical Multiple Correspondence Analysis. The main advantages of both KORRESP and KACM methods are their rapidity and their small computing time. While the classical methods have to use several representations with
decreasing information in each, ours provide only one map, that is rough but unique and permits a rapid and complete interpretation. See [3], [4] and [7] for the details and financial applications.
4. DEMOGRAPHIC STUDY OF THE RHONE VALLEY
The data come from the project ARCHEOMEDES, supported by the EU, in collaboration with the laboratory P.A.R.I.S (University Paris 1). We consider 1783 communes in the Rh6ne valley, in the south of France. This valley is situated on the two banks of the river Rh6ne. It includes some big cities (Marseille, Avignon, Aries . . . . ), some small towns, many rural villages. A large part is situated in medium mountains, in very depopulated areas since the so-called drift from the land. At the same time, in the vicinity of the large or small towns, the communes have attracted a lot of people who are working in urban employment. The goal of this study is to understand the relations between the evolution of the population and the professional composition in each communes. The data include two tables. The first one gives the numbers of seven population census (1936, 1954, 1962, 1968, 1975, 1982, 1990). These numbers are normalized by dividing by their sum, to keep the evolution and not the absolute values. The second one contains the current numbers of working population, distributed among six professional categories (farmers, craftsmen, managers, intermediate occupations, clerks, workers). In this second table, the data are transformed into percentages and will be compared with the ~2 distance. The first step consists in defining two classifications of the communes from the two types of data. We use a Kohonen one-dimensional network (a chain) to transform the quantitative variables into qualitative ordered characters. First we classify the communes into 5 classes from the census data, and then into 6 classes from the professional composition data. The first classification into 5 classes is easily interpretable. The classes are arranged according to an evident order: there are the communes with strong increase (aug_for), with medium increase (aug_moy), with relative stability (stable), with medium decrease (dim_moy), with strong decrease (dim_for). See in fig. 4.1 the code vectors and in fig. 4.2 the contents of the 5 classes. 07~
Q7
(17-
i (15
O5
Q3
. 3~
.
.
54
~
. 6B
.
. 75
aug_for
Qa
. a2
~0
. 33
.
.
54
~
.
. tt3
. 73
aug_moy
03
. 82
~0
36
91
6~
(B
75
stable
fo
9)
. 36
. 54
. 6Z
.
. 6B
. 75
dim_moy
,t 8~
O3,
.
.
.
.
.
.
.
if)
dim_for
Fig. 4.1 9The code vectors of the 5 classes (first classification). The curves represent the population evolution along the seven census.
Fig. 4.2" The contents of the 5 classes (first classification). In each class, all the communes vectors are drawn in a superposed way. The 6 classes (A, B, C, D, E, F) provided by the second classification of the professional composition data, are a little more delicate to interpret. Actually they straightforwardly correspond to an order following the relative importance of the farmer category. Class A does not contain almost any farmer, while class F consists in communes with a majority of farmers (those are very small villages, but the use of the %2 distance restores their importance). See in fig. 4.3 the code vectors and in fig. 4.4 the contents of the 5 classes.
2~
011 ~
26
~
a~
21.
2~-
~
~ ~
2~!
11"
11-
1.1
01
F~n ~
IVI~ t m e e "
Wk
11" ~n
~
Mra l"e
01 gl~r V~k
261
OI rmn ~
I~
tle
~
WJk
11" E~n ~
Nlm I~
OI ~
~k
~ E~'n ~
Nlm i'~
01~
Vilk
Fern ~
t,~a ~
~
V~
A B C D E F Fig. 4.3 9The code vectors of the 6 classes (second classification). The curves correspond to the (corrected) proportion of farmers, craftsmen, managers, intermediate occupations, clerks, workers.
Fig. 4.4 9The contents of the 6 classes (second classification). Note that in class A, some communes are very specific, they do not have any farmer, but all their inhabitants belong to one or two categories. From these two classifications, we compute the contingency table, see table 4.1. A quick glance shows a strong dependence between the row variables (census classes) and the column variables (professional classes). Table 4.1 9Contingency Table which crosses the two classifications. aug_for aug_moy stable dim_moy dim for
223 85 80 29 34
B
C
D
E
53 112 100 50 18
26 86 112 55 35
9 34 113 80 57
0 7 54 56 69
F 0 1
22 57 126
To analyze this dependence, we use a classical Correspondence Analysis and the KORRESP algorithm See in fig. 4.5 and table 4.3, the results.
Axes 1(0,70) and 2(0,26) stable
D
aug_for dim for
A
c~m_moy
B
aug_moy
dim_moy daLle -1
B aug molt
0
F
A
dim_for
aug_for
Fig.4.5 : The first projection of the modalities Table 4.3 The two-dimensional Kohonen using a Factorial Correspondence Analysis map with the results of KORRESP.
Both representations suggest to use a one-dimensional Kohonen network (a chain) to implement the KORRESP method. The results are shown in table 4.3. Table 4.3 : The one-dimensional Kohonen map with the results of KORRESP.
A augjor
B
aug_moy
C stable
D
F
dim_moye
dim_for
The conclusions are simple. The rural communes where the agriculture is dominant are depopulated, while the urban ones have an increasing population. The relations are very precise: we can note the pairs of modalities ((dim_for), F), ((dim_moy), D), ((stable), C), ((aug_moy), B), ((aug_for), A). Actually, the SOM-inspired method is very quick and efficient, and gives the basic points of the information with only one representation. As to the classical correspondence method, it is also useful but its computation time is longer and it is usually necessary to examine several projections to have a complete analysis, since each axis represents only a percent of the total information.
5. The domestic consumption of the Canadian families
The data have been provided by Prof. Simon Langlois from the Universit~ of Laval. The purpose of the study is to define homogenous groups from the point of view of their consumption choices. The interest of such clustering is at least double. On the one hand, when one has successive surveys that include distinct individuals stemming from a same population, we can build a pseudo-panel, composed of synthetic individuals representative of the groups, which will be comparable from one survey to another. On the other hand, it facilitates the matching of distinct surveys when each one provides different information about samples extracted from the same population. The constitution of groups which are homogenous for these data allows the linking of all the surveys. For example, it is possible to apply this method to match consumption surveys done for the same period with different samples (each sample is answered about the consumption of halfnomenclature). The matching is necessary to build complete consumption profiles. One has to notice that this method does not exactly correspond to the methodology proposed by Deaton, 1985, (13), who follows the same cohort from one survey to another by considering only individuals born at the same time. When pseudo-panels are considered, one uses to build the clusters by crossing some significant variables. For example, to study the households consumption modes, the significant variables are the age cohort, the education level and the income distribution. Here, we use the Kohonen algorithm to define the clusters and apply it to the data of two consumption surveys. We also compare the results to those that we obtain from a standard classification.
5.1. The data We consider two consumption surveys, performed by Statistiques Canada, in 1986 and 1990, with about 10 000 households which were not the same ones from one survey to the other. The consumption structure is known through a 20 functions nomenclature, described in Table 5.1. Table 5.1 : Consumption nomenclature. Alcohol" Food at home; Food away; House costs; Communication" Others" Gifts; Education; Clothes; Housing; Leisure; Lotteries; Furniture; Health; Security; Personal Care ; Tobacco; Individual Transport; Collective Transport; Vehicles. Each household is represented by its consumption structure, expressed as percentages of the total expenditure. The two surveys have been gathered in order to define classes including individuals which belong to one or the other year. So it will be possible to observe the dynamic evolution of the households groups which have similar consumption structures. One can see that for any classification method, the classes contain in almost equal proportion data of the two surveys. It seems that there is no temporal effect on the groups, which simplifies the further analyses.
See in Fig. 5.1 the mean consumption structure for the 1992 survey.
0.25
0.2
0.15
0.1
0.05
<
o
~" ~oo ~g
~o
8
8
=
~~
~~
-g
~~
:_ ~
~
~ _~ ~_ ~ "
~ ~
~
~~ g
~ :o~ ~o _~ ~o~ r
r
>
Fig. 5.1 Mean Consumption Profile in 1992. 5.2. The classes W e want to compare two clustering methods : 1) a SOM algorithm using a two-dimensional (8 x 8) grid, that defines 64 classes, whose number is then reduced to 10 macro-classes, by using a hierarchical clustering with the 64 code vectors, in order to get an easier interpretation of their contents. 2) a hierarchical classification into 10 classes with the Ward method. 5.3. The S O M classes and the macro classes Fig. 5.2 represents the 64 SOM classes with their code vectors and the macro-classes which differ by their texture. First we note that, due to the topological conservation property of the SOM algorithm that the macro-classes group only neighboring SOM classes. In Fig 5.3, the distances between the SOM classes are drawn, following the method suggested in [6]. Observe that the grouping into 10 macro-classes respects the distance : the changes of macroclasses generally occur where the distances are larger.
Fig. 5.2 9The 64 S O M classes, their code vectors and the macro-classes.
Fig. 5.3" The distances between the 64 code vectors' in each direction, the classes are more distant when there is more white area.
10 The SOM classes could be analyzed, but it is difficult to keep and characterize 64 types of classes. Conversely, the 10 macro classes have well separated features. One can observe that : 1. the sizes of the macro-classes are about 600 to 700 households, or about 1200, except one with a little more than 400. This macro-class gathers only 4 SOM classes which have a very special profile (as it will be seen below). 2. in all the macro-classes, there are as many 1986 data as 1992 ones. So there is no significant effect of the year of the survey. 3. the mean profiles of the 10 macro-classes are well identified, and are different from the mean profile of the whole population. Nine types of consumption items are at the origin of the differentiation of the macroclasses. 1. macro-class 5 is dominated by the Housing item (with a 38 %). 2. macro-class 9 is characterized by the importance of the Housing item (26 %) and the Collective Transport. 3. for two macro-classes (1 and 2), the Vehicle purchase makes the difference. While the general mean value for this item is about 5 %, the value is 17 % in macro-class 1, and the other items are reduced in an homothetic way. In macro-class 2, the value is 36 %, and the housing expenditure is small, what corresponds to a large representation of the house-owners (71% instead of 60 % in general). 4. the Food Home (20 %) and the Others items define the macro-class 7. 5. in macro-class 3, the Security (insurance) expenditure is the double mean value. 6. macro-class 10 corresponds to a large value of the Gifts item (25 %). 7. leisure item defines macro-class 8 (with 13 %), while tobacco defines macro-class 4 (with 12 %) and education is dominant in macro-class 6 (10 %). The grouping into 10 macro-classes increases the contrast with respect to the mean consumption profile. All the SOM classes inside a macro-class have more or less the same features, with some specific characteristics.
5.4. Hierarchical clustering If we consider the 10 classes defined by a hierarchical Ward clustering on the consumption profiles, the results are disappointing. The groups have unequal sizes, the differentiation between groups are more quantitative than qualitative, and poorer in information. For example, 4 groups have more than 1000 or 2000 elements, while the others have about 200 or 400. Among the 4 more important groups, 3 (the 1, 4, 6) have a mean profile similar to the general mean one, with only one component a little larger. Groups 2 and 7 correspond to a high housing expenditure, and cannot be clearly set apart, and so on. Actually the correctly spotted groups are the small ones, while the others are not very different from one to another, and are similar to the general population. Furthermore, some specific behaviors, in particular those which are distinguished by a relatively large importance of the Security or Education expenditures, do not emerge in this clustering.
11 So from now on, we continue the analysis by using the SOM classification, followed by the grouping into 10 macro-classes.
5.5. Crossing with qualitative variables To understand better the factors which determine the macro-classes, and to allow their identification, we use a graphic representation of some qualitative variables, that were not present in the classification. For that, we use 4 qualitative variables, that give a socio-demographic description of the households: 1. the first one (Wealth) is a variable with 5 modalities (poor, quasi-poor, middle, quasi-rich, rich). This variable is defined by combining three simple criteria, the income distribution, the total expenditure, the food expenses, according to the age, the education level and the regional origin. 2. the second one (Age) is the age of the head of household, with 6 modalities (less than 30, 30-39, 40-49, 50-59, 60-69, more than 69). 3. the educational level (Education), with 5 levels (primary, secondary, post-secondary without diploma, post-secondary with diploma, university diploma). 4. the tenure status (Tenure Status) (owner or tenant). For each SOM class, we compute the distribution of the four qualitative variables (which did not participate to the classification), and we represent it as a sector diagram (a pie) inside the cell. We observe that there is also a continuity for the variations of the socio-demographic variables distributions among the SOM classes. But they provide other information, different from the previous one.
Fig. 5.4 : The distribution of the variable Wealth.
12
Fig 5.5 9The distribution of the variable Age
Fig. 5.6 9The distribution of the variable Education.
13
Fig 5.7 : The distribution of the variable Tenure Status For example, the partitioning of the population according to the poverty-wealthy criterion that the classes having a strong proportion of rich or quasi-rich people are rather situated at the extremities of the diagonal right top - left bottom, the poor and quasipoor being at the central area. At the same time, the opposition owner-tenant is distributed according to a simple opposition on this diagonal. The first ones are at the left bottom, the second ones at the opposite. We rediscover a well-known situation of poor people, who can be as well owner as tenant of their lodgings. It is possible to analyze in this way the four graphic representations. Actually, it is the combination of these characteristics that we have to examine to interpret the zones of the grid, as gathered by the classification into 10 classes.
(Wealthy), indicates
6. Conclusion The SOM algorithm is therefore a powerful tool to analyze multidimensional data and to help to understand the underlying structure. We are now working about local representation of the contents of a class in relation with the neighboring classes, in order to give an interpretation to the significant and discriminate variables. There is no doubt that the related data mining techniques will have a large development in many scientific fields, where one deals with numerous, large dimensioned data.
14 References
[ 1] F.Blayo, P.Demartines : Data analysis : How to compare Kohonen neural networks to other techniques ? In Proceedings of IWANN'91, Ed. A.Prieto, Lecture Notes in Computer Science, Springer-Verlag, 469-476, 1991. [2] F.Blayo, P.Demartines : Algorithme de Kohonen: application ?~l'analyse de donn6es 6conomiques. Bulletin des Schweizerischen Elektrotechnischen Vereins & des Verbandes Schweizerischer Elektrizitatswerke, 83, 5, 23-26, 1992. [3] M.Cottrell, P.Letremy, E.Roy : Analyzing a contingency table with Kohonen maps : a Factorial Correspondence Analysis, Proc. 1WANN'93, J.Cabestany, J.Mary, A.Prieto Eds., Lecture Notes in Computer Science, Springer-Verlag, 305-311, 1993. [4] M.Cottrell, S.Ibbou : Multiple correspondence analysis of a crosstabulation matrix using the Kohonen algorithm, Proc. ESANN'95, M.Verleysen Ed., Editions D Facto, Bruxelles, 27-32, 1995. [5] M.Cottrell, B.Girard, Y.Girard, C.Muller, P.Rousset : Daily Electrical Power Curves : Classification and Forecasting Using a Kohonen Map, From Natural to Artificial Neural Computation, Proc. IWANN'95, J.Mira, F.Sandoval eds., Lecture Notes in Computer Science, Vol.930, Springer, 1107-1113, 1995. [6] M.Cottrell, E. de Bodt: A Kohonen Map Representation to Avoid Misleading Interpretations, Proc. ESANN'96, M.Verleysen Ed., Editions D Facto, Bruxelles, 103-110, 1996. [7] M.Cottrell, E. de Bodt, E.F.Henrion : Understanding the Leasing Decision with the Help of a Kohonen Map. An Empirical Study of the Belgian Market, Proc. ICNN'96 International Conference }, Vol.4, 2027-2032, 1996. [8] M.Cottrell, P.Rousset :, The Kohonen algorithm: A Powerful Tool for Analysing and Representing Multidimensional Quantitative and Qualitative Data, Proc. 1WANN'97, 1997. [9] M.Cottrell, J.C.Fort, G.Pag~s: Theoretical aspects of the SOM Algorithm, WSOM'97, Helsinki 1997, Neurocomputing 21, 119-138, 1998. [10] A.Deaton : Panel data from time series of cross-sections, Journal of Econometrics, 1985. [11] G.Deboeck, T.Kohonen : Visal Explorations in Finance with Self-Organization Maps, Springer, 1998. [12] P.Demartines : Organization measures and representations of Kohonen maps, In : J.H6rault (ed), First IFIP Working Group, 1992. [13] P.Demartines, J.H6rault: Curvilinear component analysis: a self-organizing neural network for non linear mapping of data sets, IEEE Tr. On Neural Networks, 8, 148-154, 1997. [14] F.Gardes, P.Gaubert, P.Rousset: Cellulage de donn6es d'enqu~tes de consommation par une m6thode neuronale, Preprint SAMOS # 69, 1996. [15] S.Kaski: Data Exploration Using Self-Organizing Maps, Acta Polytechnica Scandinavia, 82, 1997. [16] T.Kohonen: Self-Organization and Associative Memory, (3rd edition 1989), Springer, Berlin, 1984. [ 17] T.Kohonen: Self-Organizing Maps, Springer, Berlin, 1995.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
15
Value Maps: Finding Value in Markets that are Expensive G u i d o J. D e b o e c k 1818 Society, W o r l d B a n k e-marl:
[email protected]
Based on traditional measures of value such as price-earnings ratio (PE) many stocks are at present either way over valued or are faring very poorly. The rapid expansion of Internet and technology companies has made surfing stock markets much more hazardous. What investors need is a new guide, a more visual way to assess markets. The main idea of this chapter is to demonstrate how maps can be designed for finding value in stock markets. "Value Maps" presents self-organization maps of vast amounts of data on company and stock performance most of which is derived from traditional measures for assessing growth and value of companies. This chapter which extends the applications in [1 ] provides maps for finding value among the largest companies, the best performing companies, those that have improved the most, and those that based on traditional measures have excellent value attributes. The data was selected from over 7150 stocks listed on the New York Stock Exchange (NYSE), the American Exchange (AMEX) and the NASDAQ, including American Depository Receipts (ADRs) of foreign companies listed on US exchanges. Hence the maps in this chapter provide guidance not only to American companies but also companies of world class whose stock can easily be bought and sold without brokerage accounts on every continent. The Value Maps in this chapter are models for how company and market information could be treated and represented in the future.
Introduction Benjamin Graham was among the first to study stock returns by analyzing companies displaying common characteristics. In the early 1930's, he developed what he called the net current asset approach of investing. This called for buying stocks priced less than 66 percent of the company's liquidity. Ben Graham and David Dodd wrote in 1940 that people who habitually purchase common stocks at more than about 20 times their average earnings are likely to lose considerable money in the long run [2]. Nevertheless the participation in the stock markets has never been higher at a time when the average stock prices of US companies are close to 33 times the average earnings. In these circumstances what is value? What stocks have value? Which companies are worth investing in? What markets or sectors are undervalued? Sixty years after Graham and Dodd there is still a lively debate on this. The debate on value has actually increased in intensity in
16 recent years especially because of the high valuation of American stocks and the emergence and rapid escalation of the prices of the Internet stocks. High valuation is usually attributed to stocks with high price earnings ratios- a company's stock price divided by its per share earnings. Compared to historical precedence the current price earnings ratio of 33 for all stocks listed on the S&P 500 is very high. Compared to the average PE's of the internet stocks included in the DOT index (http://www.TheStreet.com), the ISDEX index (http ://www.internet.com), or the IIX index (http://quote.yahoo.com/q?s=^IIX), tree relatively new indices measuring the performance of Internet stocks, the current PE of S&P500 stocks is relatively low, especially since many Internet stocks have yet to produce earnings. So what has value? Some classic definitions of value can be found in Box 1. Interesting background reading on value investing can be found in [3], [4], and [5]. A long-term maverick on value is Warren Buffett. Since he took control of Berkshire Hathaway in 1965, when shares were trading in the $12-15 range, the per-share book value of Berkshire Hathaway stock has grown at a rate of more than 23 percent annually, which is nearly three times the gains in major stock averages. Buffett's approach on value is to seek intrinsic value, which he defines as 'you have to know the business whose stock you are considering buying' [6]. He recognizes that valuing a business is part art and part science but leaves it to the reader to interpret when a business has value. In Warren's own words '[a business] has to be selling for less than you think the value of the business is, and it has to be run by honest and able people. If you can buy into a business for less than it' s worth today, and you're confident of the management, and you buy into a group of businesses like that, you' re going to make money' [6].
Looking at Buffett's portfolio and the stocks held by Berkshire Hathaway Inc it is clear that Buffett's portfolio is diversified but that the bulk of his assets are tied up in five companies. In fact one third of the $36 billion worth of Berkshire Hathaway Inc is in one company -Coca Cola Co which has a PE of 44, about one third higher than the average for all S&P500 stocks. More specific guidance on what has value can be found in James O'Shaughnessy's [7]. After weighting risk, rewards, and long-term base rates, O'Shaughnessy shows the best overall strategy of uniting growth and value in a single portfolio over the past 42 years produces nearly five times better than the S&P500. The annual compounded rate of return of the united growth and value strategy achieved 17.1 percent or 4.91 percent higher than the S&P500 return of 12.91 percent a year.
17
Box 1: What is value? There are several definitions of value: F a i r m a r k e t v a l u e : Fair market value or FMV is whatever someone is willing to pay for a similar asset. FMV reflects the amount of cash a buyer would be willing to pay and a seller willing to accept. I n v e s t m e n t v a l u e : A company' s investment value is unique to all potential buyers; all buyers possess their own rate of return requirements for an asset because each investor has established a unique minimum rate of return. B o o k v a l u e : a standard value measure based on a company's net worth on an accounting basis. L i q u i d a t i o n v a l u e : what an enterprise could fetch if all assets were sold, all receivables collected and all outstanding bills and debts paid. I n t r i n s i c v a l u e : represents what someone would conclude a business is worth after taking an analysis of the company's financial position. The real worth of a company. -
O'Shaughnessy suggests that (i) (ii)
(iii)
(iv)
(v) (vi) (vii)
all stocks and large stocks with high PE ratios do substantially worse than the market (page 70); companies with the 50 lowest PE ratios from among the large stocks do much better than all others and the three lowest deciles by PE substantially outperform all large stocks; over the long term, the market clearly rewards low price book (PB) ratios; yet the data shows that for 20 years the 50 largest stocks with high price book ratios did better than all stocks; a high PB ratio is one of the hallmarks of a growth stock (page 99); a low price to sales (PS) ratio beats the market more than any other value ratio and did so more consistently, in terms of both the 50 stock portfolios and the decile analysis ( page 134); value strategies work when applied both to large stocks and to the universe of commons stocks and they did so at least 88% of the time over all rolling 10 year periods (page 164); multifactor models i.e. using several factors, dramatically enhance returns; in all likelihood, adding relative strength to a value portfolio dramatically increases performance because it picks stocks when investors recognized the bargains and begin buying again (page 269)
In sum, Buffett suggests a highly subjective way of assessing value; O'Shaughnessy works through a lot of statistical evidence to suggest a united value and growth investment. Both approaches leave the average investor who do not have Buffett's skills and time to assess
18 businesses; or who may not have a PhD in statistics to do an in depth rigorous analysis, in a real quandary. To facilitate investment decisions by the average investor we demonstrate an alternative approach which is easier and provides a visual way of finding value in expensive markets without requiring elaborate statistical analyses. The proposed approach is based on self-organizing maps, which provide two-dimensional representations of vast quantities of data.
Methodology and Data Self-organizing maps (SOM) belong to a general class of neural network methods, which are non-linear regression techniques that can be applied to find relationships between inputs and outputs or organize data so as to disclose so far unknown patterns or structures. As this approach has been demonstrated to be highly relevant in many financial, economic and marketing applications [1 ] and is the subject of this conference we will refer the novice reader to the literature for further explanation of the details of this approach [8], [1 ]. The data used for this study was derived from Morningstar TM, which publishes monthly data on over 7150 stocks listed on the NYSE, AMEX and NASDAQ exchanges. Morningstar's Principia Pro TM was used to selecting the stocks for each of the maps in this paper. Principia Pro's key features include filtering, custom tailored reporting, detailed individual-stock summary pages, graphic displays, and portfolio monitoring. Principia Pro does not allow data mining based on the principles of self-organization. Hence we used Principia Pro T M to select data for constructing self-organizing maps. The maps shown here were obtained by using Viscovery| (from Eudaptics Software GmbH in Austria), an impressive tool representing state of the art SOM capability -- according to Brian O'Rourke in Financial Engineering News, March 1999--. More information regarding Viscovery| can be found at http://www.eudaptics.com. A demo copy of Viscovery| can be downloaded from the same website. The maps in this article seek to identify the best companies based on how well companies treat their shareholders. The main yardstick used for creating Value Maps is the total return to stockholders. The total return to shareholders includes changes in share prices, reinvestment of dividends, rights and warrants offerings, and cash equivalents such as stocks received in spin-offs. Returns are also adjusted for stock splits, stock dividends and re-capitalization. The total return to shareholders that companies provide is the one true measure important to investors. It is the gauge against which investment managers and institutional investors are measured and it should be the measure used to judge the performance of corporate executives. The maps shown in the next section can be used by investors to see how stocks in their portfolio measure up; how to spot new investment opportunities; how to adjust their portfolio to meet the objectives they have chosen. Corporate managers can use the maps to see how their companies stack up against the competition.
19
Main Findings The main findings are presented as follows: first we analyzed value among the largest companies, next the best performing companies, then among companies that have improved the most and finally those that based on traditional measures have excellent value attributes.
1. The largest 100 companies The one hundred largest companies on the NYSE have the most visibility. Microsoft, IBM, GE, Wall Mart, Exxon, American Express, Coca-Cola, AT&T, Ford and many others are known all around the globe. How to differentiate between them? Applying SOM we found some interesting differences in value. The one hundred largest companies listed on the NYSE were obtained by sorting 7159 companies by market capitalization i.e. the current stock-market value of a company's equity, in millions. Market capitalization is calculated by multiplying the current share price by the number of shares outstanding as of the most recently completed fiscal quarter. Market capitalization is often used as an indicator of a company size. Stocks with market caps of less than $1 billion are often referred to as small-cap stocks, while market caps of more than $5 billion generally denote large-cap stocks. Based on data as of 12-30-98 Microsoft (MSFT) with a market capitalization of $345 billion is the largest company; Anheuser-Bush Companies (BUD) which produces Budweiser with a capitalization of $31.3 billion ranked
100th. The main inputs for this analysis included (i) the total return produced over three months, one year, three years, five years; (ii) the percentage rank of these returns in each industry; (iii) the market capitalization in millions; (iv) value measures including the price earnings ratio, price book ratio, price sales ratio and price cash flow ratios; and (v) the relative strength of the current stock price as compared to the stock price over the past 52 weeks. Equal priority was applied to all inputs; log transformations were applied to all. The initial map size was set to 2000, yielding maps of 54 by 35 nodes. Because of the small number of records the map tension was set to 3 which encourages interpolation between records. The SOM shown in Figure 1 shows six clusters. Summary statistics on each of these clusters is provided in Table 1. About 73% of all companies formed one cluster including IBM, GM, Motorola (MOT) and many others. The average market capitalization in this cluster is $ 90 billion. The average PE ratio for this group is 37 (or slightly higher than the average for the
20
Figure 1: Self-organizing map of the largest 100 companies (in terms of market capitalization) listed on New York Stock Exchange which shows 1. a main cluster including IBM, Microsoft, Intel, and many others; 2. cluster two (with CMB, ING, FRE, AXA) which are financial institutions that are leaders in performance; 3. cluster three (with TYC, XRX, TWX, WCOM) which are companies that are rich in valuations; 4. cluster four with SUNW, TXN, GPS, BBV which are high fliers; 5. cluster five which includ BAC and C which are banks and RD and SC which are oil companies that provide attractive investment opportunitie; and 6. cluster six with Nokia and Dell which have grown very fast in the past years but have also become very expensive.
21 market as a whole); the average PB ratio is about 10. Five other clusters show the more interesting information. Cluster two includes among others Chase Manhattan bank (CMB), ING Group (ING), Morgan Stanley Dean Witter &Co (MWD), Freddie Mac (FRE), AXA ADR (AXA). This group can be labeled recent leaders in performance with relatively low cap, low PEs, PBs and PSs. The average market capitalization of the companies in this group is $48 billion. Their PE of 24 is 25% less than the current average PE of the market as a whole; the average price book value is 3.3. Over the past year these companies produced an average return of 48%; over the past three months their return was 53%. Cluster three includes Tyco International (TYC), Xerox (XRX), Time Warner (TWX) and MCI WorldCom (WCOM). This cluster is similar to cluster two in recent return achievements, however stocks in this group are much more expensive; exceptional high PEs and price book values twice as high as those of cluster two. These make this group less attractive than the previous one. We call this group the very rich group.
Table 1" Summary Statistics on Biggest 100 companies
Clusters >> Matching records (%)
Cl
C2
C5
C3
C4
C6
C0
73
5
4
4
3
2
9
Tot Ret 3 Months Mean 25.0 Tot Ret 1 Year Mean 44.7 Tot Ret 3 Year Mean 34.8 Tot Ret 5 Year Mean 30.0 Market Cap Mean 90,688 PE Current Mean 37.3 Price to Book Mean 9.9 Price to Sales Mean 4.3 Price to Cash Flow Mean 24.7 Relative Strength Mean 12.5
53.0 48.9 41.9 34.9 48,923 24.4 3.3 1.6 7.3 16.0
12.2 -6.7 23.0 24.5 95,308 20.0 3.2 12.3 280.3 -28.0
41.2 92.0 52.8 36.1 72,643 355.4 6.4 6.2 73.2 50.8
64.5 114.8 63.0 50.5 32,832 45.6 11.8 3.7 26.0 69.3
32.6 247.5 154.4 117.7 82,670 68.5 33.1 5.6 47.4 174.5
40.6 108.8 53.1 44.9 48,396 125.9 14.0 6.7 56.3 64.0
Notes: c 1: F, XON, CHV, BP, DT, TI, TEF, FON, BUD, A, NW, BCS, MOB, AN, AXP, BTY, DEO, FNM, EIRICY,ATI, E, MOT, FTE, BLS, ONE, MC, UN, OIRCL, HWP, MCD, Aft, FTU, CPQ, VVMT, PEP, GTE, INTC, T, HMC, HD, MIRK, BEL, AHP, SBC, TOYOY, ABT, BMY, JNJ, MO, GLX, SBH, NTT, DIS, PG, LLY, PFE, MSFT, MDT, G, LU, CSCO, VOD, SGP, WLA, KO, ABBBY DCX GIVI,IBM, UL, BIRK.B,AIG, GE C 2: CMB MVVD,ING, FIRE,AXA; C 3: TYC XIRX,TVVX,WCOM; C 4: TXN, SUNW, GPS, BBV C 5: C, BAC, IRD SC; C 6: NOK.A, DELL; C 0: MBK, AEG, EMC, DD, BA, AOL, SAP, ALL
Cluster four contains Texas Instruments (TXN), Sun Microsystems (SUNW), and Gap (GPS). These had the highest returns over the past year (114% on average). They also had the largest returns in the last three months (64% on average) and relatively low capitalization ($32 billion). However they had PEs well above the market, actually one third higher than the market (45) and PBs four times those of the companies in cluster two. Maybe these can best be considered as the high fliers.
22 Cluster five can be labeled the underperformers. It includes Bank of America (BAC), Royal Dutch Petroleum (RD) and Shell Transport (SC) which produced negative returns over the past year and only 12% return over the past three months. Average capitalization in this group is high ($95 billion). Average PEs of 20 and PBs of 3.2 makes this group however very attractive for future investments. Finally, cluster six shows Nokia and Dell, two companies with high returns over the past three years (154% average) and the past year (247% average), but which have in the process become very expensive in terms of PEs and PBs. In sum, a SOM map of the one hundred largest companies listed on the NYSE defines six clusters, among these six there are the recent performance leaders, the very rich, the high fliers, the under-performers and a large group in the middle. Nine companies stand out with attractive valuations. Among them are BAC, CMB, MWD, ING, FRE and AXA which are financial institutions and RD and SC which are oil companies. At time of this writing (March 1999) banks and oil companies have started to accelerate and have produced significant advances in recent weeks. The average return on the stocks selected via SOM was 7.3% in 2.5 months. Exceptional good performance included Chase Manhattan Bank (CMB) who increased 25% over 2.5 months (quoted at $59.66 on 12-31-98 and $74.61 on 3-12-99).
2. Best Performing Companies In the previous section we started from size using the market capitalization as the main initial selection criterion for picking companies to map performance and valuations. In this section we ignore size and start with the one hundred best performing companies. We choose the annualized return over the past three years as the main criterion for selecting one hundred best performing companies. Three years is a reasonable time period because longer periods, five or ten years, often include economic regime changes. Data pre-processing included the same log transformations as described above. Also map parameters were chosen to be consistent with those used earlier. The SOM map shown in Figure 2 shows five clusters. As before the outliners provide the most interesting information. For example, in the bottom left comer of the map in Figure 2 we find HIST and AOL. HIST is the symbol of Gallery of History who markets historical documents; AOL is American Online which provides consumer on-line computer services. Both of these companies have experienced phenomenal growth over the past three years: their average annualized return over the past three years was 120%. Their valuations are however also record high. Hence they may be considered the least attractive among the best performing one hundred. At the bottom right of the map in Figure 2 we find a cluster that includes Federal Agricultural Mortgages (FAMCK), Mediware Information Services (MEDW), Bank of Commerce
23
Figure 2: Self-Organizing map of 100 companies that over the past three years produced the most value for shareholders. In the bottom right we find FAMCK and BCO, which are financial institutions, and MEDW and TSRI, which are service companies that may provide the best investment opportunities in this group.
24 (BCOM) and TSRI (which provides computer programming services on a contract basis). While the average market capitalization of this group was small (only $120 million), the valuation numbers speak for themselves: PEs of 21 and PBs of 3. The average return in the last three years for this group was 102%; over the past year they regressed by 29% but in the last three months of 1998 they accelerated by 41%. How did they do out of sample? Bank of Commerce (BCOM) is up 40%, Federal Agricultural Mortgages (FAMCK) increased by 19% in the first quarter of 1999; MEDW and TSRI both decreased by about 30%. Hence we only find confirmation of earlier findings in regard to banking and financial institutions.
3. Companies who improved If neither size nor best return over the past few years are considered relevant for selecting stocks to produce value maps, then maybe the rate at which they are changing or improving in producing value for shareholders should be used as the primary criterion. For this section we started from computing the difference between the most recent annual return with the annualized return over the past three years. Based on this the top twenty companies that improved the most over the past three years are shown in table 2. This acceleration in shareholder value was then used to sort companies to obtain the top one hundred. The SOM map of these one hundred companies who produced most improvement in shareholder value is shown in Figure 3. This map in Figure 3 shows four clusters of which the ones in the top right and left comer provide the most interesting selections. The cluster in the top right comer includes Carver Bancorp (CNY) and JMC Group (JMCG) which went from zero to negative total return. Similar to the main cluster in the center of the map a large number of companies in this group have gone from zero to small negative return over the past year and then back to positive returns in the past three months. The companies in the top left comer show however substantial improvements in return. In cluster 2 (top left comer) we find LML Payment Systems (LMLAF), KeraVision (KERA), ACTV (IATV), Research Frontiers (REFR), Cypress Biosciences (CYPB). Right next to it we find TeleWest Communications (TWSTY). Companies in both of these clusters have gone from 3 to 5 % annualized return over three years to 130 to 140% return over the past year. The former group averaged returns of 85% in the past three months of 1998 while TWSTY produced 25% return. Out of sample results show that TWSTY increased by 61%, from $28.25 on 12-31-98 to $ 45.75 on 3-12-99.
4. Highest Valuations or lowest PE's, PB's, PS's Traditional measures of valuation are based on price earnings ratios, price book value, and price cash flow and price sales ratios. Hence we searched the database for companies with the lowest price earnings ratios i.e. PE's less than 15, who have price book ratios less than 1 and have price sales ratios less than 0.5. This yielded 376 records of which several had missing
25 data. Fortunately SOM allows records with missing data and hence all 376 records were used for the roadmap on companies with best valuations. Table 2: Companies that improved the most in shareholder value over the past three years. ICompany Name
Ticker
IExchange
Sector
Grand Union Track Data LML Payment Systems Business Objects SA ADR Cellular Communications of Puerto Rico rCatalyst International Osicom Technologies Tops Appliance City ACTV TeleWest Communications PLC ADR
GUCO TRAC LMLAF BOBJY CLRP CLYS FIBR TOPS IATV TVVSTY BAMM KERA CYPB CERG SAFE PCHM VGHN KNBWY WHI ELT
NNM NNM NASQ NNM NNM NNM NNM NNM !NASQ NNM NNM NNM NASQ NNM NNM NNM NNM NASQ !NYSE NYSE
Retail Services Services Services Services Technology Technology iRetail !Technology Services Retail Health Health Financial Health Health Services Consumer Staple Industrial Cyclicals Technology
Books-A-Million KeraVision Cypress Bioscience Ceres Group Invivo PharmChem Laboratories iVaughn Communication Kirin"BreweryADR Washington Homes Elscint
.....
Improvement 1-3 year 458.7 418.7 308.8 202.9 197.1 193.9 162.7 160.6 134.0 125.2 123.3 119.9 106.4 97.6 81.8 75.2 69.1 66.5 59.8 52.0
Figure 4 shows 11 component planes obtained via a SOM on 376 records on companies with best valuations. The top four planes show the distribution of total returns over the past three months, one year, three years and five years (left to right). The next set on the second row show the distribution of the price earnings, price book, price sales and price cash flow ratios (left to right). Finally on the bottom row are the distributions of the market capitalization, the relative strength, and the rate of improvement over the past three years among all 376 companies. In a color print out of these planes the lowest values are in blue and the higher values are in red; green and yellow areas indicate values closer to the lowest or highest for a particular component, respectively. From Figure 4 we can visually derive that value provided to shareholders as measured by total annualized returns over three and five years correlate highly, especially among the companies with highest valuations. Total return achieved over the past three months and in the past one year are however not highly correlated with returns over three and five years. Interestingly enough the returns over three months and the past year do not entirely overlap. The second row of planes shows some overlap between price earnings and price book values; however the distributions on price sales ratios and price cash flow ratio are substantially different. On all four planes we find companies with very low valuation ratios in the right bottom comer.
26
Figure 3: Self-organizing map of 100 companies that have most improved in terms of generating return for shareholders over the past three years. The most improved indicator was computed by subtracting the total return in the past year from the annualized return over three years for each company.
27
Figure 4 : Component planes of 376 companies with the lowest price-earning ratios, price book ratios and price sales ratios. The component planes included are from left to right total return in last 3 months, one year, three years and five years; on the second row, the price earning ratio, the price book ratio, the price sales ratio and price cash flow ratio; on the third row, the market capitalization, the relative strength of stock prices and the distribution of companies that improved the most over the past three years.
28
The companies with the highest market capitalization among these 376 can be found in the left upper comer in the bottom left plane. The companies with the best relative strength i.e., whose price is closest to their 52 week low, are on the right side in the middle plane on the bottom row. Companies that have been most improving can be found in the left bottom comer of the bottom right plane. From the above we can visually discover without analysis of statistics which is the most interesting selection among companies with best valuations. By selecting the darkest area in the bottom right plane i.e. the left segment of the bottom right plane, we can filter those companies that improved the most over the past three years from among all those that have the best valuations. This filtering produces a short list of companies that provide interesting investment opportunities. Table 3 shows the results of this filtering process; it shows the top ten obtained by filtering 376 companies with best valuations on the basis of the most improvements over the past few years. Table 3" SOM selected top ten companies with excellent valuations who have improved lot over the past three years.
a
Tick er
Exc han ge
Secto
Total
Tot
Tota
Tot
r
Retu rn 3 Mont hs
al Re tur n1 Ye ar
I Ret urn 3 Yea r
al Re tur n5 Ye ar
Ma rke t Ca p
Pri ce Ea rni ng s
P ri c e B o o k
P ri c e S al e s
P ri c e c a s h FI o
R e I. S tr e n g t h
Im pr
ov e m en t
w TFC Enterprises TFCE
Fin.
-7.1
73.2
-33.8
-34.0
18.5
7.7
0.5
0.5
2.5
37
107.
Bangor HydroBGR NYSE Elec American Health AHEPZ NNM
NNM
Utility
31.4 107.0
6.1
-1.8
94.4
14.9
0.7
0.5
2.6
63
100.
-4 208.1
129.6
NA
3.7
2.4
0.3
0.5
0.6
-59
78.4
Isle Capris
Services
54.8
62.8
-13.4
-24.3
93.5
11.3
1
0.2
1.4
29
76.2
NA
!ISLE
NNM
Sound Advice
!SUND
NNM
Retail
33.3 116.6
42.4
-12.2
t2.1
8.6
0.8
0.1
2.5
71
74.2
OroAmerica
OROA
NNM
11.2
95.0
26.5
-8.0
62.9
8.3
0.9
0.4
4.6
54
68.5
17.3
32.6
-32.6
-40.7
2.9
4.8
0.8
0.3
1.6
5
65.3
16.0
62.0
2.2
-9.4
46.7
10.7
0.8
0.2
6.3
28
59.8
-15.6
-11.4
-58.0
NA
22.9
0.9
0.6
0.4
19.2
-30
46.6
21.4
27.5
-18.4
NA
54.9
5.3
0.6
0.3
NA
1
45.9
15.3
82.3
6.2
-18.7
36.3
7.7 0.74 0.35
4.6
23.
76.1
Casinos
Network Six
NWSS
NASQ
Cons Dur. Techn.
Washington
WIll
NYSE
Industria
Homes Astea International
ATEA
NNM
I Techn.
Cronos Group
CRNSF NNM
Services
Average Top 10
A few points deserve highlighting:
29 1. Among the top ten we find companies listed on every exchange NYSE, NNM (NASDAQ National Market) and NASDAQ (small cap market); 2. The companies listed belong to various sectors : finance, retail, consumer durables, industrial cyclicals, services, utilities; hence good value can be found in almost any sector; (internet stocks do not appear here because the Morningstar data we used contained very few internet companies as of 12-30-98 and had yet to recognize internet as a separate sector); 3. While the average annualized return of the top ten companies was negative over three and five years, in the last year the average return of the top ten improved to 82%; in the last three months of 1998 the average return was 15% in one quarter; 4. The average price earning ratios of these most improving best valuation companies is 7.7; the average price book value is 0.74 and the average price sales ratio is 0.35. 5. Most of these companies have over the past three years improved by 45 to 107% meaning that several have doubled in value to shareholders. The out of sample performance on each of this top ten stock is shown in table 4. Table 4 shows the prices of each of these stocks on 12-31-1998 and on 3-12-1999 (or the time of this writing). It also shows the number of shares that could have been bought on 12-31-1998 for $10,000 investment in each stock. The last column in Table 4 shows the net return of investing $10,000 in each stock. This portfolio of the top ten most improving companies, selected from among 376 companies with best valuation, would have produced in the first two and a half months of 1999 a total return of 8.4% or 2.8% more than the S&P500. Annualized this is close to 40% or an added value of 13% over and above the S&PS00. Table 4 Out of sample performance 12-31-98 to 3-12-99
TFC Enterprises iBangor Hydro-Electric American Health Prop Isle of Capris Casinos Sound Advice OroAmerica Network Six Washington Homes Astea International Cronos Group Portfolio S&P 500
Price Price Difference 12-31-98 3-12-99 1.63 2.44 0.81 12.81 13 0.19 1.88 1.38 -0.5 3.97 4.38 0.41 3.25 2.38 -0.87 9.88 8.75 -1.13 4.06 4.38 0.32 5.88 6.63 0.75 1.69 3.31 1.62 6.38 4.5 -1.88 1229.23
1297.68
Shares per Net Return 10,000 US$ 49.7% 6135 4,969 1.5% 781 148 -26.6% 5319 (2,660) 10.3% 2519 1,033 -26.8% 3077 (2,677) -11.4% 1012 (1,144) 7.9% 2463 788 12.8% 1701 1,276 95.9% 5917 9,586 -29.5% 1567 (2,947) 8.4% Total 8,373 68.45 5.6% 5,569
30
Conclusions The maps in this chapter demonstrate how value can be discovered visually even in expensive markets through self-organizing of vast amounts of company and stock market data. We have identified interesting investment opportunities based on how well companies treat their shareholders. The total return to shareholders that companies provide is the one true measure important to investors. It is the gauge against which investment managers and institutional investors are measured and it should be the measure used to judge the performance of corporate executives. We looked at the one hundred largest companies and found some banking and oil companies; we looked at the best performing companies and those that have made most progress and again found some banks and financial institutions with attractive potentials. We looked at 376 companies with the best valuations and identified ten that have most improved and are outperforming the S&P500 in the first quarter of 1999. The self-organizing maps in this chapter provide a visual way to find interesting investment opportunities. The best opportunities that were identified are companies that have good valuations (based on classic valuation criteria) who may have had negative returns over the past 3 to 5 years but who recently changed from negative to positive returns. Most of them are accelerating very fast in producing value to shareholders. The attractiveness of this visual way of identifying value in markets is that it reduces the subjective-ness imbedded in the Warren Buffett approach for assessing what companies have value, and it reduces the need for elaborate statistical analyses which O'Shaughnessy approach was based on. The maps presented here are models for representation of vast quantities of financial and economic data; they demonstrate the speed with which multi-dimensional data can be synthesized and produce meaningful results for making investment decisions. Being able to discern patterns in data fast is particularly important to electronic day traders who in volatile markets are constantly picking stocks for very short holding periods, usually not exceeding one day. Of course electronic day traders especially those betting on Internet stocks have a saying that 'those who are discussing value are out of the market; those who are in the market discuss price'!
References [1 ] Guido Deboeck & Teuvo Kohonen, Visual Explorations in Finance with self-organizing maps, Springer-Verlag, 1998,250 pp. [2] Benjamin Graham, David Dobbs, Security Analysis, reprint 1934 ed., New York, McGraw-Hill, 1997, p. 493
31 [3] Timothy P. Vick, Wall Street on Sale: How to beat the market as a value investor, McGraw-Hill, 1999, 289 pp. [4] Anthony M Gallea & William Patalon III, Contrarian Investing: Buy and Sell when others won't and make money doing it, Prentice Hall, 1998. [5] Peter Lynch, Beating the Street, Simon & Schuster, New York, 1993, 318 pp. [6] Janet Lowe: Warren Buffett Speaks: Wit and Wisdom from the Worlds's Greatest Investor, John Wiley & Sons, New York, 1997 [7] James O'Shaughnessy: What works on Wall Street: A guide to the best performing investment strategies of all times.McGraw-Hill, New York, 1998 revised edition, 366 pp. [8] Teuvo Kohonen, Self-Organizing Map, Springer-Verlag. 2nd edition, 1997, 426 pp.
Acknowledgments The author wants to thank Professor Teuvo Kohonen for educating him about self-organizing maps, an approach, which makes representation of multi-dimensional financial data a lot more effective. He also likes to thank Professor Erkki Oja and Samuel Kaski for assembling this important collection of papers on self-organizing maps. He is grateful to Dr Gerhard Kranner, CEO of Eudaptics Software, and Joahnnes Sixt for all the support they have provided in making the creation of Value Maps so easy.
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
33
Data Mining and Knowledge Discovery with Emergent Self-Organizing Feature Maps for Multivariate Time Series A. Ultsch Philipps-University of Marburg Department of Computer Science Hans-Meerwein-Str 35032 Marburg Germany Self-Organizing Feature Maps, when used appropriately, can exhibit emergent phenomena. SOFM with only few neurons limit this ability, therefore Emergent Feature Maps need to have thousands of neurons. The structures of Emergent Feature Maps can be visualized using U-Matrix Methods. U-Matrices lead to the construction of self-organzing classifiers possessing the ability to classify new datapoints. This subsymbolic knowledge can be converted to a symbolic form which is understandable for humans. All these steps were combined into a system for Neuronal Data Mining. This system has been applied successfully for Knowledge Discovery in multivariate time series. 1. I n t r o d u c t i o n Data Mining aims to discover so far unknown knowledge in large datasets. The most important step thereby is the transition from subsymbolic to symbolic knowledge. SelfOrganizing Feature Maps are very helpful in this task. If appropriately used, they exhibit the ability of emergence. I. e. using the cooperation of many neurons, Emergent Feature Maps are able to build structures on a new, higher level. The U-Matrix-method visualizes these structures corresponding to structures of the high-dimensional input-space that otherwise would be invisible. A knowledge conversion algorithm transforms the recognized structures into a symbolic description of the relevant properties of the dataset. In chapter two we shortly introduce our approach to Data Mining and Knowledge Discovery, chapter three clarifies the use of Self-Organizing Feature Maps for Data Mining. Chapter four clarifies the use of Feature Maps in order to obtain emergence. In the chapters five and six those steps of Data Mining, where Feature Maps can be used, are described. Chapter seven is a description of our system, the so called Neuronal Data Mine which uses Emergent Feature Maps and Knowledge Conversion. In chapter eight an important application area- Knowledge Discovery in multivariate time series - is described. Chapter nine gives first results of an application of this system.
34 2. Data Mining and Knowledge Discovery Since the use of the term Data Mining is quite diverse we give here a short definition in order to specify our approach to Data Mining and Knowledge Discovery. A more detailed description can be found in [Ultsch 99a]. We define Data Mining as the inspection of a large dataset with the aim of Knowledge Discovery. Knowledge Discovery is the discovery of new knowledge, i. e. knowledge that is unknown in this form so far. This knowledge has to be represented symbolically and should be understandable for human beings as well as it should be useful in knowledge-based systems. Central issue of Data Mining is the transition from data to knowledge. Symbolically represented knowledge as sought by Data Mining- is a representation of facts in a formal language such that an interpreter with competence to process symbols can utilize this knowledge [Ultsch 87]. In particular, human beings must be able to read, understand and evaluate this knowledge. The knowledge should also be useable by knowledge-based systems.
Inspee/ion ofthe D a t a
$,,
~>
~p~
dear, chosen Da~a l
1~t*
:I "
- rne~ va~u~ - ~tr~tio. . areas r
Clwh:
~>
~>
,
Construction ~r Classifiers
owledge C~tverslon
i
Figure 1: Steps of Data Mining The knowledge should be useful for analysis, diagnosis, simulation and/or prognosis of the process which generated the dataset. We call the transition from data, respectively an unfit knowledge representation, to useful symbolic knowledge Knowledge Conversion [Ultsch 98]. Data Mining can be done in the following steps: 9 inspection of the dataset 9 clustering
35 9 construction of classifiers 9 knowledge conversion and 9 validation (see figure 1 for an overview) Unfortunately it has to be stated that in many commercial Data Mining tools there is no Knowledge Conversion [Gaul 98]. The terms Data Mining and Knowledge Discovery are often used in those systems in an inflationary way for statistical tools enhanced with a fancy visualization interface [Woods, Kyral 98]. The difference between exploratory statistical analysis and Data Mining lies in the aim which is sought. Data Mining aims at Knowledge Discovery. 3. S O F M for D a t a M i n i n g Figure 1 shows the different steps in Data Mining in order to discover knowledge. Statistical techniques are commonly used for the inspection of the data and also for their validation. Self-Organizing Feature Maps can be used for classification and the construction of classifiers. Particularly well suited for these tasks are Emergent Feature Maps as described in the next chapter. Classifiers constructed with the use of a SelfOrganizing Feature Map do, however, not possess a symbolic representation of knowledge. They can be said to contain subsymbolic knowledge. In the step Knowledge Conversion the extraction of knowledge from Self-Organizing Feature Maps will be performed. One method to extract knowledge from Self-Organizing Feature Maps is the so called sig*algorithm which will be briefly described in chapter six. 4. E m e r g e n t vs. N o n - e m e r g e n t Feature M a p s Self-Organizing Feature Maps were developed by Teuvo Kohonen in 1982 [Kohonen 82] and should, to our understanding, exhibit the following interesting and non-trivial property: the ability of emergence through self-organization. Self-organization means the abilty of a biological or technical system to adapt its internal structure to structures sensed in the input of the system. This adaptation should be performed in such a way that first, no intervention from the environment is necessary (unsupervised learning) and second, the internal structure of the self-organizing system represents features of the input-data that are relevant to the system. A biological example for self-organization is the learning of languages by children. This process can be done by every child at a very early age for different languages and in quite different cultures. Emergence means the ability of a system to produce a phenomenon on a new, higher level. This change of level is termed in physics "mode"- or "phase-change". It is produced by the cooperation of many elementary processes. Emergence happens in natural systems as well as in technical systems. Examples of natural emergent systems are: Cloud-streets, Brusselator, BZ-Reaction, certain slime molds, etc.
36
Fig. 2: Hexagonal convection cells on a uniformly heated copper plate [Haken 71] Even crowds of human beings may produce emergent phenomena. An example is the so called "La-Ola Wave" in ballgame stadiums. Participating human beings function as the elementary processes, who by cooperation, produce a large scale wave by rising form their places and throwing their arms up in the air. This wave can be observed on a macroscopic scale and could, for example, be described in terms of wavelength, velocity and repititionrate. Important technical systems that are able to show emergence are in particular laser and maser. In those technical systems billions of atoms (elemantary processes) produce a coherent radiation beam. Although Kohonen's Self-Organizing Feature Maps are able to exhibit emergence they are often used such, that emergence is impossible. For emergence it is absolutly necessary that a huge number of elementary processes cooperate. A new level or niveau can only be observed when elementary processes are disregarded and only the overall structures, i. e. structures formed by the cooperation of many elementary processes, are considered. In typical applications of Kohonen's Self-Organizing Feature Maps the number of neurons in the feature maps are too few to show emergence. A typical example, which is representative for many others, is taken from [Reutterer 99]. The dataset describes consumers of household goods. Each household is described by a nine-dimensional vector of real numbers. Self-Organizing Feature Maps are used to gain some inside-knowledge in the structure and segmentation of the market for the household goods. A Self-Organizing Feature Map with three by three, i. e. nine neurons has been used [Reutterer 99]. Using Kohonen's learning scheme, each of the nine neurons represents the bestmatch of several input-data. Each neuron is considered to be a cluster.
37 Common to all non-emergent ways to use Kohonen's feature maps is that the number of neurons is roughly equal to the number of clusters expected to be found in the dataset. A single neuron is typicaly regarded as a cluster, i. e. all data, whose bestmatches fall on this neuron, are members of this cluster. It seemed for some time that this type of Kohonen's feature maps performs clustering in a way that is similar to a statisical clustering algorithm called k-means[Ultsch 95]. An absolutely necessary condition for emergence is the cooperation of many elementary processes. Emergence is therefore only expected to happen in Self-Organizing Feature Maps with a large number of neurons. Such feature maps, we call them Emergent Feature Maps, have typically at least some thousands if not tens of thousands of neurons. In particular the number of neurons may be much bigger than the number of datapoints in the input-data. Consequently most of the neurons of Emergent Feature Maps will represent very few input-points if any. Clusters are detected on Emergent Feature Maps not by regarding single neurons but by regarding the overall structure of the whole feature map. This can be done by using U-Matrix-methods [Ultsch 94]. With Emergent Feature Maps we could show that Self-Organizing Feature Maps are different and often superior to classical clustering algorithms [Ultsch 95]. A canonical example, where this can be seen, is a dataset consisting of two different subsets. These subsets are taken from two well seperated toroids that are interlinked like a chain as it can be seen in figure 3. Using an Emergent Feature Map of a dimension 64 by 64 = 4096 neurons, the two seperate links of the chain could easily be distinguished. In contrast to this, many statistical algorithms, in particular the k-means algorithm, were unable to produce a correct classification. We think that the task of Data Mining, i. e. the seeking of new knowledge, calls for Emergent Feature Maps. The property of emergence, i. e. the appearance of new structures on a different abstraction level, coincides well with the idea of discovering new knowledge.
38
Fig. 3: Chainlink Dataset 5. C o n s t r u c t i o n of Classifiers When Emergent Feature Maps with a sufficiently large number of neurons are trained with high-dimensional input-data, these datapoints distribute sparsely on the feature map. Regarding the position of the bestmatches, i. e. those neurons, whose weights are most similar to a given input-point, gives no hint on any structure in the input-dataset. In the following picture a three-dimensional dataset consisting of thousand points was projected on a 64 by 64 Emergent Feature Map. The topology of the feature map is toroid, i. e. the borders of the map are cyclically connected. The positions of the bestmatches exhibit no structure in the input-data. In order that structures of the input-data can emerge we use the so called U-MatrixMethod. The simplest of these methods is to sum up the distances between the neuronsweights and those of its immediate neighbours. This sum of distances to its neighbours is displayed as elevation at the position of each neuron. The elevation-values of the neurons produce a three-dimensional landscape, the so called U-Matrix. U-Matrices have the following properties: 9 Bestmatches that are neighbours in the high-dimensional input-data space lie in a common valley. 9 If there are gaps in the distibution of input-points, hills can be seen on the U-Matrix. The elevation of the hills is proportional to the gap distance in the input-space. 9 The principal properties of Self-Organizing Feature Maps, i.
e.
conserving the
39
overall topology of the input-space, is inherited by the U-Matrix. Neighbouring data in the input-space can also be found at neighbouring places on the U-Matrix. 9 Topological relations between the clusters are also represented on the two-dimensional layout of the neurons
Fig. ~: An U-Matrix of the Data With U-Matrix-Methods emergence in Kohonen maps has been observed for many different applications, for example: medical diagnosis, economics, environmental science, industrial process control, meteorology, etc. The cluster-structure of the input-dataset is detected using an U-Matrix. Clusters in the input-data can be detected in the U-Matrix as valleys surrounded by hills with more or less elevation, i. e. clusters can be detected, for example, by raising a virtual waterlevel up to a point, where the water floods a valley on the U-Matrix. Regarding an U-Matrix the user can indeed grasp the high-dimensional structure of the data. Neurons that lie in a common valley are subsumed to a cluster. Regions of a feature map that have high elevations in an U-Matrix are not identified with a cluster. Neurons that lie in a valley but are not bestmatches are interpolations of the input-data. This allows to cluster data with Emergent Feature Maps. This approach has been extensively tested over the last years and for many different applications. It can be shown that this method gives a very good picture of the high-dimensional and otherwise invisible structure of the data. In many applications meanings for clusters could be detected. Emergent Feature Maps can be easiliy used to construct classifiers. If the U-Matrix has been seperated into valleys corresponding to clusters and hills corresponding to gaps in the data, then an input-datapoint can be easily classified by looking at the bestmatch of this datapoint. If the point's bestmatch lies inside a cluster-region on the U-Matrix the input-data is added to that cluster. If the bestmatch lies on a hill in the U-Matrix, no classification of this point can be assigned. This is in particular the case if the dataset possesses new features, i. e. aspects that were not included in the data learned so far. With this approach, for example, outliers and errouneous data are easily detected.
40 6. K n o w l e d g e C o n v e r s i o n The classifiers constructed with Emergent Feature Maps and the U-Matrix described in the last chapter, possess the "knowledge" to classify new data. This knowledge is, however, not symbolic. Neither a reason, why a particular dataset belongs to a particular cluster, nor, why a given dataset can not be classified, can be given. What is necessary at this point is to convert this type of knowledge to a symbolic form. We have developed an algorithm called sig* in order to perform this Knowledge Conversion [Ultsch 94]. As input sig* takes the classifier as described in the last chapter. A symbolic description of all the weights of the neurons belonging to a particular cluster is constructed. Sig* generates description using decision-rules. These decision-rules contain as premises conditions on the input-data and as conclusions the decision for a particular cluster. Clusters are described by two different types of rules. There are so called characterization rules which describe the main characteristics of a cluster. Secondly, there are rules which describe the difference between a particular cluster and neighbouring clusters. The different steps of this Knowledge Conversion can be described as follows: 9 Selection of components of the high-dimensional input-data that are most significant for the characterization of a cluster 9 Construction of appropriate conditions for the main properties of a cluster 9 The Composition of the conditions in order to produce a high-level significant description. To realize the first step, sig* uses a measure of significance for each component of the high-dimensional input-data with respect to a cluster. The algorithm uses only very few conditions, if the clusters can be easily described. If the clusters are more difficult to describe sig* uses more conditions and more rules to describe the differences are generated in order to specify the borders of a cluster. For the representation of conditions, sig* uses interval-descriptions for characterization-rules and splitting-conditions for the differention-rules. The conditions can be combined using "and", "or" or a majority-vote. It could be shown that for known classifications sig* reproduces 80 to 90 § % of the classification-ability of an Emergent Feature Map. 7. T h e N e u r o n a l D a t a M i n e N D M The methods presented in the previous chapters have been developed and refined over the last years and combined to a tool for Data Mining and Knowledge Discovery called Neuronal Data Mine [Ultsch 99a]. This tool contains the following modules : 9 Statistics (Inspection of the Data) 9 Emergent Feature Maps (Clustering) 9 U-Matrix (Construction of Classifiers) 9 sig* (Knowledge Conversion)
41 9 Validation
Fig. 5: Screen-shot of the user-interface of the NDM 8. T h e N e u r o D a t a M i n e for K n o w l e d g e D i s c o v e r y in T i m e Series One of the latest and most fascinating applications of NDM is Data Mining in multivariate time series. The key for this application is a suitable knowledge representation for temporal structures (see chapter 2). With the definition of unification-based temporal grammars (UTG) this key-issue has been solved [Ultsch 99b]. UTGs belong to the class of definitve clause grammars. UTGs describe temporal phenomena by a hierachy of semiotic description-levels. Each semiotic description-level consists of a symbol (syntax), a description (semantic) and an explanation useful in the context of a particular application (pragmatic). Temporal phenomena are described on each level using temporal phases and temporal operations. The latter are called connexes. As phases we identified Primitive Patterns, Successions, Events, Sequences and Temporal Patterns. Primitive Pattern represent the different elementary conditions of the process described by the multivariate time series. Successions model the duration, Events the simultanety of temporal phases. Sequences are used to formulate repetitions. Temporal Patterns finally condense variations in Sequences to a common abstract description of important temporal patterns. The phases described above can be combined using connexes for duration, simultanety and temporal sequence. These temporal operations are designed to be flexible with regard to time. The connexes require not a coincidence of events in a mathematically sense. They allow a certain flexibility, i. e. events that are sufficiently close in time are considered to be simultaneous. This is necessary since the multivariate time series stem from natural or technical processes that have always a certain variation in time. A special fuzzy representation was used to represent this flexibility [Ultsch 99b]. This approach leads to only three temporal operations for the representation of temporal features. In other representation formalisms, for example Allen, much more temporal operations are necessary [Allen 84]. Emergent feature maps are used for Temporal Data Mining in the following steps: 9 description of the elementary conditions of the process (Primitive Pattern) 9 description of the duration of phases (Successions) 9 description of simultaneity (Events)
42 9 detection of start- and end-points of temporal patterns (Sequences resp. Temporal Patterns).
f N~o 1
Temporal Data Mining
IF
Fig. 6: Temporal Data Mining
9. Application: Sleep Related Breathing Disorders In a first application the Temporal Data Mine has been used for a medicalproblem, the so called Sleep Related Breathing Disorders (SBRD) [Penzel et al 91]. Humans who suffer from SRBD experience the stopping of breathing during certain periods of sleep. The stopping-periods are critical if they last at least 10 seconds and occur more than 40 times per hour [Penzel et al 91]. As multivariate time series were considered: EEG, EMG, EOG, EKG, airflow, thorax- and abdomen movements, airflow and saturation of blood with oxygen.
Fig. 7: Multivariate Time Series for SRBD For those multivariate time series, two different types of U-Matrix called "air" and "move", were generated [Guimaraes/Ultsch 99]. The U-Matrix "air" focuses on all aspects of the time series related to airflow. The U-Matrix "move" concentrates on aspects of movements of thorax and abdomen. In the "air"-U-Matrix six elementary states were identified.
43 These elementary states (clusters) are considered elementary primitive temporal elements and termed Primitive Patterns. In the U-Matrix "move" nine Primitive Patterns could be identified. The temporal sequence of the Primitive Pattern are represented as paths on the U-Matrices. Using temporal knowledge conversion, six Events and five different Temporal Patterns could be found in the time series. The knowledge was formulated in UTG notation. All semiotic description levels of the UTG (see last chapter) have been presented to an expert in order to evaluate the plausibility of the phases and the descriptions. This showed that all the events found represented important medical properties. In particular the events could be related to physiological stages like, for example, "obstructive snoring" or "hyperpnoe". Four of the five Temporal Patterns that were discovered were very well known to the expert and could be assigned a medical meaning. One of the Temporal Patterns was a newly discovered pattern inherent in some type of human sleep. This gave a hint on a potential new way to look onto this certain types of sleeping disorders. In the following picture an example of a Temporal Pattern and the corresponding multivariate time series is depicted.
Fig 8: Temporal Knowledge 10. C o n c l u s i o n In Data Mining the first step after the inspection of a dataset is the identification of clusters. SOFM with only few neurons limit implicitly the number of clusters to be found in the data. With such feature maps only a very crude insight into the input-data can be gained, if at all. In feature maps possessing thousands of neurons, U-Matix-methods can be used to detect emergence. U-Matrices visualize structures in the data by considering the cooperation of many neurons. The structures seen give insights to the otherwise invisible high-dimensional dataspace. It can be shown that emergent feature maps are superior to other clustering methods, particularly to k-means [Ultsch 95]. The most important step of Data Mining is Knowledge Conversion, i. e. the transition from a subsymbolic to a symbolic representation of knowledge. Emergent feature maps provide an excellent starting-point for Knowledge Conversion. Other classifiers such as
44 decision-trees focus on the efficiency of the discrimination between clusters. Declarative rules, extracted from U-Matrices, using sig* provide an extract description of significant properties of clusters. The methods described above could be used to analyze multivariate time series. Unificationbased grammars (UTG) have been developed as a tool to represent symbolic knowledge for Temporal Data Mining. The approach has been successfully tested for a medical problem regarding sleep disorders. 11.
References
[Allen 84] Allen, J.: Towards a General Theory of Action and Time, Artifical Intelligence 23, 1984, S 123- 154 [Gaul 98] Gaul, W. G. Classification and Positioning of Data Mining Tools, Herausforderungen der Informationsgesellschaft an Datenanalyse und Wissensverarbeitung 22. Jahrestagung Gesellschaft fiir Klassifikation [Guimar~es/Ultsch 99] Guimar~ess, G. Ultsch, A.: A Method for Temporal Knowledge Conversion, to appear. [Guimar~es/Ultsch 96] Guimar~es, G. Ultsch, A.: A Symbolic Representation for Pattern in Time Series Using Definitive Clause Grammars, 20 th Annual Conference of the Society for Classification, Freiburg 6th - 8th March 1996, pp 105 - 111 [Kohonen 82] Kohonen, T.: Self-Organized Formation of Topologically Correct Feature Maps, Biological Cybernetics Vol. 43, pp 59- 69, 1982 [Penzel et al. 91] Penzel, P., Stephan, K., Kubicki, S., Herrmann, W. M.:Integrated Sleep Analysis with Emphasis on Automatic Methods, In: R. Degen, E. A. Rodin (Eds.) Epilepsy, Sleep and Sleep Deprivation, 2nd ed. (Epilepsy Res. Suppl. 2), Elsevier Science Publisher, 1991, S 177- 2041 [Reutterer 98] Reutterer; T, Panel Data Based Competitive Market Structure and Segmentation Analysis using Self-Organizing Feature Maps, Proc.Annual Conf. Siciety for Classification, pp 92, Dresden, 1998 [Ultsch 99a] Ultsch, A.:Data Mining und Knowledge Discovery mit Neuronalen Netzen, Technical Report, Department of Computer Science, University of Marburg, Hans-MeerweinStr., 35032 Marburg [Ultsch 99b] Ultsch, A.: Unifikationsbasierte Temporale Grammatiken fiir Data Mining und Knowledge Discovery in multivariaten Zeitreihen, Technical Report Department of Computer Science, University of Marburg, March 1999 [Ultsch 98] Ultsch, A.: The Integration of Connectionist Models with Knowledge-based Systems: Hybrid Systems, Proceedings of the 11th IEEE SMC 98 International Conference on Systems, Men and Cybernetics, 11 - 14 October 1998, San Diego [Ultsch 95] Ultsch, A.: Self-Organizing Neural Networks Perform Different from Statistical k-means clustering, Gesellschaft ffir Klassifikation, Basel 8th- 10th March, 1995 [Ultsch 94] Ultsch, A.: The Integration of Neural Networks with Symbolic Knowledge Processing, in Diday et al. "New Approaches in Classification and Data Analysis", pp 445 - 454, Springer Verlag 1994 [Ultsch 87] Ultsch, A.: Control .for Knowledge-based Information Retrieval, Verlag der Fachvereine, Ziirich, 1987
45
[Woods/Kyral 97] Woods, E., Kyral, E. Data Mining, Ovum Evaluates, Catalumya Spain, 1997
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 91999Elsevier Science B.V.All rights reserved
47
F r o m A g g r e g a t i o n O p e r a t o r s t o Soft L e a r n i n g V e c t o r Q u a n t i z a t i o n and Clustering Algorithms Nicolaos B. Karayiannis a aDepartment of Electrical and Computer Engineering, University of Houston, Houston, Texas 77204-4793, USA
This paper presents an axiomatic approach for developing soft learning vector quantization and clustering algorithms based on aggregation operators. The development of such algorithms is based on a subset of admissible aggregation operators that lead to competitive learning vector quantization models. Two broad families of algorithms are developed as special cases of the proposed formulation.
1. I N T R O D U C T I O N Consider the set X' C IRn which is formed by M feature vectors from an n-dimensional Euclidean space, that is, X = {Xl,X2,...,XM}, xi E IR", 1 _< i _< M. Clustering is the process of partitioning the M feature vectors to c < M clusters, which are represented by the prototypes vj E 12, j E Arc -- {1,2,...,c}. Vector quantization can be seen as a mapping from an n-dimensional Euclidean space IR" into the finite set 12 = {Vl, v 2 , . . . , vc} C IRn, also referred to as the codebook. Codebook design can be performed by clustering algorithms, which are typically developed by solving a constrained minimization problem using alternating optimization. These clustering techniques include the crisp c-means [1], fuzzy c-means [1], generalized fuzzy c-means [2], and entropyconstrained fuzzy clustering algorithms [3]. Recent developments in neural network architectures resulted in learning vector quantization (LVQ) algorithms [4-11]. Learning vector quantization is the name used in this paper for unsupervised learning algorithms associated with a competitive neural network. Batch fuzzy learning vector quantization (FLVQ) algorithms were introduced by Tsao et al. [5]. The update equations for FLVQ involve the membership functions of the fuzzy cmeans (FCM) algorithm, which are used to determine the strength of attraction between each prototype and the input vectors. Karayiannis and Bezdek [9] developed a broad family of batch LVQ algorithms that can be implemented as the FCM or FLVQ algorithms. The minimization problem considered in this derivation is actually a reformulation of the problem of determining fuzzy c-partitions that was solved by the FCM algorithm [12]. This paper presents an axiomatic approach to soft learning vector quantization and clustering based on aggregation operators.
48 2. R E F O R M U L A T I O N ATORS
F U N C T I O N S B A S E D ON A G G R E G A T I O N O P E R -
Clustering algorithms are typically developed to solve a constrained minimization problem which involves two sets of unknowns, namely the membership functions that assign feature vectors to clusters and the prototypes. The solution of this problem is often determined using alternating optimization [1]. Reformulation is the process of reducing an objective function treated by alternating optimization to a function that involves only one set of unknowns, namely the prototypes [9,12]. The function resulting from this process is referred to as the reformulation function. A broad family of batch LVQ algorithms can be derived by minimizing [9] 1 M R v = ~ ~ D,({llx, - vell2}ee~r~),
(1)
i=1
where D , ( { ] l x i - vt[12}eearc) is the generalized mean of {[Ix,- vtll2}texc, defined as 1
):
(2) C g--I
with p E I R - {0}. The function (1) was produced by reformulating the problem of determining fuzzy c-partitions that was solved by the FCM algorithm. The reformulation of the FCM algorithm essentially established a link between batch clustering and learning vector quantization [9-11].
2.1. Aggregation Operators The reformulation function (1) is formed by averaging the generalized means of { ]Ix,vtll2}tea5 over all feature vectors x~ C X. The generalized mean is perhaps the most widely known and used aggregation operator. Nevertheless, the search for soft LVQ and clustering algorithms can naturally be extended to a broad variety of aggregation operators selected according to the axiomatic requirements that follow. A multivariable function h(al, a2,..., ac) is an aggregation operator on its arguments if it satisfies the following axiomatic requirements:
Axiom AI: The function h(.) is continuous and differentiable everywhere. Axiom A2: The function h(al, a2,..., ac) is idempotent, that is, h(a, a , . . . , a) - a . Axiom A3: The function h(.) is monotonically nondecreasing in all its arguments, that is, h(al, a2,..., ac) < h(bl, b2,..., b~), for any pair of c-tuples [el, a 2 , . . . , a~] and Ibm,b2,..., b~] such that a,, bi E (0, c~) and a, < bi, Vi E Aft. Soft LVQ and clustering algorithms were developed by using gradient descent to minimize the reformulation function (1). This function can also be written as i
M
R = ~ z ~ (llx,- viii ~, llx,- v~ll~,, llx,- v~ll~), i--1
(3)
49 where the aggregation operator h(.) is the generalized mean of {de}tells. This is an indication that the search for soft LVQ and clustering algorithms can be extended to reformulation functions of the form (3), where h(-) is an aggregation operator in accordance with the axiomatic requirements AI-A3.
2.2. U p d a t e Equations Suppose the development of LVQ algorithms is attempted by using gradient descent to minimize the function (3). The gradient Vv~R = OR/Ovj of R with respect to the prototype vj can be determined as 2 M
(4)
i=1 where {Ctij } are the competition functions, defined as 0
= o(llx
- v ll :)
h ( l l x i - viii 2, I l x i - v2112,..., I l x i - Vcll2) .
(5)
The update equation for the prototypes can be obtained according to the gradient descent method as
M
Zxvj =
Vv, R =
52" J ( x , - vj),
(6)
i=1 where rlj : ~2 Tlj~ is the learning rate for the prototype Vj and the competition functions {aij} are determined in terms of h(-) according to (5). The LVQ algorithms derived above can be implemented iteratively. Let {vj,~-l}je~r be the set of prototypes obtained after the ( u - 1)th iteration. According to the update equation (6), a new set of prototypes {vj,,}je~r can be obtained according to M
Vj,u ~-- Vj,v-1 "JV~]j,u E OLij,~' ( X i - Vj,u-1 ), j e N'c. i=1
(7)
If r/j,, = (~]M 101ij,u) -1, then {vj,u_l}jEN c do not affect the computation of {vj,,}jeHc, which are obtained only in terms of the feature vectors xi E A'. The algorithms derived above can be implemented as clustering or batch LVQ algorithms [10,11].
2.3. Admissible Reformulation Functions Based on Aggregation Operators The search for admissible reformulation functions is based on the properties of the competition functions {aij}, which regulate the competition between the prototypes {vj}jejd c for each feature vector xi C A'. The following three axioms describe the properties of admissible competition functions: AxiomRl" Ifc=l,
then a n - l, l _< i _< M.
Axiom R2: aij >_ O, l _ < i _ < M ; l _ < j < c . Axiom R3: If [[xi- vp[[ 2 > I [ x i - vq][ 2 > 0, then dip < a~q, 1 < p, q _< c, and p -/- q.
50 Axiom R1 indicates that there is actually no competition in the trivial case where all feature vectors xi E X' are represented by a single prototype. Thus, the single prototype is equally attracted by all feature vectors xi C A'. Axiom R2 implies that all feature vectors xi E X' compete to attract all prototypes {Vj}jEA/" c. Axiom R3 implies that a prototype vq that is closer in the Euclidean distance sense to the feature vector xi than another prototype vp is attracted more strongly by this feature vector. Minimization of a function R defined in terms of an aggregation operator h(.) in (3) does not necessarily lead to competitive LVQ algorithms satisfying the three axiomatic requirements R1-R3. The function R can be made an admissible reformulation function by imposing additional conditions on the aggregation operator h(-). This can be accomplished by utilizing the Axioms R1-R3, which lead to the admissibility conditions for aggregation operators summarized by the following theorem:
Theorem 1: Let X' = { x l , x 2 , . . . , X M } C ]Rn be a finite set of feature vectors which are represented by the set of c < M prototypes 12 = {Vl,V2,...,v~} C ]Rn. Consider the function R defined in terms of the multivariable function h(.) in (3). Then, R is an admissible reformulation function in accordance with the axiomatic requirements R1-R3 if: 1. h(.) is a continuous and differentiable everywhere function, 2. h(.)is an idempotent function, i.e., h(a, a , . . . , a ) = a, 3. h(.) is monotonically nondecreasing in all its arguments, i.e., Oh(a~, a 2 , . . . , ac)/Oaj >__ 0, Vj E N'c, and 4. h(-) satisfies the condition Oh(a~, a 2 , . . . , a~)/Oap < Oh(a1, a 2 , . . . , a~)/Oaq, for ap > aq > 0, Vp, q E Arc and p 5r q. The first three conditions of Theorem 1 indicate that a multivariable function h(.) can be used to construct reformulation functions of the form (3) if it is an aggregation operator in accordance with the axiomatic requirements A1-A3. Nevertheless, Theorem 1 indicates that not all aggregation operators lead to admissible reformulation functions. The subset of all aggregation operators that can be used to construct admissible reformulation functions of the form (3) are those satisfying the fourth condition of Theorem 1. 3. R E F O R M U L A T I O N F U N C T I O N S GATION OPERATORS
BASED ON MEAN-TYPE
AGGRE-
The reformulation function (1) can also be written as
with f ( z ) = x 1-m and g(z) = f - x ( x ) = z ~j-~--~. Any function of the form ( 8 ) i s an admissible reformulation function if f(.) and g(.) satisfy the conditions summarized by the following theorem [11]:
51 Theorem 2: Let A" = { x I , x 2 , . . . , X M } C ]~n be a finite set of feature vectors which are represented by the set of c < M prototypes 12 = {Vl,V2,...,vc} C IRn. Consider the function R defined in (3), with
l=1
Then, R is an admissible reformulation with the axiomatic requirements R1-R3 everywhere functions satisfying f ( g ( x ) ) creasing (increasing) functions of x E (decreasing) function of x E (0, oc).
function of the first (second) kind in accordance if f(.) and g(.) are continuous and differentiable = x, f ( x ) and g(x) are both monotonically de(0, c~), and g ' ( x ) i s a monotonically increasing
A broad variety of reformulation functions can be constructed using functions g(.) of the form g(x) - (go(x)) 11 , rn 7~ 1, where g o ( x ) i s called the generator function. The following theorem summarizes the conditions that must be satisfied by admissible generator functions [11]: Theorem 3: Consider the function R defined in terms of the aggregation operator (9) in (3). Suppose g(.) is defined in terms of the generator function go(') that is continuous on (0, ec) as g(x) = (go(x))~2--~, m -r 1, and let m , ro(x) - m - 1 (g~
)2
(10)
, - go(x)go(X).
The generator function go(x) leads to an admissible reformulation function R if: 9 go(x) > 0, Vx E (0, co), g~o(X) > 0, Vx E (0, c~), and ro(x) > 0, Vx E (0, c~), or 9 go(x) > 0, Vx E (0, oc), g~o(X) < 0, Vx E (0, co), and ro(x) < 0, Vx E (0, co).
If g~o(X) > 0, Vx E (0, co), and m > 1 (rn < 1), then R is a reformulation function of the first (second) kind. Ifg~(x) < 0, Vx E (0, c~), and m > 1 (m < 1), then R i s a reformulation function of the second (first) kind. 3.1. C o m p e t i t i o n and M e m b e r s h i p F u n c t i o n s Consider the soft LVQ and clustering algorithms produced by minimizing a reformulation function of the form (3), where the aggregation operator h(-)is defined in (9). Using the definition of h(-), O--~-h(al,a2,...
COaj
ac) =
'
-
Oaj
c/=l
g(ae)
)1, = -
c
(aj)
-
g(a~)
)
.
(11)
c l=1
Since the constant 1/c can be incorporated in the learning rate, the competition functions {aij} can be obtained from (5) using (11) as c~ij - g ' ( l l x i - viii 2) f' (Si), where Si - ~ ~~--1 g ( i i X i - Vt]]2) 9
(12)
52 Suppose g(-) is formed in terms of an admissible generator function go(') as g ( x ) = (g0(x))~---~, rn -r 1. In this case, the competition functions { a i j } can be obtained as
---'~/ --m , =
x,
v,
e--1 go0 (llx~ - viii =)
(~3)
where
o,, = ~o(llx,- v, ll~) So (sa-~) = ~o(tlx,- v, tl~) So (,o(llx,- v, ll~),,,).
(1~)
It can easily be verified that {a{j} and {Oij} satisfy the condition 1 -
(~,j / o,j)
1__
=l,l_
(15)
C j=l
This condition can be used to determine the constraints imposed by the generator function on the resulting c-partition by relating the competition functions {aij} with the corresponding membership functions {uij}. F u z z y LVQ and clustering algorithms can be obtained as special cases of the proposed formulation if the corresponding generator functions produce fuzzy c-partitions. A generator function produces fuzzy c-partitions if the membership functions {uij} determined in terms of the corresponding competition functions {aij } and {0ij } satisfy the condition u~j=l,l
(16)
j--1
If the condition (16) is not satisfied by the membership functions {u~j} formed in terms of {a~j } and {0k/}, then the proposed formulation produces s o f t LVQ and clustering algorithms which include fuzzy LVQ and clustering algorithms as special cases. 3.2. Soft LVQ a n d C l u s t e r i n g A l g o r i t h m s B a s e d on N o n l i n e a r G e n e r a t o r Functions Consider the nonlinear generator function g o ( x ) = x q, q ~ O. For q > O, g o ( x ) = x q is an increasing generator function. For q = 1, g o ( x ) = x q reduces to the linear generator function g o ( x ) = x that produces the FCM algorithm. If rn > 1, then g o ( x ) = x q generates an admissible reformulation function of the first kind for all q > 0. For m < 1, g o ( x ) = x q generates an admissible reformulation function of the second kind only if 0 < q < 1 - rn. If q = 1, then g o ( x ) = x q generates an admissible reformulation function of the first kind for rn > 1. This is the range of values of rn associated with the FCM algorithm. For q = 1, g o ( x ) = x q generates an admissible reformulation function of the second kind if m<0. Consider the family of algorithms produced by the generator function g o ( x ) = x q, q > O. For this generator function, {0ij} can be obtained from (14) as
Oij= (1 ~([[Xi--Vell2)a--~m)L----~(1-q) -C l = l
IIx~- viii =
(17)
53 The competition functions {aij} can be obtained from (13)as
~=i
I I x , - v~ll ~
(18)
'
where {Oij} are defined in (17). If q = 1, then Oij = 1 , V i , j , and
e=~
I I x , - v~ll ~
(19)
This is the form of the competition functions of the FLVQ algorithm. aij = ( c Uij )r~ , where
In this case,
=
IIx,-wll ~
l=l
are the membership functions of the FCM algorithm. For q 7~ 1, the membership functions {uij} can be obtained from the competition functions {aij} as uij - ( a i j ) ! / c . Using (18), -1
I----1
IIx,-vell ~
"
(21)
The membership functions (21) resulting from the generator function go(x) - x q with q r 1 do not satisfy the constraint (16) that is necessary for fuzzy c-partitions. The cpartitions obtained by relaxing the constraint (16) are called soft c-partitions, and include fuzzy c-partitions as a special case. The performance of the LVQ and clustering algorithms corresponding to the generator function go(x) = x q, q > 0, depends on both parameters m E (1, oc) and q E (0, oc). The algorithms generated by go(x) = x q produce asymptotically crisp c-partitions for fixed values of m > 1 as q --+ ec and for fixed values of q > 0 as m --+ 1+. The partitions produced by the algorithms become increasingly soft for fixed values of m > 1 as q -+ 0 and for fixed values of q > 0 as rn ~ co. 4. R E F O R M U L A T I O N
AGGREGATION
FUNCTIONS OPERATORS
BASED
ON ORDERED
WEIGHTED
A broad family of soft LVQ and clustering algorithms can also be developed by minimizing reformulation functions constructed in terms of ordered weighted aggregation operators [10]. Consider the function R = -~
f i--1
w~g(llxi- v[~]ll ~)
)
,
(22)
where { l l x i - v[~]ll2}~eH~ are obtained by ordering { l l x i - v~ll2}eex~ in ascending order, that is, ] l x i - v[1]ll2 < I l x i - v[2]ll2 < . . . < I l x i - v[~]l[2, and the weights {w~}tex~ satisfy w~ E [0, 1], Vg C Aft, and ~ = 1 w~ = 1. The reformulation function (1) can be obtained
54 from (22) if g(x) = x p, f ( x ) = g - l ( x ) = x~, and we = 1/c, Vg e Arc. Any function of the form (22) is an admissible reformulation function if the functions f(.) and g(.) and the weights {we}echo satisfy the conditions summarized by the following theorem: Theorem ~: Let X = { x l , x 2 , . . . ,XM} C ]R'~ be a finite set of feature vectors which are represented by the set of c < M prototypes 1; = { v l , v 2 , . . . , v c } C ]Rn. Consider the function R defined in (3), with h(al,a2,...,ac)=
f (~weg(a[e]))
(23)
where {a[e]}eexc are the arguments of h(.) ordered in ascending order, that is, a[1] < a[2] < ... < all, and the weights {we}ee~rc satisfy we E [0, 1], Vg E Aft, and ~ = 1 we = 1. Then, R is an admissible reformulation function of the first (second) kind in accordance with the axiomatic requirements R1-R3 if f(-) and g(-) are continuous and differentiable everywhere functions satisfying f ( g ( x ) ) = x, f ( x ) and g(x) are both monotonically decreasing (increasing) functions of x e (0, c~), g'(x) is a monotonically increasing (decreasing) function of x E (0, c~), and wl > w2 > . . . > w~. The development of soft LVQ and clustering algorithms can be accomplished by considering aggregation operators of the form (23) corresponding to g(x) = x p and f ( x ) = x 1. For g(x) = x p, g'(x) = p x p-1 and g " ( z ) = p ( p - 1)x p-2. The functions g(x) = x p and 1 f ( x ) - x-~ are both monotonically decreasing for all x E (0, c ~ ) i f p < 0. Theorem 4 requires that g'(x) be a monotonically increasing function of x E (0, c~), which is true if p ( p - 1) > 0. For p < 0, this last inequality is valid if p < 1. Thus, the function R corresponding to g(x) = x p is an admissible reformulation function of the first kind if p e ( - c ~ , 0). The functions g(x) = x p and f ( x ) = x~ are both monotonically increasing for all x E (0, oc) if p > 0. In this case, Theorem 4 requires that g'(x) be a monotonically decreasing function for all x E (0, oc), which is true if p ( p - 1) < 0. For p > 0, this last inequality is valid if p < 1. Thus, the function R corresponding to g(x) = x p is an admissible reformulation function of the second kind if p E (0,1). Consider the aggregation operator resulting from (23) if g(x) = x p, which implies that f ( x ) = x~. In this case, the reformulation function defined in (3) takes the form 1 M R = ~ ~ Dp(w, { l [ x , - v~]l2 }eexc),
(24)
i--1
with Op(w, { l l x i - veII2}eezo) =
we ( l I x i - v[e]li2)p ~ ,
(25)
with p e R - { 0 } . For a given weight vector w = [w~ w 2 . . . we] T such that we e [0,1], Vg e Arc, and E~=I we = 1, (25)is the ordered weighted generalized mean of {]lxi-ve]12}eex~. If p = 1, then the ordered weighted generalized mean (25) reduces to the ordered weighted mean. For w = [ 1 0 . . . 0]T, Dp(w, {llxi - ve]12}eex~) = mineexr - veil2}. For w = [00 ... 1]T, Dp(w, {]Ix,- ve]12}eez~) = maxeey~{]]xi- veil2}. If w = [Wl w 2 . . . wc] T with we = 1/c, V~ C Arc, then (25) coincides with the generalized mean or unweighted p-norm of { l l x , - vetl2}eex~, defined in (2).
55 4.1. O r d e r e d W e i g h t e d LVQ A l g o r i t h m s Consider the LVQ and clustering algorithms produced by minimizing a reformulation function of the form (3), with the aggregation operator h(.) defined in (23). From the definition of h(.),
~h(al,
a~,..,
Oa[j]
'
a~) = _ _ 0 f Oa[j]
~,g(ai~ )
= ~j #(a~l) f'
w, a(al, ]) .
(26)
The update equation for each viii can be obtained from (5) using (26) as M i=1
where {r]j} are the learning rates and {a;[jl} are the competition functions, defined as
~;~] = g'(llx - vtj] II2) if(S;),
(28)
where Si = E g----1 ~ We g(llx; - vt~] II 2). If p = 1/(1 - m), the function g ( x ) = x ~j-~ leads to admissible reformulation functions of the first kind if m E ( 1 , ~ ) . In this case, the competition functions {a@]} can be obtained from (28) as
~i[j] :
We g=l
(,,x,_ IIx;-
vt~e]II ~
"
(29)
The effect of ordering the squared Euclidean distances {llxi- vjl]2}jeHc is carried in the update equation (27) for each prototype vii I by the corresponding weight wj that can be incorporated in the learning rate. In such a case, the update equation for each prototype is independent of the ordering of {llxi- vjll~}j~jco and takes the form (6), with
(
E
g=l
,Ix,-
"
(30)
= 1/c, vg E Arc, then (30) reduces to the competition functions (19) of the FLVQ algorithm, which were produced by minimizing the reformulation function (1).
If w~
4.2. O r d e r e d Weighted Clustering A l g o r i t h m s For we = 1/c, Vg E Arc, (30) gives the competition functions (19) which can be written in terms of the membership functions (20) of the FCM algorithm as aij = (cuij) m. This indicates that the competition functions (30) of ordered weighted LVQ algorithms also correspond to a set of membership functions {u;j ), obtained according to oqj = (c u~j)m as
g--1
ilXi_ V[ellj2
.
(31)
56 If w = [1 0 . . . 0]T, then (31) becomes 1 (minee~cc.{.!!x_/_-veil 2} ) ~-1 IIx,- viii ~ . .
u,j- ~ \
(32)
This is the form of the membership functions of the Minimum FCM algorithm [2]. If wc] T with we = 1/c, Vg E Arc, then (31) takes the form of the membership functions (20) of the FCM algorithm [1]. w = [wa w 2 . . .
5. C O N C L U S I O N S This paper proposed a general framework for developing soft LVQ and clustering algorithms by using gradient descent to minimize a reformulation function based on admissible aggregation operators. This approach establishes a link between competitive LVQ models and operators developed over the years to perform aggregation on fuzzy sets. For meantype aggregation operators, the development of LVQ and clustering algorithms reduces to the selection of admissible generator functions. This paper studied the properties of soft LVQ and clustering algorithms derived using nonlinear generator functions. Another family of soft LVQ and clustering algorithms was developed by minimizing admissible reformulation functions based on ordered weighted aggregation operators. In addition to its use in the development of soft LVQ and clustering algorithms, the proposed formulation can also provide the basis for exploring the structure of the data by identifying outliers in the feature set. A major study is currently under way, which aims at the evaluation of a broad variety of soft LVQ and clustering algorithms on segmentation of magnetic resonance images of the brain. REFERENCES
.
8. 9. 10. 11. 12.
J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum, New York, 1981. N. B. Karayiannis, Proc. Fifth Int. Conf. Fuzzy Syst. New Orleans, LA, September 8-11, 1996, 1036. N. B. Karayiannis, J. Intel. Fuzzy Syst. 5 (1997) 103. T. Kohonen, Self-organization and Associative Memory, 3rd Edition, Springer-Verlag, Berlin, 1989. E. C.-K. Tsao, J. C. Bezdek, and N. R. Pal, Pattern Recognition 27 (1994) 757. I. Pitas, C. Kotropoulos, N. Nikolaidis, R. Yang, and M. Gabbouj, IEEE Trans. Image Proc. 5 (1996) 1048. N. B. Karayiannis, IEEE Trans. Neural Networks 8 (1997) 505. N. B. Karayiannis, SPIE Proc. 3030 (1997) 2. N. B. Karayiannis and J. C. Bezdek, IEEE Trans. Fuzzy Syst. 5 (1997) 622. N. B. Karayiannis, Proc. 1998 Int. Conf. Fuzzy Syst. Anchorage, AK, May 4-9, 1998, 1388. N. B. Karayiannis, Proc. 1998 Int. Conf. Fuzzy Syst. Anchorage, AK, May 4-9, 1998, 1441. R. J. Hathaway and J. C. Bezdek, IEEE Trans. Fuzzy Syst. 3 (1995) 241.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V.All rightsreserved
57
A c t i v e L e a r n i n g in S e l f - O r g a n i z i n g M a p s M. Hasenj/iger ~, H. Ritter ~, and K. Obermayer b ~Technische Fakultiit, Universit/it Bielefeld, Postfach 10 01 31, D-33501 Bielefeld, Germany bDept, of Computer Science, Technical University of Berlin, FR 2-1, Franklinstr. 28/29, D-10587 Berlin, Germany The self-organizing map (SOM) was originally proposed by T. Kohonen in 1982 on biological grounds and has since then become a widespread tool for exploratory data analysis. Although introduced as a heuristic, SOMs have been related to statistical methods in recent years, which led to a theoretical foundation in terms of cost functions as well as to extensions to the analysis of pairwise data, in particular of dissimilarity data. In our contribution, we first relate SOMs to probabilistic autoencoders, re-derive the SOM version for dissimilarity data, and review part of the above-mentioned work. Then we turn our attention to the fact, that dissimilarity-based algorithms scale with O(D2), where D denotes the number of data items, and may therefore become impractical for real-world datasets. We find that the majority of the elements of a dissimilarity matrix are redundant and that a sparse matrix with more than 80% missing values suffices to learn a SOM representation of low cost. We then describe a strategy how to select the most informative dissimilarities for a given set of objects. We suggest to select (and measure) only those elements whose knowledge maximizes the expected reduction in the SOM cost function. We find that active data selection is computationally expensive, but may reduce the number of necessary dissimilarities by more than a factor of two compared to a random selection strategy. This makes active data selection a viable alternative when the cost of actually measuring dissimilarities between data objects becomes high. 1. I N T R O D U C T I O N The self-organizing map (SOM) was first described by T. Kohonen in 1982 [1] as a biologically inspired method to generate useful representations of data objects. In the subsequent decade his method has been developed into a widely used tool for exploratory data analysis [2]. The method is simple and intuitive, which made it so popular, and it generates a representation of the data that preserves the important relational properties of the whole dataset while still being amenable to visual inspection. The SOM combines the two standard paradigms of unsupervised data analysis [3,4] which are the grouping of data by similarity (clustering) and the extraction of explanatory variables (projection methods). Given a representation of data objects in terms of feature vectors which live in an Euclidean feature space, standard SOMs perform a mapping from
58 the continuous input space to a discrete set of "neurons" (clustering) which are arranged in a lattice. Similar data objects are assigned to the same or to neighboring neurons in a way that the lattice coordinates (labels) of the neurons often correspond to the relevant combinations of features describing the data. The final representation of the data by the lattice of neurons can be viewed as an "elastic net" of points that is fitted to some curved data manifold in input space providing a non-linear projection of the data. SOMs have long been plagued by their origin as a heuristic method 1 and there have been several efforts to put the SOM on firm theoretical grounds (e.g. [5-10]). By pointing out the relation between topographic clustering methods and source-channel coding problems Luttrell [11] and later Graepel et al. [12,13] derived cost functions for topographic clustering methods, and advocated the use of deterministic annealing strategies (cf. [14]) for their optimization. The tasks of grouping data, of finding the relevant explanatory variables, and of embedding the data in some low dimensional "feature space" for the purpose of visualization, are not restricted to data that are described via feature vectors and Euclidean distance measures. On the contrary, there is an even stronger need for grouping and for embedding methods if relations between data objects are only defined pairwise, for example via a table of mutual dissimilarities obtained by actually measuring "distances" between objects. Pairwise data are less intuitive and there are less tools available for their analysis. Based on ideas put forward by [11,15], Graepel and Obermayer [16] have recently generalized the SOM method to perform clustering on dissimilarity data and to generate a representation which can be viewed as a non-linear extension to metric multidimensional scaling [17]. If the dissimilarities are given by the Euclidean distances between feature vectors assigned to the objects, the new method reduces to the standard SOM. The analysis of dissimilarity data, however, faces a serious scaling problem. The number of dissimilarity values grows quadratically with the number of data objects, and for a set of 100 data objects there are already 10,000 dissimilarities to be measured! Data acquisition and data processing may become quite demanding tasks, and the analysis may become infeasible even for a moderate number of data objects. Luckily it turns out that dissimilarity matrices are highly redundant if there is structure in the data to be discovered, a fact that is well known from multidimensional scaling [18,19]. Just consider for example a matrix of distances between European cities. Because the cities are located on the earth's surface, three distances per city are in general sufficient to recover their relative locations, all the other distances being redundant. These considerations suggest that the scaling problem for dissimilarity-based algorithms can be overcome by providing strategies (i) for the treatment of missing data, and (ii) for the selection of only those dissimilarity values that carry the relevant information, i.e. for active data selection. Active data selection (e.g. [20-24]) and the missing data problem (e.g. [25,26]) have been well investigated in the past for problems of supervised learning, but applications to unsupervised learning have so far been scarce (but see [27]). In our contribution we investigate SOMs for the analysis of dissimilarity data and we specifically address the open problems of missing data and of active data selection. For this purpose, we first review Graepel and Obermayer's [16] work on pairwise topographic clustering in section 2, and we re-derive the generalized SOM utilizing the relationship l w.r.t, to its property to generate neighborhood preserving maps.
59
input c~ 9~-
P~ ( i)
o
p1(r )
P~(rl i)
cO
P2(slr)
bottleneck
P2(s ) [_s[
c~ .c_ -o
_ pl(r')
T
9
P2(r'l s )
P~(i Ir )
-o
Po(i,)
output Figure 1. Left: Sketch of a neural network autoencoder with three hidden layers. The target representation is formed in the bottleneck layer. Right" A probabilistic autoencoder with three hidden representations.
between clustering and probabilistic autoencoders. In section 3 we then turn to the handling of missing data and to the problem of active data selection. Following [27], we replace every missing distance between two data objects by an average value calculated from the distances to other objects from the same cluster. As a strategy for active data selection we then suggest to include only those dissimilarity values for learning for which we expect the maximum reduction in the clustering cost. In section 4 we provide the results of numerical simulations. We show the performance of SO M for different percentages of missing data, and we compare the performance of active selection strategies with the performance obtained by selecting new dissimilarity values at random. Section 5, finally, summarizes our conclusions. 2. T O P O G R A P H I C
MAPPING
OF D I S S I M I L A R I T Y
DATA
2.1. C o s t F u n c t i o n s Fig. 1 (left) shows the sketch of a feedforward neural network architecture called autoencoder. Autoencoders are typically used to find a representation of the data which makes the regularities in a given dataset explicit. In order to achieve this goal, the input data is first passed through a limited capacity bottleneck and then reconstructed from the internal representation. The bottleneck enforces a representation using only a few explanatory variables while the constraint of reliable reconstruction at the output layer ensures that the extracted explanatory variables are indeed relevant. For linear connectionist neurons and a Euclidean distance measure between the data objects, the autoencoder performs a projection of the data into a subspace which is spanned by the first principal components [28]. If the representation in the bottleneck layer is constrained to be sparse - with exactly one neuron active for a given data object - that same architecture generates a clustering solution as we will see below. Let us next consider an autoencoder for which the transformations between representa-
60 tions are probabilistic (Fig. 1 (right)). In an encoding stage, data objects i are mapped to representatives r and s in the bottleneck layer with probabilities Pl(rli ) and P2(slr ). In the decoding stage, the data objects i ~ are reconstructed from the internal representation via the probabilistic "decoders" /52(r'ls) and/51 (i'lr'). Object i and reconstruction i' need not coincide, because the transformations between the representations are probabilistic and because the representation in the bottleneck may be lossy. We now introduce a dissimilarity measure 5(i, i') which describes the degree of dissimilarity between any two objects i and i ~. We can think of 5 being derived from a distance measure d(x, x ~) between associated feature vectors x and x ~ or being directly measured in an experiment. For any given autoencoder and for any given dissimilarity measure, we then obtain for the average dissimilarity E
E - E E P~
(1)
i,i ~ r,s,r ~
where Po(i) denotes the probability that a data object i is contained in a given dataset. If the reconstruction of an object i' from a representation s is done in an optimal way, the encoders and the decoders are related by Bayes' theorem, /5 (i, lr, )
_ Pl(r'li')Po(i') P~ (r')
/52(r,]s ) _ P2(s]r')Plp2(s)(r') ,
(2)
so that the average dissimilarity E can be expressed solely in terms of the encoding probabilities. By inserting Eqs. (2) in Eq. (1) we obtain
E = E E P~
P2(slr') P2(s)
P~ (r'li')Po(i')5(i ' i') '
(3)
i,i ~ r,s,r ~
where P2(s) is given by P2(s) = E
P2(slr)Pl(rli)P~
(4)
r,i
The quantity E in Eq. (3) is the cost which is associated with a particular representation s. It serves as a quality measure for the representation" the lower the cost E is, the better does the representation s capture the regularities underlying the data. Now we specialize Eq. (3) to describe a representation r of size N of D data objects i as it is obtained via topographic clustering methods. We enforce a sparse representation via the ansatz P~ (rli) - m ~ r , i = 1 . . . D, r - 1 . . . N, where the assignment variables m~r are equal to one if the data object i is assigned to neuron r and zero otherwise. The assignment variables form the entries of an assignment matrix A4 and are constrained by Y]r mir = 1 Vi to ensure sparseness, i.e. a unique assignment of objects to neurons. Topography is ensured via the second encoding stage whose probabilities P2(s}r') can be interpreted as permutation probabilities between neurons. In the following, we assume that these probabilities are given and we treat them as elements of a neighborhood matrix 7t - (hr~)r,.~=l...N E [0, 1]N• with hr~ - P2(slr). Note, that the rows of 7-/are normalized by ~--~ hr~ = 1 Vr.
61 If we additionally denote the distance between two objects i and i' by D
E
:
N
1 i~j l
mirhrsmjthts E u = I mkuhus
die, Eq. (3) reads (5)
", ": r,s,t=l E k = l 1
D
N
(6) i=1 s = l
where the factor ~1 was introduced for computational convenience and where we have introduced the abbreviations D
d~ - j~l
N
rhj~d~j ,-b--=-
and
"-- E k - - 1 ?Ttks
mi~ - E miuhu~
(7)
u=l
for the average dissimilarity of object i to objects assigned to neuron s and the "effective" assignment variables this. E TMP is our cost function for the clustering and the topographic mapping of data objects based on their pairwise dissimilarities (TMP). The optimal clustering solutions are given by the assignment matrices A4 for which Eq. (6) is m i n i m a l - for a given neighborhood matrix 7-/. These solutions are consistent with the intuitively plausible idea that the "correct" groups of data objects are found if the average dissimilarities between objects that are assigned to the same neurons are minimal. Let us briefly relate this approach to previous work. If the dissimilarities dij are given by Euclidean distances between feature vectors, E TMP is equivalent to the cost function of topographic vector quantization (TVQ) [29]. If there is no coupling between neurons, i.e. if hrs = 5r~, E TMP reduces to the cost function for pairwise clustering [15]. 2.2. O p t i m i z a t i o n In order to avoid local minima, optimization of E TMP w.r.t, the assignment variables is performed using mean field annealing [30]. Introducing noise in the assignment variables leads to the Gibbs distribution
-
1 exp ( - ~ E TMP )
(8)
for representations, where the sum in the partition function Zp = Y'~{m~r)e x p ( - / ~ E T M P ) is carried out over all admissible assignment matrices A/l. For a given value of the inverse computational temperature /3 - which governs the level of noise - the probability of assigning a data object i to a neuron r is then given by the expectation value of mir w.r.t. to the probability distribution P~. Unfortunately, these expectation values cannot be calculated analytically, because E TMP is not linear in the assignment variables mir. The usual solution to this problem is to approximate the Gibbs distribution Eq. (8) by the factorizing distribution 1 -
(9)
{r
62 and to estimate the quantities ei~, the so-called partial assignment costs or mean fields, by minimizing the Kullback-Leibler divergence between Pa and Qz. As detailed in [16], the optimal mean fields % are given by
~:~
(
j:l
)
~:1 (m~)
vk,~,
(10)
where
(?/~kr) =
exp(-~e~r) EN=lexp(_3e~s)
V k, r
(11)
are the expectation values of the assignment variables w.r.t, to Qz and D
((njs)dkj
(As) : j~l:E/:I <#t/s>'
N
(this) - ~
(mju)hus,
(12)
u:l
are the average effective distances between a data object k and the data objects assigned to neuron s. The approximation of PZ by Qz is called mean-field approximation and implicitly assumes that on average the assignments of data items to neurons are independent in the sense that (mirmjr) - (mir)(mjr), where (.) denotes the expectation value. The self-consistent equations (11) and (10) can be solved by fixed point iteration at any value of the temperature parameter/3,/3 > 0. In particular, it is possible to employ deterministic annealing in/3, thus finding the unique minimum of the mean field approximation to Eq. (8) at low values of/3 which is then tracked with increasing t9, until at a sufficiently high value of/3 the solution is expected to correspond to a good minimum of the original cost function Eq. (6). For SOM applications, annealing in the computational temperature should generally be preferred over annealing the range of the neighborhood function. It is robust, and it allows to use hrs solely for encoding permutation probabilities between neurons (cf. [12,31]). Note, that for a Euclidean distance measure the standard SOM as described by T. Kohonen in 1982 is obtained from TMP, Eqs. (10) and (11), by omitting the convolution with hr~ in Eq. (10) (and considering an on-line update for/3 -~ cx~)(cf. [12]). Therefore we define a SO M approximation to TMP by:
. 1 f ~jr) (%~) ekr--<6[kr)--2 j:l EF=I(/~%Ir)
Vk, r.
(13)
3. A C T I V E DATA S E L E C T I O N It is known since the early days of multidimensional scaling that a large number of the dissimilarity data is practically redundant. As Kruskal noted, only 25 % of the possible judgements is a sufficient portion of judgements, provided that they are "properly distributed in the matrix of all possible judgements" ([18], p. 166). These facts apply similarly to TMP, so that it seems to be strongly recommendable to eliminate redundancy
63 in the data in order to increase the efficiency of TMP by working with incomplete data matrices, whose entries are carefully selected on the basis of their expected informativeness w.r.t, the clustering solution. This strategy poses two problems: first we have to be able to handle the missing data problem and second we have to develop a criterion for data selection. Let us begin with the treatment of missing data. We note that the data enter the TMP algorithm only in the evaluation of the mean fields (Eqs. (10) or (13)) via the average dissimilarities (dks) (cf. Eq. (12)). We may thus replace a missing value dkj in Eq. (12) by the average value (dks) that can be calculated from the available dissimilarities of object k to the objects in cluster s. This amounts to restricting the sums in Eq. (12) to the known values. Next we derive a criterion for the selection of new dissimilarities. An appropriate framework for this kind of problem is given by Bayesian decision theory (cf. [32,33]) which deals with the problem of how to act in an environment when only partial information is available. The missing information is compensated for by defining a prior probability distribution over the states of the environment, and the usefulness of a particular act is measured by a suitable utility function. The optimum act based on the current knowledge is then given by the act with the maximum expected utility w.r.t, the posterior distribution over the states of the environment. Let us now apply this framework to the problem at hand. An appropriate starting point for the definition of a suitable utility function is provided by the clustering cost function E TMP. We should measure that missing dissimilarity whose knowledge is expected to lead to the maximum reduction of the clustering cost. If the objects are uniquely assigned to neurons, this proposition implies that the knowledge of a new dissimilarity value should lead to a change in the assignments of objects to neurons. Let us rewrite E TMP a s D
N
N
i=1 s=l
s=l
(14) iCj
omitting the factor 1 and assuming that the object j is assigned to neuron a. Moreover, in accordance with the assumptions of the mean field approximation we assume that the first term in Eq. (14) is approximately constant w.r.t, changes in the assignment of object j on the basis of new information. We may then express the change in E TMP that results if object j were assigned to cluster r instead of to cluster a as N N A E T M P -- E '"ast*J h rl TM ~.new s -- E hrs'-3 s " s=l s=l
(15)
If object j always is assigned to the neuron that represents those objects that are least dissimilar to object j we obtain N AETMP :
E
N
rl TM -- mtin E h ts~js ~new 9 ,h~as~js s--1 s=l
(16)
64
A l g o r i t h m 1 Active Topographic Mapping of Pairwise Dissimilarity Data 0 Find a clustering solution given the known dissimilarity values. O Choose a set of objects k, k C { 1 . . . D} - one object from each cluster - which will be considered as possible participants in a query. @ Evaluate the integrals in Eq. (17) via Monte Carlo methods and find the pair (i, r) from the above set for which v(dij, j C r) is maximal. O Select an object j from cluster r, for which dij is not known yet, and measure the dissimilarity value dij. O Either find a new clustering solution and go to O or go to O immediately.
Anyhow, since we do not have complete knowledge of the data, we do not know the true values of @ld and djnew, and we need to estimate these and hence the expected value of A E TMP after a query. We assume that the dissimilarities resulting from a comparison _ of object i with objects assigned to neuron s are normally distributed with mean di., and variance a~s. The posterior distributions are then given by the Student's t distributions (cf. 7! [32]), whose parameters can be derived from the empirical means dis , cf. EQ. (7), restricted ,2 Averaging Eq. to available data values, and the corresponding empirical variances hi.,. (16) over the posterior distributions we obtain
v(dij, j e r) =
:/(z ...
s
hasd'i's -- n~n
z
ht.,d'i:
s
• fs
fs s=l s#r
Ttis
~:' I~: air -zr,--lr,
( dT'i'.,Idis,7' -hi*-", l,'is) x
~--~-,,/:ir Ttir
d((~l)
_7,,
where a denotes the neuron to which object i is assigned before the new dissimilarity value is known, nis is the effective number of available dissimilarity values between object i and objects from cluster s, ni* = (nit + 1)nir, and ~is = h i s - 1 is the degree of freedom of the Student's t distribution, v(dij, j E r) is often called the expected value of sample information; the goal is to choose that dissimilarity value, for which v(dij, j E r) is maximal. The expectation value in Eq. (17) must - u n f o r t u n a t e l y - be evaluated numerically. An efficient method for the approximate evaluation of such high-dimensional integrals is provided by Monte Carlo integration, i.e. a random sample of the variable ()--~s hasd~s mint ~-~, htsd~.~) is generated according to the probability density given in Eq. (17) and v(dij, j c r) is approximated by the arithmetic mean of this sample. Since for computational reasons one cannot afford to test all missing dissimilarity values for their expected informativeness, we randomly select one object per cluster and we evaluate v(dij, j E r) only for those objects. After deciding for one object i and one cluster r, we measure the dissimilarity dij between i and a randomly chosen object j from r. Our algorithm for
65
Figure 2. Dataset no. 1. Left: 500 data points generated from a 1D noisy spiral in 3D space (for parameters see text). R i g h t : Dissimilarity matrix, calculated from the squared Euclidean distances between the data points. Bright and dark pixels indicate small and large distances. The dissimilarity matrix serves as input for TMP and pairwise SOM.
active data selection is summarized in Algorithm 1. If there there is no coupling between the neurons, i.e. hrs - 5~s, Eq. (17) is equivalent to the query algorithm proposed by [27]. If we replace (~-~ ha~d~l~- mint ~--~ ht.~d~t~) in Eq. -l/ -ll (17) with (d~a - mint dit), we arrive at a computationally less expensive approximation of the query criterion that yields satisfactory results if the coupling between the neurons is weak.
4. R E S U L T S In this section, we present numerical results to evaluate the performance of the algorithms presented above. We use two different toy data sets: a noisy spiral in 3D (cf. [16] and Fig. 2) with and without a pronounced cluster structure. Dataset 1 consists of D = 500 data points which were generated according to xi = (sin 0i + ~x, cos0i + r/y, Oi/~r + r/z), where 0i = 47ri/(D- 1), i = 0 . . . ( D - 1) and ~ = (~x, r/y, ~z) is zero mean Gaussian noise with standard deviation an = 0.3. The corresponding dissimilarity matrix is calculated from the squared Euclidean distances between these points, dij = [ x i - xj12/2. Dataset 2 was created by equidistantly placing 10 Gaussian distributed clusters of 50 data points each along the spiral. The standard deviation of the Gaussian distributions was a n = 0.4. For data representation we choose a 1D SOM with 10 neurons and a Gaussian neighborhood function (width ah). Optimization was performed using an exponential annealing schedule/~t+l = 1.1~t, with ~0 = 0.1 and a final value of/~] = 95.56. The convergence _new _old criterion for the fixed point iteration was given by I~kr - Ckr I < 10-7, V k, r. Let us first consider the complete dissimilarity matrix. Fig. 3 (left) shows a plot of the average assignment error (ETMP),
66
700
..........
exact TMP ........
OM approx. -...............
600
125 A
,,i-, r .m O
500
EL
.~.
400
I-
"
250
LU
v
tl:l "O
300
200 100 0.1
~h = 0 . 5
~ " " ~ ~
.......................... 1 13
......................
375 I,
10
100
5
1
0
,
,
2
3
, , , 0 ~ 4 5 6 neuron
,
,
, ~
7
8
9
10
Figure 3. Left: Average assignment error (E TMP) according to Eq. (18) as a function of the inverse temperature fl for dataset 1. Solid and dotted lines correspond to the exact TMP update (Eq. (10)) and the SOM approximation (Eq. (13)). Results for two different widths ah of the Gaussian neighborhood function are shown. Right: Final assignment matrix for ah = 0.5 and TMP update. The data points are numbered according to the order of their generation along the spiral. Bright and dark correspond to low and high probabilities of assignment.
(ETMP}_
-21i D~jl ~Y ,'=
D(mir)hrs(n~t)htsy Y~.u=l(mk,)hus dij r,s,t=l ~ k - - 1
(18)
as a function of the inverse temperature/~ for dataset 1 (cf. Fig. 2). Two different widths of the neighborhood function were used for the exact TMP update as well as for the SOM approximation. For low fl all data objects are assigned to all clusters with equal probability. At fl ..~ 0.66 the average assignment error (E TMP) begins to drop indicating the first phase transition in the clustering process, a phenomenon that is well known from soft central clustering [34,35,12]. The analytical values for the critical temperatures (for details see [I6]) are /~'~MP - - 0 . 6 6 6 (for TMP) and fl~OM = 0.659 (for the SOM approximation) for ah = 0.5, and /~MP -- 0.690 and flSOM = 0.668 for ah = 0.75. The differences in (E TMP) between the TMP and the SOM learning rules increase with the width of the neighborhood function, but remain small. At intermediate values of fl the average assignment costs of SOM learning are even below the values for TMP. Fig. 3 (right) shows the final assignment matrix (fl = 95.56) for TMP. Its diagonal structure shows that TMP successfully represents the 1D manifold on which the data live and which was hidden in the dissimilarity matrix. Assignment errors stem from the noise in the data and occur only between neighboring neurons. Fig. 4 shows the ratio (EWMP}relative between the average assignment errors for clustering solutions obtained with the incomplete and the full dissimilarity matrix as a function of the percentage of missing dissimilarities. Incomplete data matrices were created by selecting dissimilarities at random from the complete data matrix. (E TMP) was then calculated by inserting the assignments that resulted from running TMP on the incomplete (and the
57 1.25 [ | %=0.25, - , 1.2 l C~h=0-5 .............
1.4
/ ~'h=o.75 .......................... ................
_ ii
>o 1.15 l- % = 1 . 0
,., t
v
1.~
.............
c~_=0.6 , ~=0.4
?
l/
//i ."~9
._d,:!," ..::t'i;:;;;i
1 ~
v
T
~,~=0.2
>o 1.3
"~
=
~
1.1 1
-
-
i
i
=
.~!
0.95 0
20 40 60 80 percentage of missing data
100
0
20 40 60 80 percentage of missing data
100
Figure 4. The ratio (ETMP/relative between the average assignment errors for clustering solutions obtained with the incomplete and the full dissimilarity matrix is plotted as a function of the percentage of missing dissimilarities. Left: Plot of the ratio for dataset 1, for various widths ah of the neighborhood function. R i g h t : Plot of the ratio for dataset 2, for different widths an of the Gaussian clusters and for ah -- 0.5. The incomplete dissimilarity matrices were generated by randomly selecting a set of values. Each data point represents an average over 10 simulations for different incomplete data matrices with a similar percentage of missing data. The error bars indicate the standard deviations about the mean.
full) data matrices into Eq. (18). For the calculation of (E TMP) all dissimilarities were used in both cases. The results show that TMP is quite robust w.r.t, missing data. Fig. 4 (left) shows for dataset 1 that the degradation in performance depends on the coupling between the neurons: the stronger the coupling, the more robust is the grouping. Fig. 4 (right) shows for dataset 2 that the performance of the algorithm depends on the overlap between the clusters in the data. The result is intuitively appealing: the clearer the separation between the clusters, the less data are necessary to identify this structure. Fig. 5 and Fig. 6 show a comparison of the results obtained using the random and two active strategies for data selection. The random "strategy" consists of selecting new dissimilarities at random from the set of missing values. The two strategies for active data selection are based on the query criterion given by Eq. (17) (active I) and the approximation to this query criterion that results from replacing hrs in Eq. (17) with 5r.~, Vr, s (active II). In both cases 12,500 data items corresponding to 5% of the complete dissimilarity matrix are chosen at random to generate an initial clustering solution. From Figs. 5 and 6 it can be seen that during an initial phase, when there is not enough information available for good queries, as well as in the final phase, when additional dissimilarities are almost redundant anyway, the active and random data selection strategies perform quite similarly. It is in the intermediate stage, in which the advantage of active clustering becomes apparent. For the exact query criterion (active I), the number of necessary dissimilarities decreases with increasing width of the neighborhood function. For stronger coupling, the difference between the exact query criterion (active I) and its ap-
58
175 170 A
random , t
~
245
,
240
active III ............. active ............... A
165
random , !ili!,
=
,
active III ............. active .................
235
n
a.
=~
160
v
155
I-I.IJ
~ 230 1.1.1 v
220
150 145
""::i::~ .
225
' 5
i
|
i
|
|
215
i
7.4 9.8 12.2 14.6 17 19.4 percentage of available data
' 5
,
,
i
|
i
i
7.4 9.8 12.2 14.6 17 19.4 percentage of available data
Figure 5. Comparison between the random selection of dissimilarities and two active data selection strategies for dataset 1. The average assignment error is plotted as a function of the percentage of available dissimilarities. The width of the neighborhood function is o-h : 0.5 (left) and o-h -- 0.75 ( r i g h t ) . The curve denoted by active I shows the results obtained from the full query criterion from Eq. (17), while curve active II shows the results from approximating Eq. (17) with hrs -- (~rs, Vr, s. Results are averages over 5 runs; error bars indicate standard deviations about the mean. For parameters see text in Sec. 4. The clustering solution was updated after every 1000 queries.
2501~', 240 l- ~
random, o , active I ............. active II ....................
320 r-,,,it 315 ~-~
random, ~ , active I .............
310
~A 230 v
a-A 305
210
~kk,,y~.'
200
~
v
'
:
~
190
295
290 285
5
7.4 9.8 12.2 14.6 17 19.4 percentage of available data
. 5
.
. . . 1. . . . "-:~:=:i 7.4 9.8 12.2 14.6 17 19.4 percentage of available data
Figure 6. Comparison between the random selection of dissimilarities and two active data selection strategies for dataset 2. The average assignment error is plotted as a function of the percentage of available dissimilarities. The width of the Gaussian clusters was given by an = 0.4; the width of the neighborhood function was a h = 0.5 (left) and a h = 0.75 ( r i g h t ) . Other parameters as in Fig. 5.
69 proximation (active II) also becomes more evident. Although the performance of "active II" is superior to the random strategy in all cases, it performs best in the low coupling limit for which it becomes exact. 5. C O N C L U S I O N In our contribution, we investigated a method for the topographic mapping of pairwise dissimilarity data (TMP) that was recently proposed by [16]. This algorithm extends previous approaches for central data and robustly combines the grouping of dissimilarity data with an embedding for the purpose of visualization. We showed that a large portion of the dissimilarity matrix is practically redundant w.r.t. to the goal of grouping the data. We exploited this redundancy by generalizing TMP to the processing of incomplete data matrices, and suggested to carefully select the entries of these sparse matrices in a way that exploits the knowledge that is already available from the incomplete data. Our query criterion is based on Bayesian decision theory and aims at selecting the data items of maximum expected sample value, a quantity that is measured by the expected change in the clustering cost function. Our results show, that judiciously selected data can indeed lead to a better performance compared to randomly selected data. The extent of this improvement depends on the properties of the data set (e.g. its intrinsic dimensionality) as well as on those of the clustering algorithm (e.g. the strength with which topography is imposed). However, there are three points that deserve further research. The first point concerns the problem of missing data, that up to now is solved in an ad hoc manner. A more principled approach would combine the estimation of the missing dissimilarity values with the estimation of the average assignment variables and the mean fields. The second point is that active query selection as presented above is a computationally costly procedure, because the integration in Eq. (17) has to be performed numerically for each object that is considered as a potential candidate for a query. Although the computational burden can be alleviated by omitting the convolution with the neighborhood function in the utility function for weakly coupled neurons, less expensive query criteria are needed. As a last point we note that the query criterion, Eq. (17), is valid only for binary assignment variables, i.e. in the limit ~ -+ oe. An extension to probabilistic assignments is desirable, in particular, if ~ is used as a regularizer to avoid overfitting [36]. ACKN OWLED G EMENT S This work was funded in part by a DFG scholarship (GK 256) to M. Hasenjiiger and by the Technical University of Berlin via grant FIP 13/41. We are grateful to T. Graepel for stimulating discussions. REFERENCES
1. T. Kohonen, Biol. Cybern. 43 (1982) 59. 2. S. Kaski, J. Kangas, and T. Kohonen, Neur. Computing Surveys 1 (1998) 102. 3. T. Kohonen, Self-Organizing Maps, Springer, Berlin, 1995.
70
.
6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36.
H. Ritter, T. Martinetz, and K. Schulten, Neural Networks, Addison-Wesley, Reading, MA, 1992. M. Cottrell and J. C. Fort, Biol. Cybern. 53 (1986) 405. H. Ritter and K. Schulten, Biol. Cybern. 60 (1988) 59. H. Ritter and K. Schulten, Biol. Cybern. 54 (1986) 99. E. Erwin, K. Obermayer, and K. Schulten, Biol. Cybern. 67 (1992) 47. E. Erwin, K. Obermayer, and K. Schulten, Biol. Cybern. 67 (1992) 35. H.-U. Bauer, M. Herrmann, and T. Villmann, submitted to Neur. Networks, 1997. Available at http://www.chaos.gwdg.de. S. P. Luttrell, Neur. Computation 6 (1994) 767. T. Graepel, M. Burger, and K. Obermayer, Phys. Rev. E 56 (1997) 3876. T. Graepel, M. Burger, and K. Obermayer, Neurocomputing 20 (1998) 173. J. Buhmann, in Learning in Graphical Models. Kluwer, Boston, 1998. T. Hofmann and J. Buhmann, IEEE Trans. PAMI 19 (1997) 1. T. Graepel and K. Obermayer, Neur. Computation 11 (1999) 139. I. Borg and P. Groenen, Modern Multidimensional Scaling, Springer, New York, 1997. J. B. Kruskal, Psychometrika 29 (1964) 1. I. Spence, in Proximity and Preference, pp. 29-46, Univ. of Minnesota Pr., Minneapolis, 1982. D. J. C. MacKay, Neur. Computation 4 (1992) 590. P. Sollich, Phys. Rev. E 49 (1994) 4637. D. A. Cohn, Z. Ghahramani, and M. I. Jordan, J. Artificial. Intelligence Research 4 (1996) 129. Y. Freund, H. S. Seung, E. Shamir, and N. Tishby, Machine Learning 28 (1997) 133. M. Hasenj~iger and H. Ritter, Neut. Processing Letters 7 (1998) 107. Z. Ghahramani and M. I. Jordan, Technical Report A.I. Memo No. 1509, MIT, 1994. Available at ftp://ftp.cs.toronto.edu/pub/zoubin/review.ps.Z. R. J. A. Little and D. B. Rubin, Statistical analysis with missing data, Wiley, New York, 1986. T. Hofmann and J. M. Buhmann, NIPS 10 (1998) 528. P. Baldi and K. Hornik, Neur. Networks 2 (1989) 53. S. P. Luttrell, IEEE Trans. Neur. Networks 2 (1991) 427. L. K. Saul and M. I. Jordan, NIPS 8 (1996) 486. M. Burger, T. Graepel, and K. Obermayer, NIPS 10 (1998) 430. H. Raiffa and R. Schlaifer, Applied Statistical Decision Theory, MIT Pr., Cambridge, MA, 1961. J. O. Berger, Statistical decision theory and Bayesian analysis, Springer, New York, 1985. K. Rose, E. Gurewitz, and G. C. Fox, IEEE Trans. Inform. Theory 38 (1992) 1249. J. Buhmann and H. Kiihnel, IEEE Trans. Inform. Theory 39 (1993) 1133. J. M. Buhmann, Technical Report IAI-TR-98-3, Universitiit Bonn, 1998. Available at http://www-dbv.informatik.uni-bonn.de/abstracts/buhmann.TR98.html.
Kohonen Maps. E. Oja and S. Kaski,editors
9
ElsevierScienceB.V.All rightsreserved
POINT PROTOTYPE
71
GENERATION AND CLASSIFIER DESIGN
J a m e s C. Bezdek l'a a n d Ludmila I. K u n c h e v a b a C o m p u t e r Science D e p a r t m e n t , University of West Florida Pensacola, FL 32514, USA,
[email protected] bSchool of Mathematics, University of Wales LL57 1UT Bangor, UK, m a s 0 0 a @ b a n g o r . a c . u k ABSTRACT
We c o n s i d e r p o i n t prototype c o n s t r u c t i o n for n e a r e s t p r o t o t y p e classifier design. The distinctions b e t w e e n pre- a n d p o s t - s u p e r v i s e d learning, a n d also b e t w e e n s e l e c t i o n a n d e x t r a c t i o n of p o i n t p r o t o t y p e s a r e d i s c u s s e d . N u m e r i c a l e x a m p l e s b a s e d on the Iris d a t a are given to c o n t r a s t a n d c o m p a r e v a r i o u s models. O u r calculations s u g g e s t that: (i) p r e - s u p e r v i s i o n yields better nearest prototype classifiers than post-supervision, i n d e p e n d e n t of t h e type of p r o t o t y p e s chosen; (ii) selection is (arguably) b e t t e r t h a n e x t r a c t i o n for finding point p r o t o t y p e s for classification; (iii) in p o s t - s u p e r v i s e d s e q u e n t i a l (local) m e t h o d s s u c h as vector q u a n t i z a t i o n m a y p r o d u c e b e t t e r p r o t o t y p e s t h a n b a t c h (global) m e t h o d s s u c h as fuzzy cm e a n s w h e n the n u m b e r of prototypes is larger t h a n the n u m b e r of labeled classes; a n d (iv) a m o n g p o s t - s u p e r v i s e d designs, self-organizing f e a t u r e m a p s p r o d u c e classifiers t h a t are intermediate between those b a s e d on local a n d global prototype updating. N e a r e s t p r o t o t y p e s , self-organizing f e a t u r e m a p s , classifier design, pre-supervision, p o s t - s u p e r v i s i o n
KEY WORDS
I. INTRODUCTION
Object d a t a are r e p r e s e n t e d as X = {x 1..... x n} c 91P, a set of n f e a t u r e vectors in f e a t u r e space 91P. The j - t h object is a physical entity s u c h as a fish, g u i t a r , m o t o r c y c l e , cigar, etc. Column vector x j is t h e n u m e r i c a l r e p r e s e n t a t i o n of object j a n d Xkj is the k - t h f e a t u r e or attribute value associated with it. F e a t u r e s can be c o n t i n u o u s l y or discretely valued in 91. One of the m o s t basic s t r u c t u r e s in p a t t e r n recognition is the label vector. T h e r e are f o u r t y p e s of c l a s s labels - crisp, fuzzy, p r o b a b i l i s t i c a n d possibilistic. Letting integer c denote the n u m b e r of classes, 1 < c < n, define three sets of label vectors in 91c:
I Research supported by ONR grant 00014-96-1-0642.
72
Npc = {y e 9~C'yi e [0, iI V i, Yi > 0 3 i} = [0,11 c -{0}
( c } Nfc = Y ~ Npc" ~Yi = 1
(i)
Nhc = {y e Nfc'y i e {0,1}Vi} = { e l , e 2 ..... ec}
(3)
(2)
i=l
In (I) 0 is t h e zero vector in 91c. Nhc is t h e c a n o n i c a l (unit vector) b a s i s of E u c l i d e a n c - s p a c e , so e i = (0 ..... ~ 1 ..... 0) T, t h e i-th v e r t e x of Nhc, is t h e i
crisp label for class i, 1 < i < c. Nfc c o n t a i n s the f u z z y (and probabilistic) label vectors; Npc c o n t a i n s possibilistic label vectors. Note t h a t Nhc c Nfc c Npc. A c-partition of X is a c • m a t r i x U = [Uik ]. T h e r e are t h r e e s e t s of n o n d e g e n e r a t e c-partitions, w h o s e c o l u m n s {Uk} c o r r e s p o n d to t h e t h r e e t y p e s of label vectors:
{
n
NpcVk;0 < ~Uik Vi
}
;
(4)
}
;
{5}
Mhc n = {U e Mfcn'U k e NhcVk }
9
(6)
Mpc n =
U G ~cn.uk
E
k=l ifcn
= {U E Upcn'Uk
G ifcVk
E q u a t i o n s (4), (5) a n d (6) define, respectively, t h e possibilistic, fuzzy (or p r o b a b i l i s t i c , if in a s t a t i s t i c a l context), a n d crisp c - p a r t i t i o n s of X, w i t h M h c n c Mfcn c M p c n . C r i s p p a r t i t i o n s h a v e a n e q u i v a l e n t s e t - t h e o r e t i c characterization:
{X 1..... X c } p a r t i t i o n s
X when
X i (~ Xj = O V i r j
and
X -- k.)X i .
C l u s t e r i n g a l g o r i t h m s deliver c - p a r t i t i o n s U of X, a n d often e s t i m a t e o t h e r p a r a m e t e r s too [ 1 ] . . T h e m o s t c o m m o n p a r a m e t e r s b e s i d e s U t h a t a r e associated with clustering a r e s e t s of v e c t o r s w e d e n o t e by V = {Vl,V 2 ..... r e } c 91p. T h e v e c t o r v i is i n t e r p r e t e d a s a point prototype (centroid, c l u s t e r center, s i g n a t u r e , exemplar, temI~late, codevector, etc.) for t h e p o i n t s a s s o c i a t e d with c l u s t e r i. T h e c o m m o n d e n o m i n a t o r in all point p r o t o t y p e g e n e r a t i o n s c h e m e s is a m a t h e m a t i c a l definition of h o w well prototype v i r e p r e s e n t s t h e i-th c l u s t e r . Any m e a s u r e of similarity o n ~)~P c a n be u s e d . The u s u a l choice is d i s t a n c e (dissimilarity), t h e m o s t c o n v e n i e n t is s q u a r e d i n n e r p r o d u c t A - i n d u c e d d i s t a n c e llxll2 = xTAx, a n d the m o s t p o p u l a r is s q u a r e d E u c l i d e a n d i s t a n c e (A=Ip, t h e i d e n t i t y on 91P). Local methods a t t e m p t to optimize s o m e (possibly unknown)
f u n c t i o n of t h e e
squared
distances
IllXk-Vill2"l
-r~
j
by
73 u p d a t i n g o n e or m o r e of t h e vi's as a r e s u l t of looking a t j u s t one i n p u t datum
x k. Global methods u s u a l l y s e e k e x t r e m a of s o m e f u n c t i o n of
{llXk-viii2-1 < i <_c;1 < k < n}, i.e., all cn s q u a r e d d i s t a n c e s . D o n ' t c o n f u s e local a n d global p r o t o t y p e g e n e r a t i o n methods w i t h t h e local a n d global extrema of s o m e objective f u n c t i o n f o u n d by a p a r t i c u l a r m e t h o d .
A classifier is a n y f u n c t i o n V:~I~p ~ Npc. The v a l u e y = D(z) is t h e label v e c t o r for z in 91p. D is a crisp classifier if D[91 p] = Nhc; o t h e r w i s e , t h e classifier is fuzzy, possibilistic or probabilistic, w h i c h for c o n v e n i e n c e we l u m p t o g e t h e r as soft classifiers. W h e n p a r t or all of X is u s e d to "train" D (find its p a r a m e t e r s ) , the t r a i n i n g a n d t e s t d a t a are d e n o t e d b y Xtr a n d Xte. We d e n o t e t h e test or generalization error rate of D w h e n t r a i n e d on Xtr a n d t e s t e d w i t h crisply l a b e l e d Xte as Ev(XteIXtr) = (# m i s t a k e s c o m m i t t e d b y D / [ X t e [ ) . S o m e a u t h o r s [1] call t h i s t h e a p p a r e n t e r r o r r a t e of D to e m p h a s i z e t h a t Ev(Xte ]Xtr) is a n empirical e s t i m a t e of the t r u e b u t u n k n o w n e r r o r r a t e t h a t could be o b t a i n e d if the p a r a m e t e r s of D were k n o w n i n s t e a d of e s t i m a t e d . If t h e s a m e d a t a , X = X t r - Xte, are u s e d for t r a i n i n g a n d testing, ED(XlX) is called t h e (apparent) resubstitution error rate. A classifier t h a t achieves EB(XlX) = 0 is said to be consistent, following H a r t ' s u s e of this t e r m for d a t a s u b s e t s u s e d w i t h the 1-nn classifier [2]. W h e n t h e t r a i n i n g d a t a labels are u s e d during t h e c r e a t i o n of p o i n t p r o t o t y p e s , t h e m e t h o d is pre-supervisedlearning; w h e n t h e t r a i n i n g l a b e l s are u s e d to label t h e p r o t o t y p e s aflerthey are f o u n d the m e t h o d is post-supervised learning. O n c e p o i n t p r o t o t y p e s are f o u n d (and p o s s i b l y r e l a b e l e d to agree m o s t usefully w i t h t h e d a t a if t h e t r a i n i n g d a t a have p h y s i c a l labels), t h e y c a n be u s e d to define a crisp nearest prototype (1-np) classifier DV,E,~. Let V be a set of c crisply labeled p o i n t p r o t o t y p e s , one p e r class, o r d e r e d so t h a t e i is t h e crisp label for v i, 1 < i < c; let 8 be a n y d i s t a n c e m e a s u r e on 9l p, a n d let E = {ei'i
= 1 .....
c} = Nhc. The crisp nearest prototype (1-np) classifier DV,E, 8 is
defined, for z ~ 9t p, as Decide z e i ca DV,E,~(z) = e i ca ~(z, vi) _<~(z, vj) V j ~: i.
{7}
Ties in (7) are a r b i t r a r i l y resolved. K u n c h e v a a n d B e z d e k [3] develop a f r a m e w o r k in w h i c h to d e s c r i b e a generalized nearest prototype classifier (GNPC), w h i c h i n c l u d e s (7) a s a s p e c i a l c a s e of a p a r a d i g m t h a t e n c o m p a s s e s a t l e a s t t h e five classifier families s h o w n in F i g u r e 1. T h e GNPC p r o v i d e s a c o m m o n f r a m e w o r k for crisp a n d soft n e a r e s t p r o t o t y p e classifier design, as well as the aggregation of evidence for labeling u n s e e n (test) i n p u t s b y a l a r g e n u m b e r of r e a s o n i n g m e c h a n i s m s . The c l u s t e r / r e l a b e l family in Figure 1 is p o s t - s u p e r v i s e d , as is t h e (sub)family of
74 S O F M m o d e l s d i s c u s s e d here. T h e o t h e r t h r e e families in F i g u r e 1 a r e presupervised.
Figure 1. Five families of classifier d e s i g n s are GNPCs W h e n a single p r o t o t y p e is n o t sufficient to r e p r e s e n t one or m o r e of t h e c l a s s e s a c c u r a t e l y e n o u g h , several p r o t o t y p e s for e a c h c l a s s are n e e d e d . Let ( V c , E ~ ) - {(vj,ei(j)): j : 1..... c;i(j) E {i ..... 5}}. Here X h a s 5 labeled c l a s s e s , 5 < c, V c is a s e t of c crisply labeled prototypes, with m o r e t h a n one p e r c l a s s for a t l e a s t o n e c l a s s if 5 < c, el(j) l a b e l s vj as c l a s s i, a n d 6 is a n y d i s t a n c e m e a s u r e on 9~p. T h e c r i s p 1- nearest multiple prototype (1-nmp) c l a s s i f i e r DVc,Ee, 6 is defined, for z E 91p, as Decide z E i ca Dvc,Ea,6(z) : ei(j)
r
[i(z, vj) <_~(z,v S) V s ~ j.
(8)
Ties in (8) are a r b i t r a r i l y resolved. W h e n c = 5 e q u a t i o n (8) r e d u c e s to (7). F o r r e a s o n s to be given below, we t h i n k t h a t S O F M s ( i n c l u d i n g vector quantization (VQ) a n d s o m e soft v a r i a n t s of it) do n o t provide p a r t i c u l a r l y good e s t i m a t e s of c l u s t e r s in X - i.e., of U E Mhc n. On t h e o t h e r h a n d , if t h e r e a r e e n o u g h t r a i n i n g d a t a , a l g o r i t h m s of t h i s k i n d often g e n e r a t e m u l t i p l e p r o t o t y p e s w h i c h provide fairly a c c u r a t e r e p r e s e n t a t i o n s of s u b s t r u c t u r e in the input data. 2. SELECTION vs. EXTRACTION
C l a s s i f i e r p e r f o r m a n c e d e p e n d s i m p o r t a n t l y on t h e q u a l i t y of Xtr. If Xtr is large e n o u g h a n d / o r its s u b s t r u c t u r e is well delineated, we e x p e c t classifiers t r a i n e d w i t h it to yield s m a l l e r r o r r a t e s . O n t h e o t h e r h a n d , w h e n t h e t r a i n i n g d a t a are large in d i m e n s i o n p a n d / o r n u m b e r of s a m p l e s n, good
75 classifiers s u c h as t h e k-nearest neighbor rule [4] c a n r e q u i r e too m u c h s t o r a g e a n d CPU t i m e for efficient d e p l o y m e n t . To c i r c u m v e n t t i m e a n d storage p r o b l e m s c a u s e d by very large d a t a sets, as well as to improve the efficiency of s u p e r v i s i o n by Xtr, we try to r e d u c e Xtr while a p p r o x i m a t e l y preserving the s h a p e of the decision b o u n d a r i e s set u p by t r a i n i n g D with it. Two c o m m o n s c h e m e s for this are selection a n d r e p l a c e m e n t . Selection (S) m e a n s : find a proper subset Xtr c Xtr. Replacement (R) m e a n s : u s e a t r a n s f o r m a t i o n ~'91 p ~-~ 91p to find Xtr = fl[Xtr ]" Point p r o t o t y p e s p r o d u c e d b y t h e s e two s c h e m e s will be called S- a n d R-prototypes. Figure 2 d e p i c t s both methods.
Figure 2. Selection (S) a n d extraction (R) of prototypes from Xtr 2. I S e l e c t i o n m e t h o d s
Selection is t h e special case of r e p l a c e m e n t t h a t is s h o w n in t h e u p p e r portion of Figure 2. Selection t e c h n i q u e s t h a t p r o d u c e Ev(:k~I:ktr)= 0 are called condensation, a n d are a special c a s e of t h e m o r e g e n e r a l s e t of m e t h o d s called editing, for w h i c h EB(~ItrlXtr)>__0. M a n y a u t h o r s h a v e s t u d i e d w a y s to c o n d e n s e [Hart, 2] or edit [Wilson, 5] t r a i n i n g d a t a . Basically, we w a n t to select a s u b s e t Xtr c Xtr t h a t p r e s e r v e s t h e s h a p e (or s k e l e t o n ) of t h e o r i g i n a l d a t a . Devijver a n d K i t t l e r d e v e l o p e d a n
76 a s y m p t o t i c a l l y o p t i m a l editing t e c h n i q u e (MULTIEDIT) w h i c h is well s u m m a r i z e d in [4]. D a s a r a t h y [6] developed a c o n d e n s a t i o n t e c h n i q u e he called the MCS (minimal consistent subset) method. K u n c h e v a a n d Bezdek [7] s t u d i e d the u s e of genetic algorithms a n d r a n d o m s e a r c h for b o t h c o n d e n s a t i o n a n d editing. When the classifier D h a s the form (7) a n d the prototypes V are found by selection, D is called the nearest neighbor (1-nn) rule. The well known k - n n rule is obtained from (8) by extending the s e a r c h there to the nearest k neighbors, which in our terminology are S-prototypes. 2.2 Extraction methods
Replacements are almost always labeled point prototypes (such as V from one of the c - m e a n s clustering models). Roughly speaking, there are four major a p p r o a c h e s to the extraction of R-prototypes 9(i) sequential models s u c h as the leader algorithm, and sequential hard c-means (SHCM, Hartigan, [8]); (ii) b a t c h clustering models s u c h as hard, f u z z y and possibilistic cm e a n s (HCM, FCM, PCM, Bezdek et al. [1]) a n d the generalized Lloyd algorithm [9]; (iii) network-related models s u c h as adaptive resonance theory (ART, [10]) a n d self-organizing f e a t u r e maps (SOFMs) a n d t h e i r g e n e r a l i z a t i o n s [11] ; a n d (iv) s t a t i s t i c a l m o d e l s s u c h as m i x t u r e d e c o m p o s i t i o n [12l. All of t h e s e families have b o t h crisp a n d soft realizations, some of which are discussed below. The lower view in Figure 2 illustrates r e p l a c e m e n t of Xtr by (multiple) labeled point prototypes V for two classes (El) and (O). It is also possible to replace the d a t a in Figure 2 with non-point prototypes s u c h as rings, lines, h y p e r q u a d r i c surfaces, etc., leading to more sophisticated classifiers t h a t can m a t c h prototypical shapes to objects having similar representations. See [1] for more discussion on algorithms capable of doing this. C h a n g [13] gave one of the earliest m e t h o d s for the e x t r a c t i o n of Rprototypes. C h a n g ' s algorithm features s e q u e n t i a l u p d a t i n g b a s e d on a criterion t h a t h a s a graph-theoretic flavor. Bezdek et al. [14] p r o p o s e d a modification of C h a n g ' s algorithm t h a t t h e y called the modified Chang algorithm (MCA). H a m a m o t o et al. [23] gave a n u m b e r of bootstrap m e t h o d s for the generation of R-prototypes t h a t are in some ways quite similar to the C h a n g a n d MCA methods. These three m e t h o d s belong to the class of presupervised n e a r e s t multiple prototype classifier designs. T h e r e are m a n y other ad hoc a n d hybrid r e p l a c e m e n t s c h e m e s - for example, Yager a n d Filev's [15] so-called " m o u n t a i n c l u s t e r i n g method", which is driven by a global objective function t h a t finds pre-specffied, fixed R - p r o t o t y p e s t h a t are a m o n g a set of c a n d i d a t e s w h i c h are u n i f o r m l y sprinkled t h r o u g h o u t the hyperbox s p a n n e d by the input data. One way to find prototypes for 1-np classifiers is to r u n a n y prototype g e n e r a t i n g clustering algorithm on the entire labeled training set Xtr a n d simply ignore the labels during training. Using the knowledge t h a t there are c = ~ labeled s u b s e t s in the training data enables you to specify the k n o w n
77 n u m b e r of classes, so the r e s u l t is (presumably) one p r o t o t y p e per class. Why do this if y o u have labeled data? Ignoring the labels d u r i n g clustering may enable y o u to discover geometrically better prototypes t h a n the labeled s a m p l e m e a n s for the classes, b e c a u s e geometric p r o p e r t i e s of the d a t a (which are n o t necessarily c a p t u r e d by the labeled samples) c a n s o m e t i m e s drive the model t o w a r d s a more useful set of prototypes t h a n the s u b s a m p l e m e a n s for 1-np designs. A s e c o n d m a j o r a p p r o a c h to 1-np d e s i g n s is to g e n e r a t e t h e p r o t o t y p e s with a competitive l e a r n i n g m o d e l s u c h as the SOFM. Both of these a p p r o a c h e s require relabeling the prototypes after they are found, a n d hence, are post-supervised methods. One of the simplest a p p r o a c h e s to multiple R-prototype g e n e r a t i o n w h e n crisply labeled d a t a are available is to r u n any c l u s t e r i n g a l g o r i t h m t h a t generates prototypes on Xtr,i, the training d a t a for the i-th class, one class at a time. This g e n e r a t e s , say, c i p r o t o t y p e s for class i w h i c h are a l r e a d y labeled by e i for pre-supervised 1-nmp classifier design. A n o t h e r possibility is to r u n a n y point prototype generating clustering algorithm on all of Xtr, ignoring the given physical labels, b u t with values for c > ~; i.e., t h a t are greater t h a n the given n u m b e r of class labels. This i n t r o d u c e s the necessity for c l u s t e r v a l i d a t i o n [1], b u t with labeled t e s t d a t a a n d a well defined p e r f o r m a n c e objective, this is less of a problem t h a n it is in true exploratory d a t a analysis. This yields post-supervised R-prototypes for 1-nmp classifier design. 3. UNSUPERVISED COMPETITIVE LEARNING NETWORKS
The p r i m a r y goal for unsupervised competitive learning (UCL) models is to p o r t r a y the i n p u t d a t a by a m u c h smaller n u m b e r of p r o t o t y p e s t h a t are good r e p r e s e n t a t i v e s of s t r u c t u r e in the data. Identification of c l u s t e r s is implicit, b u t n o t active, in the p u r s u i t of this goal by UCL a l g o r i t h m s . Moreover, p r o t o t y p e s p r o d u c e d this way m a y not be particularly effective as a b a s i s for p r o t o t y p e classifiers, b e c a u s e p r o t o t y p e s t h a t are good for classifier design are not necessarily the s a m e (even in form) as t h o s e t h a t are u s e d for other p u r p o s e s . For example, prototypes good for compression, t r a n s m i s s i o n a n d r e c o n s t i t u t i o n of i m a g e s m a y be q u i t e p o o r as representatives of classes for pixel classification in the s a m e image. 3. I T h e UCL, SOFM a n d VQ m o d e l s
Sequential UCL models u p d a t e estimates of one or more of the {vi} at each of n i n p u t events d u r i n g p a s s t (one iteration is u s u a l l y t a k e n as one p a s s t h r o u g h X, w i t h o u t replacement). Upon p r e s e n t a t i o n of a n x k from X, the general form of the u p d a t e equation is: Vi, t = Vi,t_ 1 + O~ik,t ( X k -
Vi,t-1), i
=
1 ..... c; t=l
..... T
(9)
78
In (9) {(Xik,t} i s the leaming rate distribution over the c p r o t o t y p e s for i n p u t x k d u r i n g iterate t. W h e n x k is s u b m i t t e d to a UCL n e t w o r k , d i s t a n c e s are c o m p u t e d b e t w e e n it a n d e a c h vj. The o u t p u t n o d e s "compete", one or m o r e w i n n e r (i.e., m i n i m u m distance) n o d e s are f o u n d 9a n d finally, t h e w i n n i n g n o d e a n d possibly o t h e r p r o t o t y p e s are t h e n u p d a t e d . T h e r e are at l e a s t four cases 9 (i) Only v i is u p d a t e d (ii) Only one vj is u p d a t e d (iii) S o m e v.'s are u p d a t e d J (iv) Every vj is u p d a t e d
(winner take all, VQ, SHCM e.g.) (some vector t a k e s all, ART1, e.g.) (elite u p d a t e s , SOFMs, e.g.) (all n o d e s are u p d a t e d , GLVQ-F, e.g.)
The p r o t o t y p e s t h a t get u p d a t e d (the update neighborhood) d e p e n d o n t h e m o d e l c h o s e n , a n d t h e u p d a t e n e i g h b o r h o o d c a n be e m b e d d e d in t h e definition of the l e a r n i n g r a t e s for a p a r t i c u l a r model. A t e m p l a t e t h a t c a n be u s e d for m a n y UCL m o d e l s is given in Table 1. Notice t h a t t h e d a t a are p r o c e s s e d as if t h e y are unlabeled, even if t h e y are actually labeled. Table 1 A general u n s u p e r v i s e d competitive l e a r n i n g (UCL) point p r o t o t y p e generation algorithm Store 9(Un)labeled Object D a t a Xtr = { x 1 , x 2 ..... x n } C ~ P Utr ~ Mha n = Labels of vectors in Xtr Pick:
#of nodes" l
initial prototypes: V o a gl cp
t e r m i n a t i o n criterion
di tanc
m a x . # of iterations T
te~inationnorm.
vit- L Et =,!!Vt - v t _ ~ l l
.......
.......................................
~> special choices for a particular model Do: t (-- 1;E o = high value DO WHILE (t < T a n d Et_ I > ~) F o r k = 1 to n x E X , X k (--X, X ( - - - - X - { x k} Get d i s t a n c e s {l[Xk - v i . t _ llIA: 1 < i < c!
(lOa)
Get l e a r n i n g r a t e s {(X~k,t;1 <_i < c}
(lOb)
t>
Vi, t -- Vi,t_ I + (Xik,t (Xk - Vi,t_ I)
Next k t ~-- t + l END WHILE V (-- Vt_ 1
(10c)
79 UCL as we h a v e w r i t t e n it c a n be u s e d to i m p l e m e n t a m o d e l d u e to K o h o n e n [1 1] called t h e self-organizing feature map (SOFM). In t h e SOFM e a c h p r o t o t y p e V j, t E 9~ p is a s s o c i a t e d w i t h a display node, s a y dj, t ~ 91q. U s u a l l y q = 1 or 2, b u t the display "space" could have m o r e d i m e n s i o n s , a n d it is n o t really a space, b u t a lattice L of i n t e g e r s (addresses) in 91q. T h e display lattice is in 1-1 c o r r e s p o n d e n c e with the indices a s s o c i a t e d w i t h t h e p r o t o t y p e v e c t o r s in 91p. In t h e SOFM t h e w i n n i n g v e c t o r Vi, t t h a t b e s t m a t c h e s a n i n p u t v e c t o r x k is f o u n d . Next, a n update neighborhood Jk/(di,t) c L c e n t e r e d at di, t E D is defined in L, a n d the w i n n e r display cell's n e i g h b o r s a r e l o c a t e d in L. T h i s m e a n s t h a t y o u m u s t define w h a t a n e i g h b o r h o o d is in L, a n d t h i s in t u r n r e q u i r e s two c o n c e p t s - s h a p e a n d size. For linear a r r a y s , the s h a p e of N(di, t) is u s u a l l y a d j a c e n t indices to t h e left a n d right of L o u t to a specified radius; for two d i m e n s i o n a l display sets, it could m e a n the 4 - c o n n e c t e d n e i g h b o r s of di, t diagonally or parallel to t h e a x e s of L, or t h e 8 - c o n n e c t e d n e i g h b o r s of di, t t h a t s u r r o u n d it in L, etc. Along w i t h t h e s h a p e of ]k](di,t) t h e r e m u s t be a c o n c e p t of size, u s u a l l y defined t h r o u g h its "radius". Finally, vi, t a n d o t h e r p r o t o t y p e v e c t o r s in t h e i n v e r s e i m a g e [Jk](di,t)] -1 o f t h e s p a t i a l n e i g h b o r h o o d ]k](di,t) a r e u p d a t e d u s i n g (1 0c). C o m p u t a t i o n of the l e a r n i n g r a t e s {a~k,t} in (10b) is n o t specific in Table 1. Different m o d e l s r e q u i r e c h o o s i n g v a r i o u s p a r a m e t e r s ( ~> special choices in t h e "pick" b l o c k of t h e table), a n d a l m o s t all of t h e m c o m p u t e q u a n t i t i e s w h i c h are f u n c t i o n s of t h e d i s t a n c e s in (10a). Generally - b u t n o t very often a is a f u n c t i o n of i, k a n d t, b u t in s o m e m o d e l s it is fixed for all k's d u r i n g e a c h p a s s t h r o u g h X, a n d t h e n we write ai, t. Most frequently, a is fixed for b o t h i a n d k, d e p e n d i n g only on t; in this case we write a t. Infrequently, only one p a s s is m a d e t h r o u g h X, in w h i c h c a s e we write ai, k. The sign of a d e t e r m i n e s w h e t h e r t h e u p d a t e in (9) m o v e s vi,t_ 1 t o w a r d s (attraction) or a w a y from x (repulsion). Most competitive l e a r n i n g m o d e l s u s e only positive l e a r n i n g rates, b u t t h e r e are a l g o r i t h m s t h a t u s e negative l e a r n i n g r a t e s for v e c t o r s w h o s e d i s p l a y lattice a s s o c i a t e s are far from t h e w i n n e r cell in J~(di,t) (e.g., t h e so called "Mexican h a t f u n c t i o n " d i s c u s s e d in [1 11). If m a n i p u l a t i o n of N(di, t) is i n c l u d e d in the definition of t h e {aTk,t} in (10b), Table 1 specifies t h e basic SOFM design. The u s u a l w a y to o p e r a t e the SOFM is to d e c r e a s e b o t h t h e v a l u e s of t h e l e a r n i n g r a t e s a n d size of t h e u p d a t e n e i g h b o r h o o d over time. W h e n t h e u p d a t e n e i g h b o r h o o d is r e d u c e d to t h e w i n n e r a l o n e (Vi, t =[J~J(di,t)]-l), SOFM b e c o m e s vector quantization (VQ), a p a r t i c u l a r i n s t a n c e of t h e UCL model. The r e l a t i o n s h i p b e t w e e n a n d m a n i p u l a t i o n of V, {a ~ ik,t } a n d N(di, t )
80 c a n be a p r e t t y difficult c o n c e p t to grasp; please refer to K o h o n e n [11] for amplification.
3 . 2 Relabeling R- p r o t o t y p e s for post-supervised np classification To u s e V for l - n p or l - n m p p o s t - s u p e r v i s e d classifier design, the labels Utr t h a t are supplied with Xtr are u s e d after the t r a i n i n g p h a s e is c o m p l e t e d to a s s i g n a p h y s i c a l label to e a c h algorithmically labeled prototype. A crisp p a r t i t i o n of Xtr is n e e d e d before V c a n be relabeled with Utr. IfV is o b t a i n e d with point prototype clustering, you will have a c-partition U(V) related to V, a n d U(V) c a n b e m a d e crisp (for e x a m p l e , b y m a x i m u m m e m b e r s h i p thresholding, [1]). Since UCL models n e i t h e r n e e d nor generate a p a r t i t i o n of X~ d u r i n g training, w h e n V is obtained with a UCL model y o u n e e d to create a crisp p a r t i t i o n before V c a n be relabeled. In t h i s case t h e u s u a l w a y to g e n e r a t e a partition of Xtr with the UCL o u t p u t V is to create the crisp l - n p partition U(V) of X as follows: 1; U i k ( V ) --
O;
"*k
< II*k-' L1
C:j * i} i k
(11)
otherwise; resolve ties arbitrarily
E q u a t i o n (i i) does not g u a r a n t e e t h a t U(V) partitions Xtr into ~ c l u s t e r s , i.e., t h a t e a c h of the ~ c l a s s e s k n o w n to exist in Xtr is r e p r e s e n t e d b y a t l e a s t one of the c prototypes. W h e n a prototype c a n n o t acquire one of t h e p h y s i c a l l a b e l s d u r i n g relabeling, we call it a n inactive prototype. T h i s h a p p e n s m o r e often t h a n you m i g h t expect (Example 4.2 h a s s u c h a c a s e ) . The c r e a t i o n of U(V) from V as in (I i) often leads to s e m a n t i c confusion. UCL a n d its relatives are not clustering algorithms. For example, Yager a n d Filev's m o u n t a i n c l u s t e r i n g m e t h o d [13], w h i c h does n o t p r o d u c e c l u s t e r s w i t h o u t u s i n g a n e q u a t i o n s u c h as (11) after t e r m i n a t i o n of t h e t r a i n i n g p h a s e , is incorrectly called a c l u s t e r i n g algorithm. More precisely, it is a p r o t o t y p e g e n e r a t i o n a l g o r i t h m w h o s e t e r m i n a l p r o t o t y p e s c a n be u s e d to find crisp c l u s t e r s . (In fact, it is e a s y to realize t h e m o u n t a i n c l u s t e r i n g m e t h o d as a special case of UCL for the right choices at D in Table 1.) This t e r m i n o l o g y - t h a t UCL is a c l u s t e r i n g m e t h o d - is fairly pervasive, a n d is always confusing. Since UCL m o d e l s are not explicitly designed to find good partitions, c l u s t e r s built "after the fact" by a p p r o a c h e s s u c h as (11) m a y or m a y n o t be s a t i s f a c t o r y in t h e s e n s e of p a r t i t i o n i n g X for s u b s t r u c t u r e . F o r e w a r n e d , d o n ' t be s u r p r i s e d if a UCL m o d e l p r o d u c e s u n s a t i s f a c t o r y c l u s t e r s in u n l a b e l e d d a t a - t h a t ' s not its job. Now we are r e a d y to relabel the c v.'s. Again, there are v a r i o u s w a y s to do 1 this, a n d the m e t h o d we choose here is b a s e d on m a x i m i z i n g the e x p e c t e d probability of correct classification for the s a m p l e s at h a n d . Specifically, we
81 n
let n =[Xtr [" n j = [ X j [ = k~ Ujk(V ) for j = i ... C, where X. is the crisp c l u s t e r =1
'
'
J
in Xtr d e t e r m i n e d by the j-th row of UI'V)); a n d nj = # of points in X t h a t have physical label i. Then, for j - 1 ..... c, we relabel v as v s if V s ~-- V j r
nsj
= max{nii ~ J}
(12)
1_
This relabeling s c h e m e differs from the relabeling m e t h o d given in [1] w h i c h w a s u s e d for s o m e of the r e s u l t s given in [7,14]. However, t h a t m e t h o d a n d (12) coincide w h e n e v e r the labeled s a m p l e s are e q u a l l y probable, which is the case for Iris. There are of course m a n y , m a n y w a y s to u s e the labels in Utr in the context of p r e - s u p e r v i s e d or simply s u p e r v i s e d classifier design. Models of this kind include CL models s u c h as LVQ 1-LVQ3 [I 1], k - n n models [4, 61, Bayesian classifiers [12], n e u r a l n e t w o r k s [24], etc.
3.3 Unsupervised Learning Vector Quantization (VQ) The oldest model we are familiar with t h a t can be properly identified as a UCL model is p r o b a b l y sequential hard c-means (SHCM, [16]). The u p d a t e rule of SHCM is s t r u c t u r a l l y identical to the more r e c e n t a n d p o p u l a r VQ designs. T h e l e a r n i n g r a t e d i s t r i b u t i o n for VQ t h a t is u s u a l l y u s e d in equation (10b) is: "*jk,tr~ ULVQ = {~t
", j = l =, 2iJ =....~argmin C"j a m ~S{11Xk - Vs't-1 [[}}
(13)
E q u a t i o n (13) s h o w s t h a t this form of VQ is a w i n n e r t a k e all s t r a t e g y t h a t is, the u p d a t e n e i g h b o r h o o d [~(di,t)] -1 = vi, t. In (13) learning rate 0~t is usually: (i) i n d e p e n d e n t of i a n d k; (ii), initialized to some value in (0, 1); a n d (iii), d e c r e a s e d nonlinearly with t, often (I t or (1 / t). See [11] for conditions on the learning rates t h a t g u a r a n t e e the convergence of the VQ iterate s e q u e n c e to a limit point.
3.4 Generalized learning vector quantization - Fuzzy SHCM a n d VQ place all of their e m p h a s i s on the w i n n i n g prototype. However, s t r u c t u r a l i n f o r m a t i o n due to i n p u t x is carried by all c of the d i s t a n c e s {llx- viii}. M a n y a u t h o r s have s u g g e s t e d modifications to winnert a k e - a l l m o d e l s t h a t u p d a t e s o m e or all of the p r o t o t y p e s d u r i n g e a c h u p d a t i n g epoch. Here we d i s c u s s one (fuzzy) i n s t a n c e of this type - viz. generalized learning vector quantization - f u z z y (GLVQ-F, [ 17]), w h i c h is b a s e d on minimizing the functional (for m> 1)
82
c
JGLVQ-F(Xk;V)
2
= r=l ~UriJXk--VrJJ
: r=l ~
lix
_
JJXk
--
IJ2/Cm-1) :ilj/(m_l, 11-1vrjj2 "
J[Xk
--
(141
In (14) the vector u = ( U l , U 2 . . . . . Uc) T ~ Nfc is a fuzzy label vector w h o s e e n t r i e s t a k e the form of the n e c e s s a r y condition for u p d a t e s in f u z z y cm e a n s (FCM, [1]). The real n u m b e r m > 1 in (14) is the s a m e fuzziness p a r a m e t e r t h a t a p p e a r s in FCM. The v a l u e of m affects t h e q u a l i t y of r e p r e s e n t a t i o n by the t e r m i n a l p r o t o t y p e s a n d also c o n t r o l s the s p e e d of t e r m i n a t i o n of the GLVQ-F algorithm, which is j u s t steepest d e s c e n t applied t o JGLVQ-F" The GLVQ-F u p d a t e rule for the prototypes V at iterate t in the special (and simple) case m=2 u s e s the following learning rate distribution:
GLVQ-F(m=2) C~ik,t
_ 2C(I t -
II ~
IJxk-v .t-llJ 2
r=l
[JXk - Vr,t_l[J 2
11-2 ,
1
_
.
(15)
J
2~ Uik,t-I
E q u a t i o n (15) h a s t h e s a m e s i n g u l a r i t y c o n d i t i o n a s FCM in its d e n o m i n a t o r . W h e n no Iixk- Vr,t_lij = 0, ( 1 5 ) p r o d u c e s a l e a r n i n g r a t e for each value of i, so all c prototypes are u p d a t e d at each input. As in (13), (I t in (15) - n o w one f a c t o r of the l e a r n i n g r a t e s {Clik,t} - is u s u a l l y c h o s e n p r o p o r t i o n a l to l / t , a n d the c o n s t a n t (2c) is a b s o r b e d in it w i t h o u t loss. Limiting p r o p e r t i e s of GLVQ-F are 9 (i) as m a p p r o a c h e s infinity, all c p r o t o t y p e s receive equal u p d a t e s a n d the vi's all converge to the g r a n d m e a n of the data; w h e r e a s (ii) as m a p p r o a c h e s 1 from above, only the w i n n e r is u p d a t e d , a n d GLVQ-F reverts to VQ. Finally, we m e n t i o n t h a t the w i n n i n g prototype in GLVQ-F for m=2 receives the largest (fraction) of Clik,t at iterate t; a n d t h a t o t h e r p r o t o t y p e s receive a s h a r e t h a t is inversely p r o p o r t i o n a l to their d i s t a n c e from the input. GLVQ-F learning rates satisfy the a d d i t i o n a l c
c o n s t r a i n t E Clik,t <_ i. i=1
4 . M U L T I P L E P R O T O T Y P E C L A S S I F I C A T I O N IN I R I S
This section c o m b i n e s p a r t s of e x a m p l e s d i s c u s s e d in [7, 14] with s o m e n e w r e s u l t s . Specifically, we have a d d e d m a t e r i a l here on the u s e of the SOFM to get R-prototypes, a n d H a m a m o t o et al's [23] b o o t s t r a p p i n g m e t h o d for g e n e r a t i n g S - p r o t o t y p e s . The only d a t a set we u s e is A n d e r s o n ' s well k n o w n 4 - d i m e n s i o n a l Iris d a t a [18, 25]. Iris c o n t a i n s 50 labeled 4-vectors in e a c h of t h r e e p h y s i c a l classes t h a t c o r r e s p o n d to t h r e e s u b s p e c i e s of Iris flowers. In o u r n o t a t i o n then, n = 150 a n d p = 4, so Iris is a very small a n d
83
very o v e r w o r k e d d a t a set. So overworked, t h a t t h e r e a r e n o w t h r e e or f o u r v e r s i o n s of Iris floating a r o u n d - b u t t h a t ' s a story b e s t told e l s e w h e r e [25]. F i g u r e 3 s c a t t e r p l o t s t h e t h i r d a n d f o u r t h f e a t u r e s of Iris a n d t h e 2D s u b s a m p l e m e a n s for e a c h of t h e t h r e e c l a s s e s . C l a s s 1 is well s e p a r a t e d f r o m c l a s s e s 2 a n d 3 in t h e s e two d i m e n s i o n s . C l a s s e s 2 a n d 3 s h o w s o m e overlap in t h e c e n t r a l a r e a of the figure, a n d this region c o n t a i n s t h e v e c t o r s t h a t are u s u a l l y mislabeled by n e a r e s t prototype designs. The d a s h e d b o u n d a r i e s i n d i c a t e t h e p h y s i c a l l y labeled 2D c l u s t e r b o u n d a r i e s . T h u s , ~ = 3 in t h e t e r m i n o l o g y of e q u a t i o n (8).
Figure 3. T h e Iris d a t a " {(feature 3, f e a t u r e 4 )} = Iris34 T h e r e s u b s t i t u t i o n e r r o r r a t e for t h e p r e - s u p e r v i s e d 1-np d e s i g n t h a t u s e s t h e 6 = 3 c l a s s m e a n s V a s s i n g l e p r o t o t y p e s is 11 e r r o r s in 150 s u b m i s s i o n s u s i n g t h e E u c l i d e a n n o r m 82, i.e., EBv.~.~2 (IrislIris)= 7 . 3 3 % . P o s t - s u p e r v i s e d 1-np d e s i g n s for Iris t h a t s e e k 6= 3 R - p r o t o t y p e s r e p o r t r e s u b s t i t u t i o n e r r o r r a t e s r a n g i n g f r o m 5 to 20; v a r i a t i o n s a r e d u e to d i f f e r e n t m e t h o d s p r o d u c i n g d i f f e r e n t s e t s of V's, a s well a s d i f f e r e n t relabeling s c h e m e s .
84 4.1 1 - n m p classifier d e s i g n s for Iris b a s e d on VQ and GLVQ-F [14] Initial p r o t o t y p e s for VQ a n d GLVQ-F in t h i s first e x a m p l e w e r e c o m p u t e d b y s a m p l i n g t h e d i a g o n a l of t h e line s e g m e n t c o n n e c t i n g t h e m i n i m a l a n d m a x i m a l p o i n t s in t h e data: M i n i m u m of f e a t u r e j 9m j = m~in{ Xjk }: j = 1,2 ..... p k M a x i m u m of f e a t u r e j 9Mj = ~ { Xjk }: J = 1, 2 ..... p k
(16b)
Vi,0 = m +
{16c)
(i1) c-I
(M-m)
"i = I, 2 . . . . .
c
(16a)
Thus, Vl, o = m = ( m l , . . . , m p ) T " Vc,o = M = (M l,...,Mp) T" a n d the r e m a i n i n g (c-2) initial p r o t o t y p e s a r e u n i f o r m l y d i s t r i b u t e d a l o n g t h e line s e g m e n t c o n n e c t i n g t h e s e two v e c t o r s . T h e E u c l i d e a n n o r m w a s u s e d in (10a), a n d t h e n u m b e r of p r o t o t y p e s g e n e r a t e d r a n g e d f r o m c = ~ = 3 to c = 30. T h e t e r m i n a t i o n t h r e s h o l d ~ h a d o n e of t h e t h r e e v a l u e s : e= 0.1, 0.01 a n d 0 . 0 0 1 . T h e p r i m a r y t e r m i n a t i o n criterion t h a t w a s c o m p a r e d to e w a s t h e 1norm between successive estimates of t h e c p r o t o t y p e s , i.e.,
-Vjr,t_ll"
if t h i s failed to s t o p a n a l g o r i t h m , E t - [[Vt -Vt_l[[1 = ~ j~ll V jr,t r=l s e c o n d a r y t e r m i n a t i o n o c c u r r e d a t t h e i t e r a t e limit T = 1000. T h e initial l e a r n i n g r a t e w a s a o = 0 . 4 a n d a w a s d e c r e a s e d linearly, a t = a o ( ( T - t ) / T ) for b o t h a l g o r i t h m s . F o r t h e r e s u l t s displayed, (15) w a s u s e d for GLVQ-F. S a m p l e s w e r e d r a w n r a n d o m l y f r o m Xtr = Iris w i t h o u t r e p l a c e m e n t . O n e i t e r a t i o n c o r r e s p o n d s to o n e p a s s t h r o u g h Iris. E a c h a l g o r i t h m w a s r u n 5 t i m e s for e a c h c a s e d i s c u s s e d to see h o w different i n p u t s e q u e n c e s a f f e c t e d t h e t e r m i n a l p r o t o t y p e s . F o r t h e less s t r i n g e n t t e r m i n a t i o n c r i t e r i a (e = 0.1 and 0.01), different terminal prototypes were sometimes obtained on d i f f e r e n t r u n s . F o r e = 0 . 0 0 1 , t h i s effect w a s n e a r l y ( b u t n o t a l w a y s ) e l i m i n a t e d . M o s t of t h e r u n s u s i n g e = 0 . 0 0 1 w e r e c o m p l e t e d in l e s s t h a n 3 0 0 p a s s e s t h r o u g h Iris. Table 2 exhibits the terminal R-prototypes and post-supervised l-nmp r e s u b s t i t u t i o n e r r o r r a t e s f o u n d b y VQ a n d GLVQ-F for c = 6. E a c h of t h e t h r e e p h y s i c a l c l u s t e r s is r e p r e s e n t e d b y two p r o t o t y p e s b y b o t h VQ a n d GLVQ-F, a n d t h e overall e r r o r r a t e p r o d u c e d b y t h e s e two c l a s s i f i e r s is 9 . 3 3 % - 14 m i s t a k e s : n o t r e a l l y m u c h b e t t e r t h a n a n y p o s t - s u p e r v i s e d d e s i g n a t ~= 3, a n d n o t a s good a s t h e s u p e r v i s e d s a m p l e m e a n s d e s i g n , w h i c h c o m m i t s 11 r e s u b s t i t u t i o n m i s t a k e s .
85 Table 2 Typical R-prototypes a n d 1-nmp r e s u b s t i t u t i o n error rates for c = 6 prototypes (Iris data) VQ labels 1 1 2 2 3 3
4.69 5.23 5.52 6.21 6.53 7.47
VQ prototypes 3.12 1.39 3.65 1.50 2.61 3.90 2.84 4.75 3.06 5.49 3.12 6.31
GLVQ-F labels 0.20 0.28 1.20 1.57 2.18 2.02
E r r o r rate = 9.33 %
1 1
2 2 3 3
GLVQ-F (m=2) prototypes 4.75 3.15 1.43 0.20 5.24 3.69 1.50 0.27 5.60 2.65 4.04 1.24 6.18 2.87 4.73 1.56 6.54 3.05 5.47 2.11 7.44 3.07 6.27 2.05 E r r o r rate = 9.33 %
The t h i r d a n d f o u r t h f e a t u r e s of the p r o t o t y p e s in Table 2 are plotted in Figure 4 a g a i n s t a b a c k g r o u n d c r e a t e d by r o u g h l y e s t i m a t i n g the convex hull of e a c h p h y s i c a l class in t h e s e two d i m e n s i o n s by eye. S o m e of the p r o t o t y p e s are h a r d to see b e c a u s e their coordinates are very close in t h e s e two d i m e n s i o n s .
Figure 4. Terminal R-prototypes i n Iris34 at c = 6 The VQ a n d GLVQ-F prototypes t h a t seem to lie on the b o u n d a r y b e t w e e n c l a s s e s 2 a n d 3 in Figure 4 are highlighted by enclosing t h e m w i t h the
86 j a g g e d s t a r ~ . T h e s e p r o t o t y p e s i n c u r m o s t of t h e m i s c l a s s i f i c a t i o n s c o m m i t t e d by the VQ a n d GLVQ-F 1-nmp classifiers. Table 3 lists the s a m e information as Table 2 for typical r u n s m a d e at c = 7. The error rate d r o p s to 2 . 6 6 % for b o t h designs. The s e v e n t h R - p r o t o t y p e is n o t "added" to t h e p r e v i o u s six; r a t h e r , seven n e w R-prototypes are f o u n d by e a c h algorithm. Note t h a t VQ a n d GLVQ-F c o n t i n u e to use 2 prototypes for each of classes 1 a n d 2, a n d a d d a t h i r d r e p r e s e n t a t i v e for class 3 at c = 7. Neither VQ n o r GLVQ-F p r o t o t y p e s r e p r e s e n t t h e d a t a efficiently, b e c a u s e o n l y o n e prototype is n e e d e d to r e p r e s e n t the 50 class 1 points with no r e s u b s t i t u t i o n errors. This p o i n t is b r o u g h t o u t in [14], w h e r e the so-called "dog-rabbit (DR}" p r o t o t y p e g e n e r a t i o n algorithm is u s e d to achieve this m o r e desirable r e p r e s e n t a t i o n of Iris Sestosa. Table 3 Typical R-prototypes a n d 1-nmp r e s u b s t i t u t i o n error rates for c = 7 p r o t o t y p e s (Iris data) VQ Labels 1 1 2 2 3 3 3
VQ prototypes 4.68 3.11 1.39 0.20 5.23 3.65 1.50 0.28 5.53 2.62 3.93 1.21 6.42 2.89 4.59 1.43 6.57 3.09 5.52 2.18 7.47 3.12 6.31 2.02 5.99 2.75 5.02 1.79 E r r o r rate = 2.66 %
GLVQ-F Labels 1 1 2 2 3 3 3
GLVQ-F (m=2) prototypes 4.74 3.15 1.43 0.20 5.24 3.69 1.50 0.27 5.57 2.61 3.96 1 . 2 1 6.26 2.92 4.54 1.43 6.62 3.09 5.56 2.16 7.50 3.05 6.35 2.06 6.04 2.79 4.95 1.76 Error rate = 3.33 %
F i g u r e 5 s h o w s t h a t t h e crucial " b o u n d a r y " p r o t o t y p e s from VQ a n d GLVQ-F in the c -- 6 case have r o u g h l y "divided" into two s e t s of n e w prototypes, enclosed again by the jagged star. These two pairs of p r o t o t y p e s have m o v e d away from the a p p a r e n t b o u n d a r y of the lower left p a r t of t h e convex hull of class 3. Both n e w pairs move f u r t h e r into the convex h u l l s of t h e i r respective classes, enabling their 1-nmp classifiers to achieve a lower error rate. W h e n t h e s e two UCL algorithms are i n s t r u c t e d to seek c = 8 R-prototypes, t h e r e s u b s t i t u t i o n e r r o r r a t e for b o t h p o s t - s u p e r v i s e d d e s i g n s typically r e m a i n s at 2.66%, a n d at c = 9 the r e s u l t s are quite similar. This s u g g e s t s t h a t t h e r e p l a c e m e n t of Iris with 7 or 8 p r o t o t y p e s f o u n d by either VQ or GLVQ-F r e s u l t s in a 1 - n m p design t h a t is quite s u p e r i o r (as m e a s u r e d b y the r e s u b s t i t u t i o n error rate) to the labeled 1-np design b a s e d on the ~ = 3 s u b s a m p l e m e a n s V. It is r e a s o n a b l e to a s s u m e t h a t this t r e n d w o u l d also hold for a p p a r e n t error r a t e s c o m p u t e d with test d a t a reserved from Iris i.e., t h a t t h e 1-nmp designs w o u l d generalize b e t t e r t h a n classifiers b a s e d on 1-np designs - reasonable, b u t certainly not g u a r a n t e e d .
87
Figure 5. Terminal R-prototypes in Iris34 at c = 7
4.2 A c o m p a r i s o n of 16 nearest prototype classifiers [7, 14, new] The Iris d a t a c a n be fairly well r e p r e s e n t e d in the s e n s e of m i n i m a l posts u p e r v i s e d r e s u b s t i t u t i o n error by 7 or 8 labeled R-prototypes. Increasing c p a s t c = 7 h a s little effect on t h e b e s t case r e s u l t s . How f e w R or S p r o t o t y p e s are n e e d e d by the 1-nmp rule to achieve good r e s u l t s ? At the o t h e r e x t r e m e , h o w m a n y are n e e d e d to a c h i e v e c o n s i s t e n c y zero r e s u b s t i t u t i o n errors? Any labeled d a t a set X offers y o u a tradeoff b e t w e e n the variables ED(X[X) a n d c, a n d we t h i n k t h a t discovering the b e s t pair (c, ED(X]X)) will always d e p e n d on the d a t a set at h a n d - t h a t is, no t h e o r y will emerge t h a t will tell u s how to do this generally. Figure 6 plots in (c, ED(XlX)) space the b e s t case o u t p u t s of 16 pre- a n d p o s t - s u p e r v i s e d classifiers b a s e d on different prototype g e n e r a t o r m o d e l s . (c-Means in Figure 6 r e p r e s e n t s the o u t p u t s of three algorithms; h a r d , fuzzy a n d possibilistic c-means.) Algorithms indicated in Figure 6 w h i c h have n o t b e e n d i s c u s s e d here are: random search a n d genetic algorithms (RS, GA, [7]); dog-rabbit (DR, [ 14]). modified fuzzy c - m e a n s (MFCM, [ 19]); fuzzy learning vector quantization (FLVQ, [20, 21]); a n d the soft competition scheme (SCS, [21, 22]) Please refer to the i n d i c a t e d references for d i s c u s s i o n s of t h e s e models. We will d i s c u s s H a m a m o t o et al.'s [23] bootstrap (BS) a l g o r i t h m shortly.
88
Figure 6. R- a n d S-Prototype classifiers c o m p a r e d by ED(Iris [ Iris) for 16 m e t h o d s , b e s t case r e s u l t s Figure 6 lets u s s p e c u l a t e on s o m e i m p o r t a n t points a b o u t t h e tradeoff b e t w e e n EI)(XlX) a n d c. We e m p h a s i z e t h a t o u r c o n j e c t u r e s are b a s e d on c o m p u t a t i o n s with j u s t one very small d a t a set. The b e s t classifiers in Figure 6 are o n e s with coordinates closest to the origin. The vertical d a s h e d line at
89
c = 3 ( p r e s u m a b l y ) c o r r e s p o n d s to the s m a l l e s t possible n u m b e r of "good" p o i n t p r o t o t y p e s for n e a r e s t p r o t o t y p e classification; this is t h e 1-np case, c = ~ = 3. The b e s t t h r e e (Pareto optimal) m o d e l s lie along the h e a v y line at t h e b o t t o m of t h e g r a p h . Specifically, t h e b e s t 1-np m o d e l is a set of 3 Sp r o t o t y p e s s e l e c t e d f r o m Iris w i t h e i t h e r RS or GA t h a t r e s u l t s in ED,RS(Iris [ Iris) = Ev,c~(Iris [ Iris) = 2. At c = 5 we find a r e p l a c e m e n t model, t h e b o o t s t r a p m e t h o d d i s c u s s e d in [23], w h i c h r e s u l t s in 1 e r r o r , ED,BS(Iris I Iris)= I. The t h i r d Pareto optimal model is also a r e p l a c e m e n t model. The modified C h a n g a l g o r i t h m p r o d u c e s c = 11 R - p r o t o t y p e s t h a t e n a b l e t h e c o r r e s p o n d i n g 1 - n m p classifier to achieve c o n s i s t e n c y (ED,MCA(Iris [ Iris)=0)). There are 4 m o d e l s b e t w e e n the lower "best results" line a n d the n e w SOFM b e s t c a s e s line in F i g u r e 6" DR, VQ a n d GLVQ-F, w h i c h a r e all local (sequentially u p d a t e d ) models, a n d the supervised, non-iteratively c o m p u t e d s a m p l e m e a n s V. F o u r global (batch) models, HCM, FCM, PCM a n d FLVQ, a n d one local s e q u e n t i a l (SCS) model lie above the SOFM curve. The general t r e n d of t h e tradeoff b e t w e e n Ev(Iris ! Iris) a n d c for c > 3 is m i r r o r e d by the line plotted t h r o u g h the 5 SOFM points in Figure 6: ED(X[X) d e c r e a s e s as c i n c r e a s e s , a n d conversely. However, t h e r e is a p p a r e n t l y a limit for e a c h of t h e n o n - c o n s i s t e n t m o d e l s (which is d a t a d e p e n d e n t of c o u r s e ) b e y o n d w h i c h f u r t h e r i n c r e a s e s in c do not generally yield a f u r t h e r d e c r e a s e in the r e s u b s t i t u t i o n error rate [ 14]. The r e s u l t on the Iris d a t a s h o w n in Figure 6 u s i n g H a m a m o t o et al.'s pres u p e r v i s e d b o o t s t r a p p i n g t e c h n i q u e [23] is new, so we give a s h o r t d i s c u s s i o n of it. F o u r b o o t s t r a p p i n g t e c h n i q u e s are d i s c u s s e d in [23], all of w h i c h a r e p r e - s u p e r v i s e d , R - p r o t o t y p e e x t r a c t i o n m o d e l s . T h e specific m e t h o d u s e d to p r o d u c e the R - p r o t o t y p e s t h a t are listed in Table 4 w h i c h yield the p o i n t (5, 1) in Figure 6 is called m e t h o d III in [23]. Here's h o w it w o r k s for the Iris data. For c l a s s i w i t h crisply labeled s a m p l e s X i, i = 1 . . . . . ~ = 3" p i c k t h e n u m b e r of R - p r o t o t y p e s desired, say npi 9we u s e d (1, 2, 2) for c l a s s e s (1, 2, 3). R a n d o m l y c h o o s e npi p o i n t s in X, find the k - n e a r e s t n e i g h b o r s to e a c h of t h e s e npi points, average the k n e i g h b o r s of e a c h of t h e npi selected points, a n d replace e a c h selected p o i n t w i t h its k - n n centroid. We u s e d k = 7 in o u r calculations, w h i c h were r e p e a t e d 10,000 times; the r e s u l t s s h o w n in Figure 6 a n d Table 4 are the b e s t case results. The set s h o w n in Table 4 is a p r e t t y nice r e s u l t : a s i m p l e , fast, a n d u n a s h a m e d l y r a n d o m w a y to get pres u p e r v i s e d R - p r o t o t y p e s t h a t provide a low r e s u b s t i t u t i o n e r r o r rate. We h a v e n o t f o u n d a c o m b i n a t i o n of p a r a m e t e r s yet t h a t e n a b l e s H a m a m o t o et al.'s m e t h o d to achieve c o n s i s t e n c y - t h a t is, t h a t p r o d u c e s no r e s u b s t i t u t i o n errors.
90 Table 4 Five R - p r o t o t y p e s p r o d u c e d by H a m a m o t o et al.'s p r e - s u p e r v i s e d b o o t s t r a p m o d e l t h a t give ED(Iris [ Iris) = I Labels 1 2 2 3 3
Bootstrap Prototypes 5.16 3.64 1.47 0.24 6.59 2.99 4.53 1.43 6.03 3.09 4 . 5 4 1.54 6.76 3.16 5.67 2.23 5.84 2.81 4.99 1.89
F o r t h e n e w S O F M e x p e r i m e n t s r e p o r t e d h e r e , we u s e d t h e S O F M s o f t w a r e available in MATLAB's N e u r a l N e t w o r k s Toolbox. T h e t e r m i n a t i o n t h r e s h o l d T w a s s e t to 250. U n l i k e T in T a b l e 1, t h i s is a n a b s o l u t e t e r m i n a t i o n criterion: MATLAB's SOFM t e r m i n a t e s w h e n t = 2 5 0 = T. T h e v a r i a b l e t in (17) c o u n t s t h e n u m b e r of i n p u t s d r a w n r a n d o m l y , w i t h r e p l a c e m e n t , from X (t is not t h e n u m b e r of p a s s e s t h r o u g h X w i t h o u t r e p l a c e m e n t , a s in T a b l e 1). T h e SOFM l e a r n i n g r a t e d i s t r i b u t i o n h a d t h e ( s o m e w h a t bizarre) form
(~t ---
1
I
5 t+4 +
41000T )
9 (~0=1
. .t.=. . .
1 2,
T.
(17)
T h e r a d i u s r t (and t h u s , t h e n u m b e r of p r o t o t y p e s u p d a t e d at a p a r t i c u l a r input) w a s d e c r e a s e d a c c o r d i n g to t h e f o r m u l a
T-10t } r t = m a x 1,M o T
,
t = 1, 2 . . . . . T,
(18)
We c r e a t e d l i n e a r or r e c t a n g u l a r n e i g h b o r h o o d s in t h e d i s p l a y lattice L by specifying t h e m a x i m u m possible initializing "radius" M o for e a c h grid s e t u p a s s h o w n in T a b l e 5. T h e u p d a t e n e i g h b o r h o o d r e d u c e s to t h e w i n n e r a f t e r tvQ i n p u t s ( s h o w n in t h e l a s t c o l u m n of T a b l e 5), so in t h e s e e x p e r i m e n t s SOFM q u i c k l y r e d u c e d to VQ. For v a l u e s of r t larger t h a n 1, t h e cells w h o s e p r o t o t y p e a s s o c i a t e s get u p d a t e d d e p e n d on two things; w h e t h e r t h e d i s p l a y grid is l i n e a r or r e c t a n g u l a r , a n d h o w d i s t a n c e s are c o m p u t e d on t h e grid. All of o u r e x p e r i m e n t s u s e d the M a n h a t t a n ( i - n o r m d i s t a n c e s ) grid p a t t e r n . We m a d e 10 s e t s of r u n s u s i n g 10 r a n d o m p e r m u t a t i o n s of Iris. A n u m b e r of different grid p a t t e r n s w e r e tried, d e p e n d e n t to s o m e e x t e n t on t h e n u m b e r of p r o t o t y p e s c h o s e n . T h u s , for c = 19, t h e only possible a r r a n g e m e n t of t h e d i s p l a y l a t t i c e is linear, 1 x 19, b u t for c = 18, we c a n u s e a 1 x 18 l i n e a r a r r a y , a s well a s 3 x 6 or 2 x 9 r e c t a n g u l a r a r r a y s for the lattice L. We tried all p o s s i b l e grid c o m b i n a t i o n s for c = 3 u p to c = 30, b u t f o u n d t h a t in m a n y
91
i n s t a n c e s quite of few of the cells (and their c o r r e s p o n d i n g p r o t o t y p e s in 4space) were inactive. R-prototypes a s s o c i a t e d with inactive cells will n e v e r be relabeled by (12), a n d t h u s r e p r e s e n t "ghost" clusters. Table 5 Best case p o s t - s u p e r v i s e d 1-nmp r e s u b s t i t u t i o n error r a t e s u s i n g SOFM Rp r o t o t y p e s from the Iris d a t a # active prototypes 12 9 7 5 3
# errors 3 4 5 11 12
# passive prototypes 0 0 1 0 0
grid setup 3x4 3x3 2• 1• 1•
Mo
tvQ
5 4 4 4 3
15 13 13 13 1
The b e s t c a s e (in l 0 tries) r e s u l t s of r u n n i n g the SOFM as j u s t d e s c r i b e d are s h o w n a s 5 p o i n t s in Figure 6 a n d in the first two c o l u m n s of Table 5. The grid configuration t h a t gave the b e s t case r e s u l t s are also listed in Table 5, a n d y o u c a n see t h e r e t h a t in the 5 errors u s i n g 7 p r o t o t y p e s case, t h e r e were 8 p r o t o t y p e s available, b u t only 7 were active. You m i g h t t h i n k t h a t T=250 is j u s t too few i n p u t s to get stable prototypes. We did too! So, we set T = 2 5 0 , 0 0 0 for the 2 • grid s e t u p a n d r a n it this long - twice - j u s t to find out if this w a s the case. Nope! - still 5 errors. Figure 7 differs from Figure 6 only in the m e a n i n g of the s y m b o l s a t t a c h e d to the c o o r d i n a t e s of e a c h of the 16 classifier designs. The implicit variable u n d e r l y i n g t h e p o i n t s in Figure 6 is R- vs. S - p r o t o t y p e s . In Figure 7, the c o r r e s p o n d i n g u n d e r l y i n g variable is pre- vs. post-supervision. It is h a r d not to be i m m e d i a t e l y s t r u c k by the fact t h a t ALL of the b e s t de s igns in Figure 7 are p r e - s u p e r v i s e d . This fits o u r intuitive e x p e c t a t i o n s a b o u t the b e s t u s e of labels p r e t t y well. If you have labels for the t r a i n i n g data, w h y not let t h e m guide y o u to the p r o t o t y p e s ? If it w e r e n ' t for the fact t h a t t h e r e are p a p e r s t h a t illustrate the utility of p o s t - s u p e r v i s e d designs in a variety of settings, it would be h a r d to a r g u e t h a t they s h o u l d receive m u c h c o n s i d e r a t i o n as good w a y s to d e s i g n classifiers. The a u t h o r s o f [ l , Section 4.1] h a v e this to say: "Nearest prototype classifiers are simple, effective a n d cool. However, you got to pay y o u r d u e s if you w a n t to u s e (them). T h a t is, you have to g e n e r a t e the p r o t o t y p e s , a n d y o u k n o w t h a t don't come easy!" Yep. A n d too, t h e y j u s t don't w a n t to w a s t e all t h o s e p ro to ty p e g e n e r a t o r s t h e y love to invent! All kidding aside, Figure 7 is pretty compelling, b u t again, let's r e m i n d you t h a t this figure is b a s e d on j u s t the real Iris da ta . We will be t e s t i n g all t h e s e m o d e l s on other, m o r e i n t e r e s t i n g d a t a sets soon, to see if p r e - s u p e r v i s i o n reliably b e a t s the p a n t s off post-supervision. (We are betting it will.)
92
Figure 7. Pre- a n d Post-supervision compared by ED(Iris [ Iris) for 16 methods, best case results 5. C O N C L U S I O N S
: SHORT FORM 9
Here are o u r c o n c l u s i o n s , opinions a n d c o n j e c t u r e s a b o u t n e a r e s t prototype classifier design. We remind you again t h a t anything s u p p o r t e d by c o m p u t a t i o n s on j u s t one (puny) data set s u c h as Iris m u s t be viewed with, m o s t generously, a very wary eye.
93 1. Pre-supervision finds more useful prototypes t h a n p o s t - s u p e r v i s i o n for n e a r e s t prototype classifier designs. 2. S e q u e n t i a l (local) m e t h o d s m a y p r o d u c e b e t t e r R - p r o t o t y p e s for posts u p e r v i s e d 1-nmp classifier design t h a n b a t c h (global) m e t h o d s do. 3. SOFM, a n i n t e r m e d i a t e model (in t e r m s of u p d a t e strategies) b e t w e e n local models (such as VQ a n d SHCM) a n d global models (such as the c - m e a n s a l g o r i t h m s ) , yields i n t e r m e d i a t e p o s t - s u p e r v i s e d 1 - n m p results. 4. There is a clear tradeoff b e t w e e n m i n i m a l c a n d m i n i m a l error rate, a n d this is a d a t a d e p e n d e n t issue. 5. Multiple p r o t o t y p e classifiers will u s u a l l y p r o d u c e lower e r r o r r a t e s t h a n single prototype designs. 6. CONCLUSIONS
: LONG FORM |
C o n c l u s i o n 1 is o u r s t r o n g e s t assertion (i.e., the one in w h i c h we have the h i g h e s t confidence). Figure 7 s u p p o r t s this conclusion p r e t t y well. If y o u let y o u r eye include the p r e - s u p e r v i s e d sample m e a n s design at ~ = 3 in Figure 7, it is easy to see t h a t the best case p r e - s u p e r v i s e d designs c a n be u s e d to partition this s c a t t e r p l o t into pre- vs. p o s t - s u p e r v i s i o n designs (indicated on Figure 7 by the s h a d e d area). F r o m this it is fairly easy to a s s e r t t h a t pres u p e r v i s e d d e s i g n s will a l m o s t always provide b e t t e r r e s u b s t i t u t i o n e r r o r r a t e s t h a n p o s t - s u p e r v i s e d ones will (almost, b e c a u s e this is j u s t one d a t a set, a n d n o t all possible m e t h o d s have b e e n tried). And from this you m i g h t be t e m p t e d - as we are - to conclude the s a m e t h i n g a b o u t generalization error r a t e s (Ev(Xte IXtr)). C o n c l u s i o n 2 s t a t e s t h a t local r e p l a c e m e n t m e t h o d s s u c h as VQ, DR a n d GLVQ-F m a y p r o d u c e b e t t e r point R-prototypes for 1-nmp classifier design t h a n b a t c h m e t h o d s s u c h as the c - m e a n s m o d e l s a n d FLVQ. The single e x c e p t i o n to t h i s in Figure 6 is the SCS model, w h i c h h a s s e q u e n t i a l u p d a t i n g b u t b a t c h - l i k e p e r f o r m a n c e . However, all of the b e s t - c a s e posts u p e r v i s e d 1-np designs in Figure 6 except SCS were b a t c h models. We offer a c o n j e c t u r e a b o u t the efficacy of u s i n g s e q u e n t i a l v e r s u s b a t c h m o d e l s to g e n e r a t e p r o t o t y p e s for t h e 1 - n m p classifier. S e q u e n t i a l u p d a t i n g of t h e p r o t o t y p e s in UCL m o d e l s s u c h as VQ a n d GLVQ-F e n c o u r a g e s "localized" p r o t o t y p e s w h i c h are able, w h e n t h e r e is m o r e t h a n one p e r class, to position t h e m s e l v e s b e t t e r with r e s p e c t to s u b c l u s t e r s t h a t m a y be p r e s e n t within the s a m e class. This leads u s to conjecture t h a t b a t c h a l g o r i t h m s are at their b e s t w h e n u s e d to erect 1-np designs; a n d t h a t s e q u e n t i a l m o d e l s are m o r e effective for 1 - n m p classifiers. Moreover, this p r o b a b l y h o l d s for small v e r s u s large v a l u e s of c. W h e n c is small relative to n (e.g., c = 5 regions in a n image with n = 6 5 , 5 3 6 pixel vectors), b a t c h m o d e l s p r o b a b l y p r o d u c e m o r e effective prototypes, b e c a u s e t h e y take a global look at the d a t a before deciding w h a t to do; b u t if c = 2 5 6 (e.g., w h e n u s i n g a UCL a l g o r i t h m for, say, image compression), s e q u e n t i a l u p d a t i n g m a y hold a n a d v a n t a g e , as it localizes the u p d a t e n e i g h b o r h o o d , a n d t h a t objective is more in line with sequential models.
94
C o n c l u s i o n 3 is really j u s t a n o b s e r v a t i o n b a s e d on t h e p e r c e i v e d envelopes s u g g e s t e d by Figure 6, w h i c h suggests t h a t we m i g h t r e g a r d the SOFM as a transition model t h a t lies "between" the local a n d global m e t h o d s u s e d to g e n e r a t e p r o t o t y p e s in this s e c o n d example. This is intuitively plausible, since the SOFM s t a r t s o u t u p d a t i n g more t h a n one (but u s u a l l y n o t all c) R-prototypes, a n d often is t e r m i n a t e d at its local realization (VQ). With t h e r e s u l t s in on only one very small a n d well b e h a v e d d a t a set, the j u r y is still out, so we w o n ' t overlay a "three zones" plot (local, i n t e r m e d i a t e , global) on Figure 6. (But we w a n t to.) C o n c l u s i o n 4 is intuitively plausible, a n d is s u p p o r t e d by the g r a p h s in Figures 6 a n d 7. There is clearly a tradeoff b e t w e e n the m i n i m u m error rate (desired) a n d the m i n i m u m n u m b e r of point prototypes t h a t will p r o d u c e it. For t h e r u n s r e p r e s e n t e d in Figure 6, we c a n get zero r e s u b s t i t u t i o n error if we are willing to u s e 11 extracted or 12 selected prototypes. On the o t h e r h a n d , if we really w a n t a m i n i m a l set of prototypes, the b e s t we c a n do (for t h e s e r u n s anyway) is to u s e 3 selected p r o t o t y p e s (GA or RS) t h a t c o m m i t two r e s u b s t i t u t i o n e r r o r s . We d o u b t t h a t a n y t h e o r y will s u p p o r t a m e t h o d i c a l w a y to m a k e this tradeoff. You will j u s t have to m a k e the r u n s with the d a t a you have. Notice, however, t h a t none of the c o n s i s t e n t m o d e l s axe competitive learning designs; a n d all are pre-supervised. C o n c l u s i o n 5, like C o n c l u s i o n 4, is y o u r intuitive expectation, a n d we s i m p l y o b s e r v e t h a t o u r c a l c u l a t i o n s b e a r o u t y o u r intuition. Do 1 - n m p d e s i g n s really w o r k b e t t e r t h a n 1-np s c h e m e s ? Sure. There is little d o u b t t h a t u n l e s s t h e d a t a are really simple, w i t h c o m p a c t , w e l l - s e p a r a t e d clusters, t h a t regions of overlap s u c h as those depicted in Figures 4 a n d 5 will c a u s e difficulty (and errors) for single prototype designs. We c o n c l u d e with a r e m a r k a b o u t all n e a r e s t prototype classifier designs. R e a d e r s familiar with s t a n d a r d multilayered perceptrons (MLP) k n o w t h a t it's p r e t t y easy to find a three layer MLP t h a t is c o n s i s t e n t on the Iris data, a n d this design generalizes pretty well too (see [1] for a detailed d i s c u s s i o n a n d example). So, w h y b o t h e r with prototype classifiers at all? Well, t h e r e are t i m e s w h e n it is nice to have the p r o t o t y p e s a r o u n d to a n s w e r q u e s t i o n s s u c h as: " w h a t are t h e c h a r a c t e r i s t i c s of the p r o t o t y p i c a l H o n d u r a n figurado"? or, "how c a n I c h a r a c t e r i z e the b e n c h m a r k Mexican M a d u r o T o r p e d o " ? This s o r t of q u e s t i o n c a n n o t u s u a l l y be a n s w e r e d w i t h t h e i n f o r m a t i o n available from n e t w o r k classifiers (although t h e y c a n p r o b a b l y identify one of t h e s e cigars p r e t t y well). And p e r h a p s m o s t i m p o r t a n t l y , p r o t o t y p e classifiers give some of u s lots of job security - always a significant c o n s i d e r a t i o n in Academe! 7 . REF]ERENCF_~
1. J. C. Bezdek, J. M. Keller, K r i s h n a p u r a m , R. a n d N. R. Pal
Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, Kluwer, Norwell, MA, 1999.
95
2. P. E. H a r t (1968). The condensed n e a r e s t neighbor rule, IEEE Trans. IT, 14, 515-516. 3. L. I. K u n c h e v a a n d J. C. Bezdek (1999). An i n t e g r a t e d framework for generalized n e a r e s t prototype classifier design, Int. J. o f Uncertainty, F u z z i n e s s a n d K n o w l e d g e - B a s e d Systems, 6(5), 437-457. 4. P. Devijver a n d J. Kittler (1982). Pattern Recognition: A Statistical Approach, Prentice-Hall, Englewood Cliffs, NJ. 5. D. L. Wilson (1972). Asymptotic properties of n e a r e s t neighbor rules using edited data, IEEE Trans. SMC, 2, 408-421. 6. B. V. D a s a r a t h y (1994). Minimal c o n s i s t e n t s u b s e t (MCS) identification for optimal n e a r e s t neighbor decision s y s t e m s design, IEEE Trans. SMC, 24, 511-517. 7. L. I. K u n c h e v a a n d J. C. Bezdek (1998). Nearest prototype classification: c l u s t e r i n g , genetic a l g o r i t h m s or r a n d o m search?, IEEE Trans. Syst. Man and Cyberns., C28(1), 160164. 8. J. Hartigan (1975). Clustering Algorithms, Wiley, NY. 9. A. Gersho a n d R. Gray (1992). Vector Quantization and Signal Compression, Kluwer, Boston. 10. G. A. C a r p e n t e r (1989). Neural network models for p a t t e r n recognition a n d associative memory, Neural Networks, 243257. I i. T. Kohonen (I 989). Self-Organization and Associative Memory, 3rd Edition, Springer-Verlag, Berlin. 12. D. Titterington, A. Smith a n d U. Makov. (1985). Statistical Analysis of Finite Mixture Distributions, Wiley, NY. 13. C. L. C h a n g (1974). Finding prototypes for n e a r e s t neighbor classification, IEEE Trans. Computer, 23(11), 1179-1184. 14. J. C. Bezdek, T. Reichherzer, G. S. Lim and Y. A. Attikiouzel, (1998). Multiple prototype classifier design, IEEE Trans. SMC, C28(I), 67-79. 15. R. R. Yager a n d D. P. Filev (1994). Approximate clustering by the m o u n t a i n method, IEEE Trans. SMC, 24(8), 1279-1283.
96
16. J. M a c Q u e e n (1967). Some m e t h o d s for classification a n d a n a l y s i s of m u l t i v a r i a t e observations,Proc. Berkeley Symp. Math. Stat. a n d Prob., 1, eds. L. M. LeCam a n d J. Neyman, Univ. of Califomia Press, Berkeley, 281-297. 17. N. B. Karayiannis, J. C. Bezdek, N. R. Pal, R. J. Hathaway and P.I. Pai (1996). Repairs to GLVQ : A new family of competitive learning schemes, IEEE Trans. Neural Networks, 7(5), 10621071. 18. E. A n d e r s o n (1935). The Irises of the Gaspe peninsula, Bull. Amer. Iris Soc., 59, 2-5. 19. J. Yen a n d C.W. C h a n g (1994). A multi-prototype fuzzy cm e a n s algorithm, Proc. European Congress on Intelligent Techniques and Soft Computing, Aachen, Germany, 539-543. 20. N. B. K a r a y i a n n i s a n d J. C. Bezdek (1997). An i n t e g r a t e d a p p r o a c h to fuzzy learning vector quantization a n d fuzzy cm e a n s clustering, IEEE Trans. FYtzzy Syst., 5(4), 622-628. 21. J. C. Bezdek a n d N. R. Pal (1995). Two soft relatives of learning vector quantization, Neural Networks, 8(5), 729-743. 22. E. Yair, K. Zeger and A. Gersho (1992). Competitive learning a n d soft competition for vector quantizer design, IEEE Trans. SP, 40(2), 294-309. 23. Y. Hamamoto, S. U c h i m u r a and S. Tomita (1997). A bootstrap t e c h n i q u e for n e a r e s t neighbor classifier design, IEEE Trans. PAMI, 19(I), 73-79. 24. J. Z u r a d a (1992). Introduction to Artificial Neural S y s t e m s , West Publ. Co., St. Paul, MN. 25. J. C. Bezdek, J. M. Keller, R. K r i s h n a p u r a m , L. I. Kuncheva a n d N. R. Pal (1999). Will the real Iris d a t a please s t a n d up?, IEEE Trans. Fuzzy Systems, 7(3), in press.
Kohonen Maps. E. Oja and S. Kaski, editors 9 ElsevierScienceB.V. All rightsreserved
97
Self-Organizing Maps on non-euclidean Spaces Helge Ritter Faculty of Technology Bielefeld University, D-33501 Bielefeld, Germany We propose a new type of Self-Organizing Map (SOM) that is based on discretizations of curved, non-euclidean spaces. As an introductory example, we briefly discuss "spherical SOMs" on tesselations of the sphere for the display of directional data. We then describe the construction of "hyperbolic SOMs", using regular tesselations of the hyperbolic plane, which is a non-euclidean space characterized by constant negative gaussian curvature. The approach is motivated by the recent observation that the geometry of hyperbolic spaces possesses very favourable properties for the mapping of hierarchical data. We conclude with some initial simulation results illustrating some properties of the hyperbolic SOM and discuss a number of issues for future research. 1. I N T R O D U C T I O N The Self-Organizing Map, as introduced by Kohonen more than a decade ago, has stimulated an enormous body of work in a broad range of applied and theoretical fields, including pattern recognition, brain theory, biological modeling, mathematics, signal processing, data mining and many more [8]. Much of this impressive success is owed to the combination of elegant simplicity in the SOM's algorithmic formulation, together with a high ability to produce useful answers for a wide variety of applied data processing tasks and even to provide a good model of important aspects of structure formation processes in neural systems. While the applications of the SOM are extremely wide-spread, the majority of uses still follow the original motivation of the SOM: to create dimension-reduced "feature maps" for various uses, most prominently perhaps for the purpose of data visualization. The suitability of the SOM for this task has been analyzed in great detail and linked to earlier approaches, such as PCA, multidimensional scaling, principal surfaces and the like. At the same time, this property of the SOM has stimulated the design of a number of related algorithms, including hierarchical variants of the SOM [10], the ASSOM [9] introducing the idea of mapping onto subspaces, the PSOM [21], the GTM [1] and the stochastic SSOM for mapping of proximity data [5], to just name a few. While these algorithms often differ from the SOM in many details, they share with it the idea of using a deformable template to translate data similarities into spatial relationships. While the way the template is implemented (e.g., as a discrete lattice or as a continuous, parametrized manifold) can affect important details of the generated mappings, a much larger role is played by its topological structure.
98 So far, the overwhelming majority of SOM approaches have taken it for granted to use (some subregion of) a fiat space as their data model and, motivated by its convenience for visualization, have favored the (suitably discretized) euclidean plane as their chief "canvas" for the generated mappings, although a number of works also has considered higher-dimensional euclidean lattices to represent, e.g., the configuration manifold of a robot a r m [16] and there have been a few approaches to use different lattice topologies, such as hierarchical trees [10] and subsequent work, or hypercubic lattices [2] to adapt the SOM to particular requirements in the data. 2. N O N - E U C L I D E A N
SOMS AND THEIR MOTIVATION
However, even if our thinking is deeply entrenched with the "obvious" structure of euclidean space, this is not necessarily honored by the multitude of data occurring around us. An obvious example is already given by directional data: in this case, the natural data model is the surface $2 of a sphere; this is a non-euclidean space characterized by uniform positive curvature and it is a well-known result that there is no singularityfree continuous one-to-one mapping of this space onto a whatsoever shaped piece of flat /R 2 (which regularly bothers when we have to describe points on the sphere by a pair of coordinates, which always is a hidden attempt to identify $2 with a part of/R2). Therefore, many directional data may much more naturally admit a projection onto a SOM with a spherical topology (which in addition to its special symmetry also does not suffer from a boundary). However, to the knowledge of the present author, such attempts have not appeared in the literature so far 1 Part of the reason for this may be the somewhat more difficult handling of a "spherical lattice" and in Sec. 3 we briefly comment on this point. Another interesting type of data are hierarchical structures. The capability of the 2d "euclidean" SOM to map even an amazing amount of the semantic features of language [17], which gave rise to the ambitious WEBSOM project [6], is remarkable, but it also raises the question to what extent a "flat" /R 2 (or a higher-dimensional euclidean/R D) provides sufficient freedom to map the neighborhood of an item from a more complex "information space" (such as language) into spatial relationships in the chosen euclidean projection space. The obvious limiting factor is the rather restricted neighborhood that "fits" around a point on a 2d lattice. Recently, it has been observed that another type of non-euclidean spaces, the hyperbolic spaces that are characterized by uniform negative curvature, are very well suited to overcome this limitation [11] since their geometry is such that the size of a neighborhood around a point increases exponentially with its radius r (while in a D-dimensional euclidean space the growth follows the much slower power law rD). This exponential scaling behavior fits very nicely with the scaling behavior within hierarchical, tree-like structures, where the number of items r steps away from the root grows as br where b is the (average) branching factor. This interesting property of hyperbolic spaces has been exploited for creating novel displays of large hierarchical structures that are more accessible to visual inspection than in previous approaches [12]. 1to avoid boundary effects, periodic boundary conditions for 2d-lattices are a well-known and frequently used technique; the resulting topology, however, is that of a torus, which is of zero curvature and topologically different from a sphere.
99 Therefore, it appears very promising to use hyperbolic spaces also in conjunction with the SOM, and in tile present contribution we want to present some initial results along these lines. The resulting hyperbolic SOMs are based on a tesselation of the hyperbolic plane (or some higher-dimensional hyperbolic space) and their lattice neighborhood reflects the hyperbolic distance metric that is responsible for the non-intuitive properties of hyperbolic spaces. Together with the spherical SOM, they provide an important special case of "non-euclidean" SOMs, which are the generalization of the SOM to non-euclidean spaces. By making this generalization we obtain an important new degree of freedom for choosing our "template" such that it may better support the desired translation of data similarities into spatial neighborhood relationships. Specifically, for hyperbolic SOMs, we argue that tlle nature of the chose "space template" is particularly adequate for mapping high-dimensional and hierarchical structures. Since the notion of non-euclidean spaces may be unfamiliar to many readers, we below first make a few remarks on spherical SOMs which may help to form an intuitive idea of the general concept. Then we give a brief account of some basic properties of hyperbolic spaces that are exploited for hyperbolic SOMs, in particular regular tesselations of the hyperbolic plane. Finally, we provide some hints on the actual implementation of hyperbolic SOMs after which we conclude with some computer experiments and a discussion.
3. S P H E R I C A L
SOMS
The surface of the sphere provides us with the simplest model of a non-euclidean, curved space. It is curved, since no finite patch of it can be mapped o n t o / R 2 without distorting local distances. Like in the plane, the distance between two points on a sphere is given by the shortest path connecting the two points, but this path is now an arc, since it is confined to run within the surface. As a result, no two points are separated by more than the maximal distance of 7r and the area in a circular neighborhood of radius r around a point grows slower than the usual 7rr 2 law in the plane. Since we are quite familiar with the sphere, we find these properties not too surprising, but we see the striking differences as compared to the geometry on flat /R 2. By virtue of our geometric picture of the sphere we easily can imagine that the sphere will be very suitable to create mappings of data with an underlying directional structure. If we want to create a topology-preserving mapping for such data, we should provide a spherical SOM, i.e., we must use node lattice whose topology mimicks that of a sphere with a corresponding distance measure between nodes. Unfortunately, there exist only five regular tesselations of the sphere (the five platonic bodies tetrahedron, octahedron, cube, pentagondodecahedron and icosahedron) which are all rather coarse. However, if we are willing to make a slight departure from perfect uniformity we can recursively subdivide the triangles of the icosahedron (at each step subdividing each triangle into four smaller ones as indicated in Fig.1 and "lifting" the newly introduced mid- Figure 1: Recursive points to the surface of the bounding sphere), leading to spherical triangle split triangle tesselations with arbitrarily fine resolution (these are only "almost" regular, since
100
the original icosahedron and the newly created vertices are surrounded by 5 and 6 triangles, resp.). At first sight, it may seem cumbersome to evaluate the required distance function on the icosaeder-based tesselations. However, it is fairly easy to compute along with the recursive tesselation the radial normals ~r towards the newly created lattice nodes r. Then, the required distance is simply given by dr,r, - arccos(nr 9nr'), which we can plug into the standard SOM adaptation rule. For lack of space, we will not go into further details about spherical SOMs and instead focus now on their less intuitive counterparts, the hyperbolic SOMs. 4. H Y P E R B O L I C
SPACES
While the sphere is characterized by a constant positive gaussian curvature (i.e., the surface "bends" into the same perpendicular direction, as one moves along two orthogonal directions), there are also surfaces that possess negative gaussian curvature. The counterpart of the sphere is achieved by requiring a constant negative curvature everywhere, and the resulting space is known as the hyperbolic plane H2 (with analogous generalizations to higher dimensions)[3,20] 2 Unfortunately, it is no longer possible to embed H2 in a (distance-preserving) way in our euclidean /R 3, but we still can do so with sufficiently small "patches" of H2. The embedded patches then resemble the shape of a "saddle", i.e., the negative curvature shows up as a local bending into opposite normal directions, as we move on orthogonal lines along the patch. This may make it intuitively plausible that on such surfaces the area (and also the circumference) of a circular neighborhood around a point now grows faster than in the uncurved case. The geometry of H2 is a standard topic in Riemannian geometry (see, e.g. [19,15], and the relationships for the area A and the circumference C of a circle of radius r are given by A
=
47rsinh2(r/2)
(1)
C
=
27r sinh(r)
(2)
These formulae exhibit the highly remarkable property that both quantities grow exponentially with the radius r (whereas in the limit r -+ 0 the curvature becomes insignificant and we recover the familiar laws for flat/R2). It is this property that was observed in [11] to make hyperbolic spaces extremely useful for accommodating hierarchical structures: their neighborhoods are in a sense "much larger" than in the non-curved euclidean (or in the even "smaller" positively curved) spaces (it makes also intuitively understandable why no finite-dimensional euclidean space allows an (isometric) embedding: it only has space for a power-law-growth). To use this potential for the SOM, we must solve two problems: (i) we must find suitable discretization lattices on H2 to which we can "attach" the SOM prototype vectors. (ii) after having constructed the SOM, we must somehow project the (hyperbolic!) lattice into "flat space" in order to be able to inspect the generated maps (since we believe that it will remain difficult to buy hyperbolic computer screens within the foreseable future). 2We gloss over many mathematical details here; the mathematical treatment of curvature is based on the Riemann curvature tensor and requires the formalism of differential geometry, see e.g., [15] for an accessible introduction.
101 Fortunately, both problems have already been solved more than a century ago by the mathematicians exploring the properties of non-euclidean spaces (the historic references are [4,7], for newer accounts see, e.g., [13,3]), and the next two subsections summarize some of the results that we will need. 4.1. P r o j e c t i o n s of h y p e r b o l i c s p a c e s While the structure of hyperbolic spaces makes it impossible to find an isometric (ie., distance-preserving) embedding into a euclidean space, there are several ways to construct very useful embeddings under weaker conditions (in the following, we will restrict the discussion to embeddings of the hyperbolic plane H2). The closest analogy to the sphereembedding occurs if we sacrifice the euclidean structure of the embedding space and instead endow it with a Minkowski-metric [14]. In such a space, the squared distance d 2 between two points (x, y, u) and (x', y', u ' ) i s given by
d ~ = ( x - ~')~ + ( y - y')~ - ( ~ - ~')~
(3)
i.e., it ceases to be positive definite. Still, this is a space with zero curvature and its somewhat strange distance measure allows to construct an isometric embedding of the hyperbolic plane, given by x
sinh(p)cos(r
(4)
y =
=
sinh(p)sin(C)
(5)
u
cosh(p)
(6)
=
where (p, r are polar coordinates on the hyperbolic plane (note the close analogy of (4)-(6) with the formulas for the embedding of a sphere by means of spherical polar coordinates in//~3!). Under this embedding, the hyperbolic plane appears as the surface M swept out by rotating the curve u 2 = 1 -t- x 2 -t- y2 about the u-axis 3. From this embedding, we can construct two ...." further ones, the so-called Klein model and the "', u .'" Poincard model (the latter will be used to vi"",. A .."" sualize hyperbolic SOMs below). Both achieve a projection of the infinite H2 into the unit disk, however, at the price of distorting distances. The Klein model is obtained by projecting the points of M onto the plane u = 1 along rays passing through the origin O (see Fig.2). Obviously, this projects all points of M into the "flat" unit disk x 2 + y2 < 1 of//~2. (e.g., A ~-+ B). The Poincar~ Model results if we add two further steps: first Figure 2: Construction steps underlya perpendicular projection of the Klein Model ing Klein and Poincar&models of the (e.g., a point B) onto the ( " n o r t h e r n " ) s u r f a c e hyperbolic space H2 of the unit sphere centered at the origin (point C), and then a stereographic projection of the "northern" hemisphere onto the unit circle about the origin in the ground plane u = 0 (point D). It turns out that the resulting projection of H2 has a number of pleasant properties, among them the preservation of
"'"..,M
3the alert reader may notice the absence of the previously described local saddle structure; this is a consequence of the use of a Minkowski metric for the embedding space, which is not completely compatible with our "euclidean" expectations.
102 angles and the mapping of shortest paths onto circular arcs belonging to circles that intersect the unit disk at right angles. Distances in the original H2 are strongly distorted in its Poincar(~ (and also in the Klein) image (el. Eq. (10)), however, in a rather useful way: the mapping exhibits a strong "fisheye"-effect. The neighborhood of the H2 origin is mapped almost faithfully (up to a linear shrinkage factor of 2), while more distant regions become increasingly "squeezed". Since asymptotically the radial distances and the circumference grow both according to the same exponential law, the squeezing is "conformal", ie., (sufficiently small) shapes painted onto H2 are not deformed, only their size shrinks with increasing distance from the origin. By translating the original H2 4 the fisheye-fovea can be moved to any other part of H2, allowing to selectively zoom-in on interesting portions of a map painted on H2 while still keeping a coarser view of its surrounding context.
4.2. Tesselations of the hyperbolic plane To complete the set-up for a hyperbolic SOM we still need an equivalent of a regular grid in the hyperbolic plane. Fortunately, the tesselation of hyperbolic spaces is a well developed subject in the mathematics literature, and we can borrow the following results: while the choices for tesselations with congruent polygons on the sphere and even in the plane such that each grid point is surrounded by the same number n of neighbors are severely limited (the only possible values for n being 3,4,5 on the sphere, and 3,4,6 in the plane), there is an infinite set of choices for the hyperbolic plane. In the following, we will restrict ourselves to lattices consisting of equilateral triangles only. In this case, there is for each n > 7 a regular tesselation such that each vertex is surrounded by n congruent equilateral triangles. Fig.3 shows two example tesselations (for the minimal value of n = 7 and for n - 10), using the Poincar~ model for their visualization. While in Fig.3 these tesselations appear non-uniform, this is only due to the fisheye effect of the Poincar~ projection. In the original H2, each triangle has the same size, and this can be checked by re-projecting any distant part of the tesselation into the center of the Poincar(~ disk, after which it looks identical (up to a possible rotation) to the center of Fig.3. As already evident from inspection of Fig.3, the resulting lattices share with the underlying space H2 the property that the number of vertices that fall within a given distance r of a node increase (asymptotically) exponentially with r. Thus, they correctly mimick the essential property of providing "more neighborhood space" to each node than tesselations derived from a euclidean space can do. One way to generate these tesselations algorithmically is by repeated application of a suitable set of generators of their symmetry group to a (suitably sized, cf. below) "starting triangle". For the here considered case of a uniform mesh of equilateral triangles, a suitable set of generators are the hyperbolic rotations around (the "peripheral" subset of) the already generated triangle vertices. These are special cases of (orientation-preserving) isometries on H2 and it can be shown that in the Poincar(~ model any such isometry can 4the readers with physics background will observe that the translations of H2 correspond to a subgroup of the Lorentz group for the chosen Minkowski embedding space; in the Poincar~ model such mappings can be most elegantly described by identifying each point (x, y) with the complex number z - x + i y . Then a H2 translation is described by the "disk-automorphism" z ' = ( z - a)/(1 -~z) where a denotes the point that becomes translated into the origin.
103
Figure 3. Regular triangle tesselations of the hyperbolic plane, projected into the unit disk using the Poincar~ mapping. The leftmost tesselation shows the case where the minimal number (n = 7) of equilateral triangles meet at each vertex and is best suited for the hyperbolic SOM, since tesselations for larger values of n (right: n = 10) lead to bigger triangles. In the Poincar~ projection, only sides passing through the origin appear straight, all other sides appear as circular arcs, although in the original space all triangles are congruent.
be described as a complex mapping
1 - ~z
(r)
where r is an angle and C C C is some complex number with I(I < 1. In particular, a hyperbolic rotation D(a, z0) = T(r ~) of angle c~ about a point z0 can be decomposed into two translations and a rotation about the origin D ( a , zo) = T(O,-Zo) o T ( a , O) o T(O, Zo)
(8)
from which r and ~ can be obtained analytically. For a triangle tesselation with vertex order n (ie. a = 27r/n), we would apply these hyperbolic rotations to an equilateral starting triangle with equal corner angles a. In a hyperbolic space, specifying the angles of a triangle automatically fixes its size: unlike the euclidean case, the angles no longer sum up to 7r, instead, the sum is always less and the "deficit" equals the area of the triangle. Therefore, the smallest triangles are obtained for the minimal value of n = 7, which, therefore, leads to the "finest" tesselation and will be our choice for the simulations reported below. Since the resulting lattice structure is very different from a rectangular array, there is no straighforward indexing scheme
104 and an efficient implementation has to store pointers to the topological neighbors of each newly generated vertex in order to allow iterating the SOM update step over the lattice neighborhood of each winner node. 5. H Y P E R B O L I C
SOM ALGORITHM
We have now all ingredients required for a "hyperbolic SOM". It employs a (finite patch) of a hyperbolic lattice, e.g., the regular triangle tesselation with vertex order n = 7. Using the construction scheme sketched in the previous section, we can organize the nodes of such a lattice as "rings" around an origin node (ie., it is simplest to build approximately "circular" lattices). The numbers of nodes of such a lattice grows very rapidly. (asymptotically exponentially) with the chosen lattice radius R (its number of rings). For instance, for n = 7, Table 1 shows the total number NR of nodes of the resulting regular hyperbolic lattices with different radii ranging from R = 1 to R = 10. Each lattice node r carries a prototype vector ~3r E /R D from some D-dimensional feature space (if we wish to make any non-standard assumptions about the metric structure of this space, we would build this into the distance metric that is used for determining the best-match node). The SOM is then formed in the usual way, e.g., in on-line mode by repeatedly determining the winner node s and adjusting all nodes r E N(s, t) in a radial lattice neighborhood N(s, t) around s according to the familiar rule
(9)
/kWr = r ] h r s ( : ~ - Wr)
with hrs - e x p ( - d 2 ( r , s)/2a2). However, since we now work on a hyperbolic lattice, we have to determine both the neighborhood N(s, t) and the (squared) node distance d2(r, s) according to the natural metric that is inherited by the hyperbolic lattice.
4 232
R
NR
5 617
8 11173
9 29261
7s
Table 1: node numbers NR of hyperbolic triangle lattices with vertex order 7 for different numbers R of "node rings" around the origin The simplest way to do this is to keep with each node r a complex number Zr to identify its position in the hyperbolic space (it is convenient to use the Poincar~ model; with correspondingly changed distance formulas any of the other models would do as well). The node distance is then given (using the Poincar~ model, see e.g. [19]) as d-
2arctanh (
Zr -- Zs 1 -
2s
(10)
9Zr
The neighborhood N(t, s) can be defined as tile subset of nodes within a certain graph distance (which is chosen as a small multiple of the neighborhood radius or) around s. In our current implementation, we use a recursive node traversal scheme which is more
105
Figure 4. Unfolding of a 2d-hyperbolic SOM lattice in a uniform data density inside the 3-dimensional ,,nit sphere. (a)(le.ft): early ordering phase, (b)(cente,9 emergence of local "saddle structure", (c)(righ, t) final configuration: the node-rich periphery of the hyperbolic lattice allows the realization of a highly contorted "space-filling" surface to approximate the 3d-topology of the spherical data distribution.
costly than a scheme optimized for a particular lattice type; however, it still is reasonably fast and has the virtue that it can be used for arbitrary lattice topologies. Like the standard SOM, also the hyperbolic SOM can become trapped in topological defects. Therefore, it is also important here to control the neighborhood radius a(t) from an initially large to a final small value (we use a simple exponential decrease, or(t) = (O'T,,,ax/~7(o))t/Tmax). In addition, we found two measures very effective to support the ordering process further: (i), the "conscience mechanisln" introduced by deSieno [18], which equalizes the winning rates of the nodes by adding for each node to the computation of its distance to the input a "distance penalty" which is raised whenever a node becomes winner, and which is lowered otherwise, and (ii) "radial learning", which initially lets only the subset of nodes closer than a distance R to the origin participate in the winner determination, where R gradually increases from 0 to the final, maximal value during training. While this is also useful for the euclidean SOM, it is particularly effective in the hyperbolic case, since it ensures a proper centering on the data distribution before the node-rich periphery with its many degrees of freedom gets involved. 6. E X P E R I M E N T S
In tile following, we give some examples of hyperbolic SO Ms ("HSOMs") generated on simple data sets and compare the resulting maps with those of a standard 2d SOM. As a first data set we use a uniform point distribution in the unit sphere o f / R a. This is not yet a high-dimensional data set, but in three dimensions we still can follow the unfolding of the map using perspective drawings of the sequence of "virtual nets" in their 3d-embedding space. Figs.4a-c show three development stages of this process for a HSOM with R = 4 (232 nodes). The pictures give some insight into how the HSOM takes advantage of its "larger
106
Figure 5. Projection of 5d-sphere data on euclidean SOM (a)(left) and on hyperbolic SOM (b) (center). The right diagram (c) shows that the hyperbolic lattice allows to obtain a detailed, magnified view of the border region while still keeping the central region (now upper right) well visible (compare sizes of edges emanating from ' l'-node). Note that circular arcs in the hyperbolic image (cf. Fig.3) have been simplified to straight lines.
neighborhood" to map tile 3d space onto its 2d hyperbolic lattice. Initially, with a(t) still large (Fig. 4a, left), the map contracts and centers itself in the data distribution in the usual way, as familiar from the standard SOM. Then, as a(t) gets lower (Fig. 4b, center), the map starts to "feel" the dimension conflict (which here still is rather mild and only 3d vs. 2d). This is accompanied by developing protrusions to better fill the higherdimensional volume with the 2d lattice. Now the advantage of the hyperbolic 2d-lattice over its euclidean cousin becomes apparent: due to its hyperbolic nature the growth of the number of its nodes with radial distance r can much better cope with the required r 3 law of the data distribution (and, due to the exponential growth law, this would remain so for data of any dimensionality D)! This makes its periphery much more flexible to fold into the missing space dimensions (of course, this still can be done only by way of making the boundary produce a "meandering" space filling curve). The final result is visible in Fig 4c (right), which is reminiscent of a 2d-saddle whose (ld-) periphery has become deformed to follow a space filling curve in the (2d) surface of a sphere. This behavior becomes even more pronounced as we proceed towards higher data space dimensionalities. We then can no longer visualize the corresponding "virtual" nets, but we still can show the distribution of the data points on the 2d SOM lattice. Fig.5 compares the results for the same experiment, but now with points distributed with uniform density in tlle 5-dimensional unit ball IIr-]l < 1 of/R 5. To compare the result for the hyperbolic SOM with that of a "flat" euclidean SOM, we need some comparable displays of the underlying, very different lattice topologies. For the hyperbolic SOM, we use the Poincar~ projection into the unit disk (as already familiar from Fig.3). To create a roughly comparable display for the "flat" SOM (for which we used a hexagonal lattice with size adjusted for about the same number of nodes as were used for the HSOM), we performed a central projection of its lattice hexagon onto a hemisphere that we then view from above. To visualize at least one dimension of the underlying data distribution, we use the radial coordinate
107
Figure 6. Projections obtained for synthetic, hierarchically clustered data set in 5d space. (a)(left) euclidean SOM, (b), center and (c), right: two different views of the hyperbolic SOM (again with connection arcs simplified to straight lines).
and label in both images each node by the (scaled) radial distance Itgr[ of its prototype vector ~r (which, for a continuous distribution, represents the best-match data point) from tile origin. Fig.5a (left) depicts the resulting map for the standard SOM, while the central diagram shows the HSOM. The difference is striking: while the euclidean SOM has primarily mapped the "superficial" structure of the sphere (all radius values are similar and close to the maximal value), tile HSOM gives a much clearer impression of the radial structure, arranging spherical shells of increasing radius into concentric rings around its origin. In Fig.5c (right) we have moved the "focus" into the border region of the hyperbolic SOM to illustrate the exponential "scale compression" of the hyperbolic lattice: we can now see a magnified view of a rather small part of the total circumference of the lattice, but the still the central portion is sufficiently near (only a few links in the hyperbolic lattice) to be kept recognizably within view in the upper right portion of the display. Figs.6a-c illustrate tile capability to display the structure of hierarchically chlstered data. In this case, we have created an artificial data set with gaussian clusters centered at the node points of a "random tree" i n / R 5. Each tree node had a fixed branching factor of M = 5, and the directions of the branches were chosen randomly, while the branching lengths and the cluster diameters were diminished by a constant factor of f = 0.5. The HSOM training was performed by randomly sampling the nodes of this tree (we used L = 3 levels, yielding 31 clusters, identified by labels 1, 11 . . . 15, 111 . . . 155). Again we compare a standard SOM (hexagonal lattice, leftmost picture (a)) with a hyperbolic SOM (middle and right pictures (b) and (c)), using the projections explained previously. Apparently, the periphery of the HSOM (middle) provides a somewhat better resolution of the cluster structure of the data. In addition, parts of the tree that are closer to the root (the more "general items") tend to become Inapped centrally (although they only occupy relatively few nodes, since they are comparably small in number), with the branches of the tree arranged towards the periphery. Fig. 6c (right) finally provides a zoomed view of a subregion in the periphery of the HSOM. We clearly see that the globally visible
108 arrangement of the hierarchical structure is continued here at a smaller scale, providing a spatial layout of the "13"- and the "15"-subtrees adjacent to the /-root node.
7. D I S C U S S I O N Much of the enormous success of the SOM is based oil its remarkable ability to translate high-dimensional similarity relationships among data items that in their raw form may be rather inaccessible to us into a format that we are highly familiar with: into spatial neighborhoods within a low-dimensional space. So far, we have been accustomed to use only euclidean spaces for that purpose. But mathematicians have prepared everything that we need to extend the realm of SOMs into non-euclidean spaces. With this contribution, we hope to have demonstrated that this may offer more than just an exotic exercise: besides the more obvious uses of spherical SOMs the newly proposed hyperbolic SOMs are characterized by very interesting scaling properties of their lattice neighborhood that makes them promising candidates to better capture the structure of very high-dimensional or hierarchical data. We have shown how regular tesselations of the hyperbolic plane H2 can be used to construct a SOM that projects onto discretized H2 and we have presented initial simulation results demonstrating that this "hyperbolic SOM" shares with the standard SOM the robust capability to unfold into an ordered structure. We have compared its properties with the standard SOM and found that the exponential node growth towards its periphery favors the formation of mappings that are structured differently than for the euclidean SOM. Our initial simulations indicate that the faster increasing hyperbolic neighborhood can indeed facilitate the construction of space-filling map configurations that underly the dimension reduction in SOMs and that hyperbolic SOMs can visualize hierarchical data, benefitting at the same time from techniques introduced in previous work on the exploitation of hyperbolic spaces for creating good displays of tree-like structures. So far, this is only a very modest beginning and many challenging questions for hyperbolic SOMs are still open. The next steps will require to study the properties of HSOM for larger, real world data sets. Here, interesting candidates should be data resulting from various branching processes (which includes textual data), since they are usually described by an exponentially diverging flow in their state space, which might fit better onto a hyperbolic space than onto the euclidean plane. From a mathematical point of view, it will be necessary to revisit many of the questions that have been addressed for the euclidean SOM in the past, in particular the computation of density laws. An important point in this regard is the study of edge effects, which are much stronger for hyperbolic SOMs since they are by construction such that from any point there is always a (logarithmically) short path leading to the border. Of course, there will also be the ordering issue, but there is little reason to assume simpler answers when the underlying lattice is a H2-tesselation (this might be different, however, for the spherical SOMs; in this case, the closedness of the surface may provide a strong ordering drive). Finally, the introduction of a hyperbolic map space for SO Ms suggests similar generalizations for related types of mapping algorithms, an issue which we intend to address in future work.
109 REFERENCES
10. 11. 12.
13. 14. 15. 16. 17. 18. 19. 20. 21.
Christopher M. Bishop, Markus Svenson, and Christopher K. I. Williams. GTM: The generative topographic mapping. Neural Computation, 10(1):215-234, 1998. D.S. Bradburn. Reducing transmission error effects using a self-organizing network. In Proc. of the IJCNN89, volume II, pages 531-538, San Diego, CA, 1989. H. S. M. Coxeter. Non Euclidean Geometry. Univ. of Toronto Press, Toronto, 1957. R. Fricke and F. Klein. Vorlesungen iiber die Theorie der automorphen Funktionen, volume 1. Teubner, Leipzig, 1897. Reprinted by Johnson Reprint, New York, 1965. Thore Graepel and Klaus Obermayer. A stochastic self-organizing map for proximity data. Neural Computation, 11(1):139-155, 1999. Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen. WEBSOM--selforganizing maps of document collections. In Proceedings of WSOM'97, Workshop on Self-Organizing Maps, Espoo, Finland, June ~-6, pages 310-315. Helsinki University of Technology, Neural Networks Research Centre, Espoo, Finland, 1997. F. Klein and R. Fricke. Vorlesungen iiber die Theorie der elliptischen Modulfunktiohen. Teubner, Leipzig, 1890. Reprinted by Johnson Reprint, New York, 1965. T. Kohonen. Self-Organizing Maps. Springer Series in Information Sciences. Springer, second edition edition, 1997. Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen. Self-organized formation of various invariant-feature filters in the adaptive-subspace som. Neural Computation, 9(6):1321-1344, 1997. Pasi Koikkalainen and Erkki Oja. Self-organizing hierarchical feature maps. In Proc. of the IJCNN 1990, volume II, pages 279-285, 1990. John Lamping and Ramana Rao. Laying out and visualizing large trees using a hyperbolic space. In Proceedings of UIST'9~, pages 13-14, 1994. John Lamping, Ramana Rao, and Peter Pirolli. A focus+content technique based on hyperbolic geometry for viewing large hierarchies. In Proceedings of the A CM SIGCHI Conference on Human Factors in Computing Systems, Denver, May 1995. ACM. W. Magnus. Noneuclidean Tesselations and Their Groups. Academic Press, 1974. Charles W. Misner, J. A. Wheeler, and Kip S. Thorne. Gravitation. Freeman, 1973. Frank Morgan. Riemannian Geometry: A Beginner's Guide. Jones and Bartlett Publishers, Boston, London, 1993. H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-o'ryanizing Maps. Addison Wesley Verlag, 1992. Helge Ritter and Teuvo Kohonen. Self-organizing semantic maps. Biol. Cybern., 61:241-254, 1989. D. De Sieno. Adding a conscience to competitive learning. In Proc. of the ICNN88, volume I, pages 117-124, San Diego, CA, 1988. Karl Strubecker. Differentialgeometrie III: Theorie der Fl5chenkriimmung. Walter de Gruyter & Co, Berlin, 1969. J.A. Thorpe. Elementary Topics in Differential Geometry. Springer-Verlag, New York, Heidelberg, Berlin, 1979. JSrg Walter and Helge Ritter. Rapid learning with parametrized self-organizing maps. Neurocomputing, pages 131-153, 1996.
This Page Intentionally Left Blank
Kaski,editors ElsevierScienceB.V. All rightsreserved
Kohonen Maps. E. Oja and S.
9
111
Self-Organising Maps for Pattern Recognition N. M. Allinson and H. Yin Department of Electrical Engineering and Electronics, UMIST, PO Box 88, Manchester M60 1QD, United Kingdom. Self-organisation is a fundamental pattern recognition process, in which intrinsic interand intra-pattern relationships are learnt without the presence of a potentially biased external influence. In this paper, we present and review the statistical and optimal properties of this important learning model from some pattern recognition aspects through a number of examples and extended applications. The SOM is optimal for vector quantisation. Its topographical ordering provides a mapping with enhanced fault and noise tolerant abilities. It is also applicable to many other applications such as dimensionality reduction and classification. A Bayesian SOM has been devised to extend the mapping as an optimal estimator for probability density functions. This enhanced SOM further explains the functional role of the neighbourhood in the mapping process and reveals that the neighbourhood function form imposes an underlying distribution prototype to the neurons. 1. I N T R O D U C T I O N Kohonen's self-organising map (SOM) is an abstract mathematical model of topographic mapping from the (visual) sensory to the cerebral cortex [ 1, 2]. Modelling and analysing the mapping are important to understanding how the brain perceives, encodes, recognises, and processes the patterns it receives and thus, if somewhat indirectly, is beneficial to machinebased pattern recognition. The algorithm has found a large number of applications in many fields such as pattern classification, image compression, face recognition, time-series modelling and prediction, and data mining [3]. There has been a large effort in studying this learning model and its properties (c.f. 2). Although the algorithm has a simple computational form, formal analysis of it and the associated learning processes and mathematical properties is not easily tractable. Some important issues still remain unanswered. Self-organisation, or more specifically the ordering process, has studied in some depth; however predominantly in one-dimensional case. The conclusions have been difficult, if not impossible, in extend to higher dimensions. We have considered the general proof of convergence as two separate issues - ordering and convergence. We have extended and applied some statistical concepts to the analysis the SOM, and obtained a comprehensive proof of quantitative convergence [4]. The analysis shows that the SOM is potentially optimal for vector quantisation. It also shows the diminishing quantitative effect of initial states, and the asymptotic and diminishing Gaussian distribution of the features. The ordering process has been studied using the channel-noiseresilient coding principle and using Luttrell's hierarchical VQ theory [5]. Based on these
112 principles, only can a meaningful comprehension of ordering be defined. From our understanding, the meaning of ordering should be in the sense of fault-tolerance, and fast and robust pattern association. We have also applied the SOM algorithm or combined it with other methods to several including face recognition, financial market prediction, image segmentation, image and video compression, and density modelling. Some of them are reviewed in the following sections. In pattern recognition, an accurate estimate of probability density is often desirable. Single parametric densities have a very limited capability in representing real-world problems. A mixture-distribution has been found a powerful approach [6, 7]. It provides a trade-off between limited parametric approaches and computational intensive nonparametric ones. The expectation-maximisation (EM) method can be used to solve the maximum likelihood (ML) solution for mixture parameters (e.g. [8]). Most EM-based algorithms are pseudo-likelihood based, which often over-emphases individual data points, so can easily lead to overfitting. In addition, due to its deterministic gradient descent and "batch" operation nature, the EM algorithm has a high possibility of being trapped in local optima and being slow to converge [9-11 ]. Some modifications such as the stochastic EM algorithm, maximum penalised likelihood method, and ensemble averaging method have since been proposed to overcome these problems [ 10, 12]. We have recently proposed a Bayesian SOM for solving Gaussian mixture problems, and have shown additional advantages (e.g. relax the local optimum problem and provide faster convergence speeds) over the EM algorithm [ 11, 13]. Its generalised form [ 14] shows that the algorithm combines naturally the criterion of minimising the Kullback-Leibler information metric [9] and the stochastic approximation method [10] into a SOM learning form. As the Kullback-Leibler metric is the expectation of log likelihood, thus it has a more natural generalisation ability than the raw sample likelihood criterion. 2.
PATTERN DIMENSIONALITY
The goal of pattern recognition is to classify objects into a number of categories by either supervised or self-organised schemes [ 17]. In supervised learning, an objective function can often be well defined, and learning is to minimise or maximise the function by some means. Whilst in a self-organising system, such an objective function may be not available or is not apparent. The learning is a self-ruled pattern organisation process in order to discover some intrinsic inter- and/or intra-pattern relationships. Patterns are of any possible forms and of any possible dimensions. However as physical brains are limited in their connective architectures, the recognition process often involves a dimensional reduction or transformation step, in which high dimensional patterns are captured and represented by low dimensional features. This step has two main aims - to reduce the representation dimensionality with minimum information loss, or to distinguish ambiguous patterns with greater ease. The SOM is a dimensional reduction mapping in the sense that it quantises and re-represents a high dimensional space on a discrete map of a low dimension. It uses a low dimensional grid to capture in a high dimensional space. The map is randomised initially as this implies no prior will be imposed to the mapping. The unique use of the neighbourhood function, which simulates the lateral excitation and inhibition amongst post-synaptic cells in the cortex, makes the mapping topographical, i.e., like patterns will be mapped onto nearby nodes of the map and vice versa. A simple example of this
113 dimensionality reduction is illustrated in Figure 1, where the input samples are randomly positioned patches from the image of concentric circles. The resulting ordered map displays clear local ordering and some global ordering, but we are attempting to order data with three degrees of freedom on a two-dimensional map. If the input image is changed to a checkerboard then the result (Figure 2) shows full global ordering as the inherent input and output dimensions are the same.
Figure 1. (a) Training image of concentric circles, (b) Trained 16x16 SOM.
3.
TEXTURE SEGMENTATION
The SOM has been used in a hierarchical structure, together with Markov random field (MRF) model, for the unsupervised segmentation of textured images [18]. The MRF is used as a measure of homogeneous texture features from a randomly placed local region window on the image. Such features are noisy and poorly known. They are input to a first SOM layer, which learns to classify and filter them. The second local-voting l a y e r - a simplified SOM produces an estimate of the texture type or label for the region. The hierarchical network learns to progressively estimate the texture models, and classify the various textured regions
114 of similar types. Randomly positioning of the local window at each iteration ensures that the consecutive inputs to the SOMs are uncorrelated. The size of the window is large at the beginning to capture patch-like texture homogeneities and shrinks with time to reduce the estimation noise of the parameters at texture boundaries. The weights of the neurons in the first layer will converge to the MRF model parameters of various texture types, whilst the weights of the second layer will be the prototypes of these types, i.e. the segmented image. The computational form of the entire algorithm is simple and efficient. The theoretical analysis of the algorithm shows that the algorithm will converge to the maximum likelihood segmentation. Figs. 3 and 4 shows two typical examples of such applications. In Figure 3, further processing, namely the minimum mean-square-error cross validation has been used to verify the correct number of texture types in the image, when this is unknown. In Figure 4, the number of texture types was subjectively assumed as four. But interestingly, the algorithm has segmented the image into four meaningful categories - trees, grass, buildings, and roads.
Figure 3. (a) Composite textured image, (b) Segmentation result.
Figure 4. (a) An aerial image, (b) Segmentation result.
115 4.
IMAGE C O M P R E S S I O N
In our proof of convergence, we have shown that the SOM is an optimal quantiser (VQ) as it will satisfy the two necessary conditions for VQ (Voronoi partition and centroid condition) [4]. That is, the mapping quantises optimally, in mean-square-error sense, the continuous input space. The input space can be then represented by a limited number of reference vectors (code vectors). The use of the neighbourhood function makes the SOM superior to common VQs in two major aspects. Firstly, that the SOM is better in overcoming the under- or over-utilisation problem. Neighbourhood learning introduces a natural "conscience" to the learning process, in contrast to other enforced (such as frequencysensitive) competitive learning algorithms. The second is that the SOM will produce a map (codebook) with some ordering among the code vectors and this gives the map an ability to provide noise tolerance for input or retrieval patterns. An example is given in Figure 5, in which (b) shows the 16x16 codebook produced by the SOM has a very distinct ordering among the code vectors. The code vectors are of 4x4 blocks. In our experiments, we have found that SOMs generally perform better than other VQs especially in situations where local optima are present. The robustness of the SOM has been further improved by introducing a constraint on the learning extent of a neuron based on the input space variance it covers [ 19]. The algorithm was intended to achieve global optimal VQ by limiting and unifying the distortions from all nodes to approximately equal amounts - the asymptotic property of the global optimal VQ, (i.e., for a smooth underlying probability density and large number of code vectors as all regions in an optimal Voronoi partition have the same within region variance). The constraint is applied to the scope of the neighbourhood function so that the node covering a large region (thus having a large variance) has a large neighbourhood. Our results show that the resulting quantisation error is smaller.
Figure 5. (a) Quantised Lena image, (b) The SOM codebook (map).
116 A more recent application of rate-constrained SOMs, where the learning process is modified to minimise the VQ distortion subject to a entropy approximation constraint, is foe low bit rate video coding. Figure 6 is a sequence of frames from a QCIF (176• pixels) conferencing video system operating at 28.8 kbps and 10 fps. The system employs multiresolution discrete wavelet transformations and the SOMs provide additional quantisation for each of the transformed levels.
Figure 6. Video sequence frames at 28.8 kbps and 10 fps.
5.
DENSITY ESTIMATION
Mixture models have been widely employed in pattern classification applications to approximate probability density functions (PDFs). In a mixture distribution, each sample observation from an input space, is assigned to one of several component sub-PDFs, which in turn are represented by a common single PDF profile such as Gaussian and Cauchy. The optimal approach is the ML criterion. However in practice, this will be a pseudo joint-likelihood, as independence of the observations has to be assumed. This ML approach can only be solved numerically. The EM algorithm is an iterative ML procedure for parameter estimation with incomplete data or missing data circumstances [20]. Many problems can be viewed as instances of such situations. For example, in the unsupervised learning for a mixture distribution model, the input samples are incomplete, the missing data are the
117 component-labels or indicator functions for each sample. By using the EM procedure, the marginal, or incomplete-data, likelihood is obtained by the average or expectation of the complete-data likelihood with respect to the missing data under the current parameter estimates (E-step), then the new parameter estimates are obtained by maximising the marginal likelihood (M-step). The EM algorithm has been shown to be an iterative gradient ascent algorithm, in which the likelihood function exhibits no decrease after each iteration. The EM method has been applied to the estimation of Gaussian mixtures [8]. It is an extended and generalised k-means algorithm with a consideration of the component distributions and priors, and so will generally result in much improved clustering than the kmeans algorithm. Is has been shown [4, 21] that the EM algorithm is a variable metric gradient ascent algorithm with first-order convergence. The EM algorithm provides a feasible solution for the PDF estimation problem. But it converges slowly, especially when the mixture components are not well separated, and its computational costs are usually high [9]. Based on the mixture distribution model, we have proposed a Bayesian SOM for estimation of Gaussian mixtures. In this algorithm, each neuron represents a sub-PDF profile with weights representing the kernel's position, variance and mixing parameter. Its output reflects the contribution from this neuron to the joint-PDF. When an input is presented to the network, the outputs from all neurons indicate how well each of them in representing the input, and the neuron with the highest kernel output wins. This has changed the Euclidean distance winning mechanism to a maximum posterior probability rule. The learning rule is similar that of the SOM but with the neighbourhood function replaced by a Gaussian posterior function (estimated online). This function serves as a Bayesian inference to each neuron's learning from the input, and is fairly flat at the beginning of training and gradually sharpens. The number of nodes, however, needs not to be known a priori, as long as there are sufficient number to represent the underlying components in the mixture. Only the signification ones will remain. Such a number can be an objective factor in the learning. The Bayesian SOM is further related to a criterion of minimising a relative entropy, or Kullback-Leibler information metric; and can be generalised to the self-organised mixture network (SOMN) for general mixtures [ 14]. This criterion is equivalent to the ML criterion. but with the entropy measure arguably more suitable than the L2 measure for density estimation or unsupervised learning [22, 23]. The algorithm has been applied in various situations. The results indicate that it outperforms the EM algorithm. It has also been compared with some algorithms recently proposed by Ormoneit and Tresp [10]. In their algorithms, various methods such as averaging, maximum penalised likelihood and Bayesian estimation have been used to improve the EM based estimation. The example data below were similar to those used in [ 10]. In total, 200 sample observations were generated according to the method given in [ 10] for each of two pattem classes. Half was used for training and the other half for testing. The scatter of the entire data is shown in Figure 7. Each class-conditional distribution was estimated by a 10x 10 SOMN of isotropic Gaussians. Typical results, i.e., estimated densities, are given in Figure 8. Based on these probability distributions, an optimal classifier can then be constructed according to the Bayes rule (with class prior in proportional to the percentages of the class samples, 0.5 for each class in this example). The classification rates are of 85% and 83% for training and testing respectively, a slightly improvement over those reported in [101.
118
Figure 7. Data scatters (two pattem classes).
Figure 8. Estimated mixtures by two SOMNs. Figure 9 shows another application of the SOMN in fitting an x-ray rocking curve profile, i.e., a x-ray diffraction peak as a function of photon energy (dashed line). Though in reality a spectrum, the experimental sample (200 data points in total) were resampled randomly to provide an effective density histogram for network training (19,512 points were generated). The network, initially assigned five Cauchy components as the true number of peaks was assumed unknown, has successfully learnt the two main peaks. The estimated density (solid line) is after only five epochs. The resulting weights of five components are listed in Table 1. Two components (nodes 4 and 5) are active, others are too small to contribute meaningfully.
119 0.12
|
i
Data Estimate
!
i
I
0.06
0.04
.
-20000
i
-i;oo
. . . . .
-IOOO
i
-5oo
o
''
'
500
Figure 9. Mixture estimation using a Cauchy SOMN for a x-ray rocking curve profile. Table 1. Results for inidividual nodes in the mixture of Figure 8. Node 1 2 3 4 5 E
Mixing weight 6.42x 10-322 6.37• 10-322 3.82• .33 0.3711 0.6289
Mean 819.69 1302.90 30.27 -667.79 -357.93
Variance (~) 20.10 20.10 119.03 10.29 67.03
The structure and result of the Bayesian SOM or SOMN also reveals an important implication to the neighbourhood function used in the standard SOM. That is, the mapping is inferring a mixture density structure. The form of the neighbourhood function implies the underlying prototype in which the neurons are distributed. It imposes a prior or subjective factor on the input distribution that the map is inferring. That is, if a Gaussian neighbourhood function is used, the SOM is bound to infer a mixture of Gaussian distributions from the input. However, as Gaussian mixtures are universal approximators, the widely applied Gaussian neighbourhood functions seem justified. 6.
SUMMARY
Some relationships between the biologically-inspired SOM and statistical pattern learning methods such as the stochastic approximation, maximum likelihood, entropy based learning method have been investigated. Some extensions of the SOM have thus been proposed either to improve statistical pattem recognition methods or to demonstrate more aspects of the mathematical properties of the map. The mathematical role of the neighbourhood function, as a mixture distribution interpreter, has been unveiled. This should provide an important basis for ordering analysis.
120 REFERENCES
1. 2. 3. 4. 5. 6.
T. Kohonen, Biological Cybernetics, Vol. 43, (1982) 56. T. Kohonen, Self-Organising Maps, Springer, Berlin, 1997. Proceedings of WSOM'97, Helsinki University of Technology, 1997. H. Yin and N.M. Allinson, Neural Computation, Vol. 7, (1995) 1178. S.P. Luttrell, IEEE Trans. Neural Networks, Vol. 1, (1990) 229. D.M. Titterington, A.F.M. Smith, and U.E. Makov, Statistical Analysis of Finite Mixture Distributions. John Wiley: New York, 1985. 7. R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley: New York, 1973. 8. L. Xu. and M.I. Jordan, in Proc. of World Congress on Neural Networks (II), (1993) 431. 9. R.A. Redener and H.F. Walker, SIAM Review, Vol. 26, (1984) 195. 10. D. Ormoneit and V. Tresp, IEEE Trans. Neural Networks, Vol. 9, (1998) 639. 11. H. Yin and N.M. Allinson, in Proc. of WSOM'97, (1997) 118. 12. Y. Delignon, A. Marzouki, and W. Pieczynski, IEEE Trans. Image Processing, Vol. 6, (1997) 1364. 13. H. Yin and N.M. Allinson, Electronics Letters, Vol. 33, (1997) 304. 14. H. Yin and N.M. Allinson, Proc. of IEEE Int. Conf. Neural Networks, (1998) 2277. 15. S. Kullback and R.A. Leibler, Annals of Math. Statistics, Vol. 22, (1951) 79. 16. H. Robbins and S. Monro, The Annals of Math. Statistics, Vol. 22, (1951) 400. 17. N.M. Allinson, in Theory and Appl. of Neural Networks, J.G. Taylor and C.L.T. Mannion (eds.), Springer-Verlag, London, (1990) 110. 18. H. Yin and N.M. Allinson, Electronics Letters, Vol. 30, (1994), 1842. 19. H. Yin and N.M. Allinson, in Proc. Int. Conf. Neural Info. Process., Vol., (1996) 80. 20. A.P. Dempster, N.M. Laird, and D.B. Rubin, J. of Royal Stat. Sci. B, Vol. 39, (1977) 1. 21. L. Xu and M.I. Jordan, Neural Computation, Vol. 8, (1996) 129. 22. M. Benaim and L. Tomasini, in Proc. Int. Conf. Art. Neural Networks, (1991) 391. 23. H. White, Neural Computation, Vol. 1, (1989) 425.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
Tree Structured
121
Self-Organizing Maps
Pasi Koikkalainen* University of Jyvs Department of Mathematical Information Technology P.O. Box 35, 40351 Jyv~skyls Finland The tree structured self-organizing map (TS-SOM) was originally intented as a fast implementation of the self-organizing map (SOM). This paper explains it as a constructive smoother for a class of dimension reduction problems. There is a well known relation between self-organizing maps and principal curves. Unfortunately in most presentations it is derived by simple reasoning, avoiding the mathematical statement of the problem, which is essential to understand how efficient SOM implementations can be constructed. In this presentation the self-organizing map (SOM) is derived as a numerical solution of a generic model in a continuous domain, which differs from the ad hoc way of constructing neural networks. It is then shown that many variations of the SOM are just different numerical implementations, including generative models. This leads to the derivation of the TS-SOM algorithm. 1. I n t r o d u c t i o n The tree structured self-organizing map (TS-SOM) [9],[10] is a multiresolution representation of several self-organizing maps (SOMs). It uses discretization as a smoother to implement the shrinking neighborhood (gradually decreasing smoothing) of Kohonen's original algorithm [7]. This results a sequence of latent data representations with growing flexibility. Before explaining the TS-SOM in detail the theoretical basis of its components are explained in sections 1 and 2. 1.1.
Dimension
reduction
and latent
spaces
Let X(w) e :Dx, V(w) e Dv be random vectors and px(x), py(v) the corresponding densities, where x E ]Rm and v E IRn. In dimension reduction one tries to choose a projection from a given higher dimensional data domain Dx c IRm to some new lower-dimensional domain (latent space) :Dv C IRn such that maximal information (in some sense) of the density Px (x) is saved in the process. Thus, one can consider the density pv(v) on a domain Dv as an approximation to px(x). The relations (projections) between X and V are defined via functions v'(x) : Dx ~ Dv and x ' ( v ) : Dv --+ Dx. *This work has been sponsored by the Academy of Finland (grant 37190/96) and the Technology Development Center (TEKES projects Stella, Daemon).
122 Without restrictions the problem is not well defined. It is therefore usual that the density py(v) or the functions v'(x) and x'(v) (or both) are restricted in some way that is characteristical to the method, and special techniques are used to overcome the remaining singularities. In general we may choose to constraint: (a), the shape of py(v); or/and
(b), the dimension of/3v; or/and (c), the flexibility of v'(x) and x' (v). In addition, a criteria for the optimal approximation under the given constrains is required. The most natural choice would be to minimize the information loss between px(x) and pv(v), which unfortunately might be difficult to apply in practical cases when p z ( x ) is unknown. The selection of criterias and constraints depends on the nature of the problem. In some cases pz(x) is fully known and the task is to simplify it, or to find the best fit to a given py(v). When we have only observations X(w) the role of By(V) is to estimate the most important characteristics of an unknown real world px(x). 1.2. E x a m p l e - P C A To illustrate the idea, principal component analysis restricts the shape of py(v) by assuming multivariate Normal distribution N(K, E) as an approximation to px(x), where we know that
= f xpx(x)dx - E[ x ],
and
E = f ( x - K)(x - ~)Tpx(x)dx = Cov[ x ].
The shape of the density (pdt) pv(v) is a hyperellipsoid that is fully defined by its principal axes v. These axes define a linear projection v'(x) = P ( x - X), where the columns of m x m matrix P are the eigenvectors of E, and x~(v) - P - l v + K . For the cases when data occupies less than m dimensions, the eigenvectors corresponding to non-zero eigenvalues can be solved using special techniques like the singular value decomposition. When pz(x) is not known the parameters K and E are obtained by minimizing the information loss on linear projections/)x --+/3v --+/)x,. This can be expressed as an optimization problem, max~,r~ s where the likelihood is
s
N ~1e x p { - ~l(x(k ) _ ~)Ty]-I
II
(X(k)-~)
},
(1)
k=l
k = 1, 2 , . . . , N are the samples, and Z is a normalization constant. 2. S e l f - o r g a n i z i n g m a p s , p r i n c i p a l surfaces a n d s m o o t h e r s The derivation of the tree structured self-organizing map (TS-SOM) [9],[10] requires some knowledge about discretization and smoothing in the context of the self-organizing map (SOM) [7],[8]. It is generally known that the SOM algorithm is closely related to principal surfaces as discussed in [11] and [12].
123
In principal curves and surfaces [5], [61 we are interested to find a surface (or a curve) x'(v) where T~v is restricted, but the shape of pdfpv(v) is (in general) not. The mapping v'(x) : :Dx ~ :Dv is defined such that points x E IRm are projected to the closest point v on 7:)7, and the surface x'(v) restricted to go through the middle of px(x Iv). This is expressed more formally as
x'(v)
- R(
f
xpz(x
Iv -
v'(x))dx, v ) = R( E[ x I v = v'(x) ], v ),
(2)
where R(x"(v), v) is an additional regulator, typically a set of identical independent smoothers for xi with respect to v to ensure uniquess of the solution. The projection v'(x) can be expressed, using Euclidean metric, as a minimization v'(x) - arg min IIx'(v") - xll VI I
(3)
where v" C T~v.
It is relatively easy to imagine that the design of regulator R(x"i(v), v) plays a major in numeric solutions. The smoothness of x'(v) must be related to the complexity of the data-set, which can be understood in several ways, depending on applications. Also, the smoothness is not sufficient enough to guarantee an unique solution as pointed out in [3], and it is difficult to find clear criterias for the task when the conditional pdfpx ( x l v ' ( x ) ) is estimated from data. Principal curves and self-organizing maps are usually implemented such that smoothing decreases gradually until the desired solution is obtained. This has been tested experimentally to give satisfactory results, but optimal criterias for procedures are hard to find. 2.1. D e r i v a t i o n
of Kohonen's
algorithm
It is possible generalize self-organizing maps and principal curves as kernel smoothed surfaces of type x'(v)=~l
[fH(V-Z)a
/xpx(xlz=v'(x))dxdz]
/
[/H(V-Z)dz]a '
(4)
where H(u) is a kernel function and width a defines the amount of smoothing. If H(v) is a Gaussian shape, then equation (4) is another form (fundamental solution) of a physically important Poisson equation
-kV x'(v) = /xp (x iv- v'(x))dx,
(5)
where k is a diffusion constant that determines the amount of smoothing. With an appropriate selection of the smoothing parameters these equations could be used as objective functions for the SOM. This is also a basis to understand the TS-SOM, as explained later.
124 2.2. D i s c r e t i z a t i o n Due historical reasons self-organizing maps are usually given in form where T)v is defined as a n dimensional discrete lattice of nodes (neurons)
7:)L = { v e ] R
~ 9V=vQ=Qh,
Q=[ql,qe,...,q,~]T, q j e { 1 , 2 . . . , p } } ,
(6)
where h is a discretization step and p is the number of nodes per dimension. Further simplification can be made by discretizing the projection T)x -+ T)L from x E IRm to the closest lattice point (node) Vb(x) e 7:)L,
(7)
where b(x) = argminllx'(vj) - xll. 3
Discretization replaces integrals with sums over lattice points. The resulting discrete surface can be expressed with Naradaya-Watson type of kernel estimator (see [12]). Using n((v~-vj)/~) as a neighborhood function and wi = x'(vi) as a weight notations hi,j = ~H((v~-v~)/~) vector of neuron i we get
w~ = ~h~,k f x p z ( x Ib(x)= k ) dx = ~ hi,kE[ x Ib(x)= k ]. k
(8)
k
There are many alternatives to solve equation (8) from data when px(x) is not known. Batch algorithms can be derived from the likelihood function, which leads to Kohonen's original stochastic algorithm through recursive least squares method [13]. Same result comes also directly from equation (8), which gives the updating function g(w(n),x(n)) for Robbins-Monro iteration w ( n + 1) -- w(n) + where a(n) ~ - ~ 1 c~(n) = sion function w' > wi and
oL(n)g(w(n), x(n)),
(9)
is a decreasing sequence of positive real numbers such that l i m ~ c~(n) = 0, c~ and ~ - ~ 1 c~(n)2 < c~. To find g(w(n),x(n)) one can construct a regresf(wi) = 0 that is the expectation of g under conditions f ( w ' ) < 0 when f (w') > 0 w' < wi as follows
f(w~) -- ~ h~,E[ x I b(x) -- k ] - w~ -- E[ h~,b(~)(x- w~) Iwd -- E[ g I w~ ].
(10)
k
Applying g from equation (10) in equation (9) gives Kohonen's stochastic adaptation rule for the self-organizing map
wi(n + 1)
= ~ wi(n) +-,/(n)(x(n)
I,
where the neighborhood
-
wi(n))
for b(x(n)) e otherwise
Ni
Ni = { j I hi,j ~: 0 } and ~,(n) = a(n)hi,b(x(,O).
(11)
125 2.3. G e n e r a t i v e a p p r o a c h e s Principal surfaces and self-organizing maps approximate p d f p x ( x ) , x E 7:)x indirectly by the smoothed expectation E[ x Iv = v~(x)]. Another direction is to require stronger similarity between px(x) and its approximation pv(v). In practice one seeks solutions where 7:)v and pv(v) are easier to understand than original px(x) in ]Rm, i.e. one typically assumes that :Dv C ]R2. To meet both requirements, good approximation and dimension reduction, one needs flexible functional forms for the density px(xlv), such as mixtures of Gaussians [14] that can be transformed onto pv(vlx ) by using the Bayes theorem. In generative models this is done under the assumption that each point v C 7)v contributes a generator 15(x I x ~, I) to a location x ~ E T)x through the mapping x ~(v), where I is the set of parameters that define the shape of the generator. In addition the approximation requires prior weightings for generators 15(v, I), which simplifies to i5(v)i5(I) if v and I are statistically independent. Now the approximation can be given as px(x)
~x(x)
f/ (x I x'(v),
dvd .
(12)
The beauty of generative approach is that objective functions are easy to write in terms of likelihoods s -- I-[k=l~z(X(k)). N These are usually decomposed of several parametrized density components, which then leads to estimation of parameters from data. In full Bayesian estimation one tries to find all densities simulatneously, while most practical implementations use hierarchical approaches where in most part p(v, I) is estimated (or decided) in separately. The most typical choice for generators is to use Gaussian shapes ih(xlv, l ) = e x p
-~llx-
where the shape I is determined by the width 1//~ of the generator. In many practical algorithms equation (12) has been solved by using the EM-algorithm [2], but there are also other possibilities. Equation (12) has been applied by Utsugi [15] to generalize self-organizing maps to Gaussian mixture models. In Utsugi's formulation the smoothing prior ih(v,I) is discretized to a lattice of locations vj. In similar to SOM these are related to 7:)x via positions x'(vj) = wj. The smoothing is then done on lattice by using a discrete Laplacian operator jt~ that implements equation (5). This results that the locations wj are comparable to Kohonen's original SOM. In another model, called generative topographic mappings (GTM) [1], equation (12) expresses x'(v) from ]R~ to ]Rm via a set basis functions (I) = {r r ek(V)}. GTM is similar to SOM when the basis functions (I) are Dirac delta functions, 5(.). Otherwise the mapping x'(v) is defined as a linear projection from all basis functions x'(v) = W(I)(v), where W is k • m matrix of weights that relates the basis functions to the real axes of T)x. Although Utsugi's model looks similar to GTM it has different motivation. When the prior constraints neighboring centers directly from the lattice 7:)~, it is more easy to imagine that data in point x E 7:)z is generated by Gaussians that are neighbors on T)v. In GTM one works with higher abstractions, and the user must make difficult decisions about the number and shape of basis functions.
126
3. The tree structured self-organizing map The generative approach provides a framework that explains and constrains self-organizing maps. In practice model complexity makes using and implementing these algorithms difficult. The computing time grows rapidly as the number of data points, the dimension of inputs x, or the number of lattice points vi (or basis functions Cj(v) in GTM) increases. In addition it might be difficult to control these algorithms in real world problems. Tree structures play an important role in computer science and artificial intelligence to reduce the complexity of algorithms. In this role trees have been applied successfully in many probabilistic problems. Famous tree algorithms are, for instance, decision trees and image processing pyramids. Also TS-SOM is loosely based on multidimensional tree search, introduced in reference [4], where the search over N lattice nodes in equation (7) is reduced to (9(logp N) computational complexity, where p is the number of sons per node. TS-SOM is a constructive method that grows from simple solutions to more complex ones. From this perspective it also related to multigrid methods that are commonly used in scientific computing to speedup algorithmic convergence. These are optimization techniques that use simple problems initialize more complex ones. Constructive build-up provides also other computational advantages. When the search is repeatedly executed over same data, the search on tree level (layer) 1 is mostly replaced by a lookup table bl-l[X(w)] that associates sample X(w) to the closest node on previously constructed level l - 1. The size of the remaining search set, between node and next level is independent of N, which results almost (9(1) overall search complexity. This no longer dominates the computing time of the SOM algorithm. Yet another aspect of the TS-SOM is that it puts more emphasis on the role of discretization, which is briefly discussed in the following.
3.1. A b o u t discretization and smoothing It is well known that smoothing and convolution f , g - f f ( x - z ) g ( z ) d z are very closely related, especially when kernel smoothers are considered. The degree of smoothing can be controlled quite well because the spread (or variance) a 2 of the resulting function is 2 know to be a~,g = a~ + ag. The Gaussian smoother H(u) in equation (4) does not fully explain smoothing in selforganizing maps. The effect of discrete lattice :DL must be included in the study. Evidently any function that is represented by discrete points is band limited in Fourier domain. In spatial domain, under certain conditions, we may approximate this as rectangular filtering sin(~x), f(x), which resembles Kohonen's "Mexican-hat" function, and like Gaussian ~-x smoother it approaches Dirac delta 5(.), when a 2 -+ 0. Thus discretization works as a smoother for self-organizing maps, although a more careful examination would be needed to find sufficient conditions. This is shortly demonstrated in figure 3. In practice the smoothing is usually balanced and fixed between the effects of discretization and Gaussian kernel. For example, when the discretization is two times coarser the Gaussian has also two times larger variance on real axes. This easily (automatically) obtained by defining the width of the Gaussian as a function of neighboring lattice points.
127 3.2. Tree s t r u c t u r e s and levels The structure of the TS-SOM is a graph T = (V, E l , E2), where V is the set of nodes (neurons), E1 is the set of hierarchical edges (tree links), and E2 is the set of directed lateral edges (SOM links). For any node i E V there a subtree Ti such that (V/ C V, E l i C E l , E2i C E2), and Ti N Tj = 0 if (i r Vj and j ~ V~). Each tree level is a self-organizing map, where each node divides to 2 n son nodes as depicted in figure 1, when n is the SOM dimension (7:)y C ]R~). To be able to separate different SOMs, the level 1 index is added to previous notations when needed. The tree levels are indexed with 1 = 0, 1 , . . . L - 1, and 7)L(1) denotes one n-dimensional SOM lattice on level 1. Simple computation shows that level 1 has (2~) t nodes. Furthermore, the following notations are used for node i on layer l: Si is a set of sons (children on layer 1 + 1), Fi is the parent node (on layer l - 1), and Ni is the set of the nearest neighbors on the same layer 1.
Figure 1. Illustration of TS-SOM structures for dimensions n = 1 and n - 2.
3.3. D e r i v a t i o n of t h e T S - S O M a l g o r i t h m We assume that SOMs on each tree level are Gaussian mixtures with fixed positions and widths in terms of lattice points. If each node i on level l contributes a Gaussian generator ibt(x I x'(vi), I), and only nearby generators on the SOM lattice are significantly overlapping, many useful simplifications can be made. First, the likelihood can be written in terms of previous levels N
N
= II
L,_I = II
k=l
k=l
N
...
,L0 -
II
9
(13)
k=l
Now one can solve layers in order 0, 1 , . . . , L - 1, starting from a SOM with only one node, then solving SOM with 2n nodes, and so on. Although the full derivation must be omitted in this context, is is easy to see how this leads to a simple algorithm. When previous levels are used as priors, a generator j on level 1 + 1 can have role only in a subset of T)x that in the close neighborhood of its parents ((Fj} U NFj), which is
128
Figure 2. Relation of generators in two successing layers.
illustrated in figure 2. How large this area is depends on the shape of Gaussians. In practice one can make the following simplifications, where vb(z)l denotes the closest node on level l for the given input X.
1. Limited search. I f vb(z) l . l+1 = Vr, where r E UiSN~. = vi t h e n --b(X) This reads, if data sample was closest to node i, then on next layer it must be closest to sons of i, or to sons of the neighbors of i. rl+l = vj t h e n X must be generated by j or by its neighbor 2. Limited updating. I f --b(X) r E Nj. This results that the updating rule for the position of generator j can be expressed a sum of conditional means of its neighbors.
The widths of the generators are determined by the lattice ag ~ 71x'(vi) - x'(vi+l)l, where 7 is a constant for neighborhood weighting. The generator positions wi E / ) z can be computed from data by using the following algorithm.
O. Initially, where m(r) is the number of all samples for node r : l' = O; Wroot = X = m(~oot) ~ k X ( k ) - -
1
1. Initialize new layer 1 for all j E Di(1) 1 = l' + 1; wj = WF~;
9
2. Find the set of closest samples and their centroids /'or each node j : where p 6 ~ x ( j ) ~)x(j) = {X(k) l j = b'(X(k)) } , Xj = ~ E p X ( p ) , 3. Update all j E I)L(1) 9 wj = [ Erhj,~m(r) ]-1 ~
hj,~m(r)-Xrr,
r
4. Repeat until converged : If IIw - wO' ll < COWO 2; 5. Next layer. If more layers G O T O 1;
where
r
e Nj U {/);
129 Weighting hj,rm(r)-X-r~ results from overlapping generators, where the weight 7 ~ hj,r is constant, typically hi,i = 1, and hi,j -- 0.3 for j C Ni. Step 2. requires search of the best node for all data. Since the layers are trained sequentially, the search of the position (equation 7) on layer l can be boosted by simply storing the best matching nodes on previous layer for all data in a lookup table bt-l[X(k)]. The remaining task is then to look best candidate for X(k) from a relatively small set. This results only (2n) 2 search operations per sample X(k), where n is the dimension of the SOM lattice. Finally, some example organizations on 2 and 3 dimensional pdfs are shown in figures 3 and 4.
[2]
El
Figure 3. Weight vectors of 1-dimensional ring-topology TS-SOM (levels 2 , . . . , 9) without Gaussian smoothing (7 = 0). The inputs are sampled from 2-dimensional uniform density pds
REFERENCES
1. C. Bishop, M. Svensen and C. Williams, GTM: The Generative Topographic Mapping Neural Computation 1 (1998). 2. A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Royal Statistical Society Series B, 39, 1-38. 3. T. Duchamp and W. Stuetzle, Extremal Properties of Principal Curves in the Plane, Annals of Statistics 4 (1996), 1511-1520. 4. J.H. Friedman, J.L. Bentley, R.A. Finkel, An Algorithm for Finding Best Matches in Logarithmic Time, ACM Trans. Math. Software 3 (1977), 209-216.
130
Figure 4. Upper picture: the organization of a 1-dimensional ring-topology TS-SOM (levels 2,6,8 and 10) with Gaussian smoothing on 2-dimensional Gaussian density pds Bottom picture: the organization of 2-dimensional sphere-topology TS-SOM (levels 3,4,4,6) with Gaussian smoothing on 3-dimensional uniform density pd[.
T. Hastie and W. Stuezle, Principal Curves, JASA 406 (1989), 502-516. 6. M. LeBlanc and R. Tibshirani, Adaptive Principal Surfaces, JASA 425 (1994), 53-64. 7. T. Kohonen, Self-organized formation of topologically correct feature maps, Biological Cybernetics 43 (1982), 59-69. T. Kohonen, Self-organizing maps (1997), 2. ed. Springer-Verlag. P. Koikkalainen and E. Oja, Self-Organizing Hierarchical Feature Maps, In proc. IJCNN-90 (1990), IEEE press, 279-284 10. P. Koikkalainen, Progress with the Tree-Structured Self-Organizing Map, In proc. ECAI'94 (1994), A. Cohn (ed.), John Wiley & Sons, 211-215. 11. H. Ritter, T. Martinetz and K. Schulten, Neural Computation and Self-Organizing Maps (1992), Addison-Wesley. 12. F. Mulier and V. Cherkassky, Self-Organization as an Itreative Kernel Smoothing Process, Neural Computation 7 (1995), 1165-1177. 13. L. Ljung and T. SSderstrom (1983), Theory and Practice of Recursive Identification, MIT Press. 14. D. Titterington, A. Smith and U. Makov, Statistical Analysis of Finite Mixture Distributions (1985), Wiley. 15. A. Utsugi, Hyperparameter selection for self-organizing maps, Neural Computation 3 (1997), 623-635. .
Kohonen Maps. E. Oja and S. Kaski, editors
131
@1999 Elsevier Science B.V. All rights reserved
Growing self-organizing networks-
history, status quo, and perspectives
Bernd Fritzke ~ aInstitute for Artificial Intelligence Computer Science Division Dresden University of Technology D-01062 Dresden An overview of the (rather recent) history of growing self-organizing networks is given. Starting from Kohonen's original work on the self-organizing map various modifications and new developments are motivated and illustrated. Current applications are presented and possible directions for future research. 1. I N T R O D U C T I O N In this paper I will use a somewhat subjective standpoint limiting the more detailed explanations to work I have done myself. This is partly due to space limitations which prevent a complete survey but it also provides the possibility to better motivate the particular direction the research has gone. Where most appropriate, pointers to the literature are given. My first contact with self-organizing networks was in 1989 when I stared at the mysterious beauty of the well known "'cactus image" contained in [1] (see figure 1 for a simulation sequence with a comparable probability distribution, the last sub-figure being very similar to the mentioned illustration). After understanding the details I was surprised that - starting from complete initial disorder - topological order can be achieved by following two simple rules (from [1]): i) Locate the best-matching unit. ii) Increase matching at this unit and its topological neighbors. Moreover, these rules lead to simple and efficient computer implementations. For the first rule only vector distances
I1~ - w~ll
( w e A)
(1)
have to be evaluated the number of which depends linearly on the size of the investigated self-organizing system ,4. The unit s with minimum distance is the so-called best-matching unit for the current input signal t~. For the second rule a number of vector additions of the form
w~ := w~ + ~(( - w~)
( w e N~)
(2)
132
Figure 1. A 20 • 20 self-organizing map adapts to a cactus-shaped probability distribution. Initially there is a topological defect ("twist") in the map (b), but the strong neighborhood adaptation at this stage of the simulation is able to correct this rapidly (c).
are performed, whereby Ns is a topological neighborhood around s and c = c(d(c, s), t) is a learning rate depending both on the distance d(c, s) between s and the current unit c and on the time (measured in number of presented input signals). Again, the number of vector additions depends linearly on the size of the self-organizing system A which makes the whole procedure computationally efficient and feasible also for large systems. Later I learned about the close relation of the self-organizing map to earlier work done by Willshaw and yon der Malsburg in the context of explaining retino-topical mappings [2]. Very likely the simpler mathematical description of the self-organizing map was an important factor leading to the widespread use of this method for many kinds of data analysis and learning problems (see [3] for an extensive list) whereas Willshaw and von der Malsburgs related pioneering work today is more applied in other areas, e.g. in the vision domain for face and object recognition and robotics [4-6]. It should be noted that in the case of the self-organizing map the simple description does not lead to an equally simple mathematical analysis of the resulting system. In particular it has been shown that no energy function exists [7]. The presence of an energy function
133
Figure 2. A one-dimensional SOM with 2000 units folds into a two-dimensional input space. The input signals are uniformly distributed in the shaded area. One can note that the gap between opposing parts of the chain is everywhere about the same.
is often taken as the guarantee for a well-behaved system. Nevertheless, the SOM has proven to be practically applicable for a huge number of problems and parameter choice seems to be not a difficult problem in most cases. Often the parameter decay schemes proposed in [8] are used. This means having an adaptation of each unit r according to Awr = e(t)h(t, r, s ) ( ~ - wr)
(3)
with the neighborhood range defined according to h(t, r, s) - exp
( d 1 ( r , 8)2~ - ~-ff(~) j .
(4)
on the grid and the width a and Thereby, dl (r, s) is the well-known M a n h a t t a n - d i s t a n c e the learning rate e are decayed exponentially according to or(t) -- c~i(crf /cri) tltmax
(5)
and
e(t) = ei(e f /ei) t/tmax
(6)
with suitable initial values cri, ei and final values al, eI. One important property of the SOM method is that it uses a fixed dimensionality for the map. If the input data is higher-dimensional, this leads to the development of dimensionality-reducing mappings: each point in input space is mapped to the position of the closest unit in the map. One can also interpret this as a folding of the lowdimensional map into the higher-dimensional input space (see e.g. figures 2 and 3). The dimensionality-reducing mappings can be used in various ways to visualize as well as group data. A recent application is WEBSOM [9]. 2. G R O W I N G
CELL STRUCTURES
After having experimented with the SOM for a while I noted two possible problems during application of this method to a particular data set:
134
Figure 3. A two-dimensional SOM of size 40 x 40 folds into a three-dimensional input space. The input signals are uniformly distributed in the boxed area. One can note that the map is mostly oriented according to the first two principal components of the data and incorporates the smaller (vertical) principal component only by developing small folds orthogonal to the main orientations.
9 Specifying a suitable size of the SOM in advance can be difficult. 9 The predefined (usually rectangular) structure may not be suitable to represent the given data. Later I attended a presentation by Jokusch [10] where an incremental variant of the SOM was proposed which grew on a rectangular grid guided by the input signals of the given data distribution. While this work seemed very interesting since the "dogma" of a fixed network size was left, it still had some features to desire. In particular, the resulting structures tended to arborize with many branches and the limitation for each unit to be on a rectangular grid lead to structures lacking the canonical simplicity which I had in mind. I then came up with the idea of using a "simplex" as the basic building block for a selforganizing network. A simplex is the most simple polyhedron of a given dimensionality k. Accordingly, a two-dimensional simplex is a triangle, a three-dimensional simplex is a tetrahedron, higher dimensions lead to k-dimensional hypertetrahedrons consisting of k + 1 vertices completely connected with each other. Given enough simplices every k-dimensional shape can be modeled with arbitrary accuracy (a feature which is used in finite element methods). By constraining the network
135
Figure 4. A "Growing Cell Structures" network adapts to a circle-shaped probability distribution. Always after 200 presented signals a new unit is inserted by splitting one of the existing edges and re-connecting such that the whole structure again consists only of triangles.
structure to consist of k-dimensional simplices it is possible to ensure that the network is a k-dimensional structure. In particular, if new simplices are inserted by splitting existing ones, it is very simple to ensure this condition. Figure 4 shows the first few insertion steps of the resulting model which I named "Growing Cell Structures" (GCS) and which was initially presented at the first ICANN conference 1991 in Helsinki [11]. An important question which has to be answered for an incremental model is the following: Where should new units be inserted? I answered this question for the GCS model as well as for later variants of it as follows: 9 Define a measurable objective for the network. 9 Keep local estimates for each unit indicating how much this unit contributes to the objective. 9 Insert new units near those units not contributing much to the objective. An important pre-condition for this strategy to lead to "good" networks in terms of the given objective is that several units in a particular region of input space can be expected to contribute more to the objective than a single one does. Fortunately, this pre-condition is fulfilled in many problem areas. One of them is vector quantization where a given data set 7) is quantized by a small number of reference vectors and the objective is to keep the mean distance between data item t~ and nearest reference vector ws small:
Z II~- w~lt min!
(7)
t~ev Obviously, if one reference vector in a certain area of the input space is unable to quantize the data in its vicinity well, it helps to add more reference vectors there. In this case the corresponding data subset is partitioned among all reference vectors in the area leading
136 to a smaller overall quantization error. Thus, for vector quantization the accumulated quantization error of each unit is a good criterion where to insert new units. Accordingly, on-line error estimates can be obtained by updating the error estimate E of the bestmatching unit s as follows:
:= A(E + II - w ll
(8)
Thereby, A is a decay factor which ensures that more recently acquired error values are weighted stronger than older ones and that for units which are not best-matching unit anymore, the error term vanishes. If the goal of the network is entropy maximization rather than error minimization this can be incorporated by using the following update rule for the error estimate:
Es := A(Es + 1)
(9)
This simple modification leads to high values of E for units which are winner often and inserting new units near them tends to make the winning frequency of all units similar thereby maximizing the Shannon entropy. There are some fundamental properties in which GCS differs from the original SOM. In particular, GCS has a) dynamic generation of the network structure, b) constant (and small) adaptation strength, and c) constant (and small) neighborhood range On the other hand, the properties inherited from SOM are 9 fixed network dimensionality 9 neighborhood definition by means of the topological structure Further details of the model (see [12] for a complete description) involve local redistribution of error estimates when a new unit is inserted. This re-distribution is done for efficiency reasons only and allows insertion of new units in a fixed rhythm (e.g. always after 200 presented signals). Alternatively one could discard error estimates after each insertion and use a number of adaption steps per inserted unit which is proportional to the current network size. One general strength of the described approach lies in the possibility to choose an application-specific error criterion and use this to steer the growth of the network. In figure 5 the quantization error is used as the insertion criterion leading to a problemdependent structure of the network. A one-dimensional variant of GCS has been used to compute approximate solutions for the traveling salesman problem, a classical NP-complete optimization problem [13]. In this case the structure was further constrained to be a ring of units (by creating one additional connection). The input signals in this case where the coordinates of cities (or drilling holes in some problems). The ring structure was allowed to grow until each city
137
Figure 5. A "Growing Cell Structures" network adapts to a cactus-shaped probability distribution. Always after 200 presented input signals a new unit is created near the unit with maximum accumulated quantization error. The structure which develops is two-dimensional by construction and is specific for the presented data.
had a unique nearest unit on the ring. Visiting the cities in the order of their corresponding units on the ring gave the desired round trip. By limiting the search for the best-matching unit to a topological neighborhood on the ring, the whole method had only linear time complexity and gave better solutions than the original SOM method for this problem while being much faster for nontrivial problem sizes. A related model is "Growing Grid" which follows a similar growth strategy but restricts the network topology to be a rectangular grid. This leads to rectangular self-organizing maps which automatically determine a suitable height/width ratio [14]. 3. G R O W I N G
NEURAL GAS
In many situations it is not helpful to have a pre-defined and fixed dimensionality for the self-organizing network. Rather it might be useful to let the data itself determine the dimensionality which may even vary from one area of the input space to another.
138
Already in 1991 (also at the first ICANN meeting) Martinetz did propose the "Neural Gas" model which he combined with "Competitive Hebbian Learning" to generate network structures having exactly the mentioned properties [15]. The resulting model deviates in two principal ways from the original SOM: a) The neighborhood of units to be adapted together with the best-matching is determined by ordering all units according to their distance in input space from the best-matching unit. The SOM in contrast uses the distance on the pre-defined and unchangeable grid. b) The topological structure is built up dynamically by connecting the best-matching unit for an input signal t~ with the unit second-closest in input space. Moreover, connections are allowed to age and are removed eventually if not refreshed. This is done to make the resulting graph a close approximation of the Delaunay triangulation restricted to the area of input space with non-zero probability density p(t~). Apart from these point several features of the SOM are still present in Martinetz model: 9 constant network size 9 shrinking neighborhood range 9 decaying adaptation strength A nearly obvious idea at this point is to combine the error-based growth proposed in the GCS model and the dynamic graph generation of "Competitive Hebbian Learning". I named the resulting model "Growing Neural Gas" (GNG) [16]. Independently, Bruske developed the "Dynamic Cell Structures" along similar lines [17]. In contrast to Martinetz' model the "Competitive Hebbian Learning" in GNG is actually used to determine the positions of the reference vectors. The reason is that new reference vectors are always inserted in the center of an existing edge (as already proposed for GCS). In figure 6 a simulation run of the GNG model is shown. 4. U T I L I T Y AS R E M O V A L
CRITERION
Perhaps surprisingly a central question in the context of growing self-organizing networks is: How to identify candidate units for removal? Why is removal of units needed in the first place? Depending on the objective function of the network and the insertion strategy units may be created in regions of the input space where they are not helpful in achieving the stated objective. For example if the objective is reduction of quantization error, new units in the GNG model are inserted between the unit with maximum accumulated error and one of its neighbor. Occasionally the position of the new unit may be in an area of zero probability density. This unit will not be the winner for any signals and will not contribute at all to the overall objective. Thus, it could be removed without increasing the quantization error at all. In most cases,
139
Figure 6. A "Growing Neural Gas" network adapts to a cactus-shaped probability distribution. Starting from a single unit a graph develops which mostly (not strictly as in GCS) consists of triangles and is extremely well adadapted to the given data.
however, the situation is not that simple but in general the units of the network will contribute to a different degree to the given objective. In the case of limited resources (e.g. a given maximum network size) one has to ask, whether it is possible to allocate units differently than in the current state of the network. Moreover, the network structure may suggest misleading relations in the data which are only present since some units do not correspond well to the underlying data distribution. For the specific but frequently used objective "reduction of quantization error" I recently proposed Utility as a criterion for removal with the following informal definition: The Utility Uc of a unit c is the additional quantization error which would occur if this unit would be removed. To formalize this for the case of a finite data s e t / ) let us write the mean quantization error occurring for a network .4 as
E[1),A]
= 1/]/)[ ~
~
cEA ~ERc
[]t~- wc[[ 2
(10)
140 whereby 7~c is the subset of 7:) for which the unit c, c E ,4 has the nearest reference vector (i.e. the subset of all elements of 7:) lying in the Voronoi region of we). Then the utility of a unit c can be defined as
U(c) : E[7:),A \ { c } ] - E[T),A],
(11)
which is equivalent to
v(c) - ~
It~- w~(~)ll ~ - I I ~ - w~ll ~,
(12)
with s2(t~) being the unit second-closest to t~. By iteratively moving the unit with minimum utility to a new location one arrives at LBG-U [18], a recently proposed improvement of the classical LBG method for vector quantization in finite data sets. In the case of on-line learning with continuous probability distributions one can compute on-line estimates of the utility U~ for each unit performing the following update for the best-matching unit:
us := ~(u~ + I1r w~(r
~ -115 - w~ll ~)
(13)
Again, A is a suitable decay factor the size of which in this case determines how fast older utility values are forgotten. Removing units with low utility leads in the case of growing self-organizing networks to models which are able to track non-stationary probability distributions [19]. In figure 7 the difference between such a model and the original SOM is illustrated when confronted with non-stationary data. 5. S U P E R V I S E D L E A R N I N G In growing self-organizing networks the error measure guiding the insertion of new units can be chosen application-dependent. Therefore, it is particularly easy to use these networks in a supervised learning setting, i.e. for pattern classification or for regression problems. The common principle of most approaches proposed so far is to associate with every unit of the network a simple local model which is used to represent the I/Omapping of the data in this area of the pattern space. The error which is accumulated at the units is accordingly the classification or regression error. Since new units can be interpolated from existing ones, it is possible to train the local model while the network grows to its final size. Variations of this approach include growing RBF networks [12,20], growing networks with local linear maps [21] and growing networks with a local on-line interpolation between constant output values [22]. 6. A P P L I C A T I O N S In this section a few applications of growing self-organizing networks are listed without further explanations. GCS and GNG have been used for 9 Evaluation of high energy physics experiments [23-25]
141
Figure 7. Two self-organizing networks of size 10 adapting to a one-dimensional nonstationary probability distribution. The vertical dimension is the time which goes from t = 0 (top) to t = 40000 (bottom). The SOM is only initially able to follow the changes in the data distribution, since at early stages the neighborhood range and adaptation strength is still strong. The GNG-U (GNG plus utility-based removal) network on the other hand tracks the distribution well over the complete time period. It is able to do so due to its constant parameters and the iterated removal and subsequent re-insertion.
9 Computer graphics (radiosity rendering) [26] 9 Self Calibration of the Fixation Movement of a Stereo Camera Head [27] 9 Robust Kinematic learning for a redundant robot arm [22]. LBG-U has so far been used for 9 Learning of object start positions in video sequences [28].
142
Figure 8. A modified GNG network (see text) adapts to a bi-modal probability distribution. Although the clusters overlap, they can still be identified by inspecting the strength of the edges in the graph. Areas of low probability density lead to weak edges so that cluster can be found by determining subgraphs connected by edges stronger than a threshold. Continuously increasing the threshold results in hierarchy of clusters.
7. O U T L O O K One promising area where the potential of growing self-organizing networks has not been fully exploited is certainly data mining and knowledge discovery. Clustering huge data sets without knowing in advance the number of clusters is something incremental networks should excel at. Since in the real world clusters do overlap, methods must be developed to identify clusters even in noisy environments. In figure 8 it is illustrated how this could be done in principle. Here the original GNG method has been enhanced by keeping a "strength" parameter with every edge which is increased every time this edge connects winner and second winner for an input signal. This makes it possible to identify frequently used edges instead of only noting existence or non-existence of an edge. This information could be used for clustering in noisy environments. Since many competitive learning methods can be seen as special cases of the EM algorithm [29], another interesting research area might be to use ideas from growing selforganizing networks to develop new incremental variants of EM. Vice versa one could try to incorporate EM in the self-organization process of growing networks. Since EM can be used for density estimation, there would be immediate applications for such methods in the area of pattern recognition.
143 REFERENCES
1. T. Kohonen. Self-organization and associative memory. Springer, Heidelberg, 1984. 2. D. J. Willshaw and C. von der Malsburg. How patterned neural connections can be set up by self-organization. In Proceedings of the Royal Society London, volume B194, pages 431-445, 1976. 3. Samuel Kaski, Jari Kangas, and Teuvo Kohonen. Bibliography of self-organizing map (SOM) papers: 1981-1997. Neural Computing Surveys, 1(3&4):1-176, 1998. Available in electronic form at http://www.icsi.berkeley.edu/~jagota/NCS/: Vol 1, pp. 102-350. 4. Martin Lades, Jan C. Vorbriiggen, Joachim Buhmann, J. Lange, Christoph vonder Malsburg, Rolf P. Wfirtz, and Wolfgang Konen. Distortion Invariant Object Recognition in the Dynamic Link Architecture. IEEE Transactions on Computers, 42:300311, 1993. 5. Wolfgang Konen, Thomas Maurer, and Christoph von der Malsburg. A Fast Dynamic Link Matching Algorithm for Invariant Pattern Recognition. Neural Networks, 7(6/7):1019-1030, 1994. 6. Mark Becker, Efthimia Kefalea, Eric Ma~l, Christoph von der Malsburg, Mike Pagel, Jochen Triesch, Jan C. Vorbriiggen, Rolf P. Wfirtz, and Stefan Zadel. GripSee: A Gesture-controlled Robot for Object Perception and Manipulation. Autonomous Robots, 1999. Accepted. 7. E. Erwin, K. Obermayer, and K. Schulten. Self-organizing maps: ordering, convergence properties and energy functions. Biological Cybernetics, 67:47-55, 1992. 8. H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-Organizing Maps: An Introduction. Addison-Wesley, New York, 1992. 9. Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen. WEBSOM--selforganizing maps of document collections. In Proceedings of WSOM'97, Workshop on Self-Organizing Maps, Espoo, Finland, June 4-6, pages 310-315. Helsinki University of Technology, Neural Networks Research Centre, Espoo, Finland, 1997. 10. S. Jokusch. A neural network which adapts its structure to a given set of patterns. In R. Eckmiller, G. Hartmann, and G. Hauske, editors, Parallel Processing in Neural Systems and Computers, pages 169-172. Elsevier Science Publishers B.V., 1990. 11. B. Fritzke. Let it g r o w - self-organizing feature maps with problem dependent cell structure. In T. Kohonen, K. M~kisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 403-408. North-Holland, Amsterdam, 1991. 12. B. Fritzke. Growing cell structures - a self-organizing network for unsupervised and supervised learning. Neural Networks, 7(9):1441-1460, 1994. 13. B. Fritzke and P. Wilke. FLEXMAP - A neural network with linear time and space complexity for the traveling salesman problem. In Proc. of IJCNN-91, pages 929-934, Singapore, 1991. 14. B. Fritzke. Growing grid - a self-organizing network with constant neighborhood range and adaptation strength. Neural Processing Letters, 2(5):9-13, 1995. 15. T. M. Martinetz and K. J. Schulten. A "neural-gas" network learns topologies. In T. Kohonen, K. Ms O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 397-402. North-Holland, Amsterdam, 1991.
144 16. B. Fritzke. A growing neural gas network learns topologies. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 625-632. MIT Press, Cambridge MA, 1995. 17. J. Bruske and G. Sommer. Dynamic cell structure learns perfectly topology preserving map. Neural Computation, 7(4):845-865, 1995. 18. B. Fritzke. The LBG-U method for vector quantization- an improvement over LBG inspired from neural networks. Neural Processing Letters, 5(1):35-45, 1997. 19. B. Fritzke. A self-organizing network that can follow non-stationary distributions. In ICANN'97: International Conference on Artificial Neural Networks, pages 613-618. Springer, 1997. 20. B. Fritzke. Fast learning with incremental RBF networks. Neural Processing Letters, 1(1):2-5, 1994. 21. B. Fritzke. Incremental learning of local linear mappings. In F. Fogelman and P. Gallinari, editors, ICANN'95: International Conference on Artificial Neural Networks, pages 217-222, Paris, France, 1995. EC2 & Cie. 22. E. Ma~l. A Hierarchical Network for Learning Robust Models of Kinematic Chains. In C. von der Malsburg, W. von Seelen, J.C. Vorbrfiggen, and B. Sendhoff, editors, ICANN'96: International Conference on Artificial Neural Networks, pages 617-622. Springer, 1996. 23. M. Kunze and J. Steffens. Growing Cell Structure and Neural G a s - Incremental Neural Networks. In ~th Artificial Intelligence in High Energy Physics workshop, Pisa, 1995. 24. Rfidiger Berlich, M. Kunze, and J. Steffens. A Comparison between the Performance of Feed Forward Neural Networks and the Supervised Growing Neural Gas Algorithm. In 5th Artificial Intelligence in High Energy Physics workshop, Lausanne, 1996. World Scientific. 25. M. Kunze. Uber die Anwendung Neuronaler Systeme in der Teilchenphysik. Habilitationsschrift, Ruhr-Universit~t Bochum, 1996. 26. C.-A. Bohn. Efficiently representing the radiosity kernel through learning. In Xavier Pueyo and Peter Schr5der, editors, Rendering Techniques '96, pages 123-132, Wien, Austria, 1996. Springer-Verlag Wien. Proceedings of the 7th EUROGRAPHICS workshop on rendering in Porto, Portugal, June 17-19, 1996. 27. M. Pagel, E. Ma~l, and C. v. d. Malsburg. Self Calibration of the Fixation Movement of a Stereo Camera Head. Machine Learning, 31:169-186, 1998. 28. Hartmut S. Loos, Bernd Fritzke, and Christoph vonder Malsburg. Positionsvorhersage von bewegten Objekten in grogformatigen Bildsequenzen. In Proceedings des Workshop Dynamische Perzeption, Juni 18-19, Bielefeld, Germany. Infix Verlag, 1998. 29. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statis. Soc. Ser. B, (39):1-38, 1977.
Kohone, Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
Kohonen
145
Self-Organizing Map with quantized weights
P. Thiran ~ * ~Institute for Computer Communications and Applications, Ecole Polytechnique Ffidfirale de Lausanne, CH-1015 Lausanne, Switzerland This paper reviews the convergence properties of the Kohonen Self-Organizing Map when the inputs and the weights are quantized. The motivation behind this work and the impact of the results are both of theoretical nature (the finite state space is easier to handle, and allows to obtain necessary and sufficient conditions on the admissible neighborhood functions for a large class of functions), and of practical nature (Should the quantization weights completely perturb the self-organization property of the map, it will not be possible to implement it on a digital VLSI chip). 1. I n t r o d u c t i o n There are at least three reasons for studying Kohonen's algorithm with quantized weights and inputs: 1. The Kohonen algorithm is now a Markov chain with a finite discrete state space, which is easier to handle in the proofs, even if this simplicity is counter-balanced by the fact that tie winners will have to be dealt with. Nevertheless, in the 1-dim. setting, it will be possible to derive more general conditions for the ordering of the weights for a general input distribution than in the continuous case. 2. The neighborhood functions (too flat or too steep) that were only slowing down the organization of the continuous weights will now make it impossible. This will provide some additional insight for the self-organization process in the continuous case.
.
The digital implementation of the Kohonen algorithm on a chip or a neuro-accelerator [1-3] yields the quantization of all data, sometimes on a very low number of bits in the case of a mixed digital-analog chip [4].
This paper reviews the main results obtained for Kohonen's Self-Organizing Map with a discrete state space, proofs and details can be found in [5,6,4], and is structured as follows: Kohonen's algorithm with quantized weights is first described in Section 2. Most of the paper is focused on the rigorous analysis in the 1-dim. setting, where both the weights and the inputs are scalar. In Section 3, necessary and sufficient conditions for the ordering of scalar quantized weights are established. Section 4 these results are compared with *E-maih
Patrick. Thiran 9
146 the ones obtained when the weights are continuous-valued. qualitatively this analysis for the 2-dim. case.
Finally, Section 5 confirms
2. M o d i f i c a t i o n of t h e w e i g h t s d y n a m i c s in t h e K o h o n e n n e t w o r k The network is made up of N neurons. The output response of each neuron i to an input ( - (~1, ~ 2 , . . . , ~ n ) is given by [[ t ~ - tti [, where tti = (#il, P i 2 , ' " , Pin) is its synaptic weight vector. We consider the Kohonen algorithm expressed in the discretetime formalism. The inputs will always be bounded (Xmi n ~ II ~ II~ _< Xm~x, with [[ t~ I[o~ - maxj [(j[), and so will be the weights. Therefore we can clip the dynamics of the weights to these values without perturbing the learning process. On the other hand, discretization of the dynamics can affect the learning process noticeably, as we will see further. We will investigate the effects of quantization by roundoff. This law of quantization is given by the function Q(.), mapping a real value x to an integer multiple of the quantification step q 9
Q[x]-kq
1 (k-~)q_<x<
if
1 (k+~)q,
with k being an integer. In case of vector quantization, the vector x = gets mapped to
(1)
(xl,x2,... ,xn)
Q[x] = (Q[xl], Q[x2],..., Q[xn]) with Q[.] defined by (1). The quantization step q is given by Xma x - Xmi n
q-
n-
1
Xmax-
=
Xmi n
25 - 1
(2)
where b is the number of bits and L is the number of quantization levels. Both the weights and the inputs are quantized with the same law Q[.]. At time t, an input t~(t) is presented to the network, which selects then the winner w as the neuron whose output response is minimum. This winning unit satisfies therefore the condition [ ~(t+l)-ttw(t)[]-m.in
z
~(t+l)-tt~(t)[[.
(3)
The weights are then updated according to the equation tt~(t + 1)
-
Q[tti(t) + c~(t)A([ w - i[[,
t)(~(t + 1) - tti(t))]
= tti(t)+ Q[c~(t)A([[ w - i[[, t)({(t + 1 ) - tti(t))]
(4)
where c~(t) is the adaptation gain (0 < c~(t) < 1) and A ( I w - il,t ) is a neighborhood function (0 < A ( I w - il, t) _< 1, with A(0, t) = 1). A simple choice for A(Iw - il, t) is a rectangularly shaped neighborhood function, i.e. A([ w - i l l
,t)-
1 0
if if
icN'w(t) ir
(5)
where N'w(t) is the (usually time-dependent) neighborhood set of neuron w, defined as the set of units i for which the value of A(II w - i , t) is nonzero.
147 3. M a t h e m a t i c a l
a n a l y s i s in the 1-dim. s e t t i n g
In the case of scalar quantized weights and inputs, n - 1, and without loss of generality we will a s s u m e Xmi n 0 and Xm~x - 1. so that the quantization step is q - 1/(L - 1). The weights and the inputs can then only take the values 0, q, 2q, 3q, ..., ( L - 2)q, ( L - 1)q = 1. In this section, we want to choose the parameters independently of the input probability distribution, so that the weights become ordered for any input probability distribution. The only assumption that will be made is that both quantization levels 0 and ( L - 1)q must have a nonzero probability to receive an input. In the following, such a level will be called an excitable level: level lq, with 0 < 1 < L - 1, is thus excitable if the probability that ~ - lq is nonzero. The question that we will study is the following: how do we have to choose the parameters of the network so that the weights become ordered with probability 1 ? 3.1. Tie w i n n e r s There still remains an ambiguity in the algorithm with discrete-valued weights. Indeed, what happens if more than one unit satisfies (3) ? With continuous-valued weights and inputs, such an event occurs with probability zero [7] and so does not matter, but this probability is definitely nonzero in the case of discrete weights and inputs. Therefore we need to specify how the winner is chosen. To have a unique winner, we chose as a rule that if two neurons satisfy (3), the winner will be the one with the lowest index. This choice is arbitrary, and other choices are of course possible. 3.2. E x i s t e n c e of the desired a b s o r b i n g classes Nevertheless, if we pick a rectangularly shaped neighborhood function (such as (5) with .hfw = { w - 1, w, w + 1}), the classes of strictly monotone configurations are no longer absorbing. Consider Figure 1, representing an initially strictly increasing chain of four weights ( # l ( t ) < # 2 ( t ) < # a ( t ) < #4(t)). The sequence of inputs ~(t), ~ ( t + l ) , ~ ( t + 2 ) a n d ((t + 3) given in Figure 1, with c~ = 0.33, disorganizes the chain: pl (t + 4) < #2(t + 4) = #3(t q- 4) > #4(t q- 4). The reason why this problem occurs is due to the nonlinearity Q[.] and to the shape of the neighborhood function A(Iw - il). With c~ = 0.33, if ~ ( t ) = ~(t) - q, then
#~(t + 1) -- #~(t) + Q[0.33 q ] - #~(t) while if #w-l(t)) = ~(t) - 2q - #w(t) - q, we have that
# w - l ( t + 1) -- #w-l(t) + Q[0.66 q] -- #w-l(t) + q -- #w(t) -- #w(t + 1). So in this case, the weight of the winning unit does not move because it is too close to the input, but its neighbors, being further from the input, are displaced. This effect is therefore responsible for the disorganization of an ordered chain. To fix it, one should reduce the displacement of the neighbors relatively to the displacement of the winner, by taking a neighborhood function decreasing with the distance between w and i. For
148
q~ ~(t) ~t4 (t) . . . . . la3(t) - - y ~(t)
~(t+ 1) I,.
--
~ -
~l (t)
-
-- --
/
-Z '
i
'
89
'
.~
,
.._
,
r
4
Neuron
index
The weights at time t
The weights at time t+l
~(t+3)
~(t+2)
The weights at time t+2
The weights at time t+3
The weights at time t+4
Figure 1. Sequence of inputs disorganizing an initially ordered chain, when the neighborhood is rectangular.
instance, we will see that an acceptable function is 1 l(j)
-
~
0
ifj-O if 1 <_ j <_ R i f j > R,
(6)
where j = I w - il. To establish necessary and sufficient conditions for the existence of the desired absorbing classes, we need to define a new parameter, 5(j), as follows. Suppose that for any j _> 0, 1/2c~A(j) is not an integer (Such a condition is easy to achieve, the set of values of c~ and A(j) such that 1/2c~a(j) is an integer has measure zero). Then 5(j) is defined as the integer such that 1 2c~a(j)
1 < 5(j) <
1 2c~a(j)
(7)
if 1/2c~a(j) < L - 1 and 5(j) = L -
1
if 1/2c~a(j) > L in [6].
1. The meaning of 5(j) is provided by the following lemma, proven
L e m m a 1 If at some time t, pw•
> ~(t) + 5(j) q then
~(t) + 5(j) q <_ #w+j(t + 1) _< pw+j(t) - q,
149
if ~(t) - 5(j) q <_ pw+j(t) <_ ~(t) + 5(j) q then ,~•
+ 1) - , ~ •
and if p~+j(t) < ~ ( t ) - 5(j) q then p~+j(t) + q <_ p~+j(t + 1) _< ~(t) - 5(j) q. Consequently, 5(j)q is the maximum distance between an input and weight p~+j such that this weight cannot move. For example, let us take a - o.aa and A(j) defined by (6), with R - 3. In this case, one can compute that 5(0) - 1, 5(1) - 3, 5(2) - 6, 5(3) - 9 and 5(j) - L - 1 for j _> 4. We can now formulate the following theorem, which is proven in [6]. 1 Every strictly increasing (decreasing) configuration of weights at some time t will remain strictly increasing (decreasing) at time t + 1 if, and only if,
Theorem
5(j) < 5(j + 1)
whenever
5(j + 1) < L - 1.
(8)
One can easily check that the neighborhood function (6) satisfies (8) for any 0 < c~ < 1. 3.3. A v o i d a n c e of p a r a s i t i c a b s o r b i n g classes If the neighborhood function decreases spatially at a sufficient rate so that (8) is satisfied, Theorem 1 guarantees that strictly ordered configurations are absorbing classes of the Markov chain. Nevertheless, if this decreasing rate is too large, other parasitic, non ordered configurations may also be absorbing classes. Remember that the weights must self-organize for any input probability distribution, and that we supposed that both levels 0 and ( L - 1)q are always excitable. In particular, the weights must self-organize even if all the other quantization levels are not excitable. This input distribution is the worst one in terms of parasitic classes. Indeed, any other distribution has at least three excitable quantization levels: levels 0, ( L - 1) q and kq with 1 _< k <_ L - 2. If it has a parasitic absorbing class, it means that any sequence of inputs in particular the one where inputs are presented only at levels 0 and ( L - 1 ) q - cannot order this configuration of weights. Therefore this configuration is also a parasitic class when the only excitable levels are 0 and ( L - 1)q. This "worst" distribution is the one that will be considered in this section. Let us define the notion of inversion: a configuration of weights has an inversion (or a twist) at index i, 1 <_ i _< N - 1, if and only if -
(~i+1 --
~i)(#i-
~i--1) < 0
or
#i+1 - #i. For example, the configuration of Figure 2(a) has an inversion at index ( N - 1). There are two necessary conditions, given in the following theorem, in order to avoid parasitic absorbing classes. The first one (9) applies when there is an inversion at index 2 or ( N - 1), like in Figure 2(a), the second one (10) applies for the configurations of the type of Figure 2(b):
150
quantization level (L-1)q[ (L-2)
(L-1)c (L-2){
32!I
~ t :~ ~ JL
3C 2i q 0
N-"I~ "-neuron (a)
~ 89~ j*+tlj*+2 (b)
Figure 2. Two kinds of parasitic absorbing classes.
2 If there is no other absorbing class than strictly increasing and decreasing configurations of weights, and if (8) is satisfied, then
Theorem
5(1)+5(N-2)
(9)
and, for all 0 <_ j <_ N 5(j)+5(N-2-j)
2,
(10)
As an example, suppose that N - 10, c~ - 0.33 and that A(j) is given by (6), with R = 8. Then (7) yields that 6(0) - 1, 5(j) - 3j for 1 < j < 8 and 6(9) - L - 1. To avoid any parasitic absorbing class, one computes by (9) and (10) that the number of quantization levels L must be at least equal to 29. Simulations on 1000 runs of the algorithm confirm this result: Figure 3 shows the percentage of ordered configurations obtained after 100, 1000 and 10000 iteration steps for different values of L, when the only excitable levels are 0 and ( L - 1)q, and have an equal probability of being excited: P(~-0)-P(~-(L-1)q)
P((-
kq)
-
1/2
-
0
forl
(11)
As computed, one notices that below the minimum value L - 29, one cannot get 100 % ordered configurations. This percentage of ordered configurations falls off sharply as L decreases (44.6 % for L - 28, 8.6 % for L - 27). The weights' initial values were randomly chosen between 0 and 1. If we knew a priori that more than two levels were excitable, these two conditions could be made less severe. However, even for a uniform probability distribution of the inputs,
151 there would be parasitic absorbing classes if the number of quantization levels is low and the neighborhood function rapidly decreasing spatially (i.e. is very concave). This is shown in Figure 4, where the input distribution is now uniform, i.e.,
P(~=kq)-l/L
for0_
(12)
all the other parameters (neighborhood function, etc) of the simulations being unchanged. The number L of quantization levels at which absorbing non-ordered states appear is now only L - 25.
Figure 3. Percentage of ordered configurations obtained after 100, 1000 and 10000 iteration steps in the case of a particular non uniform input distribution.
Note finally that with (9) and (10), (8) becomes (~(j) < (~(j + 1)
(13)
for 0 _< j _< N - 2. This implies that a local neighborhood function (for which (~(j) < L - 1 only for small values of j) always possesses parasitic absorbing classes, when few quantization levels are excitable. Such a probability distribution can however be viewed as a pathological input probability distribution, and a more realistic probability distribution will allow a local neighborhood function. It remains however an open question to know what are the necessary conditions needed in order to prevent the appearance of parasitic absorbing classes for a particular, given input distribution [8]. 3.4. T h e o r d e r i n g p r o c e s s Theorems 1 and 2 state necessary conditions that the parameters must verify. In this section, we will see that these conditions are also sumcient. To this aim, we must first find a particular sequence of inputs having a nonzero probability that will order any configuration {#i(0)} in a finite time.
152
Figure 4. Percentage of ordered configurations obtained after 100, 1000 and 10000 iteration steps in the case of a uniform input distribution.
The major difference with the proof for continuous-valued weights and inputs is this particular set of input sequences. Indeed, in [9,7,10], the particular set of input sequences consisted in presenting a sufficient number of input sets "near" a well-chosen weight, to make an inversion disappear locally, and so, step by step, to make all the inversions disappear. In the case of quantized weights and inputs, we cannot make an inversion disappear locally as we want, because at some time some neurons are tie with others and cannot therefore be elected as winners at this particular time. We must make the inversions disappear in some sort of "global" way, by presenting a sufficient (one can show that L is sufficient) number of inputs to level ( L - 1)q and next to level 0, then back to level ( L - 1)q, and so forth, to "stretch" all the weights and eventually remove all the inversions. One can show that such a sequence of inputs indeed orders any configuration of weights [6], so that we can now state the main result, valid for any input probability distribution with two excitable levels 0 and ( L - 1)q: T h e o r e m 3 Any configuration of weights becomes strictly decreasing or increasing in a finite time with probability one if, and only if, the parameters of the network c~ and A satisfy the following inequalities, for 0 < j < N - 2,
5(j) 5(1)+5(N-2)
5(j) + 5 ( X -
2 - j)
< 5(j + l) <
L-1
< L -1
where (~(.) is given by (7) from c~ and A. P r o o f : (=~) These conditions were shown to be necessary in the previous section.
153 (~=) Theorem 1 states that strictly increasing and decreasing configurations of weights are two absorbing classes of the Markov chain. On the other hand, a finite sequence of inputs having a nonzero probability, has been constructed so that any configuration enters one of these two absorbing classes. One deduces then from a property of the finite state Markov chains that the weights become ordered almost surely in a finite time. 9 Let us return to the example treated and simulated in Section 3.3, with the input distribution (11). Theorem 3 asserts that the minimal number of quantization levels L - 29 that was shown to be necessary in order to avoid non ordered absorbing classes, is also sufficient to guarantee that every configuration of weights will become ordered in a finite time. The simulations results plotted in Figure 3 show indeed that the percentage of ordered configurations eventually reaches 100 % as soon as L > 29.
4. C o m p a r i s o n w i t h t h e K o h o n e n a l g o r i t h m w i t h c o n t i n u o u s w e i g h t s Theorem 3 gives necessary and sufficient conditions guaranteeing the self-organization of 1-dim. network for any 1-dim. input probability distribution having at least two excitable levels, when the weights and the inputs are quantized. The main difference with continuous-valued weights is that the constraints on the parameters are stricter. We have shown in Theorem 1 that the neighborhood function must be spatially strictly decreasing when the weights are quantized, so that the particular case of a rectangular (or step) neighborhood function is forbidden (the parameters of the network being timeinvariant). If the weights are continuous-valued, the use of a spatially decreasing function enhances the speed of convergence, but it is not a necessary condition for the ordering of the weights, since a rectangular neighborhood function is allowed [9,7]. On the other hand, if the neighborhood function decreases spatially too fast (i.e., it is too concave), the weights may be trapped for ever in parasitic absorbing states, as shown by Theorem 2. With continuous-valued weights, some concave neighborhood functions yield the appearance of "meta-stable" states, as we have seen above, which slow down considerably the speed of convergence, but these states are no longer absorbing, since there is a nonzero (yet very small) probability to leave them. The weights just remain a long time in these states. Erwin et al [10] have shown that the most frequent metastable states are the ones with a twist at index 2 or (N - 1) - as the one of Figure 2(a)). Figure 5, where the percentage of ordered configurations is plotted as a function of the iteration step t for the example of Section 3.3 (N = 10, c~ = 0.33, A(j) given by (6), with R = 8) and for the uniform input distribution (12), illustrates well the differences between parasitic non-ordered stable states, meta-stable states and ordered stable states. When L = 27, every configuration of weights reaches an ordered stable state quite fast. When L = 26, every configuration will also eventually reach an ordered state, but often after having transited by a recta-stable state during a very long time, which slows down the ordering process. When L = 25, these Ineta-stable states become stable, with the consequence that a configuration of weights caught in one of these parasitic stable states cannot leave it any more, on the contrary of what happens when L = 26. Note how sensitive on the number of quantization levels L the convergence may be. Consequently, one can qualitatively say that the conditions improving the convergence of the self-organization process when the weights are continuous-valued become necessary
154 100-/
o _
=
80-
:, j: /
/....
60/
.......
/.// i
/.. ./L = 26
o o
f/-
/
/
/
/t
o
L = 25
40-
o a{} =
20-
0
'
0
'
'
I
'
4000
'
'
t
8000
'
'
'
I
12000
'
'
'
I
16000
'
'
'
ii
20000
Iteration step t
Figure 5. Percentage of ordered configurations obtained for 25, 26 and 27 quantization levels in the case of a uniform input distribution.
when the weights are discrete-valued. 5. H i g h e r d i m e n s i o n a l i n p u t a n d w e i g h t s spaces Finally, in the case of higher dimensional input and weight spaces, and with timevarying parameters, a qualitative analysis [4] confirms these mathematical results. In particular the gain c~(t) must be a sufficiently decreasing function of time when the number of bits is low, so that can be well untangled. The neighborhood size is usually also a decreasing function of time. As in the continuous case, the neighborhood size must decrease sumciently slowly and the initial size must be sufficiently large, so that the mapping preserves the topology of the network. It is preferable to use a neighborhood function decreasing with the distance between the winner and its neighbors rather than the classic rectangular neighborhood function, as in the 1-dim. case. A too flat and wide (e.g., rectangular) neighborhood function must also be avoided as for small values of the gain c~(t) (and since it is decreasing with time, it will eventually become small), the weights of the neurons close to the winner are no longer updated, contrary to the weights further away, as explained in Section 3.2. With a small number of bits, this may even lead to an increase of the error at the end of the convergence process, as shown in [4]. The extension of the mathematical analysis of self-organization to the 2-dim. setting appears as a very hard problem [11,12], as a workable definition of an ordered configuration is itself difficult to establish. Because of the finite state space of the algorithm with quantized weights, it would perhaps be easier to extend to the 2-dim. case with timeinvariant parameters. Moreover, it is possible to take these theoretical conclusions into account in a hybrid analog and digital VLSI realization [13,4], where an analog network implements the
155 spatially decreasing neighborhood function, whereas the weights are stored as digital variables. Because of the analog circuitry, the number of quantization levels is quite small (5...8 bits), so the quantization effects can become really critical. Indeed, an analog spatially decreasing functionhas been successfully implemented by Heim, Hochet and Vittoz [13] and Peiris (1994)[14]. In digital neuro-accelerators, the granularity can be smaller (16...32 bits), and so the quantization effects are less crucial. However the parallelism of these hardware systems can be exploited only when the weights are not updated after the presentation of an input but only after the presentation of a group (epoch) of T inputs. This impact of this modification has been also studied, both quantitatively in the 1-dim. setting, and qualitatively on speech data [15]. It is again possible to optimize both the hardware and the algorithm to make the best usage of the parallelism, while keeping the convergence of the algorithm. REFERENCES
1. Thierry Cornu, Paolo Ienne, Dagmar Niebur, Patrick Thiran, and Marc A. Viredaz. Design, implementation, and test of a multi-model systolic neural-network accelerator. Scientific Programming, 5(1):47-61, Spring 1996. 2. Marc Viredaz. Design and Analysis of a Systolic Array for Neural Computation. PhD Thesis N ~ 1264, t~cole Polytechnique F~d~rale de Lausanne, Lausanne, 1994. 3. Paolo Ienne Lopez. Programmable VLSI Systolic Processors for Neural Network and Matrix Computations. PhD Thesis, t~cole Polytechnique F~d~rale de Lausanne, Lausanne, 1996. 4. Patrick Thiran, Vincent Peiris, Pascal Heim, and Bertrand Hochet. Quantization effects in digitally behaving circuit implementations of Kohonen networks. IEEE Transactions on Neural Networks, NN-5(3):450-58, 1994. 5. Patrick Thiran. Dynamics and Self-Organization of Locally Cuupled Neural networks. Presses Polytechniques et Universitaires Romandes, Lausanne, Switzerland, 1997. 6. Patrick Thiran and Martin Hasler. Self-organisation of a one-dimensional Kohonen network with quantized weights and inputs. Neural Networks, 7(9):1427-39, 1994. 7. Catherine Bouton and Gilles Pages. Self-organization and a.s convergence of the onedimensional Kohonen algorithm with non-uniformly distributed stimuli. Stochastic Processes and their Applications, 47:249-274, 1993. 8. Patrick Thiran and Martin Hasler. Quantization effects in Kohonen networks. In Marie Cottrell, editor, Proceedings of Congr~s Satellite du Congr~s Europden de Mathdmatique, Paris, 1992. 9. Marie Cottrell and Jean-Claude Fort. t~tude d'un algorithme d'auto-organisation. Annales de l'Institut Henri Poincard, 23(1):1-20, 1987. 10. Edgar Erwin, Klaus Obermayer, and Klaus Schulten. Self-organizing maps: Ordering, convergence properties and energy functions. Biological Cybernetics, 67:47-55, 1992. 11. Jean-Claude Fort and Gilles Pages. About the Kohonen algorithm: Strong or weak self organization ? Neural Networks, 9:773-785, 1996. 12. J. Adrian Flanagan. Self-organising neural networks. PhD Thesis, l~cole Polytechnique F~d~rale de Lausanne, Lausanne, 1994.
156 13. P. Heim, B. Hochet, and E. Vittoz. Generation of learning neighbourhood in Kohonen feature maps by means of simple nonlinear network. Electronics Letters, 27(3):275277, 1991. 14. Vincent Peiris. Mixed Analog Digital VLSI Implementation of a Kohonen Neural Network. PhD Thesis, t~cole Polytechnique F~d~rale de Lausanne, Lausanne, 1994. 15. Paolo Ienne, Patrick Thiran, and Nikolaos Vassilas. Modified self-organising feature map algorithms for efficient digital hardware implementation. IEEE Transactions on Neural Networks, NN-8(2):315-330, 1997.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
157
O n t h e O p t i m i z a t i o n of S e l f - O r g a n i z i n g M a p s by G e n e t i c A l g o r i t h m s Daniel Polani a aInstitut fiir Informatik Johannes Gutenberg-Universit/it D-55099 Mainz, Germany polani@informat ik. uni-mainz, de This paper gives an overview over the current state of research concerning the genetic optimization of Self-Organizing Maps. Philosophy, methods and results as well as future prospects of the approach are discussed. 1. I N T R O D U C T I O N
Many of the modern paradigms of computation are motivated by biology. They combine inherent parallelizability and the hope that algorithms useful in the complex environment of living beings will also prove useful for solution of difficult artificial problems. Neural Networks (NN) and Genetic Algorithms (GA) are such biologically motivated paradigms that have received considerable attention in the last years. While NN models derive their inspiration from the information processing mechanisms in living beings, GAs are a simplified model of the Darwinian concept of evolution. It is expected that a combination of NN and GA is able to yield systems with a high degree of adaptivity and robustness in analogy to living systems. Much of the work combining both paradigms concentrates on the GA optimization of feedforward networks which are usually trained with variants of backpropagation. In contrast, there is not much work applying GAs to the Self-Organizing Map (SOM). Backpropagation training is used for supervised learning where the task is to minimize the deviation between the output of the network to some given training data. Standard backpropagation acts as a gradient descent method and is therefore troubled by local optima. Among other methods devised to enhance backpropagation, some attention has also been devoted to GAs, since they are more robust search algorithms w.r.t, local optima than gradient search. However, SOMs are trained in an unsupervised fashion and there is no "deviation" from training data to minimize. Instead the dynamics of the network is given a priori [19] and no explicit goal of the training process is formulated. The "self-organization" of a SOM is a stochastic process who develops towards a "topologically organized" state. A natural definition of organization exists only for the one-dimensional case and does not extend to higher dimensions. This lack of a canonical organization measure is a crucial problem for the application of GAs, since these require some selection criterion. It is the purpose of this paper to discuss this and other problems occurring in the context of GA optimization of SOMs.
158 We will give a review of the concepts of SOM evolution and its research. First we develop a framework for the study of SOM optimization by GAs, where we will dwell on specific problems in more detail and present approaches to solve them. An overview over some relevant studies will then be given. Finally, we will indicate some open questions and point out directions for future research.
2. T H E F R A M E W O R K In this section we will set up the framework required for discussing the issues of SO M optimization using GAs. First, we will introduce some notation regarding SOMs. Then, we will briefly review the essentials of GAs and their application to NN optimization. We will present the issue of chromosome encoding and introduce organization measures to be used as fitness functions. 2.1. T h e S e l f - O r g a n i z i n g M a p : D e f i n i t i o n s a n d C o n v e n t i o n s In the following we give some definitions which will be used throughout the text. Define the SOM as a map w : A -~ V from a discrete finite set A of (formal) neurons to a convex subset V of R d, the input space. A is called output space. Each neuron j is mapped to its weight which we write wj instead of w(j). A metric dA is induced on A by the adjacency structure of an undirected weightless graph, the Kohonen graph, of which A is the vertex set; on V, dv is chosen as the Euclidean metric. A left-inverse map i~, 9V -+ A to w is defined by i~,(x) := argmin i dv(x, wi) where ties are arbitrarily broken 1. In the following we will always write i* instead of i~, for notational convenience. Given an inital map w(0), the SOM learning rule defines a sequence of maps (w(t))t=l, 2....
A w j ( t ) " - e(t). ht(i*(x(t)),j) (x(t)-wj(t)) for all neurons j, x(t) input, e(t) the learning rate, and ht the activation profile at time t.
via
being the training
2.2. G e n e t i c A l g o r i t h m s We will give an outline of the GA principles only at a very general level relevant to our considerations. Consult e.g. [13,23] for details. Essentially, a GA can be considered as a population-based semi-random directed search. The GA operates on a population of chromosomes, which are test solutions for the problem at hand. The existence of a whole population of test solutions is a central difference between Evolutionary Algorithms and other search paradigms. On the space f~ of chromosomes w we have two main operators, the mutation and the recombination. The mutation is an operator able to change a chromosome w into a chromosome J . The recombination operator creates one or more new chromosomes from two chromosomes in the population. Apart from its biological motivation its use is based the on the building-block hypothesis, i.e. the assumption that the chromosomes in ft can be taken apart and recombined to form new chromosomes to maintain favorable properties w.r.t, selection [13]. Finally, the GA is directed by the selection operator. During a GA run the selection aWe will only consider continuous input distributions. If the input distributions are concentrated on discrete points or particularly if dy is not Euclidean, the training process may be significantly affected by the way the ties are broken [26].
159
--~I mutation
rec~176 I GA ~I selectio~-'n~--~
training
L raw network
I
Figure 1. Optimization of NN models by GAs
operator alternates with mutation and recombination, selecting a certain portion of the current population for survival and discarding the rest. The selection operator models the "survival of the fittest" paradigm from Darwinian evolution and implements the preferences or goals of the GA designer. We will represent selection by a fitness function f : Ft -+ R which is to be maximized or minimized depending on the context. The dotted box in Fig. 1 denotes the core GA. From the GA point of view, the problem-specific parts outside the dotted box are completely encoded into f (symbolized by the dashed arrow in Fig. I).
2.3. Applying
Genetic Algorithms
to Neural Network
Optimization
In the following we will refer to the optimization of NN models by GAs as GANN optimizations for brevity. Figure 1 shows the complete GANN optimization cycle. The GA generates chromosomes via mutation and recombination. A given chromosome w is transcribed, yielding a "raw" network that results from the GA without invoking any network-specific learning mechanisms. The raw network is then subjected to a training process whose dynamics is specified by the given network model. The trained network is evaluated, returning a fitness value which is then used by the GA selection operator. The designer of this procedure has three "degrees of freedom":
Transcription:
how to transform the chromosome into a NN? The transcription procedure may determine the weights of the network, the topology of the network, or the parameters that are to be used in the learning rule.
Evaluation:
how to measure the quality of the trained network? While for supervised learning natural quality measures can be specified, this is not the case for SOMs and there is a wide selection of possible organization measures that can be explored
(Sec. 2.5).
150 Training" how to train the network? One can use the standard training procedure accompanying the network model (e.g. backpropagation for a feed-forward network or the usual SOM training), whose parameters may be optimized by the GA (Sec. 3.1). Or, the learning rule structure itself may be determined by the chromosome (Sec. 3.2). 2.4. C h r o m o s o m e E n c o d i n g s In Fig. 1 a chromosome contains all the information necessary to generate a fully specified NN model. Selecting an adequate chromosome representation and transcription mechanism is a key ingredient for successful GANN optimization approaches. Depending on the goal of the optimization one might want to optimize the parameters used in the learning rule, the learning rule structure, the weights of the network, the topology of the network or other properties. Chromosomes can contain strings over the alphabet {0,1}, but also real-valued numbers [2], or, in the case of Genetic Programming also trees (Secs. 3.2 and 3.3.1). The mutation operator can perform a random perturbation of an entry. Recombination generates a new chromosome from two (or more) parents where different entries may be derived from different parents. 2.5. O r g a n i z a t i o n M e a s u r e s as F i t n e s s F u n c t i o n s For supervised learning, say, for a regular feed-forward network, a natural evaluation function is obtained by the mean squared error between network and training output. This evaluation function is used in many GANN optimization tasks involving feed-forward networks (e.g. [22]). Such an approach is not straightforward to generalize to SOMs. Instead, we are faced with a variety of possible fitness functions that can be used. Choosing different fitness functions for optimization will give us some additional insight into their dynamics. One would like to use some measure as fitness function that determines the degree of "organization" of the Self-Organizing Map in any way. We informally suggest two plausible properties such a measure should have [27]: First, the measure should quantify the process of self-organization during training; ideally it should increase or decrease monotonously during the training process (property 1). Second, it should quantify the "topology preservation" of the SOM (property 2). A Liapunov function for the deterministic dynamical system associated with a given SOM [7] would be a good candidate for property 1. For the one-dimensional SOM the number of inversions provides such a measure [8,19], but for higher dimensions only approximations to property 1 exist [4,32]. The second property of topology preservation is even more intricate. As we define a SOM as a map from the discrete set of neurons A to V it is tempting to use the conventional mathematical notion of topology to define topology preservation. The input space V is typically continuous and the classical notion of topology yields a rich structure. Applied naively, it gives, however, a quite uninteresting structure for A, namely the discrete topology. There have been different approaches to deal with the problem. To define a notion similar to the classical topology, additional requirements have been added to the discrete space A in [12]. The notion defined there does not include a measure to quantify the degree of topology preservation, but is just a predicate that may be fulfilled by a given SOM or not.
161 However, to be useful as fitness function for GANN optimization we require a measure that distinguishes between different degrees of topology preservation. The measure from [1] is an early approach to quantify the topology preservation of a SOM. Like in [12], this measure requires an additional structure on A. A smooth version of the measure from [1] is closely related to the curvature of a manifold [26]. One of the simplest measures to use is the mean quantization error pQ. While it is not regarded as organization measure in the strict sense of the word, for simplicity we will still call it so. Our simulations discussed in Sec. 3.3.3 use an estimate for the mean quantization error pQ by choosing a large set x l . . . Xq c V of inputs according to the probability distribution on
V,
giving
#Q "--
~-~=1dv Wi.(xk),Xk
. #Q has been used
as fitness function several times (e.g. [6,16,25,26,28]). In [9] a closely related measure, an estimate of the mean distance #Mean dist - - ~1 E ~ = I dv(Wi*(xk),Xk) o f a weight to the data points has been used. Another measure used for the optimizations in [16] is the activation entropy given by PEntropy : ~-~iEAPi1ogpi, where pj is the probability that the neuron j E A is activated by an input signal. The experiments in [21] use a global compatibility measure similar to the measure from [14] defined via #Compat-- Eq--1 Y]yeAexp(--C'dA(i*(xk),J) 2) d v ( x k , w j ) 2 For the experiments to be described in Sec. 3.3, we introduce a Hebbian measure PH that also depends on the data distribution and is related to the topographic function from [31]. #S is based on the Delaunay triangulation of the set {wili C A} induced by the data distribution on V [20]. In practice one chooses according to this distribution a training set X l . . . Xq E V. Whenever xl lies in the 2nd order Voronoi cell Vj~ of wj and wk (i.e., whenever wj and wk are the weights nearest and second nearest to xz), a Hebbian connection between j and k is created with strength cjk = 1 if it did not exist before or, if it did, its strength is increased by one. Thus, one obtains a graph with vertex set A whose edges have strengths cjk estimating the value of fyj~ p(x)dx, p being the distribution density of the data. This graph (the Kohonen graph determining PH via
Hebb graph)
is compared to the original
~.lc~\c~l+
PH "-~- 1 -
E cj~ (j,k)eCn\CK ~ICK]+ E cjk (j,k)ECH
gK denotes the edge set of the Kohonen graph and gH the edge set of the Hebb graph, c:=
(~/](j,k)eCHCjk)/ICnl
being the average strength of the Hebb graph edges. The value of
the deviation is subtracted from 1, thus PH gives values just below 1 for good matches, i.e. topology preservation and small values for distorted embeddings. The Hebbian measure essentially determines the Kohonen edges that do not match Hebbian ones and vice versa (refer to [26,27] for details). Apart from these there are a quite a few other measures and measure classes whose behavior under GA has not yet been explored [3,10,14,17,30-32].
162 3. S I M U L A T I O N
STUDIES
This section gives an overview over some results obtained in the studies of SOM optimization via GAs. We will first review some approaches and results obtained by general parameter optimization, in particular of learning rules. Then, a short account of the current state of structural learning rule optimization is given. Finally, a brief review of some results about SOM topology optimization is given. 3.1. G e n e r a l O p t i m i z a t i o n of S e l f - O r g a n i z i n g M a p s In [16] the GA is used to optimize the parameters of the learning rule and to select between a linear 25 • 1 and a square 5 • 5 network. The parameters for the SOM learning t
t
rule are ~(t) := ~i ( ~ ) ~ , h t ( i , j ) : = exp(--d~(i'j) 2~2(t) ), with a(t) := ai ( ~~ ) ~ , T the length of the training, ~i and c] the initial and final learning rates and ai and a f the initial and final activation widths. All of these parameters are optimized by the GA. The networks are trained with input data coming from a cross shaped data set in R 2. pQ as well as #Entropy are used as a fitness criterion. The main findings of this work are that despite the two-dimensional input space a linear network seems better to maximize entropy, in which case also the final activation width should remain relatively large. On the other hand, for the quantization error to become small, the square topology is preferred by the GA. In [21] the GA evolves learning rule parameters and the initial SOM weights. The learning rule is given by s(t) = c~/(1 + 5. t) and h t ( i , j ) = e x p [ - ( ~ t + 7)dA(i,j)2], where a, ~, 7, 5 are real-valued parameters to optimize. PComp~t is used as selection criterion. The optimization is Lamarckian, i.e. after training the resulting network weights are coded back in the chromosome which is then reintroduced into the population. The GA is used as "global optimizer" to enhance the "local optimization" of the SOM. The network is trained with a L-shaped data set. The results show that this hybrid approach optimizing PCompat yields SOMs with a better value for #Q than those trained with the standard SOM rule, although not being explicitly optimized for #q. 3.2. O p t i m i z a t i o n of L e a r n i n g R u l e s The GA can be applied to the training of the SOM in different levels: it can be limited to the optimization of some numerical parameters of the standard learning rule as in Sec. 3.1, or it can extend to optimize the structure of the learning rule itself. An influential first step to structural optimization of neural network learning rules beyond simple parameter selection has been taken in [5]. The task was to optimize a supervised learning rule for a feed-forward network by a GA. The learning rule was modeled as polynomial of second order, where the variables were the activation values of neurons feeding into and being fed by the connections as well as the training values. The coefficients of this polynomial were modified by the GA. In this experiment successful learning rules evolved, among them the well-known delta rule. In [9] a similar approach is taken to encode a SOM learning rule. A polynomial is formulated where the variables are components of the weights w, the input signals x, the inverse t -1 of the training time t, and quantities y which are obtained by determining certain correlation functions between w and x. As in [5], the coefficients of the polynomial are optimized by the GA.
163 Given #Mean dist as fitness function, the GA in [9] produced learning rules that give better results w.r.t, to PMeandist than does the SOM learning rule. One would expect this, since the learning rules have been specifically evolved to optimize PMean dist. With #Mean dist as fitness function, the evolved SOMs still display a self-organization property similar in spirit to the original SOM rule. Note, however, that, due to the complex structure of the polynomial resulting from the GA search, a simple interpretation of the resulting learning rules is no more possible. A further step towards a even more structural optimization of learning rules is the use of Genetic Programming (GP) methods. A first approach in this direction has been undertaken in [6]. In GP, the GA chromosomes are representing complete algorithms (often formulated as LISP programs, which can be encoded as trees). Such a chromosome, for instance, can represent a learning rule to be optimized. The advantage is, in principle, a virtually unlimited structural freedom and flexibility. However, there is a price to pay. Apart from the required computational resources, it is necessary to provide adequate problem-dependent elementary operators (terminals) that can be used in a learning rule. Furthermore maintaining the balance of GP parameters is delicate, since the search space is vastly larger than in typical GA applications. It may therefore be advisable to introduce problem-dependent recombination and mutation operators which requires additional expertise by the algorithm designer. This currently makes it still difficult to suggest a standard approach for use of GP to optimize SOM. The work from [6] uses a modular decomposition of the learning rule to accelerate evolution and obtains some encouraging results that show that a GP optimization of the learning rule can advance the fitness of the trained networks. These results suggest that a further exploration of GP methods could be worthwhile for the future.
3.3. O p t i m i z a t i o n of N e t w o r k T o p o l o g i e s Most work on SOM optimization by GAs considers either fixed topologies [9,21] or only allows the selection of simple predefined topologies, as [16], where the choice between a linear and square SOM is encoded in the chromosome. One reason seems to be that SOMs are often used to visualize high-dimensional data and that the researchers therefore wish to stick with easily interpretable topologies for this purpose. Nevertheless, as shown in Sec. 3.3.3, topology optimization gives valuable insights into the properties of the SOM training process. Another reason may be that while there is a canonical way to encode learning parameters, this is no longer the case for structural information like network topology.
3.3.1. A p p r o a c h e s to T o p o l o g y E n c o d i n g Most attempts to encode a network topology concentrate on feed-forward networks. In [22] the topology of such a network was defined by a chromosome matrix with 0-1entries, each entry specifying whether the corresponding connection was present or not. For small networks with this encoding the GA finds topologies that perform better than hand-crafted ones. For large networks (as required for SOMs) chromosome size and thus optimization time becomes prohibitive. This problem has been attacked in [18] by a context-free adaption of Lindenmayer systems. Every chromosome codes a set of productions. The chromosome is transcribed by starting with an axiom and applying the productions. Thus, even large networks can be encoded in relatively compact chromo-
164
N
I maxsteps l a01 b01 all bll laolJbnll
Figure 2. Chromosome structure for the SOM transcription rule
somes. In this method, however, one of the problems among others is that network topologies have to be formulated as sub-networks of a network with 2 k, k E N neurons. These deficiencies led to the development of Cellular Encoding (CE) [15]. CE has been motivated by the developmental process in living organisms as well as the Genetic Programming paradigm (Sec. 3.2). It defines a language which is stored in the chromosome string. During transcription, the string is read out by a parallel production process, unfolding the network, each neuron having its own program copy and reading head. CE has been studied for some time now, but to the author's knowledge never been applied to SOM optimization. Its design is primarily aimed at the creation of the input-output architecture of feed-forward networks and not at the more homogeneous structure of the SOMs. Though in principle CE is complete in the sense that any topology can be formulated as CE string, it seems not adequate to code topologies typical for the geometrical structure of SOMs.
3.3.2. T h e S O M Transcription R u l e To study the influence of the network topology on the performance of the SO M learning rule and to be able to perform an effective search through the space of relevant topologies, a specialized SOM transcription rule based on number-theoretic principles has been developed in [25,26,28]. It transcribes a chromosome of a structure as in Fig. 2 into a Kohonen graph. Here we give only an outline of the method and direct the interested reader to the references for details. The chromosomes are partitioned into several cells. The header cells determine the number N of SOM neurons and the maximum number maxsteps of transcription steps to be performed. The rest of the chromosome is interpreted as finite sequence of double-cells of the form (at, bl) each, 1 = 0 . . . n - 1. We assume the SOM neurons being numbered from 0 to N - 1. Starting with double-cell 1 = 0 in the first transcription step, a neuron pointer moves from the current neuron i to neuron (i + az) mod N and connects it with neuron (i + at + bt) mod N, incrementing l by 1. After transcribing double-cell l = n - 1, transcription continues with 1 = 0. The rule stops when it tries to connect two already connected neurons or a neuron with itself, or when the maximum number of transcription steps has been reached. It can be shown that the SOM transcription rule is complete in the sense that any Kohonen graph can be generated for suitably large number of double-cells n. In addition, for smaller n it significantly favores graphs with a higher degree of symmetry and less connections. To study the effects of topology on self-organization it is therefore an advantage to use an input space with a higher degree of symmetry. 3.3.3. S i m u l a t i o n R e s u l t s We wish now to present some studies investigating the conditions required to evolve SOM topologies compatible with with a two-dimensional input space. In the following
165
(a) Short training time
(b) Long training time
(c) Receptive Fig. 3(b)
fields
for
Figure 3. Training in the regular input space
simulation runs only the network topology is optimized, all other parameters are set by hand and fixed for all the runs. The SOM topology is obtained by applying the transcription rule from Sec. 3.3.2 to the given chromosome. The entries from Fig. 2 were coded as binary strings, with 1 _< N _~ 256 and 0 _< maxsteps _~ 4080. The SOM weights were all initialized with random values from [0.45, 0.55] 2 and the SOM was trained with the equidistribution on the unit square. Both the unit square without (regular) and with periodic boundary conditions (toroidal) are used as input space. We give an overview over some important results. For full details of the simulation parameters and results, please generally refer to [26]. The runs we consider first use pQ as fitness. The SOMs are displayed in the standard representation, i.e. we plot the weights in input space and connect two weights whenever the corresponding neurons are adjacent in the Kohonen graph. A first general observation is that for successfully evolved SOMs maxsteps is practically always so large that the connections encoded in the chromosome are fully developed. In other words, the connection potential in the chromosome is always fully exploited and setting maxsteps - ee would not affect the results. Also N is close to its maximum possible value, since this implies a better covering of input space and therefore smaller quantization errors. N cannot always reach its maximum value of N - 256 because of its interaction with the other chromosome components during transcription. First we consider SOMs after only a short training time (300 steps) before evaluation. In particular the activation profile still includes neighbors at distance 3 in A. Typical successful SOMs then look similar to Fig. 3(a). Note that the neurons are still in the process of circular expansion around the initialization region in the center and that many of them have not yet left the center. The next step is to consider longer training times (3000 steps, the activation profile now includes only neighbors at distance I). There are many variations in the resulting SOMs, of which Fig. 3(b) shows one. The network has expanded over the input space and there are no loose connections (for short training times there often are). The network is close to being planar. This effect is due to the much lower symmetry of the input space as compared to the graph generated by the SOM transcription rule. As the symmetry of the input space becomes higher and therefore
166
(a) Best SOM found for quantization error
(b) Receptive fields for SOM from Fig. 4(a)
Figure 4. Training in the toroidal input space
more compatible to the high symmetry topologies generated by the transcription rule, the selection pressure towards planar topologies diminishes. We consider now the toroidal input space. Figure 4(a) shows the best SOM found in 50 GA runs of 3000 generations each. The SOM is not planar and this is quite typical for SOMs generated with the quantization error as fitness. The high connectivity seems to be advantageous to distribute the weights quickly and uniformly in the input space. Figure 4(b) shows the corresponding receptive fields which indeed show a high isotropy and a good quantization of the input space, which is clearly better than in Fig. 3(c). In the next runs we study the resulting topologies with PH as fitness function. If we now try to naively optimize it, the GA indeed finds an optimal solution, e.g. a SOM with only 2 connected neurons and a single Kohonen connection. It is optimal, because the Hebb graph from Sec. 2.5 always has a single connection of full weight which matches perfectly the Kohonen connection. To obtain more meaningful results one has to force the GA to search for large networks by weighting the Hebb measure by the size N of the SOM.
1
(a) SOM resulting from Hebb measure as fitness
(b) Receptive fields for Fig. 5(a)
I
(c) SOM resulting from ~QH as fitness
Figure 5. Optimization of weighted Hebb measure and of the hybrid m e a s u r e PQH
167 Then indeed a large number of neurons is created. The use of #H reduces the deviation from planarity as compared to #Q, but the result still does not show a network with two-dimensional topology as would have been desirable. Instead, it yields consistently a couple of one-dimensional chains (Fig. 5(a)). Further analysis confirms that the Hebbian measure does not single out well SOMs with too low embedding dimensionality. But a glance at the structure of its receptive fields in Fig. 5(b) reveals how a remedy can be found: the receptive fields of the network are far from being isotropic, but instead they are elongated. We know from the results above that optimizing the quantization error counters just that effect. Thus, by combining quantization error and Hebbian measure we obtain a hybrid measure via #QH -- #H/ #2Q. Using this measure, one then obtains consistently planar SOM topologies as in Fig. 5(c) as best results. These investigations show that the GA is a valuable tool to assess the quality of an organization measure. Often the performance of a newly developed organization measure is tested for only very few network topologies; or it is assessed by its performance with realworld data distributions of complicated structures, which make its merits or deficiencies difficult to elicit. Using a GA according to above procedure, such analyses can be made systematic. 4. S U M M A R Y
AND OUTLOOK
The present paper reviewed research about the genetic optimization of of SO Ms. The optimization of learning rule parameters and of initial weights are able to improve network performance. The latter, however, requires chromosome sizes proportional to the size of the SOM and becomes unwieldy for large networks. The optimization of learning rule structures leads to self-organization processes of character similar to the standard learning rule. A particularly strong potential lies in the optimization of SOM topologies. It allows to study global dynamical properties of SOMs and related models as well as to develop tools for their analysis. Though quite some effort has been put into the investigation of evolution of NN models in the last years, SOM evolution research is still at the beginning. One reason for that certainly lies in the lack of a natural organization measure. A few of the existing measures listed Sec. 2.5 have been studied, but use of the other measures is certainly indicated in the future. This will also help to separate measures useful in practice from measures of only theoretical interest. In certain situations there may be a further way to determine a network fitness. If one uses Motoric Maps [29] as control for some agent in a reinforcement learning problem, the values from a reward function could be used as fitness. Hierarchies of SO Ms are sometimes used for classification tasks. A possible application of GAs would be the evolution of those hierarchies as well as the filters used for data preprocessing. Finally, one of the most important open questions from the point of view of pure research as of applications is, how network structures should be encoded in a GA chromosome to attain a "creative" evolution process. This question has not yet been satisfactorily solved and currently it is hard to say how far we are from an answer. In this paper we have mentioned different approaches to code networks, most of them directed only to the creation of feed-forward networks. The SOM transcription rule described in Sec. 3.3.2
168 has been developed for specialized study of certain properties of SOM evolution. Yet, we are still lacking a transcription procedure useful to model the developmental process taking place in living organisms, or at least to provide us with a topology adequate for the problems at hand. A solution might lie in morphogenetic approaches combining genetic and environmental influences to develop a network, which is a current topic of Artificial Life research. Here also a Darwinian selection process during the developmental phase itself could be considered [11]. Another approach might adapt the SANE model [24], where evolution takes place on the individual neuron instead of the network level, to the evolution of SOMs. Self-Organizing Maps have by now established themselves as a basis for the development of useful models for cortical maps, by which they had been motivated and influenced. Implementing a successful developmental process to transcribe a GA chromosome into a working SOM network would be a further key step towards an understanding of the principles guiding natural evolution to produce a mechanism like the brain. Perhaps in future, together with Genetic Algorithms, Self-Organizing Maps will provide a valuable tool to study also this important and intriguing question of modern research. REFERENCES 1. M.D. Alder, R. Togneri, and Y. Attikiouzel. Dimension of the speech space. Proceedings of the IEE-1, 138(3):207-214, June 1991. 2. Thomas Bs Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. Oxford University Press, New York, 1996. 3. Hans-Ulrich Bauer and Klaus R. Pawelzik. Quantifying the neighbourhood preservation of Self-Organizing Feature Maps. IEEE Trans. on Neural Networks, 3(4):570-579, July 1992. 4. Marco Budinich and John G. Taylor. On the ordering conditions for self-organizing maps. Neural Computation, 7(2):284-289, 1995. 5. David J. Chalmers. The evolution of learning: an experiment in genetic connectionism. In D.S. Touretzky, J.L. Elman, T.J. Sejnowski, and G.E. Hinton, editors, Proceedings of the 1990 Connectionist Models Summer School, San Mateo, California, 1990. Morgan Kaufmann. 6. K.G. Char. Constructive learning with genetic programming. In EuroGP-98, 1998. 7. M. Cottrell, J. C. Fort, and G. Pages. Theoretical aspects of the SOM algorithm. In WSOM '97: Workshop on Self-Organizing Maps, pages 246-267, Espoo, Finland, June 1997. 8. M. Cottrell, J.C. Fort, and G. Pages. Two or three things that we know about the Kohonen algorithm. In Proc. ESANN, pages 235-244, Brussels, 1994. 9. Ali Da~dan and Kemal Oflazer. Genetic synthesis of unsupervised learning algorithms. Technical Report CIS 9309, Department of Computer Engineering and Information Science, Bilkent University, 06533 Bilkent, Ankara, Turkey,
[email protected], 1994. 10. P. Demartines and F. Blayo. Kohonen self-organizing maps: Is the normalization necessary? Complex Systems, 6(2):105-123, April 1992. 11. G. Edelman. Neural Darwinism: Theory of Neuronal Group Selection. Basic Books, New York, 1987. 12. Jean-Claude Fort and Gilles Pages. About the Kohonen algorithm: Strong or weak selforganization? In Michel Verleysen, editor, ESANN ' 9 5 - Proceedings of the 3rd European Symposium on Artificial Neural Networks, pages 9-14, Brussels, 1995. D facto. 13. David E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, 1989.
169 14. G. J. Goodhill and T. J. Sejnowski. A unifying objective function for topographic mappings. Neural Computation, 9:1291-1304, 1997. 15. Frederic Gruau. Neural Network Synthesis using Cellular Encoding and the Genetic Algorithm. PhD thesis, Ecole Normale Superieure de Lyon, Jan. 1994. 16. Steven Alex Harp and Tariq Samad. Genetic optimization of Self-Organizing Feature Maps. In Proc. IJCNN, volume 1, pages 341-346, 1991. 17. S. Kaski and K. Lagus. Comparing Self-Organizing Maps. In C. von der Malsburg, W. von Seelen, J. C. Vorbriiggen, and B. Sendhoff, editors, Proceedings of ICANN96, volume 1112 of Lecture Notes in Computer Science, pages 809-814, 1996. 18. H. Kitano. Designing neural networks using genetic algorithms with graph generation systems. Complex Systems, 4:461-476, 1990. 19. Teuvo Kohonen. Self-Organization and Associative Memory, volume 8 of Springer Series in Information Sciences. Springer-Verlag, Berlin, Heidelberg, New York, 3rd edition, may 1989. 20. Thomas Martinetz and Klaus Schulten. Topology representing networks. Neural Networks, 7(2), 1994. 21. M. Mclnerney and A. Dhawan. Training the Self-Organizing Feature Map using Hybrids of Genetic and Kohonen Methods. In Proc. ICNN'9~, Int. Conf. on Neural Networks, pages 641-644, Piscataway, NJ, 1994. IEEE Service Center. 22. G. F. Miller, P. M. Todd, and S. U. Hedge. Designing neural networks using genetic algorithms. In D. Schaffer, editor, Proc. 3rd International Conference on Genetic Algorithms, pages 379-384, 1989. 23. Melanie Mitchell. An Introduction to Genetic Algorithms. MIT Press, 1996. 24. D. Moriarty and R. Miikkulainen. Forming neural networks through efficient and adaptive co-evolution. Evolutionary Computation, 5:373-399, 1997. 25. D. Polani and T. Uthmann. Adaptation of Kohonen Feature Map Topologies by Genetic Algorithms. In R. M~inner and B. Manderick, editors, Parallel Problem Solving from Nature, 2, pages 421-429. Elsevier Science Publishers B.V., September 28-30 1992. 26. Daniel Polani. Adaption der Topologie yon Kohonen-Karten dutch Genetische Algorithmen, volume 143 of Dissertationen zur Kiinstlichen Intelligenz. Infix, June 1996. (In German). 27. Daniel Polani. Organization Measures for Self-Organizing Maps. In Teuvo Kohonen, editor, Proceedings of the Workshop on Self-Organizing Maps (WSOM '97), pages 280-285. Helsinki University of Technology, Jun 1997. 28. Daniel Polnni and Thomas Uthmann. Training Kohonen Feature Maps in different Topologies: an Analysis using Genetic Algorithms. In Proc. 5th International Conference on Genetic Algorithms, pages 326-333, 1993. 29. Helge Ritter, Thomas Martinetz, and Klaus Schulten. Neuronale Netze. Addison-Wesley, 1994. 30. J. W. Sammon, Jr. A nonlinear mapping for data structure analysis. IEEE Trans. Comput., C-18:401-409, 1969. 31. Thomas Villmann. Topologieerhaltung in selbstorganisierenden neuronalen Merkmalskarten. PhD thesis, Universit~it Leipzig, 1996. 32. Ste~phane Zrehen and Francois Blayo. A geometric organization measure for Kohonen's map. In Proc. of Neuro-N~mes, pages 603-610, 1992.
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
171
Self organization of a massive text document collection Teuvo Kohonen, Samuel Kaski, Krista Lagus, Jarkko Saloj~rvi, Jukka Honkela, Vesa Paatero, and Antti Saarela a aHelsinki University of Technology, Neural Networks Research Centre, P.O. Box 2200, FIN-02015 Helsinki, Finland When the SO M is applied to the mapping of documents, one can represent them statistically by their weighted word frequency histograms or some reduced representations of the histograms that can be regarded as data vectors. We have made such a SO M of about seven million documents, viz. of all of the patent abstracts in the world that have been written in English and are available in electronic form. The map consists of about one million models (nodes). Keywords or key texts can be used to search for the most relevant documents first. New effective coding and computational schemes of the mapping are described. 1. I N T R O D U C T I O N In the vast majority of SOM applications, the input data consists of measurements, signal values, or statistical descriptors that form high-dimensional real feature vectors. It is also possible to form similarity graphs of text documents by the SOM principle, when models that describe collections of words in the documents are used. The models can simply be weighted histograms of the words regarded as real vectors, but usually some dimensionality reduction of the very-high dimensional histograms is carried out, as we shall see below. A document organization, searching and browsing system called the WEBSOM (actually its newest version WEBSOM2) is described in this paper. The original WEBSOM [1] was a two-level SOM architecture, but we later simplified it as described in this paper; at the same time we introduced several speed-up methods to the computation and increased the document map size by an order of magnitude. 2. S T A T I S T I C A L M O D E L S O F D O C U M E N T S 2.1.
The primitive
vector
space
model
In the basic vector space model [2] the stored documents are represented as real vectors in which each component corresponds to the frequency of occurrence of a particular word in the document: the weighted word histogram can be viewed as a model or document vector. For the weighting of a word according to its significance, one can use the Shannon entropy over the document classes, or the inverse of the number of documents in which the word occurs ("inverse document frequency"). The main problem of the vector space
172
model is the large vocabulary in any sizable collection of free-text documents, which means a huge dimensionality of the model vectors. 2.2. Latent semantic indexing (LSI) In an attempt to reduce the dimensionality of the document vectors, one often first forms a matrix in which each column corresponds to the word histogram of a document, and there is one column for each document. After that the factors of the space spanned by the column vectors are computed by a method called the singular-value decomposition (SVD), and the factors that have the least influence on the matrix are omitted. The document vector formed from the histogram of the remaining factors has then a much smaller dimensionality. This method is called latent semantic indexing (LSI) [3].
2.3. Randomly projected histograms We have shown experimentally that the dimensionality of the document vectors can be reduced radically by a much simpler method than the LSI, by a simple random projection method [4], [5], without essentially losing the power of discrimination between the documents. Consider the original document vector (weighted histogram) ni C ~n and a rectangular random matrix R, the elements in each column of which are assumed to be normally distributed vectors having unit length. Let us form the document vectors as the projections xi C ~m, where m << n: x~ = R n ~ .
(1)
It has transpired in our experiments that if m is at least of the order of 100, the similarity relations between arbitrary pairs of projection vectors (xi, xj) are very good approximations of the corresponding relations between the original document vectors (ni, nj), and the computing load of the projections is reasonable; on the other hand, with the decreasing dimensionality of the document vectors, the time needed to classify a document is radically decreased.
2.4. Histograms on the word category map In our original version of the WEBSOM [1], the reduction of the dimensionality of the document vectors was carried out by letting the words of free natural text be clustered onto neighboring grid points of another special SOM. The input to such a "word category map" [1] consisted of triplets of adjacent words in the text taken over a moving window, whereupon each word in the vocabulary was represented by a unique random vector. Later we abandoned the word category map since an even better accuracy of document classification was achieved by the straightforward random projection of the word histograms. 2.5. C o n s t r u c t i o n of random projections of word histograms by pointers Before describing the new encoding of the documents [6] used in this work, some preliminary experimental results that motivate its idea are presented. Table 1 compares a few projection methods in which the model vectors, except in the first case, were always 315-dimensional. For the material in this smaller-scale preliminary experiment we used 18,540 English documents from 20 Usenet newsgroups of the Internet. When the text was preprocessed as explained in cf. Sec. 5, the remaining vocabulary consisted of 5,789 words
173 or word forms. Each document was mapped onto one of its grid points. All documents t h a t represented a minority newsgroup at any grid point were counted as classification errors.
The classification accuracy of 68.0 per cent reported on the first row of Table 1 refers to a classification t h a t was carried out with the classical vector-space model with full 5789-dimensional histograms as document vectors. In practice, this kind of classification would be orders of magnitude too slow. Random projection of the original document vectors onto a 315-dimensional space yielded, within the statistical accuracy of computation, the same figures as the basic vector space method. This is reported on the second row. The figures are averages from seven statistically independent tests, like in the rest of the cases. Consider now t h a t we want to simplify the projection matrix R in order to speed up computations. We do this by thresholding the matrix elements, or using sparse matrices. Such experiments are reported next. The following rows have the following meaning: Third row, the originally random matrix elements were thresholded to +1 or - 1 ; fourth row, exactly 5 randomly distributed ones were generated in each column, whereas the other elements were zeros; fifth row, the number of ones was 3; and sixth row, the number of ones was 2, respectively.
Table 1 Classification accuracies of documents, in per cent, with different projection matrices R. The figures are averages from seven test runs with different random elements of R. Accuracy Vector space model Normally distributed R Thresholding to +1 or - 1 5 ones in each column 3 ones in each column 2 ones in each column
68.0 68.0 67.9 67.8 67.4 67.3
Standard deviation due to different randomization of R 0.2 0.2 0.3 0.2 0.2
These results are now supposed to give us the idea t h a t if we, upon formation of the random projection, would reserve a memory array like an accumulator for the document vector x, another array for the weighted histogram n, and permanent address pointers from all the locations of the n array to all such locations of the x array for which the matrix element of R is equal to one, we could form the product very fast by following the pointers and summing up to x those components of the n vector t h a t are indicated by the ones of R. In the method t h a t is actually being used we do not project ready histograms, but the pointers are already used with each word in the text in the construction of the lowdimensional document vectors. W h e n scanning the text, the hash address for each word is formed, and if the word resides in the hash table, those elements of the x array that are
174
found by the (say, three) address pointers stored at the corresponding hash table location are incremented by the weight value of that word. The weighted, randomly projected word histogram obtained in the above way may be optionally normalized. The computing time needed to form the histograms in the above way was about 20 per cent of that of the usual matrix-product method. This is due to the fact that the histograms, and also their projections, contain plenty of zero elements. 3. R A P I D C O N S T R U C T I O N
OF A L A R G E D O C U M E N T
MAP
The SOM algorithm is capable of organizing even a randomly initialized map. However, if the initialization is regular and closer to the final state, the asymptotic convergence of the map can be made at least an order of magnitude faster. Below we introduce several speed-up methods by which, first, a reasonable approximation for the initial state is formed and then, the stationary state of the SOM algorithm is reached effectively by a combination of various approximation methods. Let us denote the ith model vector of the SOM by mi. 3.1. Fast d i s t a n c e c o m p u t a t i o n In word histograms, there are plenty of zeros, and if the pointer method of random projection is used, the number of zeros in the projected document vectors is still predominant. The document vectors are mapped onto the SOM according to their inner products with the model vectors. Since the zero-valued components of the vectors do not contribute to inner products, it is possible to tabulate the indices of the non-zero components of each input vector, and thereafter consider only those components when computing distances. A related method has been proposed for computing Euclidean distances between sparse vectors [7]. Our formulation is, however, simpler when only inner products are needed.
3.2. E s t i m a t i o n o f larger m a p s b a s e d on c a r e f u l l y c o n s t r u c t e d s m a l l e r o n e s Estimating initial values for a large SOM. Several suggestions for increasing the number of nodes of the SOM during its construction (cf., e.g. [8]) have been made. The new idea presented below is to estimate good initial values for the model vectors of a very large map on the basis of asymptotic values of the model vectors of a much smaller map. As the general nature of the SOM process and its asymptotic states are now fairly well known, we can utilize some "expert knowledge" here. For instance, consider first a rectangular two-dimensional SOM array with two-dimensional input vectors. If the probability density function (pdf) of the input is uniform in a rectangular domain and zero outside it, there is a characteristic "shrink" of the distribution of the model vectors with respect to the borders of the pdf, whereas inside the array the model vectors can be assumed as uniformly distributed. For an arbitrary number of grid points in the SOM array, rectangular or hexagonal, the amount of this "shrinkage" can easily be estimated. Consider then that the input has an arbitrary higher dimensionality and an arbitrary pdf, which, however, is continuous and smooth. Even then, the relative "shrinkage" and the relative local differences of the new model vectors are similar as in the uniform case (cf., e.g., [11], Fig. 3.7). Consider again a pdf that is uniform over a two-dimensional rectangular area. This
175 same area is now approximated by either the set of vectors {m '(d)i E ~2}, or by {m'~ s) r ~2}, where the superscript d refers to the "dense" lattice, and s to the "sparse" lattice, respectively. If the three "sparse" vectors _~(s) m i , _.~(s) "~ j , and m ~ ~) do not lie on the same straight line, then in the two-dimensional signal plane any "dense" vector m '(d) can be approximated by the linear combination m'(hd) -- c~hm'l ~) +/3hm'~ ~) + ( 1 - a h - ~h)m'~~) ,
(2)
where ah and ~h are interpolation-extrapolation coefficients. This is a two-dimensional vector equation from which the two unknown scalars C~h and /3h can be solved. Consider then another, nonuniform, but still smooth pdf in a space of arbitrary dimensionality and the two SOM lattices with the same topology but with different density as in the ideal example. When the true pdf is arbitrary, we may not assume the lattices of true codebook vectors to be planar. Nonetheless we can perform a local linear estimation of the true codebook vectors m(hd) C ~)~n of the "dense" lattice on the basis of the true codebook vectors ml ~), m~~), and m~s) r ~n of the "sparse" lattice, using the same interpolation-extrapolation coefficients as in (2). In practice, in order that the linear estimate be most accurate, the respective indices h, i, j , and k should be such that u~ -"(~) i , "-"(s) ' j , and mt~~) are the three codebook vectors
closest to m'(hd) in the signal space (but not on the same line). With C~h and/3h solved from (2) for each node h separately we obtain the wanted interpolation-extrapolation formula as
rh(hd) ---- (^~ h l U.~.(s) i
-~-
/3hm~s) + (1
--
O/h
--
/~h)m~~) ,
(3)
Notice that the indices h, i, j, and k refer to topologically identical lattice points in (2) and (3). The interpolation-extrapolation coefficients for two-dimensional lattices depend on their topology and the neighborhood function used in the last phase of learning. For best results the "stiffness" of both the "sparse" and the "dense" map should be the same, i.e. the relative width of the final neighborhoods should be equal.
3.3. Rapid fine-tuning of the large maps Addressing old winners. Assume that we are somewhere in the middle of the training process, whereupon the SOM is already smoothly ordered although not yet asymptotically stable. Assume that the model vectors are not changed much during one iteration of training. When the same training input is used again some time later, it may be clear that the new winner is found at or in the vicinity of the old one. When the training vectors are then expressed as a linear table, with a pointer to the corresponding old winner location stored with each training vector, the map unit corresponding to the associated pointer is searched for first, and then a local search for the new winner in the neighborhood around the located unit will suffice. After the new winner location has been identified, the associated pointer in the input table is replaced by the pointer to the new winner location. This will be a significantly faster operation than an exhaustive winner search over the whole SOM. The search can first be made in the immediate surrounding of the said location, and only if the best match is found at its edge, searching is continued in the
176 Training
vectors
Pointers
-L
] n?wwi???r "',
Figure 1. Finding the new winner in the vicinity of the old one, whereby the old winner is directly located by a pointer. The pointer is then updated.
surrounding of the preliminary best match, until the winner is one of the middle units in the search domain. Koikkalainen [9], [10] has suggested a similar speedup method for a search-tree structure. Initialization of the pointers. When the size (number of grid nodes) of the maps is increased stepwise during learning using the estimation procedure discussed in Section 3.2, the initial pointers for all data vectors after each increase can be estimated quickly by utilizing the formula that was used in increasing the map size, equation (3). The winner is the map unit for which the inner product with the data vector is the largest, and so the inner products can be computed rapidly using the expression x T m (d) -- ~h~-.T__(s)u~i+ ~hxTm~ s) + (1 -- ah -- /3h)xTm (s)
(4)
Here d refers to model vectors of the large map and s of the sparse map, respectively. Expression (4) can be interpreted as the inner product between two three-dimensional vectors, [ah;/3h; (1 -- ah --/3h)] T and [xTmlS); xTm~S); xTm(S)] T, irrespective of the dimensionality of x. If necessary, the winner search can still be speeded up by restricting the winner search to the area of the dense map that corresponds to the neighborhood of the winner on the sparse map. This is especially fast if only a subset (albeit a comprehensive subset) of all the possible triplets (i, j, k) is allowed in (3) and (4). Parallelized Batch Map algorithm. The Batch Map algorithm [11] directly aims at the solution for the equilibrium state of the SOM algorithm:
E [hcki(Xk -- mi)] = 0
(5)
for all i. Here E denotes the expectation value, which is approximated over the input data set, and hcki is the neighborhood function of the SOM, where the index of the model vector that is the best match for the kth input xk is denoted by ck, respectively.
177 In the asymptotic state of the SOM algorithm the model vectors must fulfill the condition mi =
~-~jnjhjixj Ej njhji
,
(6)
where :~j = Ek:ck=j xk/nj is the mean of the inputs that are closest to the model vector mj and nj is the number of those inputs. The Batch Map algorithm consists of iterative application of equation (6). Equation (6) allows for a very efficient parallel implementation, in which extra memory need not be reserved for the new values of mi. At each iteration we first compute the pointer ck to the best-matching unit for each input xk. If the old value of the pointer can be assumed to be close to the final value, as is the case if the pointer has been initialized properly or obtained in the previous iteration of a relatively well-organized map, we need not perform an exhaustive winner search as discussed above. Moreover, since the model vectors do not change at this stage, the winner search can be easily implemented in parallel by dividing the data into the different processors in a shared-memory computer. After the pointers have been computed, the previous values of the model vectors are not needed any longer. They can be replaced by the means ~j and therefore extra memory is not needed. Finally, the new values of the model vectors can be computed based on (6). This computation can also be implemented in parallel and done within the memory reserved for the model vectors if a subset of the new values of the model vectors is held in a suitably defined buffer. It should perhaps be noted that if the neighborhood function is very narrow or if there is only a small amount of data in relation to the map size, the sum Ej njhji in (6) may become zero for some j. It has turned out in our experiments that there exists a simple remedy to this: the computation can be continued successfully by keeping the previous value of mj in such situations. Saving memory by reducing representation accuracy. The memory requirements can be reduced significantly by using a coarser quantization of the vectors. We have used a common adaptive scale for all of the components of a model vector, representing each component with 8 bits only. If the dimensionality of the data vectors is large, the statistical accuracy of the distance computations is still sufficient. It is advisable to choose the correct quantization level probabilistically when computing new values for the model vectors. 4. U S I N G T H E M A P S F O R I N F O R M A T I O N
DISCOVERY
4.1. U s e r i n t e r f a c e a n d e x p l o r a t i o n of t h e d o c u m e n t m a p The document map has been presented as a series of HTML pages that enable exploration of the grid points: when clicking the latter with a mouse, links to the document data base enable reading the contents of the articles. If the grid is large, subsets of it can first be viewed by zooming. There is also an automatic method for assigning descriptive signposts to map regions [12]; in deeper zooming, more signs appear. The signposts are words that appear often
178 in the articles in that map region and rarely elsewhere, and they are used to monitor the search.
4.2. C o n t e n t - a d d r e s s a b l e search The HTML page can be provided with a form field into which the user can type an own query in the form of a short "document." This query is preprocessed and a document vector (histogram) is formed in the same way as for the stored documents. This histogram is then compared with the models of all grid points, and a specified number of bestmatching points are marked with a symbol: the better the match, the larger the symbol. These symbols provide good starting points for browsing. For comparison, we have also provided an option for the search. In the "keyword" mode each word of the vocabulary is indexed by pointers to those map units where these words occur, and one thereby uses a rather conventional indexed search to find the starting points for browsing. 5. E X P E R I M E N T S : ABSTRACTS
A DOCUMENT
M A P OF ALL E L E C T R O N I C
PATENT
We have constructed a map of all the 6,840,568 patent abstracts that were available in English in electronic form. The average length of the abstracts was 132 words, and in total they contained 733,179 different words (base forms). The size of the SOM was 1,002,240 models (map units). 5.1. P r e p r o c e s s i n g From the raw patent abstracts we first extracted the titles and the texts for further processing. We then removed non-textual information. Mathematical symbols and numbers were converted into special symbols. All words were converted to their base form using a stemmer. The words occurring less than 50 times in the whole corpus, as well as a set of common words in a stopword list of 1,335 words were removed. The remaining vocabulary consisted of 43,222 words. Finally, we omitted the 122,524 abstracts in which less than 5 words remained. 5.2. F o r m a t i o n of statistical m o d e l s To reduce the dimensionality of the models we used the randomly projected word histograms. For the final dimensionality we selected 500, and 5 random pointers were used for each word (in the columns of the projection matrix R). The words were weighted using the Shannon entropy of their distribution which is related to their occurrence in the subsections of the patent classification system. There are 21 subsections in the patent classification system in total; agriculture, transportation, chemistry, building, engines, and electricty are examples of such subsections (cf. Fig. 2). 5.3. F o r m a t i o n of the d o c u m e n t m a p The final map was constructed in four stages. First, a 435-unit document map was computed very carefully. The small map was used to estimate a larger one which was then fine-tuned with the Batch Map algorithm. This process of estimation and fine-tuning was repeated in three steps with progressively larger maps. With the newest versions of our programs the whole process of computation of the
179 document map takes about six weeks on a six-processor SGI 02000 computer. We cannot yet provide exact figures of the real processing time since we have all the time developed the programs while carrying out the computations. We have computed the maps relatively carefully--reasonably organized maps could have been obtained in a much shorter time. The maximal amount of memory required was about 800MB. Forming the user interface took an additional week of computation. This time includes finding the keywords to label the map, forming the WWW-pages that are used in exploring the map, and indexing the map units for keyword searches. 5.4. R e s u l t s In order to get an idea of the quality of the organization of the final map we measured how the different subsections of the patent classification system were separated on the map. When each map node was labeled according to the majority of the subsections in the node and the abstracts belonging to the other subsections were considered as misclassifications, the resulting "accuracy" (actually, the "purity" of the nodes) was 64%. It should be noted that the subsections overlap partially--the same patent may have subclasses which belong to different subsections. The result corresponded well with the accuracies we have obtained with smaller maps computed on subsets of the document collection.
Figure 2. Distribution of four sample subsections of the patent classification system on the document map. The gray level indicates the logarithm of the number of patents in each node.
Distribution of patents on the final map has been visualized in Fig. 2. Two case studies of document searches are depicted, the first one in Fig. 3 using the content-addressable search mode and then an example with a more traditional keyword search in Fig. 4.
180
Figure 3. Content-addressable search ("document search") was used to provide a starting point for exploration of the document map. The search for "coffee machine" directs the user to a map unit that contains several patents regarding various types of coffee machines. In the surrounding region related patents are found which concern coffee brewing, processing of coffee beans, etc. As suggested by the labels in the region, other food-related topics are located nearby. The automatically selected labels written on the display describe texts within the region. The color depicts the density of the documents in that region with light shade indicating a high density. Three zoom levels are used for such a large map.
181
Figure 4. The optional keyword search mode was utilized to find map units that contain abstracts about information retrieval. The best hits were marked with circles, the size of the circle indicating the goodness of the match. There seems to be a group of good hits near the label "information" (the starting point of the arrow). The contents of one of these hits are shown in the picture, and indeed seem to contain good results: several documents relate specifically to information retrieval. The largest circle on the display indicating best response is an isolated hit which, on closer inspection (not shown), turns out to contain patents not only relating to information retrieval, but also to other types of retrieval systems. In this manner the pattern of hits depicted on the display of the ordered map may aid the user in identifying the most relevant hits. This information is, of course, available in addition to any relevance judgements traditionally provided by search engines.
182
6. C O N C L U S I O N S We have demonstrated that it is possible to scale up the SOMs in order to tackle very large-scale problems. Additionally, it has transpired in our experiments that the encoding of documents for their statistical identification can be performed much more effectively than believed a few years ago. In particular, the various random-projection methods we introduced are as accurate in practice as the ideal theoretical vector space method, but orders of magnitude faster to compute than the eigenvalue methods (e.g., LSI) that have been used extensively to solve the problem of large dimensionality. Finally it ought to be emphasized that the order that ensues in the WEBSOM may not represent any official taxonomy of the articles and does not serve as a basis for any automatic indexing of the documents; the "similarity" of the documents is only based on the usage of different words in them. Such similarity relationships better serve "finding" than "searching for" relevant information. REFERENCES
1. S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, Neurocomputing 21 (1988) 101. 2. G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGrawHill, New York, 1983. 3. S. Deerwester, S. Dumais, G. Furnas and K. Landauer, J. Am. Soc. Inform. Sci., 41 (1990) 391. 4. S. Kaski, Data exploration using self-organizing maps, Acta Polytechnica Scandinavica, Mathematics, Computing and Management in Engineering Series No 82 (1997). Dr. Tech. Thesis, Helsinki University of Technology, Finland 5. S. Kaski, Proc of IJCNN'98, Int. Joint Conf. on Neural Networks, IEEE Press, Piscataway, NJ, 1998, p. 413. 6. T. Kohonen, Proc. 8th Int. Conf.on Artificial Neural Networks (ICANN'98), L. Niklasson, M. Boden, and T. Ziemke (eds.), Springer, London, 1998, p. 65. 7. D. Roussinov and H. Chen, CC-AI--Communication, Cognition and Artificial Intelligence 15 (1998) 81. 8. J.S. Rodrigues and L.B. Almeida, Proc. INNC'90, Int. Neural Networks Conference, Kluwer, Dordrecht, p. 813. 9. P. Koikkalainen, Proc. ECAI 94, l lth European Conf. on Artificial Intelligence, A. Cohn (ed.), John Wiley & Sons, New York, 1994, p. 211. 10. P. Koikkalainen, Proc. ICANN'95, Int. Conf. on Artificial Neural Networks, Paris, France, vol. 2, p. 63, 1995. 11. T. Kohonen, Self-Organizing Maps, Springer, Berlin, 1995, 2nd ed. 1997. 12. K. Lagus and S. Kaski, to appear in Proc. ICANN'99, Int. Conf. on Artificial Neural Networks, 1999.
Kohonen Maps. E. Oja and S. Kaski, editors 9 ElsevierScienceB.V. All rights reserved
Document
183
Classification with Self-Organizing Maps
Dieter Merkl ~ ~Institut fiir Softwaretechnik, Technische Universitgt Wien Resselgasse 3/188, A-1040 Wien, Austria http://www, ifs. tuwien, ac. at/~dieter
The self-organizing map is a very popular unsupervised neural network for the analysis of high-dimensional input data as in information retrieval applications. The reason is that the map display provides a convenient possibility for the user to explore the contents of a document archive. From geography, however, it is known that maps are not always the best way to represent information spaces. Often it is better to provide a hierarchical view of the underlying data collection in form of an atlas where, starting from a map representing the complete data collection, different regions are shown at finer levels of granularity. Using an atlas, the user can easily "zoom" into regions of particular interest while still having general maps for overall orientation. We show that a similar display can be obtained by using hierarchical feature maps to represent the contents of a document archive. These neural networks have a layered architecture where each layer consists of a number of individual self-organizing maps. By this, the contents of the document archive may be represented at arbitrary detail while still having the general maps available for global orientation. 1. I n t r o d u c t i o n
Today's information age may be characterized by constant massive production and dissemination of written information. Powerful tools for exploring, searching, and organizing this mass of information are needed. Particularly the aspect of exploration has found only limited attention in the research community as compared to other aspects of information retrieval systems. Current information retrieval technology still relies on systems that retrieve documents based on the similarity between keyword-based document and query representations. Typically, the retrieved documents are presented to the user in the form of ranked lists-not a particularly effective way for archive exploration. An attractive way to assist the user in document archive exploration is based on selforganizing maps [5] for document space representation. A number of research publications show that this idea has found appreciation in the community [7,9-12,14,22]. Maps are used to visualize the similarity between documents in terms of distances within the twodimensional map display. Hence, similar documents may be found in neighboring regions of the map. This map metaphor for document space visualization, however, has its limitations in that each document is represented within one single two-dimensional map. Since the
184 documents are described in a very high-dimensional feature space constituted by the index terms representing the contents of the documents, the two-dimensional map representation has necessarily some imprecisions. In this paper we argue in favor of establishing a hierarchical organization of the document space based on an unsupervised neural network. In much the same way as we are showing the world on different pages of an atlas, where each page contains a map showing some portion of the world at some specific resolution, we suggest to use a kind of atlas for document space representation [15,16]. A page of this atlas of the document space shows a portion of the library at some resolution while omitting other parts of the library. As long as general maps that provide an overview of the whole library are available, the user can find his or her way along the library by choosing maps that provide a sufficiently detailed view of the area of particular interest. More precisely, we show the effects of using the hierarchical feature map [18] for document archive organization. The distinguished feature of this model is its layered architecture where each layer consists of a number of independent self-organizing maps. The training process results in a hierarchical arrangement of the document collection where self-organizing maps from higher layers of the hierarchy are used to represent the overall organizational principles of the document archive. Maps from lower layers of the hierarchy are used to provide fine-grained distinction between individual documents. Such an organization comes close to what we would usually expect from conventional libraries. The remainder of this work is organized as follows. In Section 2 we give a brief description of the architectures and the training rules of the neural networks used in this study. Section 3 is dedicated to an exposition of some issues we believe to be crucial for document classification. Sections 4 and 5 provide the experimental results from document classification. The former describes the results from using the self-organizing map, i.e. library organization according to the map metaphor. The latter gives results from using the hierarchical feature map, i.e. library organization according to the atlas metaphor. Our future directions of research in the framework of self-organizing maps for document classification are outlined in Section 6. Finally, we present some conclusions in Section 7.
2. Self-organizing neural networks for d o c u m e n t classification 2.1. Self-organizing maps The self-organizing map [5,6] is one of the most prominent artificial neural network models adhering to the unsupervised learning paradigm. The model consists of a number of neural processing elements, i.e. units. Each of the units i is assigned an n-dimensional weight vector rni, rn~ E ~n. The training process of self-organizing maps may be described in terms of input pattern presentation and weight vector adaptation. Each training iteration t starts with the random selection of one input pattern x(t). This input pattern is presented to the selforganizing map and each unit determines its activation. Usually, the Euclidean distance between the weight vector and the input pattern is used to calculate a unit's activation. In this particular case, the unit with the lowest activation is referred to as the winner, c, of the training iteration, as given in Expression (1).
c: me(t) = m.in IIx(t) - mi(t)l I z
(1)
185 Subsequently, the weight vector of the winner as well as the weight vectors of selected units in the vicinity of the winner are adapted. This adaptation is implemented as a gradual reduction of the difference between corresponding components of the input pattern and the weight vector, as shown in Expression (2). +
1) =
+
.
.
-
(2)
Geometrically speaking, the weight vectors of the adapted units are moved a bit towards the input pattern. The amount of weight vector movement is guided by a so-called learning rate, a, decreasing in time. The number of units that are affected by adaptation as well as the strength of adaptation is determined by a so-called neighborhood function, hci. This number of units also decreases in time. Typically, the neighborhood function is a unimodal function which is symmetric around the location of the winner and monotonically decreasing with increasing distance from the winner. A Gaussian may be used to model the neighborhood function. It is common practice that at the beginning of training a wide area of the output space is subject to adaptation. The spatial width of units affected by adaptation is reduced gradually during the training process. Such a strategy allows the formation of large clusters at the beginning and fine-grained input discrimination towards the end of the training process. In a nutshell, the training process of the self-organizing map describes a topology preserving mapping from a high-dimensional input space onto a two-dimensional output space where patterns that are similar in terms of the input space are mapped to geographically close locations in the output space. 2.2. H i e r a r c h i c a l f e a t u r e m a p s
The key idea of hierarchical feature maps as proposed in [18,19] is to use a hierarchical setup of multiple layers where each layer consists of a number of independent selforganizing maps. One self-organizing map is used at the first layer of the hierarchy. For every unit in this map a self-organizing map is added to the second layer of the hierarchy. This principle is repeated with any further layers of the hierarchical feature map. The training process of hierarchical feature maps starts with the self-organizing map on the first layer. This map is trained according to the standard training process of self-organizing maps as described above. When this first self-organizing map is stable, training proceeds with the maps of the second layer. Here, each map is trained with only that portion of the input data that is mapped on the respective unit in the higher layer map. By this, the amount of training data for a particular self-organizing map is reduced on the way down the hierarchy. Additionally, the vectors representing the input patterns may be shortened on the transition from one layer to the next. This shortage is due to the fact that some input vector components can be expected to be (almost) equal among those input data that are mapped onto the same unit. These equal components may be omitted for training the next layer maps without loss of information. This reduction in input vector dimension, obviously, leads to shorter training times because of faster winner selection and weight vector adaptation. Hierarchical feature maps produce disjoint clusters of the input data. Moreover, these disjoint clusters are gradually refined when moving down along the hierarchy. Contrary to that, the self-organizing map in its basic form cannot be used to produce disjoint clusters.
186 The separation of data items rather is a tricky task that requires some insight into the structure of the input data. What one gets, however, from a self-organizing map is an overall representation of input data similarities. In this sense we may use the following picture to contrast the two models of neural networks. Self-organizing maps can be used to produce maps of the input data whereas hierarchical feature maps produce an atlas of the input data. Taking up this metaphor, the difference between both models is quite obvious. Self-organizing maps, in our point of view, provide the user with a single picture of the underlying data archive. As long as the map is not too large, this picture may be sufficient. As the maps grow larger, however, they have the tendency of providing too little orientation for the user. In such a case we would advise to change to hierarchical feature maps as the model for representing the contents of the data archive. In this case, the data is organized hierarchically which facilitates browsing into relevant portions of the data archive. In much the same way as one would probably not use the map of the world in order to find one's way from, say, SchS"nbrunn to Neustift, one would probably not use a single map of a document archive to find a particular document. Conversely, when given an atlas one might follow the hierarchy of maps along a path such as World ~ Europe -+ Austria ~ Vienna in order to finally find the way from SchS"nbrunn to Neustift. In a similar way an atlas of a document archive might be used.
3. I s s u e s in d o c u m e n t classification Generally, the task of text classification aims at uncovering the semantic similarities between various documents. In a first step, the documents have to be mapped onto some representation language in order to enable further analyses. This process is termed indexing in the information retrieval literature. A number of different strategies have been suggested over the years of information retrieval research. Still one of the most common representation techniques is single term full-text indexing where the text of the documents is accessed and the various words forming the document are extracted. These words may be mapped to their word stem yielding the so-called terms used to represent the documents. The resulting set of terms is usually cleared from so-called stop-words, i.e. words that appear either too often or too rarely within the document collection and thus have only little influence on discriminating between different documents and would just unnecessarily increase the computational load during classification. In a vector-space model of information retrieval, the documents contained in a collection are represented by means of feature vectors x of the form x = [~1,~2,... ,~]T. In such a representation, the ~i, 1 _< i < n, correspond to the index terms extracted from the documents as described above. The specific value of ~i corresponds to the importance of index term i in describing the particular document at hand. One might find a lot of strategies to prescribe the importance of an index term for a particular document [24]. Without loss of generality, we may assume that this importance is represented as a scalar in the range of [0, 1] where zero means that this particular index term is absolutely unimportant to describe the document. Any deviation from zero towards one is proportional to the increased importance of the index term at hand. In such a vectorspace model, the similarity between two text documents corresponds to the distance
187 between their vector representations [25]. A challenging issue in processing document representations obtained by full-text indexing is that the number of index terms is very large even with document collections of moderate size. As a consequence, the documents are represented as vectors belonging to a very high-dimensional feature space. In order to enable an efficient calculation of document similarities, the feature space should be compressed. The reason why compression of the feature space yields significant results is that natural language text usually is far from being free of correlations. Such correlations originate for instance from frequent word co-occurrences. Patterns of frequently co-occurring words may thus be reduced to single words. An approach based on singular value decomposition that is often used in the information retrieval community is latent semantic indexing [2]. An alternative approach based on auto-associative feedforward neural networks is described in [13]. A very effective alternative is the word category map as used in the WEBSOM project [4,7,9,10] where the self-organizing map is used to cluster similar words. The general idea is based on the seminal work of [21]. As another important issue in document classification we certainly have to mention the visualization of document similarity. A very convenient visualization is marked by the map display provided with the utilization of self-organizing maps. This certainly is a reason for the success of the self-organizing map in a number of information retrieval applications. The basic map display, however, has its limitations in that cluster boundaries are not shown explicitly. A number of approaches to overcome this limitation are based on the idea of representing clusters of similar input patterns by means of different shades of grey. As an example consider the U-matrix display as described in [26]. An appealing approach to cluster visualization by means of automatically coloring the map display is proposed in [8]. We have addressed cluster visualization in two different ways. Firstly, we have presented a visualization technique that is based on the adaptation of the coordinates used to plot the various units of the self-organizing map [14,17]. By this we reach a far more condensed arrangement of similar input patterns within the two-dimensional map display. Secondly, we have developed a method for automatically assigning labels to the units of the selforganizing map [20]. The labels are derived by analyzing the term co-occurrence patterns within documents mapped onto the same unit. These labels give clear hints on the contents of the documents and thus facilitate the interpretation of training results. Other possibilities to represent cluster boundaries explicitly may be found by using different artificial neural network models. A model based on the usage of a number of independent self-organizing maps is described in [27]. These independent maps compete to represent a particular input pattern such that similar input patterns are contained in the same map. Other approaches rely on neural networks that are based on the self-organizing map's principle of unsupervised learning and address cluster boundary detection by means of adaptive network architectures. These models make use of incrementally growing and splitting architectures where the final shape of the network architecture is guided by the specific requirements of the input space. Representatives of this type of artificial neural networks are described in [1,3]. However, all of these models share in common a lack of intercluster similarity representation. In other words, the similarity of documents belonging to different clusters cannot be deduced from the map-like display provided by
188
these models. Finally, the hierarchical feature map, a neural network model based on a layered architecture of independent self-organizing maps, is described in [18]. We believe that this model has its advantages for a task such as text classification. These advantages are related to, firstly, input vector compression inherent in training thus enabling fast training times. Secondly, the visualization capability is comparable to that of the selforganizing map.
4. A m a p of t h e w o r l d For the experiments presented thereafter we use the 1990 edition of the CIA World Factbook (http://www. o d c i . g o v / c i a / p u b l i c a t i o n s / f a c t b o o k ) as a sample document archive. The CIA World Factbook represents a text collection containing information on countries and regions of the world. The information is split into different categories such as Geography, People, Government, Economy, Communications, and Defense Forces. We use full-text indexing to represent the various documents. The complete information on each country is used for indexing. In other words, for the present set of experiments we refrained from identifying the various document segments that contain the information on the various categories. In total, the 1990 edition of the CIA World Factbook consists of 245 documents. The indexing process identified 959 content terms, i.e. terms used for document representation. During indexing we omitted terms that appear in less than 15 documents or more than 196 documents. These terms are weighted according to a simple t f • idf weighting scheme [23], i.e. term frequency times inverse document frequency. Such a weighting scheme favors terms that appear frequently within a document yet rarely within the document archive. With this indexing vocabulary the documents are represented according to the vector-space model of information retrieval. The various vectors representing the documents are further used for neural network training. Figure 1 gives a graphical representation of the training result obtained with a 10 • 10 self-organizing map. Each unit is either marked by a number of countries or by a dot. The name of a country appears if this unit serves as the winner for that particular country. Contrary to that, a dot appears if the unit is never selected as winner for any document. Figure 1 shows that the self-organizing map was successful in arranging the various input data according to their mutual similarity. It should be obvious that in general countries belonging to similar geographical regions are rather similar with respect to the different categories described in the CIA World Factbook. These geographical regions can be found in the two-dimensional map display as well. In order to ease the interpretation of the self-organizing map's training result, we have marked several regions manually. For example, the area on the left hand side of the map is allocated for documents describing various islands. It is interesting to note, that the description of the oceans can be found in a map region neighboring the area of islands in the lower middle part of the map. In the lower center of the map we find the European countries. The cluster representing these countries is further decomposed into a cluster of small countries, e.g. San Marino and Liechtenstein, a cluster of Western European countries, and finally a cluster of Eastern European countries. The latter cluster is represented by a single unit in the last row of the output space. This unit has as neighbors other countries that are usually attributed as belonging to the Communist hemisphere, e.g. Cuba, North Korea, Albania, and Soviet
189
,
Chrislmas Island Cocos Islands
Norfolk Island Guam Amancan Samoa Marshall IsiandsPapua New Guinea Tonga Seo Tome . ' . " Saint Pierre North. Madana Islands Micronesla Solomon Islands Western Samoa . ~. 9 Palau . . . ~ .~ 7. ~ 7. -
Cook Islands Niua Tokelau Tuvelu
Guadeloupe Martinique
Antigua
Amba Puelto Rico Virgin Islands
Anguilla Mayotte French Guiana Falkland Islands New C,aladonia French Polynesia Saint Helena
Kiribati Maudtius Nauru Seychelles Vanuatu
Grenada Saint Kitts Saint Lucia Saint Vincent
Hong Kong Netherlands
Barbados
//
Bulundi Rwanda Uganda
"- -
Comoros~// Cape Verde Burkina Faso Gamble Maldives~, Djibouti Central Afdcan Rap. Sierra Leone , i ~ Equatorial Guinea Guinea ." " tI Guinea Biaeau -"
Belize
i
,
-""
Africa
i
Angola
Bhutan Nepal
1,
Madagascar Mozambique Nigeda
| /
tslands
Chad Mall Niger
I
',
Botswana Leaotho Malawi Swaziland Zambia
Senegal
Cameroon Gabon Ghana
...
Bntish Virgin Islands
Guernsey Jersey
Montserrat Pitcaim Islands Turks Islands L
Bermuda Cayman hdands Gibraltar
',,
Macau Reunion
Malta
Bahamas ;1 i Jamaica i . /9 ""
.............
,;:-:::-:.
Faroe Islands#// Andorra , , South Africa Greenland d San Merino " 9 . i i# Vatican " 9
Guyana Suriname Tnnidad
India Pakistan
Bangladesh Burma Thailand
', Europ. SmaUStates
IsleofMan Sou~I Georgia Svelbard World
Gaze Strip i West Bank i ,
',
Liechtenstein /~ Italy Luxembourg Monaco ,'/#
Greac:", Singapore Ireland x~ South Korea Turkey ~l Sri Lanka
,;,
9 - -- ".'.;"
,
W astern F..uro i ~
'
; /
'l
99 " " ..9
~st,~ Belgium France
Portugal United Kingdom
Brunel Cyprus Fiji Ube~ta
l
Ethiopia t 9 Somalia
Maudtanm Zelre
Iron Viatham
, "'...
...........
Basses da India Navassa Island Wake Island Glipperton Island Europa Island Glodoeo Islands L~lands Juan de Nova Island TromeJn Island
,New Zealand
Japan
Soviet Union Yugoslavia
'i I
Albania
Mongolia Nor~ Korea
"" ..........
Egypt Ireq ,
/ 9Panama i' ,'
i
~,..
"
II i
Algeria Morocco Tunisia
~1
/ l
............ ~
'1 ........
i
Israel Jordan Syria
l
9
Johnston Atoll i~ Arctic Ocean Indian Ocean ", Midway Islands ~ Atlantic Ocean Pacific Ocean / ,l , ocnm ,, ' "". .,' ."
united States
Libya~ x
Arab States
Sudan
i
/ 9 Antarctica
Lebanon North Yemen Saudi Arabia
i
-* /
9
e ~
China ' Talwan II
.. "
9 Denmark " 9 Finland " 9 9149 Sweden
Benin Congo / Ivory Coast / . . . . . . . . . . . Togo / .........
-.
/ 9 Bahrain oI Kuwelt / Omen / Qatar
,'
Sv~tzerland "-.
Kenya Tanzania
South Yemen United Arab Emirates
: A.s=,,e
Spain
"..
Bouvet Island Baker Island French Antarctic Lands How.and Is.. ', Jan~s Island ", Kingman Reef 9- . . Palmyra Atoll
c,~=
GeonanFed. Rap.
'~
Ashrnore Islands Pare 9 Islands Core.Seals.ends Spratiylslands Heardlsland
Namibia
i
: ~elend : Norway '
9
'i I ,'
',
-
," J.n May..',.req-S^ N~. zo..
Afghanistan Cambodia Laua
~" Indonesla Malaysia
/
Ph,~n,,s
',
Paraguay Venezuela
Latin America
9
, Costa Rica El Salvador Guatemala
i 11
Haiti
Mexico
' Bulgada i ~ Czechoslovakia 9 German Dam. Rap. 9149 Hungary ", Poland " Ro.~mia
,." "l:londuras " "
Mcamgua Bolivia Peru
". 9" - .
....
Argentina ' Brazil Chile Colombia i Dominican Rap. / Ecuador / Uruguay. 9
Figure 1. A map of the world
Union. At this point it is important to recall that our document archive is the 1990 edition of the CIA World Factbook. Thus, the descriptions refer to a time before the 'fall' of the Communist hemisphere. Other clusters of interest are the region containing countries from Latin America (lower right of the map), the cluster of African countries (upper right of the map), or the cluster containing Arab countries (middle right of the map). For the latter cluster it is interesting to note that one of its neighboring units represents Indonesia and Malaysia-the large islamic countries of Asia. The third country represented by this unit-Philippines-has a highly similar geographical and economical description. Overall, the representation of the document space is highly successful in that similar documents are located close to one another. Thus, it is easy to find an orientation in this document space. The negative point, however, is that each document is represented
190 on the very same map. Since the self-organizing map represents a very high-dimensional data space within a two-dimensional display, it is only natural that some information gets lost during the mapping process. As a consequence, it is rather difficult to identify the various clusters. Imagine Figure 1 without the dashed lines indicating cluster boundaries. Without this information it is only possible to identify, say, African countries when prior information about the document collection is available. 5. A n atlas of t h e w o r l d The hierarchical feature map can provide essential assistance in isolating the different clusters. The isolation of clusters is achieved thanks to the architecture of the neural network which consists of layers of independent self-organizing maps. Thus, in the highest layer the complete document archive is represented by means of a small map. Each unit is then further developed within its own branch of the neural network. For the experiment presented thereafter we used a setup of the hierarchical feature map using four layers. The respective maps have the following dimensions: 3 • 3 on the first layer, 4 x 4 on the second layer, and 3 x 3 on the third and fourth layer. Figure 2 presents the contents of the first layer self-organizing map. In order to keep the information at a minimum we refrained from showing the names of the various countries in this figure. We rather present some aggregated information concerning the various countries.
Islands
/
Oceans / / / ArabStates / Islands /Small Countrie,~/ ~Economically~~ Developed/ /
/ Islandx / / /Small Countries// Africa / LatinAmeric
7
Figure 2. Hierarchical feature map: First layer
In the remainder of this discussion we will just present the branch of the hierarchical feature map that contains what we called economically developed countries. The other branches cannot be shown in this work because of space considerations. These branches, however, are formed quite similarly. In Figure 3 we show the arrangement of the second layer within the branch of economically developed countries. In this map, the various countries are separated roughly according to either their geographical location or their political system. The clusters are symbolized by using different shades of grey. Again, we find, for example, the countries from the "eastern hemisphere" well separated form countries belonging to the "western hemisphere".
191
Figure 3. Hierarchical feature map: Second layer
Finally, Figure 4 shows the full-blown branch of economically developed countries. In this case it is straight-forward to identify the various cluster boundaries in that each cluster is represented by an individual self-organizing map. Higher level similarities are shown in higher levels of the hierarchical feature map.
Figure 4. A subset from an atlas of the world: Economically developed countries
6. F u t u r e d i r e c t i o n s The work on self-organizing maps for document classification, obviously, has not found an end with the research reported so far. We feel that a lot of highly interesting issues
192 should be addressed in future work. At the moment we are working on three aspects related to the utilization of the selforganizing map for document classification. Firstly, we proposed an approach for automatically labeling the units of a self-organizing map with those index terms that best characterize the documents represented by a specific unit. The benefit of this approach is that the various document clusters are described in terms of shared index terms, thus making it easier for the user to explore the contents on an unknown document archive. Preliminary results are reported in [20]. Secondly, we are working towards an incrementally growing version of hierarchical feature maps. In such a model, the depth of the hierarchy as well as the dimensions of the various self-organizing maps will be determined during the unsupervised training process. The obvious benefit of such a model is that no prior knowledge about the document archive is required in order to define the setup of the neural network. Our first experience with this model indicates that highly similar results as those described in Section 5 can be expected. Just to give an example, one of the third layer maps is shown in Figure 5. In this case, the self-organizing training process arranged some neighboring European countries along a 5 x 2 self-organizing map.
...................................................................................
italy
[ ......
t ..............................................
j. ~. ......................................................
"sanmari'no ............iie chi ensiein" i vatican switzerland i
austria german fed rep
Figure 5. A map from an incrementally growing atlas
Finally, we are including metaphor graphics in the map display. Such graphics visualize the length of a particular document in terms of the thickness of books, the frequency of access to a particular document in terms of the appearence of the book's back, and the time of last access to a document in terms of the book's position within the shelve. Recently used books are standing towards the front of the shelve, whereas books that have not been in use for a longer time slowly move towards the back of the shelve. Consider Figure 6 as an example for such a map display. 7. C o n c l u s i o n s In this paper we have provided an account on the feasibility of using self-organizing maps in a highly important task of information retrieval, namely document classification. As an experimental document archive we used the description of various countries as contained in the 1990 edition of the CIA World Factbook. For this document collection it is rather easy to judge the quality of the classification result. For document representation we relied on the vector space model and a simple t f • idf term weighting scheme.
193
Figure 6. Metaphor graphics for document archive representation
We demonstrated that both the self-organizing map and the hierarchical feature map are highly useful for assisting the user to find his or her orientation within the document space. The shortcoming of the self-organizing map, however, is that each document is shown in one large map and thus, the borderline between clusters of related and clusters of unrelated documents are sometimes hard to find. This is especially the case if the user does not have sufficient insight into the contents of the document collection. The hierarchical feature map overcomes this limitation in that the clusters of documents are clearly visible because of the architecture of the neural network. The document space is separated into independent maps along different layers in a hierarchy. The user thus gets the best of both worlds. The similarity between documents is shown in a fine-grained level in maps of the lower layers of the hierarchy while the overall organizational principles of the document archive are shown at higher layer maps. Since such a hierarchical arrangement of documents is the common way of organizing conventional libraries, only small intellectual overhead is required from the user to find his or his or her way through the document space.
194 REFERENCES
1. J. Blackmore and R. Miikkulainen. Incremental Grid Growing: Encoding highdimensional structure into a two-dimensional feature map. In Proceedings IEEE International Conference on Neural Networks, San Francisco, CA, 1993. 2. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Hashman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 1990. 3. B. Fritzke. Growing Cell Structures: A self-organizing network for unsupervised and supervised learning. Neural Networks, 7(9), 1994. 4. T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. W E B S O M - self-organizing maps of document collections. In Proceedings Workshop on Self-Organizing Maps, Espoo, Finland, 1997. 5. T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 1982. 6. T. Kohonen. Self-organizing maps. Springer-Verlag, Berlin, 1995. 7. T. Kohonen. Self-organization of very large document collections: State of the art. In Proceedings of the Int'l Conference on Artificial Neural Networks (ICANN'98), SkSvde, Sweden, 1998. 8. T. Kohonen and S. Kaski. Automatic coloring of data according to its cluster structure. In E. Alhoniemi, J. Iivarinen, and L. Koivisto, editors, Triennial Report 199~ 1996. Helsinki University of Technology, Neural Networks Research Center & Laboratory of Computer and Information Science, Otaniemi, Finland, 1997. 9. T. Kohonen, S. Kaski, K. Lagus, and T. Honkela. Very large two-level SOM for the browsing of newsgroups. In Proceedings of the Int'l Conference on Artificial Neural Networks (ICANN'96), Bochum, Germany, 1996. 10. K. Lagus, T. Honkela, S. Kaski, and T. Kohonen. Self-organizing maps of document collections: A new approach to interactive exploration. In Proceedings of the Int'l Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, 1996. 11. X. Lin, D. Soergel, and G. Marchionini. A self-organizing semantic map for information retrieval. In Proceedings of the A CM SIGIR Int'l Conference on Research and Development in Information Retrieval (SIGIR'91), Chicago, IL, 1991. 12. D. Merkl. A connectionist view on document classification. In Proceedings of the Australasian Database Conference (ADC'95), Adelaide, SA, 1995. 13. D. Merkl. Content-based document classification with highly compressed input data. In Proceedings of the Int'l Conference on Artificial Neural Networks (ICANN'95), Paris, France, 1995. 14. D. Merkl. Exploration of document collections with self-organizing maps: A novel approach to similarity representation. In Proceedings of the European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD'97), Trondheim, Norway, 1997. 15. D. Merkl. Exploration of text collections with hierarchical feature maps. In Pro-
ceedings Int'l ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97), Philadelphia, PA, 1997. 16. D. Merkl. Text classification with self-organizing maps: Some lessons learned. Neu-
195
rocomputing, 21 (1-3), 1998. 17. D. Merkl and A. Rauber. Alternative ways for cluster visualization in self-organizing maps. In Proceedings of the Workshop on Self-Organizing Maps, Espoo, Finland, 1997. 18. R. Miikkulainen. Script recognition with hierarchical feature maps. Connection Science, 2, 1990. 19. R. Miikkulainen. Subsymbolic Natural Language Processing: An integrated model of scripts, lexicon, and memory. MIT-Press, Cambridge, MA, 1993. 20. A. Rauber and D. Merkl. Automatic labeling of self-organizing maps: Making a treasure-map reveal its secrets. In Proceedings of the Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD'99), Beijing, China, 1999. 21. H. Ritter and T. Kohonen. Self-organizing semantic maps. Biological Cybernetics, 61, 1989. 22. D. Roussinov and M. Ramsey. Information forage through adaptive visualization. In Proceedings of the A CM Int'l Conference on Digital Libraries (DL'98), Pittsburgh, PA, 1998. 23. G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989. 24. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing ~ Management, 24(5), 1988. 25. H. R. Turtle and W. B. Croft. A comparison of text retrieval models. Computer Journal, 35(3), 1992. 26. A. Ultsch. Self-organizing neural networks for visualization and classification. In O. Opitz, B. Lausen, and R. Klar, editors, Information and Classification - Concepts, Methods, and Applications. Springer-Verlag, Berlin, 1993. 27. W. Wan and D. Fraser. Multiple Kohonen self-organizing maps: Supervised and unsupervised formation with application to remotely sensed image analysis. In Proceedings of the Australian Conference on Neural Networks, Brisbane, QLD, 1994.
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
197
N A V I G A T I O N IN D A T A B A S E S U S I N G S E L F - O R G A N I S I N G M A P S S.A. Shumsky P.N. Lebedev Physics Institute, Moscow, Russia
1. Introduction The total volume of databases worldwide is growing exponentially. And gradually the problem of information search, which was timely recently, gives way to the problem of selection of relevant information in the expanding ocean of available data. It is important for this problem that "raw" data should be reduced to the form, suitable for assimilation i.e. the central problem is conversion of data into knowledge. Knowledge is data capable of reproduction. The difference between knowledge and data is alike the difference between living and inorganic matter. Knowledge must be structured and linked by associations. Taking into account the volume of accumulated data the process of data structuring must be highly automated. To this end selforganising Kohonen maps are very well suited for conversion of data into knowledge. They are self-organising and they provide clarity, making data perception easier for people. These features make self-organising maps ideal navigation tool for the huge collections of data. The era of Internet precipitates the need for such navigation tools. In due time the epoch of great geographical discoveries initiated drastic changes in the standards of navigation devices. Clocks became more accurate, astronomical instruments were significantly improved. In the long run mastering the oceans caused discoveries of Copernicus. In a similar way, the present mastering the Information Ocean claims for new generation of instruments leading to new scientific discoveries. Apparently, self-organising maps will play an important role in this process. The goal of this work is to demonstrate the using of self-organising maps in Internet and business applications. The author does not aspire to present a complete review; focusing on the projects, which he participated in. We start from a more detailed discussion of the advantages of the above technology. Then we present several applications of Kohonen maps for organisation of business information. Namely for analysis of Russian banks, industrial companies and stock market. Next section is devoted to using of self-organising maps for navigation in document collections, including Internet applications. Finally we discuss some possible extensions of presently used self-organising maps.
198
2. Advantages of self-organising maps Self-organising maps belong to the class of unsupervised neural networks. Such networks learn latent structures in data without any experts "marking" the data (for example, relating data to that or another class). This is a very valuable feature, because "marked" data is usually much more expensive. Scarce "marked" data can later be used for annotating the map, which was formed automatically without any experts. Thus, self-organising maps allow automatic data structuring. Another advantage of self-organising maps is also implied in their name. Dimensionality reduction of multi-factor information to (usually) two co-ordinates appeals to natural for people visual skills - for data analysis. Mother Nature has polished our ability to perceive visual images for millions years. Thus Kohonen maps reveal the potential for analytical work for an ordinal user. Self-organising maps allow a new look on data for specialists in various fields. This alone may lead to new generalisations. As a matter of fact, self-organising maps propose completely new style for working with databases: a comprehensive overview instead of numerous local cross-sections and selections. Such holistic approach to data makes self-organising Kohonen maps a convenient tool for database navigation, based on associations between neighbouring parts of the maps. Databases are no longer "black boxes"; each database acquires a unique visual structure. Such navigation tools are indispensable in the Information Ocean. In Internet the drawbacks of usual Boolean search are especially noticeable. Search queries, which are based on a traditional technique of Boolean selections, bring too much information. The necessity to make traditional search tools more intelligent, providing them the ability to understand documents' subjects has become evident. Self-organising maps prove their effectiveness in this field as well. In the next section we discuss in greater details the prerequisites for using self-organising maps in financial analysis.
3. Analysis of financial information Financial analysis assumes revealing and taking into account existing regularities in multifactor information, understanding correlations of different factors. In the absence of comprehensive economical theory the ability to rely on historical precedents, proved by statistical analysis is essential. Kohonen map is such a statistical tool, which facilitates recognition of similar situations and helps decision-making in business. Let us illustrate it on a concrete p r o b l e m - bankruptcy prediction.
3.1. Analysis of bankruptcies 3.1.1. The price of the question Let us first illustrate the "price of the question" by some numbers. Total volume of interbank credits worldwide is estimation to $38 trillions. It is approximately twice the world stock market volume (Sharpe, Alexander, and Bailey, 1995). Of course, all banks are interested in estimating the risk that a credit will not be returned.
199 Total number of bankruptcies in USA in 1980-s was increasing 14% annually. In the USA banking sector the amount of bankruptcies increased from 50 in the year 1984 to 400 in the year 1991 (Trippi and Turban, 1993). However, this makes up less than 3% of approximately 14000 USA banks. As to Russia, in 1996 more than 10% of about 2000 Russian banks lost their licences. The August crisis of 1998 will result in a much more drastic decrease of viable Russian banks. Thus, prediction of bankruptcies is indeed a very hot topic.
3.1.2. Statistical techniques for bankruptcies prediction An extensive class of existing methods, including neural one, is based on statistical processing of bankruptcy cases. The task is to estimate fmancial stability of the company, taking into account only impartial information- financial indicators. Usually one aims to estimate the probability of bankruptcy after a certain period (e.g. one or two years) based on the current financial accounting information. A pioneer work by Altman in this field appeared in 1968 (Altman, 1968). Using discriminant analysis he retrieved five most important financial indicators, which described financial situation of a company in the context of bankruptcy prediction. 9 9 9 9 9
Current assets to Gross assets ratio Retained income to Gross assets ratio Disposable income to Gross assets ratio Capitalisation to Debt ratio Sale to Gross assets ratio
These indicators are used also in the generally accepted CAMEL rating technique.
3.1.3. Neural prediction of bankruptcies Neural methods for bankruptcy predictions inherited basic principles of statistical studies. Actually, neural nets propose an easy-to-use non-linear regression technique, and as a rule demonstrate better results. According to (Trippi and Turban, 1993), neural simulations provide the best accuracy in bankruptcy prediction problem: approximately 90% in comparison with 80% - 85% for other statistical methods (discriminant analysis, logical analysis, ID3, kNN). Training neural net on examples of bankruptcies produces a discriminant f u n c t i o n numerical index of financial health of a company, measure of its stability. But stability is not the only criterion of financial activity of a company ~. Shareholders are interested not only in eternal existence of their company, but in a good income as well. Besides, the tendencies are as much important as the present state of the company. Another set of factors can be significant to this end, which will provide another estimating function. So, large profitability can provide increase of reliability in the future. Meanwhile, it is not clear how to train a network to recognise "future success", in the absence of precise criteria, such as the bankruptcy for the failure. It is possible however to overcome these difficulties, recalling that a company exists in a community of similar competitors, which are characterised by the same parameters. We can discuss weak and strong aspects of firm activity in comparison with this community. This leads us to a different approach: systematic comparison of the company's financial state with the rest
i Biological evolution demonstrates numerous examples of "too successful" adaptation to a specific ecological niche, which lead to quick extinction, when the environment conditions suddenly changed.
200 of its competitors. This approach, based on unsupervised learning of self-organising maps, does not require answering difficult questions. Besides, it can be applied in situations, when the amount of "marked" d a t a - a total number of known bankruptcy c a s e s - is limited and nonlinear regression methods can not be used effectively.
3.1.4. Self-organising maps of banking systems In the first work in this field (Martin-del-Prio and Serrano-Cinca, 1993) a Kohonen map of Spanish banks was generated. It was based on Altman set of financial indicators. Samples of bankrupts identify a risk area. The work described the crisis period, when about a half of Spanish banks collapsed, so the map was divided into two approximately equal parts: one part contained relatively successful banks, potential bankrupts defined the other part of the map. A similar approach was presented in (Shumsky and Yarovoy, 1998) for Russian banks. This paper reviewed financial state of about 1700 banks in 1994 - 1995, each described by 30 financial indicators from balance sheets and annual statements. Various financial information parameters can be drawn on the generated maps. It can be presented by colour scheme or by relief. This allows visualisation of financial parameters on self-organising maps in a similar way to using colour schemes on climatic and geographic maps. An aggregate of such colour schemes gives one an atlas, a useful source of graphic information for financial analysts. Changes of financial positions can be presented as trajectories on the self-organising maps. These trajectories represent the evolution of the company and show up existing tendencies and cycles. One more advantage from the macro-economical point of view is graphical representation of a share of companies with similar parameters; this is provided by approximately uniform load of the map cells. This allows for visualisation of such macroeconomic parameters as a share of banks with large fixed capital or a share of banks, which are pressed by funds.
3.2. Extension of fundamental analysis The author expresses an opinion that self-organising maps present a break-through in the bank rating techniques. Actually, we get two-dimensional rating information, which is certainly more informative than traditional one-dimensional ratings. For example, a self-organising map of Russian banks contains about 85% of information (a mean-square error of bank presentation on the map is about 15% from the total variance of "raw" 30-dimenbsional data). We should notice here that the first principal component of the same data contains only 35% of total information. This means that the best rating made from linear combination of balance sheet accounts is far less informative than a two-dimensional map. This technique can certainly be applied not only for banks. It can be considered as an augmentation of fundamental financial analysis arsenal. The same approach has been used for analysis of the 200 largest Russian industrial companies (Shumsky and Kochkin, 1999).
3.3. Broadening technical analysis Similarly, Kohonen maps can be considered as a tool of technical analysis and operate with market i n d i c a t o r s - stock-market quotations. Traditional technical analysis is restricted by analysis of single financial instrument, aiming to predict time series using patterns, extracted from its past. Neural-based time series prediction allows for analysis of cross-correlation between different time series. However, a problem of choosing input parameters for neural
201 networks is not so simple. Selection of closely related indicators can considerably reduce the effectiveness of subsequent non-linear technical analysis. Self-organising maps allow to group time series using non-linear statistics. For example, if every time series will present a multidimensional map input, a map will represent a comprehensive market structure. Close companies on the map will exhibit similar market behaviour. Another possibility for using self-organising maps lies in graphic representation of market situation in general. Snapshots of market conditions will play the role of multi-dimensional map input in this case. Similar market conditions will be grouped on the trained map. Analysis of trajectories of market conditions in time would be useful for dealers. The above section was devoted to Kohonen maps of numerical information. In the next section we discuss a very interesting and difficult problem of structuring textual databases.
4. Navigation in documents collections The Internet has drastically changed the information structure of the modern society. Plunging of mass media into computer networks opens a brand new opportunities for business. New business, such as electronic trade and direct active advertisement, are emerging. Market quotations of the Internet companies, effectively represent expectations of the society, concerned with further development of the Internet. It is worth mentioning here a recent series of Internet portal purchases; their founders made fortunes for a few years. For instance, a popular Internet portal Excite was sold for just a little less than Digital Equipment, which was one of the major companies in the computer business. New lines of information services will be associated with processing of cheap data into valuable knowledge. Another important component of the Internet business will be concerned in personalisation of knowledge delivery. Personal software agents will dig information, actual for their hosts and act in the Internet on their behalf. Agentware, the base of information society in the nearest future, implies understanding of what a person wants from his agent. Conversion of data to knowledge in the case of textual information requires retrieving of sense from a d o c u m e n t - this feature is absent in modern search systems. Sense retrieving technology appears to be one of the most important elements for business in the Internet. Selforganising maps and relative algorithms can really enhance effectiveness of present information technologies. A Websom project is a striking example of motivating impulse in this direction. We shall discuss progress in the Internet search systems in connection with the main ideas of this project.
4.1. Historical background All indexing methods can be grouped in two classes: lexical indexing and vector indexing. Lexical indexing is very effective for optimisation of Boolean queries, vector indexing allows constructing queries for retrieving similar documents. Traditional lexical indexing has almost achieved its perfection. Advanced lexical search engines use stop-lists, understand all grammar forms and have extended query language. The possibilities to specify proximity of words in the query to each other and to the beginning of the document are implemented in many search systems. Manual generation of thesaurus has made some search systems more intellectual. Indeed, modern lexical search systems answer the demands of experts in particular domains.
202 However, the majority of the Internet users do not consider the lexical search convenient. Firstly, complicated Boolean query language is too difficult for an ordinary user. Secondly, it is usually difficult to formulate a query in key words, not being an expert in a certain domain. This is especially true, if there is no settled terminology in the field. Besides, any attempt to exceed the limits of a particular speciality meets problems of synonymy and polysemy. And, finally, ranking of retrieved documents is not implemented in Boolean logic; this is especially actual for the Internet, when typical queries bring too many of documents. Information filtering is possibly more important component of search systems than information search. This implies a certain conception of documents content. That is why a key word based search should be reinforced with content processing. Content representation has been investigated along different lines of approaches: statistical, linguistic, conceptual, noticeable practical results have been achieved only for statistical methods. In the end of 80's Salton (Salton, 1989) proposed a vector model as an alternative to a lexical context-free indexing. In the vector model every document was represented by a frequency spectra of words, which identified the document in a multi-dimensional semantic space. In a search process frequency spectrum of a query is considered as a vector in the same semantic space. The distance in this vector space defines the most relevant documents. The possibility of document ranking according to their similarity based on a distance in a vector space is the most attractive feature of the vector model. An example of effectively working associative search system is the model of Latent Semantic Indexing, which was proposed in 1990 (Deerwester et.al., 1990). The eigenvalues of the covariance matrix of word frequencies in documents are the components of a semantic space in this model. Another approach was proposed in the Websom project, where semantic space is based on the idea of context clustering. In general, there are two main feature extraction techniques: principal component analysis and clustering. These basic methods can also be used in combination. For example in recently announced associative search system Semantic Explorer, semantic space is formed by a such a hybrid algorithm. Like in the Websom, self-consistent clustering gives rise to semantic categories. A context, however, includes not only the neighbouring words, but the whole frequency spectra, like in LSI model. We shall describe the Semantic Explorer in greater detail in the next section.
4.2. Semantic Explorer Semantic Explorer includes all necessary tools for conversion of any document collection into real knowledge base, which provides: 9 Analysis: extraction of specific semantic categories for a collection 9 Search: based on documents content 9 Navigation: via an annotated map of a document collection. Close documents have associative relations with each other. The product is saturated with neural algorithms. Self-consistent clustering extracts thematic categories. Self-organising maps are used for visualisation of queries and responses and navigation in a database. Colouring of the maps illustrating distribution of semantic categories, age generated by on-line RBF networks. Semantic Explorer contains proprietary algorithm of automated annotation of categories and clusters on the map. A completely automated cycle of
203
vector indexing without using language specific thesauruses allows processing documents on any language. Such tools, by all means, are going to be very useful for automatic ordering of text content of the Internet; this allows navigation in large collections of textual information. Selforganising maps make perception of large documents collections much easier. Semantic categories, annotated by the most important words serve as tables of contents for databases, and distribution of categories on the map is a kind of visual context indexing.
Figure 1. Self-organising map of the SPIE abstracts collection in Semantic Explorer. Documents found are sparkling stars in the "galaxy" of the whole collection For example, Figure 1 presents semantic indexing of the full collection of about 50,000 abstracts from the SPIE conferences. Semantic Explorer extracts 96 basic semantic categories. One of them annotated by its keywords is shown on the snapshot. The documents density on this subject is shown on the map. This map is a good starting point for the search on neural networks in this collection of abstracts. Crick in the brightest point brings one the first portion of extracted documents. To refine the search results one may use the "more like this" option, marking the documents of interest.
204
Figure 2. Results of the "more like this" search in Semantic Explorer
Figure 2 illustrates how an abstract on fussy Kohonen clustering network brings one the document on Kohonen's self-organising maps among other closely related documents. This "more like this" technology is a cornerstone of another system of personal information delivery. Below we describe this latter prototype for agent-based information systems.
4.3. Proxima Daily The Proxima Daily is a software complex for automatic filtering and delivering of news via the I n t e r n e t - a personal newspaper. The product uses the same semantic engine as the Semantic Explorer. The system learns the interest profiles of its user and selects from the news stream related documents. Server itself supplies only the news content. Its representation on the user's computer is formed dynamically according to one of the prearranged templates. Contrary to its title, the Proxima Daily can work not only in daily mode, but can also operate as a d i g e s t - in modes Proxima Weekly and Proxima Monthly. One can also browse the news archives via this newspaper interface. And these snapshots of the past will always be shown according to the current user's profile.
205
5. Conclusion In conclusion we shall mention one of the interesting, from the author's point of view, extensions of Kohonen m a p s - in context of recent attempts to find new forms of information visualisation - multi-dimensional self-organising maps. In applications two-dimensional Kohonen maps are usually used, three-dimensional maps being less common. Indeed, as the dimensionality of a map grows, the number of neurones increases exponentially. That is why it seems natural to use multi-dimensional Kohonen maps of the most economical hyper-cube topology: only two neurons along each direction. In this case, say, a 10-dimensional Kohonen hyper-cube will contain about 103 neurons. However, data co-ordinates in such hyper-cube, defined, for example as the centres of gravity of neurons excitation, represent non-linear principal components. One would argue, that multi-dimensionality nullifies the main advantage of Kohonen maps graphical representation of information. A hyper-cube, however can be projected on a plane. Such projections have a form of a "fish eye" centred on one of the hyper-cube's sides. Such a visualisation, which makes an accent on a certain subdomain of data, has already proved its utility in the system designed in Palo-Alto Research Centre. Kohonen hyper-cubes may give a regular technology of automatically creation and colouring of such visualisation systems. In conclusion, self-organising maps are very relevant in the modem world, overloaded by easily accessible data. Kohonen maps give the possibility to organise document collections automatically and represent them in a way, which allows easy perception of intrinsic regularities. It is a really magic tool for converting data to knowledge.
6. References Altman, E. I. (1968). "Financial ratios, Discriminant analysis and the prediction of corporate bankruptcy", Journal of Finance, 23, No 4, 589-609. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990. (See also: Todd A. Letsche and Michael W. Berry. Large-Scale Information Retrieval with Latent Semantic Indexing. URL: http://www.cs.utk.edu/~berry/sc95/sc95.html) Martin-del-Prio, B., and Serrano-Cinca, K. Serf-Organizing Neural Network: The Financial State of Spanish Companies. In Neural Networks in Finance and Investing. Using Artificial Intelligence to Improve Real-WorM Performance. R.Trippi, E.Turban, Eds. Probus Publishing, 1993., 341-357. G. Salton. Automatic Text Processing. Addison-Wesley Publishing Company, Inc., Reading, MA, 1989. Sharpe, W., Alexander, G., Bailey, J. "Investments", Fifth Edition, Prentice Hall, 1995. Shumsky, S.A., and Kochkin A.N., "Self-organising maps of 200 top Russian companies", In: Proceedings of Neuro-informatics'99, Moscow, 1999.
206 Shumsky, S.A., and Yarovoy, A.V. (1998). "Kohonen Atlas of Russian Banks", in: Deboeck, G. and Kohonen, T. (Eds). Visual Explorations in Finance with Self-Organizing Maps. Springer, 1998. Trippi, R., and Turban, E., eds. (1993) Neural Networks in Finance and Investing, Probus Publishing.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
207
A SO M - b a s e d sensing a p p r o a c h to robotic m a n i p u l a t i o n tasks E. Cervera* and A. P. del Pobil Department of Computer Science, Jaume-I University E-12071 Castell6, Spain This paper presents a SOM-based approach for learning manipulation tasks, particularly the peg-in-hole insertion. The input to the SOM consists of force signals which define the contact state together with a qualitative position information. The combination of a SO M for feature extraction and a reinforcement algorithm for action learning is proposed. The paper presents experimental results of the application of this approach to real manipulation tasks with uncertainty in three dimensions using non-cylindrical parts. The system learns the task by performing successive trials, without further supervision. The SO M-based representation is able to generalize to other shape of the peg. 1. I N T R O D U C T I O N We present a practical framework for robotic manipulation tasks, particularly the insertion of non-cylindrical parts with uncertainty in modeling, sensing and control. The approach is based on an algorithm which autonomously learns a relationship between sensed states and actions. This relationship allows the robot to select those actions which attain the goal in the minimum number of steps. A SOM extracts features from input signals and it complements the learning algorithm, forming a practical sensing-action architecture for manipulation tasks. In the type of manipulation problems addressed in this work, interactions between the robot and objects are allowed, or even mandatory, for operations such as compliant motions and parts mating. We restrict ourselves to tasks which do not require complex plans; however, they are significantly difficult to attain in practice due to uncertainties. Among these tasks, the peg-in-hole insertion problem has been broadly studied, but very few results can be found in the literature for three-dimensional non-cylindrical parts in an actual implementation. We believe that practicality, although an important issue, has been vastly underestimated in fine motion methods, since most of these approaches are based on geometric models which become complex for non-trivial cases especially in three dimensions [2]. Learning methods provide a framework for autonomous adaptation and improvement during task execution. An approach to learning a reactive control strategy for peg-inhole insertion under uncertainty and noise is presented in [6]. This approach is based on active generation of compliant behavior using a nonlinear admittance mapping from *Funded by a grant of the Spanish Ministry of Education
208 sensed positions and forces to velocity commands. The controller learns the mapping through repeated attempts at peg insertion. A two-dimensional version of the peg-inhole task is implemented on a real robot. The controller consists of a supervised neural network, with stochastic units. In [5] the architecture is applied to a real ball-balancing task, and a three-dimensional cylindrical peg-in-hole task. Kaiser and Dillman [7] propose a hierarchical approach to learning the efficient application of robot skills in order to solve complex tasks. Since people can carry out manipulation tasks with no apparent difficulty, they develop a method for the acquisition of sensor-based robot skills from human demonstration. Two manipulation skills are investigated: peg insertion and door opening. 1.1. M o t i v a t i o n
Approaches based on geometric models are far from being satisfactory: most of them are restricted to planar problems, and a plan might not be found if the part geometries are complex or the uncertainties are great. Many frameworks do not consider incorrect modeling and robustness. Though many of the approaches have been implemented in real-world environments, they are frequently limited to planar motions. Furthermore, cylinders are the most utilized workpieces in three-dimensional problems. If robots can be modeled as polygons moving amid polygonal obstacles in a planar world, and a detailed model is available, a geometric framework is fine. However, since such conditions are rarely found in practice, we argue that a robust, adaptive, autonomous learning architecture for robot manipulation tasks --particularly part m a t i n g - - is a necessary alternative in real-world environments, where uncertainties in modeling, sensing and control are unavoidable. 2. S O M S A N D R E I N F O R C E M E N T
LEARNING
In the proposed architecture (see Fig. 1), an adaptation process learns a relationship between sensed states and actions, which guides the insertion task towards completion with the minimum number of actions. A sensed state consists of a discretized position and force measurement, as described below. A value is stored in a look-up table for each pair of state and action. This value represents the amount of reinforcement which is expected in the future, starting from the state, if the action is performed. The reinforcement (or cost) is a scalar value which measures the quality of the performed action. In our setup, a negative constant reinforcement is generated after every motion. The learning algorithm adapts the values of the table so that the expected reinforcement is maximized, i.e., the number of actions (cost) to achieve the goal is minimized. The discrete nature of the reinforcement learning algorithm poses the necessity of extracting discrete values from the sensor signals of force and position. This feature extraction process along with the basis of the learning algorithm are described below. 2.1. F e a t u r e e x t r a c t i o n w i t h S O M s Force sensing is introduced to compensate for the uncertainty in positioning the endeffector. It does a good job when a small displacement causes a contact, since a big change
209
\ ;!
Action selection
Feature extraction ,,.~l Learning "'- I algorithm
Figure 1. Block diagram of the learning system. features from force and position input signals.
I_ I-
A SOM performs the extraction of
in force is detected. However, with only force signals it is not always possible to identit~ the actual contact state, i.e., different contacts produce similar force measurements, as described in [4]. The adopted solution is to combine the force measurements with the relative displacement of the end-effector from the initial position, i.e., that of the first contact between the part and the surface. The next problem is the discretization of the inputs, which is a requirement of the learning algorithm. There is a conflict between size and fineness. With a fine representation, the number of states is increased, thus slowing down the convergence of the learning algorithm. Solutions are problem-dependent, using heuristics for finding a good representation of manageable size. In a previous work [4] we pointed out the t>asibility of SOMs for extracting feature information from sensor data in robotic manipulation tasks. An example is provided in Section 3. SOMs perform a nonlinear projection of the probability density function of the input space onto the two-dimensional lattice of units. Though all the six force and torque signals are available, the practical solution adopted is to use only the three torque signals as inputs to the map. The reason for this is the strong correlation between the force and the torque; thus, adding those correlated signals does not include any new information to the system. The SOM is trained with sensor samples obtained during insertions. After training, each cell or unit of the map becomes a prototype or codebook vector, which represents a region of the input space. The discretized force state is the codebook vector which comes the 'nearest (measured by the Euclidean distance) to the analog force values. The number of units must be chosen a priori, seeking for a balance between size and fineness. In the experiments, a 6 x 4 map is used, thus totalling 24 force discrete states. Since the final state consists of position and force, there are 9 x 24 = 216 discrete states in a cylindrical insertion, and 27 x 24 = 648 discrete states in the non-cylindrical task. 2.2. R e i n f o r c e m e n t l e a r n i n g The advantage of the proposed architecture over other random approaches is the ability to learn a relationship between sensed states and actions. As the system becomes skilled, this relationship is more intensely used to guide the process towards completion with the
210 minimum number of steps. The system must learn without a teacher. The skill measurement is the time or number of steps required to perform a correct insertion and is expressed in terms of cost or negative
reinforcement. Sutton [9] defined reinforcement learning (RL) as the learning of a mapping from situations to actions so as to maximize a scalar reward or reinforcement signal. Q-learning [10] is an RL algorithm that can be used whenever there is no explicit model of the system and the cost structure. This algorithm learns the state-action pairs which maximize a scalar reinforcement signal that will be received over time. In the simplest case, this measure is the sum of the future reinforcement values, and the objective is to learn an associative mapping that at each time step selects, as a function of the current state, an action that maximizes the expected sum of future reinforcement. In Q-learning, a look-up table of Q-values is stored in memory, one Q-value for each state-action pair. The Q-value is the expected amount of reinforcement if, from that state, the action is performed and, afterwards, only optimal actions are chosen. In our setup, when the system performs any action (motion), a negative constant reinforcement is signalled. This reinforcement represents the cost of the motion. Since the learning algorithm tends to maximize the reinforcement, cost will be minimized, i.e. the system will learn those actions which lead to the goal with the minimum number of steps. The basic learning step consists in updating a single Q-value. If the system senses state s, and it performs action a, resulting in reinforcement r and the system senses a new state s', then the Q-value for (s, a) is updated as follows:
Q(s, a) +-- (1 - c~)Q(s, a) + c~(r + 7 max Q(s', a'))
(1)
a'CA(s')
where c~ is the learning rate and 7 is a discount factor, which weighs the value of future reintbrcement. The table converges to the optimal values as long as all the states are visited infinitely often. In practice, a good solution is obtained with a few thousand trials of the task.
2.3. A c t i o n s e l e c t i o n and e x p l o r a t i o n During the learning process, there is a conflict between exploration and exploitation. Initially, the Q-values are meaningless and actions should be chosen randomly, but as learning progresses, better actions should be chosen to minimize the cost of learning. However, exploration cannot be completely turned off, since the optimal action might not yet be discovered. 3. A M A N I P U L A T I O N
EXAMPLE:
THE PEG-IN-HOLE
TASK
In the two-dimensional peg-in-hole insertion task, a rectangular peg has to be inserted into a vertical rectangular hole. Even small errors in the position a n d / o r orientation of the peg prevent the task from being accomplished successfully, causing undesired contacts which, if ignored, produce reaction forces which can damage the piece or the manipulator. In the considered setup, the hole is chamferless, but there is a clearance between the hole and the peg. In the first simulations friction is neglected, but it will be considered later. Force sensing is achieved by means of a simulated force sensor attached to the upper
211
Figure 2. Contacts in a peg-in-hole insertion task. Reaction forces are depicted as arrows.
face of the peg. W h e n there is a contact between the peg and the surface, the reaction forces are measured by this hypothetical sensor. As we can see in Fig. 2, there are six ditthrent possible contact states between the peg and the surface of the hole. A contact state is the set of all configurations of contacts between the same topological elements (edges, vertices). By considering some clearance between the peg and the hole we have added three more states to the ones used by Asada [1993]. Obviously, each state has its own symmetric. Our aim is to train a SOM with the torque and forces measured by the sensor in each type of contact, and obtain an output from the network suitable for identifying these states. The m a p is represented in Fig. 3, which consists of a big white region on the top with units labeled with states pl, p4 and p6, and three smaller light regions isolated by darker zones, i.e. long distances. This regions are labeled with p2, p3 and p5. This representation reflects the state ambiguities, which are also presented in the table. Some units are labeled twice to show this problem, that occurs with states p l , p4 and p6. This means t h a t those units not only are selected for the first state, but sometimes they are also selected for another state. Unlabeled neurons are displayed as a dot. 4. L E A R N I N G
COMPLEX
MANIPULATION
TASKS
It has been shown how a SOM can evolve to form clusters closely related to contact states, without any a-priori knowledge of those states. We have solved the problem in the case of the peg-in-hole. Now we want to apply this scheme to the real situation described. The SOM will be fed with the six signals of a real force/torque sensor attached to the wrist of the robot arm. We will limit the analysis to the fine motion involved in the tasks of inserting the tool in the pallet on the robot vehicle or the machining center. A similar t r e a t m e n t could be done for the task of extracting the tool. After training the SOM with data from a set of examples, the regions on the map are labeled appropriately. In this case, a label is assigned to each different error/offset. In
212
Figure 3. U-matrix representation of the SOM in a 2D peg-in-hole insertion. Task parameters: p = 0.2, Clearance = 1%
Fig. 4 the activation patterns for complete insertion sequences are shown. Each pattern correspond to a different error. In each pattern, the brighter cells are the ones t h a t were more activated in some moment of the process. On the figures, trajectories are also shown, which depict the temporal evolution of the activation pattern. The arrows join the winner units during a whole sequence of signals. 4.1. I n s e r t i o n of a 3 D s q u a r e p e g Due to its radial symmetry, a cylinder is more simple than other pieces for insertions. It has been widely studied in the literature since the force analysis can be done in two dimensions. Analytical results for pegs of other shapes are much more difficult: [1] developed an heuristic approach to manage 1,000 different contact states of a rectangular peg insertion. In the proposed scheme, it is very simple to deal with other shapes than the cylinder. Besides the uncertainty in the position along the dimensions X and Y (tangential to the surface), the agent must deal with the uncertainty in the orientation with respect to the Z axis (the hole axis, which is normal to the surface). In addition to the center of the square being located exactly in the center of the hole, it has to be exactly oriented to allow a proper insertion. The peg used in the experiments has a square section, its side being 28.8 mm. The hole is a 29.2 m m square, thus the clearance is 0.2 mm, and the clearance ratio is approximately 0.013. The peg is made of wood, like the cylinder, and the hole is located in the same platform as before (Fig. 5). The radius of uncertainty in the position is 3 mm, and the uncertainty in the orientation is +8.5 degrees. The exploration area is a 5 mm square and an angle of -t-14 degrees. The area is partitioned in 9 regions, and the angle is divided in three segments. The selforganizing map contains 6 x 4 units, likewise the previous case. The rest of the training parameters are the same as before. The input space of the SOM is defined by the three filtered torque components. The map is trained off-line with approximately 70,000 data vectors extracted from previous
213
Figure 4. Activation patterns (a = 0.5) and trajectories of winner units for the insertion task in the machining center with positive transversal error.
Figure 5. Robotic manipulator grasping a square peg for insertion.
214
Figure 6. Voronoi diagram defined by the SOM on dimensions insertion.
(Mx,My), for
the cube
random trials. 4.1.1. Definition of the S O M The parameters of the map are:
9 Dimension: 6 x 4 9 Topology: hexagonal 9 Neighborhood: bubble Following the recommendations given by [8] for good convergence of SOMs, the map is trained in two phases: 9 150,000 iterations, learning rate = 0.1, radius = 4 9 1,500,000 iterations, learning rate = 0.05, radius = 2 The partition of the signal space corresponds to the Voronoi diagram defined by the map units, whose projection onto Ms and My is depicted in Fig. 6. The total number of states is 27 x 24 = 648. Though this is the total number of states, some of them may actually be never visited at all, thus the number of real states is somewhat smaller. There is a tradeoff between the number of states and the learning speed. If more states are used, the internal representation of the task is more detailed, but more trials are needed to learn the q-values of all those states. With less states, the q-values are updated more quickly and the learning process is faster. Unfortunately, there
215
Learning
0.6 r_>0.5 a_
/
/ /
0.4
Random
..........................................
/
j.-+ 0.1 0 t 0
/
2'0
..........-.....
20
I 0
! 80
i i 100 120 Time - seconds
i 140
i 160
i 180
i 200
Figure 7. Probability of insertion for random and learned strategies for the cube insertion.
is no general method for selecting the number of states, and it becomes a task-dependent heuristic process. Two new actions are added, namely rotations around the normal axis to the surface, since symmetry around it does not hold any more. A qualitative measure of that angle is also included in the agent's location estimation. Since 10 different actions are possible at each state, the table of q-values has 6,480 entries. The rest of the architecture and the training procedure remains unchanged. The increased difficulty of the task is shown by the low percentage of successful insertions that are achieved randomly at the beginning of the learning process. Only about 15~ of the insertions are performed in less than 20 seconds time. The higher difficulty is the reason for the longer learning time and the worst performance achieved with respect to the cylinder.
4.1.2. Learning results Figure 7 depicts the probability of successful insertion for 1,000 random trials and 1,000 trials with the learnt controller, with respect to a time up to 210 seconds (3 and a half minutes). The random controller, even for a long time of operation, is capable of pertbrming a low percentage of trials (about 45%), whereas the learnt controller achieves more than 90% of the trials. As far as we know, this is the best performance achieved for this task using a square peg. In [5, ] only results for the cylinder are presented and, though generalizing to other shapes is said to be possible, no real experiments are carried out.
216
................. --~........................ ::
0.9 .... . ......................
Learned
0.8
_ ~
(cube SOM!,........' - . - ~- " - f -" "- ' J ~ ~
0.7 to 0.6 / ....;Y
~0.5
~
. r "f ,.j....,.j,_..,.j_z
0.4
0.3
} f f/.1
0.2
0.1
00
20
40
60
80
1O0 120 Time (seconds)
140
I
I
I
160
180
200
Figure 8. Probability of insertion for random and learned strategies with SOMs trained with the triangle and the cube.
4.1.3. Generalizing to another peg shape An interesting generalization test is to use a SOM trained with samples from insertions of the square peg for learning the insertions of the triangle peg. Though trained with different shapes, the purpose is to test if the features learnt with the square are useful for the insertion of other shapes. Since the size of the SOMs is the same, the state representation is not modified at all. Figure 8 depicts the probability of successful insertion for 1,000 random trials, 1,000 trials with the strategy learnt with the specific SOM, and 1,000 trials with the strategy learnt with the SOM from the cube task, with respect to a time up to 210 seconds (3 and a half minutes). Surprisingly enough, results with the cube SOM are slightly better than those obtained with the specific SOM. A possible explanation is that the SOM trained with the cube is more powerful than that trained with the triangle. It might occur that some input data has not much influence during the training process of the triangle SOM (due to its low probability density) but is rather important for the learning of the insertion strategy. Since the cube SOM is covering a wider area, maybe some states are properly identified with this SOM whereas they are ambiguous with the triangle SOM. This is an interesting result which demonstrates the generalization capabilities of the SO M for extracting features which are suitable for different tasks.
217 5. C O N C L U S I O N
AND FUTURE
DIRECTIONS
A practical SO M-based sensing approach to robotic manipulation has been presented. We have indicated the need for a robust representation of the task state, to minimize the ett>cts of uncertainty. The implemented system is fully autonomous, and incrementally improves its skill in perforlning the task. Results for the 3D peg insertion task with both cylindrical and non-cylindrical pegs have demonstrated the effectiveness of the proposed approach. The learning process is fully autonomous. First, features are extracted from sensor signals by the SOM. Later, the reinforcement learning algorithm associates the optimal actions to each state. The system is able to manage uncertainty in the position and orientation of the peg. Obviously, uncertainty is larger than the clearance between the parts. Experimental results demonstrate the ability of the system to learn to insert non-cylindrical parts, for which no other working system has been described in the literature. In addition, the system generalizes well to other positions and orientations of the parts. Future work includes the study of skill transfer between tasks, to avoid learning a new shape from scratch. A promising example of using a SOM trained with the square peg for the insertion of a triangle peg is shown. Another direction for future research will be to investigate the integration of the presented techniques with other sensors, e.g. vision, maybe through the combination of several SOMs [3]. REFERENCES
1.
2.
3. 4.
5. 6.
7. 8. 9. 10.
M. E. Caine, T. Lozano-P~rez, and W. P. Seering. Assembly strategies for chamferless parts. In Proceedings of the IEEE International Conference on Robotics and A'utornation, pages 472-477, 1989. J. Canny and J. Reif. New lower bound techniques for robot motion planning problems. In 28th IEEE Symposium on Foundations of Computer Science, pages 49-70, 1987. E. Cervera and A. P. del Pobil. Multiple self-organizing maps: A hybrid learning approach. Neurocornputing, 16:309-318, 1997. E. Cervera, A. P. del Pobil, E. Marta, and M. A. Serna. Perception-based learning for motion in contact in task planning. Journal of Intelligent and Robotic Systems, 17:283-308, 1996. V. Gullapalli, J. A. Franklin, and H. Benbrahim. Acquiring robot skills via reinforcement learning. IEEE Control Systems, 14(1):13-24, 1994. V. Gullapalli, R. A. Grupen, and A. G. Barto. Learning reactive admittance control. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 1475-1480, 1992. M. Kaiser and R. Dillman. Hierarchical learning of efficient skill application for autonomous robots. In International Symposium on Intelligent Robotic Systems, 1995. T. Kohonen. Self-Organizing Maps. Springer Series in Information Sciences. Springer, 1995. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambirdge, Massachusetts, 1998. C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279-292, 1992.
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
SOM-TSP:
An a p p r o a c h to o p t i m i z e surface c o m p o n e n t
219
m o u n t i n g on a
printed circuit board H.Tokutaka and K.Fujimura ~ ~Department of Electrical and Electronic Engineering, Tottori University 4-101 Koyama, Tottori 680-8552, JAPAN E-mail: tokutaka~ele.tottori-u.ac.jp The SOM-TSP method can easily increase their handling dimensions compared with usual other TSP methods, such as Hopfield method, Chaotic method...etc. Therefore, the possibility where the method can be applied to the various optimization problems has been opened. As one of the proposals, here, we propose the application of SOM-TSP method in optimizing the efficiency of surface mounting of electronic parts on the printed circuit board. Here, we developed the maximum 4 dimensional SOM-TSP method. It was found that the required time for mounting electronic parts will be decreased by our proposed method as compared to the built-in method on the mounting-system, through our numerical experiment. The output of the factory will be increased using our proposed method. 1. I N T R O D U C T I O N The SOM-TSP (Self-Organizing M a p for Traveling Salseman Problem) method is one of the applications of one-dimensional SOM (Self-Organizing Map)[1]. We can obtain the sub-optimum (sub-shortest) tour-length (not the shortest tour-length) in a short time on TSP (Travelling Salesman Problem) by SOM-TSP method as compared to other methods. Although, the SOM-TSP method is useful for industrial application, the theoretical minimum solution is not needed to solve the problem. Here, we introduce the novel algorithm for the optimization of mounting electronic parts on a printed circuit board using SOM-TSP method. In section 2, we briefly describe the improved original Angeniol method [2],[3]. In section 3, we have applied the improved SOM-TSP method to the maximum 4-dimensional optimization problem which optimizes the efficiency of surface mounting of electronic parts on a printed circuit board. The full algorithm is described in this section. In section 4, simulation and results are discussed. With large numbers of part kinds and total number of parts to be mounted, the mounting time is reduced drastically by our proposed method compared with the usual built-in method. However, when the numbers are not so large, our proposed method and that mounted by human beings are very similar in mounting time as well as in the pattern of the handler's movement route. Finally in section 5, some concluding remarks are given.
220 2. S O M - T S P
METHOD
2.1. Summary of the algorithm The details of the algorithm are discussed in the original paper[2] and our improved method[3]. In the algorithm, all the nodes are connected on a one dimensional ring through the survey of TSP. The nodes on the ring continuously move and are finally locked to the locations of the cities to obtain the most optimum or sub-optimum solutions. During the survey, all the nodes move freely on the plane where all the cities are fixed and located. When each city catches only one node on the ring, the survey is terminated and the actual path (solution) is obtained. During the survey, each node J on the ring is renewed by the following eq.(1) (1) where 2 dimensional positions of the i-th city and j-th node on the ring are expressed as (X~,X~) and (C~, C~), respectively. The nodes on the ring are also numbered from 1 to N (the total node number on the ring at the present search). The node J is counted clockwise or counter-clockwise from the node Jc which is the closest node to the i-th city in the present search. The smaller number is then taken as the order distance n. As shown in eq.(1), each node J becomes closer to the i-th city by a factor proportional to the renewal function f(G, n) expressed as 1
_n
2
f ( G , n) -- x/---~ ) e x p ( - ~ )
(2)
where G is the parameter adjusting the renewed neighbouring range. When G is large or small, the renewed neighbouring range becomes broad or narrow respectively. The value G in eq.(2) is determined appropriately at the start of the survey of TSP. During the survey, after each search of all M cities, G will decrease following eq.(3)
a=
a
(3)
We optimized the initial value G and the renewal coefficient a as 10 and 0.95 [3] respectively. Angeniol et al[2] fixed the value a during the survey. However, we added the following m o m e n t u m term to the renewal coefficient a as shown in eq.(4)
a(t + 1) - a ( t ) +
.[c(t)-c(t-1)]
(4)
where C(t) is the total node number at the present search t of all M cities, C ( t - 1) is the number at the previous search ( t - 1), and 7 is an appropriate proportionality coefficient. Here, 3` is negative and this means that the renewal coefficient c~ decreases, when the total node number increases. Figure 1 shows the characteristics of the momentum term 3` [4]. Above 80% reduction in computing time is obtained when 3' = - 0 . 2 as compared to 3, = 0. Thus tour length is increased by a small percentage. We confirmed that the characteristics of 3` is effective up to 10,000 cities for very large cities problem.
221
Figure 1. The characteristics of the m o m e n t u m term 7 .
(a) Theoretical shortest tour
(b) The tour obtained by SOM-TSP method with ~, = -0.2; +10.7%
Figure 2. The routes ; (a) of theoretical shortest tour and (b) obtained by our improved SOM-TSP method with ~ , - -0.2.
222 Figure 2 shows the results of our simulations for the p r 2 3 9 2 probelm 1. The simulations were performed under the conditions of "y = 0 , - 0 . 2 (the result of only 7 = - 0 . 2 is shown in figure 2 ). It was confirmed that ; 9 Computation time became 1/5 at 7 = - 0 . 2 (cf. 3, = 0); 9 Tour length of +10.7% longer was obtained at ~/= - 0 . 2 in comparison with theoretical minimum. Note that 7 = 0 means Angeniol's algorithm. The improved SOM-TSP method which was expanded to consider multi-dimensional data, was applied to this experiment. Experiment conditions used were 7 = -0.2, G(0) = 10 and c~(0) = 0.95.
3. P R O B L E M A N D A L G O R I T H M Figure 3 is a schematic diagram of the Automatic Electronic Parts Surface-Mounting Machine which is used to mount electronic parts on the printed circuit board. The machine is composed of one XY stage, rotary type handler (with multi head), reels, and one feeder axis. The printed circuit board is fixed on the XY stage. Electronic chip parts rolled in one reel are used. The reels are set to the feeder axis. The XY stage and feeder axis are movable in the machine. A rough mechanism that electronic parts are mounted on the printed circuit board is explained here. Electronic parts are adsorbed at point (a) by the head in which handler was installed. The head moves to release point (b) by the rotating handler. As for the XY stage, the position on the board in which elctronic parts will be installed is controlled to agree with the release point (b) (positioning). After positioning, the parts adsorbed on head are dropped (released). The minimum Mounting-time can not be obtained by only calculating the shortest tour length for the position of the electronic parts on the circuit board. Moreover, the movement of the feeder axis is slower than the XY stage, because the feeder axis which includes reels is very heavy. These factors complicate this problem further. If we can find the optimum combination of the reel order and the mounting order of electronic parts, the mounting-time will be minimized.
3.1. A l g o r i t h m We used two processes in order to solve this problem. The notation below will help explain our algorithm. 9 Kind of electronic parts Si (i = 1,2,---,n): ("i-th" means that the kind of parts, take the value i = 1,2,-.., n); 9 Total number of electronic parts q : "k-th" means that the parts found in the k-th record in the original data file, take the value k = 1 , 2 , . . - , q ; 1This problem is obtained from "htt p: //www.iwr.uni-heidelberg.de /iwr/comopt/soft/TSPLIB95/tsp/pr2392.tsp.gz"
223
Figure 3. A schematic diagram of the Automatic Electronic Parts Surface-Mounting Machine.
224 9 Each location of electronic parts: Pk
Pk = (xi,j,Yi,j) ; where index j means j-th parts in i-th kind of parts &, take the value j = 1, 2,--., mi. I Process 1" Calculate Reel Order {ri} I The order of installing the reel on the feeder axis is calculated. We calculate center of gravity (2i, 9i) for each kind of electronic parts & as
~ =
Xi, 1 -4- Xi, 2 -~- " " " -4- Xi, j -~- " " " -4- X i , m i
(5)
mi ~/i - - Yi,1 + Yi,2 + ' ' "
+ Yi,j + ' ' "
+ Yi,mi
(6)
mi where,
x i , j : x coordinates of j-th electronic parts& yi,j : y coordinates of j-th electronic parts&. The standard deviations of a~ and a iY for each electronic parts & is calculated as o_ix _ ~/(zi,1--~i)2+(xi,2--~i)2+'"+(z~,m~m~--~i)2
(7)
u
O'~ __
,/(Yi,1--ffi)2 +(Yi,2--ffi)2 +...+(Yi,mi--yi)2 mi v
(8)
The average and the standard deviation of each electronic parts & are treated as fourdimensional data (2i, ~7i,af, af). All the data of electronic parts were merged, and one data set was generated. This data set is used for the calculation of the SOM-TSP method with supplementary cities (described in later section). The order of the visit obtained as a result is used for the order of reels r~. I Process 2" Calculate Mounting Order {ok}l The mounting order of the parts on the circuit board is calculated. Data pO _ (xi,j, yi,j, Ar~) is generated for each k-th parts from the following three elments; x and y cordinates from Pk for k-th parts and Ari which is ri multiplied by a parameter A . The parameter A is used to control the allowed range of reel skippings. If the value of parameter A is large, the allowed range of reel skippings becomes narrow. In the opposite case, the range becomes wide. Parameter A can be adequately selected according to the performance of the mounting-machine which is used. The data set {pO} is generated and it is used for calculation of SOM-TSP with supplementary cities (described in later section). The order which was obtained from the calculation becomes the set of the mounting order {ok}, for the given problem. S u p p l e m e n t a r y cities In principle, only a closing looped route can be obtained by original SOM-TSP method. The order cannot be obtained even if the route is found, because the beginning point and the end point are not defined. The SOM-TSP method with supplementary cities is introduced to improve this defect. We explained the method of three dimensions (in the case of Process 2) as that can be easily understood. Let's generate a new route which contains supplementary cities according to the following ( as shown in figure 5a);
225
Figure 4. The algorithm for the problem of mounting electronic parts on the printed circuit board. 1. Define the starting point and the end point; 2. Generate an artificial route having closed loop, which does not influence set {p2} from the starting point toward the end point in the space created from three components of x, y and A r i . The route is calculated using usual SOM-TSP method for the new data set which includes all data positions in the new route. All data of the positions arbitrarily added are deleted after the calculation using SOM-TSP method. Using the rest of the data, an open route is obtained, and the order of the pka from the starting point to the end point along the route, becomes the mounting order ok. Figure 5a shows the case where supplementary cities have been included in order to complete the loop, here, the starting point and ending point are not close to each other. With this application, the starting point and the terminal point of the handler can be chosen arbitrarily. The degree of freedom in this case is increased.
226 For figure 5b, there are no supplementary cities and the starting point and the terminal point should be next to each other therefore, the degree of freedom is reduced.
Ar~
~
A
Yu
/ i !
Supplementary Cities
(a)with S u p p l e m e n t a r y
i~ iiii~i :
-- I
cities
(b)w i t h o u t S u p p l e m e n t a r y
cities.
Figure 5. (a) with Supplementary cities, (b) without Supplementary cities.
4. S I M U L A T I O N S
AND RESULTS
Performance of our algorithm was evaluated using the printed circuit board actually used. The performance was finally evaluated on the required time for mounting (this time is known as " m o u n t i n g - t i m e " ; as shown in several figures in this paper). The mounting time was measured using the simulator developed for the targeted Mounting Machine. For the printed circuit board with 81 kinds and 354 parts, the mounting-time was 112 seconds per board using built-in method on the mounting system (machine). When our proposed method was used for the same board, the mounting time was reduced to 90 seconds (See figure 6). In figure 6, tracing route obtained when mounting are drawn. It was found that the tracing route obtained by the method had fewer overlaps among lines. From the comparison of these figures (tracing routes), we can estimate at a glance that our method resulted in a better performance. Also, for another example, we applied the proposed method to a printed circuit board with 109 kinds and 716 parts. In this case, the mounting time was reduced to 174 seconds (See figure 7 ). The proposed method was able to shorten the mounting time, even with higher number of parts. Current software being used in the industry for component assembly works effectively when the number of components being assembled are many. It becomes ineffective when the number of parts being assembled are few. Under this condition, the parts are assembled by human beings. Applying SOM-TSP to the assembling of higher number of parts yielded better results as compared to the current software being used as shown in figures 6 and 7. For the case of assembling fewer parts on the component board as shown in
227 figure 8, with 65 kinds and 154 parts, there was not much difference between the time used in assembling the parts for SOM-TSP and that done by human beings. Subsequent experiments done for fewer parts assembling yielded similar results as shown in figure 8. The pattern for component assembling for SOM-TSP was found to be similar to that for human beings. In conclusion, the algorithm for SOM-TSP is similar to that of human beings.
Figure 6. Obtained tracing routes for the problem with 81 kinds and 354 parts; (a) method used in factory at present time (b) proposed algorithm with supplementary cities.
The ralationship between parameter A and mounting time is shown in figure 9 for a printed circuit board with 109 kinds and 716 parts. The mounting time by our method
228
Figure 7. Obtained tracing routes for the problem with 109 kinds and 716 parts; (a) method used in factory at present time (b) proposed algorithm with supplementary cities. The characteristics of the reel skiping are compared.
229
Figure 8. Obtained tracing routes for the problem with 65 kinds and 154 parts; (a) method used in factory at present time (by human beings) (b) proposed algorithm with supplementary cities.
230
was also found to be optimized in the range of 2000 to 3000 of parameter A. Furthermore, the optimized mounting time was found to be 175 or below. From other experiments performed, it turns out also that there exists an optimum value for parameter A.
Figure 9. The characteristics of mounting-time vs. parameter A.
5. C O N C L U S I O N The SOM-TSP method was applied to the electronic parts mounting problem of a printed circuit board. Open routes can be obtained by SOM-TSP with supplementary cities. Therefore, we can specify the starting point and the end point on the board. The algorithm which consists of 2 processes (application of SOM-TSP with supplementary cities) was applied to the problem. By applying this method to actual boards, a reduction in mounting time was achieved. If this technique is applied to the mounting process of the printed circuit board at an actual factory, it is certain that the productive efficiency of the factory will improve appreciably. REFERENCES
1. 2. 3. 4.
T. Kohonen, Self-Organizing Maps, Springer-Verlag, 1995. B. Angeniol, G. de La C. Vaubois and J.Y. Le Texier, Neural Networks 1 (1988) 289. K. Fujimura, H. Tokutaka, Y. Ohshima and S. Kishida, ICONIP'94-Seoul, 427 (1994). K. Fujimura, H. Tokutaka, S. Tanaka, T. Maeno and S. Kishida, WSOM'97-Helsinki, 80 (1997). 5. M. Padberg and G. Rinaldi, Operation Research Lett. 6 (1987) 1.
Kohonen Maps. E. Oja and S. Kaski, editors 9 ElsevierScienceB.V. All rights reserved
231
Self-Organising Maps in Computer Aided Design of electronic circuits Ahmed Hemani ESD Lab, Department of Electronics, KTH, Sweden ahmed @ele.kth.se
Adam Postula Department of CSEE, University of Queensland, Australia adam @elec.uq.edu.au
Abstract
Computer Aided Design of electronic circuits is very dependent on effective optimisation algorithms. Several neural networks paradigms have been explored in application to optimisation problems in CAD and Kohonen's Self-Organising Maps (SOM) proved to be one of the more successful. This presentation focuses on mapping to SOM scheduling and binding, the processes that are crucial for optimisations in High Level Synthesis. The presented analysis of various issues in mapping HLS scheduling to SOM can be of interest also to researches in other fields facing similar mapping problems. The SOM based algorithms have been implemented and formed the optimisation kernel of the synthesis system SYNT, which was later developed into a commercial tool.
1. Introduction Computer Aided Design (CAD) of electronic circuits is an area where the need for efficient optimisation algorithms and dependence on advanced research results is overwhelming [ 1]. As the silicon technology allows to put whole electronic systems on a chip the design tasks are also much more complex and must be performed in a shorter time. The designer must concentrate on functionality of the system and leave the implementation details to automatic tools, capable of producing optimised solutions competing with handcrafted designs. Research on hardware synthesis provides methodology and optimisation algorithms for automation of the design process. As the design problems span from placement of transistors to simulation or test of whole systems a large variety of different optimisation algorithms can be found in research literature. Among those investigated lately are approaches based on neural networks and Self Organisation Maps applied to lower level hardware design problems, such as placement [2],[3],[4],[5]. As the need for higher level tools increases rapidly it is of much interest to explore possibilities created by those new approaches in this area. The High Level Synthesis (HLS) paradigm [6],[7] provides a way of producing optimised hardware from functional or behavioural descriptions, allowing the designer to efficiently work on a high abstraction level. HLS is a concept very similar to compilation in software, but is much more difficult since the solution space is multidimensional and extensive optimisations are needed to produce acceptable results.
232 High Level Synthesis process is illustrated in Figure 1., where the most important components of the optimisation chain are identified and their intermediate results are indicated.
~ ~ibrary
graphof operationnodesand dependencies
"~,,4Iallocation )
f
r
types of functionalunits (FU)chosen
us
)-
operationsassignedto FU types and time steps
nctional~r nits J "~Jscheduling ) I binding
88
(
J
1
structure extraction
operationsassignedto particularFUs structureand interconnectionsoptimised
lower level tools Figure 1. High Level Synthesis
Allocation, scheduling and binding [8], [9] form the optimisation kernel of behavioural synthesis. These tasks are complex and interdependent. Due to their complexity they are usually performed separately and, when needed, the sequence is reiterated to improve the results. Among them, scheduling is probably the most known in other disciplines and considered as the most important for HLS results since it directly influences performance and cost of the synthesised hardware. We focus this presentation on the issues of mapping the scheduling problem, as it is defined in HLS, onto Kohonen's Self Organisation Maps (SOM) [10], [11]. We discuss in detail how specific requirements of HLS scheduling are mapped to SOM and expect that an interested reader could be able to use our approach to map scheduling problems specific to other domains. Application of SOM to optimisation in high level synthesis proved to be very successful and we implemented scheduling and binding as a part of the synthesis system SYNT. The SYNT system has been extensively tested in academic and industrial environments and later formed a basis for commercial development. In our development work [14], we found that the complexity and intricacies of the scheduling problem to be mapped sometimes obscured the behaviour of the original SOM. Kohonen's algorithm can be presented and analyzed in a strictly formal and mathematically correct way [10] but for the developers of synthesis tools it was equally important to have an intuitive understanding of this algorithm and its characteristics. The following section gives such an intuitive view.
233
2. Kohonen's Self Organisation explained an intuitive way Consider a group of people, crowded in the centre of an arbitrary polygon, as shown in Figure 2.a. Each person in this group is a friend of some of the others. The friendship of two persons is proportional to the similarity of the shape of their heads. We want to make people happier by giving them more living space, while retaining their social circles. Hence there are two objectives: 1. To distribute the people uniformly in the polygon. 2. To position friends close to each other. More accurately, a person' s distance to another person should be proportional to the degree of similarity between them. The required distribution is shown in Figure 2.b. We will later see how these two objectives mirror the optimisation needs of scheduling and binding problems in HLS.
Figure 2.a Crowded people
Figure 2.b Result of self-organisation
Now let us subject these people to the following self-organising process, illustrated in Figure 3. and see that we achieve the objectives.
Step 1: Set process parameters. Set friendship circle, a process parameter, to a large value. This parameter gradually decreases with each iteration. It controls how many friends a person has at any instance (iteration). As the friendship circle decreases, the more distant friends are the first to go out of the friendship circle. Step 2: Generate a point. Generate a random point in the polygon. Points are generated with uniform probability distribution over the polygon. Step 3: Set up a competition. Let people compete for the random point. The winner is the person closest to the random point. Step 4: Move the winner and it neighbours. Move the winner towards the random point. Move those friends of the winner who are in its current friendship circle also towards the random point. Close friends move more than the distant friends.
234
Step 5: Update process parameters. Decrease the friendship circle. If the friendship circle is large enough to have some friends in it, repeat by going to Step 2.
Figure 3.The process of self-organisation There are three characteristics of Kohonen's self-organising algorithm that we use in the scheduling and binding. 9 Uniform
distribution
This is due to people competing, winning and moving towards points, which are uniformly distributed. When a person wins a point and moves towards it, he/she is in a better position to win any future point(s) close to the point he/she has already won. As everyone competes for the points which are uniformly generated on equal terms, in time, everyone carves out a territory in the polygon, almost equal to that of the others. 9 Clustering
As a winner moves towards the point it has won, it also moves some of its friends towards that point. This directly explains why friends are clustered. To further reinforce clustering in the subsequent iterations, when a point is generated close to the point won by the winner, the friends are in a good position to win it and pull their friends (including the original winner) towards it. The amount by which the winner and its friends move towards the point is controlled by a gain parameter. Gain is proportional to the similarity of a person (being moved) to the winner. Obviously, the gain is highest for the winner, the person being most similar to oneself. This explains why friends not only cluster, but there is an order in the clustering as well: one person' s distance to another is proportional to the similarity between them. This is illustrated in Figure 2.b where triangular-headed persons on an average are closer to square-headed persons as compared with the round-headed persons, as triangle and square are more similar than triangle and circle.
235
9 Hill-climbing Moving a winner towards a point is called a primary move, whereas moving friends of the winner is called a secondary move. The primary move is a conscious attempt by the winner to improve its position, whereas the secondary move blindly follows the winner; the secondary move may or may not improve the position of the secondary movers (the friends) and as such it provides the hill-climbing mechanism in the self-organising process. At the beginning of the process, when the friendship circle is large, every primary move entails many secondary moves, i.e., many hill-climbing moves. As the process evolves and the friendship circle shrinks, the number of secondary moves and thereby the hill-climbing moves, decreases. This behaviour endows the self-organising process with the characteristic of a decreasing number of hill-climbing moves as the process evolves. This is analogous to the behaviour of simulated annealing algortim. 3. Scheduling as a s e l f organisation process
3.1. Scheduling in high level synthesis of digital circuits Scheduling is a classic optimisation problem studied in fields as diverse as the factory floor [12] and satellite communication [13]. It is characterised by the trade-off between available time and available resources. Besides the limit on available resources, the other constraints on the scheduling problem are: a) some operations are dependent on others, and b) some operations can be performed on specific types of resources only. In high-level synthesis (HLS) of synchronous digital systems, resources are functional units, registers and data path units like multiplexers and buses, and the time is measured in terms of number of control steps. In our synthesis system SYNT, the scheduler works under a fixed time constraint, but the allocator allows the user to explore interactively the effect of varying the time constraint on the estimated resource usage. Scheduling implicitly decides the number of units used by each functional unit type. Operations serviced by the same type of functional unit can reuse units used by operations in other control steps. So the maximum number of units required of a particular type is determined by the maximum number of operations scheduled in a control step and serviced by the same functional unit type. 3.2. Mapping high level synthesis scheduling to S O M Use of Kohonen's Self-Organising Maps for scheduling [15], [16] is based on the following insight: "Given a number of control steps, the schedule that requires the minimum resource is the one that creates the most uniform distribution of the operations across the control steps." The scheduler uses two important characteristics of self-organisation: uniform distribution and hill climbing. The uniform distribution provides minimum resource usage, the hill climbing is necessary for performing global optimisation.
236 Network. The neural network, as shown in Figure 4., used by Kohonen's self-organisation algorithm has three components. input
adaptive
nides ~
~ / ~ l g ~ ~
~
I !f,.,_r'~~
tl < t2 < t3
output
\~.,
~
\ ~ \.,~ \
]
Fov*'tl / ~ 1~'0--<~t1<3"0 / \ yOv*'t2 / \ fFCv*'t3
/
ot., < -,.o
aFOitVe';l~tnhl iS ?te~i~Opisratio~ t bytf~pndSe~ I parameter- friendshipcircle~t atiterationt. ] Figure 4.Organisation of the neural network used by the self-organising scheduler.
Input Nodes. Input nodes correspond to the dimensions of the output space. Since there are two dimensions in the schedule space: the time (control steps) and resource (functional units), correspondingly there are two input nodes: the time node K, and the resource node F. A random point in the schedule space (for operations to compete) is represented by an input vector (Ik, If). This vector is put on the input nodes. Output Nodes. Output nodes correspond to the objects we want to self-organise in the output space. In terms of the scheduling problem, these correspond to the operations we want to schedule, i.e., one output node for every operation to be scheduled. These are shown in Figure 4. by small circles with operations vx inside them. The connecting arrows show the dependencies between operations. ~. Adaptive weights. Every output node is connected to the two input nodes by a pair of adaptive weights (W k, WU).These adaptive weights allow all output nodes to evaluate the input vector simultaneously, and they specify the position of output nodes in the schedule space. The weight pairs specify the state of the self-organising process at any instance and indeed the final result, i.e., the schedule. For each operation, the Wk weight specifies the control step in which it will be executed and the Wf weight specifies the functional unit type (and instance when doing binding) to which it is bound. Two important concepts related to the self-organisation must be defined in terms of the scheduling problem.
237
Friends. In scheduling, two operations are friends, if they are dependent. This qualifies friendship. Quantifying friendship between two operations is more accurate if it is based on a measure of freedom to move the operations instead of on dependency only as it is explained in Figure 5.a It can be calculated based on an average of As Soon As Possible (ASAP) and As Late As Possible (ALAP) schedules, before the self-organisation scheduling starts. The friendship distance of an operation to itself is one, which is also the minimum friendship distance as an operation is the closest friend of itself. The friendship distance matrix specifies for every pair of operations the friendship distance between them. Friendship circle. A process parameter, which controls how many friends an operation has at any instance (iteration). In Figure 4. the shrinking polygon size with increasing iteration represents the fact that operation v* gradually loses friends, the more distant ones being the first to go. The friendship circle is large in the beginning and it shrinks at the rate determined by the complexity and the size of the problem. Schedule space. The output space in Kohonen's self-organisation algorithm is linear, but the schedule space is discrete in time (control steps) and in space (functional units - FUs). To map the scheduling problem to self-organisation, we treat the schedule space as linear. In the time dimension, each control step is given a resolution, which naturally corresponds to the length of clock period. In the resource or FU dimension, the resolution is called type frame. All FU types have type frames of equal width and each of these widths is calculated from the estimated number of units required for the FU types. This is illustrated in Figure 5.b.
T1
b has m o r e freedom
between
T2 Typeframeforoperationsopl andop2~n
a
~
opl
2
and c, so it is not
as close a friend to a and c as d would be to a and e.
op2
\
/
~
"imeframe foropl Timeframe . forop2
/
J
ax
Figure 5.a Quantifying friendship.
Figure 5.b Schedule space, time and type frames.
Output nodes in the self-organisation algorithm are free to move in the entire output space, but the scheduling problem imposes some movement restrictions. The following restrictions come from practical constraints of the scheduling problem, as their violation could create an illegal schedule:
Time frame. Given the number of control steps, every operation has associated with it a time frame. As an operation cannot be scheduled before its ASAP schedule or after its ALAP schedule, it would be illegal for an operation to move outside its time frame. This translates into the
238
Wk weight of every operation having a lower limit in the form of the ASAP schedule and an upper limit in the form of the ALAP schedule.
Type frame. Every operation is serviced by a particular FU type. Corresponding to each FU type, there is a contiguous segment in the resource dimension called type frame. An operation is restricted in the resource dimension to moving in its type frames only. This restriction translates into the Wfweight of every operation having an upper limit in the form of the right edge of the type frame and a lower limit in the form of the left edge. Dependency. Although an operation is restricted within its time frame, it is still possible to create illegal schedules by scheduling an operation before its predecessor or after its successor. This requires the movement of an operation being restricted by its dependencies.
3.3. Process parameters We have already discussed the fundamental issues in mapping scheduling to self-organising maps. Now we are in a position to discuss process parameters such as friendship circle, gain and shrink schedule, which are of crucial importance for practical applications.
Friendship circle influences two crucial aspects of the self-organising process: 9 the time required for the process to converge 9 the number of hill-climbing moves at the beginning of the process. The initial friendship circle is set to 0.5 of critical path of the graph to be scheduled. Factor 0.5 was originally suggested by T. Kohonen,[Koh3 88] and our experiments confirmed its validity. Critical path is defined as a path from predecessors to successor nodes, that has the largest number of edges i.e. the data dependency distance between two such operations is largest.
Shrink schedule decides the rate at which the friendship circle shrinks and thereby influences: 9 convergence time along with the friendship circle. 9 the rate at which the hill-climbing moves decrease. The shrink schedule in SOS is linear and is based on a measure of complexity of the scheduled graph. That measure of complexity takes into account parallelism of operations and the length of critical path to obtain a balanced measure and assure that scheduling of complex parallel graphs got enough iterations.
Gain controls how much an operation and its friends change their position as a result of winning the competition in an iteration. It always has a value between 1.0 and 0.0. In our implementation we start with a value of 0.5 and then decrease it with time and friendship distance.
3.4. Synthesis specific implementation of self-organising operations Generation of points. In Kohonen's algorithm the output space is uniform and the points are generated with uniform distribution. In scheduling, the output space is digitised by time/type frames as discussed before. Some time/type frames in the schedule space can accommodate more operations than other frames, in order to use this opportunity the number of points for competition should be larger in
239 those frames. Also hardware functional units are of different cost and to prioritise cheaper solutions more points should be generated in cheaper FU type frames.
Competing and deciding the winner. An operation is eligible for competing for a point if its time and type frames include that point, this criterion is necessary for creating legal solutions. An operation is eligible only if it needs to move from its position. This is a departure from the original SOM that stems from the needs of scheduling and deserves more detailed explanation. SOM tends to uniformly distribute operations in a linear space but the scheduling should uniformly distribute operations over control steps. The scheduler has information about the lower bound of functional units for a particular type of operation, that has been obtained by analysis of the graph before starting scheduling. By checking the number of operations scheduled in a particular control step and not allowing an operation to compete if that number is lower or equal the lower bound, the scheduler contributes to the uniform distribution of operations over control steps. It should be noted that the same operation can still be moved as a result of a secondary move when it follows its friend, therefore hill climbing is not affected.
Moving the winner and its friends There are some restrictions on how close operations can come to each other in time. All types of functional units have delays and an operation inherits that delay. That means that e.g. a close friend can not come closer to the winner than its propagation delay. There are also some more restrictions stemming from dependencies between operations in different branches of a graph. All that makes implementation of the move operation fairly complex.
Adapting the weights and maintaining dependencies As the position of an operation in the schedule space is represented by the pair of weights (Wk, Wf), moving the winner and its friends towards the input vector essentially means adapting their weights. In a general SOM, all the weights of all the moved nodes would be updated, but in the scheduler, both the W~ and W/weights are adapted for the winner, whereas for predecessors and successors only the Wk weights are adapted. This strategy has been adopted because we are primarily interested in uniform distribution in time dimension and the SOS scheduler does not show any loss in quality of the results by skipping the adaptation of the Wf weights of the friends. The need to maintain dependencies arises from the fact that an operation dependent on other operations in different branches of the graph can be wrongly pulled to the winner in a secondary move. To rectify that the operations violating the dependencies must be moved. Those tertiary moves are similar in effects to the secondary move and do not have deteriorating effects on the algorithm convergence.
Stopping the process The scheduler stops when the friendship circle becomes less than one, the operation distance to itself, as the friendship circle is so small that it does not even include the winner. Another, scheduling specific criteria, is reaching the lower bound of functional units usage.
240
Scheduling results. The immediate result of the self-organisation process is a schedule in the time/resources space. The result required by the synthesis system is not only the assignment of operations to control steps but also the number of functional units of different types needed to implement the operations. This information is easily extracted from the schedule as the number of required units of a specific type is equal to the maximum number of operations of that type scheduled in any control step. A typical result of scheduling is shown in Figure 6. where operations of a graph have been scheduled in control steps (horizontal slots) and assigned to functional unit types (vertical slots).
4. Binding as a self organisation process Binding is a process of assigning a scheduled operation to a particular functional unit, and as such is performed after scheduling. The optimisation goal of binding is minimisation of the cost of interconnect between functional units. The strategy is to cluster operations assigned to a particular functional unit type according to their common dependencies and sharing of data. With other words the binding is a multi-partitioning problem, where the CDFG is to be partitioned into as many clusters as the number of units decided by the scheduler, such that the sum of connections between the partitions is minimised. Like the scheduler, the binder in the SYNT system is also based on Kohonen's self-organisation algorithm and is called SOB, for self-organising binder [17]. The scheduler relied on the uniform distribution of nodes, while the binder relies on the clustering characteristic of Kohonen's algorithm. And like the scheduler, the binder also relies on the built-in hill-climbing mechanism to ensure that it does not get trapped in local minima. The binder uses very similar network to that presented for the scheduler. The difference is in the way the Wfweight is interpreted. In binding Wfalso specifies the instance of the FU type to which the associated operation is currently bound. When the self-organisation process is finished the instance is extracted from its Wfweight using the resource requirement table that was created by SOS. Another difference is that in SOB operations are allowed to move only in resource dimension, this is the reason for much simplification of SOB since it does not need to resolve complicated dependency problems. A typical result of binding is presented in Figure 6., where all the operations within the shadowed polygon have been grouped since they share or propagate data to each other. From this grouping assignment to functional units, adder and multiplier, has been easily obtained.
5. S Y N T - a synthesis system with S O M based scheduler and binder The SOS and SOB programs form the optimisation kernel of the SYNT system which is capable to automatically synthesize digital circuitry from behavioural description in VHDL. The system has been fully implemented and tested on different designs, ranging from medium to large size descriptions. As the system was intended for industrial use it has been provided with a user interface allowing a designer to easily experiment with the schedules and to influence the
241 choice of functional unit types. A sample of the interface showing results of scheduling and binding is presented in Figure 6.
Figure 6. Scheduling and binding results for the fifth order filter, using 17 control steps.
This interface proved invaluable in testing the system on industrial designs [ 18]. The scheduler and binder provided consistently good results in reasonable run times. The planned improvements were to extend the SOS capability to optimize interconnections between functional units and the number of registers, to be able to schedule operations on multi-operational functional units and to handle algorithmic pipelining. The SYNT system was so successful that it formed a core product of a start up company and later was developed further into a fully commercial tool by one of major CAD tool vendors.
6. References 1. S.H.Gerez, Algorithms for VLSI Design Automation, Wiley&sons,1999. 2. R. Lebeskind-Hadas, C.L. Liu, Solutions to module orientation and rotation problems by neural computation Networks. Proceedings of 26 th DAC Las Vegas, U.S.A. 1989. 3. Sung-Soo Kim, Ching-Min Kyung, Circuit Placement in Arbitrarily-Shaped Region using
242 Self-organisation. ISCAS '89. 4. Jih-Shyr Yih and E Mazumdar, A neural network design for circuit partitioning. Proceedings of 26 th DAC, LasVegas, U.S.A. 1989. 5. A. Hemani, A. Postula, Cell placement by Self-organisation, Neural networks, Vol. 3, 1990. 6. D. Gajski, N. Dutt, A. Wu, S. Lin, High-Level Synthesis, Introduction to Chip and System Design, Kluwer Academic Publishers, 1992. 7. R. Camposano, W. Wolf, High-Level VLSI Synthesis, Kluwer Academic Publishers, 1991. 8. M. Balakrishnan, Integrating scheduling and binding: A synthesis approach for design space exploration, Proceedings of the 26th DAC, 1989. 9. R. J. Cloutier and D. E. Thomas, The combination of Scheduling, Allocation and Mapping in a Single Algorithm, Proceedings of the 27 th DAC 1990. 10.T. Kohonen, Self-organisation and associative memory. Second edition. Springer Verlag. 1988. 11.H. Ritter, Kohonen's Self-organising maps, exploring their computational capabilities, Proceedings of IEEE Int. conference on neural networks, San Diego, 1988. 12.D.N. Zhou, V. Cherkassky, T.R. Baldwin, Scaling neural network for Job-Shop Scheduling. Proceedings Vol. 3., IJCNN, San Diego, June 1990. 13.E Bourret, E Remy, A Special Purpose Neural Network for Scheduling Satellite Broadcasting times. Proceedings Vol. 2, IJCNN, Washington DC, Jan. 1990. 14.A.Hemani, High-Level Synthesis of Synchronous Digital Systems using Self-Organisation Algorithms for Scheduling and Binding, PhD Thesis, Deptartment of Applied Electronics, KTH, Stockholm, Sweden, 1992. 15.A. Hemani, A. Postula, A neural net based Self-organising scheduling Algorithm, Proceedings of EDAC, Glasgow, March, 1990. 16.A. Hemani, A. Postula, Scheduling by self-organisation, Proceedings of IJCNN'90, Washington-DC, Jan., 1990. 17.A. Hemani, Self-organisation and its application to binding, Proceedings of the 6th International conference on VLSI Design, Bombay, India, Jan.'93. 18.Ahmed Hemani, Mats Fredriksson, Kurt Nordqvist, Bj6rn Fjellborg, Borje Karlsson, Application of High-level synthesis system in an IndustrialProject, Jan'94, Calcutta, India. pp: 5-10.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rightsreserved
243
M O D E L I N G S E L F - O R G A N I Z A T I O N IN T H E V I S U A L C O R T E X * Risto Miikkulainen, James A. Bednar, Yoonsuck Choe, and Joseph Sirosh t Department of Computer Sciences, The University of Texas at Austin Austin, TX 78712 USA To gain insight into biological knowledge organization and development, the SelfOrganizing Map architecture can be extended with anatomical receptive fields, lateral connections, Hebbian adaptation, and spiking neurons. The resulting RF-SLISSOM model shows how the observed receptive fields, columnar organization, and lateral connectivity in the visual cortex can arise through input-driven self-organization, and how such plasticity can account for partial recovery following retinal and cortical lesions. The selforganized network forms a redundancy-reduced sparse coding of the input, which allows it to process massive amounts of information efficiently, and explains how various low-level visual phenomena such as tilt aftereffects and segmentation and binding can emerge. Such models allow understanding biological processes at a very detailed computational level, and are likely to play a major role in cognitive neuroscience in the future. 1. I N T R O D U C T I O N The Self-Organizing Map (SOM) architecture was originally motivated by the topological maps and the self-organization of sensory pathways in the brain [1]. The standard SOM algorithm, where a global supervisor finds the maximally responding unit and determines the weight-change neighborhood, is a computationally powerful abstraction of the actual biological mechanisms. Such an abstraction has turned out extremely useful in many real-world tasks from visualization and data mining to robot control and optimization [2]. However, SOM has turned out to be a useful tool in understanding biological maps and biological self-organization as well. As more has been learned about the structure and development of the sensory areas in the brain, it has become increasingly clear that activity-dependent adaptation similar to SOM plays a central role. For example, the topological structure in the somatosensory cortex and its reorganization after peripheral lesions, as well as the formation of ocular dominance and orientation columns in the visual cortex, can be given a computational account by SOM models [3,4]. The SOM model can be extended in several ways to make it possible to model biological phemonena in more detail. For example, the neighborhood function can be implemented through chemical diffusion, or through lateral connections between neurons. Firing-rate *This research is supported in part by the National Science Foundation under grants IRI-9309273 and IIS-9811478. RF-LI~SOM software and demos are available at http://www, c s . u t e x a s . e d u / u s e r s / n n . tPresent address: HNC Software Inc., San Diego, CA 92121.
244 neurons can be replaced with spiking neurons, and actual anatomical receptive fields can be included in the model instead of abstracted input features. The neurons can adjust their weights in terms of Hebbian adaptation instead of vector differences [5,6]. This paper summarizes work to date on one such model, RF-LISSOM (Receptive-Field Laterally Interconnected Synergetically Self-Organizing Map [6-9]), which was designed specifically to provide computational insight into the development and function of the primary visual cortex. The model shows how the neuron's receptive fields, 2-D columnar organization, and lateral connectivity can be developed based on input-driven selforganization. These same mechanisms explain how the cortex can remain plastic and adapt to changes in input and to internal lesions. The model suggests that the resulting organization is formed to represent and process visual input efficiently, forming a redundancy-reduced sparse coding of the visual input. The self-organized model can then be used to model various low-level visual phenomena, including tilt aftereffects and segmentation and binding. 2. T H E R F - L I S S O M M O D E L
The RF-LISSOM model presents a unified theory of self-organization, plasticity, and function in the primary visual cortex. Although several models of cortical self-organization based on Hebbian learning have been proposed, RF-LISSOM is the only one where adapting lateral connections are explicitly taken into account. RF-LISSOM therefore allows testing hypotheses about cortical mechanisms at a fundamentally different level of detail than the other models. RF-LISSOM is based on the assumption that the vertical columns form the basic computational units in the visual cortex, and that it is possible to approximate cortical function by modeling the 2-D layout of such columns. The cortical network in the model consists of a sheet of computational units, interconnected by short-range excitatory and long-range inhibitory lateral connections. Units receive input from a receptive surface or "retina" through the afferent connections. These connections come from overlapping patches on the retina called anatomical receptive fields (figure 1). The inputs to the network consist of simple images of elongated Gaussian spots on the retinal receptors. The retinal activation propagates through the afferent connections and produces an initial response in the network:
rIij = cr[E #ij,xu~xu],
(1)
x,y
where r]ij is the response of unit (i, j), ~xy is the activation of a retinal receptor (x, y) within the receptive field of the unit, Pij,~u is the corresponding afferent weight, and a is a piecewise linear approximation of the sigmoid activation function. At each subsequent time step, the unit combines the input activation with lateral excitation and inhibition:
~ij(t) = a [ ~ # i j , x ~ x,y
+ % ~ Eij,k,rlkt(t -- St) -- 7i ~ Iij,~l~?kl(t -- St)I, k,l
(2)
k,l
where Eij,kt is the excitatory lateral connection weight on the connection from unit (k, l) to unit (i, j), Iij,k~ is the inhibitory connection weight, and ~a~(t-St) is the activity of unit (k, l) during the previous time step. The scaling factors % and 7i determine the strength
245
Figure 1. The R F - L I S S O M model. The lateral excitatory and lateral inhibitory connections of a single unit in the network are shown, together with its afferent connections. The afferents form a local anatomical receptive field on the retina.
of the lateral excitatory and inhibitory interactions. Both afferent and lateral connections have positive synaptic weights, which are set to random values before self-organization. While the initial response is typically diffuse and widespread, this recurrent exchange of lateral excitation and inhibition rapidly turns the response into stable, focused patches of activity. After the activity has settled, both afferent and lateral connection weights adapt according to the same mechanism: the Hebbian rule, normalized so that the sum of the weights is constant:
W~j,m.(t) + c~rl~jXm, wij,mn(t + 1) - Era. (Wij,m.(t) + arlijXm.)'
(3)
where rlij stands for the activity of the unit (i, j) in the settled activity bubble, wij,mn is the afferent or the lateral connection weight (#ij,xy, Eij,kl or Iij,kl), Ct is the learning rate for each type of connection (C~Afor afferent weights, OZE for excitatory, and c~i for inhibitory) and Xmn is the presynaptic activity (~x~ for afferent, Ukt for lateral). The radius of the lateral excitatory interactions starts out large, but as self-organization progresses, gradually decreases until it covers only the nearest neighbors. Also during self-organization, those long-range inhibitory lateral connections that become very weak are periodically pruned away, resulting in highly specific patterns of lateral connections.
246 3. C O R T I C A L
SELF-ORGANIZATION
The main thesis of the RF-LISSOM project is that receptive fields, topographic maps, and lateral connections in the primary visual cortex can all self-organize based on the same input-driven Hebbian learning mechanism. In a series of RF-LISSOM simulations, we indeed showed how the observed ocular dominance, orientation, motion direction, and size-selectivity columns and patterned lateral connections between them emerge based on correlations in the visual input [6,8-10]. Figure 2 shows results for the orientation simulation, run on the Cray T3D at the Pittsburgh Supercomputing Center. The self-organization of afferents resulted in oriented receptive fields similar to those found in the primary visual cortex. The global orientation map is also very similar to those in the cortex and includes structures such as pinwheels, fractures and linear zones, as has been observed by recent imaging techniques [11]. Although similar orientation maps have been obtained in computer simulations before, the RF-LISSOM model is the first to show that the lateral interactions can be learned at the same time as the orientation map forms, as a synergetic part of the process. The surviving lateral connections of highly-tuned units, such as those in figure 2b, link areas of similar orientation preference, and avoid units with the orthogonal orientation preference. Furthermore, the connection patterns are elongated along the direction that corresponds to the unit's preferred stimulus orientation. This organization reflects the activity correlations caused by the elongated Gaussian input pattern: such a stimulus activates primarily those units that are tuned to the same orientation as the stimulus, and located along its length. Such connection patterns have already been confirmed in very recent neurobiological experiments [12]. The connection patterns at pinwheel centers and fractures have not been studied experimentally so far; the RF-LISSOM model predicts that neurons at pinwheel centers have unselective lateral connections, and those at fractures have biaxially distributed connections. 4. I N F O R M A T I O N
ENCODING
AND PLASTICITY
Several researchers have proposed that the visual cortex forms a sparse, redundancyreduced representation of the visual input [13,14]. The RF-LISSOM model allows us to test this hypothesis in precise computational measurements [6,8]. The sparseness of the network representation was measured by its kurtosis (peakedness). Kurtosis was found to be significantly higher in the self-organized system than in a system with fixed lateral connections, and the high kurtosis was achieved by removing redundancies in the input. This result shows that the self-organized structures are not accidental, but serve a useful purpose in information processing. The self-organized model can also be used to study plasticity phenomena in the adult cortex [6]. The first such experiment showed how the network reorganizes after a retinal lesion, giving a computational explanation of the phenomenon of dynamic receptive fields. The second experiment showed how the network reorganizes in response to cortical lesions, predicting the extent of recovery. These experiments suggest that the same processes that are responsible for the development of the cortex also operate in the adult cortex, maintaining it in a dynamic equilibrium with the input. They also suggest ways to hasten recovery from damage to the retina or sensory cortex.
247
Figure 2. S e l f - o r g a n i z a t i o n of t h e O r i e n t a t i o n M a p a n d L a t e r a l C o n n e c t i o n s . Each computational unit in the simulated cortex is shaded according to its orientation preference. Shades from dark to light represent continuously-changing orientation preference from 127.5 ~ to 37.5 ~ from the horizontal, and from light to dark preference from 37.5 ~ to -52.5 ~ i.e. back to 127.5 ~ To disambiguate the shading, every third neuron in every third row is marked with a line that identifies the neuron's orientation preference. In addition, the length of the line indicates how selective the neuron is to its preferred orientation. The outlined areas indicate units from which the center unit has lateral connections. (a) Initially, the afferent weights of each neuron are random, and the receptive fields are randomly oriented and very unselective, as shown by the random shades and random and short lines (the line lengths were slightly magnified so that they could be seen at all). The lateral connections cover a wide area uniformly. (b) After several thousand input presentations, the receptive fields have organized into continuous and highly selective bands of orientation columns. The orientation preference patterns have all the significant features found in visuo-cortical maps: (1) pinwheel centers, around which orientation preference changes through 180 ~ (e.g. the neuron eight lines from the left and four from the bottom), (2) linear zones, where orientation preference changes almost linearly (e.g. along the bottom at the lower left), and (3) fractures, where there is a discontinuous change of orientation preference (as in 7 lines from the left and 17 from the bottom). Most of the lateral connections have been pruned, and those that remain connect neurons with similar orientation preferences. The marked unit prefers 127.5 ~, and its connections come mostly from dark neurons. In the near vicinity, the lateral connections follow the twists and turns of the darkly-shaded iso-orientation column, and avoid the lightly-shaded columns representing the orthogonal preference. Further away, connections exist mostly along the 127.5 ~ orientation, since these neurons tend to respond to the same input. All the long-range connections shown are inhibitory at this stage; there are excitatory connections only from the immediately neighboring units.
248 5. L O W - L E V E L P E R C E P T U A L
PHENOMENA
Various hypotheses about the function of the visual cortex can also be tested on the RFLISSOM model. The psychophysical experiments are replicated on the model, obtaining behavior similar to humans. Because RF-LISSOM is computational, it is possible to observe activation and connection patterns between large numbers of units simultaneously, making it possible to relate high-level phenomena to low-level events. This way a thorough understanding of the visual phenomena can be achieved. 5.1. Tilt Aftereffects The prevailing theory for tilt aftereffects attributes them to lateral interactions between orientation-specific feature-detectors in the primary visual cortex [15]. The lateral inhibitory interactions between activated neurons are believed to increase temporarily while an input pattern is inspected, causing changes in the perception of subsequent orientations. This occurs because the detectors are broadly tuned, and detectors for neighboring orientations also adapt somewhat. When a subsequent line of a slightly different orientation is presented, the most strongly responding units are now the ones with orientation preferences further from the adapting line, resulting in a change in the perceived angle. This theory was tested in the RF-LISSOM model of orientation selectivity [17,6]. Figure 3 plots the change in the perceived angle as a function of the angle. For comparison, the figure also shows results from the most detailed data available for the tilt aftereffect in human foveal vision [16]. The results from the model are consistent with those seen in human subjects. When the test line has a similar orientation as the adaptation line, the angle appears expanded, as predicted by the theory. What is most interesting, however, is that the model also reproduces the indirect tilt aftereffect, where large angles appear contracted. The indirect effect has been difficult to explain in the inhibition theory, but the model suggests a simple explanation: since the total amount of inhibition stays constant while weights change (equation 2), the increased inhibition in close angles results in reduced inhibition in large angles, causing angle contraction. Weight normalization of this kind has very recently been found in biological experiments [18].
5.2. P e r c e p t u a l G r o u p i n g What are the neural mechanisms that cause us to perceive a set of features in the scene together as a group? One possibility is that such grouping is based on synchronization and desynchronization of spiking activity between neurons, or between neuronal groups [19]. The RF-LISSOM model is ideal for testing this hypotheses. The model includes lateral connections that have been organized to efficiently represent visual input. They can also mediate visual function by modulating the spiking behavior of neuronal groups. In order to study perceptual phenomena, the basic RF-LISSOM model was combined with a leaky integrator model of the neuron [20,6]. The synapses in this model perform decayed summation of incoming spikes. If the sum exceeds a threshold, the neuron fires, and the threshold is temporarily raised. The input to the network consists of multiple square objects in the retina, spiking at a constant rate. The network goes through a similar settling process as the RF-LISSOM network. The input is kept constant and the cortical neurons are allowed to exchange spikes. After a while, the neurons reach a stable
249
0
'
I
'
'
i
'
'
I
'
'
I
'
'
t
'
'
t
' ,,~,'
i
'
'
I
'
'
I
"
'
I
"'
~
i'
'
'
_
r~
9
2~ .~,
"'"
',
9
_2 ~
_
_4 ~
_
"'~']"
_
<
__] i
- 9 0 ~ - 7 5 ~ - 6 0 ~ - 4 5 ~ - 3 0 ~ -1 Angle
on Retina
t
i
i
1
i
i
30 ~
t
a
45 ~
i
60 ~
75 ~
90 ~
(counterclockwise)
Figure 3. T i l t A f t e r e f f e c t in H u m a n s a n d in t h e R F - L I S S O M M o d e l . The open circles represent the average tilt aftereffect (TAE) over ten trials for the human subject DEM [16]. The subject adapted for three minutes on a sinusoidal grating of a given angle, then was tested for the effect on a horizontal grating. Error bars indicate • standard error of measurement. The heavy line shows the average magnitude of the tilt aftereffect over ten trials in the RF-LISSOM model. The network adapted to a vertical line at a particular position for 90 iterations, then the TAE was measured for test lines oriented at each angle. The duration of adaptation was chosen so that the magnitude of the TAE matches the human data. The RF-LISSOM curve closely resembles the curve for humans at all angles, showing both direct and indirect tilt aftereffects.
250
Figure 4. S e g m e n t a t i o n a n d B i n d i n g of V i s u a l I n p u t . The input consisted of three oriented blobs. Three different areas of the cortex responded, one for each blob. The percentage of neurons firing per unit time in each of these areas is shown by gray-scale coding in the three rows of the figure. The time is shown in the x-axis. At first, all areas are equally active, but after about 25 time steps, only one area is active at any one time (binding the neurons representing the same object together), and activation rotates from one area to another (segmenting the different object representations).
rate of firing, and this rate is used to modify the weights as in R F - L I S S O M . S t a r t i n g from initially r a n d o m weights, the spiking model formed an ordered m a p of the input, and was able to indicate the different objects on the retina by synchronizing and desynchronizing the m u l t i - u n i t activation of the different areas (figure 4). 6. C O N C L U S I O N
The RF-LISSOM project aims at understanding the information processing of the primary visual cortex at a specific and natural level: as computations arising in a laterally connected 2-D array of feature detectors. In the project, a number of theories about the development and perceptual phenomena in the visual system has been put to test computationally. The main result is strong computational support to the idea that the visual cortex is a continuously-adapting structure in a dynamic equilibrium with both external and intrinsic input, and that phenomena such as columnar organization, patchy lateral connections, sparse coding, plasticity, tilt aftereffects, and perceptual grouping are its emergent effects. Given that the physiology of the neocortex is rather uniform, it is possible that similar mechanisms underlie higher cognitive functions as well. For example, there is recent evidence that fronto-temporal cortex lays out semantic properties of concepts [21]. It may be possible to extend the current low-level perceptual models to language development and memory, and some of their predictions may be verified with recent imaging techniques. This direction forms a most exciting challenge for future work on biological modeling with self-organizing maps.
251 REFERENCES
1. T. Kohonen. Self-Organization and Associative Memory. Springer, Berlin; New York, 1989. 2. T. Kohonen. Self-Organizing Maps. Springer, Berlin; New York, 1995. 3. E. Erwin, K. Obermayer, and K. Schulten. Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Computation, 7(3):425-468, 1995. 4. K. Obermayer, H. J. Ritter, and K. J. Schulten. Large-scale simulations of selforganizing neural networks on parallel computers: Application to biological modelling. Parallel Computing, 14:381-404, 1990. 5. T. Kohonen. Physiological interpretation of the self-organizing map algorithm. Neural Networks, 6:895-905, 1993. 6. R. Miikkulainen, J. A. Bednar, Y. Choe, and J. Sirosh. Self-organization, plasticity, and low-level visual phenomena in a laterally connected map model of the primary visual cortex. In R. L. Goldstone, P. G. Schyns, and D. L. Medin, editors, Perceptual Learning, volume 36 of Psychology of Learning and Motivation, pages 257-308. Academic Press, San Diego, CA, 1997. 7. J. Sirosh and R. Miikkulainen. Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71:66-78, 1994. 8. J. Sirosh, R. Miikkulainen, and J. A. Bednar. Self-organization of orientation maps, lateral connections, and dynamic receptive fields in the primary visual cortex. In J. Sirosh, R. Miikkulainen, and Y. Choe, editors, Lateral Interactions in the Cortex: Structure and Function. The UTCS Neural Networks Research Group, Austin, TX, 1996. Electronic book, ISBN 0-9647060-0-8, http://www, cs. ut exas. edu / users / nn / web- pubs/html b o ok96. 9. J. Sirosh and R. Miikkulainen. Topographic receptive fields and patterned lateral interaction in a self-organizing model of the primary visual cortex. Neural Computation, 9:577-594, 1997. 10. J. Sirosh and R. Miikkulainen. Lateral connections in the visual cortex can selforganize cooperatively with multisize rfs just as with ocular dominance and orientation columns. In Proceedings of the 18th Annual Conference of the Cognitive Science Society, pages 430-435, Hillsdale, NJ, 1996. Erlbaum. 11. G. G. Blasdel. Orientation selectivity, preference, and continuity in monkey striate cortex. Journal of Neuroscience, 12:3139-3161, August 1992. 12. W. H. Bosking, Y. Zhang, B. Schofield, and D. Fitzpatrick. Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. Journal of Neuroscience, 17(6):2112-2127, 1997. 13. H. B. Barlow. Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1:371-394, 1972. 14. D. J. Field. What is the goal of sensory coding? Neural Computation, 6:559-601, 1994. 15. D. J. Tolhurst and P. G. Thompson. Orientation illusions and aftereffects: Inhibition between channels. Vision Research, 15:967-972, 1975. 16. D. E. Mitchell and D. W. Muir. Does the tilt aftereffect occur in the oblique meridian?
252
Vision Research, 16:609-613, 1976. 17. J. A. Bednar and R. Miikkulainen. A neural network model of visual tilt aftereffects. In Proceedings of the 19th Annual Conference of the Cognitive Science Society, pages 37-42, Hillsdale, N J, 1997. Erlbaum. 18. G. G. Turrigiano. Homeostatic plasticity in neuronal networks: The more things change, the more they stay the same. Trends in Neurosciences, 1998. In press. 19. C. von der Malsburg and W. Singer. Principles of cortical network organization. In P. Rakic and W. Singer, editors, Neurobiology of Neocortex, pages 69-99. Wiley, New York, 1988. 20. Y. Choe and R. Miikkulainen. Self-organization and segmentation in a laterally connected orientation map of spiking neurons. Neurocomputin9, 21:139-157, 1998. 21. M, Spitzer, M. Bellemann, and T. Kammer. Functional MR imaging of semantic information processing and learning-related effects using psychometrically controlled simulation paradigms. Cognitive Brain Research, 4:149-161, 1996.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
A S p a t i o - T e m p o r a l M e m o r y B a s e d on S O M s with A c t i v i t y D i f f u s i o n Neil R. Euliano a and Jose C. Principe b
aNeuroDimension Inc., 1800 N. Main St., Gainesville, FL 32609 bComputational NeuroEngineering Laboratory, University of Florida Gainesville, FL 32611 This paper discusses the use of the biologically inspired concept of activity diffusion to create a spatio-temporal memory in the SOM and neural gas algorithms. The activity diffusion creates a system that is sensitive to temporal patterns it has been trained with and thus can "anticipate" future inputs. This technique uses temporal information to help remove variability inherent in the signal. In essence, we are using a system that is capable of creating time-varying voronoi regions. 1. INTRODUCTION Sensory processing can be grouped into two domains, static and dynamic problems. Static problems consist of information that is independent of time. For instance, in static image recognition, the image does not change over time. On the other hand, time if fundamental to the dynamic problem. The output of a dynamical system, for example, depends not only on the present input but also the current state of the system, which encapsulates the entire past of the inputs. Temporal processing is the analysis, modeling, prediction, and/or classification of systems that vary with time. Patterns that evolve over time have traditionally provided the most challenging problems for scientists and engineers. Language skills (speech recognition, speech synthesis, sound identification, etc.), vision skills (motion detection, target tracking, object recognition, etc.), locomotion skills (synchronized movement, robotics, mapping, navigation, etc.), process control (both human and mechanical), time series prediction, and many other applications all require temporal pattern processing. In fact, the ability to properly recognize or generate temporal patterns is fundamental to cognition. Most neural network architectures were designed for static problems and much of the neural network successes have been in that domain. Many attempts have been made at adapting architectures for temporal processing, but with much less success than the static models.
1.1. Temporal SOM Research There have been many attempts at integrating temporal information into the SOM. One major technique is to add temporal information to the input of the SOM. For example, exponential averaging and tapped-delay lines were tested in [ 1,2]. Another common method is to use layered or hierarchical SOMs where a second map tries to capture the spatial dynamics of the input moving through the first map [ 1,2]. More recently, researchers have
253
254 integrated memory inside the SOM, typically with exponentially decaying memory traces. Privitera and Morasso have created a SOM with leaky integrators and thresholds at each node which activate only after the pattern has been stable in an area of the map for a certain amount of time. This allows the map to pick out only the "stationary" regions of the input signal and use these sequences of regions to detect the input sequence [3,4]. Chappell and Taylor created a SOM which has neurons that hold the activity on their surface via leaky integrator storage [5]. The learning law proposed by Chappel and Taylor, however, can lead to an unstable weight space. The methodology seems to work for patterns of binary inputs with at most length 3. Critchley [6] improved the architecture by moving the leaky integration to the synapses. This gives the network a much better picture of the temporal input space and has much more stable training, but becomes nothing more than an exponentially windowed input to a standard Kohonen map, as proposed by Kangas [ 1]. Kohonen and Kangas proposed the hypermap architecture to include context in the SOM architecture [7]. Kangas extended this concept by eliminating the context weights and allowing only nodes in the vicinity of the last winner to be selected [8]. Goppert and Rosenstiel conceptually extend this concept to include the notion of attention [9,10,11 ]. The theory being that the probability of selecting a winner is affected by either higher-cognitive processes (which may be considered a type of supervision) or by information from the past activations of the network (context). This gives two components to the selection of a winner, the extrasensory distance (context or higher processes) and sensory distance (normal distance form weight to input). The architecture outperformed the standard SOM on simple temporal tasks but did not train well on more complicated trajectories. 2. T E M P O R A L ACTIVITY DIFFUSION We believe that by studying biological neural networks, we may find the key to temporal processing in artificial neural networks. Therefore we began our search for a new temporal methodology by studying biology. There we came across the concept of activity diffusion.
2.1. Activity Diffusion in Biology The Reaction-Diffusion equations were originally proposed by Turing in 1952 and are typically used to explain natural pattern formation.[ 12] Turing's proposal modeled patterns found in nature by an interaction of chemicals called "morphogens". The different morphogens react with each other AND diffuse throughout the substance via the equation:
cOmi(x,t) 02 mi(x,t) c3---------~= f (mi(x,t),mj(x,t)) + Dmi c3x---------5~ Equation 1 where mi(x,t) is the concentration of morphogen i at time t, Dm is the diffusion coefficient, and flmi,m) is a function (typically nonlinear) which represents the interaction (reaction) between morphogens. By varying the interaction between chemicals and speed of diffusion, complicated spatial pattems of chemicals are created. The reaction-diffusion equations can create traveling waves. An example of such a system is the Fitzhugh-Nagumo equations (FHN) which describe the transmission of energy down the axon of a neuron. The general concept is that when an element fires, its activity is diffused to its neighbors and pushes them just far enough from their stable state to move them to the "excited" state. Next, these newly
255 excited elements excite their neighbors, etc. The excited elements begin to relax, creating a traveling wave of activity. [13] One particularly interesting aspect of the human brain is the gas Nitric Oxide (NO), which has been found to be involved in many processes in the central nervous system. One such process is the modification of synaptic strength thought to be the mechanism for learning. Neurons produce NO post-synaptically after depolarization. The NO diffuses rapidly (3.3 x 10 -5 cm2/s) and has a long half-life (-4-6 seconds), creating an effective range of at least 150 ~tm. Large quantities of NO at an active synapse strengthen the synapse (called Long Term Potentiation, or LTP). If the NO level is low, the synaptic strength is decreased (Long Term Depression or LTD) even if the site is strongly depolarized. NO is thus commonly called a diffusing messenger as it has the ability to carry information through diffusion, without any direct electrical contact (synapses) over much larger distances than normally considered (nonlocal). The NO diffusion and non-linear synaptic change mechanism has been shown to be capable of supporting the development of topographical maps without the need for a Mexican Hat lateral interaction [ 14,15]. In addition to the possibility of lateral diffusive messenger effects, the long life of NO can produce interesting temporal effects. Krekelberg has shown that the NO can act as a memory trace in the brain that can allow the temporal correlations in the input to be converted into spatial connection strengths. [ 15]
2.2. Temporal Activity Diffusion in Neural Networks Using the reaction-diffusion equations, leaky integrators, and the concept of activity wavefronts, we have created a spatio-temporal memory that stores information in the activity of a distributed network. We have applied this memory technique to the SOM and a Neural Gas architecture. We call these architectures the Self-Organizing Map with Temporal Activity Diffusion (SOMTAD) and the Neural Gas with Temporal Activity Diffusion (GASTAD). The basic principles are fundamentally the same for both architectures. Unlike a short-term memory which converts time into space, the TAD method uses diffusion to create self-organization in time and space. The approach is to keep the fundamental operation of the neural network the same (in order to use the theory and knowledge we have already accumulated) but to add self-organization in time-space to the PEs in the network. By creating temporally correlated neighborhoods in the output space, the basic functionality of the network is more organized and temporally sensitive, without drastically changing its underlying operation. The mechanism for the creation of these temporally correlated neighborhoods is the diffusion mechanism. The activity diffusion is similar to the diffusion of NO in the brain. When a PE or group of PEs fire, they influence their neighbors, typically lowering their threshold such that they are more likely to fire in the near future. Because the underlying mechanism of most neural network training is Hebbian in nature, when neighboring PEs fire in a correlated fashion, they tend to continue to fire in a correlated fashion. This creates the temporally correlated neighborhoods and the self-organization in space-time. The key concept in the TAD architectures is the activity diffusion through the output space. The firing of a PE in the network causes activity to diffuse through the network and affects both the training and the operation of the network. In the SOMTAD, the activity diffusion moves through the lattice of an SOM structure. When the activity diffusion spreads to neighboring PEs, the thresholds of these neighboring PEs are lowered, creating a situation
256 where the neighboring PEs are more likely to fire next. We define enhancement as the amount by which a PE's threshold is lowered. In the SOMTAD model, the local enhancement acts like a traveling wave. This significantly reduces computation of diffusion equations and provides a mechanism where temporally ordered inputs will trigger spatially ordered outputs. The traveling wave decays over time. It can only remain strong if spatially neighboring PEs are triggered from temporally ordered inputs, in which case the traveling waves are reinforced. In a simple one dimensional case, Figure 1 shows the enhancement for a sequence of spatially ordered winners (winners in order were PE 1, PE2, PE3, PE4) and for a sequence of random winners (winners in order were PE4, PE2, PE 1, PE5), which would be the case if the input was noise or unknown. In the ordered case, the enhancement will lower the threshold for PE 5 dramatically more than the other PEs making PE 5 likely to win the next competition. In the unordered case, the enhancement becomes weak and affects all PEs roughly evenly.
0.5
0.5
0.4 - ~ 0.3
.,_, t-. 0.4 ~
0.2
L_J
0.1
i
iI
I, (a)
(b)
Figure 1: Temporal activity in the SOMTAD network, a) activity created by temporally ordered input; b) activity created by unordered input The second temporal functionality included in the TAD architectures is the decay of output activation over time. When a PE fires or becomes active, it maintains a portion (exponentially decaying) of its activity after it fires. Because the PE gradually decays, the wavefront it creates is more spread out over time, rather than a simple traveling impulse. This spreading creates a more robust architecture that can gracefully handle both time-warping and missing or noisy data. To simplify the description of the SOMTAD algorithm, we will use 1D maps and let the activity propagate in only one direction (since the diffusion of the activity is severely restricted in the one-dimensional case). Thus, the output space can be considered a set of PEs connected by a string where the information is passed between PEs along this string. The activity/enhancement moves in the direction of increasing PE number and decays at each step. An implementation of the activity diffusion in one string is shown in Figure 2 that includes the activity decay at each PE and the activity movement through the net in the left-to-right direction. The factors ~t and (1-~t) are used to normalize the total activity in the network. This activity diffusion mechanism serves to store the temporal information in the network. During training, the PEs will be spatially ordered to sequentially follow any temporal sequences presented.
257 1-~t
1~
dist(inp(t),wl) (from SOM PE 1)
1-~t
dist(inp(t),w2) (from SOM PE 2)
dist(inp(t),w3) (from SOM PE 3)
Figure 2: Model for activity diffusion in one string of the SOMTAD At each iteration, the activity of the network is determined by calculating the distance (or dot product) between the input and the weights of each PE and allowing for membrane potential decay: Equation 2
actk(t)=actk(t-1)(1-~t)+~t.dist(inp(t),w~)
where actk(t) represents the activity at PE k at time t, and dist(inp(t),wk) represents the distance between the input at time t and the weights of SOM PE k. This activity diffuses through the network and creates the enhancement for each PE. The decaying wavefront version of the activity diffusion is modeled using a scaled version of the activity of the previous PE at the previous time called the enhancement enk(t): en k (t) = v. actk_ 1(t - 1)
Equation 3
where v < 1 is a decay constant applied to decay the enhancement. Next, the winning SOM PE is selected by w i n n e r ( t ) = arg m a x ( d i s t ( i n p ( t ) , w k ) + / 3 * en k (t))
Equation 4
k
where the enhancement is the activity being propagated from the left. The parameter 13 is the s p a t i o - t e m p o r a l p a r a m e t e r that determines the amount that a temporal wavefront can lower
the threshold for PE firing. Increasing 13 lowers the threshold of neighboring PEs to the point where the next winner is almost guaranteed to be a neighbor of the current winner and forces the input patterns to be sequential in the output map. It is interesting to note that as 13--)0, the system operates like a standard SOM and when 13--)oothe system operates like an avalanche network. Once the winner is selected, it is trained along with its neighbors in a Hebbian manner with normalization as follows: wk (t + 1)= wk(t)+ q * n e i g h ( k ) * ( i n p ( t ) - w
k (t))
Equation 5
where the neighborhood function, neigh(k), defines the closeness to the winner (typically a Gaussian function), and the leaming rate is defined by r I. In our current implementation, the spatio-temporal parameter, the learning rate, and neighborhood size are all annealed for better convergence. The SOMTAD architecture creates a spatially distributed memory. The memory of the system can be described by the following equation:
258 k
t
E k (t) = ~ ~, a c t ~ _ i ( t - i -
tr)lLt i (1-
~t)x
Equation 6
i=Or=0
This equation shows how the results from the matching activity (act) contribute to the enhancement. The traveling waves create two decaying exponentials, one which moves through space (~ti), and one which moves through time ((1-~t) ~). The past history of the node is added to the enhancement via the recursive self-loop in (1-1a). The wavefront motion is added to the enhancement via the diagonal movement through the left-to-right channel scaled by ~t. The farther the node is off the diagonal and the farther back in time, the less influence it has on the enhancement. 2.3. A simple illustrative example A simple, descriptive test case involves an input that is composed of two-dimensional vectors randomly distributed between 0 and 1. Embedded in the input are 20 'L' shaped sequences located in the upper right hand corner of the input space (from [0.5,1.0]--)[0.5,0.5]--) [ 1.0,0.5]). Uniform noise between -0.05 and 0.05 was added to the target sequences. When a standard 1D SOM maps this input space, it maps the PEs without regard to temporal order, it simply needs to cover the 2D input space with its 1D structure. To show how this happens, we plot an 'X' at the position in the input space represented by the weights of each PE (remember, the weights of each PE are the center point of the voronoi region that contains the inputs that trigger that PE). Since the neighborhood relationship between PEs is important, we connect neighboring PEs with a line. In a 1D SOM, the result is a "string" of PEs, and this string of PEs is stretched and manipulated by the training algorithm so that the entire input space is mapped with the minimum distortion error while maintaining the neighborhood relationships (e.g. the string cannot be broken). The orientation of the output is not important, as long as it covers the input with minimal residual energy. A typical example is shown on the left side of Figure 3 (note the slightly higher density of the input in the 'L' shaped region). When the SOMTAD temporal activity is added to the SOM, the mapping has the additional constraint that temporal neighbors (sequential winners) should fire sequentially. Thus, the string should not only cover the input space, but also follow prevalent temporal patterns found in the input. This is shown on the right side of Figure 3. Notice in the figure that sequential nodes have aligned themselves to cover the L shaped temporal patterns found in the input. 1D Kohone~ mal:~ng vdth Temo~alEnhanceme~
1 D Kohonen mapping without T e m p o r a l Enhancement 1 . . . . . .9 . . . . .
o.~ ~--~-~.-:~-':~ 9
;..'?: . ,
' .
.~ .
1
.i ~ . . . . .
.
o.8
-
0.7
....
o,,~ ~
i? ' - ' ~ ~ ' '
o.~:I
.
;
-..,
02I~.i. . .~. . . . % o'2
o,
: ~i~ .... ~
1 .
o., o.~
~ . '. ] ~ . . . . o',
0', 0',
',
~
. . . . . . . .
Figure 3 - One-dimensional mapping of a two-dimensional input space, both with and without spatio-temporal coupling
259 With a single string, the network can be trained to represent a single pattern or multiple patterns. Multiple patterns, however, require the string to be long. A long string may be difficult to train properly since it must weave its way through the input space, moving from the end of one pattern to the beginning of the next. Additional flexibility can be added by breaking up the large string into several smaller strings. Multiple strings can be considered a 2D array of output nodes with a 1D neighborhood function. This allows the network to either follow multiple trajectories or long complicated trajectories in a simplified manner. The SOMTAD has been used for landmark discrimination in a robotics application [ 16,17]. It has also been used to self-organize the clustering of phoneme sequences. In this application, Ruwisch et. al. created a hardware system which temporally organized phonemes from spoken words in real-time [ 18].
2.4. Neural Gas Algorithm with Temporal Activity Diffusion (GASTAD) The Neural Gas algorithm is similar to the SOM algorithm without the imposition of a predefined neighborhood structure on the output PEs. The Neural Gas PEs are trained with a soft-max rule, but the soft-max is applied based on the ranking of the distance to the reference vectors, not on the distance to the winning PE in the lattice. The Neural Gas algorithm has been shown to converge quickly to low distortion errors which are smaller than k-means, maximum entropy clustering or the SOM algorithm [ 19]. It has no predefined neighborhood structure as in the SOM and for this reason works better on disjoint or complicated input spaces. The dynamics of the GASTAD algorithm are the same as in the SOMTAD algorithm except that the neural gas algorithm does not have a predefined lattice structure. This allows us the flexibility to create a diffusion structure that can be trained to best fit the input data. The GASTAD diffuses activity through a secondary connection matrix that is trained with temporal Hebbian learning. This flexible structure decouples much of the spatial component from the temporal component in the network. In the SOMTAD, two neighboring nodes in time also needed to be relatively close in space in order for the system to train properly (since time and space were coupled). This is no longer a restriction in the GASTAD. This is still a space-time mapping, but now the coupling between space and time is directly controllable. The most interesting concept that falls out of this structure is the ability of the network to focus on temporal correlations. Temporal correlation can be thought of as the simple concept of anticipation. The human brain uses information from the past to enhance the recognition of "expected" patterns. For instance, during a conversation a speaker uses the context from the past to determine what they expect to hear in the future. This methodology can greatly improve the recognition of noisy input signals such as slurred or mispronounced speech. 2.5. GASTAD DETAILS The GASTAD algorithm works as follows: First, you calculate the distance (di) from the input to all the PEs. The temporal activity in the network is similar to the SOMTAD diffusive wavefronts except that the wavefronts are scaled by the connection strengths between PEs. Thus, the temporal activity diffuses through the space defined by the connection matrix as follows:
acti(t+l)=aacti(t) + (fl+(1-~)aCtwi
....
max(p)
(t))Pwi
.....
i
Equation 7
260 where acti(t) is the activity at PE i at time t, ct is a decay constant less than 1, Pi,j is the connection strength from PE i to PE j, ~t is the parameter which smoothes the activity giving more or less importance to the past activity in the network, and max(p) normalizes the connection strengths. A previous winner that has followed a "known" path through the network will have higher activity and thus will have more influence on the next selection. The rest of the GASTAD algorithm is identical to the SOMTAD. The output is modified by the spatio-temporal activity and the PEs with the highest activity are trained using the neural gas update rule:
Api = rlh~ (ki (out))(in- Pi )
Equation 8
where 11 is the learning rate (step size), ha(.) is an exponential neighborhood with the parameter ~, defining the width of the exponential, ki(out) is the ranking of PEi based on its modified distance from the input. The connection strengths are trained using temporal Hebbian learning with normalization. Temporal Hebbian learning is Hebbian learning applied over time, such that PEs that fire sequentially enhance their connection strength. The rationalization for this rule is that PEs will remain active for a period of time after they fire, thus both current and previous winners will be active at the same time. In the current implementation, the connection strengths are updated similar to the conscience algorithm for competitive learning: Apargmin(out(t_l)),argmin(out(t)
) --
b Equation 9
The strength of the connection between the last winner and the present winner is increased by a small constant b and all connections are decreased by a fraction that maintains constant energy across the set of connections (N is the total number of PEs in the network). During operation, although the weights are fixed, the signal time structure creates temporal wavefronts in the network that allow plasticity during recognition. This temporal activity is mixed with the standard spatial activity (distance from input to the weights) via 13, the spatio-temporal parameter. Two identical input values may fire different PEs depending on the temporal past of the signal. Figure 4 shows the voronoi diagrams for the GASTAD network with two different temporal histories. In these particular diagrams, the number in each voronoi region represents the PE number for that particular region and is located at the CENTER of the static voronoi region (the center is the same as the weights of the PE). These diagrams show the regions of the input space that will fire each PE in the network. The training data included random inputs interspersed with temporal diagonal lines moving from bottom-left to top-right and bottom-right to top-left. The left side of Figure 4 shows the voronoi diagram during a presentation of random noise to the network. Since this input pattern was unlikely to be seen in the training input, temporal wavefronts were not created and the voronoi diagram is very similar to the static voronoi diagram. The right side of Figure 4 shows the voronoi diagram during the presentation of the bottom-left to top-right diagonal line. The temporal wavefront grew to an amplitude of 0.5 by the time PE 18 fired. Also, from the training of the network, the connection strength between PE 18 and PE 27 was large compared to the other PEs. Thus, the temporal wavefront flowed preferentially to PE 27 enhancing its possibilities of winning the next competition.
261 Voronoi diagramwith previous winners: 20, 26,14, 18 Beta=0.2
Voronoi diagram with previous winners: 23, 18,10, 29 Beta=O.
17
2! 0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 0 0
0.2
0.4
0.6
0.8
1
;
0'.2
014
0:6
0'.8
Figure 4: Voronoi diagrams without and with enhancement Notice how large region 27 is in right side of Figure 4 since it is the next expected winner. This plasticity seems similar to the way humans recognize temporal patterns (e.g. speech). Notice that the network uses temporal information and its previous training to "anticipate" the next input. The anticipated result is much more likely to be detected since the network is expecting to see it. It is important to point out how the static and dynamic conditions are dramatically different. In the dynamic GASTAD the centroids (reference vectors) are not as importantthe temporal information changes the entire characteristics of vector quantization creating data dependent voronoi regions. An animation can demonstrate the operation of the GASTAD voronoi regions much better than static figures. 3. GASTAD V E C T O R Q U A N T I Z A T I O N OF SPEECH DATA The goal of this application is to recognize spoken English digits from one to ten. The GASTAD will be used to vector quantize the sampled frequency representation of each digit. The corpus is a set of 15 speakers saying the digits one through ten. The first 10 speakers are used for training and the last 5 are for testing. The 15 speakers were graduate students and professors in the Electrical Engineering Department at the University of Florida. The speakers represent a wide variety of nationalities and accents, making this task significantly more difficult than one might think. The preprocessing comprised calculating the first 12 cepstral coefficients from 25.6 ms frames overlapped every 12.8 ms (10 kHz sampling). The cepstral coefficients were liftered by a raised sine to control the noninformation-bearing cepstral variability for a more reliable discrimination of sounds. These cepstral coefficients were then mean filtered three at a time to reduce the number of input vectors. The major difference between using the GASTAD VQ method and a standard VQ method is that the GASTAD algorithm is trained to enhance patterns that it was trained with. There are two options to incorporate the temporal characteristics of the GASTAD into this architecture. Typically, one vector quantizer is used to quantize every word in the corpus. This can be done with the GASTAD VQ as well. In this case, the network would need to store all of the temporal information from all ten digits in a single network. Although this is possible, the task of the GASTAD is simplified by training a separate GASTAD network for each digit (e.g. each network stores the temporal characteristics of only a single digit).
262 Similarly, we will use a separate MLP to detect each digit. Figure 5 shows the overall block diagram of the system. Feature I I VQ Codebo~ I Slonah.. Vector " Generation I v Geneiati~ 12 Smoothed, I Liftered, ;I 16 Codebook entries
Speech I
Cepstral Coefficients
'~l
Vector
Quantization reference vectors
a. Typical Vector Quantization Training System
s_ Signal I "~
lll
Feature
Vector Geneation
~ 12 Smoothed, Liftered, Cepstral Coefficients
I VQ Codebook Generation I J for Each Digit
I
10 sets of
I Codebooks " '1'
R'"
~l V.o,or I ~ vl Quantization It.J-Sequence IIJ rl of ' codebook
.,~O,,i,
Recognition
IIIII
"~
Y
reference vectors
b. GASTAD Vector Quantization Training System Figure 5. Block diagram for the digit recognition system, a) standard digit recognition system, b) GASTAD recognition system First we trained the 10 GASTAD VQ networks. This process was done by feeding each network an input consisting of the target digits spoken by the 10 training speakers interspersed with random vectors from the other 9 digits. While training, the activity wavefronts could easily be seen in a plot of the maximum activity in the network over time. This usually picks up the wavefront activity in the network quite well. Figure 6 shows the activity of the digit six network with the training data. The instances where the word six is spoken are highlighted between dashed lines. The input data interspersed between the presentations of the 'six'es are random vectors from the other digits. Clearly the activity of the network is much higher when the word six is presented to the network. You should also notice, however, that certain speakers do not adequately match the "global average". For instance, speaker 10, near sample 400, does not create a large activity spike in the network. For larger systems, this can be solved by using multiple networks for each digit, allowing for more variation in the speakers. After the VQ networks were trained, we vector quantized each digit from all 15 speakers, 10 from the training set and 5 from the test set. We also vector quantized the data using a standard Neural Gas algorithm so that we could test our results. In order to remove some of the variability caused by the different rates at which the words and phonemes were spoken, we passed each sequence of reference vectors from each spoken word through a gamma memory. The gamma memory has 6 taps and a ~t of 0.5 giving it a depth of 12 samples, which corresponds to the maximum length of any spoken digit in the corpus (the minimum was 6 vectors). The output of the six taps of the memory is then fed into an MLP with 6 hidden PEs
263 Maxirnum Activity for SOTPAR2 training for digit SIX
~tl
~41-I ~.2~i II >, I~-1 ,g
,~
00~
it ii Ii
II II II ii il It II II I I II
il it
0
ijl ii
~
itt li
tilt lli illi J~
i
it
II II i,l III 111 III
III III II
II l, II
I I I
II I!i II I!'1 Ji'l ill Ill il Itl ii
I,i it d/ i~i
0
ii ii
100
i
Iiii ilii ili ill ill i,
300
II II II
i~ ~i, ~1 II II II II I, II I~I Iii II
I I llI ill it
I li!l
ifll Ill ~II Ill
200
II till II
I~
I,I
i,, ijl
I I i~ inl lij i~l II
ill illi ill I!il
400
500
600
700
Figure 6. Maximum activity in the SOMTAD network for digit 6 and a sigmoidal output PE. All 10 networks are trained simultaneously and a winner-take-all PE is used to select the network with the largest output. The digit detector with the largest output is declared the winner and this is compared against the desired signal which indicates which digit was actually spoken.
VO'~
I
ve f~ Digit2 I
I Gamma I ~1 em~00~,m~~ I i
~'1
spoken digit frequency information
ve,oro,g,,~0 I I
~1I
t
.l~Poe,ec,or "1 *or0,0,,, I
Gamma memory I embedding
,.. MLP Detector r[ for Digit 2 .
G~mr'a I
,_[ MLP Detector ~"1 for Digit 10
embedding memory
I
Winner-takeall determines spoken digit
Figure 7. Digit recognition system The entire system was trained with three different sets of data. First, the original data from the preprocessor was used to train the system. This allows us to validate that the vector quantization reduces the variability of the data and allows for easier recognition of the digits. Second, the Neural Gas algorithm was used to vector quantize the input data. Third, the GASTAD network was used for each vector quantizer. In order to remove random variations
264 based on initial conditions, each system was trained and tested 5 different times and the results were averaged. Table 1 shows the results of the training. The key figures are the number of misclassifications in the test set. Since MLPs are universal mappers, a sufficiently large MLP can learn to classify virtually any data set. A common problem with MLPs is that they can be overtrained. If the network is overtrained, it will have very good classification in the training set, but very poor classification in the test set. Thus, the true indication of performance for MLPs is the performance in the test set. Table 1 Summary of the digit recognition system performance Training Testing Training System Type MSE MSE misclassifications No Vector Quantization 0.0005 0.0216 0.0 Neural Gas VQ 0.0010 0.0321 0.2 0.0199 0.2 6ASTADVQ ....... 0.0009 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Testing misclassifications 12.2 10.0 7.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Percent Correct classification in Test set 75.5 80.0 84.8
The table shows that the GASTAD VQ system reduced the number of errors in the testing set by 25% over the Neural Gas VQ system and by 40% in the system without vector quantization. This performance is due to the reduction in the variability in the systems. A vector quantization technique removes some of the variability of the signal by clustering all the inputs and representing every input in the cluster with a single reference vector. The GASTAD VQ system takes this one step further by using temporal information to enhance the clustering. The temporal sequence of the feature vectors plays an important role in the vector quantization. For additional comparison, a Hidden Markov Model (HMM) with 5 states was trained using the original input. The HMM was trained for 50 cycles starting from 5 different initial conditions. The average results for the HMM was 81% correct over the test set. 4. CONCLUSIONS The TAD algorithm uses temporal plasticity induced by the diffusion of activity through time and space. It creates a unique spatio-temporal memory in either the SOM or neural gas architecture. The activity diffusion couples space and time into a single set of dynamics that can help disambiguate the static spatial information with temporal information. This creates time-varying voronoi diagrams based on the past of the input signal. This dynamic vector quantization helps reduce the variability inherent in the input by anticipating (based on training) future inputs. 5. REFERENCES
1. J. Kangas, Time-Delayed Self-Organizing Maps, in Proceedings of the International Joint Conference on Neural Networks, pp. 331-336, part 2 of 3, 1990. 2. J.Kangas, Phoneme Recognition Using Time-Dependent Versions of Self-Organizing Maps, in Proceedings of the International Conference on Acoustic and Speech Signal Processing, Vol. 1, pp. 101-104, 1991.
265 C. M. Privitera, P. Morasso, The Analysis of Continuous Temporal Sequences by a Map of Sequential Leaky Integrators, Proceedings oflCNN 94, pp. 3127-3130, 1994. C.M. Privitera and L. Shastri, A DSOM Hierarchical Model for Reflexive Processing: An Application to Visual Trajectory Classification, International Computer Science Institute, Berkeley CA, Technical Report TR-96-011, June 1996. G.J. Chappell and J.G. Taylor, The Temporal Kohonen Map, Neural Networks, Vol. 6, pp. 441-445, 1993. D. A. Critchley, Extending the Kohonen Self-Organising Map by Use of Adaptive Parameters and Temporal Neurons, Ph.D. Thesis, University College London, Department of Computer Science, February 1994. T. Kohonen, The Hypermap Architecture, in Proceedings of the International Conference on Artificial Neural Networks, pp. 1357-1360, 1991. J. Kangas, Temporal Knowledge in Locations of Activations in a Self-Organizing Map, in Artificial Neural Networks 2, pp. 117-120, 1992. J. Goppert and W. Rosenstiel, Selective Attention and Self-Organizing Maps, in Proceedings of Neural Networks and their Applications", IUSPIM, Marseille, France, 1994. 10. J. Goppert and W. Rosenstiel, Dynamic Extensions of Self-Organizing Maps, in Proceedings of the International Conference on Artificial Neural Networks, Sorrento, Springer, London 1994. 11. J. Goppert and W. Rosenstiel, Neurons with Continuous Varying Activation in SelfOrganizing Maps, in From Natural to Artificial Neural Computation, Lecture notes in Computer Science, Vol. 930, pp. 416-426, Springer-Verlag, 1995. 12. A. Turing, The Chemical Basis of Morphogenesis, Phil. Transactions of the Royal Society ofLondon, Ser. B, Vol. 237, pp. 37-72, 1952. 13. J.J Tyson and J.P Keener, Singular Perturbation Theory of Traveling Waves in Excitable Media (A Review), Physica D, Vol. 32, pp 327-361, 1988. 14. B. Krekelberg and J.G. Taylor, Nitric Oxide and the Development of Long-Range Horizontal Connectivity, Neural Networks World, Vol. 6, No. 2, pp. 185-189, 1996. 15. B. Krekelberg and J.G. Taylor, Nitric Oxide in Cortical Map Formation, International Confernance on Artificial Neural Networks, 1996. 16. P. Kulzer, "NAVBOT- Autonomous robotic agent with neural network learning of autonomous mapping and navigation strategies", unpublished Master's Thesis from the University of Aveiro, Portugal, 1996. 17. N.R. Euliano, J.C. Principe, P. Kulzer, A Self-Organizing Temporal Pattern Recognizer with Application to Robot Landmark Recognition, Sintra Spatiotemporal Models in Biological and Artificial Systems Workshop, November, 1996. 18. D. Ruwisch, B. Dobrzewski, & M. Bode, Wave Propagation as a Neural Coupling Mechanism: Hardware for Self-Organizing Feature Maps and the Representation of Temporal Sequences, in IEEE Workshiop on Neural Networks for Signal Processing Proceedings, pp. 306-315, 1997. 19. T.M. Martinetz, S.G. Berkovich, K.J. Schulten, "Neural-Gas" Network for Vector Quantization and its Application to Time-Series Prediction, IEEE Transactions on Neural Networks, Vol. 4, No. 4, July 1993, pp. 558-569. .
.
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9 ElsevierScienceB.V. All rightsreserved
Advances
in modeling
267
cortical
maps
Pietro G. Morasso, Vittorio Sanguineti and Francesco Frisone ~ aDept, of Informatics, Systems and Telematics, University of Genova Via Opera Pia 13, 16145 Genova, Italy In this paper, we explore the hypothesis that lateral connections in cortical maps are used to build topological internal representations, and propose that the latter are particularly suitable for the processing of high-dimensional 'spatial' quantities, like sensorimotor information. 1. L A T E R A L
CONNECTIONS
IN CORTICAL
MAPS
Cortical areas can be seen as a massively interconnected set of basic processing elements (the so-called cortical 'columns'), which constitute what is called a 'computational map' [13]. Columns are interconnected by a large number of 'lateral' connections (i. e. parallel to cortical surface), which are mostly reciprocal. An apparent feature of many cortical areas is their 'somatotopic' or 'ecotopic' organization, in the sense that similar information (e.g., similar receptive fields) is mapped into physically adjacent columns. It has been suggested that this is required by the need for 'cooperative' computations in processing the sensory input, in which lateral interconnections play a crucial role. Many studies - for instance, [1,14] - have suggested that lateral connections (assumed to pre-exist) also play an essential role in the spontaneous emergence, or self-organization, of the maps themselves. When the sensory space to be 'represented' is more than twodimensional, as in maps of orientation sensitivity in the visual cortex, the observed discontinuities have been explained [6] as the effect of a compromise between the computational advantage of somatotopy and the physical constraint of a two-dimensional layout. All these studies imply that only adjacent columns are interconnected, and therefore cortical maps are two-dimensional lattices. However, the observed patterns of lateral connectivity are much more complicated. First of all, the structure of lateral connections is not genetically determined, but depends mostly on electrical activity during development. More precisely, they have been observed to grow exuberantly after birth and to reach their full extent within a short period; during the subsequent development, a pruning process takes place so that the mature cortex is characterized by a well defined pattern of connectivity, which includes a significant amount of non-local connections. Second, in the mature cortex the superficial connections to non-adjacent columns are organized into characteristic patterns: a collateral of a pyramidal axon typically travels a characteristic lateral distance without giving off terminal branches; then it produces
268 tightly connected terminal clusters (possibly repeating the process several times over a total distance of several millimeters). Such a characteristic distance is not an universal cortical parameter and is not distributed in a purely random fashion, but is different in different cortical areas: 0.43 mm in the primary visual area, 0.65 mm in the secondary visual area, 0.73 mm in the primary somatosensory cortex, 0.85 mm in the primary motor cortex, and up to several mm. in the infero-temporal cortex (area 7a) [12,24,3]. Finally, in the visual cortex, joint anatomical and functional analysis has suggested that columns with similar receptive fields are also laterally connected [11] although are not necessarily adjacent. A similar result has been found through correlation analysis in the motor cortex [10]: units with similar preferred movement direction seem to be connected. As the development of lateral connections depends on the cortical activity caused by the external inflow, one possibility is that they are used to capture and represent the (hidden) correlation in the input channels. In other words, lateral connections are a way to represent the inherent topology of the sensory space. To this regard, it has been pointed out [2] that lattices are effective ways to represent spaces, even multi-dimensional (with their associated topology) on a two-dimensional sheet (like the cortical surface). We therefore suggest that cortical maps are in fact lattices with the same dimensionality of the represented space, so that they are in fact topologically continuous lattices or topology-representing networks [16]. According to this hypothesis, the discontinuities observed, for instance, in orientationsensitive maps in the visual cortex are not pathologies of map topology, but indeed only indicate that the inherent dimension of the represented space is greater than two. The physical arrangement of the map can therefore simply be explained from criteria based on the minimization of the global wiring; see [4]. An open issue is what is the computational advantage of topologically continuous lattices. In this paper, we will try to provide an answer to this question by showing that they have a number of interesting computational properties, which make them ideally suited for implementing forward and inverse coordinate transformations. In particular, we will show that (i) the underlying representations only depend on the inherent dimensionality of the input space and not on the number of input variables and on the particular coordinate systems in which they are defined; (ii) their dynamic behavior may underlie a mechanism of trajectory generation which automatically accounts for the topology of the represented space; (iii) different maps can be interconnected, so that they can implement both forward and inverse transformations among the underlying spaces.
2. A N E U R A L
NETWORK
MODEL
One basic aspect of cortical maps is their modular organization into columns, which communicate by means of reciprocal lateral connections (intra-connectivity); reciprocal connectivity patterns also characterize the interaction among different cortical areas (cross-connectivity), directly or via cortico-thalamic loops, thus providing physical support for carrying out complex sensorimotor transformations. A 'spatial map' is a set F of processing elements (PE) or filters, which model cortical
269 columns. A map is completely specified by the following entities" A matrix C of 'lateral connections', in which cij indicates the strength of the connection between the elements i, j E F. We will assume that each P E is characterized by a non-empty neighborhood set N'(i) (the set of PE's to which it is laterally connected), so that cij - 1 if and only if j C N'(i) and cij = cj~. In other words, all lateral connections are reciprocal and excitatory. An internal state Ui, defined for all i C F. We will also assume that 0 _< Ui <_ 1. An external input Ii, i E F. Such an input may come from either the thalamo-cortical connections, or the cortico-cortical connections from other cortical maps. Let us consider the following dynamic equation: dU~
V --d~ + Ui - ~ cij Uj + Ii y ~kcjkUk
(1)
Equation 1 is similar to that described by [1] but, to account for the preponderance of excitatory lateral connections in the cerebral cortex, the competition dynamics mediated by inhibitory lateral connections has been substituted by an equivalent mechanism of competitive distribution of activation, mediated by excitatory connections. It has been shown [21] that a map described by the above equation converges to an equilibrium state, in which the Ui are an high-pass filtered version of the Ii. 2.1. S p a t i a l m a p s as i n t e r n a l r e p r e s e n t a t i o n s The thalamic input establishes a relationship between the internal state of the map and the 'external world'. Let us model such an input as a simple mapping: I~ - T~(x) i E F
(2)
where x C X is a sensory quantity. As a consequence, the equilibrium state of Eq.1 will also depend on x. Eq.2 determines, for each P E, a receptive field, defined as the subset of X for which Ii is such that the equilibrium state Ui is non-zero. The value of x that maximizes Ui at equilibrium will be referred as the preferred input, 5:i, or the sensory prototype 'stored' in that PE (see Fig. 1). Indirectly, receptive fields and prototypes allow to interpret Ui as the probability that x = xi and thus we can estimate the actual x from the observed internal state of the map" x ~ x~ =
Ei SciU~ E j uj
(3)
so that U~ can be seen as the population code of x. Hereafter, x~ will be indicated as the population vector associated with F. It has been noted [23] that such a code has the important property of being independent of the particular coordinate system in which x is measured; see also Section 4. The properties of such an implicit and bi-directional mapping between X and F are entirely determined by both the thalamo-cortical transformation (i.e. Eq. 2), and the pattern of lateral connections, i.e. the 'topology' of the map.
270
Figure 1. The dynamic cortical model
As regards Eq. 2, we will assume that Ti(x) is a decreasing function of the distance between x and a particular value, xi C X: Ti(x) = G(I[x-xill); for instance, a Gaussian. In this case, because of the equilibrium properties of Eq. 1, xi actually defines the 'preferred value' for the i-th PE. As for the map topology, if adjacent PEs have 'similar' sensory prototypes (which does not require topological continuity: even Kohonen's SOMs [14] have this property), a continuous trajectory of the map internal state, U~(t) (whatever the input that has generated it), results into a continuous trajectory of its associated x~(t). We also require that maps are topologically continuous, which implies that the converse is also true: 'similar' x's activate adjacent PEs; in this case, a continuous x(t) results in a continuous U~(t). We will show (see Section 6) that this property is essential for implementing bidirectional coordinate transformations. An algorithm for self-organization of topologically continuous maps has been demonstrated by [16]. 3. D y n a m i c b e h a v i o r of spatial m a p s A number of experimental observations have put into evidence that spatial maps also display a dynamic behavior: the continuous modification of the population vector in the primary motor cortex during mental rotation experiments [9], which is believed to code the direction of the planned movement; the continuous update of the representation of the target - in retinal coordinates - in the superior colliculus during gaze shift. Such facts have been explained by hypothesizing [20] a continuous movement of the peak of activation on the corresponding maps. Several attempts [5,15] were made to model such a moving-hill mechanism of continuous remapping in terms of recurrent neural networks. In the case of a sensory topology-representing map, the dynamic behavior described
271
2
0.8
~ ~ ~ ~ ~
1.6
._~ 1.2
0.2 '"!:~"':(""""""..i.i..i.i~ ~.....:,:::..... i:ili!."'i):" 05
1
15
2
25
x
0.6~ 3 0"40
0.2
0.4
0.6
0.8
1
1.2
time (s)
Figure 2. Dynamic remapping in a 1-dimensional map. Evolution of the internal status of the map at different time steps (top) and time course of the population vector (bottom). The external input pattern (dashed) has a maximum for x = 2, whereas the population vector corresponding to the initial status equals to x~st = 0.5
by Eq. 1 can be interpreted as keeping the internal state, 'in register' with the incoming sensory information. Moreover, if the external input is assumed to form a shunting interaction with the internal state (Grossberg 1973):
dU~
2
Uj
+ g(U, . ~,)
(4)
the equilibrium state takes the shape of a sharp 'peak', centered on the processing element for which Ii is maximum. In this case, if the external inputs are non-zero for each PE, and the initial status is characterized by a local peak of activation, the effect is that of a 'migration' of the peak toward the position in the map in which Ii has its maximum value; the behavior is shown in the simulation of Fig. 3, involving a map with N = 60 P Es, and lateral connections organized as a 1-dimensional lattice. It should be noted that, although the shape of the 'moving-hill' is not completely preserved, the corresponding population vector varies smoothly to its initial value to that corresponding to the maximum of Ii (see Fig. 3, right). Moreover, the corresponding trajectory of the population vector will reflect the topology of F so that, for instance, if the 1-dimensional map of Fig. 3 would represent a 1-dimensional X (for instance, a circle embedded in a plane), the time course of the population vector in the plane would in fact be circular. This property is potentially important in tasks like trajectory formation. 4. R e p r e s e n t a t i o n of a m u l t i s e n s o r y space
One of the most intriguing side-effects of spatial representations of a stimulus space, is that in a sense they allow to "add apples and oranges". Consider for example a cortical
272
/I
-" --..
"',, ",
(x~., ]i" ($1, $2, S 3) "X8...... "'"
Right ear
Figure 3. Sketch of the setup in the simulation study
map of neurons with bimodal (visual-auditory) receptive fields which are found in the posterior parietal cortex. These neurons are in fact adding apples and oranges and this does not make any sense in any retinocentric or cochleocentric framework. However, we should remember that in real, "ecological" perceptual situations the input stimuli converging on this cortical area are causally and lawfully determined by visuo-acoustic events happening in the physical 3D world: thus it is not surprising to expect that Hebbian learning applied to such map will allow the emergence of a pattern of lateral connections consistent with the 3-dimensionality of the external world. This is illustrated in Fig. 3. In a simple simulation study [17], we considered a simplified visual-auditory paradigm, related to a binocular/binaural situation: the "eyes" and the "head" are fixed, for simplicity, and the directional capabilities of the auditory systems, due to binaural time and/or intensity differences, are simply represented by using, for each "ear", a pair of measuring points which pickup the distance/propagation delay from the stimulus P. Thus, the visual-auditory vector s is 8-dimensional, with 4 retinal components and 4 cochlear components. During learning, based on the TRN algorithm [16], a visuo-auditory cortical map of 100 neurons received stimuli coming from a cubic area at the center of the visual field. The dataset (2000 points) was presented 20 times, and Fig. 4 shows the state of the network after 5, 10, 15, and 20 epochs, respectively. Each neuron in the networks learned a prototype vector zi in the embedding 8-dimensional space and developed a set of lateral connections. For the purpose of visualization, each 8-D prototype was back-projected in 3-D space and these points are plotted as small circles in the figures, linking by means of segments the circles corresponding to neurons which are neighbors in the map. The figure shows that the visuo-auditory map is in fact able to learn a coordinate-free representation of the external space, exploiting the structure hidden in the "fruit-salad" of multi-sensory data.
273
Figure 4. Visuo-acoustic map trained with a 3-dimensional training set State of the network after 5, 10, 15, and 20 epochs
274
_L'X.
3.5
/l~ .~11I]~ 2.5
,
lIII
ili. 11 ii1~_ 3/ ~111 i1~. _Ill II1~[ _~ IIIII_ _~ ~1111 III/k.JII 1~1 I III I I I I ' r ' ~ l
-~-
''
~
0"
Jt Ii _rf Ill
IIII
0.~
Ill
IIIIITU iiiiiii
l
IIIIIL ./llll
gg
~
Figure 5. Inter-connectivity matrix and 'virtual' prototypes
5. F u n c t i o n a p p r o x i m a t i o n by i n t e r c o n n e c t e d m a p s Spatial representation of variables suggests that in cortical maps even coordinate transformations are expressed in terms of population codings: for instance, let us consider a continuous, smooth function y = y(x), with x E X C ~ , y C Y C ~m and in general 77~ ~
r~.
Let us also consider two maps, Fx and Fy, which represent X and Y, respectively. They can be obtained from a suitable learning process of competitive type [16]; after learning, any training sample (x, y) is represented in the maps Fx and Fy by two independent population codes, U~ and U~. The structure of the mapping between U[ and U~ can be captured by a set of crossconnections c~y, between the P E s of one map and PEs of the other. This can be obtained with an extension of the TRN learning rule [16] in which, for each given (x, y) pair, the ci~y element is set where the i-th and the j-th PEs happen to be winners in the corresponding maps; which still identifies an Hebbian learning rule. Given F~ and Fy, their prototypes and the inter-connectivity matrix, an estimator of y, given x, is expressed by:
E~ ~ U~
y~(x) -
~ j u~
(5)
where the quantity:
u~ = ~ ~jx y ujx
(6)
J
can be interpreted as the projection of the population code of x onto the output map. The quantity y~ is the population vector corresponding to U~. Therefore, the inter-connectivity matrix has the effect of transforming a distributed representation of x, i.e. U~, into the corresponding population code of y, i.e. U~. As an example, let us consider the simple mono-dimensional mapping of figure 5, corresponding
275
3s[3
"~"
35
3.1f
1.5 ,
0.50 0.5 U Y(x)
110
i 05
I
15
o.i~, 2
25
3
0
05
1
0:5
;
Ax
,2
2:5
;
2
25
3
x 0.~
0.5
~0~
1:5
0.5
1
1.5 x
2
2.5
3
0
05
1
15 x
Figure 6. Top: Forward projection of a population code U[ (which codes x = 1.5) onto Fy via the cross-connections. U[ is the projected activation, plotted versus the y prototypes. Bottom: Inverse projection of a population code U[ (which codes y = 2) onto Fx via the cross-connections. U[ is the projected activation, plotted versus the x prototypes
to the function y = sin 27rx + sin 7r/3x + sin 7r/5x + sin 7r/7x. The size of the input and output maps is, respectively, N = 60 and M = 20; the prototype values of x and y at the end of the separate training phases are indicated by open circles on the two axes; filled circles indicate the 'virtual' prototypes, implicitly defined by the inter-connections that are visualized as horizontal and vertical segments, respectively. It should be noted that the architecture is completely bi-directional or symmetric, in the sense that it does not require to specify what is the input and what is the output (thus resembling an associative memory). (In fact, it can even represent relationships between spaces that are not 'functions'.) For instance, given a particular y, yielding a distributed representation U~ on Fy, the corresponding projection on Fx is given by x
xy
: E c,j J
(7)
Figure 6 shows how, through the inter-connections, the population codes are mapped back and forth; the set of maxima of U[ clearly identifies 'all' possible inverses, i.e. the set of x values matching a specific y. In summary, distinctive features of such an approximation scheme are that (i) different from artificial neural network models like R B F or NGBF, the transformation is among population codes: a feature that has been observed in biological sensorimotor networks [25, 22]; (ii) the architecture is 'symmetric' or bi-directional, thus suggesting that it may perform both forward and 'inverse' transformations. The scheme is also efficient and modular, because a given map or representation layer might be shared by different computational modules, each implementing a specific transformation or association, in a sort of network of maps.
276
( 2.5c
>.,
>.,
1.5
1.5 1 (
0.~ 0.5 UY(y)
1
-0
0.5
1
1.5
2
2.5
3
1.5
2
2.5
3
t i! A
05
0
0.5
1
Figure 7. Initial (dashed) and final state of U[ in response to the stimulus U~; the final U[ determines the value of x (vertical dashed line)
6. D y n a m i c inversion The dynamic mechanism may also be exploited by a pair of inter-connected maps, e.g. F~ and Fy, for inverting the mapping ~' = ~'(~); let us suppose that the external input pattern E ~ of F= is provided by an activation peak on the map Fy, projected on F~ through the inter-connections according to Eq. 7, like in Fig. 6 (bottom). In fact, I[ identifies a population vector ye on Fy that can be interpreted as a target in an end-point control dynamics; the final population vector x of F~ will correspond to the end-point inversion of the mapping y, = y(x) (see Fig. 7). 7. Conclusion In this paper we have revisited the previously formulated hypothesis[18] that lateral connections in cortical maps are used to build topological internal representations suitable for processing sensorimotor information. The logical development of this work can proceed into two directions. At the theoretical level, it can attempt a unification of the learning mechanism for the thalamo-cortical and cortico-coartical connections, along the lines outlined in[19,7], thus substituting the crisp set of lateral connections, with a fuzzier but more robust set. From the neurobiological point of view, it is very important to correlate dynamic models of cortical maps with dynamic brain imaging[8]. REFERENCES
S. Amari.
Dynamics of pattern formation in lateral-inhibition type neural fields.
Biological Cybernetics, 27:77-87, 1977.
277
10. 11.
12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
V. Braitenberg. Vehicles - Experiments in Synthetic Psychology. MIT Press, Cambridge, MA, 1984. W. Calvin. Cortical columns, modules and hebbian cell assemblies. In M. Arbib, editor, The handbook of brain theory and neural networks, pages 269-272. MIT Press, Cambridge, MA, 1995. C. Cherniak. Component placement optimization in the brain. Journal of Neuroscience, 14:2418-2427, 1994. J. Droulez and A. Berthoz. A neural model of sensoritopic maps with predictive short-term memory properties. Proceedings of the National Academy of Sciences, 88:9653-9657, 1991. R. Durbin and G. Mitchison. A dimension reduction framework for understanding cortical maps. Nature, 343:644-647, 1990. F. Frisone, P. Morasso, and L. Perico. Self-organization in cortical maps and em learning. Journal of Advanced Computational Intelligence, 2:178-184, 1998. F. Frisone, P. Vitali, G. Iann6, M. Marongiu, P. Morasso, A. Pilot, G. Rodriguez, M. Rosa, and F. Sardanelli. Can the synchronization of cortical areas be evidenced by fmri? Neurocomputing, 1999. in press. A.P. Georgopoulos, J.T. Lurito, M. Petrides,, A.B. Schwartz, and J.T. Massey. Mental rotation of the neuronal population vector. Science, 243:234-236, 1989. A.P. Georgopoulos, M. Taira, and A. Lukashin. Cognitive neurophysiology of the motor cortex. Science, 260:47-51, 1993. C. D. Gilbert and T. N. Wiesel. Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. Journal of Neuroscience, 9:2432-2442, 1989. C.D. Gilbert and T.N. Wiesel. Morphology and intracortical projections of functionally identified neurons in cat visual cortex. Nature, 280:120-125, 1979. E. I. Knudsen, S. du Lac, and S.D. Esterly. Computational maps in the brain. Annual Review of Neuroscience, 10:41-65, 1987. T. Kohonen. Self organizing formation of topologically correct feature maps. Biological Cybernetics, 43:59-69, 1982. A. V. Lukashin and A. P. Georgopoulos. A neural network for coding trajectories by time series of neuronal population vectors. Neural Computation, 6:19-28, 1994. T. Martinetz and K. Schulten. Topology Representing Networks. Neural Networks, 7(3):507-522, 1994. P. Morasso and V. Sanguineti. How the brain can discover the existence of external egocentric space. Neurocomputing, 12(2-3):289-310, 1996. P. Morasso and V. Sanguineti, editors. Self-Organization, Computational Maps and Motor Control. Elsevier, 1997. P.G. Morasso, V. Sanguineti, F. Frisone, and L. Perico. Coordinate-free sensorimotor processing: Computing with population codes. Neural Networks, 11:1417-1428, 1998. D. P. Munoz, D. Pelisson, and D. Guitton. Movement of neural activity on the superior colliculus motor map during gaze shifts. Science, 251:358-360, 1991. J.A. Reggia, C.L. D'Autrechy, G.G. Sutton III, and M. Weinrich. A competitive distribution theory of neocortical dynamics. Neural Computation, 4:287-317, 1992. E. Salinas and L. F. Abbott. Transfer of coded information from sensory to motor
278 networks. Journal of Neuroscience, 15(10):6461-6474, 1995. 23. T. D. Sanger. Theoretical considerations for the analysis of population coding in motor cortex. Neural Computation, 6:29-37, 1994. 24. H.D. Schwark and E.G. Jones. The distribution of intrinsic cortical axons in area 3b of cat primary somatosensory cortex. Experimental Brain Research, 78:501-513, 1989. 25. D. Zipser and R. A. Andersen. A backpropagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature, 331:679684, 1988.
Kohonen Maps. E. Oja and S. Kaski, editors
9
ElsevierScienceB.V. All rightsreserved
279
Topology Preservation in Self-Organizing Maps Th. Villmann* Universit~t Leipzig, Klinik und Poliklinik fiir Psychotherapie und Psychosomatische Medizin D-04107 Leipzig, Karl-Tauchnitz-Str. 25, Germany
In the present article we give a summary of our work concerning the problem of topology preservation in self-organizing maps. We develope a mathematical exact definition for that and show ways of measuring the degree of topology preservation. Finally we introduce an advanced learning scheme for generating general hypercube structures for self-organizing maps which then yield an improved topology preservation for the map. 1. I n t r o d u c t i o n The Self-organizing map (SOM) [21] introduced by T. KOHONEN is a special type of neural maps which has found a wide distribution ([22] or [27]). Neural maps constitute an important neural network paradigm. In brains, neural maps occur in all sensory modalities as well as in motor areas. In technical contexts, neural maps are utilized in the fashion of neighborhood preserving vector quantizers. In both cases these networks project data from some possibly high-dimensional input space onto a position in some output space. To achieve this projection, neural maps are self-organized by unsupervised learning schemes. The characteristic feature distinguishing neural maps from other neural network paradigms, or from regular vector quantizers, is the preservation of neighborhoods. Even though many people seem to share a common intuitive understanding of what is meant by topology preservation, it has turned out to be difficult to capture it in a mathematically rigorous fashion. Loosely spoken, topology preservation means that a continuous change of a parameter of the input data leads to a continuous change of the position of a localized excitation in the neural map. Obviously, topology preservation depends on the choice of the output space topology. As a simple example, consider maps from the unit square onto output spaces shaped like a line, a square or a cube. Of these, only the map onto the square output space will be able to preserve neighborhoods in forward and backward directions (Fig. 1). However, in many other examples, including many technical data sets, the input data lie on a submanifold of the input space only (consider, e.g. speech data, which are usually represented in a 10-20 dimensional input space, but fill only a lower-dimensional submanifold in this input space [4]). In these cases, the proper dimensionality required by the input data is usually not known a priori. Yet the output space grid has to be specified prior to learning. This problem is specific to technical applications of neural map algorithms while in biological modeling the output space topology is usu*emaih
[email protected],Tel. +49 0341 9718868, Fax +49 0341 2131257
280
b
(3
c
I
_~III Ill 0 0
v~
x
Figure 1. A map learning algorithm can achieve an optimal neighborhood preservation only, if the output space topology roughly matches the effective structure of the data in the input space.
ally chosen following assumptions about the connectivity of the underlying tissue [1]. On the other hand, the topology preservation property of the SOM in applications often is required [24]. For example, interpolation schemes of the usual SOM like the parametrized self-organizing map (PSOM) [26] or the continuous self-organizing map (CI-SOM) [13] require the topology preservation of the underlying lattice. To solve this problem one can self-organize several maps of different output space topology, measure their degree of neighborhood preservation and select the map with the optimal preservation value. For this purpose, several measures for neighborhood preservation have been developed. Two of them we will shortly present in sect. 3.2. Alternatively, one can use advanced learning schemes which adapt not only the weight vectors of the neurons, but also the topology of the output space itself. Several schemes include examples of such algorithms as there are the Topology Representing Network [23], the Growing Cell Structures [11] or others [16], [17]. All these algorithms yield an output space connectivity which is not easy to formalize as a simple structure, like e.g. as a hypercube (with varying dimensions along the different directions). However,it depends on the subsequent operations which have to be performed on the projected data, whether the output space topology may be of a general graph structure or ideas to be constrained. As an example for the latter case consider the color code or gray value visualization of data points lying on a low-dimensional submanifold of a high-dimensional data space. When the data are mapped from their original space (which cannot be visualized) to a low-dimensional space, it is essential to have a topology preserving mappingonto an ordered output space grid. Only then a transformation of output space position into color indices [15,28] can be performed. A second example are, requiring hypercube structures, are the above mentioned interpolation schemes PSOM and CI-SOM. In the present contribution we give a mathematically rigorous definition of topology preservation in SOM. In a further part we present a growing SOM (GSOM) to generate improved topology preserving maps remaining a regular lattice structure. Thereby, the work is structured as follows: in sect. 2 for clarifying notations we shortly introduce the concept of SOM. In sect. 3 we develope the mathematical definition of
281 topology preservation and discuss some measures to judge the degree of topology preservation. In sect. 4 we present the GSOM scheme. Some applications illustrate the work in sect. 4.2 followed by the conclusion. 2. T h e S e l f - O r g a n i z i n g M a p In SOM, usually, A is a prespecified DA-dimensional rectangular grid of N neurons r which can in principle be of any dimensionality, or can extend to any dimension along its individual directions. This can be cast in a formal way by writing the output space positions as r = (il, i2, i3, ...), 1 < ij < nj . Thereby N = nl x n2 x ... denotes the overall number of nodes. 2 Associated to each of the neurons r C A, is a weight vector wr which determine a particular position in V. The mapping ~ V ~ A is realized by a winner take all rule ~V---,A'V H S ~-~ argmin ] I v - wrll (2.1) rCA
whereas the back mapping is defined as ~ A ~ V " r H w~. Both functions determine the map z4 =
(2.2)
realized by the network. To achieve the map A/I, SOMs adapt the pointer positions w~ so that the input space is mapped onto the output space in a most faithful way. A sequence of data points v E V is presented to the map according to the data distribution P (V). Then the current most proximate neuron s is determined, and the pointer w~ as well as the pointers w , of neurons in the neighborhood of s are shifted towards v, AWr=ehrs(V-Wr),
h r . ~ = e x p ( ' l-r - s l ]2 2~ )~
.
(2.3)
Thereby hrs is the neighborhood function , which is evaluated in the output space and usually chosen to be of Gaussian shape. All data points v c V which are mapped onto the neuron r according to (2.1) perform its (masked) receptive field f~ ~ r = {V e V; ~V---,A (v) = r}
(2.4)
which uniquely corresponds to the Delaunay graph in the following manner: The induced Voronoi diagram 12 of a subset H C ~D with respect to a set S = {w~ e H C ~ D ; i = 1 . . . n} of p o i n t s , is given by the masked Voronoi polyhedra ~i as shown in [23]. We remark that the Voronoi polyhedra are closed sets performing a partitioning of V, i.e. H - Ui=l...~i. The induced Voronoi diagram PH uniquely corresponds to its induced Delaunay graph ~H [8]: Two Voronoi cells ~h, ~tj are connected in ~H if and only if the intersection of it is non-vanishing, i.e., ~ N ~t~ ~- 0 [8],[32]. Hence, we can define a graph metric in ~H as the minimal path length. 2yet other arrangements are also admissible. Originally, KOHONEN prefers a hexagonal structure [19]. In general, one can consider a arbitrary graph structure. Then the neighborhood relations in the graph can be given as a general connectivity matrix.
282 3. T o p o l o g y P r e s e r v a t i o n in S O M KOHONEN gave a definition for what it means for a one-dimensional topographic map to be ordered, what should be discussed in the light of the definition studied in this paper. The following definition, only valid for the one-dimensional case, was given in [20]: We consider a chain of N neural units i with weight vectors wi and receptive fields ft/ as defined in (2.~). Let ?7/(v) be an activity function of the ith neuron with respect to a stimuli vector v E V C ~Dv, for instance the negative distance - I I w / - vii or the inverse distance (llwl_vl,). Let X = {xj C V c_ ~Dv ] j = 1 , . . . , N } be a set of points Rel
Rel
Rel
Rel
such, that xl o x2 o x3 o ... o Xg holds and for each neural unit ij exists a xj C X for which Xy C ~/j holds. The Roll is an arbitrary suitably chosen relation, not necessarily transitive. The system is said to implement a one-dimensional ordered mapping if for il > i2 > i3 > . . . > iN 77il ( X l )
--
maxi:l
..... N ?7/ ( X l )
9
?7iN (XN)
--
(3.1)
maxi=l .....N?7i (XN)
holds. In this paper we will give an explicit order relation Ro~Zbased on an underlying topology, in contrast to the requirement of the existence of such a relation. The proposed straightforward generalization of definition (3.1) to higher dimensions is by no means trivial as pointed out in [7] 9 In fact, one needs to make the relation Roel more explicit in order to find whether the definition (3.1) is applicable to higher-dimension, too. In this sense we present a justification for KOHONEN'S early framework and solve the problem of general definition of topology preservation what he carried out in [22]. 3.1. M a t h e m a t i c a l D e f i n i t i o n of T o p o l o g y P r e s e r v a t i o n in S O M s In common sense topology preservation is understood as the preservation of the continuity in the input space onto the output space. However, an exact mathematical approach is non-trivial: similarity is defined by neighborhood structures in both in the input space V and in the output space A, i.e. by the respective topologies based on their neighborhood structure. Hence, topology preservation is equivalent to the continuity of M between the topological spaces with properly chosen distance metrics. In this section we give an exact definition of what topology preservation means but now to more general lattice structures. We now only assume A to be a network of N neurons r which are situated at points r - ( r l , . . . , rDA) E ~DA. The connectivity graph C A defines the structure of A. The basic idea is to describe the property of topology preservation in terms of mathematical topology. The property of topology preservation of a map may then be based on the continuity of this map between topological spaces. For this purpose we have to define suitable topologies, i.e. we have to determine respective systems of open sets. Then continuous mapping means that open sets are m a p p e d onto open sets again. Using the concepts introduced above and the introduced concepts of the Delaunay graph (see sect 2) we are now able to define in general terms what topology preservation means. However, as shown in [30] we have to define two kinds of neighborhood in the lattice A:
283 Definition 1 Suppose A to be a network of N neurons which are situated at points r = ( r l , . . . , r D A ) C ~DA with reference or synaptic weight vectors wr E V C_ ~Dv. Let furthermore C A (r) denote the connectivity graph C A where the neural unit r was taken as root. A (discrete) topology T ~ (r) is induced by the graph metric d~+(r) in C A (r).
TA+ (r) is said to be the s t r o n g n e i g h b o r h o o d t o p o l o g y in A w i t h respect to r, and (A, T+ (r)) is a topological space. Definition 2 Consider for the moment A to be a set of points in ~DA. Let ~ be the Voronoi diagram of ~DA with respect to A and GA be its dual, the Delaunay graph. Let furthermore GA (r) denote GA where the neural unit r was taken as root. GA (r) is equipped with the graph metric dTs (~) that in turn induces the (discrete) topology T A (r)v in ~ and, hence, also in A. T A (r) is said to be the weak n e i g h b o r h o o d t o p o l o g y in A with respect to r, and (A, TA (r)) is a further topological space defined on the set A.
In the next step we introduce a topology on the set of the synaptic weight vectors on the basis of their receptive fields, which again allows us to describe the neighborhood relationships between two vectors. Definition 3 Let ~A---~V " A
~ VA C V C _ ~D, be a map attributing to each neuron r their weight vector wr E V A with V A -- { w r C ~ D I r C A}. Furthermore, let Fv be
the induced Voronoi diagram of V with respect to V A and Gv be its Delaunay graph. Let furthermore Gv (r) denote Gv where the neural unit r was taken as root. Gv (r) is equipped with the graph metric dTva(r ) that in turn induces the local (discrete) topology Tva (r) in Gv (r) and, hence, also in V A. Tva (r) is said to be the ~A_~v-induced n e i g h b o r h o o d t o p o l o g y w i t h respect to r in V A and ( V A, Tva (r)) is a topological space.
Now topology preservation of a map can be expressed by the following definition: Definition 4 The map j~4 = (~A-,V, q2V-~A) is said to be t o p o l o g y preserving if both ~V---~A: ( V A , T v A (r)!h e ~ (A, TA (r) ) and q2A_~V ." (A, T ~ (r)) ~ ( V A , T v A (r)) are continuous maps of respective topological spaces for all neural units r E A where q2y-~A " ~DA ~ V ~ A is defined by r (v) = arg (minreA IIv -- w~ll ) . Irrespective of the different topologies ~V---*A is the inverse mapping of q2A-~y.
We have immediately the following two corollaries for the most important cases of rectangular and hexagonal (triangular) lattices: Corollary 5 In the case of a rectangular DA-dimensional lattice A of neurons the strong A topology is induced by the Euclidean distance dEuclid in A , and the weak topology is inA x (r,r') -- m a x f : 1 ] ( r - r')y] The systems of open sets duced by the maximum-distance dma Sr~ (r) and Sm~x (r) defining the topologies T+ (r) = T~ udid (r) and TA (r) = TAm~x (r) are determined by
k > 1}}
(32)
{r' A,A (rr')k>_l}}
(3.3)
$r(r)={sklsk={r'CA and
Smax(r)
--
respectively.
~
I dAudi (r,r' ) = dma x
~
--
284 C o r o l l a r y 6 In the special case of A being a hexagonal (triangular) lattice the weak and strong topology coincide. Hence, the definition of topology preservation relies on a single topology in the net which corresponds to the strong neighborhood topology. The conclusion in corollary 6 is in agreement with the definition of neighborhood given in [23]. Moreover, we remark that in the case of a rectangular lattice the weak neighborhood topology TA- (r) is weaker then the strong neighborhood topology T~ (r) also in the sense of mathematical topology [18]. 3.2. M e a s u r i n g t h e D e g r e e of T o p o l o g y P r e s e r v a t i o n Several approaches were developed to judge the degree of topology preservation for a given map [4],[6],[9],[10],[12],[25],[33]. Here we shortly give two widely used measures whereas a detailed analysis and comparison of them can be found in [3]. 3.2.1. T o p o g r a p h i c F u n c t i o n The topographic function (I) [30] reflects the definition 4 of topology preservation. By means of the definitions 1, 2, 3 and 4 for each unit r we introduce the functions
fr (k)de_j ~ {r,i dTA(r)(r,r') > k ; dTvn(r)(r,r')=
I} (3.4)
fr ( - k ) de___f~ ( r [ dTA+(r) (r,r') -- 1 ; dTvA(r) (r,r') > k} with k = 1 , . . . , N 1. ~ {.} denotes the cardinality of the set and dTo(~)(r,r') d~=: [Iw~ -- w~,llTo(~) is a distance measure based on the topology T ~ (r). Looking at a neural unit r, f~ (k) with k > 0 determines the continuity of q2V~A and fr (k) with k < 0 determines the continuity of ~A--,v as defined above. The topographic function of the neural lattice A with respect to the input manifold V is then defined as 1
r
(k) d&/
~
E~eA f~ (k) (1) + r
1
(--1)
E~Af,(k)
k> 0 k -- 0
(3.5)
k <0
We obtain ~ - 0 and, particularly, ~ (0) = 0 if and only if the map M is perfectly topology preserving. The largest k + > 0, for which ~ (k +) ~ 0 holds, yields the range of the largest fold if the effective dimension of the data manifold V is larger than the dimension DA of the lattice A. The smallest k- < 0, for which ~ (k) ~ 0 holds, yields the range of the largest fold if the effective dimension of the data manifold V is smaller then the dimension DA of the lattice A. Small values of k + and k- indicate that there are only local conflicts, whereas large values indicate a global dimensional conflict. Hence, the shape of ~(k) allows a detailed discussion of the magnitude of distortions occurring in a map. If less detailed information is desired, we propose here to consider simply the difference ~ = ~(+1) - r
.
(3.6)
285 While the values of the negative (positive) k components of the topographic function indicate whether the output dimension is to high (small), the sign of 9 expresses what terms are predominant. However, as pointed out in [3] 9 is sensitive with respect to noise. 3.2.2. T h e m o d i f i e d t o p o g r a p h i c p r o d u c t At second we present the widely ranged topographic product P [4]. During the computation of P for each node r the sequences nA(r) and n y (r) have to be determined, where nA(r)j denotes the j-th neighbor of r, with distance measured in the output space A, and ny (r) denotes the j-th neighbor of r, with distances evaluated in the input space between wr and Wny(~). Usually, the distances in the input space are measured by the Euclidean norm. However, following this procedure, strangely curved maps would always be judged to be neighborhood violating, even though the shape of the map might be perfectly justified given a strange curved data distribution [30]3. To overcome this problem we used here the distances d%A (r)(w~, w~, ) of minimal path length in the induced Delaunay-graph Gv of the Wr instead of the respective Euclidean distances dv (Wr, W~,) between the weight vectors as in the original approach. The sequences nA(r) and n y (r) and further averaging over neighborhood orders j and nodes r finally lead to
P
1
~
N(N- 1)r
1 log ("H~=ldTvA(r)(wr'wnA(r)) ~lj=l 2-J
dT"wA(r) (Wr, WnY(r))"
dAuclid( r ' n A ( r ) ) )
.
(3.7)
dAuclid(r, nY(r))
As the original topographic product, also/5 can take on positive or negative values, which have to be interpreted as follows: If P < 0 holds the output space is too low-dimensional, in contrast if we have P > 0 the output space too high-dimensional. In both cases neighborhood relations are violated. Only if/5 ~ 0 is valid the output space approximately matches topology of input data. The topographic product can be evaluated on the basis of the map parameters only, without knowledge of the underlying data distribution. 4. T h e G r o w i n g S e l f - O r g a n i z i n g M a p for S t r u c t u r e A d a p t a t i o n The growing self-organizing map (GSOM) approach [5] is an extension of the usual SOM. The aim of the GSOM is to generate lattice structures with an improved degree of topology preservation remaining the hypercube structure. During the learning procedure not only the pointers are distributed in the data space. In addition the dimension of A is adapted and also the length ratios between the lattice directions, i.e. a structure adaptation takes place. 4.1. T h e G S O M - A l g o r i t h m Requiring A to consist of a (rectangular) hypercube but allowing a variable overall dimensionality and variable dimensions along the individual directions in the hypercube. 3This problem is not specific to the topographic product. All approaches which are based on the evaluations of the position of the neurons in the lattice and, on the other hand, on the evaluations of the position of their weight vectors only can distinguish a correct folding due to the folded non-linear data manifold from a folding due to a topological mismatch between V and A as in the linear case, because they do not take the shape of V into account.
286
\
.......
~::::~
.............
/
i
J
\
.......... ~:-:~ I-!--I--{--I- ......... :
:
Wr+ .
...... i ............ ..... ~
/..,l
~-~
--Wr_ e
..... -
w
,,
1
Figure 2. a) (left) Illustration of the basic decision to be made during the growth procedure: an output space lattice can either grow in an existing direction or it can extend into a new direction, b) (right) Illustration of the criterion for determining the correct growth direction. Consider the center neuron with receptive field center position wr in a hypothetical one-dimensional chain of neurons. From the receptive-field-center positions w~+~ 1 and Wr-ez of its output space neighbors the local input space direction llwr+.z-w~-~i wr+'l--wr-eZ il can be estimated (large arrow) which corresponds to output space direction ez. The stimuli (stars) within the Voronoi cell of neuron r can now be decomposed into a parallel and a perpendicular part relative to this local direction. The average of the relative size of the resp. decomposition amplitudes then determines whether the output space is extended along ez or whether a new dimension is added.
The GSOM starts from an initial 2-neuron configuration, learns according to the regular SOM-algorithm, adds neurons to the output space with respect to a certain criterion to be described below, learns again, adds again, etc., until a prespecified m a x i m u m number Nrnax of neurons is distributed. During this procedure, the output space topology remains to be of the form nl x n2 x ..., with nj = 1 for j > DA, where DA is the current dimensionality of A. Hence, the initial configuration is 2 x I x I x ..., DA -- 1. From there it can grow either by adding nodes in one of the directions which are already spanned by the output space, i.e. by having ni --+ n~ + 1, i <_ DA, or by adding a new dimension, i.e. (nDA+I = 1)--* (nDA+I=2), DA ---+ DA + 1 (see Fig 2a)). The decision in which direction nodes have to be added or whether a new direction has to be initialized is made on the basis of the receptive fields f 4 9 W h e n reconstructing v CV from neuron r, an error DA
0 -- v -
Wr = E ai(v) W r + e i - W r - e l V! i=1 IlWr+e, -- Wr-e, II +
V!
'
O'~PCA
-- aDA+I(V)II~oPCAII §
vr (4.1)
remains decomposed along the different directions, which result from projecting back the output space grid into the input space A (see Fig 2b)). Thereby, ei denotes the unit
287 vector in direction i of A. 4 Considering the receptive field ~tr and determining their first principle c o m p o n e n t WPCA allows a further decomposition of v'. Projection of v ~ onto the direction of WPCA t h e n yields aDA+l(v) in (4.1). T h e criterion for the growing now is to add nodes in t h a t direction which has on average the largest error (normalized) expected amplitudes ai: n~ 1 ~
ni
+
lai(v) l
V/~9=_.A1 "t'1 a~(v)
,
i=I,...,DA+I
(4.2)
Once a direction in which to grow has been determined, new nodes have to be initialized. After each growth step, a new learning phase has to take place, in order to readjust the map. - - For a detailed study of algorithm we refer to [5] and [29].
4.2. R e s u l t s of G S O M - S i m u l a t i o n s 4.2.1. T h e o r e t i c a l E x a m p l e In this more theoretical example [5] a curved two-dimensional surface embedded in three dimensions p a r a m e t r i z e d by r, s according to vl=r+0.5,
v2=s+0.3,
v3=4rs(1-r),
0
0<s<0.3
.
(4.3)
Here, in addition to dimension selection also a scale selection task was addressed. Running the G S O M - a l g o r i t h m on this d a t a set, with Nmax -- 256, ~qi = 0.5, afi = 0.1, a q = 0.1, aff = 0.02, c~ = 0.9, and cf - 0.01 the algorithm adjusted the o u t p u t space at a dimension of d = 2, with 28 x 9 (= 252) nodes along the two directions. This length ratio reflects the length ratio of the p a r a m e t e r space underlying the data; it was approximately kept constant during the learning procedure. In subsequent runs we investigated whether the data submanifold could also be detected in a much higher-dimensional space, with noise along all the irrelevant dimensions. To this purpose, we increased the input space dimension to Dv = 10, with vl,2,3 as in (4.3), and V4
288
0,4 I
o.4- ~
a
0.2
O.2
0"0.6
0,0 1,5
0,3 0 . 5
0.4 I
b
0.6
1.5 0.3
O
0.5
o., I
d
0.2
0,2
0.%
0.3 0.5
1.5
0.5
0.5
1.5
Figure 3. Visualization of the 28 x 9 GSOM-map (part a), and the 64 x 4 (b), 16 z 16 (c) and 7 x 6 x 6 (d) fixed output space maps obtained for example 1. The weight vectors wr lie on the two-dimensional submanifold of [~3 which is covered by the data points.
4.2.2. Real World Application- Satellite Remote Sensoring Data Satellites of LANDSAT-TM type produce pictures of the earth in 7 different spectral bands. The ground resolution in meter is 30 x 30 for the bands 1-5 and band 7. Band 6 (thermal band) has a resolution of 60 x 60 only and, therefore, it is often dropped. The spectral bands represent useful domains of the whole spectrum in order to detect and discriminate vegetation, water, rock formations and cultural features [28]. Hence, the spectral information, i.e., the intensity of the bands associated with each pixel of a LANDSAT scene is represented by a vector v E V C ~Dy with D v - 6. The aim of any classification algorithm is to subdivide this data space into subsets of data points which belong to a certain category corresponding to a specific feature like wood, industrial region, etc., each feature being specified by a certain prototype data vector. One way to get good results for visualization is to use a SOM dimension DA -- 3 [15]. Then we are able to interpret the positions of the neurons r in the lattice A as vectors r = c - (r, g, b) in the color space C, whereby r, g, b are the intensity of the colors red, green and blue [15]. This assigns colors to winner neurons so that we end up immediately with the pseudo color version of the original picture for visual interpretation. However, when we are mapping the data clouds from a 6-dimensional input space onto a three-dimensional color space there may arise dimensional conflicts and the visual interpretation may fail. Usually, for
289 visual interpretation only the band 2, 3, 4 are used which means a loss of information. In the first example we investigated a picture of the north-east region of Leipzig [29]5 For comparison we also trained several usual SOMs with fixed output spaces and determined the respective/5-values (3.7) which are depicted in Tab. 1. The topographic produ c t / 5 prefers an output space dimension DA between 2 and 3. However a clear decision can not be made. An additional Grassberger-Procaccia-analysis [14] yields D~Ap ~ 1.7. Here should be mentioned that we applied in this runs, as well as in the further GSOMsimulations, additionally a magnification control scheme according to [2] to achieve a maximum of transinformation [31]. The GSOM algorithm was applied in several runs (106 lernsteps) with different values Nmax. The obtained results are depicted in Tab. 1 and Fig. 4 . The achieved/5-values are better than the respective values for the fixed lattice structures. Furthermore, for all numbers Nm~x of maximal allowed neurons, we achieved approximately the same structure and the length ratio's of the edges show a good agreement with considerations of the usual principle component analysis (PCA) of the data space: ev = (274.06, 76.19, 39.78, 11.92, 8.27, 6.28) T
(4.4)
as the vector of the respective eigenvalues. However, in general, the usual but linear PCA fails, as shown in second example. It is again a LANDSAT-scene, but now from the Colorado-area 6. The respective PCA yields ev = (4.93121, 0.6838, 0.29047, 0.05474, 0.02242, 0.01737) T
(4.5)
suggesting an one-dimensional structure, whereas a Grassberger-Procaccia-analysis [14] gives D~Av ..~ 3.1414. The GSOM generates here a 12 • 7 • 3 lattice structure (Nma~ = 256) which corresponds to a/5-value of 0.0095 indicating again a good topology preservation. For further real world applications and more detailed investigations we refer to [29]. 5obtained from UMWELT-FORSCHUNGSZENTRUM Halle-Leipzig, Germany 6Thanks to M. Augusteijn (Univerity of Colorado) for providing this image.
N
lattice structure
/5
256 256 252 256
256 16 • 16 7• 6• 6 4 • 4 • 4 • 4
-0.189 • 0.00612 -0.0642 + 0.00031 +0.0282 4-0.00024 +0.0816-t-0.00387
Nrnax lattice structure
128 256 512
12 • 5 • 2 14 x 6 x 3 15 x 6 x 4
P
0.0047 0.0050 0.0051
Table 1 Table of the modified topographic product t5 for different but fixed output spaces for the LANDSAT satellite image of Leipzig (left). Lattice results of several GSOM-runs together with the respective/5-values for the same image.
290
Figure 4. Pseudo-color imgages of LANDSAT-TM six-band spectral images: above Leipzig image using the 7 x 6 x 6 standard SOM (left) and the 14 x 6 x 3 GSOM-solution; bottom - Colorado image using the standard psuedo-color vizualization using only bands 2, 3, 4 (left) and the 12 z 7 x 3 GSOM-solution;
5. C o n c l u s i o n In this contribution we summarized the results of our work about the problem of topology preservation in SOMs. We have given a mathematical exact definition and methods for measuring its degree. So we have solved this open question as it was formulated by T. KOHONEN. Additionally, we have presented a growing scheme for SOMs generating hypercubic lattices for improving the topology preserving abilities of SOMs remaining their simple structure. In two applications we demonstrated the success of the approach. ACKNOLEDGMENT THE AUTHOR HAS TO THANK TO H . - U . BAUER, M. HERRMANN ( BOTH FROM THE
M P I F0R STRSMUNGSFORSCHUNa, GSTTINaEN/GERMANY), R. DER. (UNIVERSITY LEIPZIG/GERMANY) AND T . MARTINETZ (UNIVERSITY BOCHUM) FOR A LONG AND SUCCESSFUL CO-OPERATION WHICH LED TO THE RESULTS SUMMARIZED IN THIS CONTRIBUTION.
291 REFERENCES
10. 11.
12. 13. 14. 15. 16. 17. 18.
H. U. Bauer. Development of oriented ocular dominance bands as a consequence of areal geometry. Neural Computation, 7(1):36-50, Jan 1995. H.-U. Bauer, R. Der, and M. Herrmann. Controlling the magnification factor of selforganizing feature maps. Neural Computation, 8(4):757-771, 1996. H.-U. Bauer, M. Herrmann, and T. Villmann. Neural maps and topographic vector quantization. Neural Networks, page to appear, 1999. H.-U. Bauer and K. R. Pawelzik. Quantifying the neighborhood preservation of SelfOrganizing Feature Maps. IEEE Trans. on Neural Networks, 3(4):570-579, 1992. H.-U. Bauer and T. Villmann. Growing a Hypercubical Output Space in a SelfOrganizing Feature Map. IEEE Transactions on Neural Networks, 8(2):218-226, 1997. J. C. Bezdek and N. R. Pal. An index of topological preservation and its application to self-organizing feature maps. In Proc. IJCNN-93, Int. Joint Conf. on Neural Networks, Nagoya, volume III, pages 2435-2440, Piscataway, N J, 1993. IEEE Service Center. M. Cottrell, J. C. Fort, and G. Pages. Two or three things that we know about the Kohonen algorithm. In M. Verleysen, editor, Proc. ESANN'9~, European Syrup. on Artificial Neural Networks, pages 235-244, Brussels, Belgium, 1994. D facto conference services. B. Delaunay. Sur la sp~re vide. Bull. Acad. Sci. USSR (VII), Classe Sci. Mat. Nat., pages 793-800, 1934. P. Demartines and J. Herault. Representation of nonlinear data structures through a fast VQP neural network. In Sixth International Conference. Neural Networks and their Industrial and Cognitive Applications. NEURO-NIMES 93 Conference Proceedings and Exhibition Catalog, pages 411-24, Nanterre, France, 1993. EC2. R. Der, M. Herrmann, and T. Villmann. Time behaviour of topological ordering in self-organized feature mapping. Biological Cybernetics, 77(6):419-427, 1997. B. Fritzke. Growing cell structures--a self-organizing network in k dimensions. In I. Aleksander and J. Taylor, editors, Artificial Neural Networks, 2, volume II, pages 1051-1056, Amsterdam, Netherlands, 1992. North-Holland. G. J. Goodhill and T. J. Sejnowski. A unifying objective function for topographic mappings. Neural Computation, 9:1291-1303, 1997. J. Goppert and W. Rosenstiel. The continuous interpolating self-organizing map. Neural Processing Letters, 5(3):185-92, 1997. P. Grassberger and I. Procaccia. Maesuring the strangeness of strange attractors. Physica, 9D:189-208, 1983. M. H. Gross and F. Seibert. Visualization of multidimensional image data sets using a neural network. Visual Computer, 10:145-159, 1993. A. Hgmglginen. Using genetic algorithm in self-organizing map design. In Proceedings of the ICANNGA '95, Arles, France, 1995. S. Jockusch and H. Ritter. Self-Organizing Maps: Local competition and evolutionary optimization. Neural Networks, 7(8):1229-1239, 1994. L. W. Kantorowich and G. P. Akilov. Funktionalanalysis in Norrnierten Rgumen. Akademie-Verlag, Berlin, 1978.
292 19. T. Kohonen. Self-organizing formation of topologically correct feature maps. Biol. Cyb., 43(1):59-69, 1982. 20. T. Kohonen. A simple paradigm for the self-organized formation of structured feature maps. In S. Amari and M. A. Arbib, editors, Competition and Cooperation in Neural Nets, Lecture Notes in Biomathematics, Vol. ~5, pages 248-266. Springer, Berlin, Heidelberg, 1982. 21. T. Kohonen. Self-Organization and Associative Memory. Springer, Berlin, Heidelberg, 1984. 3rd ed. 1989. 22. T. Kohonen. Self-Organizing Maps. Springer, Berlin, Heidelberg, 1995. (Second Extended Edition 1997). 23. T. Martinetz and K. Schulten. Topology representing networks. Neural Networks, 7(2), 1994. 24. E. Oja and J. Lampinen. Unsupervised learning for feature extraction. In J. M. Zurada, R. J. M. II, and C. J. Robinson, editors, Computational Intelligence Imitating L/re, pages 13-22. IEEE Press, 1994. 25. D. Polani and J. Gutenberg. Organization measures for self-organizing maps. In
Proceedings of WSOM'97, Workshop on Self-Organizing Maps, Espoo, Finland, June
26.
27. 28.
29. 30.
31.
32.
33.
~-6, pages 280-285. Helsinki University of Technology, Neural Networks Research Centre, Espoo, Finland, 1997. H. Ritter. Parametrized self-organizing maps. In S. Gielen and B. Kappen, editors, Proc. ICANN'93 Int. Conf. on Artificial Neural Networks, pages 568-575, London, UK, 1993. Springer. H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-Organizing Maps: An Introduction. Addison-Wesley, Reading, MA, 1992. T. Villmann. Benefits and limits of the self-organizing map and its variants in the area of satellite remote sensoring processing. In Proc. Of European Symposium on Artificial Neural Networks (ESANN'99), page to appear, Brussels, Belgium, 1999. D facto publications. T. Villmann and H.-U. Bauer. Applications of the growing self-organizing map. Neurocomputing, 21(1-3):91-100, 1998. T. Villmann, R. Der, M. Herrmann, and T. Martinetz. Topology Preservation in SelfOrganizing Feature Maps: Exact Definition and Measurement. IEEE Transactions on Neural Networks, 8(2):256-266, 1997. T. Villmann and M. Herrmann. Magnification control in neural maps. In Proc. Of European Symposium on Artificial Neural Networks (ESANN'98), pages 191-196, Brussels, Belgium, 1998. D facto publications. G. Voronoi. Nouvelles aoolications des parametres k la theorie des formes quadratiques, deuxi~me m~morie: Recherches sur les parall~lo~dres primitifs. Journ. reine angew. Math., 134:198-287, 1908. S. Zrehen. Analyzing Kohonen maps with geometry. In S. Gielen and B. Kappen, editors, Proc. ICANN'93, Int. Conf. on Artificial Neural Networks, pages 609-612, London, UK, 1993. Springer.
Kohonen Maps. E. Oja and S. Kaski, editors 9 ElsevierScienceB.V.All rightsreserved
293
S e c o n d - O r d e r L e a r n i n g in S e l f - O r g a n i z i n g M a p s Ralf Der *a Michael Herrmanntb University of Leipzig, Institute of Informatics Postfach 920, 04109 Leipzig, Germany b Max-Planck-Institute for Fluid Dynamics Bunsenstr. 10, 37073 GSttingen, Germany Kohonen's self-organizing map bears large potentials as a universal tool of nonlinear data analysis. From the practical point of view control parameters like the learning rate and the neighborhood width need special attention in order to exploit the possibilities of the approach. Our paper introduces second order learning methods which generalize the dynamics of Kohonen's learning algorithm in that control parameters are individually attributed to each neuron and adapted automatically. This is achieved by making use of the special properties of the map at phase transitions it undergoes when learning parameters cross critical values. We demonstrate by way of examples both the automatic control of the self-organization process itself, the extraction of principal manifolds, the mapping of hierarchically structured data, and provide also a version of the algorithm which proves feasible in the case of sparse data sets. 1. I n t r o d u c t i o n The clever handling of control parameters plays an essential role in most learning algorithms. Manipulating the parameters in a convenient way not only may speed up the learning procedure itself but often is responsible for the success of the learning as such. In many learning procedures of practical interest finding the correct learning parameters or cooling strategies is done by trial and error. In a few cases, strong theorems are known which formulate the parameter strategy in an explicit way an example being the Robins-Monroe theorem for the case of stochastic gradient descent. Other examples may be found in [13]. Since these theorems refer to the asymptotic time behavior they are of limited value for practical applications. There an initial learning period is often decisive on whether a meaningful solution is approached during the convergence phase or not. For this reason a variety of empirical parameter-learning procedures has been invented. The self-learning of the learning parameters is called second-order learning and introduces another set of meta parameters which can however be assumed to be less sensitive to variations or are *
[email protected] t
[email protected]
294 to be controlled by higher-order learning. In this way the hope is that learning does not degenerate so much to a trial and error procedure. Whereas the basic learning algorithms are usually guided by some error functional and perhaps further constraints the learning of learning parameters is intended to achieve additional qualities of the solution. This concern generalizability, plasticity, robustness etc. and allows to cope with problems such as over-fitting, convergence speed, nonstationary input distributions or outliers. 2. S e c o n d - o r d e r l e a r n i n g in S O M s In the following we will focus on Kohonen's self-organizing feature map (SOM), cf. Ref. [11] for a comprehensive overview, where the control parameters are the learning rate e and the neighborhood width a. The learning rate controls the plasticity of the map since it defines the attention a neuron attributes to a single stimulus. By an individual tuning of the neural attention, the magnification factor of the map can be controlled locally, cf. Ref. [2] such as to achieve maps minimizing particular error criteria. In another approach [1] the plasticity of the map is improved by a heuristical scheme for choosing optmimal learning step lengths. The neighborhood width a governs both self-organization process and the topographic properties of the map. The self-organization of topological order is realized by a convenient cooling strategy which however is motivated only empirically. In contrast to the usual learning rates, a must not be cooled down to zero. In fact, below a critical value the topographic structure of the map will break down. Interestingly this happens by a phase transition as shown in Ref. [7]. Even more so, the asymptotic value of a is of crucial importance in one of the main applications of the SOM which is the dimension reduction of noisy data distributions. We may consider this case as the mapping of the neuron lattice into the input space to form there a curvilinear coordinate system representing the main features of the the higher dimensional data distribution. The quality of this fit hinges on the value of a in a sensitive way. With an appropriate value of a the curvilinear coordinate system follows the data distribution as close as possible without fitting the noise. Finding this value may be done again by trial and error in slowly shrinking a while permanently registering the preservation of topology by a convenient measure, cf. Ref. [14]. However, this approach will break down if the strength of the noise varies over the data distribution in an essential way. Then, one needs a local value of a. In the present paper we discuss a new algorithm which handles the neighborhood width in a completely self-regulating way for both the process of topological ordering and the fine tuning of the local asymptotic values of a in order to find the best fit to the data distribution. We will do this by exploiting the specific properties of Kohonen's learning algorithm at the phase transition between different regimes of the map. This way of reasoning can be carried over also to other learning procedures so that we understand our paper also as a contribution to the more general problem of second-order learning. 2.1. T h e a l g o r i t h m For the sake of simplicity we study the case of a chain, i.e. of a one-dimensional lattice of neurons to be mapped into a data distribution embedded into a higher dimensional input space. Apart from noise the data cloud is proposed to be one-dimensional so
295 that the image of the chain may serve as a nonlinear principal component of the data distribution. Kohonen's update rule for the image w~ E R n of the neuron position (lattice site r E { 1 , 2 , . . . , N } ) is Awr =
eh(r,
r*)(v - wr),
(1)
where v E R n is the input vector and
h(r, r*)
denotes the neighborhood function as usual.
2.2. Individual neighborhood width Usually the neighborhood width is defined globally for all neurons to be the same. In the present paper we consider the case that the noise is varying strongly over the data distribution so that the neighborhood width of the neurons is to be chosen individually in order to get a good fit of the principal curve to the data distribution. We define the neighborhood function as
h(r,
1
r*) = v/~a-----~ exp
-
27ra2
,
(2)
where ar is the local value of the neighborhood width to be defined below. The prefactor containing ar normalizes the influence of the current winner neuron r* onto its neighbors. Without this factor the algorithm turns out to be less stable. 2.3. P h a s e t r a n s i t i o n d y n a m i c s In order to find convenient values for the ar we exploit the dynamics of the phase transition from the topographic to the over-fitting situation, cf. [12,6]. For a discussion consider our case of mapping a chain of neurons into a data manifold of dimension higher than one. The algorithm tries to adapt the (image of the) chain to the data points as close as possible under the smoothness constraint defined by the value of a. 3 While gradually shrinking a, after the self-organization process the chain adapts closer and closer to the data points ending up with a complete match to all the data points which is the over-fitting catastrophe. The point now is that there is a sharp phase transition to the over-fitting regime which occurs at a critical value a crit o f the neighborhood width, a crit depending on the scattering of the data points about the principal curve [12,6]. At the phase transition point the quality of the map changes in that characteristic oscillations (folding) are formed. These are signaled by topology violations. 4 Although the question of measuring the topographic properties of the map is not trivial, cf. Refs. [4,14], we have found a simple criterion [3] which proved reliable in practical applications. We consider the distance a - I ] r ' - r " l l which is the distance between the first and second closest neuron to the current data point where a >_ 1. If for any data point the first and second winner are not neighbors (a > 1) than there is a violation of topology in the region of the data point which means that a has fallen below its critical value. In other words a > 1 signals a local onset of the phase transition to the over-fitting (topology violating) regime. 3We consider the case of a global value of a for the moment. aActually, there is a coexistence between a topology violating and a topology preserving phase. The latter however is much less stable, cf. [6], and can be ignored for the present discussion.
296
2.4. Parameter learning dynamics Consequently, our approach consists in keeping ar fluctuating around its (unknown) critical value _crit This is done in two steps. On the one hand we decrement O'r O r
9
1 /k(Tr - - g-----~a(Tr
Vr
(3)
which is carried out at each step (presentation of data point) for all i and on the other hand we increase the ar locally if ever a violation of topology is registered, i.e. whenever a > 1 we reset the local values of a as
ar'=max(ar,(~Kexp(2(r-R)2)), c~2
whereR=
~1 (r' + r") .
(4)
where K is an empirical factor. In our simulations we always choose K = 2.4. 5 As a result, the map fluctuates around the principal curve due to the phase transition taking place each time the phase barrier corresponding to the local critical value 0 "crit is crossed. ~
2.5. Fluctuation smoothing In order to average over the fluctuations each neuron keeps a second pointer Wr obtained by the moving average 1
mWr
--
KNT~(wr -
Wr)
(5)
over the fluctuations, where K is of the order of 10. The Wr provide in most cases a very good first order data model. Further improvements depend on the task. In the case of modeling a functional relationship one may use the Wr to investigate the properties of the noise r] in order to improve the model. For the principal curve case, an essential improvement consists in using the Wr as starting positions for a final step in the sense of the iterative Hastie-Stuetzle algorithm [8]. This can be implemented more easily by monitoring directly the averages over the data in each domain. Hence, instead of Wr each neuron gets a second pointer Vr updated if the neuron is the winner as AV~ =
1 KT,~ (v-
V~).
(6)
The set {Vr I r = 1 , . . . , N} is the final result of the algorithm, i.e. they represent the principal curve in input space. Several toy applications of the present algorithm may be found in the Figures.
3. Applications We are now going to demonstrate the properties of the above algorithm by the results of numerical simulations. We have applied the above algorithms to map both one- and two-dimensional lattices into higher-dimensional input spaces with inhomogeneous data distributions of effective dimension D = 1 or D = 2, respectively. The local scattering of the data points around the central manifold varied by up to an order of magnitude. 5The value of K can be made plausible via the well know relation A = 2.02~ between the wave length of the critical folding and the width 5 of the neighborhood measured in input space. 6This is the main difference to related methods of self-learning the neighborhood width presented in [9,10].
297 3.1. T h e s e l f - c o n t r o l of t h e s e l f - o r g a n i z a t i o n p r o c e s s In the usual scenario of the self-organization process one starts with a random initialization of the map and a high value of the (global) neighborhood width a which then is subject to a cooling procedure. Usually this is done by an exponential decay like the one of Eq. (3) on a time scale T which is fixed empirically. 1.5 9
"....-,;~ ' 9:~:. :-.;::;
0.5
.
9. ..,...~ ,y.~.-~. .
~;i::.: 9 ~i..-:~~' -[..... ~"~.-,t.:'~', ." -.~.
.
.
.
.
;~..~ ....:
.
";'~i/, " ~".':~-": ~-::
:",
.
~:~.i. &:.
!..
-0.5
"I -" .5
I
i
I
!
!
-1
-0.5
0
0.5
1
1.5
Figure 1. Mapping a one-dimensional chain of neurons into a nonlinear pseudo-onedimensional data distribution with the scattering width of the data points varying by one order of magnitude. We used our algorithm which adapts the individual neighborhood width of the neurons so that the chain (crosses) can follow the ideal principal curve (solid line). We started with a random initialization of synaptic weights and ar = 0 Vr. The squares denote the centers of the Voronoi cells. Parameters of the run were N = 100, = 0.1, T~ = 100. In our scenario one may start with arbitrary values for the individual neighborhood widths at. In fact, we start with ar close to zero for all r. The random initialization of the synaptic vectors engenders long ranging topology violations so that the values of the ar are subsequently reset to large values due to the reset mechanism of Eq. (4). The decay time T~ on its hand plays a role largely different from that of the conventional cooling procedure. Quite generally, the cooling is much more rapid, i.e. T~ << T so that T~ may be much smaller than the average healing time of a topology violation of range (in neuron space) a. This is consistent because of the reset mechanism of Eq. (4) which keeps the ar at a level corresponding to the range of the pertinent topology violations. In other words the decrease of the a values is governed by the decay of the average range a of the topology violations. This is clearly demonstrated in Fig. 2 where the time course of the values of ar for the map of Fig. 1 is given. It is one of the peculiarities of our algorithm that the dynamics of the map is largely independent of the value of the control parameters, which is mainly the cooling time T~. In fact, in the example considered in Fig. 1 we could vary T~ by several orders of
298 90 80 6070
",..,.,.,..,.,,.,
50 40 30 20
0
10000
20000
30000
40000
5000(
Figure 2. Time course of the ar for the self-organization process of the map shown in the previous figure. Depicted are the neighborhood widths over time for neurons at r = 1, 5, 50 of a chain of 100 neurons. The upper curve is that of the neuron in the middle of the chain which converges to a large value of a because of the large scattering width of the data points in its domain.
magnitude without jeopardizing the stability of the learning procedure. This is what was to be expected of a meta-parameter like T~ in a second-order learning scenario. A further discussion of the specific role of meta-parameters may be found in Sec. 3.3 below.
3.2. Revealing the structure of data distributions One of the most prominent applications of the SOM is dimension reduction of noisy data. Mathematically this corresponds to the problem of extracting principal curves principal manifolds (PM) in the general case - from higher dimensional data distributions. A PM is defined self-consistently by the requirement that each point on the P M is the average of the data points projecting to it, cf. [8]. Thus, it minimizes the mean square deviations of the data from the PM subject to some smoothness constraint. The smoothness condition restricts the curvature of the PM. It is the precise formulation of these criteria which makes the problem highly nontrivial. Hastie and Stuetzle [81 described an algorithm for the construction of principal curves (PC) which works iteratively starting from the principal component of the data set. In each iteration a new estimate of the PC is obtained from the calculation of the centers of gravity of the data points with respect to the current estimate of the PC. This procedure is combined with a smoothing operation controlled via cross validation in order to avoid the over-fitting catastrophe. The authors gave some evidence in favor of the convergence to a stable solution, mainly by referring to the linear case. However, so far there are no general criteria for the existence and uniqueness of the P M for arbitrary data distributions. In the Hastie-Stuetzle algorithm and other algorithms known so far the smoothness and hence the stability are guaranteed by local averaging, the span of the average being guided globally by cross validation. Our SOM based algorithm avoids this restriction and
299 moreover leads to a stable though possibly suboptimal principal curve, see Fig. 1 for a demonstration. Moreover the new algorithm is not restricted to the case of principal curves but also has been successfully applied to the problem of two-dimensional P Ms, see Fig. 3.
10
.......................
......................::::":::::.....i .....--'
t~..
6.5-
i
6 5
...... :::;': ......
- .....
;!:.:::
:
5.5-
...........
0 30
o 5 - - - ~ . _ ;~~ .
.~, "
20
25
30
0
15
5 o
~
5
-10
15
~~1o 0
15
5
Figure 3. Embedding a two-dimensional neural lattice in a three-dimensional data set with automatic learning of the individual values of ar.
3.3. H i e r a r c h i c a l l y s t r u c t u r e d data d i s t r i b u t i o n s In Sec. 3.1 we have emphasized the robustness of the parameter dynamics with respect to the meta cooling parameter T~. In the present section we want to demonstrate further potentialities of SO M second order learning. In the language of dynamical systems, with the parameter dynamics integrated the learning process of the map is a (stochastic) dynamical process with the asymptotic configuration of the map as an attractor. If the map is topology preserving one may say, that the map represents the principal structure (in the sense of a principal component) of the data distribution. In the case of hierarchically structured data distributions like the one of Fig. 4 there are more than one principal components imaginable which so to say reveal the principal structure of the distribution as seen on different length scales. The interesting point now is that our generalized map dynamics may converge to both of these principal components depending on the value of the meta cooling parameter. We may say that the learning dynamics is bistable or multi-stable in the general case in the sense that there are several attractors to which the dynamics may converge. The meta parameter in this case plays the role of a switch by which we can choose the attractor and hence the level of the structural hierarchy which is to be displayed by the map, an example being given by Fig. 4.
4. Sparse d a t a sets In the above algorithm we directly check the onset of the phase transition by monitoring the topology violations. This procedure works well if there are enough data points close to the boundaries of the data distribution, since it is mainly these data points which are not mapped to neighboring first and second winners. Therefore the above algorithm may
300
Figure 4. Two alternatives of a principal component found by the algorithm by making
use of its multi stability property. The data distribution is hierarchically structured in the sense that on a coarse length scale it is more or less a rectangle and on the fine scale we see a noisy sin-function. The algorithm finds both solutions for the principal curve problem. For a sufficiently large value of T~ in Eq. (3) the straight line solution is obtained as a stable final state of the learning procedure. If T~ is chosen at least one order of magnitude smaller the map converges with the same initialization to represent the fine structure of the data distribution. Both representations are topology preserving solutions. fail if the number of d a t a points is small. For this case, a very sensitive criterion for the emergence of the critical fluctuations was found to be a frequency adapted wavelet transform [5] of the map. For a one-dimensional SOM we calculate for each neuron r the G a b o r transform
1 gr - vf~-~u r
N ~
(_(k_r)2) Wr exp
2u 2
exp ( - i kwr)
(7)
k=l
where both the frequency W r o f the kernel and the width are functions of the current values of a~ so t h a t the wavelength of the kernel always does agree with t h a t of potential foldings. At the critical point a - - o -crit the wavelength of the emerging folds is A - 4.04al, where 1 is the average distance between the neurons in t h a t region, cf. [12]. Choosing Wr -- Ur -- 4a~ causes gr to j u m p by an order of magnitude when ar crosses O_crit r 9 Hence, gr is the desired sensitive criterion for detecting the onset of the phase transition. In the algorithm we use (3) as before. If for the winner gr exceeds a small threshold we use c~ - aar in (4), observing 1 _ a _< ama~, where a - 1.2 is an empirical factor. In the simulations a control of a is obtained from monitoring the fluctuations of a which optimally should stay in the region of a few percent.
5. Concluding remarks In the present paper we discussed new algorithms devoted to the general problem of learning the parameters of learning. For the case of Kohonen's feature map we have
301
Figure 5. Using wavelet transform to control the neighborhood widths in a sparse data set. A data set of 80 points and a chain of 100 neurons are chosen. Display of the emerged map (left) and at-values (right). O.4
.
.
.
.
.
.
0.35
0.3
0.25 0.2 0.15 0.1 0.05 . ~00
~
~ 2000
~ 3000
4000
5000
6000
7000
Figure 6. Time course of the wavelet transform of one of the neurons during 7000 steps. demonstrated by way of some toy examples that the algorithms are stable and depend not too sensitive on the meta-parameters like the cooling rate T~ of the local values of a. We have studied the algorithm also in a few applications with real world data and found the same behavior. We may therefore conclude that our parameter adaptation procedure makes Kohonen's algorithm an even more general tool for nonlinear data analysis. REFERENCES
1. L. L. H. Andrew and M. Palaniswami. A unified approach to selecting optimal step lengths for adaptive vector quantizers. IEEE Transactions on Communications 44(4):434-439, 1996. 2. H.-U. Bauer, R. Der, and M. Herrmann. Controlling the magnification factor of selforganizing feature maps. Neural Computation, 8(4):757-771, 1996. 3. H.-U. Bauer, M. Herrmann, and T. Villmann. Neural maps and topographic vector quantization. To appear in Neural Networks, 1999. 4. H.-U. Bauer and K. R. Pawelzik. Quantifying the neighborhood preservation of SelfOrganizing Feature Maps. IEEE Trans. on Neural Networks, 3(4):570-579, 1992.
302
10.
11. 12. 13. 14.
C. K. Chui. An Introduction to Wavelets, volume 1 of Wavelet Analysis and its Applications. Academic Press, Inc., 1992. R. Der and M. Herrmann. Critical phenomena in self-organizing feature maps: A Ginzburg-Landau approach. Phys. Rev. E, 49(5):5840-5848, 1994. R. Der and M. Herrmann. Instabilities in self-organizing feature maps with short neighborhood range. In Proceedings of the European Symposium on Artificial Neural Networks (ESANN'94), pages 271-276, 1994. T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84(406):502-516, 1989. M. Herrmann. Self-organizing feature maps with self-organizing neighborhood widths. In Proceedings IEEE Int. Conf. on Neural Networks (ICNN'95), IEEE Service Center, Piscataway, N J, pages 2998-3003, 1995. K. Kiviluoto. Topology preservation in self-organizing maps. In ICNN 96. IEEE International Conference on Neural Networks, volume 1, pages 294-9. IEEE, New York, 1996. T. Kohonen. The self-organizing map. Springer, 1995. H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-Organizing Maps: An Introduction. Addison-Wesley, Reading, MA, 1992. V. N. Vapnik. The nature of statistical learning theory. Springer, 1995. T. Villmann, R. Der, M. Herrmann, and T. M. Martinetz. Topology preservation in self-organizing feature maps: Exact definition and measurement. IEEE Transactions on Neural Networks, 8(2):256-266, 1997.
Kohonen Maps. E. Oja and S. Kaski, editors
9
ElsevierScienceB.V. All rightsreserved
303
E n e r g y f u n c t i o n s for s e l f - o r g a n i z i n g m a p s Tom Heskes ~ ~RWCP, Theoretical Foundation SNN, University of Nijmegen Geert Grooteplein 21, 6252 EZ, Nijmegen, The Netherlands By slightly changing the definition of the winning unit, Kohonen's original learning rule can be viewed as performing stochastic gradient descent on an energy function. We show this in two ways: by explicitely computing derivatives and as a limiting case of a "soft" version of self-organizing maps with probabilistic winner assignments. Kinks in a one-dimensional map and twists in a two-dimensional map correspond to local minima in the energy landscape of the network weights. 1. I N T R O D U C T I O N Much of the success of Kohonen's self-organizing map algorithm [1] can be ascribed to its clarity and practicality: easy to write down, to simulate, to understand (at least at a basic level), and with many important practical applications (see [2] for a list of thousands of papers on self-organizing maps). On the other hand, there are many theoretical issues (see e.g. [3] for a recent summary) and implications that have puzzled and probably also annoyed researchers. Q u a n t i f y i n g t o p o l o g y p r e s e r v a t i o n . What is a good solution? When would you say that a map is well-organized? Many different measures have been proposed (see e.g. [4,5]), but in most cases the algorithm to compute the quality of the map is much more complicated than the learning rule itself. C o n v e r g e n c e proofs. Can you proof that the algorithm converges to such a "good" solution? Much work has been done to proof convergence in special cases (most notably one-dimensional maps, see e.g. [3] and references herein). No general proofs exist. E n e r g y f u n c t i o n s . Can you write down an energy function, such that the learning rule corresponds to some kind of gradient descent? What is the learning rule in fact minimizing? For a finite set of training patterns, it is possible to come up with a (highly discontinuous) energy function. For a continuous distribution of inputs, it can be shown that there is no such energy function [6]. This paper is about the last issue. After people started to realize that there is no energy function for the Kohonen learning rule (in the continuous case), many attempts have been made to change the algorithm such that an energy can be defined, without drastically changing its properties. Here we will review a simple suggestion, which has been proposed
304 and generalized in several different contexts. The advantage over some other attempts is its simplicity: we only need to redefine the determination of the winning ("best matching") unit. The energy function and corresponding learning algorithm are introduced in Section 2. We give two proofs that there is indeed a proper energy function. The first one, in Section 3, is based on explicit computation of derivatives. The second one, in Section 4 follows from a limiting case of a more general (free) energy function derived in a probabilistic setting. The energy formalism allows for a direct interpretation of disordered configurations in terms of local minima, two examples of which are treated in Section 5. 2. A N E N E R G Y
FUNCTION
Our notation is as follows. We have a network of n units ("neurons"). Each unit i has an m-dimensional weight vector t~i. We use W to denote the set of all n weight vectors. Given an input vector ~, Kohonen's learning rule consists of the following steps [1,7,8].
1. Find the "winner": the unit n(W, ~) with the smallest distance to the input vector, i.e.,
n(W, ~) = argmin [l~ i
(1)
~11 = ,
2. Update the weights according to
(2) with 7] a (usually small) learning parameter and h the lateral-interaction matrix. The elements hij are all non-negative and independent of W and 2~. Usually hij is a decreasing function of the (physical) distance between unit i and j. We will assume h given and fixed throughout this paper. The learning rule (2) is an on-line learning rule: the weights are updated after the presentation of a single pattern 2~. Energy functions are usually defined by averaging over the distribution of patterns denoted as (...). If one exists, the energy function corresponding to the learning rule (2) should obey
-
O----~j =
hijp(ilW, :~)(~ - ~j)
,
(3)
where we rewrote (2) by defining p(i]W, ~) = 1, if input ~ is assigned to unit i, i.e., if i = a(W, 2-), and p(ilW , i) = 0 otherwise. The idea is that the underlying E(W, ~) is a sample function of E(W). Each on-line learning step than corresponds to a local gradient descent on such a sample. Complications arise due to the dependency of the winner g(W, ~) and thus assignment p(ilW , ~.) on the parameters W. The naive guess
E(W)
(~-~p(ilW,~.)ei(W,~.)~ \
incorrect for continuous input distributions,
(4)
305 where we have defined the local errors 1
e~(w, ~) ~ -~~
h~Jll~ - ~ j [ I ~ ,
(5)
3
does not take this dependency into account and is therefore incorrect for continuous input distributions. As we will see in the next section, things go wrong at the boundaries of the Voronoi regions, i.e., for those inputs ~ for which two units have exactly the same smallest distance. Here the derivative of p(ilW , :g) with respects to the weights is infinite and cannot be neglected. For all inputs ~ not at such a boundary, this derivative is zero and thus the dependency does not really harm. The argument goes that with a finite set of patterns, the probability that there is an input 20 exactly on the boundary between two weights is zero. So, in that case, one could argue that (4) is a proper energy function (see e.g. [9]), although, due to its discontinuity, its usefulness seems rather limited. The easiest way to check whether an energy function can be found anyways, is to differentiate the righthand side of (3) with respect to the weight vector t~k. If and only if the resulting matrix is symmetric, an energy function can be found. For the Kohonen learning rule (2), this is simply not the case, at least not for continuous input distributions, which proofs that there just is no energy function reproducing the exact Kohonen learning rule (2). Attempts have been made to view the learning rule as a result of a whole system of energy systems [10], but the complexity of such a system and the applied approximations seem to make its usefulness questionable. The better option seems to be to go the other way around: define an energy function, derive the corresponding learning procedure and check whether it has similar properties as the original (2). A first option would be to derive the learning algorithm that corresponds to the energy function (4). This, however, is quite complicated (see [7]), and, since it does not reproduce the original rule anyways, it might be better to look for simpler options. The choice we made in [11,12] is
E ( W ) = (miinei(W,~)) ,
(6)
or, equivalently, the energy function of (4) with a different definition for p(i[W, ~), i.e., a different winner mechanism replacing (1) by ec(W, :~) -- argmin ei(W, ~) -- argmin ~ h~Jll~ - ~ J l l ~ 9 i
i
(7)
j
In computing the derivative of (6) with respect to the weights, the derivative of the minimum operation does not contribute and we obtain precisely the Kohonen learning algorithm (2), except for the change in winner determination. We will proof this in the next section, through explicit computation of derivatives, and in Section 4, where (6) is derived as a limiting case of a probabilistic interpretation of self-organizing maps. 3. P R O O F I N T E R M S OF V O R O N O I T E S S E L A T I O N S
Equation (6) can be derived as a limiting case of a more general (free) energy function, introduced in the next section. Here we will give a more direct and arguably more intuitive proof, which does not require taking limits.
306 We start by rewriting (6) in the form n
where (...) has been replaced by an explicit integration over the density p(s and where V/(W) may be called the "receptive field" or "Voronoi tesselation" belonging to unit i, namely, that part of the input space for which unit i has the smallest local error:
y,(w) : {~ I ~,(w, ~) < ~j(w, ~) vj}.
(9)
Or, in terms of the winner t~(W, s in (7): V~(~) contains those inputs s for which ~(W, s i. To proof that (3) holds, we have to compute the gradient of the energy with respect to one of its parameters w~:
oE(W)o ~ : E,
,(W,ow~ o + fo ~
(10)
The first term on the righthand side yields the desired learning rule. The second term causes the problems. To evaluate it, we use 0 to turn the derivative of an integral over a region into an integral over its boundary. The boundary of a Voronoi tesselation consists of those inputs 2~ where the local errors of two units are both equal and the lowest, i.e., at all possible intersections of V/(W) and b ( W ) (which may be empty). At such an intersection, the difference eij(W,s =_ el(W, :F) - ej(W, ~) vanishes. Using (11), with the differences eij(W, :F) playing the role of f(a, s we obtain in shorthand notation (leaving out all arguments)
II] 1
Oeij/l]Oeijl] ]
where in the last step we used eij = -eji. W e
conclude that to ensure that the s u m of
the (weighted) integrals of the local errors ei(W,:F) over the boundaries vanishes, it is necessary and sufficient that the receptive fields are determined by the same local errors el(W, ~.). With any other choice, the derivative of the integration region with respect to the parameters gives a finite contribution to the gradient, which makes it much more complicated (see [7]). Table 1 summarizes the properties of the energy and gradient for both definitions of the winning unit in the case of a finite set of patterns and a continuous distribution of inputs. For the case of a finite set of patterns, we define the region V in weight space where for a least one pattern ~u there are (again at least) two winning units. Its complement is denoted fz. The volume of V is zero in weight space. The energy function is given in (4), where the winner is determined either through (1) or (7). The gradient is the derivative
307 of this energy function. Roughly speaking, both using the local errors in (5) to determine the winner and going from a finite set of patterns to a continuous distribution of inputs make the energy function an order "smoother": discontinuity of a derivative occurs at one order higher. Table 1 Summary of properties of energy functions and gradients for different definitions of winning units. V is the region of weight space where for at least one of the patterns there is more than one winner.
FINITE set of patterns
CONTINUOUS distribution of inputs
Winner based on Euclidean distance as in (1)
Energy: discontinuous (but finite on V) Gradient: discontinuous (infinite on V; original Kohonen on 12)
Energy: continuous
Winner based on local errors as in (7)
Energy: continuous Gradient: discontinuous (but finite on V)
Energy: continuous Gradient: continuous
4. P R O B A B I L I S T I C
Gradient: discontinuous (does not correspond to original Kohonen)
FRAMEWORK
The Kohonen learning rule as well as the closely related version discussed in the previous section uses a crisp winner assignment. In this section, we will derive a related version based on soft assignments (see [11]). The same expressions have been derived in [13,14] from an optimization point of view. Similar ideas have been presented in [15,16]. Most of this is not restricted to the specific choice (5) for the local errors. In the limit of no interactions, hij (~ij, we obtain the "soft" vector quantization or clustering approach described in e.g. [17-19]. We start with the same energy function as in (4), but now for a single input 2~ and with the (probability) assignments p(i), apart from the obvious constraint ~ i p(i) = 1, still left free, =
n
E(W,p,s
- ~'~p(i)ei(W,~).
(13)
i=1
The choice for p(i) minimizing the energy for fixed W and ~ is the hard assignment given in the previous section: 1 for the winner a(W, ~) and 0 for all other units. A nice and
308 easy way to soften these assignments is to add an entropy term to the plain energy (13) and turn it into a "free energy" functional: n
with S(p) - - y ~ p ( i ) l o g p ( i ) .
F(W,p,~) = E(W,p,~) - TS(p)
(14)
i=1
The optimum of the free energy yields a compromise between trying to obtain the lowest error and trying to incorporate as many units as possible. The "temperature" T acts as a kind of regularization parameter. Hard assignments are recovered for T = 0. The probability assignments minimizing the free energy (14) for given W and ~ can be easily found to obey e-~,~(w,~)
p(iIw,
= E e-ml , l '
(15)
J with 13 = 1 / T the "inverse temperature". The minimal free energy, again given W and ~, then reads 1 (w,e) F ( W , ~) = - ~ log E . e-Z~' 9
(16)
$
In this scenario, an on-line learning step would follow the gradient of the free energy"
OF(W,
'
(17)
i
which, for the choice (5), is exactly of the form (2), but now with soft assignments instead of hard assignments. Averaged over all inputs we have F(W) = -~
1(
l o g y ~ e -z~'(w'e)
)
.
(18)
i
It is easy to see that the energy function (6) and corresponding learning rule can be obtained from (18) and (17), respectively, in the zero-temperature limit/3 --+ cx). This is another proof that (6) is indeed a proper energy function for the learning rule (2) with the winner assignment (7). As shown in [14], the nonzero-temperature version by itself has its own merits: the temperature can be used to implement a kind of deterministic annealing which can help a lot to avoid local minima without the need to adjust the lateral-interaction matrix h. This is especially useful in a batch-mode version. Arguably the best way to optimize in a batch-mode version is through an EM-type algorithm [20]. A derivation goes as follows. The free energy for a new state W and a single pattern ~ can be written 1 e-~(W'e) 1 F ( W , ~.) = ej(W, ~.) -[- -~ log ~ e_Z~,(w,e) = ej(W, ~) + ~ log p ( j l W , :~) , i
(19)
309 for any choice of j. The usefulness of this identity becomes clearer when we now multiply with p(jlWo~d, ~) and sum over j to obtain F(W, ~) = ~ j
1
p(jlWo,d, ~)ej(W, ~) + -~ ~. P(jlWold, ~) log P(jlW, ~) .
(20)
It is easy to show, and this is the key result for any EM algorithm, that the last term on the righthand side is maximized at W = Wold" any other choice for W is at least as good. Therefore, we can focus on the first term: any decrease of the first term, also decreases the free energy. The correspondence between (20) and (14) is striking. The important point in (20) is that this expression is true for any choice of distribution p(jlWo~d, ~), but only for the specific choice of p(jlW, ~.) in (15) which followed from optimizing (14). In a full batch-mode approach, the E-step consists of computing the probabilities pf = P(j[Wold, :~") given the current state Wold for all units j and inputs 2"". The M-step then follows from direct minimization of the first term on the righthand side of (20), yielding, for the local errors of (5),
~
=
E E hjpye. " J . EEh,jp2 , j
(21)
This is the equivalent of the so-called "batch map" algorithm for the original Kohonen learning rule (see e.g. [21,22]). It is also possible to use an incremental EM algorithm [23], where parameters are incrementally updated based on individual inputs ~" or on small batches. The basic idea is that the assignments p}' for the current input s are computed for the current setting of the parameters W, without changing the assignments for the other patterns # -r u. The weight vectors still follow from (21), where both in the nominator and the denumerator only a single term (or a few terms, if we consider a small batch) has been changed. A general convergence proof for the incremental EM algorithm is given in [23]. Especially in the case of large batches, the incremental EM-algorithm can be much more efficient. A discussion of incremental EM-algorithms in the context of generative topographic mappings (GTM's) can be found in [24]. 5. E X A M P L E S
OF ENERGY
SURFACES
One of the important advantages of having an energy function is that it is clear what you are trying to minimize and, at least for a finite set of patterns, you can check whether this energy indeed decreases as a result of running your algorithm. Furthermore, you have an objective measure against which you can compare different weight configurations. In this section, we consider two examples of disordered states that correspond to local minima in the energy landscape, where the ordered states are global minima. 5.1. K i n k s in o n e - d i m e n s i o n a l m a p s As a first example, we consider a one-dimensional map consisting of three units with weights wl, w2, and w3. Inputs x are drawn homogeneously from the interval [0,1]. The
310 lateral-interaction matrix is of the form 1 h
l+a
cr
0)
O" 1 - - O "
(9"
0
1
a
(22)
,
with 0 _< cr < 1/2 the interaction strength.
1.0
1.0
0.5
0.5
0.0
1
2
i
0.0
3
(b)
1
2
3
i
Figure 1. Configurations in a one-dimensional map. (a) Line. (b) Kink.
Depending on the ordering of the weights, we call a particular configuration a "line" or a "kink". Figure l(a) sketches the line (123), Figure l(b) the kink (132). Figure 5.1 shows how the energy surface changes as a function of the interaction strength cr and temperature T. To visualize these energy surface, we mapped the weights Wl, w2, and w3 to a pair of coordinates ql and q2 with ql = r cos r
and
q2 = r sin r with r 2
9
~-~(wi- 1/2) 2
-- 2i--1
and r = arctan
3v/3
2
w3 -
Wl
(W3 -- W 2 ) " 4 - ( W l -
W2)
'
and plotted the energy surface as a function of the new variables ql and q2 for parameters W under the constraint wl + w2 + w3 = 3/2 (to get rid of one degree of freedom). Through this transformation, all minima will lie close to a circle of radius 1, at angles 0 and 7r for the ordered minima (lines) and angles 7r/3, 27r/3, 47r/3, and 57r/3 for disordered minima (kinks). In Figure 5.1, the interaction strength increases from left (0) to right (0.12), the temperature from top (0) to bottom (0.04). For zero interaction strength (left column), all minima are of equal depth. With increasing or, the kinks start to become unstable, ensuring that for sufficiently large ~ the lines are the only remaining minima. A larger temperature tends to attract the minima towards the origin (corresponding to all weights equal). The larger the temperature, the lower the critical a for which the kinks become unstable.
311
/
0
0.06
0.12
0.02
0.04
Figure 2. Energy surfaces for a one-dimensional mapping, for different interaction strengths ~r = 0, 0.06, and 0.12 (left to right) and temperatures T = 0, 0.02, and 0.04 (top to bottom).
5.2. T w i s t s
in t w o - d i m e n s i o n a l
maps
In our second example, we consider a two-dimensional map consisting of four units. Inputs ~ are drawn with equal probability from the square [-1,1] x [-1,1]. The lateralinteraction matrix is of the form
h=
1
cr
cr2
cr
1
cr
1
a
a2
(1+~)2
~2
cr
1
cr
(r
cr2
cr
1
"
(23)
Again, cr gives the lateral-interaction strength. We expect to find possible (local) minima if each unit covers one quadrant of the input space. We denote a particular m i n i m u m by (ijkl) if unit i lies in the first quadrant, unit j in the second, and so on. There are 4! = 24 different possible minima. 8 of them
312 1
1
(b)
1
-1
1
1
-1
4
2
-1
-1
Figure 3. Configurations in a two-dimensional map. (a) Rectangle. (b) Twist.
are perfectly ordered. We will call these configurations "rectangles". An example of such a rectangle, (1234), is given in Figure 3(a). As usual [1], lines are drawn between neighboring units in the map, i.e., between 1-2, 2-3, 3-4, and 4-1. For small a, disordered configurations are local minima of the error potential. These minima are called "twists" or "butterflies". In Figure 3(5) the twist (1324) is sketched. In order to make pictures of energy functions E(W) for this two-dimensional mapping, we have to get rid of 6 degrees of freedom. We can define a two-dimensional manifold as follows. 1. The four weight vectors uTi lie on a circle with radius r" uTi = r(cos r 2. The first weight vector is fixed to cover the first quadrant" r
sin r
= rr/4.
3. The sum of all angles r is constant" ~ i r = 4rr.
4. The sum-squared difference between the remaining free angles is fixed: ( r (r
-- r
2 "~ ( r
r
r
2+
2 -- 71"2.
With these constraints the weight vector can be fully described by two parameters ql and q2, defined as ql =
r cos r
and q2 - r sin r with r - arctan
3V/~
2 (r
r
- r
-- r
(r
r
(24)
If 1/~1 c o v e r s the first quadrant, there are still 6 different ways to cover the other three quadrants. Therefore there are 6 different minima: 2 rectangles and 4 twists. Energy surfaces as a function of the parameters ql and q2 are plotted in Figure 5.2 for different choices of a and T. The similarity with Figure 5.1 is remarkable. The destabilizing effect of the temperature seems even stronger.
313 0
0.1
0.2
0.05
0.1
Figure 4. Energy surfaces for a two-dimensional map, for different interaction strengths cr = 0, 0.1, and 0.2 (left to right) and temperatures T = 0, 0.05, and 0.1 (top to bottom).
6. DISCUSSION
Why would we bother using another definition of winning unit to turn the original Kohonen learning rule into something slightly different which corresponds to an energy function? In fact, the proposed winner definition requires an extra matrix multiplication which makes the learning algorithm a little slower. The more theoretical advantages are the following. 9 A global measure of learning performance. You know what you are minimizing: if you are not satisfied you might try and choose another definition of the local errors (5). By mapping the weight space on a lower dimension, it is possible to visualize the energy function, as we did in Section 5. Note that the energy function is not a direct measure of neighborhood preservation. It may be interesting to study for which kinds of "neighborhood preservation measures" there does exist an indirect link in the sense that a lower minimum has a better neighborhood preservation.
314 9 A smoother gradient. As indicated in Table 1, the gradient corresponding to the new winner definition is smoother than the (average) Kohonen learning rule, both for a finite set of patterns and for a continuous distribution of inputs. This makes it much easier to lend from standard theory on stochastic approximation (see e.g. [2527]) to find convergence proofs and study asymptotics. Some of this can be found in [12] where transition times between disordered and ordered states are discussed. Changing the determination of the winning unit has no effect on the basic properties of the Kohonen learning algorithm: a relatively simple procedure with remarkable self-organizing capabilities. At a more theoretical level, many results concerning the original Kohonen learning algorithm (see e.g. [3,22]) are not particularly surprising from a stochastic approximation or optimization point of view. Often, these results are just more difficult to proof for technical reasons, for example, because of discontinuities or the lack of an energy function. Nowadays, the original learning rule proposed by Prof. Kohonen with the more straightforward winner determination is by far the most popular self-organizing map algorithm. Alternatives, like the one discussed in this paper, seem to receive a bit more attention lately, but still have a long way to go. One wonders what the current status of the different winner mechanisms would have been if Prof. Kohonen in his seminal work had chosen the slightly more complicated (but from a mathematical point of view arguably more sensible) version discussed in this paper... REFERENCES
1. T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43:59-69, 1982. 2. J. Kangas and S. Kaski. 3043 works that have been based on the self-organizing map (som) method developed by Kohonen. Technical report, Helsinki Uninversity of Technology, Laboratory of Computer Science, 1998. 3. M. Cottrell, J. Fort, and Pagi~s. Theoretical aspects of the SOM algorithm. Neurocomputing, 21:119-138, 1998. 4. H. Bauer and K. Pawelzik. Quantifying the neighborhood preservation of selforganizing feature maps. IEEE Transaction on Neural Networks, 3:570-579, 1992. 5. T. Villmann, R. Der, and T. Martinetz. Topology preservation in self-organizing feature maps: exact definition and measurement. IEEE Transactions on Neural Networks, 8:256-266, 1997. 6. E. Erwin, K. Obermayer, and K. Schulten. Self-organizing maps: ordering, convergence properties and energy functions. Biological Cybernetics, 67:47-55, 1992. 7. T. Kohonen. Self-organizing maps: optimization approaches. In T. Kohonen, K. Mgkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 981-990, Amsterdam, 1991. North-Holland. 8. H. Ritter and K. Schulten. On the stationary state of Kohonen's self-organizing sensory mapping. Biological Cybernetics, 54:99-106, 1986. 9. H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-Organizing Maps. Addison-Wesley, Reading, Massachusetts, 1992.
315 10. V. Tolat. An analysis of Kohonen's self-organizing maps using a system of energy functions. Biological Cybernetics, 64:155-164, 1990. 11. T. Heskes and B. Kappen. Error potentials for self-organization. In International Conference on Neural Networks, San Francisco, volume 3, pages 1219-1223, New York, 1993. IEEE. 12. T. Heskes. Transition times in self-organizing maps. Biological Cybernetics, 75:49-57, 1996. 13. T. Graepel, M. Burger, and K. Obermayer. Phase transitions in stochastic selforganizing maps. Physical Review E, 56:3876-3890, 1997. 14. T. Graepel, M. Burger, and K. Obermayer. Self-organizing maps: generalizations and new optimization techniques. Neurocomputing, 21:173-190, 1998. 15. S. Luttrell. Self-organisation: A derivation from first principles of a class of learning algorithms. In International Joint Conference on Neural Networks, volume 2, pages 495-498. IEEE Computer Society Press, 1989. 16. S. Luttrell. A Bayesian analysis of self-organizing maps. Neural Computation, 6:767794, 1994. 17. K. Rose, E. Gurewitz, and G. Fox. Statistical mechanics of phase transitions in clustering. Physical Review Letters, 65:945-948, 1990. 18. K. Rose, E. Gurewitz, and G. Fox. Vector quantization by deterministic annealing. IEEE Transactions on Information Theory, 38:1249-1257, 1992. 19. B. Bakker and T. Heskes. Model clustering by deterministic annealing. In M. Verleysen, editor, Proceedings of the European Symposium on Artificial Neural Networks '99, 1999. 20. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1-38, 1977. 21. T. Kohonen. The self-organizing map. Neurocomputing, 21:1-6, 1998. 22. Y. Cheng. Convergence and ordering of Kohonen's batch map. Neural Computation, 9:1667-1676, 1997. 23. R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. Jordan, editor, Learning in Graphical Models, pages 355368. Kluwer Academic Publishers, Dordrecht, 1998. 24. C. Bishop, M. Svensfin, and C. Williams. Developments of the generative topographic mapping. Neurocomputing, 21:203-224, 1998. 25. H. Kushner and D. Clark. Stochastic Approzimation Methods for Constrained and Unconstrained Systems. Springer, New York, 1978. 26. H. Kushner. Robustness and approximation of escape times and large deviations estimates for systems with small noise effects. SIAM Journal of Applied Mathematics, 44:160-182, 1984. 27. A. Benveniste, M. Metivier, and P. Priouret. Adaptive algorithms and stochastic approzimations. Springer-Verlag, Berlin, 1987.
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors
9
ElsevierScienceB.V.All rightsreserved
317
LVQ and single trial EEG classification G. Pfurtscheller a and M. Pregenzerb Ludwig-Boltzmann Institute for Medical Informatics and Neuroinformatics, Inffeldgasse 16a, A-8010 Graz, Austria a
bDepartment of Medical Informatics, Institute of Biomedical Engineering, Graz University of Technology, Inffeldgasse 16a, A-8010 Graz, Austria.
EEG signals, which reflect the electrical activity of the braip, can be classified on a single trial basis. In this paper the classification of EEG signals during preparation of different types of movement with Learning Vector Quantization (LVQ) is addressed. A good separability for left and right finger movements and also foot and tongue movements is shown. Furthermore, the problem of feature selection is addressed. A modified version of LVQ, Distinction Sensitive Learning Vector Quantization (DSLVQ), is discussed and applied for the selection of optimal EEG parameters.
1. INTRODUCTION Electroencephalogram (EEG) signals, which can be recorded on the intact scalp, reflect the electrical activity of the brain. An internally or externally paced event results not only in the generation of an event-related potential (ERP), but also in an change of the ongoing bioelectrical brain activity in form of either an event-related desynchronization (ERD, [ 1]) or event-related synchronization (ERS, [2]). In contrast to the ERP, which is a time-and phase-locked response of the brain, the ERD/ERS is time-locked but not phase-locked. (for details see [3]). While the ERP can be extracted from the ongoing EEG directly with averaging techniques, the evaluation of ERD and ERS requires a preceding filtering and squaring of the samples. The resulting signals can then be analyzed with averaging techniques or with learning classification methods. Learning classification methods allow a single trial discrimination of different types of EEG responses. The classification problem is usually non-linear and high dimensional since parameters from different frequency bands, electrode positions and time segments are relevant. Different studies have investigated the separatbility of single trial EEG signals with the long-term objective to build a direct brain to computer communication system (cf., e.g., [4], [5], [6], [7]). This paper gives an overview of the application of Learning Vector Quantization (LVQ, [8]) in
318
this context. Furthermore, the problem of feature selection is addressed: a variation of LVQ, Distinction Sensitive Learning Vector Quantization (DSLVQ, [9]), can be used for a very efficient selection of the most relevant EEG parameters.
2. D I S C R I M I N A T I O N O F L E F T AND R I G H T F I N G E R M O V E M E N T S W I T H
LVQ A subject sat in a comfortable chair in a darkened room and looked at a fixation cross displayed on a monitor 100 cm in front of his/her eyes. With a visual cue the subject was asked to be prepared to press a microswithch with the index finger of either the left or right hand. One second after the first cue a second cue indicated the subject to really press the microswitch. The movement started about 0.5 - 1 seconds after the second cue. The 1.5 seconds interval between the first cue and movement onset was used for classification. EEG was recorded at a sampling rate of 64 Hz form 17 electrodes (international 10-20 system without frontal the electrodes); 6 central and occipital electrodes were selected for classification. In parallel to the monopolar (nose) referenced signals, re-referenced signals (common average reference and local average reference) were investigated. For the classification, the signals were first digitally bandpass filtered in the range of 9-11 Hz then squared and averaged over a variable number (8, 16 or 32) of consecutive samples to investigate the relevance of the time resolution. Table 1 shows the offiine results of an LVQ3 classifier on unseen test data.
Table 1 LVQ classification results (mean + standard deviation of 10 runs with different initialization) using six electrodes. Results with different reference methods and time resolutions are shown. The input dimensionality varied between 18 (6x3) for the 500-ms resolution and 72 (6x12) for the 125-ms time resolution. 125 ms 250 ms 500ms
monopolar
common average
local average
73.8 % + 1.5 74.5% _+ 1.9 72.4 % _+ 1.7
75.0 % + 1.3 75.4% + 1.4 73.2% _+ 1.2
75.9% + 1.4 75.3% + 1.8 75.1% +_ 1.1
The results show that better classification results can be achieved with re-referenced signals on the expense of higher recording- and preprocessing costs. (A larger number of electrodes and additional calculations are required.) Considering the three investigated time resolutions, similar results are obtained with time resolution of 125 ms and 250 ms, whereas with a resolution of 500 ms a performance decrease can be observed. Detailed results with more electrodes and different time windows can be found in [4].
319
3. DISCRIMINATION B E T W E E N PLANNING OF FINGER, FOOT AND TONGUE MOVEMENT USING LVQ In this study 4 different classes had to be discriminated: planning of left and right index finger, foot and tongue movement. The following paradigm was used: l s after a short warning tone, a visual cue was presented on a monitor and indicated which of the 4 movements should be prepared. Two seconds after the visual cue an acoustic stimulus (RS) indicated that the movement should now be executed. The subject was asked to make one brisk but distinct movement after RS. The different visual cues were presented in a random order, using blocks of 50 and intertrial intervals of 12 s. Each specific movement had to be made 150 times. The signals were recorded with a sample frequency of 128 Hz and digitally filtered with a number of narrow bandpass filters centered between 8 and 40 Hz. After filtering, each sample value was squared and then 32 squared samples were averaged to obtain instantaneous power estimates at intervals of 250 ms. Data from 8 electrodes (positions selected by an expert) and 4 time segments were used thus giving an input dimensionality of 32. Table 2 Correct classification rates (mean + standard deviation of 10 repetitions with different initialization) for all four types of movement (left and right index finger, toe and tongue). Different time windows and frequency bands are displayed. ls before RS ls around RS l s after RS before cue
1051.1 55.3 57.8 28.8
12 Hz _+ 1.8 + 2.1 + 1.7 _+ 1.9
3035.8 38.7 43.6 22.6
33 Hz _+ 2.8 + 3.1 + 1.1 _+2.9
3840.2 43.6 48.8 27.1
40 Hz _+ 2.3 + 2.2 _+ 1.6 + 2.5
all bands 62.3 _+ 1.4 66.0 + 1.8 69.6 + 1.5 23.1 + 1.6
Table 3 Correct classification rates (mean _+ standard deviation of 10 repetitions with different initialization) for left vs. fight index finger movements. Different time windows and frequency bands are displayed. 1s before RS ls around RS ls after RS before cue
1074.9 82.8 81.2 49.0
12 Hz +_ 1.3 _+ 1.1 + 1.3 + 3.5
3051.8 60.9 71.5 52.0
33 Hz _+ 2.4 _+ 3.9 _+ 1.8 + 3.9
3882.1 80.6 82.3 39.2
40 Hz _+ 2.9 _+ 1.6 +_ 1.9 +_ 2.6
all bands 86.7 _+ 2.9 87.9 + 1.3 88.7 +_ 1.4 47.8 +_ 2.7
Tables 2 and 3 show the LVQ results on all four classes and two selected classes, respectively. Three different 1-s intervals, were investigated: before, during and after the reaction stimulus (RS), whereby movement onset was about 1.5 s after RS. Using the 10- 12 Hz band alone, a discrimination between left and right index finger was possible with nearly 83 % accuracy. The discriminative power was similar for the band around
320
40 Hz. Discrimination between 4 different types of movement is a more difficult task and revealed a classification accuracy between 51 and 58 % when alpha band components were used alone. When all 3 frequency bands were classified together the accuracy reached 6 2 - 70 %. These results show a very interesting and new relationship between 40-Hz EEG and preparation for movement [10]. When EEG segments before the cue presentation were analyzed, the classification accuracy reached only 23% for the four class problem and 48 % for the two class problem. These results, which reflect the prior probabilities for the four and two class problem, validate that the EEG contains no relevant information before the cue. Generalizing the results, it can be stated that planning of one out of 4 different types of movement can be classified with nearly 70% accuracy.
4. O N L I N E E E G C L A S S I F I C A T I O N U S I N G AN L V Q C L A S S I F I E R For the construction of an EEG-based brain computer interface (BCI) system the EEG signals must be classifier online. To support subjects with severe motor disabilities the training data should be recorded without subsequent real movements. Figure 1 shows the timing of an online experiment: a fixation cross was presented on a computer screen. After three seconds an arrow (cue) was superimposed on the fixation cross, which instructed the subject in the following way: depending on the direction of the arrow (left or right) the subject should either imagine a left or right hand movement. The features used for classification were extracted from the 1-s epoch of EEG between 3.25 and 4.25 s (i.e. starting 250 ms after the cue presentation). The EEG was filtered in two subject-specific frequency ranges. Four power values (from the four 250 ms segments) were calculated for both frequency bands and two electrodes (C3, C4, see Figure 1, insert).
Figure 1. Timing of online sessions. The cue stimulus in form of an arrow (seconds 3 - 4) gives the side movement which shall be imagined. The interval from 3.25 s to 4.25 s is used for classification, feedback is given at 6 s. Insert: bipolar EEG channels used: C3: C3a-C3p; C4: C4aC4p. Channel C3 was derived from an electrode placed 2.5 cm anterior to C3 (C3a) and an electrode placed 2.5 cm posterior to C3 (C3p). Channel C4 was derived correspondingly.
321 Based on the 16 features per trial (4 power estimates x 2 frequency ranges x 2 EEG channels), the online classifier derived a classification plus a confidence measure. A modified LVQ classifier [11] was used which could provide this additional confidence measure (value between 0 and 1). This allowed the identification and rejection of ambiguous input data by thresholding to a pre-specified value (0.1). A second threshold value (0.4) was used to distinguish between average and very clear classifications. The results were shown to the subject as small and big feedback symbols ("+","+","o","-","-"). The LVQ classifier was updated after the 4 test sessions to track possible changes of the EEG characteristics due to learning effects. Online classification results of one session are displayed in Figure 2. The 4 parts of the session can be seen clearly, with the first part indicating a spell of bad performance (trials 25 - 35), probably due to lack of concentration. The remaining 3 parts show the regular performance that is representative for other sessions and subjects. The data of 3 other subjects are shown in Figure 3: the results were between 70 % and 90% for the other subjects and test sessions. The variation of the results over sessions can mostly be attributed to the subject's momentary mental and physical state (e.g. fatigue or stress leading to concentration difficulties) as reported by the subjects after the experiment.
Figure 2. Summed single-trial performance of one subject in one session: the session consisted of 160 trials (shown on the x-axis), which were classified independently. The y-axis represents the sum of the feedback symbols, where large ,,+" signs are assigned a value of +2, small ,,+" signs a value of + 1, ,,o" a value of zero, small ,,-" signs a value of-1 and large ,,-" signs a value of-2. The 4 parts of the session, each starting off with a sum of zero, can be seen clearly. The first 40 trials show a period (from approx, trial 25 to trial 35) where the subject lacked concentration and thus performed badly. No significant increase in the classification accuracy was found with an increasing number of test sessions (compare Figure 3). It can be expected that test sessions with feedback reinforce the neural circuitry involved in the imagination process and the
322 spatiotemporal EEG patterns become more pronounced. Nevertheless, the classification performance does not increase. This can be explained by the ,,man-machine learning dilemma" (MMLD). With each successful classification, the processing mode and, therefore, the EEG pattern is slightly modified. Such a change in the EEG pattern results, however, in a decrease of the classifier performance and no overall improvement is found. MMLD implies that two systems (man and machine) are strongly interdependent, but have to be adapted independently. The starting point of this adaptation is the training of a ,,machine" to recognize certain EEG patterns of a subject. During this phase, no feedback could be given. As soon as feedback is provided (and this is always the case in any application of such a system), the feedback results in an adaptation of man to machine: man tries to repeat success and avoid failure. This causes also changes of the EEG patterns, which are the source for the machine's analyses. Changing distributions of patterns require the adaptation of pattern recognition methods, i.e. adaptation of machine to man. Via feedback these changes influences, however, also the behavior of man which results in further variations of the EEG patterns. Due to the interdependency of EEG patterns and machine behavior, a continuous tracking on both sides could lead to instability and the initial adaptation could easily be destroyed.
Figure 3. Percentage of correctly classified trials over consecutive test sessions for 3 subjects. The experiments show that adaptation of the computer system to man is necessary. One solution is to adapt the system after some data blocks (one or multiple sessions). This reduces the danger of instability, but allows adaptation to changes of the brain.
5. SELECTION OF OPTIMAL EEG PARAMETERS WITH DSLVQ One problem with the classification of EEG signals is the large number of possible EEG parameters. Different frequency bands may be relevant at different electrode positions and during different time segments. Compared to the large number of possible input features the number of examples, which is usually available, is extremely small. This causes a significant problem for learning classification methods: a pre-selection of input
323 features is essential to obtain good generalization results. To find the optimal combination of input features is a very time-consuming process since the number of possible combinations grows exponentially with the number of candidate features. With 20 spectral components from 20 electrodes and two different time segments the number of candidate features is 800 (20*20*2) - this yields 2A800 (more than l e+200) possible feature combinations! If only 20 features (e.g. the 20 spectral components) are considered, there exist still more than one million combinations. Obviously an exhaustive search is impractical. An additional problem is that the optimality of a certain combination depends strongly on the subsequent classification method: different features may, for example, be relevant for linear and non-linear classification since the non-linear dependencies cannot be exploited with a linear classifier. The optimal feature combination for an LVQ classifier can only be found through testing LVQ on all possible feature combinations. Even though LVQ training is comparably quick, a good heuristic to find relevant features is inevitable. DSLVQ has been developed for this purpose: it finds a suitable combination of input features for an LVQ classifier and is quick enough to allow even an online adaptation of the feature selection (this might be necessary in a BCI system e.g. due to an electrode failure or visually induced disturbances). LVQ1 training segments the input space into regions which are exclusively assigned to one class (i.e. to the class with the highest posterior probability). The separation between two neighbouring regions is linear and - if the two regions are from different classes - it is also one part of the piecewise linear decision border. For a linear classification problem the orientation of the optimal decision border reflects the relevance of the features. DSLVQ uses this fact and obtains relevance estimates for the single features from analyses of the linear pieces of the decision border. A global relevance vector w=[wl,w2....Wu], where w,, represents the relevance of input feature n, is used to store the feature relevance: w is iteratively trained in parallel to the codebook and used for all distance calculations between training examples and the codebook vectors. This can be written in the form of a weighted distance function
d(x'y'w)=I~wn(x" n=l
-Yn)2 '
which is used instead of the standard Euclidean distance function also in the interactive codebook training. The influence of each feature n is adapted through the weight value w,,, which reflects its relevance for the classification problem. The relevance vector w is initialized evenly or according to a priori knowledge and then trained in parallel to the LVQ 1 codebook training. The major advantage of this dynamic scaling is that the influence of (obviously) irrelevant features can be reduced already during the LVQ1 training process. This avoids a repetition of the training with different feature sets. In a single training process, DSLVQ finds a relevant feature combination and the appropriate codebook positions. If the repeated application of DSLVQ with different initializations of the codebook finds always the same feature combination, it can be concluded that this combination is not only one good feature combination but also the optimal combination for LVQ classification.
324 In usual BCI experiments more than 800 candidate features are available, but only a few hundred consistent training examples. Recordings from different subjects cannot be directly compared, and the EEG patterns show a considerable variation when a subject gets more practice with the system [ 12]. The small number of consistent training examples requires a splitting of the search problem into three different sub-problems: selection of an optimal time interval, selection of optimal electrode positions and selection of optimal frequency bands. The parameters of one sub-problem (e.g. frequency bands) can be optimized while approximately optimal values picked by an expert and kept constant for the other parameters (e.g. time window and electrode positions). It can be assumed that these three sub-problems are locally independent, i.e. that a slightly longer time-segment or slightly different electrode location may influence the classification accuracy, but will generally not influence the relevant frequency band. Suitable feature selection methods are discussed in [11]. In [13] these methods are compared to DSLVQ and advantages of DSLVQ are shown. In different studies DSLVQ was, hence, used to optimize the parameters along all three dimensions [14], [15], [16].
Figure 4. Average DSLVQ weight values for 1-Hz spectral bands between from 5 and 25 Hz and 4 different subjects (A1, A6, B8, B9). High weight values reflect a high relevance for a discrimination during movement preparation. (modified from [ 16]))
325 6. SELECTION OF O P T I M A L FREQUENCY COMPONENTS W I T H DSLVQ
Three bipolar EEG channels were recorded from sensorimotor areas. After calculation Three different types of movement (left hand, right hand and foot) were discriminated. Figure 4 shows the DSLVQ weight values for 4 different subjects. The reported values are average weight values from 10 runs of DSLVQ and 3 electrodes. For all the subjects the significance of mu rhythms (10-12 Hz) can be seen clearly (which is in good agreement with [17] and [18] where a blocking of the mu rhythm over the contralateral hemisphere shortly before hand movements is reported). Figure 4 shows, however, also that the optimal feature selection is dependent on the individual subject. Notable differences between the 4 subjects are: i) a significance of lower mu rhythms (8-10 Hz) can be observed for two subjects (B8 and B9) ii) the relevance of the central beta rhythms (2023Hz) varies strongly: central beta rhythms are even more informative than mu rhythms for one subject (A1), but for another subject (B9) no relevance can be observed. This demonstrates the necessity of a subject specific frequency selection. The influence of this on the classification performance is discussed in detail in [12] and [ 16].
Figure 5. Weight maps derived by an interpolation of DSLVQ weight values for 56 electrodes (two class problem, planning of left and right hand movement). All 56 electrode positions and the approximate location of the central sulcus are also indicated. (modified from [5])
326 7. SEARCH FOR SOURCES OF MU AND CENTRAL BETA RHYTHMS USING DSLVQ With dipole localization methods on neuromagnetic data the sources of Rolandic mu and beta rhythm have been found close to the somatosensory area and the motor area, respectively [ 19]. Similar results have been found by Pfurtscheller et al. (1994) with single trial classification methods. This approach is based on the knowledge that the planning of one-sided movement results in a blocking or desynchronization of mu and central beta rhythms over the contralateral sensorimotor area (c.f., e.g. [ 18], [20], [21]). It is assumed that the EEG channels (electrodes) which overlie the cortical generators are also most important for discrimination of different types of movement To prove this, 56 dimensional data vectors (56 electrodes, one time window, one frequency band per evaluation) from two classes (right- and left hand movement) were processed with the DSLVQ algorithm. The DSLVQ weight values of the individual electrodes displayed their relevance for the discrimination. Two frequency bands, around 10 Hz (mu rhythm) and 20 Hz (central beta rhythm), were compared. It was found that the focus is slightly more posterior for the mu rhythm compared to the central beta rhythm (Figure 5). This is in very good agreement with the results obtained with the dipole localization.
8. CONCLUSIONS Classification of EEG signals on a single trial basis is a non-linear and high dimensional problem. The LVQ classifier has been used in different studies on band power values and revealed a good separability of EEG signals recorded before real and imagined movement. DSLVQ is a variation of the original LVQ algorithm which finds optimal scaling factors for the input features. These scaling factors reflect the relevance of the features for the classification problem. The DSLVQ method has been used to optimize electrode locations, frequency bands and time segments for the separation of different types of planned movement. The constant improvements in single trail EEG classification allow the vision of a future direct brain to computer communication system. Supported by the "Fonds zur F6rderung der wissenschaftlichen Forschung", project P11208MED, the "Steiermfirkischen Landesregierung", the "Osterreichischen Nationalbank" and the "Allgemeinen Unfallversicherungsanstalt (AUVA)" in Austria.
REFERENCES
1. 2. 3.
G. Pfurtscheller, Electroenceph. clin. Neurophysiol. 43 (1977) 757. G. Pfurtscheller, Electroenceph. clin. Neurophysiol. 83 (1992) 62. G. Pfurtscheller, F.H. Lopes da Silva, Elsevier, Amsterdam, in press 1999.
327 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
D. Flotzinger, G. Pfurtscheller, Ch. Neuper, J. Berger and W. Mohl, Med. Biol. Eng. Comput. 32 (1994) 571. M. Pregenzer, G. Pfurtscheller and D. Flotzinger, Biomed. Technik 39 (1994) 264. G. Pfurtscheller, J. Kalcher, C. Neuper, D. Flotzinger and M. Pregenzer, Electroencephal. Clin. Neurophysiol. 99 (1996) 416. J. Kalcher, D. Flotzinger, C. Neuper, S. G611y and G. Pfurtscheller, Med. Biol. Eng. Comput. 34 (1996) 382. T. Kohonen, Proc. IEEE 78 (1990) 1464. M. Pregenzer, Schaker Verlag, Aachen, 1998. G. Pfurtscheller, D. Flotzinger and Ch. Neuper, Electroenceph. clin. Neurophysiol., 90 (1994) 456. D. Flotzinger, Doctoral Thesis, University of Technology, Graz, 1995. M. Pregenzer and G. Pfurtscheller, IEEE Trans. Rehab Engr. (1999 in press). D. Flotzinger, M. Pregenzer and G. Pfurtscheller, Proc. IEEE Int. Conf. on Neural Networks, IEEE Service Center, Piscataway, 1994. M. Pregenzer, G. Pfurtscheller and D. Flotzinger, Neurocomputing 11 (1996) 19. G. Pfurtscheller, M. Pregenzer and C. Neuper, Neuroscience Letters 181 (1994) 43. M.Pregenzer and G. Pfurtscheller, Proc. Int. Conf. on Artificial Neural Networks (1995) 433. H. Gastaut, Rev. Neurol., 87 (1952) 176. G.E. Chatrian, M.C. Petersen and J.A. Lazarete, Electroenceph. clin. Neurophysiol. 11 (1959) 497. R. Samelin and R. Hari, Neuroscience 60 (1994) 537. G. Pfurscheller and A. Berghold, Electroenceph. clin. Neurophysiol. 72 (1989) 250. P. Derambure, L. Defebvre, K. Dujardin, J.L. Bourriez, J.M. Jacquesson, A. Destee and J.D. Guieu, Electroenceph. clin. Neurophysiol. 89 (1993) 197.
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
329
S e l f - o r g a n i z i n g m a p in c a t e g o r i z a t i o n o f voice qualities L. Leinonen Neural Networks Research Centre, Helsinki University of Technology, P.O.Box 5400, FIN-02015 HUT, Finland
We have explored the applicability of the self-organizing map to the modeling of perceived similarity relationships among voice qualities. Voice samples (94 male and 124 female) representing healthy and pathological voices were divided into 5 classes using the sixdimensional auditory ratings of 6 speech pathologists. The acoustic categorizations were made with 20 spectral features, which were selected using simulated annealing and learning vector quantization to maximize the classification accuracy. An accuracy of about 80 per cent was reached. The selection improved the match between the auditory categorization and clustering of samples on acoustic feature maps. Further improvement of the map models requires the addition of new types of acoustic features, and a larger data basis. 1. VOICE QUALITY Our ability to discriminate voices is based on the neural memory of hundreds or thousands of voices we have previously heard. From the voice we can tell whether the speaker is male, child, self-confident, angry, or has a sore throat, for instance. Such categorizations are easy to make but difficult to trace back to their origins in the acoustic environment. Voice quality is critical for the success in many aspects of life. A man in his twenties with a voice resembling that of an old grandmother cannot count on being one day in charge of an army or a political party. Social norms and particular occupational needs create varying demands for voice therapy. Most voice problems are due to laryngeal pathologies. Anatomical deformities of the vocal folds may turn the voice hoarse and rough, and deterioration in the control of laryngeal function may result in whispering, creaky or shrill voice, for instance. 2. CLINICAL MEASUREMENT OF VOICE QUALITY Voices are molded both by the surgeon's knife and behavioral therapies. In evidencebased medicine, therapies are developed using standardized measures of outcome. For the evaluation of voice quality there are no standardized practices, however. Voices are usually evaluated with auditory rating tests: a group of speech pathologists judge the samples along several clinically motivated dimensions, such as the degree of pathology, breathiness, and roughness. The judgments suffer from considerable inter- and intrarater variability which
330 complicates the design of reliable tests and comparison of the results from different tests. Neither does voice evaluation with simple acoustic measures of signal variability and spectral energy distribution provide the solution because the correlation between auditory judgments and acoustic measures is poor. This suggests that the acoustic patterns significant for the perception of deteriorated voice quality are relatively complex. 3. VISUALIZATION OF VOICE ACOUSTICS WITH SELF-ORGANIZING MAP The self-organizing map [ 1] provides an easily understandable visualization of similarity relationships among complex feature patterns. If self-organized acoustic feature maps would match the similarity relationships among auditory voice ratings, they could be applied to clinical voice description. Such maps would serve both diagnostic purposes (as inhalation provocation tests [2], for instance) and evaluation of We have explored the applicability of the self-organizing map to the modeling of per therapeutic outcomes. The maps could also be used as easily understandable feed-back devices during behavioral therapy. Visual feedback, instead of auditory, is often necessary because people with long histories of voice disorders may not themselves hear the variation of their own voices. As a simple example of desired visual feed-back, Figure 1 shows two trajectories of the utterance of [saa] on the map which was trained with speech spectra of subjects with normal and pathological voices. One trajectory corresponds to a healthy voice and the other to a breathy voice (continuous leakage of air through the vocal folds). The trajectories show the temporal sequence of spectral samples taken at 10-ms intervals, and the small circles along the trajectory are the larger the longer the spectral samples remain at one map location (corresponding to the model vector which matches best the sample according to Euclidean distance). Both trajectories begin in the left upper comer of the map with the representation of [s]. The differences in the trajectories of the two [aa]-samples on the map are in line with perception, namely that the breathy sample is more unstable and nearer to [s] than the healthy one.
. . . .
9healthY
l [si
breathy
y,
9
a]
Figure 1. Trajectories of two [saa]-samples on a spectral feature map.
.
331 Using spectral feature extraction without much consideration to the perceptual significance of single features, crude categorization of voices becomes possible [3-6]. We noticed that features derived from a Bark-scaled spectrum, which were suited for phonemic differentiation [7], were not appropriate for voice description. Voice disorders related to alterations in the vocal tract induce subtle changes in the formant structure. In the detection of hypemasalization (failure in closing the passage from the pharynx to the nasal cavity), we have experimented with features obtained by linear prediction [8]. 4. PERCEPTUAL CATEGORIZATION OF VOICE SAMPLES For the development of acoustic feature description by the self-organizing map we gathered 94 male and 124 female voice samples (two-syllable words with long vowel [aa]) which were then evaluated by six experienced speech pathologists on 6 analog scales for the degree of pathology, roughness, breathiness, strain, asthenia, and pitch. The ratings (six median scores for each voice) were categorized with the aid of the Sammon map [9] and the self-organizing map into five categories [10] which differed with respect to breathiness-toroughness ratio and the degree of pathology. Figure 2 shows the categories on a selforganized map whose model vectors correspond to the six-dimensional auditory ratings. As suggested by the map, the category boundaries are rather obscure which agrees with the high variability in the auditory rating tests. Besides setting boundaries, the perceptual categorization is also complicated by the missing agreement on significant perceptual dimensions and how they should be measured and used for categorizations.
Figure 2. Model vectors of a map trained with voice ratings on 6 analog scales.
332 5. ACOUSTIC FEATURE SELECTION AND PERCEIVED SIGNIFICANCE Simulated annealing [11,12] and learning vector quantization [1] were implemented for the selection of acoustic features with respect to the perceptual categorization of the sample. Twenty-two parameters that defined 20 spectral windows (adjusted parameters were the position, width and weighting of individual sinc-windows and normalization of feature vectors) were selected during simulated annealing to minimize the average error rate in the classification of samples by 10 different LVQ-codebooks. The selected features were compared with those previously used with a new test set and 50 LVQ-codebooks. An accuracy of 81 and 83 percent was reached for female and male voices, respectively (unpublished). Feature selection requires three independent sets of samples, each representative for the categorization task: during the selection process one set is used for training and one for testing the LVQ-classifiers, at the end of the process the final testing is carried out with a third set. With the present number of samples this requirement could not be strictly fulfilled and, therefore, lower classification accuracies might be obtained for new data. The studies show that hundreds of perceptually classified voice samples are needed for the selection of features that would meet clinical needs. Selection of spectral features improved the classification accuracy of LVQ-codebooks and improved the clustering of perceptually similar voice samples on the spectral feature maps. The experiments demonstrated that simple spectral features are not sufficient for matching perceptual categorization. Perceived roughness and breathiness, for instance, are partly related to the signal features that are poorly depicted by the short-time power spectra. 6. SELF-ORGANIZED MAPS OF PERCEPTUALLY MEANINGFUL SENSORY PATTERNS Our experiments suggest that selection of signal features according to their perceived significance improves the match between self-organized acoustic feature maps and perceived similarity relationships among acoustic signals. At present, self-organized maps have no rivals in on-line transmission of complex acoustic information into an easily understandable visual form. Application of maps to modeling of psychophysical problems, such as described here, is restricted by the high number of samples needed for the construction of the models. The high number relates to high variability of sensory signals accounting for adult perception of the stimulus world.
REFERENCES 1. T. Kohonen. Self-Organizing Maps. Berlin, Springer 1995, 2nd ed. 1997. 2. L. Leinonen and H. Poppius. Allergy 52 (1997) 27. 3. L. Leinonen, J. Kangas, K. Torkkola and A. Juvas. J. Speech Hear. Res. 35 (1992) 287. 4. L. Leinonen, T. Hiltunen, J. Kangas, A. Juvas and H. Rihkanen. Scand. J. Log. Phon. 18 (1993) 159. 5. H. Rihkanen, L. Leinonen, T. Hiltunen and J. Kangas. J. Voice 8 (1994) 320.
333 6.
L. Leinonen, T. Hiltunen, I. Linnankoski and M.-L. Laakso. J. Acoust. Soc. Am. 102 (1997) 1853. 7. L. Leinonen, T. Hiltunen, K. Torkkola and J. Kangas. J. Acoust. Soc. Am. 93 (1993) 3468. 8. M.-L. Haapanen, L. Liu, T. Hiltunen, L. Leinonen and J. Karhunen. Folia Foniatr. Logop. 48 (1996) 35. 9. J.W. Sammon, Jr. IEEE Trans. Comp.C-8 (1969) 401 10. L. Leinonen, T. Hiltunen, M.-L. Laakso, H. Rihkanen and H. Poppius. Folia Phoniatr. Logop. 49 (1997) 9. 11. S. Kirkpatrick, C.D. Gelatt and M.P. Vecchi. Science 220 (1983) 671. 12. K. Valkealahti and A. Visa. Proceedings of the 9th Scandinavian Conference of Image Analysis Stockholm 2 (1995) 965.
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
335
Chemometric analyses with self organising feature maps: a worked example of the analysis of cosmetics using Raman spectroscopy. Royston Goodacre, Naheed Kaderbhai, Aoife C. McGovem and Elizabeth A. Goodacre. Institute of Biological Sciences, University of Wales, Aberystwyth, Ceredigion, Wales, UK. Telephone :+44 (0)1970 621947 Fax: +44(0)1970 622354 E-mail:
[email protected] QUOTATION
"Chemometrics is a chemical discipline that uses mathematics, statistics and formal logic (a) to design or select optimal experimental procedures; (b) to provided maximum relevant chemical information by analysing the chemical data; and (c) to obtain knowledge about chemical systems." [1]
1. ABSTRACT Dispersive Raman spectroscopy was used to gain very high dimensional (1735 Raman scatter shifts) chemical fingerprints from perfumes and lipsticks. Spectral acquisition was reproducible, accurate, non-invasive, and very rapid, the typical analysis time per sample was only 1 minute. To observe the relationship between these cosmetics, based on their spectral fingerprints, it was necessary to reduce the dimensionality of these hyperspectral data by unsupervised feature extraction methods. The neural computational pattern recognition technique of self organising feature maps (SOMs) was therefore employed and the clusters observed compared with the groups obtained from the more conventional statistical approaches of principal components analysis (PCA) and hierarchical cluster analysis (HCA). All chemometric cluster analyses gave identical results. For the SOM analysis of the lipsticks very successful exploratory analyses was performed. SOMs were also able unequivocally to classify and to identify all the perfumes analysed. This study demonstrates the potential of dispersive Raman spectroscopy for the non-invasive, non-destructive discrimination of perfumes and lipsticks, and may find application in authenticity testing of perfumes and as a forensic investigative tool for the identification of lipsticks from crime scenes. 2. INTRODUCTION There is a continuing requirement for rapid, accurate, automated methods to characterise chemical systems, for instance in determining whether a particular flavour, essential oil, fragrance or perfume has the provenance claimed for it or whether it has been adulterated with or fraudulently substituted by a lower-grade material. Enantioselective (chirospecific) capillary
336 gas chromatography, coupled with isotope ratio mass spectrometry (to determine stable isotope ratios) [2,3], and site-specific natural isotope fractionation nuclear magnetic resonance (SNIF-NMR) [4,5] have been used to authenticate particular flavours or fragrances with some success.
Lipstick smears are sometimes found as evidence on clothing, cigarette butts, bedding, and miscellaneous crime scene surfaces. In forensic investigations, lipstick can frequently provide highly useful evidence about the linkage between a person and a crime scene. Thus nondestructive and non-invasive methods of forensic identification are valuable for these investigations. Various analytical based methods have been applied to the characterisation of lipsticks and these include those based on purge and trap gas chromatography [6,7], infrared [8], and resonance surface enhanced resonance Raman scattering spectroscopy [9]. Dispersive Raman spectroscopy is a physico-chemical method which measure the vibrations of bonds within functional groups [10-13], by measuring the exchange of energy with EM radiation of a particular wavelength of light (e.g., 785 nm near infrared diode laser). This exchange of energy results in a measurable Raman shift in the wavelength of the incident laser light. The Raman effect (Figure 1) is however very weak since only 1 in every 108 photons exchange energy with a molecular bond vibration, and the rest of the photons are Rayleigh scattered (that is to say scattered with the same frequency as the incident monochromatic (Vo) laser light). The Raman shift can result in two lines Vo - Vm and Vo + V m which are called Stokes anti-Stokes lines, respectively. The Stokes Raman shift is approximately 10 times stronger
337 than anti-Stokes Raman scattering and thus these are usually collected and can be used to construct a Raman 'fingerprint' of the sample (Figure 2). Since different bonds scatter different wavelengths of EM radiation, these Raman Tmgerprints' are made up of the vibrational features of all the samples components. Therefore, this method gives quantitative information about the total chemical composition of a sample, without its destruction, and produce 'fingerprints' which are reproducible and distinct for different materials. Multivariate data (such as those generated by Raman spectroscopy) consist of the results of observations of many different characters or variables (fight frequency shifts) for a number of individuals or objects [14]. Each frequency shift (wavenumber) may be regarded as constituting a different dimension, such that if there are n variables (where n - 1735 measurements) each object may be said to reside at a unique position in an abstract entity referred to as n-dimensional hyperspace. This hyperspace is necessarily difficult to visualise, and an underlying theme of multivariate analysis is thus simplification [15,16] or dimensionality reduction, which usually means that we want to summarise a large body of data by means of relatively few parameters, preferably the two or three which lend themselves to graphical display, with minimal loss of information. The reduction of high dimensional multivariate data is often carried out using principal components analysis (PCA) [15,17,18]. This type of analysis falls into the category of "unsupervised learning", in which the relevant multivariate algorithms seek "clusters" in the data [ 19]. This allows the investigator to group objects together on the basis of their perceived closeness in the n-dimensional hyperspace referred to above. Such methods, then, although in some sense quantitative, are better seen as qualitative since their chief purpose is merely to distinguish objects or populations. Recently there has been an interest in the use of neural computation methods which can also perform unsupervised learning on multivariate data, the most commonly used are self-organising (feature) maps (SOMs) [20] and auto-associative artificial neural networks [21]. Auto-associative ANNs have been used to reduce the dimensionality of the infrared spectra of polysaccharides and hence extract spectral features due to polysaccharides [22], to detect plasmid instability using on-line measurements from an industrial fermentation producing a recombinant protein expressed by Escherichia coli [23], for effecting exploratory cluster analyses of pyrolysis mass spectra [24,25], and for knowledge extraction in chemical process control [26]. Whilst, more recently NLPCA using back-propagation ANN has been used for image coding [27] and image processing [28], and for electrocardiogram (ECG) analysis for detecting ischemia in patients [29]. Wilkins et al. [30,31 ] have applied Kohonen maps to multi-dimensional flow cytometric data for the identification of seven species of flesh water phytoplankton, and Goodacre and colleagues have also exploited these SOMs successfully to carry out unsupervised learning from pyrolysis mass spectra, and hence the classification of canine Propionibacterium acnes isolates [32], P. acnes isolated from man [33], and plant seeds [24]. Within chemometrics, SOMs have also been used to detect and classify human blood plasma lipoprotein lipid profiles
338 on the basis of H ~ NMR spectroscopic data [34], for cluster analysis of multivariate satellite data [35], and for seismological surveys of earthquakes and quarry blast [36]. The aim of this study was to exploit SOMs for the chemometric analysis of the very high dimensional Raman spectra obtained from the direct non-invasive analysis of lipsticks and perfumes. The clustering results obtained from this neural computation method were compared with the conventional approaches of principal components analysis and hierarchical cluster analysis; the clusters produced by all methods were very similar.
3. MATERIALS AND METHODS 3.1. Samples Two sets of cosmetics were analysed: 9 Set 1 comprised the six eau de toilette perfumes; Chloe, Jean-Louis Scherrer, Cerruti, White Diamonds Elizabeth Taylor, Red Door, and Chloe Narcisse. In addition, since these perfumes were mostly composed of ethanol, 80% ethanol was also analysed. 9 Set 2 comprised four 'StMichael Classics' lipsticks purchased from Marks and Spencer. These were named; Champagne, Damson, Mango, and Pink Frost.
3.2. Dispersive Raman microscopy Spectra were collected using a Renishaw System 100 dispersive Raman spectrometer (Renishaw plc., Old Town, Wotton-under-Edge, Gloucestershire, UK) as described by [37-40] with a near infra-red 785 nm diode laser with the power at the sampling point typically at 80.1 mW. The instrument grating was calibrated using neon lines [41] and was routinely checked with a silicon wafer centred at 520 nm. A spectrum from each sample was collected for 10 s using the continuous extended scan (so that actual collection time was 60s). Two millilitres of each perfume sample was pipetted into a 2 ml Supelco vial (Supelco Park, Bellfonte, PA, USA). The vial was placed into a pre-fixed sample holder such that the laser was focused into the centre of the vial (12 mm from the collection lens). Samples were analysed six times. For the direct analysis of the lipsticks these were placed into a pre-fixed sample holder such that the laser was focused onto the surface of the lipstick. Samples were analysed six times and two spectra were obtained from three different locations on each of the four lipsticks. The GRAMS WiRE software package (Galactic Industries Corporation, 395 Main Street, Salem, NH, USA) running under Windows 95 was employed for instrument control and data capture. Spectra were collected over 100-3000 cm -1 wavenumber shifts with 1735 data points, therefore the spectral resolution was ~1.67cm -~. The data were displayed as intensity of Raman photon counts against Stokes Raman shift in wavenumbers (see Figure 2 for typical spectra). ASCII data were exported from the GRAMS WiRE software used to control the Raman instrument into Matlab version 5 (The MathWorks, Inc., 24 Prime Park Way, Natick, MA, USA), which runs under Microsoft Windows NT on an IBM-compatible PC. To minimise
339 problems arising from cosmic rays and any small baselines from fluorescence interference the following procedure was implemented: (i) any cosmic rays (which excite the CCD detector) were removed using a median filter with a window of 9 data points; (ii) the baselines were then removed by subtracting from these spectra the median average using a width of 250 (lipsticks) or 750 (perfumes) data points.
3.2. Cluster analyses To observe the natural relationships between samples the normalised data were analysed by principal components analysis (PCA) [14-17,19,42,43], according to the NIPALS algorithm [44], using the program Matlab. PCA is a well-known technique for reducing the dimensionality of multivariate data whilst preserving most of the variance, and whilst it does not take account of any groupings in the data, neither does it require that the populations be normally distributed, i.e. it is a nonparametric method. Moreover, PCA can be used to identify correlations amongst a set of variables and to transform the original set of variables to a new set of uncorrelated variables called principal components (PCs). The objective of PCA is to see if the first few PCs account for most (>90%) of the variation in the original data. If they do reduce the number of dimensions required to display the observed relationships, then the PCs can more easily be plotted and 'clusters'in the data visualized; moreover this technique can be used to detect outliers. For hierarchical cluster analysis the Euclidean distance between the samples in PCA space (components 1, 2 and 3) were used to construct a similarity measure, with the Gower similarity coefficient Sc [45], and these distance measures were then processed by an agglomerative clustering algorithm to construct a dendrogram [46].
3.3. Self Organising Feature Maps (SOMs) All SOMs analyses (or Kohonen artificial neural networks; KANNs) were carried out with a user-friendly, neural network simulation program, NeuFrame version 3,0,0,0 (Neural Computer Sciences, Lulworth Business Centre, Nutwood Way, Totton, Southampton, Hants), which runs under Microsoft Windows NT on an IBM-compatible personal computer, according to the general principles outlined by Kohonen [47]. KANNs provide a way of classifying data through self-organising networks of artificial neurons. The SOMs used in this work consisted of a two-dimensional network of neurons arranged on a square or rectangle grid. Each neuron was connected to its eight nearest neighbours on the grid. The neurons store a set of weights (a weight vector) each of which corresponds to one of the inputs in the data. Thus, for Raman data consisting of 1735 photon counts each node stores 1735 weights in its weight vector. Upon presentation of a Raman spectrum (represented as a vector consisting of the 1735 normalised photon counts) to the network each neuron calculates its "activation level". A node's activation level is defined as:
l i=~ 0(weight _ i n p u t i ) 2 i
(1)
340 This is simply the Euclidean distance between the points represented by the weight vector and the input vector in n-dimensional space. Thus a node whose weight vector closely matches the input vector will have a small activation level, and a node whose weight vector is very different from the input vector will have a large activation level. The node in the network with the smallest activation level is deemed to be the "winner" for the current input vector. During the training process the network is presented with each input pattern in turn, and all the nodes calculate their activation levels as described above. The nodes included in the set which are allowed to adjust their weights are said to belong to the "neighbourhood" of the winner. The winning node and some of the nodes around it are then allowed to adjust their weight vectors to match the current input vector more closely by an amount depending upon the distance from the most active node, the current size of the neighbourhood and the current value of o~. This is the usual triangular shape [48]; thus if the neighbourhood size is 2 then the winning node can update its weights by 1 x or, and the surrounding 8 nodes by 0.5 x or; likewise if the neighbourhood size is 3 then the winning node can update its weights by 1 x or, and the surrounding 8 nodes by 0.66' x or, and the 16 outer nodes by 0.33' x ct. The size of the winner's neighbourhood is varied throughout the training process. Initially 60% of the nodes in the network are included in the neighbourhood of the winner, but as training proceeds the size of the neighbourhood is decreased linearly after each presentation of the complete "training set" (all the mass spectra being analysed), until it includes only the winner itself. The amount by which the nodes in the neighbourhood are allowed to adjust their weights is also reduced linearly through the training period. The factor which governs the size of the weight alterations is known as the learning rate and is represented by t~. The adjustments to each item in the weight vector (where 8w is the change in the weight) are made in accordance with the following: &v~ = -a(w~ - i , )
(2)
This is carried out for i - 1 to i - n where in this case i - 1735. The initial value for ~ is 0.3 and the final value is 0.
341 The effect of the "learning rule" (weight update algorithm) is to distribute the neurons evenly throughout the region of n-dimensional space populated by the training set [20,47,49,50]. The neuron with the weight vector closest to a given input pattern will win for that pattern and for any other input patterns that it is closest to. Input patterns which allow the same node to win are then deemed to be in the same group, and when a map of their relationship is drawn a line encloses them. By training with networks of increasing size a map with several levels of groups or "contours" can be drawn. These contours, however, may sometimes cross - this appears to be due to failure of the SOM to converge to an even distribution of neurons over the input space [51 ]. Construction of these maps allows close examination of the relationships between the items in the training set, which in this case consisted of the normalised Raman spectra derived from the lipsticks or perfumes. Networks on grids of l x l , 2xl, 2x2, 3x2, 3x3, and 4x3 nodes were trained for 350 epochs and used to group the samples. The SOMs were allowed to "wrap around" so that they formed toroidal structures; this was in order to avoid the edge effects which would otherwise tend to corrupt very small networks of this type. Although using toroidal KANNs means that the maximum topological distance for a NxN SOM is decreased from N to N/2. This does not limit the discriminatory ability because a succession of larger KANNs was employed to assign quantitative differences and hence clustering. Indeed, it is true to say that it is very difficult to assign quantitative meaning to a single KANN. Finally, after training, new Raman spectra can be presented to the SOMs and these spectra will be clustered automatically into one of the nodes in the Kohonen output layer. This will allow the identify of the cosmetics to be elucidated.
3.4. Dendrogram construction from SOMs Construction of dendrograms from SOMs was conducted as previously detailed in [24]. The construction of a dendrogram begins when only a single node is used in the Kohonen layer and all x spectra necessarily group together; this is the starting point in the l xl size of Kohonen layer zone. Visual inspection of the trained Kohonen output layers from the 2xl topology allows further separation of the x spectra; these again can be drawn by two lines joining the l xl to 2x 1 layers, one for each of the nodes (clusters). Likewise further separation of these 'sub-clusters' will be seen in the 2x2 Kohonen output layer of the next SOM, and can again be depicted by lines connecting the 2xl and 2x2 zones. By progressively increasing the number of nodes by 6, 9 and 12 in the output
342 layer of the 3x2, 3x3 and 4x3 SOMs more detailed discriminations are found. Moreover, the constructed dendrogram will allow quantitative information in terms of the degree of relatedness of samples to be extracted from otherwise qualitative SOMs. 4. RESULTS AND DISCUSSION
4.1. The chemometric analysis of perfumes using SOMs Typical Raman spectra from the perfume Red Door and 80% ethanol are shown in Figure 2. The reason that 80% ethanol was included in this study was because perfume typically consists of 80% ethanol. It is obvious that all of the spectral features observed in the ethanol spectrum are also seen in this perfume, and indeed all five others (data not shown). Ethanol is a relatively simple molecule and the Raman vibration modes of this are also detailed in this figure (from [52]). Nevertheless, despite these obvious features, there were other smaller peaks observed in the Raman spectra of the perfumes. The complexity of the spectra was such that the classification (or clustering) of these spectra would not be possible by simple visual inspection, and this readily illustrates the need to employ chemometric techniques for the cluster analysis of Raman data. The next stage was to employ unsupervised learning to cluster the perfume 0.5 samples. As detailed above ~0 various SOMs, with different numbers of nodes in the Legend: Kohonen output layer, were A,a -- Red D o o r 2 9 B, b = C h l o e trained with the first three 1 40 60 C, c -- Scherrer replicates of each of the six D , d = Cerruti - 1 -20 rC ~ perfumes and 80% ethanol; E, e -- W h i t e D i a m o n d s the other three replicates F, f -- C h l o e Narcisse G, g = E t h a n o l (80%) were reserved as an Figure 5 Principal components analysis plot based on the independent test set. The normalised triplicates Raman spectra from the six perfumes results from the 2xl, 2x2, and 80% ethanol (upper case letters). The first three principal 3x2, 3x3, and 4x3 SOMs are components are displayed and they accounted for 99.97% of detailed in Figure 3. It can be the total variation. The test set was 21 Raman spectra (lower seen that even the simplest (in case letters) and were projected into this PC space. discriminatory terms) 2 node output SOM separated 80% ethanol from all the other perfumes. As the number of nodes were increased more discrimination in the perfumes can be seen; for example, in the 3x2 SOM Red Door and Chloe separate from the other perfumes and stay clustered together in the 3x3 SOM and then are recovered separated in the 4x3 SOM. The description of the clustering is often difficult to describe when only the output layers are viewed, therefore a dendrogram was constructed (as detailed above) and this is depicted in Figure 4. This allows more easily the interpretation of the quantitative differences between the perfumes; Red Door and Chloe are similar to one another, as are Scherrer and Cerruti, whilst Chloe Narcisse and White Diamonds are very different from all the other perfumes. 1
343 To test the quantitative clustering afforded by these dendrograms the same data were also analysed by the multivariate statistical analysis of PCA. The results of PCA are shown in Figure 5 as a pseudO-three dimensional scatter plot, and it was indeed evident that there was a great deal of congruence between this PCA plot and the dendrogram constructed from SOMs (Figure 4). Finally, since the 4x3 SOM separated all the perfumes unequivocally the next stage was to challenge this SOM with the 21 spectra in the test set. The winning node for each of the Raman spectra in the test is shown in Figure 3 and it can be seen that all six perfumes and the 80% ethanol were correctly identified. Again, this was in total agreement with projection of these Raman spectra into PCA space.
4.2. The exploratory analysis of lipsticks usin g SOMs The Raman spectra from the lipsticks were more complex than those collected from the perfumes (vide infra), generally more peaks were seen and some fluorescence was also observed. This fluorescence was seen as a broad baseline and could be easily removed by subtracting from these spectra the sliding median average of a wide number of data bins (this pre-processing of the spectra was performed as detailed above).
Damson (2) Champagne (1) Damson (1) Champagne (2)
rI-"~c
d Pink Frost The six replicate lipstick Raman spectra (1 &2) were used to train a variety of SOMs and the topological contour map constructed from these computations is shown in Figure 6. It can be seen from this map that all six (1 &2) replicates of Pink Frost were very different from the other lipsticks and that 5 of the replicates were very similar to one another Figure 7 Dendrogram representing the and the sixth was slightly different. All six relationships between the four lipsticks based replicates from the lipstick called Mango also on the normalised six replicate Raman spectra. clustered tightly together. By, contrast The dendrogram was constructed from the although the Damson and Champagne first three principal components and this lipsticks clustered loosely together three account for 94.91% of the total variation.
1
344 replicates from each lipstick could be easily separated. To test the validity of this neural computational-based clustering the multivariate statistical analysis of HCA was used to construct a dendrogram from the first three principal components from PCA (Figure 7). This dendrogram showed complete congruence with the clustering shown in the topological contour map (Figure 6), therefore the unexpected clustering observed above was indeed correct. From these analyses two questions need to be answered: (1) can the outlier in the Pink Frost be realistically explained? and (2) why are the replicates from the Damson and Champagne lipsticks so poorly clustered? The first question is easily answered by plotting the dispersive Raman spectra of the six replicates from the Pink Frost lipstick. Figure 8 shows these plots and one can easily see that five of the spectra effectively superimpose whilst the sixth outlier spectrum is slightly different. This shows that SOMs can be used as a powerful exploratory analysis tool for the detection of outliers in a similar fashion to PCA and HCA. In order to answer the second question a further six replicates of the Damson lipstick were analysed in an identical way to that described above; I First briefly, this was by analysing Analysis the same location on the lipstick twice and then moving the lipstick and analysing a different surface spot. These and the other previously ::j Second collected spectra were analysed . w~ Analysis i by SOMs with output grids of ', lxl, 2xl, 2x2, 3x2, 3x3, 4x3, lxl 2xl 2x2 3x2 3x3 4x3 4x4 5x4 4x4, and 5x4 nodes and a Size of Kohonen layer dendrogram was constructed Figure 9 Dendrogram produced using SOMs trained with the (Figure 9). It can be seen that normalised 12 replicate Raman spectra from the Damson the samples cluster according lipstick. Networks on grids of lxl, 2xl, 2x2, 3x2, 3x3, 4x3, to whether it was the first or 4x4, and 5x4 nodes were trained for 350 epochs. Note that second Raman analysis on the six different sites on this lipstick were analysed in duplicate same location. This was also (first and second analyses). found to be the case for the
345 Champagne lipstick (data not shown) and when these two lipsticks were visibly inspected it was found that the lipstick had melted where the 80 mW laser beam had been focused onto its surface. By contrast, for Pink Frost and Mango there was no visible blemish on the surface of these lipsticks. Plots of the 12 Raman spectra from the Damson lipstick are shown in Figure 10 and it can be seen that the second analysis (black lines) was much more reproducible that the first analysis (grey lines). This can also be observed in the SOM-dendrogram (Figure 9) where there was much more overlap in the lines connecting the different output layer sizes in the second analysis; that is to say, the spectra were so similar that the winning nodes described different subsets of the same six spectra. The probable reason that the second analysis was more reproducible than the first was because in the first analysis the lipstick was being melted by the laser and there was a transition from solid to liquid lipstick. It is known that there is more freedom for molecules to move in the liquid phase than the solid, and this would affect the Raman scattering. By contrast, because the second analysis was on the same location on the lipstick as the first analysis, the lipstick was already melted and so there would be no change in the vibrational modes of Raman scattering. 5. CONCLUDING REMARKS Dispersive Raman spectra were obtained non-invasively from a wide variety of perfumes and lipsticks. The artificial neural network pattern recognition technique of SOMs, based on unsupervised learning, was compared with the statistical approaches of PCA and HCA, and all chemometric cluster analyses gave identical results. For the SOM analysis of the lipsticks very successful exploratory analyses was performed. SOMs were also able unequivocally to classify and to identify all the perfumes analysed. Whilst fuzzified Kohonen clustering of FT-Raman spectra of nitro-containing explosive materials has been described previously by [531, this is the first application of SOMs to the analysis of dispersive Raman spectra. The construction of these dendrogram from SOMs was first described by this author for the analysis of pyrolysis mass spectra from seeds [24]. In the present study dendrograms constructed from a range of SOMs with different sizes of Kohonen output layers, trained with Raman spectra, allow visual simplification of the groups and quantitative information on their relationship to one another to be gained. Furthermore, results of feature extraction depicted in dendrograms are easier to interpret than either tabulated results, topological contour maps, or inspection of the individual SOMs.
346 This study demonstrates the potential of dispersive Raman spectroscopy for the non-invasive, non-destructive discrimination of perfumes and lipsticks. This method may therefore be a realistic candidate for determining whether a fragrance or perfume has the provenance claimed for it or whether it has been adulterated with, or fraudulently substituted by, a lower-grade material. Moreover, dispersive Raman, like resonance surface enhanced resonance Raman scattering spectroscopy [9], may also be a useful forensic investigative tool for the identification of lipsticks from crime scenes. 6. A C K N O W L E D G E M E N T S We are very grateful to Dr Ken Williams of Renishaw plc. for useful discussions regarding Raman spectroscopy. R.G. is indebted to the Wellcome Trust for financial support (grant number 042615/Z/94/Z), and A.C.M. and N.K. thank the UK BBSRC for financial support. 7. REFERENCES
1.
.
3. 4. .
6. 7. 8.
9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. DeJong, P.J. Lewi and J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics: Part A, Elsevier, Amsterdam, 1997. A. Mosandl, Journal of Chromatography, 624 (1992) 267-292. H. Casabianca, J.B. Graft, P. Jame, C. Perrucchietti and M. Chastrette, HRC-J. High Res. Chromatography, 18 (1995) 279-285. G.J. Martin, C. Coulomb and S. Haneguelle, Abstracts of Papers of the American Chemical Society, 202 (1991 ) 197. G.J. Martin, ACS Symp. Ser., 596 (1995) 79-93. Y. Ehara, N. Oguri, S. Saito and Y. Marumo, Bunseki Kagaku, 46 (1997) 733-736. Y. Ehara and Y. Marumo, Forensic Sci. Inter., 96 (1998) 1-10. E.G. Bartick, M.W. Tungol and J.A. Reffner, Anal. Chim. Acta, 288 (1994) 35-42. C. Rodger, V. Rutherford, D. Broughton, P.C. White and W.E. Smith, Analyst, 123 (1998) 1823-1826. J.G. Graselli and B.J. Bulkin, Analytical Raman spectroscopy. New York: Wiley, 1991. J.R. Ferraro and K. Nakamoto, Introductory Raman Spectroscopy, Academic Press, 1994. N.B. Colthup, L.H. Daly and S.E. Wiberly, Introduction to Infrared and Raman Spectroscopy, Academic Press, New York, 1990. B. Schrader, Infrared and Raman spectroscopy: methods and applications., Verlag Chemie, Weinheim, 1995. H. Martens and T. Na~s, Multivariate Calibration, John Wiley, Chichester, 1989. I.T. Jolliffe, Principal Component Analysis, Springer-Verlag, New York, 1986. C. Chatfield and A.J. Collins, Introduction to Multivariate Analysis, Chapman & Hall, London, 1980. D.R. Causton, A Biologist's Advanced Mathematics, Allen and Unwin, London, 1987. W.J. Krzanowski, Principles of Multivariate Analysis: A User's Perspective, Oxford Univeristy Press, Oxford, 1988. B.S. Everitt, Cluster Analysis, Edward Arnold, London, 1993. T. Kohonen, Self-Organising Maps, Springer, Berlin, Heidelberg, New York, 1997.
347 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53.
M.A. Kramer, AIChE J, 37 (1991) 233-243. S.P. Jacobsson, Anal. Chim. Acta, 291 (1994) 19-27. G. Montague and J. Morris, Trends Biotechnol., 12 (1994) 312-324. R. Goodacre, J. Pygall and D.B. Kell, Chem. Intell. Lab. Sys., 34 (1996) 69-83. R. Goodacre, D.J. Rischert, P.M. Evans and D.B. Kell, Cytotechnol., 21 (1996) 231-241. D.R. Kuespert and T.J. McAvoy, Chem. Eng. Comm., 130 (1994) 251-264. D. Tzovaras and M.G. Strintzis, IEEE Trans. Image Proc., 7 (1998) 1218-1223. R. Bowden, T.A. Mitchell and M. Sarhadi, Electronics Lett., 33 (1997) 1858-1859. T. Stamkopoulos, K. Diamantaras, N. Maglaveras and M. Strintzis, IEEE Trans. Sig. Proc., 46 (1998) 3058-3067. M.F. Wilkins, L. Boddy and C.W. Morris, Bin. Comput. Microbiol., 6 (1994) 64-72. M.F. Wilkins, L. Boddy, C.W. Morris and R. Jonker, CABIOS, 12 (1996) 9-18. R. Goodacre, M.J. Neal, D.B. Kell, L.W. Greenham, W.C. Noble and R.G. Harvey, J. Appl. Bacteriol., 76 (1994) 124-134. R. Goodacre, S.A. Howell, W.C. Noble and M.J. Neal, Zbl. Bakt. - Int. J. Med M., 284 (1996) 501-515. J. Kaartinen, Y. Hiltunen, P.T. Kovanen and M. AlaKorpela, NMR Med., 11 (1998) 168. J. Waldemark, Int. J. Neural Sys., 8 (1997) 3-15. M. Musil and A. Plesinger, Bull. Seismological Soc. Amer., 86 (1996) 1077-1090. K.P.J. Williams, G.D. Pitt, D.N. Batchelder, B.J. Kip, Appl. Spectrosc., 48 (1994) 232. K.P.J. Williams, G.D. Pitt, B.J.E. Smith, A. Whitley, D.N. Batchelder and I.P. Hayward, J. Raman Spectrosc., 25 (1994) 131-138. R. Goodacre, E.M. Timmins, R. Burton, N. Kaderbhai, A.M. Woodward, D.B. Kell and P.J. Rooney, Microbiol., 144 (1998) 1157-1170. A.D. Shaw, N. Kaderbhai, A. Jones, A.M. Woodward, R. Goodacre, J.J. Rowland and D.B. Kell, Appl. Spectrosc., (1999) submitted. C.H. Tseng, J.F. Ford, C.K. Mann and T.J. Vickers, Appl. Spectrosc., 47 (1993) 1808. C.S. Gutteridge, L. Vallis and H.J.H. MacFie, in M. Goodfellow, D. Jones, F. Priest (Eds.), Computer-assisted Bacterial Systematics, Academic Press, London, 1985. B. Flury and H. Riedwyl, Multivariate Statistics: A Practical Approach, Chapman and Hall, London, 1988. H. Wold, in K.R. Krishnaiah (Eds.), Multivariate Analysis, Academic Press, New York, 1966, pp. 391-420. J.C. Gower, Biometrika, 53 (1966) 325-338. B.F.J. Manly, Multivariate Statistical Methods: a Primer, Chapman & Hall, London, 1994. T. Kohonen, Self-Organization and Associative Memory, Springer-Verlag, Berlin, 1989. J. Zupan and J. Gasteiger, Neural Networks for Chemists: An Introduction, VCH Verlagsgeesellschaft, Weinheim, 1993. J. Hertz, A. Krogh and R.G. Palmer, Introduction to the Theory of Neural Computation, Addison-Wesley, California, 1991. R. Hecht-Nielsen, Neurocomputing, Addison-Wesley, Massachusetts, 1990. E. Erwin, K. Obermayer and K. Schulten, Biol. Cyber., 67 (1992) 47-55. F.R. Dollish, W.G. Fately and F.F. Bently, Characteristic Raman frequency of organic compounds, John Wiley, New York, 1974. N.W. Daniel, I.R. Lewis and P.R. Griffiths, Appl. Spectrosc., 51 (1997) 1868-1879.
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9 ElsevierScienceB.V.All rightsreserved
349
S e l f - O r g a n i z i n g M a p s for C o n t e n t - B a s e d I m a g e D a t a b a s e R e t r i e v a l E. Oja, J. Laaksonen, M. Koskela, and S. Brandt Laboratory of Computer and Information Science, Helsinki University of Technology P.O. Box 5400, Fin-02015 HUT We have developed a novel system for retrieving images similar to a given set of reference images in large image databases, based on Tree Structured Self-Organizing Maps (TS-SOMs). Our image retrieval system is called PicSOM. It has been designed with the purpose to provide a framework for generic research on algorithms and methods for content-based image retrieval. A new technique introduced in this paper facilitates automatic combination of the responses from multiple TS-SOMs and their hierarchical levels. Each TS-SOM is tuned with a different image feature representation like color, texture, or shape. This mechanism adapts to the user's preferences in selecting which images resemble each other, in the particular sense the user is interested in. The image queries are performed through the World Wide Web and the queries are iteratively refined as the system exposes more images to the user. 1. I n t r o d u c t i o n Content-based image retrieval from unannotated image databases has been an object for ongoing research for a long period [1]. Digital image and video libraries are becoming more widely used as more visual information is produced at a rapidly growing rate. The technologies needed for retrieving and browsing this growing amount of information are still, however, quite immature and limited. This is an elusive scientific problem, still at the exploratory stage, with solid engineering solutions expected to appear in the future. Many projects have been started in recent years to research and develop efficient systems for content-based image retrieval. The best-known system is Query By Image Content (QBIC) [2] developed at the IBM Almaden Research Center. Other notable systems include MIT's Photobook [3] and its more recent version, FourEyes [4], the search engine family of VisualSEEk [5], WebSEEk [6], and MetaSEEk [7], which all are developed at Columbia University, and Virage [8], a commercial content-based search engine developed at Virage Technologies Inc. We introduce here a recently implemented image-retrieval system called PicSOM. It uses a World Wide Web browser as the user interface and the Tree Structured Self-Organizing Map (TS-SOM) [9,10] as the image similarity scoring method. The implementation of our PicSOM system is based on a general framework in which the interfaces of co-operating modules are defined. Therefore, the use of TS-SOMs is only one choice for the similarity measure. However, the results we have gained so far, are very promising on the potentials of the TS-SOM method.
350 As far as the current authors are aware, there has not been until now notable image retrieval applications based on the Self-Organizing Map (SOM) [11]. Some preliminary experiments with SOM have been made previously [12]. MIT's FourEyes image browser uses Self-Organizing Maps to cluster weights for different features [13]. 2. P r i n c i p l e of P i c S O M Our method is named PicSOM due to its similarity to the well-known WEBSOM [14,15] document browsing and exploration tool that can be used in free-text mining. WEBSOM is a means for organizing miscellaneous text documents into meaningful maps for exploration and search. It is based on the Kohonen SOM [11] that automatically organizes documents into a two-dimensional grid so that related documents appear close to each other. Up to now, databases over one million documents have been organized for search using the WEBSOM system. In an analogous manner, we have aimed at developing a tool that utilizes the strong self-organizing power of the SOM in unsupervised statistical data analysis for digital images. PicSOM is intended as a general framework for multi-purpose content-based image retrieval. The system is designed to be open and able to adapt to different kinds of image databases, ranging from small and domain-specific picture sets to large general purpose image collections. The features may be chosen separately for each specific task and the system may also use keyword-type textual information for the images, if available. In this paper, we describe the PicSOM system in its current form. The basic operation of the PicSOM image retrieval is as follows: 1) An interested user connects to the W W W server providing the search engine with her web browser. 2) The system presents a list of databases available to that particular user. Later, there will also be a list of available search strategies; currently only the TS-SOM-based engine has been implemented. 3) After the user has selected the database, the system presents an initial set of tentative images scaled to a small "thumbnail" size. The user then selects the subset of these images which best matches her expectations and to some degree of relevance fits to her purposes. Then, she hits the "Continue Query" button in her browser which sends the information on the selected images back to the search engine. 4) The system marks the images selected by the user with a positive value and the non-selected images with a negative value in its internal data structure. Based on this data, the system then presents the user a new set of images along with the images selected this far. 5) The user again selects the relevant images, submits this information to the system and the iteration continues. Hopefully, the fraction of relevant images increases in each image set presented to the user and, finally, one of them is exactly what she was originally looking for. 2.1. F e a t u r e e x t r a c t i o n PicSOM may use one or several types of statistical features for image querying. Separate feature vectors can thus be formed for describing the colors, textures, and shapes in the images. A separate Tree Structured Self-Organizing Map is then constructed for each feature vector set and these maps are used in parallel to calculate the best-scoring similarity results. The feature selection is not restricted in any way and new features can be added to the system later on, as long as an equal number of features are calculated
351 from each picture in the database. Color is a natural and widely-used feature in content-based image retrieval. Common representations for color information in image retrieval include color histograms, color moments, color layouts and the recent color correlograms. In PicSOM, average R-, G-, and B-values are calculated in five separate regions of the image, as seen in Figure 1. This division of the image area increases the discriminating power by providing a simple color layout scheme. The resulting 15-dimensional color feature vector thus not only describes the average color of the image but also gives information on the spatial color composition.
0
Figure 1. Image regions used calculating color and texture feature vectors.
Texture is an innate property of all surfaces and therefore a suitable feature for image retrieval. Texture features for pattern recognition and computer vision have been researched extensively over the past decades and the achievements in the field include co-occurrence matrices, multi-resolution simultaneous autoregressive (MRSAR) models, shift-invariant eigenvector (EV) models, the Wold decomposition, and wavelets, among others. The texture feature vectors in PicSOM are calculated separately in the same five regions as the color features, shown in Figure 1. The Y-values of the YIQ color representation of every pixel's 8-neighborhood are examined and the estimated probabilities for each neighbor pixel being brighter than the center pixel are used as features. This results in five eight-dimensional vectors which are combined to one 40-dimensional texture feature vector. Shape features can also be used in content-based image indexing. In PicSOM, various shape-describing features have been experimented with. They are all formed from a thresholded binary edge image obtained by convolving the image with Sobel masks of size 3 x 3. The edge filtration is performed on the saturation and intensity components of the HSI color presentation and the resulting two binarized edge images are then logically ored to form the edge image. An example of the images used in the experiments is seen in Figure 2. The corresponding edge image is displayed in Figure 3. The first shape features are based on the histogram of the eight quantized directions of the edges in the image. When the histogram is separately formed in all the five regions seen in Figure 1, 40-dimensional feature vectors are obtained. They describe the distribution
352
I. Figure 2. An example of the images used in the experiments.
Figure 3. Edges extracted from Figure 2 and used while forming the shape features.
of edge directions in various parts of the image and thus reveal the shape in a low-level statistical manner. The second shape features are formed from the co-occurrence matrix of neighboring edge elements. As the number of quantized directions is again eight, 320dimensional vectors are obtained. The third and fourth shape-describing features are based on the Fourier Transform of the binarized edge image. The image sizes are normalized to 512 x 512 pixels before FFT. The 2-dimensional amplitude spectrum is then smoothed and down-sampled to form feature vectors of length 512 coefficients. The formation of the fourth set of shape features is otherwise similar but the edge image is transferred from the Cartesian coordinates to polar coordinates before FFT. 2.2. T r e e S t r u c t u r e d S O M ( T S - S O M ) The Tree Structured Self-Organizing Map (TS-SOM) [9,10] is a tree-structured vector quantization algorithm that uses Self-Organizing Maps (SOMs) [11] at each of its hierarchical levels. In PicSOM, all TS-SOM maps are two-dimensional. The number of map units increases when moving downwards in the TS-SOM. The search space on the underlying SOM level is restricted to a predefined portion just below the best-matching unit on the above SOM. Therefore, the complexity of the searches in TS-SOM is remarkably lower than if the whole bottommost SOM level were accessed without the tree structure. The structure of TS-SOM is illustrated in Figure 4. The computational lightness of TS-SOM facilitates the creation and use of huge SOMs which, in our PicSOM system, are used to hold the images stored in the image database. The feature vectors (color, texture, or shape) calculated from the images are used to train the levels of the TS-SOMs beginning from the top level. During the training, each feature vector is presented to the map multiple times and the model vectors stored in the map units are modified to match the distribution and topological ordering of the feature vector space. After the training phase, each unit of the TS-SOMs contains a model vector which may be regarded as the average of all feature vectors mapped to that particular unit. In PicSOM, we then search in the corresponding data set for the feature vector which best matches the stored model vector and associate the corresponding image to that map unit. Consequently, a tree-structured hierarchical representation of all the images in the
353
Figure 4. The structure of a three-layer two-dimensional TS-SOM.
database is formed. In an ideal situation, there should be one-to-one correspondence between the images and TS-SOM units in the bottom level of each map.
2.3. Using multiple TS-SOMs Combining the results from several feature maps can be done in a number of ways. A simple method would be to ask the user to enter weights for different maps and then calculate a weighted average. This, however, requires the user to give information which she normally does not have. Generally, it is a difficult task to give low-level features such weights which would coincide with human perception of images at a more conceptual level. Therefore, a better solution is to combine the results of multiple maps automatically, using the implicit information from the user's responses during the query. The PicSOM system thus tries to learn the user's preferences from the interaction with her and sets its own responses accordingly. The rationale behind our approach is as follows: If the images selected by the user map close to each other on a TS-SOM map, it seems that the corresponding feature performs well on the present query and the relative weight of its opinion should be increased. This can be implemented simply by marking on the maps the images shown to the user until now with positive and negative values depending whether she has selected or rejected them, respectively. The mutual relations of positively-marked units residing near to each other can then be enhanced by convolving the maps with a simple low-pass filtering mask. As a result, those areas which have many positively marked images spread the positive response to their neighboring map units. The images associated with these units are then good candidates for next images to be shown to the user, if they have not been shown already. The current PicSOM implementation uses convolution masks whose values decrease as the 4-neighbor or "city-block" distance from the mask center increases. The convolution mask size increases as the size of SOM layer increases. Figure 5 shows a set of convolved feature maps during a query. The three images on the left represent three map levels on the Tree Structured SOM for the RGB color feature, whereas the convolutions on the right are calculated on the texture map. The sizes of the SOM layers are 4 • 4, 16 x 16, and 64 x 64, from top to bottom. The dark regions have
354 positive and the light regions negative convolved values on the maps. Notice the dark regions in the lower-left corners of the three layers of the left TS-SOM. They indicate that there is a strong response and similarity between images selected by the user in that particular area of the color feature space.
Figure 5. An example of convolved TS-SOMs for color (left) and texture (right) features. Black corresponds to positive and white to negative convolved values.
2.4. Refining queries In our current PicSOM implementation, all positive values on all convolved TS-SOM layers are sorted in descending order in one list. Then, a preset number, e.g. 15, of the best candidate images which have not been shown to the user before are output as a new tentative image selection. Image retrieval with PicSOM is therefore an iterative process in which new images get selected or rejected by the user. Initially, the query begins with a set of reference images picked from the top levels of the TS-SOMs in use. The SOM map units associated with the selected and rejected images get positive and the negative values, respectively. The positive and negative responses are normalized so that their sum equals to zero. Previously positive map units can also be changed to negative as the retrieval process iteration continues. In early stages of the image query, the system tends to present the user images from the upper TS-SOM levels. As soon as the convolutions begin to produce large positive values also on lower map levels, the images on these levels are shown to the user. The images are therefore gradually picked more and more from the lower map levels as the query is continued. The inherent property of PicSOM to use more than one reference image as the input information for retrievals is important. This feature makes PicSOM different from other content-based image retrieval systems, such as QBIC, which uses only one reference image at a time.
355 3. I m p l e m e n t a t i o n of P i c S O M The issues of the implementation of the PicSOM image retrieval system can be divided in two categories. First, concerning the user interface, we have wanted to make our search engine, at least in principle, available and freely usable to anybody by implementing it in the World Wide Web. This also makes the queries on the databases machine independent, because the standard web browsers can be used. Second, the functional components in the server running the search engine have been implemented so that the parts responsible for separate tasks have been isolated to separate processes. The functional interfaces between these processes have then been designed to be open and easily extensible to allow the inclusions of new features in the system in future. 3.1. U s e r interface
Figure 6. WWW-based user interface of PicSOM. The user has already selected five aircraft images in the previous rounds. The system is displaying the user ten new images to select of.
Figure 6 shows a screenshot of the current web-based PicSOM user interface, which can be found at http://www.cis.hut.fi/picsom/. On the top of the page, there are three pulldown menus for examining class information on the RGB color bands, if that information is available for the particular database. The convolved feature maps are shown next on the page. In this query, RGB color and texture maps have been used as seen on the labels above the maps. On color terminals, positive map points are seen as blue and negative as red. White represents zero values. The first row displays images selected on previous rounds of the retrieval process. This example shows a query with five images of airplanes selected. The next images are the
356 ten best-scoring new images obtained from the convolved units in the TS-SOMs. It seems that these ten images contain four airplanes. Finally, the page has some user-modifiable settings and a "Continue Query" button which submits the new selections back to the search engine. The user can at any time switch from the iterative queries to examine the TS-SOM map surfaces simply by clicking the map images. Relevant images on the maps can then also be selected for continuing queries.
3.2. P a r t s of t h e P i c S O M s y s t e m The current computer implementation of PicSOM has three separate modular components" p i c s o m . c g i is a CGI/FCGI script which handles the requests and responses from the user's web browser. This includes processing the HTML form, updating the information from previous queries and executing the other components as needed to complete the requests. p i c s o m c t r l is the main program responsible for updating the TS-SOM maps with new positive and negative response values, calculating the convolutions, creating new map images for the next web page, and selecting the next best-scoring images to be shown to the user in the next round. p i c s o m c t r l t o h t m l creates the HTML contents of the new web pages based on the output of the picsomctrl program. Figure 7 illustrates the components of the current PicSOM system and the operations needed in handling the queries. The numbers indicate the usual order of actions.
(1) NEW OR CONTINUED QUERY
/ ~(3) ,J
(4)
It
RETURNS THE
CONTROL FILE
i ~r . . . . . . . . . . . . . . . . . . . . . . . . . (7) RETURNS IMAGES
|CSOMCWRLTOIt,I~LI
..............................
:
: :
i
i
(5) HTML PAGE
.......... S_ER_~B. . . . . . . . . . . . . . . . . . . . . . _Cs
..........
Figure 7. The components of the PicSOM system and the operations performed in handling the queries.
357 4. T h e e x p e r i m e n t a l i m a g e d a t a b a s e Currently, we have made experiments with an image database of 4350 images. Most of them are color photographs in JPEG format. The images were downloaded from the image collection residing at the Swedish University Network F T P server, located at
ftp :///tip. sunet, se/pub/pictures/. PicSOM also supports the utilization of textual class information for the images, if that kind of information is available in the database. The original directory structure of the collection has been used to give the images rough textual content classes. Figure 8 shows a tree-form representation of a small subset of the used classes. The classes on child nodes are subclasses of the classes on their father nodes. For instance, { "cars"} C {"vehicles" }.
animals
apes
wolves
views
cats
vehicles
aircraft
cars
tv film
trains
actors
Figure 8. A subset of the image classes in the ftp.sunet.se database.
In the user interface, the convolved TS-SOM map views can be changed to maps colored with this external information of the image content. The three color bands in the RGB color space can be used to visualize the spreads of three individual classes on the maps. 5. Q u a n t i t a t i v e r e s u l t s A number of measures to evaluate various visual features are presented in this section. Assume a database 79 containing a total of N images, and an image class C C 7) with Nc relevant images. Then, the a priori probability pc of the class C is
Nc Pc = N
(1)
An ideal performance measure should be independent of the a priori probability and the type of images in the used image class. 5.1. O b s e r v e d p r o b a b i l i t y For each image I c C with a feature vector fz, we calculate the Euclidean distance dL2 (I, J) of fz and the feature vectors fJ of the other images J E 7:) \ {I} in the database. Then, we sort the images based on their ascending distance to the image I and store the indices of the images in a ( N - 1)-sized vector gI. We now have a vector gi for each I E C containing a sorted permutation of the images in 7:) \ {I} based on their increasing Euclidean distance to I. By g[, we denote the ith component of gZ. Next, for all images I C C, we define a vector h I as follows Vi c [ 1 , N - 1]- h/I -
1 if g[ E C, 0 otherwise.
(2)
358 The vector h I now has a value of one at location i, if the corresponding image belongs to the class C. As C has Nc images, of which one is the image I itself, each vector h I contains a value h~ - 1 in exactly Nc - 1 locations. In order to perform well with the class C, the feature extraction should cluster the images I belonging to C near each other. T h a t is, the values h~ = 1 should be concentrated on the small values of i. We can now define the observed probability Pi: Vi E [1, N -
1
1]" Pi : -~c ~
hK "
(3)
KEC
The observed probability Pi is a measure between [0, 1] of the probability that a given image K E C has an image belonging to the class C as the i:th nearest image according to the feature vector f. A good feature should cluster similar images (in this case, images belonging to C) close to each other and thus the value of pi should be high for small values of i and decrease monotonically as i grows. In the optimal case, p* - 1 if i < Nc - 1, and p* - 0 if i > Nc - 1. This is equivalent to the situation, where all the images in class C are clustered together. In this case, the shortest distance to the closest image not in C is always greater than any of the inter-class distances. This is obviously very rarely the case with the low-level features currently used in content-based image retrieval. Still, Pi can be used as a comparable performance measure for different features. The worst case happens when the feature f completely fails to discriminate the images in class C from the images which do not belong to the class C. The observed probability pi is then close to the a priori Pc for every value of i e [1, g - 1].
5.2. Weighting the observed probability The observed probability Pi is a function of the distance i, so it cannot easily be used to compare two different features fl and f2. Therefore, it is useful to derive a scalar metric from Pi to enable us to do such comparisons directly. As the large values of Pi with small values i and small values of pi with large values i correspond to good discriminating power of the feature f, the weighting function h(u, x) should respectively reward the large values of Pi when i is small and punish the large values of pi when i is large. The Discrete Fourier Transform (DFT) P(u) of the observed probability function Pi can be used in forming measures for the discriminating power. P(u) is calculated as follows: N-1
P(u) - ~ i=O
N-1
p, h(u, i) = ~
Pie j:''u/N 9
(4)
i=O
The weighting function h(u, x) equals now a point on the complex value unit circle with the angle ~ = 27riu/N. The D F T is defined only for integer values of k, but in our weighting purposes, there is no reason for this limit. For example, with the parameter 1 the weighting function h(51 , x) rotates in N steps from ~ = 0 to ~ - u, which value u = ~, equals the upper half of the unit circle. The values of h(1, x) are distributed on the whole unit circle, respectively. With the parameter value u = O, the metric reduces to the sum of the probabilities Pi and is useful in normalizing the values of P(u).
359 Suitable values of u could then include, for example, u = 1 and u = 1 and performance metrics could include Re{P(u)/P(O)}, Im{P(u)/P(O)}, Abs{P(u)/P(O)} and Arg{P(u)/P(O) }, representing the real part, imaginary part, absolute value, and the angle of P(u)/P(O), respectively.
5.3. C o m p a r i s o n of feature extraction m e t h o d s In order to assess the indexing ability of the color, texture, and four shape features currently used in PicSOM, a series of experiments were performed. We chose to use two figures of merit to describe the performances. First, a local measure calculated as the average of the observed probability Pi for the first 50 retrieved images, i.e.:
~-]i50=1Pi ~ocal =
(5)
50
The r~ocal measure obtains values between zero and one. Figures near one can be obtained even though the classes were globally split into many clusters if each of these clusters are separate from the clusters of the other classes. On the other hand, for a global figure of merit we used the weighted sum of the observed probability pi calculated as: ~global =
Re{P(1/2)/P(O)}
(6)
This figure again attains values between zero and one. It favors observed probabilities that are concentrated in small indices and additionally punishes for large probabilities in large index values as described above.
Table 1 Comparison of the performances of different features for different image classes Image classes aircraft (0.08) buildings (0.11) faces (0.08) Feature types /]local T]global //local T]global //local /]global RGB 0.19 0.22 0.21 0.20 0.13 0.11 Texture 0.15 0.10 0.28 0.23 0.12 0.03 Shape Histogram 0.35 0.46 0.30 0.16 0.15 0.22 Shape Co-occurrence 0.36 0.48 0.35 0.13 0.15 0.25 Shape FFT 0.26 0.42 0.41 0.30 0.22 0.30 Shape Polar FFT 0.20 0.29 0.21 0.35 0.19 0.20
Three hand-picked image classes were used in the experiments. These were: aircraft, building, and human face views. The results are shown in Table 1. The figures in parenthesis after the class names are the a priori probabilities of the classes. As could be expected, no single feature extraction method performs well for all the three classes. For example, the shape FFT features are better than others for buildings and faces but show relatively worse performance for aircraft images. The shape co-occurrence features seem to suit well for the aircraft class whereas their global performance in the buildings class is poor. In general, the local and global merit figures seem to agree in most evaluations.
360
2O
co-occurrence
/
~ ,8 ////.//, .,." "" optimal.,"
******
histogram
FFT
~9
FFT
~;.i~.i~i
i.i:i.:
..........
...................
polar RGB texture
"~ 0.e.... ........ o5 ,,.. ....\ \" "-\! }~-co-occurrence
~ ~
.,,
0.4
',:
o.s
ji
i! :~hi~ogr~m
/,: ~-FFT
; .- ~,'~-- polar FFT
/..: ,;~t
texture
a priori
Number of images in the neighborhood
Re{E~.=oP, ei2"t'u/N }
Figure 9. The slopes of the average cumulation of relevant images for different features and the aircraft images.
Figure 10. The accumulation of the global merit of indexing ?~global for the aircraft image class.
Figures 9 and 10 illustrate the calculation of the r~ocal and ?~global values, respectively. The slopes in Figure 9 can be verified to agree with the average observed probabilities in the 'aircraft' column of Table 1. Accordingly, the real parts of the endpoints of the weighted probability curves in Figure 9 are those tabulated as the r/global values in the 'aircraft' column. 5.4.
Retrieval
precision
of the
SOMs
Since we are using the Self-Organizing Map as the image indexing tool, we can use the map ordering directly to evaluate feature performance. The SOM algorithm organizes similar input vectors near to each other. Therefore, with a good feature, similar images should be attached either to the same or to a neighboring map unit on the SOMs. For each image I E C, the precision P(r), i.e. the fraction of relevant images in the total number of retrieved images, is calculated for different neighborhood distances r. The images belonging to C are the relevant images. Then, the average precision Pavg(r) is calculated and used to measure the retrieval performance of the used feature. Regardless of the used distance function, the precision P(r) should then be high when r is small and decrease as the distance r becomes greater, eventually dropping below the a priori probability p. It is also possible to calculate the observed probability function Pi for the SOMs. With the SOM the nearest feature vectors are located in the same or neighboring map units, and the complexity of the search is significantly smaller than in the full search of all the original data. On the other hand, the SOM does perform some averaging on the feature vectors and, as a result, its performance cannot match the direct use of the Euclidean distance in the n-dimensional feature space. First, several map units have equal distance to the current map unit. For example, the map units directly north, south, west, and east of the current map unit have all the same distance. So, the map units cannot be ordered and must be considered equal. Second, the distances between vectors mapped into a same SOM unit are not preserved. Therefore, it is not possible to sort these vectors unambiguously based on the distance function. The observed probability Pi for the Self-Organizing Maps is created otherwise similarly
361 as above for the original data but the correctness values h~ of the vectors with equal distances are averaged. The resulting averaged value is then used to replace all the original values in the probability function. This will undoubtedly somewhat worsen the resulting probability function. Figure 11 displays the observed probabilities for the original data and the 64 • 64-sized bottommost TS-SOM layer in the case of the RGB features and aircraft images. The calculation of the ~g]obal value of merit for the same observed probabilities is illustrated in Figure 12. It can be seen that the performance measures of the original data are somewhat superior to those of the SOM, i.e. r~ocal(SOM)=0.16<0.19 and T]global(SO M)=0.17 <0.22.
~>~
0.2o
.,-, 0.18
.~
o.s
0.16
0.14
0.12 09 O.lO
~~ .
~
o.o8
. . . . . . . . . . . . . . . . . .
.
64
r
•
64-sized SOM
"~
o.s;
) /i~64 x 64-sized SOM
c~ o.4
.
~ ; . . : , . :,. u
a
.
.;-,. ..........................................
priori
0.3
o.o6 0.2
~i~original
o.~ 0.1
o.02 o.~
5OO
1000
1500
2000
2500
3O00
3500
4O00
Image index Figure 11. The observed probabilities of the original RGB feature vectors and the bottommost TS-SOM layer formed thereof for the aircraft image class.
O'
.~ o.1
feature vectors
........................................... optimal o.2
0.3
0.4
o.s
o.s
0.7
o.s
0.9
Re{ E'~=oPieJ2"'"/N} Figure 12. The accumulation of the global merit of indexing r/global for the original RGB features and the bottommost TSSO M layer in the case of aircraft images.
6. C o n c l u s i o n s and future plans Following some of the core principles of the WEBSOM text document exploration tool [14], we have developed an image database retrieval and browsing system called PicSOM. It uses several feature representations for a digital image, currently the spatial color, texture, and shape compositions, and several feature maps structured according to the TS-SOM method [9,10]. In preliminary experiments, the PicSOM system does show potential and we are confident that it can evolve into a usable and fully functional tool for image retrieval. A generic problem in image database search methods is how to measure their performance. The same quantitative measurements that have been used for PicSOM could be applied to other content-based image retrieval systems to facilitate fair evaluations and comparisons. The MPEG organization [16] has started to work on a new standard, called MPEG-7, to develop a set of features for image content description. The organization also plans a standard testbed for image retrieval applications, that will later be used to assess the performance of the fully functional PicSOM system.
362 In order to study our method's applicability for larger image databases, we have started experimenting with the Corel Gallery [17]. Another vast collection of images is spread in the Internet, and we have plans to use PicSOM as an image search engine for the World Wide Web. REFERENCES
1. Y. Rui, T. S. Huang, and S.-F. Chang. Image retrieval: Past, present and future. Journal of Visual Communication and Image Representation, 1998. To appear. 2. M. Flickner, H. Sawhney, W. Niblack, et al. Query by image and video content: The QBIC system. IEEE Computer, pages 23-31, September 1995. 3. A. Pentland, R. W. Picard, and S. Sclaroff. Photobook: Tools for content-based manipulation of image databases. In Storage and Retrieval for Image and Video Databases II, volume 2185 of SPIE Proceedings Series, San Jose, CA, USA, 1994. 4. T . P . Minka. An image database browser that learns from user interaction. Master's thesis, M.I.T, Cambridge, MA, 1996. 5. J . R . Smith and S.-F. Chang. VisualSEEk: A fully automated content-based image query system. In Proceedings of the ACM Multimedia 1996, Boston, MA, 1996. 6. J . R . Smith and S.-F. Chang. Searching for images and videos on the world-wide web. Technical Report #459-96-25, Columbia University, 1996. 7. A.B. Benitez, M. Beigi, and S.-F. Chang. Using relevance feedback in content-based image metasearch. IEEE Internet Computing, pages 59-69, July-August 1998. 8. A. Gupta. Visual information retrieval technology: A Virage perspective. Available online at http://www, v i r a g e , com/wpaper/, 1997. 9. P. Koikkalainen and E. Oja. Self-organizing hierarchical feature maps. In Proceedings of 1990 International Joint Conference on Neural Networks, volume II, pages 279-284, San Diego, CA, 1990. IEEE, INNS. 10. P. Koikkalainen. Progress with the tree-structured self-organizing map. In A. G. Cohn, editor, 11th European Conference on Artificial Intelligence. European Committee for Artificial Intelligence (ECCAI), John Wiley & Sons, Ltd., 1994. 11. T. Kohonen. Self-Organizing Maps, volume 30 of Springer Series in Information Sciences. Springer-Verlag, 1997. Second Extended Edition. 12. H. Zhang and D. Zhong. A scheme for visual feature based image indexing. In Storage and Retrieval for Image and Video Databases III (SPIE), volume 2420 of SPIE Proceedings Series, San Jose, CA, February 1995. 13. T. P. Minka and R. W. Picard. Interactive learning using a 'society of models'. Technical Report #349, M.I.T Media Laboratory, 1995. 14. WEBSOM - self-organizing maps for internet exploration, http://websom, hut. f i/websom/. 15. T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. WEBSOM--self-organizing maps of document collections. In Proceedings of WSOM'97, Workshop on Self-Organizing Maps, Espoo, Finland, June ~-6, pages 310-315. Helsinki University of Technology, Neural Networks Research Centre, Espoo, Finland, 1997. 16. The moving picture experts group MPEG home page, http://www, c s e l t , it/mpeg/. 17. The Corel Corporation Home Page, http://www, c o r e l , corn/.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
363
I n d e x i n g A u d i o D o c u m e n t s by using L a t e n t S e m a n t i c Analysis a n d S O M Mikko Kurimo ~ ~IDIAP CP-592, Rue du Simplon 4, CH-1920 Martigny, Switzerland Email: Mikko.Kurimo~idiap.ch This paper describes an important application for state-of-art automatic speech recognition, natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and selforganizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection and use it for more accurate indexing by generating new index terms and stochastic index weights. Indexing methods are evaluated for two broadcast news databases (one French and one English) using the average document perplexity defined in this paper and test queries analyzed by human experts. 1. I N T R O D U C T I O N The development in large-vocabulary continuous speech recognition (LVCSR) has made it possible to automatically process big databases of recorded audio documents such as broadcast news, interviews etc. An important application of LVCSR is the automatic indexing of spoken audio by extracting index terms from the speech which is decoded by a speech recognizer. Although the words in the spoken documents cannot be recognized with 100 % accuracy, an automatically generated index can still be very useful for certain applications. For example, on radio and television there are huge amounts of broadcast data for which the collection and indexing without an excessive delay would require a prohibitive amount of human labour. Even if many important words are missed by the automatic speech decoding, the rest may provide enough information for the indexing system so that most of the relevant documents can be found for given queries. Efficient audio indexing is also highly relevant for, e.g., the multimedia industry and interactive TV, since it provides possibilities for easy to use consumer interfaces and on-line remote access into archived recordings. The motivations for this paper are that we could enhance the audio indexing by extracting more information from the documents than just the most probable decoding and then by clustering the documents semantically. The clustering based on the most important semantical content reduces noise coming from the choice of words in the documents and recognition errors. Thus, a document can be indexed also for terms that were not found by the decoding, but which only appear in other semantically related documents and are probably relevant for the current document as well. Self-Organizing Maps (SOMs) have successfully been applied to organize large text
364 archives [16,14] by presenting the documents as smoothed histograms of the word categories that match with the document content. In this paper SOMs are used to cluster documents based on document vectors which are weighted averages of the vectors representing the words (or stems) decoded from the speech. The objective is to associate the documents with the index terms that describe well the main (latent) semantics of the documents and will rank the documents as well as possible according to the terms in a given document query. 2. I N D E X I N G
SPOKEN
AUDIO
The work presented here is related to the THISL project (Thematic Indexing of Spoken Language) which is an ESPRIT Long Term Research project for speech retrieval [1]. The project aims to explore the limits of state-of-the-art LVCSR, IR (Information Retrieval) and NLP (Natural Language Processing) technologies for indexing and retrieval of television and radio data. The target application is a "news-on-demand" system which recalls the relevant parts of audio or video broadcasts based on a query from the user. A prototype system for THISL has already been made for British and North American broadcast news based on the ABBOT [24] LVCSR system and a probabilistic IR system [1]. The system has been evaluated in the TREC-7 (TextRetrieval Conference) SDR track (Spoken Document Retrieval) [10,21]. Demonstrator systems have also been built for other English and French databases (e.g. [17]). The basic approach for audio indexing can be divided into several consecutive phases: 1. The audio broadcast is recorded and preprocessed for speech recognition. 2. The recognizable speech is separated from music and other non-speech sounds. 3. Text files are created from the most probable decoding hypothesis. 4. The text files are indexed using the decoded words. 5. The queries are processed and relevant documents are retrieved. The latest developments of the THISL broadcast news retrieval system are described in [1]. Corresponding full-text recognition based indexing approaches are currently used also by several other groups, e.g. [3,13]. Alternatively indexing can be based on keyword spotting or phone recognition [19]. The advantages of these systems are computationaUy lighter speech recognition and no out-of-vocabulary word problems. However, full-text recognition can constrain the task using pronunciation dictionaries and language models and thus provide a more robust text retrieval [1]. 3. L A T E N T S E M A N T I C A N A L Y S I S Latent Semantic Analysis (LSA) [9] is used for modeling text data based on semantic structures found by analyzing the co-occurrence matrix of words and documents. These models project the data into lower dimensional subspaces by finding the most relevant structures. It is important that by focusing to the relevant structures in the data, the amount of noise originating, e.g., from speech recognition errors, is reduced as well. LSA
365 is often associated with Principle Component Analysis (PCA) or Singular Value Decomposition (SVD) by which the LSA is normally generated. In document indexing LSA is applied to find out the essential index terms to which the documents should be associated. LSA has traditionally been based on the idea that the data is efficiently compressed by extracting orthogonal components directed so that each new component minimizes the projection error remaining from previous components. For indexing, the document collections are usually presented as a matrix A where each column corresponds to one document and each row the existence of a certain word [25]. This representation looses the information about the word positions and groups in the document as it is mainly intended to determine only in which documents the words are used. With SVD the word-document co-occurrence matrix is decomposed as A = U S V T to find the singular values and vectors. By choosing the n largest singular values from S we obtain a reduced space where A is approximated by the estimate An [9]
An -- U n S n V ( .
(1)
In this n-dimensional subspace the word wi can be coded as
Xi--UiSn/llUiSnll
(2)
by using the normalized row i of matrix UnSn. We can then get smoothed representations by clustering the words or the documents using the semantic dissimilarity measure [5]
wj) =
(3)
In practice and especially in spoken documents, the documents are short and important words quite rare. To still get meaningful distributions of the index words in the models, a careful smoothing is needed [5]. This is generally done by clustering similar documents together and using the average document vector of each cluster to represent the cluster members. The cluster vectors will also generate a smoothed representation of the documents, since they integrate the content of several semantically close documents into one model. The clusters can be interpreted as automatically selected topics based on the given document collection. To avoid quantization error between the document and its nearest cluster, a set of nearest clusters (or even all the clusters) can be used to compute the smoothed mapping. For example, we can consider their weighted average based on distance, so that nearby clusters will have the strongest effect. This generalization matches well the broadcast news example, since one section can be relevant to several topics. 4. U S I N G S O M F O R L S A
The main contribution of this paper is the idea of using the SOM to compute a LSA based index for spoken documents in a way which is more suitable for very large data collections. With very large document collections like broadcast news, recorded over a long time, the dimensionality of matrix A (the word-document co-occurrence matrix) becomes too large to handle. However, the matrix is sparse, because only a small subset of the very large vocabulary is actually used in one document. There exist efficient methods to compute the SVD for sparse matrices such as the Single Vector Lanczos iteration [6]
366 which lower the computational complexity significantly. However, it can still be difficult to always obtain an acceptable solution using this kind of iterative approximation methods. By Random Mapping (RM) [22] we can artificially (randomly) and quickly generate approximately orthogonal vectors for the words and present the documents as an average vector of the words. In fact, because the co-occurrence matrix is usually very sparse, we can get quite a good approximation with a considerably lower computational complexity than with SVD, already with only 100 - 200 dimensional random vectors [15]. By using this approximation it becomes feasible to use a very large vocabulary and also to expand the index later by adding new documents and words. For automatically decoded documents we must somehow take into account that documents are not completely described by the decoded words. Some relevant words are often lost or substituted by fully irrelevant ones. Clustering has the advantage of mapping the decoded documents based on their whole content and in that way minimizing the effect of incorrect individual terms. In classical clustering methods such as LBG (Linde-BuzoGray) and K-means each cluster vector is the average of vectors only in that particular cluster. This adapts the clusters well to the fine structure of the data, but can make the smoothing sometimes inefficient. The more training vectors affecting each cluster, the smoother is the representation, and the more will the clusters reflect the major structures of the data. If we do the clustering by SOM, each training vector affects at the same time all clusters around the best one, which makes it also easier to train large number of clusters [18]. As learning proceeds in a SOM the density of the cluster vectors starts eventually to reflect the density of the training vector space. This will provide the strongest smoothing on sparse of areas and the highest accuracy on dense areas. Like the RM document vectors, the SVD document vectors can be clustered by SOM as well to further reduce noise and gaining new index terms by mapping the documents to the clusters. If we train the SOM into a two-dimensional grid, the automatic ordering will provide a visualization of the structures in the data (Figure 1). If the display is suitably labeled, we can see the dominant clusters and directions and get immediately a conception of the area where the chosen document lies [11,16]. For more thorough database exploration, a graphical interface, like WEBSOM [11], can be used to virtually move inside any point on the map and examine the document space around it.
5. E X P E R I M E N T S
5.1. E v a l u a t i o n m e t r i c s The correct evaluation of a spoken document index is a difficult task. Indexes prepared in a different way describe documents using the same or different index terms and, thus, might return different documents, for the same terms given as a query. In general, it is not possible to automatically judge which documents are relevant to a given query. For the user of the index it is also very important how the retrieved documents are ranked, i.e. the most relevant ones should be on the top. However, a proper comparison of the different ranking lists is even more difficult than just judging whether the results are relevant or not [10]. In this paper we apply the test used in the latest TREC evaluation for SDR track [10]. For a database of North American Business news a set of text decoding hypothesis
367
Figure 1. Examples of visualizing an indexed document collection. Each cell corresponds to one cluster (node) in the SOM grid. The vectors of neighboring cells are usually also near each other in the original high-dimensional vector space. The color of the cells is here used to show the distance between the cluster and the selected test document. A light color means a short distance. The numbers in cells are here (picture on the left) used to show the pointers to the documents that are closest to that cluster. Another way to study the clusters is to find the best matching index terms (a different database on the right).
368 using different speech recognizers was provided. TREC provided as well a set of carefully composed test queries and relevance judgments by human experts for the documents concerning each query. Several measures were defined to compare the relevance of the retrieved set of documents. The two most important used in this paper are the recall, which is the proportion of the relevant documents which are obtained, and the precision, which is the proportion of the obtained documents which are relevant. A meaningful comparison for ranked retrieval lists is then to check the precision at different levels of recall or, as in this paper, by computing the average precision (AP) over all relevant documents. In addition to AP, we use another related measure which is the average Rprecision (RP) defined by the precision of the top R documents, where R is the total number of the relevant documents. For the databases where no relevance judgments are available, we propose a new concept called the average document perplexity [17] to give a numerical measure of how well an index describes the documents. In speech recognition the measure of perplexity is commonly used to quantify the relative difficulty of a recognition task. The perplexity is a measure of the strength or predictive power of the LM (Language Model) constraints and it is also widely used to compare LMs when it is too expensive to compute every time the actual WER (Word Error Rate) for whole speech recognition system [8]. The perplexity for the words w l , . . . , WT in the test set can be defined as 1 T P P -- e x p ( - ~ /=~ In Pr(wilLM))
(4)
For document set models the perplexity can be defined using the vector space representation of words and documents so that instead of Pr(wilLM)s we have the probabilities given by the LSA model for the test document. The LSA probabilities are computed using the normalized matches between the vectors of the index terms (words or stems) and the vector of the test document (or its smoothed version). A high word match means that the word is very likely to exist in the test document and the more unlikely words there are in the test document, the higher the perplexity. Thus a higher average document perplexity means also that the models have less predictive power for the tested documents and the index might be worse. However, perplexity is by no means a substitute for the actual retrieval test and, as it is well known from speech recognition experiments, even significant improvements in perplexity do not necessarily imply improvements in the actual WER [12]. 5.2. T e s t e d i n d e x i n g m e t h o d s After the tested news databases were processed as explained in Section 2, the obtained text files were used to prepare the indexes. Since full lattice decoding results were not yet available, the indexing was made based on the most probable decoding only. The French LVCSR system based on a hybrid HMM/MLP model that was used to decode the databases is described in [7] with latest details in [4,17]. The index that was called "default THISL", according to the first THISL prototype version [2], creates an inverted file using the stems of the decoded words directly as the index terms. The inverted file is basically a list of words with pointers to relevant documents. The stemming was made using the Porter stemming algorithm [20], so that
369 the stop words were first filtered out and then the suffices removed from the rest of the words to get the stems. The stemming algorithm is tuned only for English so that the French stems are probably not optimal. The stop list is an edited list from the most frequent words in the language. The LSA indexes were made by first preparing the smoothed document vectors as explained in Sections 3 and 4. For the traditional SVD approach sparse SVD with 125 first singular values and vectors was computed and the normalized word codes (Equation 2) of the word stems was used to form the document vectors. The RM + SOM approach was based on 200-dimensional normalized random vectors for the stems and a two-dimensional SOM of 260 units for the document vectors. A SOM of the same size was also used for smoothing the SVD based document vectors. For the construction of the document vectors, an importance weighting was used for the word stems in both the RM + SOM and the traditional LSA (unlike in [17]). The rarer the word is in the collection, the better it usually describes and discriminates the documents. Thus, the importance weight reflects the relevance of a word to the whole document collection and it can be derived, e.g., using the mutual information (defined with entropy) [26] or its simpler approximation, the Inverse Document Frequency (IDF) [23]. The forms of IDF used here scaled within [0,1] are the simple IDFi = 1 / n i ,
(5)
and the logarithmic IDF'i - 1 - log hi~ log max h i ,
(6)
where ni is the number of documents where the stem i exists [23]. To determine the best index terms for each document the smoothed document vectors are compared to all the stem vectors. The indexing was made stochastically so that the index words were weighted by the LSA scores scaled within [0,1]. To integrate the LSA index with the basic index, the index terms selected directly from the actual decoding were added with weight 1.0. Since it was not feasible to index every document with all index terms the limit of significance was determined by assuming LSA scores normally distributed and selecting all the terms corresponding to scores above the 99 % significance level. The LSA scores of a document computed for all the index terms, actually approximate the probability: Pr(doclword ) = Pr(wordldoc ) P r ( d o c ) / P r ( w o r d ) ,
(7)
where the probability of each word Pr(wordldoc ) can be computed smoothed by the K (best-matching) clusters C 1 , . . . , Ca" weighted by their similarity with the current document K
Pr(wordldoc ) = y~ Pr(wordlCk)Pr(Ckldoc).
(8)
k=l
After the LSA index is made, it can be used similarly as the "default THISL" index [21]. Queries are processed by eliminating stop words and mapping other words into
370 their stems. To find the best matches, the documents are scored based on the number of matches between the query terms and the document using the index. The scores are normalized using weights for document length and the term frequency in the collection [23]. 6. R E S U L T S
Results are given here for two broadcast databases. The first database has French speaking news and in the decoding used here the WER was high and varied a lot between different sections. The average perplexity results in Table 1 indicate that the more smoothing is applied, the higher is the perplexity on the training data. (Smaller neighborhood and larger number of SOM units imply less smoothing). The perplexities between RM and SVD based indexes are not directly comparable. Since no test queries were yet available for this database, another better standardized test set was also analyzed (Table 2 and 3).
Table 1 Average document perplexities (PP) for the French database. SOM0 is SOM trained with 0-neighborhood (equivalent to an on-line adaptive version of the classical K-means clustering) and SOMb a larger SOM (600 units). For clustered systems (SOM) the smoothed model is made using the weighted average of 10 best-matching clusters (as explained in Section 2). For the non-clustered methods (RM, SVD) the weighted average of 20 best-matching actual document vectors is used for smoothing. When no clustering was used independent test data could be simulated by ignoring the current document to give perplexities 1.94 and 2.46 for RM and SVD, respectively. Index RM RMSOM0 RMSOM RMSOMb SVD SVDSOM0 SVDSOM SVDSOMb
PP 1.68 1.75 1.85 1.80 2.14 2.47 2.62 2.33
Table 2 presents perplexities and test query evaluations for the TREC test set. The speech decoding used here had a 36 % average WER. More results (and using another decoder) have been presented in [17]. The query expansion, where not only the index terms related to the query are checked, but also terms that are commonly associated with them in reference databases [27,1], was not used here. From Table 2 we see that the average precision improves with SVD and even further when we smooth the models by SOM. The closer comparison in Table 3 shows, e.g. that LSA retrieves many more documents than the references, including also slightly more of the relevant ones. By
371 looking at the lowest standard recall level 0.10, which gives the precision of the highest ranked documents, LSA seems also to do quite well. For higher recall levels the precision of LSA drops below that of the baseline, because the cost of the higher total recall seems to be a vast increase of irrelevant documents. In Table 2 the document perplexity for RM index decreases as stronger smoothing is applied, but the AP and RP indicators do not show any clear improvement. For SVD coding the AP and RP indicators show improvements with smoothing, but the perplexity does not change much.
Table 2 Evaluation results for the TREC test set. AP is the average precision, RP the R-precision. "THISL default" is a baseline index without LSA and "perfect" is an index based on the correct transcriptions. As in Table 1, the simulated test data perplexity gave 2.7 and 1.8 for the non-clustered RM and SVD, respectively. Index RM RMSOM0 RMSOM SVD SVDSOM0 SVDSOM THISL default "perfect"
AP 0.33 0.33 0.34 0.35 0.37 0.38 0.37 0.43
RP 0.34 0.35 0.36 0.34 0.34 0.34 0.37 0.41
PP 2.6 2.2 2.1 1.7 1.8 1.8
Table 3 Some finer details of the comparison between the reference systems and the best LSA system (SVDSOM) for Table 2. "ranked" is the average number of documents ranked per query, "recall" the total recall, and "P.10" the precision at recall level 0.10.
ranked recall P.10 AP RP
"perfect" $1 decoding decoding ref. LSA 0.29 0.31 0.66 0.92 0.91 0.96 0.65 0.62 0.65 0.43 0.37 0.38 0.41 0.37 0.34
372 7. C O N C L U S I O N S This paper describes a system for decoding spoken documents and indexing them based on the latent semantic analysis of the document contents. A new computationally simple approximative approach is suggested for LSA in large document collections. To smooth the LSA models we apply clustering with a SOM. This provides as well an organized view over the contents of the document collection. Experiments are made using French and American news databases and for the latter we provide the results of relevance judgments using standardized test queries. To measure the predictive power of the models we define a new document perplexity measure. The results show that the proposed way to construct LSA index by RM + SOM does not give quite as accurate retrieval results (AP) as the SVD based LSA or the baseline THISL index. At a higher recall level (RP) the precision of RM-based indexes is between that of SVD and the baseline THISL. However, at the lowest recall level (P.10), which is probably the most useful for the interface users, the precision provided by SVD+SOM was the highest and as good as by the "perfect" index. From a computational point of view the RM + SOM is better than SVD, since it is much faster and there are much less complexity problems as the number of documents and words increases. It is also convenient that we do not need to change the old document vectors as the database is updated. The clustering of models is favorable, since the indexing is faster with smaller total number of models and smaller number of selected best models. The SOM algorithm behaves well for large document collections, because it is not affected by the vocabulary size and only almost linearly by the number of documents as opposed to typical SVD methods where the complexity is usually much higher. For further research we have left the integration of acoustic confidence measures and nbest hypothesis into the presented stochastic index, and the testing of the query expansion method with the LSA index. For the French databases the same stemming algorithm as for English has so far been used, but because the suffixes are different, we will probably have to implement a totally new algorithm. Further development of the ranking strategies might be useful for LSA, since we get significantly more matching documents and there is also more useful information included in the indexing weights. Another interesting aspect is the use of data visualization to help understand the structures in the database and to use suitable words in queries. ACKNOWLEDGMENTS This work was supported by ESPRIT Long Term Research Project THISL. I wish to thank Dr. Chafic Mokbel and the speech group in IDIAP for useful discussions concerning the methodology and comments concerning to this paper. REFERENCES
1.
2.
Dave Abberley, David Kirby, Steve Renals, and Tony Robinson. The THISL broadcast news retrieval system. In ESCA ETRW workshop on Accessing Information in Spoken Audio, Cambridge, UK, April 1999. Dave Abberley, Steve Renals, and Gary Cook. Retrieval of broadcast news documents
373 with the THISL system. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3781-3784, 1998. J. Allan, J. Callan, W.B. Croft, L. Ballesteros, D. Byrd, R. Swan, and j. Xu. INQUERY does battle with TREC-6. In Proceedings of the Sixth Text Retrieval Conference (TREC-6), pages 169-206, 1998. Johan Andersen. Baseline system for hybrid speech recognition on french. COM 98-7, IDIAP, 1998. Jerome R. Bellegarda. A statistical language modeling approach integrating local and global constraints. In IEEE Workshop on Automatic Speech Recognition and Understanding, pages 262-269, 1997. Michael W. Berry. Large-scale sparse singular value computations. Int. J. Supercomp. Appl., 6(1):13-49, 1992. Herve Bourlard and Nelson Morgan. Connectionist Speech Recognition - A Hybrid Approach. Kluwer Academic Publishers, 1994. Stanley F. Chen, Douglas Beeferman, and Ronald Rosenfeld. Evaluation metrics for language models. In DARPA Broadcast News Transcription and Understanding Workshop, 1998. S. Deerwester, S. Dumais, G. Furdas, and K. Landauer. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci., 41:391-407, 1990. 10. John S. Garofolo, Ellen M. Voorhees, Cedric G. P. Auzanne, and Vincent M. Stanford. Spoken document retrieval: 1998 evaluation and investigation of new metrics. In ESCA ETRW workshop on Accessing Information in Spoken Audio, 1999. 11. Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen. Newsgroup exploration with WEBSOM method and browsing interface. Report A32, Helsinki University of Technology, Laboratory of Computer and Information Science, Espoo, Finland, 1996. 12. R. Iyer, M. Ostendorf, and M. Meteer. Analyzing and predicting language model improvements. In IEEE Workshop on Automatic Speech Recognition and Understanding, pages 254-261, 1997. 13. S.E. Johnson, P. Jourlin, G.L. Moore, K. Sparck Jones, and P.C. Woodland. The Cambridge university spoken document retrieval system. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 49-52, 1999. 14. Samuel Kaski, Timo Honkela, Krista Lagus, and Teuvo Kohonen. WEBSOM - selforganizing maps of document collections. Neurocomputing, 21:101-117, 1998. 15. Samuel Kaski. Dimensionality reduction by random mapping: Fast similarity computation for clustering. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), volume I, pages 413-418, 1998. 16. Teuvo Kohonen. Self-Organizing Maps. Springer, Berlin, 1997. 2nd extended ed. 17. Mikko Kurimo and Chafic Mokbel. Latent semantic indexing by self-organizing map. In ESCA ETRW workshop on Accessing Information in Spoken Audio, Cambridge, UK, April 1999. 18. Mikko Kurimo. Using Self-Organizing Maps and Learning Vector Quantization for Mixture Density Hidden Markov Models. PhD thesis, Helsinki University of Technology, Espoo, Finland, 1997. .
374 19. Kenney Ng and Victor W. Zue. Phonetic recognition for spoken document retrieval. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 325-328, 1998. 20. M. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, July 1980. 21. Steve Renals, Dave Abberley, Gary Cook, and Tony Robinson. THISL spoken document retrieval. In Proceedings of the Seventh Text Retrieval Conference (TREC-7), 1998. 22. Helge Ritter and Teuvo Kohonen. Self-organizing semantic maps. Biol. Cyb., 61(4):241-254, 1989. 23. S.E. Robertson and K. Sparck Jones. Relevance weighting of search terms. J. Amer. Soc. Inform. Sci., 27(3):129-146, 1976. 24. T. Robinson, M. Hochberg, and S. Renals. The use of recurrent networks in continuous speech recognition. In C. H. Lee, K. K. Paliwal, and F. K. Soong, editors, Automatic Speech and Speaker Recognition - Advanced Topics, chapter 10, pages 233-258. Kluwer Academic Publishers, 1996. 25. G. Salton. The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice-Hall, NJ, 1971. 26. Matthew Siegler and Michael Witbrock. Improving the suitability of imperfect transcriptions for information retrieval from spoken documents. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 505-508, 1999. 27. J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proc. A CM SIGIR, 1996.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
375
S e l f - O r g a n i z i n g m a p in a n a l y s i s of l a r g e - s c a l e i n d u s t r i a l s y s t e m s Olli Simula, Jussi Ahola, Esa Alhoniemi, Johan Himberg, and Juha Vesanto a aHelsinki University of Technology, Laboratory of Computer and Information Science, P.O. Box 5400, FIN-02015 HUT, Finland The Self-Organizing Map (SOM) is a powerful tool in visualization and analysis of high-dimensional data in engineering applications. The SOM maps the data on a twodimensional grid which may be used as a base for various kinds of visual approaches for clustering, correlation and novelty detection. In this paper, the methods are discussed and applied to analysis of hot rolling of steel, continuous pulping process and technical data from world's pulp and paper mills. 1. I N T R O D U C T I O N Characterization of industrial processes is traditionally done based on analytic system models. The models may be constructed using knowledge based on physical phenomena and assumptions of the system behavior. However, many practical systems, e.g., industrial processes, are so complex that global analytic models cannot be defined. In such cases, system modeling must be based on experimental data obtained by various measurements. The measurement data and other types of information are typically stored in data bases. In many practical situations even minor knowledge about the characteristic behavior of the system might be useful. However, interpretation of the data is often difficult. For this purpose, easy visualization of the data is vital. The information need to be converted into a simple and comprehensive display which reduces the dimensionality of the data and simultaneously preserves the most important metric relationships in the data. Adaptive and learning systems provide means to analyze the system without explicit physical model. The Self-Organizing Map (SOM) [9] is one of the most popular neural network models. Due to its unsupervised learning and topology preserving properties it has proven to be very powerful in analysis of complex industrial systems. The algorithm implements a nonlinear topology preserving mapping from a high-dimensional input data space onto a two-dimensional network which roughly approximates the density of the data. The SOM forms a display panel which automatically shows the states of the system. Various visualization alternatives of the SOM are useful in searching for correlations between measurements and in investigating the cluster structure of the data. The SOM is an effective clustering and data reduction algorithm and can thus be used for data cleansing and preprocessing. Integrated with other methods, it can also be used for rule extraction [14] and regression [11]. The SOM based data exploration has been applied in various engineering applications such as pattern recognition, text and image analysis,
376
Problem [ J Data [ J -] understandingI
understanding[
Data
I J
I E:alua ~i I Deployment
(a) Data
acquisition
_l Prepr~ selection, ]
~ segmentation ]
J Feature I J SOM I J Visualization, I -I extraction [ -I training ] 7, interpretation
Conclusions
(b) Figure 1. (a) The steps in Knowledge Discovery in Databases. (b) Different phases of a SO M-based data analysis process.
financial data analysis, process analysis and modeling as well as monitoring, control, and fault diagnosis [10]. In this paper, the role of the SOM as a part of data mining process is considered. Visualization, clustering and correlation hunting possibilities are demonstrated using three case studies, based on real-world data from large-scale industry. 2. S O M I N D A T A A N A L Y S I S Data analysis is a part of a larger framework: Knowledge Discovery in Databases (KDD). The purpose is to find new knowledge from databases containing large amounts of data. KDD consists of several steps, see Figure l(a). The SOM is a powerful tool especially in data understanding and preparation phases. The successful use of SOM based methods requires close cooperation and interaction with process experts whose task is to give physical explanation for the "learned" properties and dependencies. Data analysis based on the SOM is an iterative multi-stage process, the actual training of the SOM being only part of it, see Figure 1 (b). The phases of a data analysis process can be sketched as follows: Data acquisition may be real time measurement collection (on-line) or database query (off-line) which is usually the case when an exploratory analysis is made. Data preprocessing, selection and segmentation is usually an elaborate task involving a lot of a priori knowledge of problem domain: erroneous data have to be removed. Proper data scaling and representational transformations (e.g., symbolic to numerical values) have to be considered. If possible, clearly inhomogeneous data sets may be divided to disjoint subsets according to some criteria in order to avoid problems which would come up if a global model was applied. 9 Feature extraction transforms the cleansed data into feature vectors. It is important to realize that the objective is to interpret the data and extract knowledge from it and from relations in it m not to make black-box classification or regression. Therefore, the feature variables have to describe the important phenomena in the
377 data in such a way that they are clear in the analysis' point of view. It is evident that this and the previous stages cannot be properly done without knowledge on application domain. 9 Training phase initializes and trains the SOM. However, one delicate issue is the
scaling of the feature variables. The variables with large relative variance tend to dominate the map organization. In order to equalize the contribution of individual variables, they are usually normalized to be equivariant. The distance measure used in the SOM training has to be chosen in such a way that applying it to data makes sense. Usually, the Euclidean distance is used. Variable normalization and the distance measure are, of course, data dependent issues and related to the feature extraction phase. 9 Visualization and interpretation are the key issues for using the SOM in data analy-
sis. These include correlation detection, cluster analysis and novelty detection. The data analysis is usually not a flow-through process, but requires iteration, especially between feature extraction and interpretation phases. 2.1. S O M v i s u a l i z a t i o n The SOM provides a low-dimensional map of the data space. It actually performs two tasks: vector quantization and vector projection. Vector quantization creates from the original data a smaller, but still representative, data set to be worked with. The set of prototype vectors reflects the properties of the data space. The projection performed by the SOM is nonlinear and restricted to a regular map grid. The SOM tries to preserve the topology of the data space rather than relative distances. The aim of visualization is both to understand the mapped area and to enable investigation of new data samples with respect to it 1 The unified distance matrix (u-matrix) [6,13] shows the cluster structure on a SOM grid visualization. It shows the distances between neighboring units, and clusters of units, using a gray scale or color representation on the map grid, see Figure 7(b). Also, the SOM is often "sliced" into component planes in order to see how the values of a certain variable changes on different locations of the map [12]. Each plane represents the value of one component of the prototype vector in each node of the SOM using, e.g., gray scale representation. The general behavior of the component values in different parts of the SOM corresponding to different parts of the input space can be easily seen as shown in Figure 3. The component planes play an important role in correlation detection: by comparing these planes even partially correlating variables may be detected by visual inspection. This is easier if the component planes are reorganized so that the correlated ones are near each other [15,16] as depicted in Figure 5. In this way, it is easy to select interesting component combinations for further investigation. A more detailed study of interesting combinations can be done using scatter plots which can be linked to the map units by color as in the case studies below.
1See URL http://www, cis. hut. f i/proj ects/somtoolbox/for a freely available software implementation of the SOM algorithm with several visualization techniques.
378 When analyzing new data with the SOM the key issue is to find out which part of the map best corresponds to the data. Traditionally, this has been done by finding the nearest prototype vector, the best matching unit (BMU), for each new data sample and then by indicating it on the SOM. However, typically there are several units with almost as good match as the BMU. The data sample may also be very far from the map, i.e., it is a novelty in terms of the map. Instead of simply pointing out the BMU, the response of all map units to the data can be shown. The resulting response surface shows the relative "goodness" of each map unit in representing the data. Perhaps a more interpretative response function results if the SOM is used as a basis for reduced kernel density estimate of the data. Then one can estimate the probability P(ilx) of each map unit representing the data sample, see for example [2,5]. Another possibility is to use "fuzzy response", a function of quantization error having values on the interval [0... 1]. In order to clarify the connections between visualizations, they may be linked together using c o l o r - a dominant visual hint for grouping objects. This idea has been applied in carrying information from the SOM representation to a geographical map in [1,3,7]. The same idea can be used in linking different presentations of the same data together, e.g., the SOM grid and a scatter plot or Sammon's projection [4,17]. For an example of the former, see Figure 4. Similar linking idea to PCA has been earlier presented by Aristide [3]. 3. C A S E S T U D I E S The SOM was applied to qualitative analysis in two process industry applications. The studies dealt with real world data acquired from (1) continuous pulp digester and (2) hot rolling of steel trips. In both cases, the aim was to find reasons for variations in the end product quality. In the pulp digester, the SOM based analysis was used to explain the problematic process behavior. In the hot rolling of steel, analysis of various parameter dependencies was carried out. The goal was to find the most important factors affecting the quality of the end product, which is described using several attributes: width, thickness, flatness, wedge shape, and profile. In the third case study, which was quite different from the others, world's pulp and paper industry was investigated. The aim was to find a categorization of the world's pulp and paper mills using information on their technical characteristics.
3.1. Analysis of continuous pulp digester In the continuous pulp digester case study, problems indicated by drops of pulp consistency in the digester outlet were the starting point for the analysis. In those situations, end product quality variable (kappa number) values were lower than the target value. The analysis was started with several dozens of variables which were gradually reduced down to six most important measurements during data analysis process. In Figure 2, the six signals and the production speed of the fiber line are shown. Figure 3 shows the component planes of the SOM trained using the signals of Figure 2. Five of them depict behavior of the digester and the last one is the output variable, the kappa number. The problematic process states are mapped to the upper left corner of the SOM: the model vectors in that part of the map have too low kappa number value. Correlations between the kappa number and other variables are shown in Figure 4,
379
.~ 12oo . . . . . . . IOOO
'
800 -8oo'
600
~
2000
.~_ ~ ~oo
.....
I 200
'"
.
o i " 040
|
!
400
600
80o
"''" 1000
1200
1400
800
1000
1200
1400
40O
i
!
i
t
1600
1800
200u
1600
1800
2OOO
1400
i 1600
I 1800
20Go
.,
,
1400
1600
1800
200v
i 1400
i 1600
i ........ 1BOO
20(,O
i800
20OO
"............"
"6OO
i 2oo
"
.... ,"
"'~'=i
200
2OO
i 4oo
i 60o
:" 400
]
~:
i 400
4OO
.._~_.,~...
i
i 8oo
"
600
:';':,.'.::
~. 600
6OO
,,"
i 200
i 400
! 600
.......... . .....................................
,
i I000
,
1 1200
~
,
i
800
1000
1200
i~~r'~'l:i " 800 1000
9i 1200
8O0
1200
...:...-..............::" ............ . ....... ~
oL 0
...
. . " ..................
i
. .,..:~ . . . . . ....... . . . . . ...,.. . . . . . . ...,.. .........
O~"-w""'lil, 1200 200
5O0
.
". . . . . . . . . . . .
i
::':~..'-':":'"'>"'.'..:.,.
~
o~
~"
::
2OO
uJ
6
-'
800
1000
"
.
1400
....... ..:........... .......... ........................
1000
1200
'
1400
-
-
-
lt~X,i
Figure 2. Measurement signals of the continuous digester. The analyzed parts are marked by solid line and the parts that were ignored by dotted line.
Figure 3. Component planes of the SOM trained using six measurement signals of the digester.
380
Figure 4. Color map and five scatter plots of model vectors of the SOM. The points have been dyed using the corresponding map unit colors.
where the SOM of Figure 3 has been presented using color coding. The colors assigned to map units are shown in the top left corner of the figure. The five scatter plots are based on model vector component values of the SO M. They all have the values of kappa number on the x-axes and the five other variables on the y-axes. The scatter plots indicate that in the faulty states denoted by black (the most problematic process states) and gray color, there is only weak correlation between kappa number and H-Factor, which is the variable used to control the kappa number. Otherwise, there is a negative correlation as might be expected. On the other hand, the variables Extraction and Chip level seem to correlate with the kappa number in the faulty process states. Also, the values of Press. diff. are low and value of variable Screens (which during the analysis was noticed to indicate digester fault sensitivity) is high. The interpretation of the results is that in a faulty situation, the downward movement of the wood chips which are continuously fed at the digester top slows down. Thus, the residence time of the chips in the digester increases. The H-factor based digester control fails: in the H-factor computation, cooking time is assumed to be constant, while in reality it becomes longer due to slowing down of the chip plug movement.
381
3.2. Analysis of the quality of the hot rolled strip In the second case study, a hot rolling process was analyzed. In hot rolling, steel slabs are heated, rolled, cooled and coiled into final products, strips. First, the slab is heated in the slab re-heating furnaces into temperature appropriate for the following rolling process. After the formed scale is removed with high-pressure water shower, the slab passes to the roughing mill. The slab is rolled back and forth several times vertically in the edger and horizontally in the reversing rougher. The resulting transfer bar travels under the heat retention panels through another descaIing and possible shearing of the head into the finishing mill. The finishing mill consists of six stands. The transfer bar goes through them with high, accelerating speed being in several passes simultaneously. After the rolling, the strip is cooled with several water curtains and coiled. The purpose for the analysis was to study which process parameters and variables affect the quality of the rolled strips. This was studied by hunting for correlations between process parameters and variables. The average and standard deviation of five process parameters were chosen to represent the quality: width, thickness, profile, flatness and wedge of the rolled strip. The other variables included information about the slab (analyzed chemical content), finishing mill parameters (average bending forces, entry tensions, and axial shifts for each stand), and process state (strip strength, target dimensions, and average and standard deviation of the temperature after the last stand) for a total of 36 variables. After preprocessing of data, the amount of the strips included in the study was slightly over 16500. Both traditional linear correlation analysis and correlation detection using the SOM were utilized. Due to the quite large number of variables, the component planes of the SOM were reorganized to place possibly correlating planes close to each other. The result is illustrated in Figure 5. Based on this information and the a priori knowledge of the system, the variables to be used in the more detailed analysis of the strip quality were chosen. In this case, the strip thickness was chosen to be studied further. The variables included in the new data set were quality parameters, thickness average deviation and standard deviation, strip target dimensions, strip strength, bending forces, temperature after the last stand, and strip profile. Using scatter plots colored with a continuous coloring of the SOM plane, dependencies between thickness and other parameters in different process states could be found. Unfortunately, for this paper the color coding had to be limited to grayscale. The approach is illustrated in Figure 6(b), where all the other variables are plotted against standard deviation of thickness. These plots revealed several reasons for problems with strip. For example, usually the standard deviation of the strip thickness increases as the thickness of the strip increases and when the rolling temperature and the bending forces decrease.
382
Figure 5. The reorganized component planes of the SOM.
383
Figure 6. In Figure (a), four component planes of the SOM are shown. The particular planes were chosen because they show the state of the process quite well. From the SOM, two especially interesting regions were chosen (in top right corner and lower right side). In Figure (b), scatter plots of standard deviation of thickness vs. other components in these regions are shown. Values from the upper region are depicted with gray diamonds and values from the lower region with black spots.
384 3.3. Analysis of the world's pulp and paper mills In this study, the SOM was used for data mining to analyze the technology data of the world's pulp and paper industry. There were three data sets containing information on (1) production capacities of different product types in pulp and paper mills, (2) technology of paper machines and (3) technology of pulp lines. Since each mill could contain several paper machines and pulp lines, a hierarchical structure of maps was used, illustrated in Figure 7(a). The two low-level maps extracted relevant information regarding the paper machines and pulp lines data of a mill and the high-level map combined this with production capacity data. An analysis of mill types was performed. The map was divided to clusters and the clusters were analyzed by observing the component planes of the map. The cluster analysis resulted in the 20 different mill types described in Table 1 and shown in Figure 7(b). For the analysis of different geographical areas, the data were separated to several sets each consisting of pulp and paper mills in a certain area. The data sets were projected on the map and, based on the resulting histograms, some conclusions could be drawn for each region. For example, the region labelled PMS was typical for Chinese pulp and paper mills. The same approach can be directly used for comparing and analyzing different companies.
Figure 7. In Figure (a), the hierarchical map structure is shown. Data histograms from the two smaller maps were utilized in the training of the third map. The arrows indicate data sets used in training the maps. In Figure (b), the u-matrix of the mill technology map is illustrated. Black corresponds to high value of the u-matrix, white to small value. The mill types have been marked on the map with lines and labeled as in Table 1.
385 Table 1 Pulp and paper mill types. The clusters refer to the areas marked on the u-matrix of the mill technology map in Figure 7(b). TYPE Description Uncoated woodfree paper, either new big machines or older and smaller. UNcWF SMALLIND Various industrial papers, machines old and small. Tissue paper, average wirelength and speed. TISSUE Some tissue paper, but especially high deinked waste paper usage. DEWA Many coaters, coated woodfree paper, machines of average capacity and CoWF speed. Many paper machines including some high capacity paper machines, PMS uncoated woodfree but also various industrial papers, unbleached and semibleached sulphite pulp. Cartonboard, linerboard and fluting papers, disperged waste paper for DIWA pulp. Uncoated woodfree paper, bleached chemical pulp from wood fibre. BLWF High capacity, many machines and coaters, woodfree paper or linerBIG board, pulp is chemical (sulphate), machines are big. Wrapping paper and linerboard, unbleached sulphate pulp, big paper WRLI machines. Various industrial papers, pulp is unbleached and mechanical: groundMECH wood or rmp. Cartonboard, linerboard and fluting paper, average capacity machines. AVEIND Fluting paper, semichemical unbleached pulp, average capacity paper FLUT machines. Wood containing paper, chemimechanical or mechanical pulp from wood WOODC fibre, average capacity paper machines. High paper capacity: newsprint, thermomechanical pulp, high capacity NEWS paper machines and pulp lines. No paper production, but large pulp production, high capacity pulp PULP lines, big market percentage. Cartonboard and other papers, big weight in paper machines. CARTON Linerboard and fluting paper, small to average sized machines. LINFLI Linerboard and fluting paper, high capacity machines. LINFL2 Cartonboard, wrapping, tissue and other papers, high capacity maBIGIND chines.
386 4. C O N C L U S I O N S The SOM has proven to be a powerful tool in data mining and analysis. It combines the benefits of vector quantization and data projection. The various visualization methods provide efficient ways to use the SOM in data understanding and exploration. There are different needs in exploratory visualization, but as the proposed principles are simple, they can be easily modified to meet the needs of the task. Future work is still needed to enable the methods to automatically take heed of the properties of the underlying data. It should be emphasized that the data exploration and analysis process is usually iterative, i.e., the most important variables can be determined only after various steps of the data mining process. In the beginning, there are usually several dozens of measurements which will then be reduced to the most important ones affecting the behavior of the process. Several tests must be made and interpreted using knowledge of process experts. The topology preserving property together with the regular form of the SOM gives a compact base where many kinds of visualizations and interfaces may be linked together. In the practical examples, basic SOM visualizations were used together with methods that link different kinds of color visualizations. In this paper, the linking between the scatter plots has been made interactively by highlighting the interesting points with a few grades of gray. While this could be made without the SOM, the SOM grid brings a connection to cluster visualization by means of the u-matrix. Furthermore, an automated coloring by using a simple continuous color coding of the SOM grid is used to get an overall sight to the linking, see for instance [4]. A coloring that brings up the cluster structure (see [7,8]) would certainly be useful in this operation. The scatter plots connected to the map grid will benefit the analysis only if the dependencies are such that a variable can be considered to be (locally) a function of mainly only one other latent variable. If the dependencies are more complex, the scatter plot visualization with the color linking becomes useless. Despite their evident limitations, the presented methods have facilitated the industrial data analysis, especially in the explorative phase of the work.
Acknowledgements The research in this work was carried out in the technology program "Adaptive and Intelligent Systems Applications" financed by the Technology Development Centre of Finland (TEKES) and in the "Application of Neural Network based Models for Optimization of the Rolling Process" (NEUROLL) project financed by the European Union. The cooperation of the following enterprises is gratefully acknowledged: Jaakko Poyry Consulting, Rautaruukki and UPM-Kymmene. Mr. Juha Parhankangas is acknowledged for carrying out various elaborate programming tasks in the projects. REFERENCES 1. E . J . Ainsworth. Classification of Ocean Colour Using Self-Organizing Feature Maps. In Proceedings of IIZUKA '98, volume 2, pages 996-999, 1999. 2. E. Alhoniemi, J. Himberg, and J. Vesanto. Probabilistic Measures for Responses of Self-Organizing Map Units. 1999. Accepted for publication in International ICSC
Symposium on Advances in Intelligent Data Analysis. 3. V. Aristide. On the use of two traditional statistical techniques to improve the read-
387
10. 11. 12.
13.
14.
15.
16.
17.
ability of Kohonen Maps. In Proceedings of the NATO ASI on Statistics and Neural Networks, 1993. J. Himberg. Enhancing SO M-based data visualization by linking different data projections. In L. Xu, L. W. Chan, and I. King, editors, Intelligent Data Engineering and Learning, pages 427-434. Springer, 1998. L. HolmstrSm and A. H~m~l~inen. The self-organizing reduced kernel density estimator. In Proc. ICNN'93, Int. Conf. on Neural Networks, volume I, pages 417-421, Piscataway, N J, 1993. IEEE Service Center. J. Iivarinen, T. Kohonen, J. Kangas, and S. Kaski. Visualizing the clusters on the self-organizing map. In C. Carlsson, T. J~rvi, and T. Reponen, editors, Proc. Conf. on Artificial Intelligence Res. in Finland, number 12 in Conf. Proc. of Finnish Artificial Intelligence Society, pages 122-126, Helsinki, Finland, 1994. Finnish Artificial Intelligence Society. S. Kaski, J. Venna, and T. Kohonen. Tips for Processing and Color-Coding of SelfOrganizing Maps. In G. Deboeck and T. Kohonen, editors, Visual Explorations in Finance, Springer Finance, chapter 14, pages 195-202. Springer-Verlag, 1998. S. Kaski, J. Venna, and T. Kohonen. Coloring that revelas high-dimensional structures in data. 1999. Submitted to ICONIP'99. T. Kohonen. Self-Organizing Maps, volume 30 of Springer Series in Information Sciences. Springer, Berlin, Heidelberg, 1995. T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas. Engineering Applications of the Self-Organizing Map. Proceedings of the IEEE, 84(10):1358-1384, 1996. H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-Organizing Maps: An Introduction. Addison-Wesley, Reading, MA, 1992. V. Tryba, S. Metzen, and K. Goser. Designing basic integrated circuits by selforganizing feature maps. In Neuro-N~mes '89. Int. Workshop on Neural Networks and their Applications, pages 225-235, Nanterre, France, 1989. ECS. A. Ultsch and H. P. Siemon. Kohonen's self organizing feature maps for exploratory data analysis. In Proc. INNC'90, Int. Neural Network Conf., pages 305-308, Dordrecht, Netherlands, 1990. Kluwer. A. Ultsch. Self organized feature maps for monitoring and knowledge aquisition of a chemical process. In S. Gielen and B. Kappen, editors, Proc. ICANN'93, Int. Conf. on Artificial Neural Networks, pages 864-867, London, UK, 1993. Springer. J. Vesanto and J. Ahola. Hunting for Correlations in Data Using the Self-Organizing Map. 1999. Accepted for publication in International ICSC Symposium on Advances in Intelligent Data Analysis. J. Vesanto, E. Alhoniemi, J. Himberg, K. Kiviluoto, and J. Parviainen. SelfOrganizing Map for Data Mining in Matlab: the SOM Toolbox. Simulation News Europe, (25):54, March 1999. J. Vesanto, J. Himberg, M. Siponen, and O. Simula. Enhancing SOM Based Data Visualization. In Takeshi Yamakawa and Gen Matsumoto, editors, Proceedings of the 5th International Conference on Soft Computing and Information/Intelligent Systems, pages 64-67. World Scientific, 1998.
This Page Intentionally Left Blank
389
Keyword
index
adaptive systems 349 aggregation operators 47 audio indexing 363 autoencoder 57 Bayesian model 111, 121 behavioural synthesis 231 biological modeling 243, 267 chemistry 335 classification 183, 317 classifier design 71 clustering 47, 131 topographic 57 competition function 47 computational neuroscience 243 computer-aided design (CAD) 231 content-addressable search 171 content-based image retrieval 349 cortical maps 267
data analysis I, 15, 335 economic 1 financial 15, 197 mining I, 15, 33, 131,171,375 selection 57 visualization I, 15, 33, 97, 375 digital implementation 145 document categorization 171,183, 197 DSLVQ 317 dysphonia 329 economic data 1 EEG 317 electronic CAD 231 energy functions 303 evolution 157
feature selection 317 financial data 15, 197 force sensing 207 generative model 121 genetic 157 growing Self-Organizing Map 131,279 hierarchical feature map 183 hyperbolic geometry 97 hyperbolic tesselation 97 image databases 349 incremental learning 131 information retrieval 171, 183, 363 knowledge discovery 33, 171 large Self-Organising Maps 171 latent semantic analysis 171,363 lateral connections 243 learning vector quantization (LVQ) 47, 317 local learning parameters 293 manipulation 207 Markov chain 145 missing data 57 nearest prototypes 71 neurobiological models 243, 267 non-Euclidean space 97 non-stationary distributions 131 optimization 219 over-fitting 293 pairwise data 57 pattern recognition 111 phase transitions 293 phonation 329
390 population code 267 post-supervision 71 pre-supervision 71 quantization effects 145 Raman spectroscopy 335 reaction-diffusion 253 reformulation 47 robots 207 scheduling by SOM 231 second-order learning 293 soft assignments 303 spatio-temporal memory 253 spectroscopy 335 speech recognition 363 spiking neurons 243 stock markets 15 system analysis 375 text exploration 171,183, 197 time dependent vector quantization 253 time series 33 topographic clustering 57 topology optimization 157 topology preservation 279 transcription 157 travelling salesman problem (TSP) 219 tree-structure 121 U-matrix 33 vector quantization 111, 131 time dependent 253 see also learning vector quantization vector space model 171,183 visual cortex 243 visualization 1, 15, 33, 97, 375 voice quality 329