Implementation Techniques
Neural Network Systems Techniques and Applications Edited by Cornelius T. Leondes
VOLUME 1. Algorithms and Architectures VOLUME 2. Optimization Techniques VOLUME 3. Implementation Techniques VOLUME 4. Industrial and Manufacturing Systems VOLUME 5. Image Processing and Pattern Recognition VOLUME 6. Fuzzy Logic and Expert Systems Applications VOLUME 7. Control and Dynamic Systems
Implementation Techniques Edited by
Cornelius T. Leondes Professor Emeritus University of California Los Angeles, California
V O L U M E
O
OF
Neural Network Systems Techniques and Applications
ACADEMIC PRESS San Diego London Boston New York Sydney Tokyo Toronto
This book is printed on acid-free paper. ( ^ Copyright © 1998 by ACADEMIC PRESS All Rights Reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher.
Academic Press a division of Harcourt Brace & Company
525 B Street, Suite 1900, San Diego, California 92101-4495, USA http://www.apnet.com Academic Press Limited 24-28 Oval Road, London NWl 7DX, UK http://www.hbuk.co.uk/ap/ Library of Congress Card Catalog Number: 97-80441 International Standard Book Number: 0-12-443863-6 PRINTED IN THE UNITED STATES OF AMERICA 97 98 99 00 01 02 ML 9 8 7 6
5
4
3 2 1
Contents
Contributors Preface xv
xiii
Recurrent Neural Networks: Identification and Other System Theoretic Properties Francesca Albertini and Paolo Dai Pra
I. Introduction
1
II. Recurrent Neural Networks 4 A. The Model 4 B. Examples of Admissible Activation Functions C. State Observability 7 D. Identiflability and Minimality 11 E. Controllability and Forward Accessibility III. Mixed Networks 17 A. The Model 17 B. state Observabihty 17 C. Proofs on Observability 23 D. Identiflability and Minimality 30 E. Proofs on Identifiability and Minimality rV. Some Open Problems References 47
46
5
13
33
vi
Contents
Boltzmann Machines: Statistical Associations and Algorithms for Training N. H. Anderson and D. M. Titterington I. Introduction
51
II. Relationship with Markov Chain Monte Carlo Methods 53 III. Deterministic Origins of Boltzmann Machines IV. Hidden Units
55
56
V. Training a Boltzmann Machine 57 A. The Case without Hidden Units 58 B. The Case with Hidden Units 59 C. Relationship with the EM Algorithm 60 D. Discussion of the Case with Training Data E. Information Geometry of Boltzmann Machines F. Training Boltzmann Machines by Alternating Minimization 63 G. Parameter Estimation Using Mean-Field Approximations 64 H. The Method of Gelfand and Carlin 65 I. Discussion 66 VI. An Example with No Hidden Unit 67 A. Exact Likelihood Estimation 68 B. The Method of Geyer and Thompson C. Alternating Minimization 71 D. Conclusions from the Example 72
61 62
70
VII. Examples with Hidden Units 73 A. Description of the Examples 73 B. The Gradient-Descent Method 73 C. Direct Maximization 76 D. Application of the EM Algorithm 76 VIII. Variations on the Basic Boltzmann Machine A. The Boltzmann Perceptron 81 B. Polytomous Boltzmann Machines 82 C. Neal's Approach Based on Belief Networks IX. The Future Prospects for Boltzmann Machines References 87
81
84 86
Contents
Constructive Learning Techniques for Designing Neural Network Systems Colin Campbell I. Introduction 91 II. Classification 95 A. Introduction 95 B. The Pocket Algorithm 96 C. Tower and Cascade Architectures 97 D. Tree Architectures: The Upstart Algorithm E. Constructing Tree and Cascade Architectures by Using Dichotomies 103 F. Constructing Neural Networks with a Single Hidden Layer 109 G. Summary 111 III. Regression Problems 111 A. Introduction 111 B. The Cascade Correlation Algorithm 111 C. Node Creation and Node-Splitting Algorithms D. Constructing Radial Basis Function Networks E. Summary 123 IV. Constructing Modular Architectures 124 A. Introduction 124 B. Neural Decision Trees 124 C. Reducing Tree Complexity 127 D. Other Approaches to Growing Modular Networks 128 V. Reducing Network Complexity 129 A. Introduction 129 B. Weight Pruning 129 C. Node Pruning 132 D. Summary 133 VI. Conclusion 134 VII. Appendix: Algorithms for Single-Node Learning A. Binary-Valued Nodes 135 B. Continuously-Valued Nodes 138 References 139
101
116 119
135
viii
Contents
Modular Neural Networks Kishan Mehrotra and Chilukuri K. Mohan I. Introduction 147 II. Why Modular Networks? 149 A. Computational Requirements 149 B. Conflicts in Training 150 C. Generalization 150 D. Representation 151 E. Hardware Constraints 151 III. Modular Network Architectures 152 IV. Input Decomposition 152 A. Neocognitron 153 B. Data Fusion 154 V. Output Decomposition 154 VI. Hierarchical Decomposition 157 VII. Combining Outputs of Expert Modules 159 A. Maximum Likelihood Method 160 B. Mixture of Experts Networks 161 C. Mixture of Linear Regression Model and Gaussian Densities 162 D. The EM Algorithm 164 VIII, Adaptive Modular Networks 167 A. Attentive Modular Construction and Training B. Adaptive Hierarchical Network 172 C. Adaptive Multimodule Approximation Network D. Blockstart Algorithm 176 IX. Conclusions 177 References 178
Associative Memories Zong-Ben Xu and Chung-Ping Kwong I. Introduction 183 A. Associative Memory Problems B. Principles of Neural Network Associative Memories 185 C. Organization of the Chapter
184
191
169 173
Contents II. Point Attractor Associative Memories A. Neural Network Models 193 B. Neurodynamics of PAAMs 198 C. Encoding Strategies 205
192
III. Continuous PAAM: Competitive Associative Memories 213 A. The Model and Its Significance 214 B. Competition-Winning Principle 217 C. Competitive Pattern Recognition Example D. Application: Perfectly Reacted PAAM IV. Discrete PAAM: Asymmetric Hopfield-Type Networks 232 A. Neural Network Model 233 B. General Convergence Principles 234 C. Classification Theory for Energy Functions D. Application: Complete-Correspondence Encoding 245 y. Summary and Concluding Remarks References 253
250
A Logical Basis for Neural Network Design Robert L Fry and Raymond M. Sova I. Motivation II. Overview
259 262
III. Logic, Probability, and Bearing
273
rv. Principle of Maximized Bearing and ILU Architecture 278 V. Optimized Transmission
293
VI. Optimized Transduction
296
VII. ILU Computational Structure VIII. ILU Testing
300
302
IX. Summary 305 Appendix: Significant Marginal and Conditional ILU Distributions 305 References 307
222 228
239
Contents
Neural Networks Applied to Data Analysis Aarnoud Hoekstra, Robert P. W. Duin, and Martin A. Kraaijveld I. Introduction 309 II. Data Complexity 311 A. Prototype Vectors 311 B. Subspace or Projection Methods 313 C. Density Estimation Methods 314 III. Data Separability 316 A. Mapping Algorithms 318 B. VisuaUzing the Data Space 321 C. Experiments 322 IV. Classifier Selection 327 A. Confidence Values 328 B. Confidence Estimators 329 C. VaUdation Sets 334 D. Experiments 334 V. Classifier Nonlinearity 341 A. A Nonlinearity Measure 341 B. Experiments 344 VI. Classifier Stability 354 A. An Estimation Procedure for Stability 354 B. Stabilizers 358 C. Stability Experiment 363 VII. Conclusions and Discussion 366 References 367
Multimode Single-Neuron Arithmetics Chang N. Zhang and Meng Wang I. Introduction 371 II. Defining Neuronal Arithmetics 374 A. A Brief History of Study 374 B. Polarization Configuration 375 C. Composite Operation and Clean Operation 376 D. Developedness of Arithmetical Modes 378 III. Phase Space of Neuronal Arithmetics 378 A. Prototypes of Clean Arithmetics 378 B. Domain Range 380
Contents C. Dynamic Range 384 D. Phase Diagram 386 IV. Multimode Neuronal Arithmetic Unit 387 A. Developedness of Arithmetic Operations 387 B. Multimode Arithmetic Unit 389 C. Triggering and Transition of Arithmetic Modes V. Toward a Computing Neuron 392 VI. Summary 395 References 395
Index
397
390
This Page Intentionally Left Blank
Contributors
Numbers in parentheses indicate the pages on which the authors' contributions begin.
Francesca Albertini (1), Dipartimento di Matematica Pura e Applicata, Universita di Padova, 35131 Padova, Italy N. H. Anderson (51), Department of Statistics, University of Glasgow, Glasgow G12 8QQ, Scotland Colin Campbell (91), Advanced Computing Research Centre, University of Bristol, Bristol BS8 ITR, England Paolo Dai Pra (1), Dipartimento di Matematica Pura e Applicata, Universita di Padova, 35131 Padova, Italy Robert P. W. Duin (309), Pattern Recognition Group, Faculty of Applied Physics, Delft University of Technology, 2600 GA Delft, The Netherlands Robert L. Fry (259), Applied Physics Laboratory, The Johns Hopkins University, Laurel, Maryland 20723-6099 Aarnoud Hoekstra (309), Pattern Recognition Group, Faculty of Applied Physics, Delft University of Technology, 2600 GA Delft, The Netherlands Martin A. Kraaijveld (309), Research and Technical Services, Shell International Exploration and Production, 2280 AB Rijswijk, The Netherlands Chung-Ping Kwong (183), Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, N. T., Hong Kong Kishan Mehrotra (147), Department of Electrical Engineering and Computer Science, Center for Science and Technology, Syracuse University, Syracuse, New York 13244-4100 xiii
xiv
Contributors
Chilukuri K. Mohan (147), Department of Electrical Engineering and Computer Science, Center for Science and Technology, Syracuse University, Syracuse, New York 13244-4100 Raymond M. Sova (259), Applied Physics Laboratory, The Johns Hopkins University, Laurel, Maryland 20723-6099 D. M. Titterington (51), Department of Statistics, University of Glasgow, Glasgow G12 8QQ, Scotland Meng Wang (371), Department of Computer Science, University of Regina, Regina, Saskatchewan S4S 0A2, Canada Zong-Ben Xu (183), Institute for Information and System Sciences and Research Center for Applied Mathematics, Xi'an Jiaotong University, Xi'an, China Chang N. Zhang (371), Department of Computer Science, University of Regina, Regina, Saskatchewan S4S 0A2, Canada
Preface Inspired by the structure of the human brain, artificial neural networks have been widely applied to fields such as pattern recognition, optimization, coding, control, etc., because of their ability to solve cumbersome or intractable problems by learning directly from data. An artificial neural network usually consists of a large number of simple processing units, i.e., neurons, via mutual interconnection. It learns to solve problems by adequately adjusting the strength of the interconnections according to input data. Moreover, the neural network adapts easily to new environments by learning, and can deal with information that is noisy, inconsistent, vague, or probabilistic. These features have motivated extensive research and developments in artificial neural networks. This volume is probably the first rather comprehensive treatment devoted to the broad area of practical and effective implementation algorithms and techniques architectures for the realization of neural network systems. Techniques and diverse methods in numerous areas of this broad subject are presented. In addition, various major neural network structures for achieving effective systems are presented and illustrated by examples in all cases. Numerous other techniques and subjects related to this broadly significant area are treated. The remarkable breadth and depth of the advances in neural network systems with their many substantive applications, both realized and yet to be realized, make it quite evident that adequate treatment of this broad area requires a number of distinctly titled but well integrated volumes. This is the second of seven volumes on the subject of neural network systems and it is entitled Optimization Techniques, The entire set of seven volumes contains Volume 1: Algorithms and Architectures Volume 2: Optimization Techniques Volume 3: Implementation Techniques
xvi
Preface Volume 4: Volume 5: Volume 6: Volume 7:
Industrial and Manufacturing Systems Image Processing and Pattern Recognition Fuzzy Logic and Expert Systems Applications Control and Dynamic Systems
The first contribution to Volume 3 is "Recurrent Neural Networks: Identification and Other System Theoretic Properties," by Francesca Albertini and Paolo Dai Pra. Recurrent neural networks may be characterized as networks in which the neurons arranged in layers have feedforward, feedback, and possible lateral connections. In the terminology of multi-input-multi-output of nonlinear dynamic control systems, in general, such systems can be represented by nonlinear state space system equations and, depending on the application, the time variable may be treated as either continuous or discrete. Such neural network systems models occur in many important applications such as signal processing, diverse control system applications, etc. This contribution treats such broad areas of fundamental importance as observability, controllability, and identifiability. The major issue of minimal recurrent neural networks is also examined in depth. Many results of fundamental importance to the broadly significant area of recurrent neural network systems are presented. The next contribution is "Boltzmann Machines: Statistical Associations and Algorithms for Training," by N. H. Anderson and D. M. Titterington. Boltzmann machines are a class of recurrent stochastic neural network systems that are normally associated with binary outputs. In the case where the nodes are all visible, they can be regarded as stochastic versions of Hopfield networks, but in practice they often include hidden units. In principle, Boltzmann machines represent a flexible class of exponential family models for multivariate binary data; the flexibility is enhanced by the inclusion of hidden units (or latent variables in statistical terms). They are being appHed to numerous practical problems, particularly in the neural network literature, and they are also being used as the foundations of more complicated networks. Methods for efficient training of Boltzmann machines are under active development. Boltzmann machines continue to be an area for strong development, in particularly at the interface between neural network systems and modern statistics. This contribution is an in-depth and rather comprehensive treatment of the issues and techniques involved in Boltzmann machines, and it includes numerous illustrative examples. The next contribution is "Constructive Learning Techniques for Designing Neural Network Systems," by Colin Campbell. Constructive algorithms determine both the architecture of a suitable neural network system and the neural network system parameters (weights, thresholds, etc.) necessary
Preface
xvii
for learning the data. Constructive algorithms have the inherent advantages of rapid training in addition to finding both the architecture and weights. This contribution is a rather comprehensive treatment of constructive learning techniques for designing neural network systems. It includes numerous methods for treating the various significant aspects of this broad technique. The next contribution is "Modular Neural Networks," by Kishan Mehrota and K. Mohan. Many problems are best solved using neural networks whose architecture consists of several modules with sparse interconnections between them. Modularity allows the neural network developer to solve smaller tasks separately using small neural network modules and then to combine these modules in a logical manner. In this contribution, various modular neural network architectures are surveyed. Input modularity, output modularity, hierarchical organization, and combinations of expert networks are examined in detail, followed by discussion of four adaptive modular network construction algorithms. These networks have the potential to reduce the total time required for training networks by orders of magnitude while yielding better quality results. Illustrative examples are presented which clearly manifest the significance of the results presented in this contribution. The next contribution is "Associative Memories," by Zong-Ben Xu and Chung-Ping Kwong. Associative memories are one of the most active research areas in neural network systems. They have emerged as efficient models of biological memory and have produced powerful techniques in various applications of substantive significance including pattern recognition, expert systems, optimization problems, and intelligent control. This contribution is a rather comprehensive and in-depth treatment of associative memory system techniques and their application. It is perhaps worthy of mention that all existing neural network system models can function as associative memories in one way or another. Numerous illustrative examples are included in this contribution which clearly manifest the substantive significance of associative memory systems. The next contribution is "A Logical Basis for Neural Network Design," by Robert L. Fry and Raymond M. Sova. This contribution presents the characterization of a logical basis for the design of neural network systems. A methodology is developed for the design, synthesis, analysis, testing, and, above all, the understanding of a new type of computational element called the inductive logic unit (ILU). The ILU, as a physical device, may be thought of as being analogous to other physical devices utilized in electronic systems such as a transistor, operational amplifier, or logical gate. Such a characterization facilitates application-oriented designs. The ILU is shown to be a very powerful computational device, and it is also treated as
xviii
Preface
part of a larger system which is called an inductive logic computer (ILC). This contribution focuses on the ILU, but ultimately the ILU will typically be but one of many similar components comprising an ILC that can be configured to treat a wide variety of rather important applications. The next contribution is "Neural Networks Applied to Data Analysis," by Aarnoud Hoekstra, Robert P. W. Duin, and Martin A. Kraaijveld. This contribution is an in-depth treatment of nonlinear data analysis techniques utilizing neural network systems, primarily of the class of feedforward neural system classifiers. Techniques are presented for the analysis of the data themselves, as well as for the selection, analysis, and stability of classifiers. The nonlinear characteristics of neural network systems are thereby emphasized in the effective process of data analysis. Numerous illustrative examples are presented which clearly manifest the substantive effectiveness of the techniques presented. The final contribution to this volume is "Multimode Single-Neuron Arithmetics," by Chang N. Zhang and Meng Wang. Interest in modeling single neuron computation has been constantly shaped by two types of considerations: computational ability and biophysical plausibility. This contribution is a presentation of the potential and limitations of single neuron arithmetics. Driven by suitably patterned input signals, passive membranes can effectively perform multimode arithmetic operations on the input conductances. Based on this, an abstract model of a neuronal arithmetic unit is presented. By taking active membrane mechanisms into account, an integrated neuron model can be constructed that may function as a programmable rational approximator. Numerous illustrative examples are presented. This volume on implementation techniques in neural network systems clearly reveals their effectiveness and significance, and with further development, the essential role they will play in the future. The authors are all to be highly commended for their splendid contributions to this volume which will provide a significant and unique reference source for students, research workers, practitioners, computer scientists, and others on the international scene for years to come. Cornelius T. Leondes
Recurrent Neural Networks: Identification and Other System Theoretic Properties Francesca Albertini*
Paolo Dai Pra
Dipartimento di Matematica Pura e Applicata Urdversita di Padova 35131 Padova, Italy
Dipartimento di Matematica Pura e Applicata Universita di Padova 35131 Padova, Italy
I. INTRODUCTION Neural networks have become a widely used tool for both modeling and computational purposes. From plasma control to image or sound processing, from associative memories to digital control, neural network techniques have been appreciated for their effectiveness and their relatively simple implementability. We refer the reader to Hertz et aL [1] and Hunt et al. [2] (and references therein) for seminal works, and to Sontag [3], Bengio [4], and Zbikowski and Hunt [5] for more recent reviews on the subject. The rigorous analysis of neural network models has attracted less interest than their appUcations, and so it has developed at a slower pace. There are, of course, some exceptions. In the context of associative memories, for instance, a rather sophisticated study can be found in Talagrand [6] and Bovier and Gayrard [7]. The purpose of this work is to present up-to-date results on a dynamical version of neural networks, particularly recurrent neural networks. Recurrent neural networks are control dynamical systems that, in continuous time, are described *Also: Istituto di Ingegneria Gestionale, Viale X Giugno 22, 36100, Vicenza, Italy. Implementation Techniques Copyright © 1998 by Academic Press. All rights of reproduction in any form reserved.
1
Francesca Albertini and Paolo Dai Pra
by a system of differential equations of the form X = a (Ax + Bu) where x G M", u € R'", y G R^, and A, B,C are matrices of appropriate dimension. The function cr : R" ^- R'^ is defined by 5(^) = (or(xi),..., cr(xn)), where a is an assigned nonhnear function. In (1), M represents the input signal, or control, and y represents the output signal, or observation. Systems of type (1) are commonly used as realizations for nonlinear input/output behaviors; the entries of the matrices A, B,C, often called weights of the networks, are determined on the basis of empirical data and some appropriate best fitting criterion. A brief survey of the use of recurrent neural networks as models for nonlinear systems can be found in Sontag [8]. Models of these types are used in many different areas; see, for example, Bengio [4] and Matthews [9] and Sontag [10] and Polycarpou and loannou [10] for signal processing and control applications, respectively. The system theoretic study of recurrent neural networks was initiated by Sussmann [11] and Albertini and Sontag [12, 13], and continued by Albertini and Sontag [14] and Albertini and Dai Pra [15]. The purpose of this study is twofold. • Systems of type (1) provide a class of "semilinear" systems, the behavior of which may be expected to exhibit some similarity to linear systems that correspond to the choice a = identity. It is therefore natural to attempt to characterize the standard system theoretic properties, observability, controllability, identifiability, detectability, in terms of algebraic equations for the matrices A, 5, C, as for linear systems. One may expect that the nonlinearity introduced by a "typical" function a induces some "chaotic" disorder in the system, the overall effect of which can be described in simple terms. Surprisingly enough, this picture turns out to be correct, under quite reasonable assumptions about the model. • In modeling nonlinear input/output behaviors by recurrent neural networks, it is desirable to avoid redundances. More precisely, one would like to work with a recurrent neural network that is minimal; this means that the same input/output behavior cannot be produced by a recurrent neural network with lower state space dimension [= n in (1)]. This leads to a classical identifiability problem: characterize minimal recurrent neural networks in terms of their weights. Moreover, by analogy with linear systems, questions on relationships between minimality, observability, and controllability naturally arise. There are many other interesting issues related to recurrent neural networks, such as computability, parameter reconstruction, stability properties, memory capacity, sample complexity for learning, and others that will not be addressed in this work [16-21]. The analysis of recurrent neural networks, in both discrete and continuous time, is the subject of Section II. After stating some assumptions on the model
Recurrent Neural Networks
3
(Sections II.A and II.B), we give necessary and sufficient conditions for a system of type (1) to be observable, that is, for the initial state of the system to be determined by the input/output behavior. Such conditions can be checked by performing a finite number of linear operations on the entries of the matrices A,C. Moreover, it is shown that observability of the corresponding linear system (<j = identity) implies observability of the network, whereas the opposite is false. In Section II.D we turn to the problem of characterizing identifiability and minimality of recurrent neural networks. These notions are shown to be essentially equivalent to observability. Moreover, for nonobservable systems, it is shown that through an observability reduction similar to the one for linear systems [see, e.g., [22]], one can lower the state-space dimension without changing the input/output map. Unlike for linear systems, it may appear that controllability does not play any role in minimality. However, this is not the case. The above observability and minimality results are proved under "genericity" assumptions about the function cr and the control matrix B. In Section lI.E we show, for continuous time systems, that these assumptions imply forward accessibility, which means that by choosing different controls u we can steer the state of the system to any point of a nonempty open set. Forward accessibility for a discrete-time recurrent neural network is a more difficult problem and is not completely understood. Known results on the subject are summarized in Section lI.E. Systems of type (1), although very flexible, may be not the best choice in some contexts. In many engineering applications, nonlinear systems arise as perturbations of linear ones. This happens, for instance, when a linear and a nonlinear system are interconnected [see ref. [22], Chap. 6]. There is a natural way of generalizing systems of type (1) to include interconnections with linear systems. We consider control systems that, in continuous time, take the form XI = G-(A^^xi + A^^X2 + B^u) X2 = A^hi -h A^^X2 + B2U , [ y = C^xi + C^X2
(2)
where xi € E^i, X2 € E"^ ^ u eW^, y € E^, and a is a given nonlinear function. In Section III we present a complete analysis of observability, identifiability, and minimality for such systems, which will be called mixed networks. The results on observability can also be found in Albertini and Dai Pra [23, 24], and results concerning identifiability and minimality, which are technically the most demanding, are original contributions of this work. It is relevant to point out that the main point in understanding the minimality of both systems (1) and (2) consists of determining the symmetry group of minimal systems, that is, the linear transformations of the state space that do not affect the type of system and its input/output behavior. It is well known that the symmetry group of a minimal linear system of dimension n is GL(n), the group of all in-
4
Francesca Albertini and Paolo Dai Pra
vertible linear transformations. It turns out that minimal recurrent neural networks have a finite synmietry group, whereas mixed networks have an infinite symmetry group, but much smaller than GL{n\ +^2). This reductiont)f the symmetry group is not surprising, since the nonlinearity of a prevents linear symmetries. What is remarkable is that the knowledge of linear synunetries is enough to understand observability and identifiability. Section IV presents some open problems related to the properties of recurrent neural networks and mixed networks discussed in this work.
11. RECURRENT NEURAL NETWORKS A. THE MODEL In this section we consider recurrent neural networks evolving in either discrete or continuous time. We use the superscript + to denote time shift (discrete time) or time derivative (continuous time). The basic models we deal with are those in which the dynamics are assigned by the following difference or differential equation: jc"^ =
G{AX
y = Cx,
+ Bu) (3)
withx G R", w G R'^, y e W, A e R"^^ B e R"^'", and C e R^^". Moreover, 5(x) = [a(jci),..., a(;c„)], where a is a given odd function from R to itself. Clearly, if a is the identity, the model in (3) is just a standard linear system. If a is a nonlinear function, then systems of type (3) are called (single layer) recurrent neural networks (RNNs), and the function a is referred to as the activation function. For continuous time models, we always assume that the activation function a is, at least, locally Lipschitz, and the control map M() is locally essentially bounded, so the differential equation in (3) has a unique local solution. The system theoretic analysis of RNNs will be carried on under two basic conditions, involving the activation function a and the control matrix B. The first condition is a nonlinearity requirement on a. DEFINITION 1. A function a : R -> R is an admissible activation function if, for all A^ G N and all pairs (ai, b\),..., (^AT, BN) G R^, such that hi / 0 and (ai,bi) 7^ ±{aj,bj) for all/ ^ y, the functions^ ~> G{ai -\-bi^), / = 1 , . . . , A^, and the constant 1 are linearly independent.
Recurrent Neural Networks
5
Note that no polynomial is an admissible activation function. Sufficient conditions and examples of admissible activation functions will be given in Section II.B. The next condition is a controllability-type requirement on the matrix B. DEFINITION 2. A matrix B GW^^ is an admissible control matrix if it has no zero row, and there are no two rows that are equal or opposite. The above assumption is equivalent to saying that, whatever the matrix A is, and provided a is an admissible activation function, the orbit of the control system A:+ = 5 ( A X + 5 M )
(4)
is not confined to any subspace of positive codimension. In fact, the following statement is easy to prove. PROPOSITION 3. If a is an admissible activation function, then B is an admissible control matrix if and only ifspa.n[a(Bu): u € W^} = R" or, equivalently, for all X e W, spm{5(x -h Bu): u eW} = W.
The relation between the admissibility of B and the controllability of the system (3) will be the subject of Section lI.E. DEFINITION 4. A system of type (3) is said to be an admissible RNN if both the activation function a and the control matrix B are admissible.
B. EXAMPLES O F A D M I S S I B L E ACTIVATION FUNCTIONS In this section we state and prove two criteria for verifying admissibility of an activation function. For the first one see also Albertini et al [25]. PROPOSITION 5. Suppose a is a real analytic function that can be extended to a complex analytic function on a strip {z G C: \Im{z)\ S c}\ {zo» ib}, where Izol = c and zo, ib cire singularities (poles or essential singularities). Then a is admissible.
Proof Let (ai, ^ i ) , . . . , (AAT, ^A^) be such that bi ^ 0 and {ai^bt) / zb(aj, bj) for all/ i^ j . Suppose |Z7i| > |Z?2| > • • • > I^A^I, a n d l e t c i , . . . , c^v G M be such that
J2ciCT(ai-hbi^)=0
(5)
i=l
for all ^ € R. Because a is odd, we can assume without loss of generality that bi > 0 for all /. Note that the identity (5) extends to {§ e C: |/m(^)| < c/bi) \
6
Francesca Alhertini and Paolo Dai Pra
{(zo — cLi)lbi, (zo — cii)/bi: / = ! , . . . , / : } where k
1, lim^aiai^bi^n)
= ^(^^' + ^ ^ ' ( ^ ^ ^ ) ) ^ ^•
(6)
Thus, dividing expression (5) by a{ai + bi^) and evaluating at ^ = ^„, we get c\ -\- 7 Ci
= 0.
Letting n ^- 00, we conclude that c\ = 0. By repeating the the same argument, we get Ci = 0 for all /, which completes the proof. • It is very easy to show that two standard examples of activation functions, namely a{x) = tanh(A:) and or(x) = arctan(;c), satisfy the assumptions of Proposition 5, and so are admissible. The above criterion is based on the behavior of the activation function near a complex singularity. There are natural candidates for activation function that are either nonanalytic or entire, so Proposition 5 does not apply. We give here a further criterion that requires a suitable behavior of the activation function at infinity. PROPOSITION 6. Let a be an odd, bounded function such that, for M large enough, its restriction to [M, +00) is strictly increasing and has a decreasing derivative. Define
f(x)=
Mm a(?)-or(;c),
(7)
and assume lim ^ = 0 .
(8)
Then a is an admissible activation function. Proof
Suppose N
^c/afe+^-?)=0,
(9)
/=i
with bi ^ 0 and (ai^bi) ^ ±(<2;, bj) for all i ^ j . As before, we may assume that bi > 0 for all /. We order increasingly the pairs (a/, bi) according to the order relation {a, b) > (a\ y)
if
{b > b')
or
{b = b' and a > a').
(10)
Recurrent Neural Networks
7
So we assume (ai,bi) < • • • < (aN, b^). Letting ^ -)• cx) in (9), we get ^^ ci = 0. This implies N
J2cif(ai-hbi^)^0,
(11)
/=l
We now divide the expression in (11) by f(ai + bi^), and let § ^ - oo. In this way, if we can show that | ^ o o / ( a i +Z?i?) for every / > 2, we then get ci = 0 and, by iterating the argument, ct = 0 for all/. Notice that at + bj^ = ai -\- bi^ + (ai — ai) + (bj — bi)^, so for some c > 0, and for ^ sufficiently large, we have at -\-bi^ > ai + bi^ + c. Thus, to prove (12), it is enough to show that, for all c > 0, hm
f(x-j-c) _, ,
x^oo
f(x)
= 0.
(13)
By assumption, / ' is increasing for x large, and so f(x + c)
(14)
for X sufficiently large. Thus lim , , , • > hm 1 - - - — - — c x->00 / ( x + c )
which completes the proof.
x-^ooy
/ ( j c 4- C)
= +00,
(15)
/
•
Examples of functions satisfying the criterion in Proposition 6 are (T(x) = s g n ( ; c ) [ l - ^ - ^ ' ]
(16)
and a(x)=
f e'^^dt.
(17)
C. STATE OBSERVABILITY In general, let E be a given system, with dynamics x'^ = fix, u) y = h(x)
(18)
8
Francesca Albertini and Paolo Dai Pra
where jc 6 R", u eW^, j G R^; again, the superscript + denotes a time shift (discrete time) or time derivative (continuous time). Assume, for continuous time models, that the map / is, at least, locally Lipschitz. Moreover, let XQ G R " be a fixed initial state. Then to the pair (S, ;co) we associate an input/output map Xj:^xo as follows. • For discrete time models, with any input sequence u\,... ,Uk G R'", 'kY,,XQ associates the output sequence yo = h{xo), y\,... ,yk G R^, generated by solving (18), with controls M/ (/ = 1 , . . . , A:) and initial condition JCQ. • For continuous time models, to any input map u : [0, T] -^ W^, which is, at least, locally essentially bounded, we first let jc(0 be the solution of the differential equation in (18), with control M(-) and initial condition xo, and we denote by [0, €«) the maximum interval on which this solution is defined. Then, to the input map M(-), ^E.XO associates the output function y{t) = h{x{t)), r G [ 0 , 6 j . DEFINITION 7. We say that two states JCQ, XQ G R " are indistinguishable for the system E if A.s,;co = ^s,^:' • The system E is said to be observable if no two different states are indistinguishable.
The above notion is quite standard in system theory. In the case of linear systems, observability is well understood, and the following theorem holds. THEOREM 8. Suppose a{x) = x in (3). Denote by V the largest subspace ofW that is A-invariant (i.e., AV CV) and contained in ker C. Then two states X, z are indistinguishable if and only if x — z G V. In particular, the system is observable if and only ifV = {0}.
Among the various equivalent observability conditions for linear systems, the one above is the most suitable for comparison with the result we will obtain for recurrent neural networks. It should be remarked that, for general nonUnear systems, rather weak observabiUty results are known [see, e.g., [26-28]]. Before stating the main result of this section, we introduce a simple notion that plays a significant role in this work. DEFINITION 9. A subspace V c R" is called a coordinate subspace if it is generated by elements of the canonical basis {^i,..., ^„}. The relevance of coordinate subspaces in the theory of RNNs comes essentially from the fact that they are the only subspaces satisfying the following property, for a and B admissible: x^zeR"",
x-zeV
=>5(x-hBu)-
a(z -\-Bu) e V
VM G R'^.
(19)
This "invariance" property of coordinate subspaces is responsible for the surrival of some of the linear theory.
Recurrent Neural Networks
9
THEOREM 10. Consider an admissible system of type (3). Let V d W be the largest subspace such that
(i) AV CV andV C k e r C ; (ii) AV is a coordinate subspace. Then x,z EW- are indistinguishable if and only ifx — z € V. In particular, the system is observable if and only ifV = {0}. Theorem 10 is a special case of Theorem 34, the proof of which is given in Section III.C [see also [14]]. We devote the rest of this section to comments and examples. First of all, note that the largest subspace satisfying i) and ii) is well defined. Indeed, if Vi, V2 satisfy i) and ii), so does Vi + y2- Moreover, if V is the largest subspace for which i) holds, then x—zeVif and only if jc, z are indistinguishable for the Unear system (A,C). It follows that observability of the Hnear system (A, C) implies observability of the RNN (3). We now show that the observability condition in Theorem 10 can be efficiently checked by a simple algorithm. For a given matrix D, let lo denote the set of those indexes / such that the ith column of D is zero. Define, recursively, the following sequence of subsets of {1, 2 , . . . , «}: yo = { l , 2 , . . . , n } \ / c 7j+i = /o U {/: 3j e Jd such that Mj ^0}.
^ ^
Note that the sequence Jd is increasing, and stabilizes after at most n steps. Now let Oc{A, C) = spanl^y: j i Joo]
(21)
with Joo = [Jd JdPROPOSITION
11. The subspace V in Theorem 10 is given by V = kerC n A-^{Oc{A, C)).
(22)
In particular, the system is observable if and only if ker A H kerC = OM, C) = {0}.
(23)
Proof It is not hard to see that Oc(A, C) is the largest coordinate subspace contained in ker C and A-stable. Thus, the r.h.s. of (22) satisfies conditions i) and ii) of Theorem 10, and thus it is contained in V. For the opposite inclusion, just observe that, by definition, AV C Oc(A,C). • It is worth noting, and quite easy to prove, that x — z € V implies indistinguishability of x and z, for any system of type (3). Admissibility of the system
10
Francesca Alhertini and Paolo Dai Pra
guarantees that the condition is also necessary. We illustrate with some examples how that whole picture changes if we drop some admissibility requirements. EXAMPLE 12. Let a (•) be any periodic smooth function of period r; clearly such a function cannot be admissible. Consider the following system, with n — 2 and p = m = I:
x+ = y
a{x-\-bu)
= XI-
X2,
where b is any admissible 2 x 1 control matrix. It is easily checked that the observability conditions in Theorem 10 are satisfied. However, the system is not observable. Indeed, we consider x = (r, r). Then Cx = 0, and, because a is periodic with period r, it is easy to see that for both the discrete time and continuous time cases, x is indistinguishable from 0. EXAMPLE 13. Assume that a(x) = x^, which is not admissible. Consider a system of type (3), with this function a, n = 2, m = p = 1, and matrices A, B, and C as follows:
^^ = ( 0 1 ) '
'^"G)'
<^ = (-8'l)-
(25)
Notice that this system satisfies the observability conditions in Theorem 10, but it is not observable. In fact, let W = {(a, Sa): a e R}. It is easy to check that if X e W, then a (Ax + Bu) e W for every u e R. This implies that W is invariant for the dynamics and, being contained in ker C, it contains vectors that are indistinguishable from 0. In next example we show that the implication observability of the linear system (A,C)=^ observability of the RNN may fail if B is not admissible, even though a is admissible. Note that for linear systems the control matrix B does not influence observability. EXAMPLE 14. Suppose a is a strictly increasing admissible activation function. Pick any two nonzero real values jci, JC2 in the image of a such that
xi(T~^(x2) ^ X2cr~^(xi).
(26)
Such values always exist for nonlinear a. Consider the discrete time system with « = 2, /? = l , a n d 5 = 0:
A-
\
^1
C = (X2. -xi). X2
(27)
Recurrent Neural Networks
11
Given (26), it is easy to see that the pair {A, C) is observable; however, the nonlinear system is not. In fact, the state x = (x\,X2) is an equilibrium state and Cx = 0, so it is indistinguishable from zero.
D . IDENTIFIABILITY AND MINIMALITY Recurrent neural networks possess a very simple group of symmetries. Let us consider a model S of type (3), and suppose we exchange two components of jc. Define z — Sx, where S is the operator that exchanges two components. Noting that S = S~^ and aoS = Soa, it is easily seen that the following equations hold:
-f^;
a(A'z-hB'u) C'z
'
^^^^
with A' = SA5'~^ B^ = SB, C = CS~^, as for linear systems. Thus we say that the system (28) with initial condition zo = SXQ is input/output (i/o) equivalent to (3) with initial condition XQ\ this means that XY^^XQ = '^I:^zo• The above argument can be repeated by considering compositions of exchanges, that is, by letting S be any permutation operator. Because a is odd, the same argument applies to operators S that change the sign of some components. Permutation and sign changes generate a finite group that we denote by G„. We have thus shown that Gn is a symmetry group for any system of type (3), that is, transformation of the state by an element of Gn gives rise to an equivalent system that is still a RNN. In what follows, a RNN will be identified with the quadruple II„ = (cr, A, B, C), where the index n denotes the dimension of the state space. Moreover, with (i;„, jco), jco G M"", we will denote the RNN i:„ = {a,A,B,C) together with the initial state XQ; and we will call the pair (S^^, XQ) an initialized RNN, We will write (i:„,xo)-(5:;,,x^) to denote that the two initialized RNNs (S^, xo) and (E^,, XQ) are input/output equivalent, that is, A,S„,A:O = -^i:' x' • DEFINITION 15.
Let S„ = (a, A, 5 , C) and E^, = (a, A^ B\ C)
be
two
RNNs. The two initialized RNNs (S^, XQ) and (E^,, ^Q) ^^^ ^^^^ ^^ ^^ equivalent ifn = W and there exists S e Gn such that A' = 5 A 5 ~ ^ B' = SB, C' = CS'^, and%Q = Sxo. DEFINITION 16. An admissible RNN E„ = (cr, A, B, C) is said to be identifiable if the following condition holds: for every initial state A:O € E", and for
12
Francesca Albertini and Paolo Dai Pra
every admissible initialized RNN (E^, = (a, A', B\ C), JCQ) such that (E„, XQ) and (S^,, JCQ) are i/o equivalent, n' > n or (S„, XQ) and (E^,, JCQ) are equivalent. Note that the notions in Definitions 15 and 16 can be given for linear systems (or = identity) by replacing Gn with GL(n) = {n x n invertible matrices} (and cutting the admissibility assumption). We recall the following well-known result. THEOREM 17. A linear system (A, B,C) is identifiable if and only if it is observable and controllable.
We do not insist here on the notion of controllability; that will be discussed in the next section. Remarkably enough, a formally similar result holds for both continuous time and discrete time RNNs. THEOREM
18. An admissible RNN is identifiable if and only if it is observ-
able. Theorem 18 is a special case of Theorem 51, so we do not give its proof here. The "only if" part is not difficult to see. In fact, if a RNN E„ is not observable, then there exist two different states jci, JC2 that are indistinguishable, and thus (5]„,jci) ~ (5]„,JC2). Moreover, because the matrix B is admissible, the only matrix S e On such that B = SB is the identity; thus the two initialized RNNs (E„, jci), (T,n,X2) cannot be equivalent. In comparing Theorems 17 and 18, it is seen that no controUabihty assumption is required in Theorem 18. In a certain sense, the role of controllability is played by admissibility. Actual connections between admissibihty and controUabihty will be studied in Section lI.E. We have seen in Section II.C that observability of an admissible RNN is equivalent to ker A n kerC = OdA, C) = {0}.
(29)
Suppose Oc(A, C) ^ {0}, and let ni = dim OdA, C). Then, up to a permutation of the elements of the canonical basis, the system (3) can be rewritten in the form x^ = a(A\xi -h A2X2 + Biu) X2 = a(A3X2 + B2U)
(30)
y = C2X2,
with XI e R"i, X2 e R"~"i. It follows that the system (E„ = ((r,A, B, C)), with initial state (^1,^:2) € W, is i/o equivalent to (Zn-ni = (^' ^ 3 ' ^2, ^2)) with initial state X2 e W~^^. Thus we have performed a reduction in the dimension of the state space. On the other hand, if OdA, C) = {0}, then it will follow from the proof of Theorem 18 that for (E^, = (a, A^ 5 ^ CO, x^) to be i/o equivalent to (E„, jco), it must be that n' > n. Moreover, n = n' if and only if there exists
Recurrent Neural Networks S e Gn such that A' = SAS'^, B' = SB, C = C 5 ~ ^ and JCQ, indistinguishable for S^. We summarize these remarks as follows.
13 S'^X'^
are
DEFINITION 19. An admissible RNN S„ is said to be minimal if for all jco € E'^, and for any admissible initialized RNN (S^,, x^) such that (II„, XQ) and (E^,, JCQ) are i/o equivalent, it must be that n' >n. PROPOSITION
20. A« admissible RNN is minimal if and only if Oc (A, C) =
{0}. Notice that, clearly, if an admissible RNN is identifiable, then it is also minimal, whereas the converse implication may be false, as shown in the next example. EXAMPLE 21.
Let n = 2, and m = /? = 1. Consider any RNN 112 =
((7, A, 5 , C), where G is any admissible activation function, and B = (bi, b2)^ is any admissible control matrix (i.e., 0 ^ \bi\ ^ \b2\ ^ 0). Moreover, let C = (ci, C2), with Ci ^ 0, i = 1,2, and A be the zero matrix. Then, for this model, Oc(A, C) = {0}, because C has all nonzero columns, thus by Proposition 20, it is minimal. On the other hand, this model is not identifiable, because ker A Piker C ^ 0, and so it is not observable.
E. CONTROLLABILITY AND FORWARD ACCESSIBILITY A RNN E is said to be controllable if, for every two states xi,X2 eW^, there exists a sequence of controls ui,... ,uk G M'", for discrete time models, or a control map M() : [0, T] ^^ R'", for continuous time models, which steers xi to JC2. In general, for nonlinear models, the notion of controllability is very difficult to characterize; thus the weaker notion of forward accessibility is often studied. Let xo G M" be a state; E is said to be forward accessible from XQ if the set of points that can be reached from XQ, using arbitrary controls, contains an open subset of the state space. A model is said to be forward accessible if it is forward accessible from any state. Even if much weaker than controllability, forward accessibility is an important property. In particular, it implies that from any state, the forward orbit does not lay in a submanifold of the state space with positive codimension. Unlike the other properties that we have discussed before (observability, identifiability, and minimality), the characterization of which was the same for both dynamics, discrete and continuous time, the characterization of controllability and forward accessibility is quite different for the two dynamics.
14
Francesca Albertini and Paolo Dai Pra 1. Controllability and Forward Accessibility for Continuous Time RNNs
Let E be a continuous time RNN, that is, the dynamics are given by the differential equation x(t)=a{Axit)-\-Bu(t)),
teR.
(31)
Let Xu be the vector field defined by Xu(x) = a (Ax -\- Bu). Given the vector fields X^, let IAQ{XU\U G R ' " } be the Lie algebra generated by these vector fields. It is known that if Lie {Xu \u eW^} has full rank at x, then the system is forward accessible from x [29]. This result, together with Proposition 3 (which says that the ^i^diw{Xu]ueW has already full dimension at each x G R " ) , gives the following. THEOREM 22. Let E be an admissible RNN evolving in continuous time, then E is forward accessible. REMARK 23. In a recent paper [30], E. Sontag and H. Sussmann proved that if E is an admissible RNN, and the activation function a satisfies some extra assumptions, then E is indeed controllable. The extra requirement on a is fulfilled, for example, when cr = tanh. It is quite surprising that these RNNs are controllable for any matrix A.
2. Controllability and Forward Accessibility for Discrete Time RNNs Let E be a discrete time RNN, that is, the dynamics are given by the difference equation x(t+l)
= a{Ax(t)-^Bu{t)),
teZ.
(32)
Unlike in continuous time, characterizing the controllability of discrete time models is quite difficult. We will not give the proofs of the results presented in this section, because they are very long and technical; we refer the reader to Albertini and Dai Pra [15]. It is not difficult to see that, in this case, the admissibility assumption is not enough to guarantee forward accessibility. In fact, let p = rank [A, B], and V = [A, B](W^~^^). Then, except possibly for the initial condition, the reachable set from any point is contained in a (V), If p < n then, clearly, a (V) does not contain any open set. So a necessary condition for forward accessibility is p — n.
Recurrent Neural Networks
15
The result stated below proves the sufficiency of this condition, provided that the control matrix B and the activation function a satisfy a new condition that is, in some sense, stronger than admissibility. DEFINITION 24. We say that the activation function a and the control matrix B are n-admissible if they satisfy the following conditions.
1. a is differentiable and G\X) # 0 for all x G R. 2. Denote hy bt, / = ! , . . . , « , the rows of the matrix B\ then hi ^ 0 for all /. 3. For \
are Hnearly independent. A RNN S is n-admissible if its activation function a and its control matrix B are n-admissible. REMARK 25. Notice that, in the ^-admissibility property, we do not require that no two rows of the matrix B be equal or opposite. Moreover, the linear independence on the functions / / is asked only for a fixed n. On the other hand, because the functions / / involve products of the functions a\ai -h btu), this third requirement is, indeed, quite strong. Notice that the special case A: = 1 is equivalent to asking that the functions a {at + btu) and the constant 1 are linearly independent.
THEOREM 26. Let U be an n-admissible RNN evolving in discrete time. Then S is forward accessible if and only if rank [A, 5 ] = n.
(33)
REMARK 27. The n-admissibility property, as stated in Definition 24, is given as a joint property of (7() and B. This is not, indeed, what is desirable in applications, because usually or is a given elementary function. However, it is possible to prove that when a = tanh, a and B are n-admissible for all matrices 5 in a "generic" subset of R'^^'", that is, for B in the complement of an analytic subset of R"^'" (in particular, B may vary in an open dense subset of R"^'"). For more discussion and precise statements on this subject, see Albertini and Dai Pra [15, Sect. C].
The next theorem states another sufficient condition for forward accessibility, using a weaker condition on the map a, but adding a new condition on the pair A, B.
16
Francesca Albertini and Paolo Dai Pra
DEFINITION 28. We say that the activation function a and the control matrix B are weakly n-admissible if they satisfy the following conditions:
1. a is differentiable and cr'(x) 7^ 0 for all jc € R. 2. Denote by Z?/, / = 1 , . . . , n, the rows of the matrix B\ then bi ^ 0 for all /. 3. Let a i , . . . , a„ be arbitrary real numbers, then the functions from W^ to R (or'(a/ + ^ / M ) ) ~ ^ for / = 1 , . . . , n, are linearly independent. A RNN E is weakly n-admissible if its activation function a and its control matrix B are weakly n-admissible. REMARK 29. Notice that this condition is weaker than the one given in Definition 24; in fact, the third requirement of Definition 28 is exactly the third requirement of Definition 24 for the case A; = n — 1. It is not hard to show that if B is an admissible control matrix, the activation function tanh, together with the matrix B, is weakly n-admissible. However, the same would be false for the activation function arctan for n > 4.
THEOREM 30. Let H be a weakly n-admissible RNN evolving in discrete time. If there exists a matrix H eW^^^ such that (a) the matrix (A-\- BH) is invertible, (b) the rows of the matrix [{A -h BH)~^B] are all nonzero, then I] is forward accessible. It is easy to see that condition (a) of the previous theorem is equivalent to rank [A, B] = n; thus (a) is also a necessary condition for forward accessibility. So Theorem 30 adds a new condition on A, B [condition (b)] and guarantees forward accessibility with weaker assumption on a. REMARK 31. It is interesting to note that for the single-input case, condition (b) is independent on //. In fact, the following fact holds. Let h,k e W^^^ be such that A 4- bh^ and A + bk^ are invertible. Then
{{A-\-bk^)~^b).
^OWi <> {{A-hbh'y^b).
i^Q'ii.
It is not restrictive to assume that /: = 0. Let w = A~^b and i; = (A + bh^)~^b. To get the claim it is sufficient to show that there exists A. 7^ 0, such that i; = Ait;. Because A -h bh^ is invertible, we have that h^w ^ —I, otherwise (A + bh^)w = b — b = 0. Thus we may let 1 X = 1 -\-h^w Let v^ = Xw\ then (A + bh^)v' = X{Aw + bh^w) = kb{l + h^w) = b. So we may conclude that i;' = i;, as desired.
Recurrent Neural Networks
17
III. MIXED NETWORKS A. THE M O D E L In this section we consider models obtained by "coupling" a recurrent neural network with a linear system; we called this type of system mixed network. Furthermore, for these systems we consider both models to be evolving in discrete and continuous time. Again, the superscript + will denote a time shift (discrete time) or time derivative (continuous time). A mixed network (MN) is a system whose dynamics are described by equations of the form
x+ = A^hi + A^^X2 + B^u
(34)
y = C^xi + C^X2, withjci GW^, X2 eW^, u eW, y G R^,and A^^ A^^, A^^ A^^, 5 ^ 5^, C ^ and C^ are matrices of appropriate dimensions. We let n = ni + na, and ^11 ^12
A=
^21 ^22
s.[f.].
c.[c'.c^].
As for RNNs, we assume that the activation function a : E —> M is an odd map. For continuous time models, we always assume that the activation function is, at least, locally Lipschitz, and the control maps are locally essentially bounded; thus the local existence and uniqueness of the solutions of the differential equation in (34) are guaranteed. As for RNNs, the system theoretic analysis for MNs will be carried on for a suitable subclass. DEFINITION 32. A system of type (34) is said to be an admissible MN if both the activation function a and the control matrix B^ are admissible.
For the definitions of an admissible activation function and an admissible matrix, see Section II.A. Notice that the admissibility conditions involve only the first block of equations in (34), which evolves nonlinearly.
B. STATE OBSERVABILITY In this section we present the state observability result for MNs. For the definitions of state observability and coordinate subspace, see the corresponding Section II.C.
18
Francesca Albertini and Paolo Dai Pra
D E H N I T I O N 33. Consider an admissible MN of type (34). Let W c,WbQ the maximum subspace (maximum with respect to set inclusion) such that
(i) AW c, i y , a n d \ y c k e r C ; (ii) AW = Vie V2, where Vi c R"i, V2 c W^^ and Vi is a coordinate subspace. We call W the unobservable subspace. It is clear that the unobservable subspace is well defined, because if Wi and W2 are subspaces, both satisfying i) and ii), then W\ + W2 also does. THEOREM 34. Let E be an admissible MN of type (34), and W Q W be its unobservable subspace. Then x,z € R" are indistinguishable if and only if X — z ^W. In particular, E is observable if and only //" W = {0}.
Theorem 34 will be restated in a different form in Theorem 37, the proof of which is postponed to Section III.C. REMARK 35. Notice that if E is a RNN (i.e., ni = 0), then the previous theorem yields the observability result given in Theorem 10. On the other hand, if E is a linear model (i.e., ni = 0), then Theorem 34 gives the usual linear observabiHty result stated in Theorem 8.
As done for RNNs, we present a simple algorithm to efficiently check the observability condition given by Theorem 34. First we give a useful characterization of the unobservable subspace W. PROPOSITION 36. Consider an admissible MN of type (34). Let V\ c R«i and V2 ^ R"^ be the maximum pair of subspaces (maximum with respect to set inclusion), such that
PI. P2. P3. P4.
V\ is a coordinate subspace, V\ C kerCi, Ai\V\ C V^; V2CkerC2, A22V2C V2; A21V1 c V2; A12V2 £ Vh
Then the unobservable subspace W is given by W = A " ^ ( V i e V2)nkerC. Proof. First we prove that if Vi c R"i and V2 £ ^^^ is any pair of subspaces satisfying properties P1-P4, then W = A~^{V\ 0 V2) H kerC satisfies properties (i) and (ii) of Definition 33. (i) Clearly, W c kerC. To see that AW c VK, we argue as follows. AW c^ Vi 0 ^2 £ kerC, because Vi c kerCi, and V2 £ kerC2. On the other hand.
Recurrent Neural Networks
19
because AnVi c Vi, A22V2 Q Vi, and properties P3, P4 hold, one gets that A(yi0V2) c yieV2.Thus,ifx e W c A-nVi®V2),thenAx € yi®y2, which impHes that A{Ax) e A(Vi ® y2) ^ Vi ® V2. So Ax e A-^Vi ® y2) H ker C, as desired. (ii) We will prove that AW = yi ® y2. We need only estabhsh that Vi ® V2 '^ AW, the other inclusion being obvious. Notice that yi 0 yz c kerC and, because A(Vi ® y2) c y^ e V2, we get also yi © yz c A~^(Vi © yz), and so yi © y2 c AW, It is not difficult to see that ifWC.W is any subspace that satisfies properties (i) and (ii) of Definition 33, the two subspaces Vi c M^i and y2 c R'^2^ such that AW =Vi^V2, satisfy properties P1-P4. Now, by maximality of both the unobservable subspace W and the pair W\,V2, the conclusion follows. • If Vx c E'^i and y2 c E"2 are the two subspaces defined in Proposition 36, then one has W = 0 ^ ker A n ker C = 0,
yi = 0,
y2 = 0.
(35)
In fact, W = A'^iVi © y2) H kerC, thus if W = 0, then ker A n kerC c W = 0. Moreover, because yi © y2 c A~^(yi © y2) fi kerC, we also have that V\ = 0 and V2 = 0. On the other hand, if yi = 0 and V2 = 0, then W = A-\Vi © y2) n k e r C = ker A H ker C = 0. Using (35), Theorem 34 can be rewritten as follows. THEOREM 37. Let S be an admissible MN of type (34), and Vi c E'^i, y2 c M"2 be the two subspaces defined in Proposition 36. Then x,z € E" are indistinguishableifandonlyifx—z € ker C and A(x—z) € Vi^V2> In particular, S is observable if and only if
ker A n ker C = 0,
and
Vi = 0 , y2 = 0.
(36)
Before proving this result, we present an algorithm to compute the subspaces Vi and y2, which consists of solving a finite number of linear algebraic equations. Inductively, we define an increasing sequence of indexes 7^, and two decreasing sequences of subspaces Vf c R"i, y / c E"^, for J > 1, where Vf is a coordinate subspace. Recall that for a given matrix Z), we denote by //> the set of indexes i such that the iih column of D is zero. Let /i = { l , . . . , / : } \ / c p y / = span{^^|7 ^ Ji); Vj = kerC2;
20
Francesca Albertini and Paolo Dai Pra
and, for ^ > l,let 7^+1 = 7i U {i\3j e Jd such that A]] ^ 0} U [i\3j, 30
such that {C^{A^^y A^^)^. # 0}
d-l
U I J {i\3j e Js such that {A^^{A^^Y-'-^A^^)..
^ 0}
5=1
yf+i =span{^,l7^7^+i} y^^+i = [w\{A^^)^w e kerC^ forO < / < ^, A^^{A^^Y~'w
e V^ for 1 < 5 < d}.
REMARK 38. It is easy to show that the two sequences Vf and V^^ for 1 are both decreasing; thus they must become stationary after a finite number of steps. One can find conditions that guarantee the termination of the previous algorithm. Assume that a stationary string V^ = V^'^^ = . • • = y^^+"i of length ni + 1 is obtained. Then, using both the definitions of Vf and V^^, and applying the Hamilton-Cayley theorem, one proves that for a l l J > 5 and that V2 = V2^"^ for all ^ > 5 -h n2. Thus the two sequences Vf and V2 become stationary after at most («2 + l)«i steps, forni > 1, orn steps, forni = 0.
The two sequences Vf and V2 stabilize exactly at Vi and V2, as stated in Proposition 40. Moreover, for MN which evolves in discrete time, the previous subspaces Vf and V2 have a precise meaning, as stated next. For discrete time models, given any x, z eW and any 0 < J € N, we say that x and z are indistinguishable in J-steps if any output sequence from jc or z is the same up to time d. PROPOSITION 39. Let Hbea discrete time admissible MN. Then the following properties are equivalent:
(i) jc and z are indistinguishable in d-steps, (ii) x-ze kerC; A{x - z) e {Vf 0 V^). For the proof of this Proposition we refer to Albertini and Dai Pra [23]. PROPOSITION 40.
Let V\ c W\
and V2 c R"2 be the two subspaces de-
fined in Proposition 36 and Vf and V2 be the two sequences of subspaces determined by the algorithm described above. Then the following identities hold: ^ 1 = n ^^ d>\
^ 2 = n vf d>l
(37)
Recurrent Neural Networks
21
Proof. Let
d>\
d>l
d>l
Thus V^ = span{e j\j ^ /oo}. The following four properties can easily be proved by using the recursive definition of Jd, Vf, and V2 and the fact that, for some d, it holds that Vf = V^, V/ = V^ for d>d, {i\3j e Joo so that A\j 7^ 0} c J^. {i\3j, 3/ > 0 so that {C^{A^^fA^^)j.
^0}o
{i\3j e /oo, 3/ > 0 so that {A^^{A^^yA^^).. V^ C
{M;|(A22)^M;
(38) J^.
(39)
/ 0} c /^o.
(40)
€ kerC^ for/ > 0, ^^^^^22^^ ^ yoo f^^.^ >
Q}.
(41)
We first prove that the pair V^ and ¥2^ satisfies properties P1-P4 of Proposition 36. PI. The only nontrivial fact is the A^^ invariance of V^. Let v e V^. From (38), we get that Vi = 0 for all / such that 3 ; e Joo with A\] ^ 0. So, for j € /oo.
( A H = E^)^'=0' and, therefore, A^^v e V^. P2. Let w € y2^. We show that A^'^w e V^ ^^^ ^^^O^ ^ > 1. From (41), we first see that (A^^)^(A^^i(;) E ker C^ for 0 < / < ^. Moreover, for 1 < 5 < J, again from (41), we have that A12(A22)^-^(A22U;) 6 Vi^ C
Y{,
P3. Let i; G yf^. We need to prove that A ^ ^ G ¥2"^ for all J > 1. Using arguments similar to those used to derive PI, one can see that (39) yields C^(A^^yA2ii; = 0
V/>0,
and (40) yields ^i2(^22y^2i^ G y ^ ^ n ' This implies that A^^v G y / for all d, as desired. P4. This is an immediate consequence of (41).
v^ > 1.
22
Francesca Albertini and Paolo Dai Pra
Thus we have that V^ c Vi, and ¥2^ c V2- To conclude, we need to show that the converse inclusions also hold. We will prove, by induction, that Vi c V^, and i/d for ail d > I. The case d = I is obvious. Now let et e Vi. We show V2 ^ VJf that ^/ G Vf'^\ that is, / ^ Jd-\-i. • Suppose / is such that there exists j e Jd with A jj ^ 0. Then we have
This is impossible because A ^ ^ Fi C Vi and, by inductive assumption, Vl C Vf, • Suppose / is such that there exist j and 0 0. • Suppose / is such that there exist 0 < s < d — I and j G Js with {A^^iA^^)^-'A^^)ji ^ 0. This implies
This is impossible because ( A 2 V ~ ' A ^ ^ V I ^ inductive assumption, V\ C V^^.
VI, A^^V2
^ V2, and,by
Thus we have shown that et G vf'^^. Now let w G ^2- We need to prove that w e V^"^^. • Because V2 c kerC^, and A^^V2 c V2, it holds that (A^^)'w; G ker C^ for 0 < / < J. • Because A^^ 1^2 ^ V2, A^^V2 ^ Vi, and, by inductive assumption, Vl c V( for all 1 < ^ < J, it holds that A^^CA^^)^"^w; C V/, again for I <s
Recurrent Neural Networks
23
EXAMPLE 42. Assume that Ci has no zero columns. Then there is no nonzero coordinate subspace contained in ker Ci; so Vi = 0. Therefore, V2 is the largest A22-invariant subspace contained in ker C2 and ker A12. It follows that the MN is observable if and only ker C O ker A = 0, and the two linear systems with matrices (A22, C2) and (A22, ^12) are observable.
C. PROOFS ON OBSERVABILITY Throughout this section, even if not specifically stated, all of the models of MNs we will be dealing with are assumed to be admissible. A basic ingredient in the proof of the observability result is the following technical fact, which also explains how the admissibility assumptions on the activation map a and the control matrix B^ are used (see also [14]). Recall that, for a given matrix D, ID denotes the set of index / such that the /th column of D is zero. LEMMA 43. Assume that a is an admissible activation function, B^ G R^^"^ is an admissible control matrix, and D G R^^^ is any matrix. Then the following two properties are equivalent for each §, y;, e R^ and each a, ^ eW^:
1- ^i = m for alii ^ ID, a = fi, 2. DG{^ + B^U)+OL = D5(rj + B^u)^Pforallu
G R^.
Proof. We need only prove that property 2 implies property 1, the other implication being obvious. Because B^ is admissible, there exists i; G R'" such that bi = (B^v)i 7^ 0 for all / = 1 , . . . , n and \bi\ ^ \bj\ if i ^ j . To see this, it is enough to notice that the equations (B^u)i = 0 and (B^u)i = ±(B^u)j define a finite number of hyperplanes in R^; thus to get v, we only have to avoid their union. Now, for any ? G R, we consider tv e R'^, and we rewrite (2) as n
n
Y, Duai^i + bit) - ^
Diia{r)i + bit) + (a/ - ft) = 0, / = 1 , . . . , ^ , and? G R . (42)
Assume that property 1 is false. If §/ = r]i for all i ^ ID, then the first difference in (42) is always zero. Thus if there exists I such that aj ^ Pj, (42) does not hold for this I, contradicting property 2. On the other hand, if there exists / ^ ID such that ^j / Tjj, then by definition of ID, there exists I such that Dfj 7^ 0. Because in (42) all of the pairs for which ^i = rn cancel out, we may assume without loss of generality that ^/ 7^ rji for all / (notice that ^j ^ r]j\ thus not all of the pairs cancel). Consider now equation (42) for 1 = 1. The pairs (^/, bi) and {rn, bi) are all different; thus admissibility of a implies that (42) cannot hold for all f G R, contradicting property 2. •
24
Francesca Albertini and Paolo Dai Pra
First, we introduce some useful notations. Given x € R" and u eM!^, for discrete time MNs we denote by x'^(u) the state reached from x by using the control value u. For continuous time MNs, if v(t) is the control function constantly equal to M, we denote by Xu (0 the corresponding trajectory; notice that Xu (t) is certainly defined on an interval of the form [0, 6M), and it is differentiable on this interval. When dealing with two trajectories of this type starting at two different initial states, by [0, €u) we mean the interval in which both trajectories are defined. Given two pairs of states (x,z), (x\ z') € R'^ x R'*, we write
if, for discrete time, we can find an input sequence w i , . . . , M^, for some p > 0, which steers the state x (resp., z) to x' (resp., z'). For continuous time, we require that there exists some control function u{t): [0, T] ->• R'", such that it is possible to solve the differential equation (34) starting at x (resp., z), for the entire interval [0, r ] , and at time T the state x' (resp., z') is reached. Using this terminology, two states (x, z) € R" x R" are distinguishable if and only if there is some pair (jc', zO e W such that (;c, z) ^ (x\ z') and Cx' 7^ Cz'. In what follows, proofs in discrete time will be similar to proofs in continuous time, so they will only be sketched. To simplify notations, we write
LEMMA
44.
Let H^ e R^^'^i, H^ e R^^'^^ with q > I I < i < q, and
x,zeW. (a) For continuous time MNSy if(H^A^Xu(t))i u eW^ and all t e [0, €«), then we have {H'A'Hh). = {H^A^^).j = 0
= (H^A^ZuO))i for all
{H'A'^Ah),, for all j such that {A^x)j ^
{Ah)j.
(43)
For discrete time MNs, ifiH^A^x-^(u))i = {H^ Ah"" {u)) if or all u € R'", then the same conclusions hold. (b) For continuous time MNs, if(H^A^Xu(t))i = (H^A'^Zu(t))iforall u eW^ and all t G [0, €u), then we have
{H^A^^).. = 0
forallj
such that {A^x). #
{Ah).,
For discrete time MNs, if(H^A^x-^(u))i = {H^A^z'^{u))ifor all u e M!^, then the same conclusions hold.
(44)
Recurrent Neural Networks
25
Proof, (a) Suppose we are in the continuous time case, and fix w € M'". If (HUhu(t))i = (H^Ahu(t))i foralW e [0,€u), then (HU^Xu(t))i\t=o = (H^A^Zu(t))i \t=0' This equation reads X : {H'A%a{{A'x).
+ (B'u).)
+
{H^A^'A^),
7=1
= E {H'A%.a{{Ah)j
+ {B\)j)
+
(45)
{HU''A%,
7=1
Because (45) holds for all u e W^, both equalities in (43) follow directly by applying Lemma 43. The proof for the discrete time case is the same, because the assumption (H^A^x~^(u))i = (H^A^z'^(u))i, for all u e W^, implies directly equations (45). (b) This statement is proved similarly. In fact, by the same arguments as in (a), one easily sees that, for both continuous and discrete time dynamics, ±
{H'A%.a{{A^x).
^(B^u).)
+
{H^A^'A^X),
7=1
= E {H'A%.a{{Ah)j
+ {B^u).) +
{H'A''A%,
7=1
for all u eM!^. So, again, to conclude, it is sufficient to use Lemma 43. LEMMA
45.
{A^x)i =
•
Ifx, z € E" are indistinguishable, then C^P^x = C^A^z and {Ah)iforalliiIcu
Proof. If X, z are indistinguishable, then, for all u G M'", we get, for discrete time, Cx'^iu) = Cz^(u), or, for continuous time, Cxu(t) = Czuit), for t e [0, 6M). This implies, in both cases, C^G{A^X
+ B^u) + C^A^x + C'^B^u = C^a{A^z + B^u) + C^A^z + C^B^u
for all u eW^. From Lemma 43 our conclusions follow. LEMMA
46.
Ifx, z
G M" are
•
indistinguishable, then for all q > 0, we have
1. C^(A^^)'iA^x = C^(A^^)^A^z 2. (C^(A^^)^A^% = 0 for all i and all j such that (Ah)j ^
(Ah)j.
Proof 1. We prove this statement by induction on ^ > 0. The case q = 0 is the first conclusion of Lemma 45. Assume that part 1 holds for q and for all
26
Francesca Albertini and Paolo Dai Pra
indistinguishable pairs. We deal first with continuous time models. Notice that because x and z are indistinguishable, then so are Xu (t) and Zu (0 for ^U w e W^, and all t € [0, €«). So, by the inductive assumption, we get C\A^yA\At) = C^A^yAhuO).
(46)
Now, by applying Lemma 44 [part (b), first equality, with H^ = C^(A^^)^], we get C^{A^y^U^x = C^{A^y^'A\ as desired. For discrete time MNs, the proof is the same, after replacing Xu (t) and ZM(0, withjc+(M) andz+(M). 2. Again, we apply Lemma 44 [part (b), second equahty] to Eq. (46), to conclude: {C^{A^yA^^)..
= 0,
for all j such that {A^x)j i^ (A^x)^..
The proof is similar for discrete time dynamics.
•
LEMMA 47. Let 1 < / < ni, and x, z G R". Assume that for any x\ z^ e R", such that (x, z) -^ (x\ z')y we have
{A'x'). = {A\')..
(47)
Then, for all q >0,we have (a)
(A^2(A^VA^?)/
= {A^'^iA^^Y A^Oi for all ^,^ such that
(x,z)^(§,0; (b) (A^^(A^^)^ A^^)/y = Ofor all j such that there exists a pair ^, ^ such that
(x,z)^(^,0,««^(A^?);/(AlO;. Proof (a) By induction on ^. Fix any ^, f such that (jc, z) ^> (^, O- If ^ = 0^ Eq. (47) says that, for continuous time MNs, {A^^u{t))i = {A^i;u{t))i, for all u e W^ and for all t e [0, 6^); for discrete time MNs we get (A^§"^(M))/ = (A^^"^(M))/, again for all u 6 R'". In any case, we can apply Lemma 44(a) with H^ = I, and we have {A'^A%
= (A'^A^f).,
as desired. Now assume that the statement is true for ^. Thus, in particular, we get {A^^{A^yA^^u{t)).
= (A12(A22)^A2^«(0),. (A12(A22)^A2^+(«)). = (A12(A22)^A2^+(J)).
for continous time for discrete time, ^^^^
and these equations hold for all u e W^y and for all t e [0, €„). Again, the inductive step easily follows by applying Lemma 44(b) with H^ = A^'^(A^^)^.
Recurrent Neural Networks
27
(b) Fix any j such that there exists a pair ^, f with (x,z) ^> (?, 0» ^^^ (A^^)j ^ (A^Oj- Notice that for this pair §, <, Eq. (48) holds. Thus we apply again Lemma 44(b) with H^ = A^^(A^^)^, and we get
{A'^iA^^A^').. as desired.
= 0,
•
The next proposition proves the necessity of the observability conditions stated in Theorem 37. PROPOSITION 48. Leti: be an admissible MN, and Vi c W\ V2 c E"2 he the two subspaces defined in Proposition 36.Ifx,z€M."' are indistinguishable, then X - z ekerC, and A(x - z) e Vi ^ V2.
Proof. Because indistinguishability of x, z obviously implies x — z e ker C, we need only prove that A^(x — z) e Vi and A^(x — z) G V2. Let / := {^|3(Jc^z% with (JC, z) ^ {x\z'),
and (A V ) . ^
{Ah')-}.
Then we define Vi := span{^/|/ e J], where et are the vectors of the canonical base in W, and V2 := the largest A^^invariant subspace, contained in ker C^, such that A^^ V2 c Vi. Notice that Vi is a coordinate subspace. Next we will establish that (i) Vi is A^^-invariant, Vi c kerC^ and A^^Vi c V2. (ii) V2 = {(A^^)HA^x^ - A^z')\q > 0, (x, z) ^ (x\ z')] C V2. Our conclusion follows from these statements because • A^ (x — z) G Vi, by definition of J; . A2(x-z)€y2,by(ii); • Vi c Vu and V2 c V2, by (i). Now, we prove (i) and (ii). (i) • A^^-invariance Because Vi is a coordinate subspace, proving A^^ invariance is equivalent to seeing that Ajj = 0 for all i, j such that j e J, and i ^ J. Fix j e J and / ^ / . Then there exists (§, f) such that {x,z) -^ (§, O and (A^^)j ^ (A^Oj- Because / ^ / , we have, for all u e W^, {A^Ut))i (A'^^iu)).
= {A^Ut))r = (Aif+(^)).
Vr€[0,€«)
28
Francesca Albertini and Paolo Dai Pra in continuous and discrete time, respectively. Now by applying Lemma 44(a) with H^ = /, we get A}l=0^q,
such that (A^^)^ ^
{A'^)^.
In particular, Ajj = 0, as desired. • Vi c k e r C ^ If the pair jc, z is indistinguishable, then so is any pair x\ z^ such that (x, z) -^ {x\ zO- Thus the conclusion follows by observing that Lemma 45 implies 7 C /ci. • A^^Vl C V2 It is sufficient to prove C^{A^^)^A^^ej = 0 for all 7 G 7, ^ > 0 or, equivalently, \(CHA^^rA^% (A^^(A^^)'iA^%
=0 =0
V/, Wj e 7, V/ ^ 7, Vy G 7.
^ ^
The first equality in (49) is easily obtained by applying part (b) of Lemma 46. The second one follows from part (b) of Lemma 47 after having observed that (I) if / ^ 7, then (A^x% = (Ah% for all (x\ z') such that (x,z) -^ ix\z^), (II) if j e 7, then there exists {x\ zO such that {x,z) ^ (x\ zO and (A^x')j ^ {Ah')j. (ii) V2 is by definition A^^-invariant. Thus to prove that V2 C V2, we need to show that V2 C ker C^ and A^^V2 C V\. This amounts to establish the following identities:
C^{A^Y^^^' = C^A^^Ah'
V(x^ zO such that (x,z)^
A^\A^^YA^{X'
-z:)^Vx
{x\ z!) (50)
V(y^^/) such that (jc, z) ^ (A:^ zO- (51)
Because (jc, z) -^ (x', zO implies that the pair x^ z' is also indistinguishable, (50) is just part 1 of Lemma 46. Moreover, (51) is equivalent to {A^^^A^^^A^x').
=
(A12(A22)^AV).
which follows from part (a) of Lemma 47.
•
V/
^ 7,
Recurrent Neural Networks
29
Now we prove Theorem 37. Proof. Necessity is proved in Proposition 48, thus we need only prove sufficiency. Assume first that we are dealing with discrete time MN. We will prove that if X, z G M" satisfy the indistinguishability conditions of Theorem (37), then, for all u eW^, x^(u), z^(u) also satisfy the same conditions. This fact will clearly imply that X, z are indistinguishable. First notice that the following implications hold: A\x
-z)eVi=^
x+(w) - z+(w) G Vi
A^(x -z)eV2=>
x:^(u) - Z^(M) G V2.
Both implications are easily proved by using the properties of Vi and V2, and the fact that Vi is a coordinate subspace, and so if a G Vi, then a (a) e Vi. Because Vi c kerC^ and V2 c kerC^, (52) yields jc+(w) - z+(w) G kerC. Moreover, A^^-invariance of Vi and the fact that A^^V2 ^ Vi, implies A\X^(U)
- z+(w)) = A^\x^(u)
- zfiu))
-\- A^^{x:^(u) - z^(u)) G Vi,
whereas, A^^-invariance of V2 and the fact that A^^ Vi c V2 give A^{x^(u) - z+(w)) = A^^{x^(u) - z+(w)) + A^2(x+(w) - z+(w)) G V2. Thus A(x+(w) - Z+(M)) G VI 0 V2, as desired. Now we deal with continuous time MN. For a fixed but arbitrary input signal {u(t))t>o, let x{t), z(t) denote the corresponding solutions of (34), associated with initial conditions x(0), z(0). The pair (x(t), z(t)) solves the differential equation in M^'^,
©
""" =F(x,z),
(53)
where
F(x,z)
A^x -I- B^u a{Ah + B^u) \ A^z-\-B^u J
Let Z = {(x, z) G E^'^: x - z G ker C, A^x
- z) G Vi, A2(JC - z) G ¥2}. In the
proof for the discrete time case, we showed that if (x, z) G Z , then F{x,z) G Z. ThusZis stable forthe flow of (53), that is, if (;c(0),z(0)) G Z,then(x(0, z(0) G Z. Because (w(0)r>o is arbitrary and (x,z) G Z = > x — z G kerC, the proof is easily completed. •
30
Francesca Albertini and Paolo Dai Pra
D . IDENTIFIABILITY AND MINIMALITY As done for RNNs (see Section II.D), we identify a MN with the quadruple Ttni.m = (^» ^» ^ ' ^)» where n\ and ni are the dimensions of the nonlinear and linear blocks, respectively. As usual, we let n =n\ + «2» and
Moreover,with(£„j,„2'-^o)» ^o ^ R'^, we will denote the MN E„ = (a, A, B,C) together with the initial state JCQ; and we will call the pair (5]„j,„2' -^o) an initialized MN. Our goal is to determine the group of symmetries that leaves the i/o behavior of the initialized MN unchanged. In Section II.D, we have seen that for an admissible RNN, this group of symmetries is finite, and it coincides with the group Gn generated by permutations and sign changes. For MNs, because one block of the system behaves linearly, one expects that this group of symmetries will not be finite. Let Qn be the following set of invertible matrices:
It is easy to see that if T € ^„, and we let A^T'^AT,
B = T~^B,
C = CT,
and
JCQ = r'^jco,
then the two initialized MNs (Sni,Ai2 = (^' ^ ' ^ ' ^ ) ' -^o)* and (^ni,n2 = (^' ^^ B,C), XQ) have the same i/o behavior. It is interesting to note that this implication holds without any assumption about the two MNs, except the fact that the activation function a is odd. Next we will see that if the MNs are admissible and observable, and satisfy a controllability assumption, then there are no other symmetries. We will write
to denote that the two initiaUzed MNs (Sni,^2' -^o), and (^n^,n2»^o) are i/o equivalent, that is, Ai:„^ „2,;co = -^f;- . ^^o (where )^i:„^^„2^xo represents the i/o map, see Section II.C). DEFINITION 49. Let E„i,„2 = (o-, A, 5 , C) and £/ij,^2 = (cy,A,B,C) be two MNs. The two initialized MNs (E„j,^2'-^o) and (E~ - ,jco) are said to be equivalent if ni = ni, n2 = ^2, and there exists T e Qn such that A = T-^AT, B = T-^B, C = CT,mdxo = T'^XQ, Thus, from the above discussion, if two initialized MNs are equivalent, then they are also i/o equivalent.
Recurrent Neural Networks
31
DEFINITION 50. An admissible MN, T>ni,n2 = i^^ ^ . ^^ C), is said to be identifiable if the following condition holds: for every initial state JCQ G R'^, and for every admissible initialized MN (^ni,n2 = (^^ ^ ' ^ ' ^ ) ' -^o) such that (E„j,^, xo) and (Tin^^n^^xo) are i/o equivalent, then either n > n or (S„i,n2» -^o) and {Y^fi^^fi^,xo) are equivalent.
Note that these definitions correspond to Definitions 15 and 16 given for RNNs. Given two matrices M e RP^^'P^ and A^ € RP^^'P^, we say that the pair (M, N) is controllable if it satisfies the linear Kalman controllability condition (i.e., rank[N, MN,..., MP^-'^N] = pi). THEOREM 51. Let E„j ,^2 = (cr, A, B,C)be an admissible MN. Then Enj ,^2 is identifiable if and only if Tini,n2 ^^ observable, and the pair of matrices (A^^, (B^, A^i)) is controllable.
The proof of this theorem is given in Section III.E. REMARK 52. It is clear that if ^ni,n2 is a RNN (i.e., ^2 = 0), then the previous theorem becomes Theorem 18, which states the identifiability result for RNN. On the other hand, for ni = 0, i.e., Tini,n2 linear, we recover the linear identifiability result stated in Theorem 17. DEFINITION 53. An admissible MN S„j ,„2 is said to be minimal if for every initial state XQ € R", and for every admissible initialized MN, (E^^ ^^2' -^o)* such that (Il„j,„2' -^o) and (l^n^^n^, XQ) are i/o equivalent, then it must ben >n. REMARK 54. It would be reasonable, in Definition 50, to replace the inequality h > n with the statement "wi > ni, ^2 > ^2* where at least one inequality is strict." Analogously, in Definition 53, n > n could be replaced by ^1 ^ ^ 1 , ^2 > ^2- These modified definitions are not logically equivalent to the ones we gave. However, they are indeed equivalent, and this fact will be a byproduct of the proof of Theorem 51 and Proposition 55. This means, for instance, that if Sni,«2 ^^^ the same i/o behavior as Ilni,n2 ^^^ ^ S n, then, necessarily, ni
As observed in Section II.D for RNNs, it is obvious that if an admissible MN ^ni,n2 is identifiable, then it is also minimal, whereas the converse implication may be false, as shown in Example 21. The next proposition (the proof of which will be given in Section III.E) presents necessary and sufficient conditions for minimality. PROPOSITION 55. Let E„j,„2 — (^^ ^ ' ^ ' ^ ) ^^ ^^ admissible MN, and let Vi C R'^i and V2 ^ M"^ be the two sub spaces defined in Proposition 36. Then ^n\,n2 ^^ minimal if and only if
Vl 0 V2 = 0 and the pair (A22, (B2^ ^21^^ .^ controllable.
(54)
32
Francesca Albertini and Paolo Dai Pra
REMARK 56. Proposition 55 will be proved later; however, the necessity of the conditions in (54) is not difficult to establish, as shown next. (a) Assume that Vi 0 y2 7^ 0, and let n = dim Vi, r2 = dim V2, and qt = m — Vi (/ = 1, 2). Without loss of generality, we may assume that that Vi = span{^i,..., ^n}, where ej e R"i are the elements of the canonical basis. Thus we must have (in what follows, we specify only the dimensions of some particular matrices, meaning that all of the others have the appropriate dimensions)
A"yicyi=^Aii = ( ^ ^ J ' j ; ^ ) . Vi c k e r C ' = J ^ C ' = (0,Z)'),
withH22
J1X91
with D^ € M^^*'.
Using the Unear theory, we also know that there exists a matrix T2 e GL{n2) such that r-r'>i^^72 2 - - = (^21/22).
C^T2 =
{D\0),
where L " e R«^92, and D^ e R/'^^z. Moreover, we have with K^^ €
92X^1
withM^^GR^i'^^^^ Now, if we denote, for / = 1, 2, B' = (B'\ B'^f, with B^^ e ^21 ^ ]^^2xm^ ^Q j^^y rewrite the dynamics of 2^1,^2 ^s z^ = a(H^hi
+ -^ j r 2 K
;y =
?ixm ^ j j ^
+ H^^Z2 + M^^wi + M^^W2 + B^^u)
-22
21,
^22,
?22„
D^'zi^-D^Wx
where Z2 ^ R^^ and u;i G R^^ 'Wm^ because the two blocks of variables z\ and W2 do not affect the other two blocks and the output, it is clear that (E„i,„2' (^1' ^2, w;i, wi)) is i/o equivalent to ( i : , , ^ , , = ( a , ( ^22 ^ n ) - ( B21) ( O ' . ^ ^ ) ) , U2, u,i)) . Thus we have performed a reduction in the dimension of the state space, because ^1 + ^2 < w» and so ^n\,n2 Cannot be minimal.
Recurrent Neural Networks
33
(b) Now assume that the pair (A^^, [B^, A^^]) is not controllable. Let pi be the rank of {{B\
A21), A^^B\
A 2 ' ) , . . . , {A^y'-\B\
A21)),
and p2 = n2 — p\. Then, by the linear theory, there exists an invertible matrix r^ G GL{n2) such that
(rrV = (^), where H^ e R^i^^i, H^ e E^i^^^ H^ e RP^^'P^, K e M^i^"^ M e R^i^'". Now let A^^ = (A/^^ A^^) and C^ = (L^ L^), where the matrices A^^ and L^ represent the first pi columns. Consider the MN ^m+pi given by the matrices:
It is easy to see that (E„j,„2' ^) "^ (^ni+pp 0)- Because ni -h pi < n, again, E„j,„2 is not minimal.
E. PROOFS ON IDENTIFIABILITY AND MINIMALITY Now we introduce some useful notations. In what follows, we assume that two initialized admissible MNs, (I]„j,„2'-^o) and (J^n n^,xo), both evolving in either continuous or discrete time, are given. We let A^ = (A^^, A^^) and A^ = (A^^ A^^), and similarly for A^ and A^. For discrete time models, for any k > 1 and any ui,.. .,Uk G W^, we denote by Xk[ui,... ,Uk] (resp. Xk[ui,..., Uk]) the state we reach from XQ (resp. XQ) using the control sequence ui,... ,Uk. Moreover, we let ydui, ...,uk]
= Cxklui, ...,uk]
and
ydui, ...,Uk] = Cxk[ui,...,
Uk].
For the continuous time model, for any w('): [0, T^] -^ M'", we let Xw(t) and Xyj{t), for t e [0, €w), be the two trajectories of ^ni,n2 ^^^ ^n starting at XQ and XQ, respectively. Here, with €w > 0, we denote the maximum constant such that both trajectories XwiO and Xwit) are defined on the interval [0, ew)- Again, by yw(t) and yw(t), for t e [0, ew), we denote the two corresponding output signals (i.e., yUt) = Cxuj(t), and y^it) = Cxyj(t)). For any vector i; G R'^ with the superscript 1 (resp. 2), we denote the first ni (resp. the second ^2) block of coordinates (similarly, for v eW),
34
Francesca Albertini and Paolo Dai Pra
We will denote by W c M" (resp. W c M") the unobservable subspace of i:„i,„2 (resp. ^hun2)' Moreover, with Vi c M"i and V2 c R"2 (^esp. Vi c R^i and V2 c R"2)^ we will denote the two subspaces defined in Proposition 36 for ^ni,n2 (resp. £ni,/i2). Recall that it holds W = A'^iVi 0 V2) n kerC. Now we establish some preliminary results. First we state a technical fact, which gives the idea on how the admissibility assumption is going to be used in the proof of the identifiability result. For a given matrix D, ID denotes the set of indexes / such that the iih column of D is zero, and / ^ is its complement. LEMMA 57. Assume that the following matrices and vectors are given: B e W"^, B e R"^"', C G R^""", C e R^^'^, D,D e R^^'", a eW, a e W, and e,e e R^. Moreover, assume that o is any admissible activation function, that both B and B are admissible control matrices, and that for all u e R^, the following equality holds:
Ca(a + Bu) -{- Du -^ e = Ca(a + Bu) -i-Du+e.
(55)
Then we have (a) e = e. (b) D = D, (c) |/^| = |/£|, and for any I e I^ there exists 7t{l) e / - and P{1) = ± 1 , such that (c.l) Bu = P(l)Bn{i)iforall / € { 1 , . . . , m]; (C.2) ai = pmniiy (C.3) di = P{l)Cin{i)forall / € { 1 , . . . , p]. Moreover, the map n is injective. Proof such that
Because B and B are admissible control matrices, there exists v £ R^ bi = (Bv)i^OVi bi=:(Bv)i^OWi
= h....n, = l,...,h,
\bi\:^\bj\ \bi\^\bj\
for a l l / ^ 7 ; forall/7^7.
^ ^
Letting Z c R'" be the set of all vectors v e R'" for which (56) holds, we have that Z is a dense subset of R'". Fix any v e Z; then by rewriting Eq. (55) for u = XV, with X € R, and using the notations in (56), we get n
Y\,Cija{aj+bjx)
h
+ {Dv)iX^ei
= Y^Cij(T(aj-hbjx)-^(Dv)iX
+ ei,
(57)
for all X € R, and all / e 1 , . . . , /?. Fix any / e { 1 , . . . , /?}. After possibly some
Recurrent Neural Networks
35
cancellation, Eq. 57 is of the type r
Because a is admissible, we immediately get et - ei = 0, (Dv)i - (Dv)i = 0. The first of these equations implies (a), the second implies (b), because it holds for every v e Z, and Z is dense. Now, because (a) and (b) have been proved, we may rewrite (57) as n
h
Y^ Cijcr(aj + bjx) = ^ 7=1
Cija(aj + bjx).
(58)
7=1
Fix any I e I^\ then there exists / € { 1 , . . . , /?}, such that Cjj / 0. Consider Eq. (58) for this particular /. The terms for which Cj • = 0 or Cj: = 0 will cancel (however, not all of them will cancel, because Cjj ^ 0); thus we are left with an equation of the type r
r
p=i
p=i
for some r < n, and some r < h. Because a is admissible, and the bj^ (resp. bjp) have different absolute values, there must exist two indexes jp^, jnipi)^ and P(pi) = ± 1 , such that Ki'
^jp,)
=
P(PI){^MP,)^^MPO)'
So we have r
Now, by repeating the same arguments, we will find another index jp2, a corresponding index J7T(P2) with n{p2) 7^ 7t{pi), and P{p2) = ±1- Notice that necessarily p2 ^ p\, because, otherwise, \bj^^p )| = \bj^^p ^\ would contradict
36
Francesca Albertini and Paolo Dai Pra
(56). Thus we will collect two more terms. Going on with the same arguments, we must have that r = r, and after r steps, we end up with an equation of the type r p=i
Again, by the admissibility of a, we also have (Cj: — p{p)Cjj
.) = 0.
Thus, in particular, because Cjj # 0, we have shown that there exist 7t(l) and PQ) such that (aj,bj) = m{a^ijyK(J)) CJJ = mCj^gy
.^Q.
Notice that, if given this [ e /^, we would have chosen a different index / such that CJJ ^ 0; then we would have ended up with corresponding nQ) and ^(l). However, because the (bj) all have different absolute values, it must be that 7T(1) = Tt(l), and P(l) = P(l) (see (60)). Thus we have shown that V/G/^,
37r(/)G/^, )S(/) = ± 1 ,
such that I ^'^ [(ii)
to, W = ^(0(5.(0,^.(0), Cii = piDd^^i) ifC///o.
This implies |/^| < |/~|. By symmetry, we conclude that |/^| = |/~|. From (61) (i), we get directly that (c.2) holds. Again from (61) (i), we also have (Bv)i =
P(l)(Bv)^iiy
for all V e Z. Because Z is dense, this implies (c.l). Moreover, (61) (ii) proves (c.3) for those C// different from zero. On the other hand, if C// = 0, then, necessarily, CijT{i) must also be zero. Otherwise, one repeats the argument above, exchanging C with C, and finds an index A(7r(/)) such that | C / ; r ( o | = |QM7r(0)|,
|^7r(o| = |^A(7r(0)|-
In particular, X(7T(1)) ^ /, because Cn — 0 and C/A.(7r(/)) 7^ 0- But then |^A,(7r(/))l = \bji(jt)\ = \bi\, which is impossible, because the bi have all different absolute values. The injectivity of the map n is also a consequence of \bi\ ^ \bj\ for all
i ^ y.
•
LEMMA 58. Let (E„,_„2, xo) and (E/jj^/jj, JCQ) be two initializedMNs,^ Hi e R«^">, H2 e M«^"^ Hi e R*^"', and H2 e R«^"2.
Recurrent Neural Networks
37
• For continuous time models, if for all w : [0, Tu)] -^ M!^, and for all t e [0, €w), we have Hixlit) thenJor all u
+ H2xl(t) = Hixlit)
+ ^24(0,
(62)
eW,
Hia{A^xUt)
+ 5^M) + H2A^Xyj(t) + H2B^u
= Hia{A^xUt)
+ B\)
+ H2A^xUt)
+ H2B^u.
• For discrete time models, if for all r >0, and for allui,... have HiX
[Ui,
. . . ,Ur]-\-
= Hix^[ui,.
H2X
[Ui,
.. .
(63)
,Ur € R'", we
,Ur]
..,Ur] + H2x\ui,...,
M^],
(64)
(when r = 0, it is meant that the previous equality holds for the two initial states) then, for all u € W^, Hia[A^x[u\,...,
Wr] + B^u) + H2A^x{u\,...,
— H\5{A^x[ui,...,
M^] + H2B^u
Mr] + B^u) + H2A^x[ui,...,
w^]
-\-H2B^u.
(65)
Proof Because the discrete time case is obvious, we prove only the continuous time statement. Fix any t € [0, 6^;). For any u e M."^, let Wu : [0, t-\-1] -> W^ be the control map defined by Wu{t) = u;(0 if ^ € [0, i], and Wu = u if t G (t,t -\- I]. Then the two trajectories Xw^it) and Xyj^(t) are defined on an interval of the type [0, t-\-e) and are differentiable for any t € (F, f + €). Because Eq. (62) holds for all f € (0, r + e), we have, for all t e (tJ-\- e), Hixlit)
+ H2xl(t) = Hiklit)
+
H2il(t).
Now, by taking the limit as r ->/"+, we get (63), as desired. LEMMA
59.
//*(E„j,«2' -^o) ^ (^ni,«2' •^o)» then, for all I > 0, we have
(a) C^(A^^)^B2 =
(b)
•
C^iA^^yS^;
= l^g2(^22)/^2iL andforalii e /^2(A22)/A2I there exists 7t{i) G /p;2/T22y 121 and fi(i) = ± 1 , ^MC/Z f/zaf |/C2(A22)/A2II
for all j e { I , . . . , / ? } ;
38
Francesca Albertini and Paolo Dai Pra (cl) for continuous time models, for all w:[0,Tyj] ^^ M!^, and for all t G [0, 6t(;), we have
(c2) for discrete time models, for all r >0, for allui,... C^A^^fA^^x^uu
...,«,] +
,Ur G M'^, we have ...,ur]
C2(A22)'+'X2[MI,
= C2(A22)'A^1;C'[«I, . . . , « . ] + C^A^^)'-^'x\ui,
..., uA.
(When r = 0, it is meant that the previous equality holds for the two initial states.) Proof Assume that we are dealing with continuous time MNs. We first prove, by induction on / > 0, statements (a) and (cl). Assume / = 0. Because (^nun2^xo) ~ {Y,n^^n^,xo), then for all w : [0, 7^] -^ W^, and for all t e [0, e^), we have C^xlit) + C^xlit) = yy^^it) = 3)u,„(0 = C^xlit) +
C^xlit).
From Lemma 58 this implies C^5{A^^xlit)
+ A^^xlit) + B^u) + C^A^^xlit)
= C^5{A^^xl(t)
+ C^A^^xl(t) +
+ A^^xlit) + B^u) + C^A^^xlit)
+
C^B\
C^A^^xl(t)
•^C^B^u.
(66)
Now by (66) and by applying Lemma 57 (parts (a) and (b)), we immediately get conclusions (a) and (cl) when / = 0. The proof of the inductive step follows the same lines. By inductive assumption, we have
for all r € [0, €„,). By again applying Lemma 58, we get C\A^^)U^'a{A''xiit)
+ A^^xlit)
+ B'U) + C^A^^)'^'{A^x^{t}
+ B^u)
= c 2 ( A 2 2 ) ' p i a ( A ' 1 4 ( 0 + A ' 2 4 ( 0 + B '«) + C^A^^)'^\A^x^it)
+
Pu).
By applying Lemma 57 (part (a) and (b)), the previous equation gives both (a) and (cl) for the case / + L Moreover, from part (c) of Lemma 57, (b) must also hold.
Recurrent Neural Networks
39
The proof for the discrete time dynamics is very similar and simpler. The idea is to establish, again by induction on / > 0, statements (a) and (c2) first. We sketch only the case / = 0. Because (E„j,„2'-^o) "^ (^iii,n2^^o), given any sequence M1,..., w^, w, we must have y[ui,...
,Ur,u] = y[ui,...
,Ur, M],
which is the same as C^a(A^x[ui,
...,Ur] + B^u) + C^A^x[u\,...,
w^j + C^B^u
= C^a(A^jc[wi,..., Mr] + B^u) + C^A^x[ui,...,
w^] + C^B^u.
Again, because the previous equation holds for every u e W^, Lemma 57 gives conclusions (a) and (c2) for the case / = 0. • We omit the proof of the next lemma, the conclusions of which may be established by induction, using the same arguments as in the previous proof. LEMMA 60. Assume that (Xni,n2^^o) "^ (^hi,n2^ ^o)^ cind that there exists / € { ! , . . . , ni), P(i) = ibl, and7T(i) € { 1 , . . . ,^i} such that
• for the continuous time dynamics, for all w(') € [0, Tuj] -» M!^, for all t e [0, €«;), the following holds: [A'^xliO
+ A^^lit)].
= ; S ( / ) [ A " 4 W + A12;C2 (0]^(,.),
• for discrete time dynamics, for all r > 1, for allui,... following holds: [A^^X^[UI,
,Ur € R'^, the
. . . , Mr] + A^^X^[MI, . . . , Mr]].
= P(i)[A^^x\ui,
...,Ur]
+ A^^X^lui,
. . . , Ur]]^^.y
Then, for all I >0,we have (a) [A'\A^^)'B% := m[A'\A^^)'B\^.^j, y € { 1 , . . . , m], (bl) for continuous time dynamics, for all w{-) e [0, Tu,] —> W",forall t e [0, e„,), [A'\A^^)'A^'xlit)+A'^{A^^y^\lit)].
(b2) for discrete time dynamics, for all r > I, for allui,... [A12(A22)'A21;C1[«I, . . . , M.] + A'^A^^)'^'X\UU
^Pii)[A^^{A^^)'A^^X^[ui,...,Ur] +
A'^A^^f'-'xHu,,...,Ur]l^.y
,Ur G W^, ...,
«,]].
40
Francesca Albertini and Paolo Dai Pra
LEMMA 61. If(Tin^^n2^^o) "^ (^«i,/i2'^o) cind Vi = 0, then n\ < hi, and for alii e { 1 , . . . , wi} there exists 7r(/) e { 1 , . . . , ni} and P(i) = ±1, such that
(a) Bfj = p{i)Bl^yforall 7 G 1 , . . . , m; (b) C]. = p{i)C]^^^foralljeh..,,p; (cl) for continuous time dynamics, for all w(') e [0, T^] -^ M.^, for all t € [0, €w), the following holds:
(c2) for discrete time dynamics, for allr > 1, for allu\,... following holds:
,Ur € R'", the
[A11X^[MI,...,M,] + A 1 V [ M I , . . . , W , ] ] . = p{i)[A^^x\ui,
. . . , Mr] + A^^X^luu
. . . , Wr]]^(,).
Moreover, the map TC is injective. Proof Assume that we are dealing with continuous time MNs (the proof for the discrete time case is very similar and thus is omitted). Because V\ = 0, then, letting Jd denote the set of indexes defined in Section III.B, we have that for any / € { 1 , . . . , ni], there exists d > I, such that / e Jd- We first prove (a) and (cl) by induction on the first index d > \, such that i e Jd. Assume that d = I, that is, i e J\. Thus, by definition, there exists I < I < p such that C/- ^ 0. Because (^ni,n2^^o) "^ (^ni,n2^^o), then for all w : [0, Tyu] -> R'", and for all t e[0,€yj), we have C^xlit) + C^xlit) = yu;At) = yu^^it) = C^xl(t) +
C^xl{t).
From Lemma 58 this implies C^a{A^xUt)
+ B^u) -\- C^A^xUO + C^B^u
= C^o{A^x^{t)
+ B\)
+ C^A^Xyoit) + C^B^u,
(67)
Because C/- ^ 0, we have / e /^^; thus, by Lemma 57, we know that there exist n(i) € { 1 , . . . ,ni}, and)S(/) = ± 1 , such that
for all J e 1 , . . . , m, and, for all w{-) e [0, T^] -^ W", for all t e [0, £„,), [A^'xlit)
+ A'^xlit)].
=
fiii)[A''xlit)
+
A'^xlit)]^^.^
Now suppose that / e Jd+\, for d > O.lfi e Ji, then there is nothing to prove; otherwise we are in one of the following three cases: 1. there exists j e Jd, and AjJ ^ 0;
Recurrent Neural Networks
41
2. there exists 1 < y < /? and 0 < / < J - 1, such that (C^(A^^)^ A^^)^/ ^ 0; 3. there exists j e 7^, for 1 < ^ < d — 1, such that We will prove the three cases separately. 1. Because j e Jd, then, in particular, (cl) holds for this j . Thus we have, for all w{') e [0, Tuj] -^ W^, for all t e [0, 6^;), [A''xl(t)
+ A'^xlit)]j
= P(j)[A''xl(t)
+
A'^xlit)]^^.y
Now, by applying Lemma 58 to this equation, we get, for all u e W^, [A^^a{A^Xuj(t) + B\)
+ A^^A^xUt) + A^^B^u].
= P(j)[A^^a{A^xUt)
+ B\)
+ A^^A^xUt) +
A^^Pu]^^jy
Now, because A jj ^ 0, we have i e / ^ ^ ; by applying Lemma 57 to the previous equation, we conclude that there exists 7t(i) € { 1 , . . . , ^i}, and ^ ( 0 = ± 1 , such that (a) and (cl) hold for this index /. 2. From Lemma 59 (cl), we have that for all w : [0, T^j] -> W^, and for all t e [0, eyu), the following equality holds:
= C\A'^)'A^'xUt) +
C'{A'T'^iit).
By applying Lemma 58 to the previous equation, we get C^{A^^fA^'a{A'xUt)
+ B'U) + C^A^^f^^A^xUt)
+
= C^(P^)^P^5(A^ic^(r) + B^u) + C^{A^^)^^\A^xUt)
B\) + B^u).
(68)
Because (C^(A^^)^A^^)j/ 7^ 0, again by applying Lemma 57, we conclude that there exist n(i) G { 1 , . . . , ni}, and y6(0 = ± 1 , such that (a) and (cl) hold for this index /. 3. Because j e Js, then, in particular, (cl) holds for this j . Thus we have, for all w(-) € [0, Tuj] -> M'", for all t e [0, 6^^), [A^'xlit)
+ A'^xl(t)]j
= P(j)[A''xi(t)
+
A'^xlit)]^
ur
Thus Lemma 60 applies, and from conclusion (bl) of this lemma, we get, for all />0, [A'^iA^^fA^hlit)
+
= m[A''{A''yA^'xl(t)
A'^{A^^f^'xl{t)]j +
A^^{A^^y^'xl(t)l^.y
42
Francesca Albertini and Paolo Dai Pra
Now, letting I = d — s — I, because (A^^(A^^)^A^^)yi ^ 0, we may proceed as before, and by applying first Lemma 58, and then Lemma 57, (a) and (cl) also follow for this index /. To conclude the proof, we must establish (b). Using (a) and (cl), we may rewrite, for each j e { 1 , . . . , /?}, Eq. (67) as
= E
C]ia{{A^xUt))i
-^ {B^u)^) -h
{C^A^xUt))j-
Because a is admissible, the previous equality implies (b). Note that the injectivity of n is an obvious consequence of the injectivity of the corresponding map in Lemma 57. • Given two matrices M e W^ ^P^ and N eW^^P\ we say that the pair (M, A^) is observable if it satisfies the linear Kalman observability condition, that is. {
N NM
\
rank
LEMMA 62. then the pair
is observable.
PI-
Let ^nx,n2 = i^^ ^^ B, C) be an admissible MN. IfV2 = 0,
m--)
Proof. Assume, by contradiction, that the above pair is not observable. Then there exists a nonzero A^^-invariant subspace W2 ^ M"^ that is also contained in ker C^ and in ker A^^. Thus, clearly W2 ^ V2, contradicting the assumption that V2 = 0. • Now we are ready to prove the identification result. Proof of Theorem 51. Let ^nun2 = (<^' ^ ' B, C) be an admissible MN. Necessity. We prove this part by contradiction. First assume that ^m^ni is not observable. Then there exist two different states x\, X2 € W, which give the same i/o behavior; thus the two initialized MNs (E„J ,„2' -^1) ^^^ (^«i ,/i2' ^2) ^ c i/o equivalent. Next we will see that they cannot be equivalent. Assume that there is a matrix T e Qn (see Definition 49) such that A = T-^AT,
B = T-^B,
C = CT,
and
xi = 7-^x2.
Recurrent Neural Networks
43
Then, because fiMs an admissible control matrix, we have that T must be of the type
-(orO' On the other hand, the matrix T^ must satisfy A^^ = T^A^^,
C^ = T^C^,
PP- = T-^A^^T,
which, by Lemma 62, also impHes that T^ = I. Thus T = I, and so we must have xi — X2, which is a contradiction. So, in this case, T,ni,n2 is not identifiable. Now, if the pair (A^^, [B^, A^^]) is not controllable, we have already seen in Remark 56 that the MN ^ni,n2 is not minimal, because it is possible to perform a state-space reduction, and so, in particular, ^ni,n2 is not identifiable. Sufficiency. We prove this part only for continuous time dynamics, the proof in discrete time being the same after the obvious changes in notations. Let xo € W', consider another admissible initialized MN (lln^^n^, XQ), such that
Because T>ni,n2 is observable, in particular, Vi = 0, thus, by Lemma 61, ni < hi, and for all / e { 1 , . . . , ni} there exists 7t(i) e { 1 , . . . , wi} and fi(i) = ± 1 such that the conclusions (a), (b) and (cl) of Lemma 61 hold. Now let A,: { 0 , . . . , /zi} -> { 0 , . . . , ni} be the permutation given by k(i) = n{i), if 1 < / < n\, k(i) — i otherwise. Notice that the map X is indeed a permutation, because the map n is injective (see Lemma 61). Then, let T\ = PD e Gn^, where P is the permutation matrix representing A,, and D = diag(j6(l), . . . , ^ ( n i ) , l , . . . , l ) . Then, (a), (b), and (cl) of Lemma 61 can be rephrased as
-d")
TiB^ = (^g„ j ,
(69)
C^Ti = ( C ' , C " ) ,
(70)
and, for all w: [0, T^] -^ R*", for all ? e [0, e„,), we also have [A^^xlit) + A^^xlit)].
= [TiA^'xlit)
+
TiA^^xlit)]., ie{l,...,ni}.
(71)
Thus, Lemma 60 applies and we get [A^^{A^^)'B%J
=
[TIA^^A^^)'B^].J,
ie{l,...,ni},
j e{l,...,m},
(72)
44
Francesca Alhertini and Paolo Dai Pra
and
= inA'^{P^)'P'xl(t) + nP\PT'xi(t)),
(73)
By applying Lemma 58, from Eq. (73) we get [Ai2(A22)'A2la(Ai;c„,(0 + B'U) + A'^{A^^)'^\A^Xy„{t) = [TiA'^{A^^)'A^^a{A'xUt)
+ B^u)].
+ B'U)
+ riAi2(A22)'+i(Pi,(0 + pM)]. = [TiA'^iA'^yA^'T.-'aiA'xUt) + T,A'^A^^)'^\A^x^it)
+ {B\ +
B^'fu)
B\)l,
where, to get this last equality, we have used Eqs. (71) and (69). Now, because the function a and the matrix B^ are both admissible, by Lemma 57, we conclude: (A12(A22)'A21).. = {nA'\A^^)'A^'Tf%.,
Vi, ; € { ! , . . . , m } .
(74)
By Lemma 59, we also get, for all / > 0
and
iC'iA'')'A'% =
{C^P')'P%-% Wie{l,...,p},
>/j e {!,...,ni).
(76)
Let
where M^i eM«2xni a^^
T,X- = {in). where H^^ e W^ ^^^2. Then Eqs. (72), (74), (75), and (76) say that the two linear models, ( ( fn ) , A'\ {B\ A^i)) , ( ( %, ) , A'\ {B\ M^i)) ,
(77)
Recurrent Neural Networks
45
are i/o equivalent. Because 2^1,^2 is observable, we have V2 = 0. So, by Lemma 62, the pair
((.")•-) is observable. Moreover, the pair (A^^, (B^, A^^)) is controllable, by assumption, and thus, by the linear theory (see Theorem 17), we get «2 5 ^2So, in conclusion, n = n\ -\-n2
Now we must show that, if n = n, then (E^j^^i'-^o) is equivalent to {l^fi^^fi^,xo). Note that, from what we have seen before, necessarily, ni = ni and n2 = n2. Moreover, we must have B^ =T-^B^
A^^T-^
=M^^
C'=C'Ti
T,A^^ = H'\
^^^^
On the other hand, by the linear theory (again Theorem 17), there must exists T2 e GL(w2), such that
m-c
'C2^ T \
{B^m)
A^^^T^^A^^T2,
=
T^\B^A^^). (79)
By applying Lemma 58 to Eq. (71), we also have [A^^a{Ahu,it)
+ B^u) + A^^{A^Xyjit) + B^u)].
= [TiA^^T-^a{A^x^(t)
+ B\)
+ TiA^^{A^Xuj(t) + B^u)]..
Clearly, this equality implies A" = riA"rf^
(80)
Now let
(o-°.)Clearly, T G Gn, and by (78), (79), and (80), we have C = CT,
A = T-^AT,
B = T~^B.
(81)
So, to complete the proof, we need only show that XQ = T~^X0' Let xi = TXQ. Because (E„j,„2' ^o) ~ (^hun2^^o), and (81) holds, we conclude that XQ is in-
46
Francesca Albertini and Paolo Dai Pra
distinguishable from xi. On the other hand, because ^ni,n2 is observable, we conclude that TXQ = JCI = xo, as desired. • Proof of Proposition 55. We have already seen in Remark 56 that the conditions in (54) are necessary for minimality. On the other hand, the sufficiency follows directly from the previous proof. In fact, as observed in Remark 63, to get that h
IV. SOME OPEN PROBLEMS Recurrent neural networks have proved to be a quite stimulating object for system theoretic study. With the results of this work, we may say that minimality in both RNNs and MNs is well understood, under the assumption of admissibility. On the other hand, there are still many interesting questions concerning the system theoretic properties of these networks that remain open. 1. Although admissibility of the activation function is an acceptable assumption, the admissibility requirement for the control matrix appears to be an undesirable constraint. Is it conceivable, for instance, to be able to reduce the dimension of a RNN by using a nonadmissible control matrix? First of all, by dropping the admissibility assumption in the control matrix, the symmetry group of a RNN may increase considerably; this is due to the fact that if B is not admissible, then there are noncoordinate subspaces that are stable fora o B. It appears (we are currently working on this problem) that observability for a RNN can be characterized without requiring admissibility for the control matrix. On the other hand, identifiability seems to be much harder. For identifiability, in fact, we compare two systems with control matrices Bi, B2, so that the maps a o B\ and a o B2 may have different stable subspaces. 2. Dropping the admissibility assumption for the control matrix B would make it possible to deal with multiple layer neural networks. A multiple-layer neural network is a cascade of neural networks such that the output of one network is the input of the successive one. Thus a double-layer RNN would be a system of the type x^ = a(Aixi -h Biu) X2 = 5(A2X2 + B2C1X1) y = C2X2
Recurrent Neural Networks
47
that is still a RNN but with the nonadmissible control matrix
Co) By analogy with what is known for feedforward networks (see [32]), we expect that double-layer RNNs would better approximate certain nonlinear i/o behaviors than admissible RNNs. The identifiability and minimality problem is particularly relevant in this context. 3. As we mentioned earlier, complete controllability for continuous time RNNs with suitable activation functions has recently been proved. We believe that the techniques could be used to deal with MNs in continuous time. It seems, however, that controllability in discrete time is much harder, and makes a quite challenging theoretical problem. 4. Forward accessibility in discrete time has been proved under strong assumptions on the activation function and the control matrix; these assumptions are hard to check, even for systems with relatively low dimension. Improvements in this direction are quite desirable. 5. It appears that some results in this work come from "linear" properties of 5 : invariant subspaces, linear transformations that commute with 5, etc. We believe it is worthwhile to investigate whether similar arguments can be applied to more general vector fields that are not produced by scalar functions applied componentwise.
ACKNOWLEDGMENTS We are both indebted to Prof. E. Sontag, who has led us into the subject of recurrent neural networks, and has contributed to the development of many of the techniques used in this work.
REFERENCES [1] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, Reading, MA, 1991. [2] K. J. Hunt, D. Sbarbaro, R. Zbikowski, and P. J. Gawthrop. Neural networks for control systems: a survey. Automatica 28:1083-1122, 1992. [3] E. D. Sontag. Neural networks for control. In Essays on Control: Perspectives in the Theory and Its Applications (H. L. Trentelman and J. C. Willems, Eds.), pp. 339-380. Birkhauser, Boston, MA, 1993. [4] Y. Bengio. Neural Networks for Speech and Sequence Recognition. Thompson Computer Press, Boston, 1996. [5] R. Zbikowski and K. J. Hunt, eds. Neural Adaptive Control Technology. World Scientific, Singapore, 1996.
48
Francesca Albertini and Paolo Dai Pra
[6] M. Talagrand. Resultats rigoureux pour le model de Hopfield. C. R. Acad. Sci. Paris Sen I Math. 321:109-112, 1995. [7] A. Bovier and V. Gayrard. Lower bounds on the memory capacity of the dilute Hopfield model. In Cellular Automata and Cooperative Systems, pp. 55-66. Les Houches, Paris, 1992 (reprinted in NATO Adv. Sci. Inst. Sen C Math. Phys. Sci. 396). [8] E. D. Sontag. Neural nets as systems models and controllers. In Proceedings of the Seventh Yale Workshop on Adaptive and Learning Systems, Yale University, New Haven, CT, 1992, pp. 73-79. [9] M. Matthews. A state-space approach to adaptive nonlinear filtering using recurrent neural networks. In Proceedings of the 1990 lASTED Symposium on Artificial Intelligence Applications and Neural Networks, Zurich, July 1990, pp. 197-200. [10] M. M. Polycarpou and P. A. loannou. Neural networks and on-line approximators for adaptive control. In Proceedings of the Seventh Yale Workshop on Adaptive and Learning Systems, Yale University, New Haven, 1992, pp. 93-98. [11] H. J. Sussmann. Uniqueness of the weights for minimal feedforward nets with a given inputoutput map. Neural Networks 5:589-593, 1992. [12] F. Albertini and E. D. Sontag. For neural networks, function determines form. Neural Networks 6:975-990, 1993. [13] F. Albertini and E. D. Sontag. Uniqueness of weights for recurrent nets. In Proceedings of Mathematical Theory of Networks and Systems, Regensburg, Germany, August 1993. [14] F. Albertini and E. D. Sontag. State observability in recurrent neural networks. Systems Control Lett. 22:235-244, 1994. [15] F. Albertini and P. Dai Pra. Forward accessibility for recurrent neural networks. IEEE Trans. Automat. Control 40:1962-\96S, 1995. [16] H. T. Siegelmann and E. D. Sontag. On the computational power of neural nets. /. Comput. System. Sci. 50:132-150, 1995. [17] R. Koplon and E. D. Sontag. Using Fourier-neural networks to fit sequential input/output data. Neurocomputing, to appear. [18] A. N. Michel, J. A. Farrell, and W. Porod. Qualitative analysis of neural networks. IEEE Trans. Circuits Systems I Fund Theory Appl. 36:229-243, 1989. [19] K. Romanik. Approximate testing and leamability. In Proceedings of the Fifth ACM Workshop on Computational Learning Theory, Pittsburgh, PA, July 1992. [20] B. Dasgupta and E. D. Sontag. Sample complexity for learning recurrent perceptron mappings. IEEE Trans. Inform. Theory 42:1419-US1, 1996. [21] P. Koiran and E. D. Sontag. Vapnik-Chervonenkis dimension of recurrent neural networks. Unpublished. [22] E. D. Sontag. Mathematical Control Theory: Deterministic Finite Dimensional Systems. Springer-Verlag, New York, 1990. [23] F. Albertini and P. Dai Pra. Observability of discrete-time recurrent neural networks coupled with hnear systems. In Proceedings of the European Control Conference, Rome, September 58,1995, pp. 1586-1589. [24] F. Albertini and P. Dai Pra. Recurrent neural networks coupled with linear systems: observabiUty in continuous and discrete time. Systems Control Lett. 27:109-116, 1996. [25] F. Albertini, E. D. Sontag, and V. Maillot. Uniqueness of weights for neural networks. In Artificial Neural Networks with Applications in Speech and Vision (R. Mammone, Ed.), pp. 115-125. Chapman & Hall, London, 1993. [26] R. Hermann and A. J. Krener. NonUnear controllability and observability. IEEE Trans. Automat. Control. AC-22:728-740, 1977. [27] H. Nijmeijer. Observability of autonomous discrete time non-linear systems: a geometric approach. Intemat. J. Control 36:867-874, 1982. [28] F. Albertini and D. D'Alessandro. Remarks on the observability of discrete-time nonlinear systems. In Proceedings of the IFIP, System Modeling and Optimization, Prague, July 10-14, 1995.
Recurrent Neural Networks
49
[29] H. J. Sussmann. Orbits of families of vector fields and integrability of distributions. Trans. Amer. Math. Soc. 180:171-188, 1973. [30] E. D. Sontag and H. J. Sussmann. Complete controllability of continuous-time recurrent neural networks. Systems Control Lett. 30:177-183, 1997. [31] R. B. Koplon. Linear systems with constrained outputs and transitions. Ph.D. Thesis, Rutgers University, New Brunswick, NJ, 1994. [32] E. D. Sontag. Feedback stabiUzation using two-hidden-layer nets. In IEEE Trans. Neural Networks 3:981-990, 1992.
This Page Intentionally Left Blank
Boltzmann Machines: Statistical Associations and Algorithms for Training N. H. Anderson
D. M. Titterington
Department of Statistics University of Glasgow Glasgow G12 8QQ, Scotland
Department of Statistics University of Glasgow Glasgow G12 8QQ, Scotland
I. INTRODUCTION Boltzmann machines are a class of recurrent stochastic networks, normally with binary outputs. In the case in which the nodes are all visible, they can be regarded as stochastic versions of Hopfield networks, but in practice they often include hidden units. Our notation will be as follows. The network consists of n nodes (otherwise called neurons or units), and xt will denote the state of the ith node, i = 1 , . . . , n. In general, each Xi can be 0 or 1. We shall also envisage a "zeroth" neuron, which is permanently in state XQ = 1, and which allows for bias terms in the model. The set of xi will be denoted by x, which therefore represents the state of the network itself, and the state space of all possible x will be denoted byAf. The state of the network evolves according to a sequence of transitions, in each of which one of the nodes is given the chance to change its state. If the / th neuron is selected for updating, then the new value of x/ is -1
1 [0
with probability 1 + exp
(1)
otherwise.
Implementation Techniques Copyright © 1998 by Academic Press. All rights of reproduction in any form reserved.
51
52
N. H. Anderson and D. M. Titterington If, therefore, x\i denotes [xji j ^ i], then
It is assumed that wij = Wjt, for all /, y, and that wa = 0, for all /. In practice, it may be that node / receives inputs from only a subset of the other units, corresponding to wij > 0. Such units form the neighborhood of /, denoted by 9/, so that ^P(xi=0\x\i)\
f^.
WijXj.
Suppose that at each stage a node is chosen at random for updating and is updated according to the above stochastic rule. Repetition of this procedure generates a sequence of states for the network of nodes that corresponds to a Markov chain, with stationary transition probabilities, on the 2"-dimensional state space. At any given stage there is a positive probability that the state will remain unchanged, so that the chain is aperiodic. Furthermore, if the neighborhood system is also such that the chain is irreducible, that is, such that any state can be reached, with positive probability, from any other state, in finite time, then the chain is ergodic, and it follows that the Markov chain has a stationary probability distribution given by the Boltzmann-Gibbs distribution, p(x) = {Z(W)r^exp{-Eix)},
(3)
for all jc, where E{x) = — y ^ WijXiXj, W denotes the matrix of weights {wij}, and Z(W) =
J2^xp{-Eix)} X
is the appropriate normalizing constant, called the partition function, a term from the literature in statistical physics. There are various other points of contact with statistical physics. Section 20.3.5 of Titterington and Anderson [1] briefly notes, with references, the relationship between Boltzmann machines and spin-glass models, and we allude to another interface later in Section V.G. The plan of the rest of the chapter is as follows. In Section II we indicate the links between the dynamics of the Boltzmann machine with the Markov chain Monte Carlo techniques that are currently proving to be so useful in statistics and.
Boltzmann Machines
53
in particular, in Bayesian methods. Section III reminds the reader of the Boltzmann machine's origins as stochastic versions of Hopfield nets, and Section IV introduces the feature of hidden nodes, which gives Boltzmann machines so much addedflexibiUty.The main purpose of the chapter is to discuss methods that have been developed for training Boltzmann machines. These are outlined in Section V, which concentrates on the relationship of most of the algorithms to the statistical paradigm of parameter estimation by maximum likelihood. The particular algorithms are taken from both the neural computation literature and from modem publications in statistics. Sections VI and VII illustrate some of the algorithms in detail with examples without and with hidden units, respectively. The examples are, admittedly, of trivial dimensions, but this allows a comparatively detailed study. Section VIII describes a number of modifications of the basic Boltzmann machine, and Section IX briefly mentions a few possible developments for the future.
IL RELATIONSHIP WITH MARKOV CHAIN MONTE CARLO METHODS One consequence of the above discussion is that, if the updating rule is applied many times, starting from any arbitrary state, then, in the long term, the procedure generates a realization from the probability distribution (3), by what is essentially a version of the so-called Gibbs sampler, which is one of a class of simulation procedures that have recently become known as Markov chain Monte Carlo (MCMC) methods. The method is also known as the heat-bath method and as Glauber dynamics. If n is large, it is not practicable to sample directly from (3), because of the size of the state-space {2^ elements) and the consequent infeasibility of computing the partition function. MCMC methods are designed to obviate this difficulty by constructing a Markov chain whose limiting distribution is the distribution of interest. In the case of the Gibbs sampler, the general rationale is as follows. Of interest is the generation of a high-dimensional random vector x = (xi,... ,Xn) from a joint distribution p(x). Instead of doing this directly, one generates a sequence of realizations of a Markov chain whose transition rules are based on the nfull conditional distributions {p(xi \x\i)}. Typically, the full conditional distributions are much more manageable; in particular, the corresponding normalizing constant is easily computed. In the case where p(x) is given by (3), the Gibbs sampler leads to the updating rule (2). The only minor difference between Gibbs sampling and the procedure described in Section I is that, in Gibbs sampling, the updating cycle usually cycles systematically through the elements of X, rather than choosing from them at random; the equilibrium behavior is not altered by this change in practice.
54
N. H. Anderson and D. M, Titterington
The Gibbs sampler can be modified by introducing an annealing feature. Instead of using (2), we use the corresponding rule based on
in which T is a parameter equivalent to the temperature parameter familiar in statistical physics. One may start with T large and reduce it to the value 1, according to an appropriate annealing schedule. The Markov chain thereby created is now nonhomogeneous, but in the limit a realization from p(x) should still be generated. Geman and Geman [2] pioneered the use of the Gibbs sampler in image restoration, combined with annealing schedules in which T tended to zero, thereby locating the modes of p{x). In their context this corresponded to maximum a po^f^non (MAP) restorations of noisy images. The Gibbs sampler is only one MCMC procedure, albeit one that has found many adherents, because of its simplicity and its motivation through the full conditional distributions. It has had a spectacular impact on the practical application of the philosophy of Bayesian statistics to realistic problems, and the work of Geman and Geman [2] was an early manifestation of this. In Bayesian statistics, inferences about unknown quantities should be made on the basis of a so-called posterior distribution, which amalgamates prior information with observed data. As well as modes, as in the image restoration example, various other summaries of the posterior distribution may be of interest, and many of them can be represented as averages, with respect to the posterior distribution. In many practical problems, analytical evaluation of the corresponding sums or integrals is out of the question, and one way to approximate them is by a sample average, based on a large sample of realizations from the distribution, each realization generated by the application of a MCMC procedure. A general class of MCMC procedures is that of the MetropoUs-Hastings methods [3, 4], which include as special cases both the Gibbs sampler and the original Metropolis algorithm [5]. To see the extent to which MCMC methods have revolutionized the application of Bayesian statistics, see the book compiled by Gilks, Richardson, and Spiegelhalter [6] and the more theoretical review by Tiemey [7]. For more details of the application of Gibbs sampling and the Metropolis method to Boltzmann machines, see Aarts andKorst[8,9]. An important point to note is that, essentially, the study of Boltzmann machines amounts to the study of probability distributions such as p(x), as given in (3). The relationship with a network of nodes is in a sense incidental, although reflection on the network structure can shed considerable light on the properties of the distribution.
Boltzmann Machines
55
III. DETERMINISTIC ORIGINS OF BOLTZMANN MACHINES In this section we comment briefly on the fact that Boltzmann machines can be envisaged as stochastic versions of Hopfield associative memories. Note that the updating rule can be written as follows: the new xt is 1 with probability fs(Ylj ^ij^j)^ ^^^ ^ Otherwise, where fs denotes the sigmoid (logistic) nonlinearity defined by fs{u) = {1 + exp(—M)}~^ Suppose instead that the new xi is given by
"^i =
fhy^^ij^jX
where fh (u) denotes the hard limiter nonlinearity,
ifu > 0 if M < 0. This corresponds to the deterministic updating rule for a basic Hopfield network [10]. One interpretation of the workings of this network is as a procedure for locating the minima of the function E(x), which can be interpreted as an energy fiinction. If the nodes are updated according to the above rule, asynchronously and selecting the nodes at random, then E{x) decreases at each stage until a local minimum ofE{x) is reached. Many problems of combinatorial optimization have been attacked using the Hopfield network and variations thereof, by specifying an energy function of interest and applying the appropriate version of the updating rule. These problems include the famous traveling salesman problem; see, for instance, Aarts and Korst [8]. The energy surface often has many local minima, and the choice of the initial x can strongly influence the local minimum reached by the algorithm. This behavior justifies the terminology of attractor neural networks for these networks; see Amit [11], for instance. It also lies behind the invention of the stochastic version that is of prime interest in this chapter, in that, if stochastic updating is permitted, along with annealing that allows T to tend to zero, then there is the chance of escaping from the region of attraction of local, but nonglobal minima. A major contribution of Geman and Geman [2] was to establish, in their context of image restoration, an annealing schedule that would guarantee convergence to the relevant global minimum. A drawback to this theoretical contribution was that the annealing schedule turned out to be impractically slow. For more pointers to the literature on Hopfield networks, see Section 20.2 of Titterington and Anderson [1].
56
N. H. Anderson and D. M. Titterington
IV. HIDDEN UNITS Although we shall regard the model introduced in Section I as a Boltzmann machine, the incorporation of hidden nodes is an important feature of Boltzmann machines as used in practice. In fact, the vector of node states, x, may well represent three types of variable, namely, inputs, outputs, and hidden variables. The inputs and outputs, which we shall denote by xj and XQ, both correspond to v/5ible units and represent physical quantities; we shall write xy = {x[,xo)' The hidden units are created to add flexibility to the model, and we shall denote the corresponding set of variables by XH- Thus, x = (jcy, XH) = {xi,xo, XH)- We shall denote the corresponding numbers of variables (i.e., nodes) by nj, no, ny andn/f, so thatn = ny -{-nn =ni -\-no -^-nnWhen the Boltzmann machine is evolving, according to the stochastic updating rules, all units are treated in the same way, except in cases in which x/ is fixed {clamped, in the neural computing terminology), so that, in the long term, realizations are generated from the (joint) stationary distribution for jc. However, the distribution of interest in practice is P{xy) = ^
p{xy, XH),
(4)
XH
or, if ;v/ is clamped, p(xo\xi)
=
^p{xo,XH\xi). XH
Because (4) can be written in the mixture-distribution form, P(xy) =
Y^p(xy\xH)p{xH), XH
we can see that the hidden variables are what are called latent variables in statistics. They add flexibility to the structure of the models in the same way as mixture models, factor analysis models, and latent class models. The features underlying the structure of Boltzmann machines are the numbers of nodes, visible and hidden; the pattern of intemodal connections; and the connection weights. Recalling the relationship between Boltzmann machines and Gibbs distributions, we note that these features relate to statistical model building (in terms of the choice of the nodes and connections) and selection of parameter values (so far as the choice of weights is concerned). The art of model building involves the familiar objectives of balancing parsimony of structure with the complexity of the task that has been set; for instance, one may aim to "minimize," in some sense, the number of hidden nodes while ensuring that the model is sufficiently rich. So far as choice of weights is concerned, the approach depends on the objective of the whole exercise. When Boltzmann machines are used in the solution of optimization problems, the appropriate weights are determined by the
Boltzmann Machines
57
particular task; see, for instance, Aarts and Korst [8] and Section 20.5 of Titterington and Anderson [1]. In this article, however, we shall concentrate mainly on the interpretation of Boltzmann machine learning as an exercise in statistical parameter estimation.
V. TRAINING A BOLTZMANN MACHINE In this section, we consider what can be done if the parameters are unknown, although we do assume that the architecture (i.e., the model) is given. We identify the Boltzmann machine with its stationary distribution, with the result that we are interested in estimating the parameters of a type of exponential family distribution; see, for instance. Cox and Hinkley [12, p. 12]. To see this, note that logp(x) = 2 J WijXiXj + constant. where the constant does not depend on x. Thus the parameters, in W, and certain functions of the variables, which contribute to what are known as the sufficient statistics, are combined in a simple way. The usual scenario is that a training set is available, which is assumed to come from the marginal stationary distribution for the visible variables, but we shall also consider the case in which a target distribution is prescribed for the visible variables and the objective is to find the best approximating Boltzmann machine with a given architecture. In statistical terms, this amounts to finding the best approximating exponential-family distribution, within a particular class, to the prescribed distribution. We shall see that there are close relationships with Amari's information geometry of exponential family distributions, with maximum likelihood estimation and with the Iterative Proportional Fitting Procedure (IPFP) [13] used for fitting log-linear models to categorical data. We shall denote the target distribution by {r(jc): x € X}, and we shall denote the class of Boltzmann machines of a specified architecture by B. Thus we envisage each /? G S to be of the form (3), but with parameters {wtj} to be chosen. Target distributions representing a training set D are simply the corresponding sets of relative frequencies. The training set itself is assumed to be a set of A^ independent w-dimensional realizations from X:
D = {x«,...,;cW}. We shall define Pij = ¥.p{xiXj) = Frob p(xi = Xj = 1) and similarly for r/^, for all i, j , where E denotes expectation.
58 A.
N. H. Anderson and D. M. Titterington THE CASE WITHOUT HIDDEN UNITS
Ackley et al. [14] proposed the following rule for iteratively modifying the weights {wij}: wtj changes to wtj + Aiu^y, where Au;,7 = r]{rij - pij)\
(5)
see also Hinton and Sejnowski [15]. Various remarks can be made about (5). (i) In the terminology of exponential-family distributions, {wij} are the natural parameters and {Xljt=i ^i -^ / 1 ^^^ ^^ associated minimal sufficient statistics, as revealed by the formula for the likelihood function, which is defined to be the joint probability function of all the data, regarded as a function of W: Lik(W) = n / . ( ; c « ) = {Z(W)}-^exp{ ^ u , „ ( X ; ; c « ; c f ) j . k=l
^ i<j
^ k
^ ^
(ii) Expression (5) defines a gradient-descent rule for minimizing I{r\ p) with respect to W, where p is given by (3) and nq\ P) = Y^r{x)\og{rix)/p{x)},
(6)
X
which is the KuUback-Leibler directed divergence between r and p; see also Luttrell [16]. (We assume in (ii) that p(jc) > 0 whenever r(x) > 0 and that 0-log 0 = 0.) Of course, minimizing I{r\ p) is equivalent to maximizing Y^r{x)\ogp{x), X
which is the log-likelihood function in the case where r represents a training sample. To prove (ii), note that ^r(jc)log/7(jc) = \ Y^r{x)x^Wx X
-logZ{W)
X
= I ] '•(^) ( I ] WijXiXj\ - log Z{W) = I]^'7«'.7-logZ(W). '<7
Boltzmann Machines
59
Thus diinp)•,p) dwi
(
d\ogZ(W)\
^
by standard exponential-family theory, so that (5) can be written Awij =
-rj
a/(r; p) dwij
as required. The parameter r] is the step size in the gradient descent algorithm, and, as usual, its choice will affect the convergence characteristics of the procedure. The practical problem in implementing the algorithm is the calculation of ptj. It is not a realistic proposition to compute this analytically, and in practice ptj is estimated by ptj, a relative frequency generated by simulation. This is in itself a time-consuming operation: each realization that contributes to pij involves either iterating the relevant Gibbs sampler to stationarity or continuing a Markov chain Monte Carlo procedure long enough to nullify the effects of serial correlation; see, for instance, Besag and Green [17], Gelman and Rubin [18], and Geyer [19].
B. THE CASE WITH H I D D E N UNITS The algorithm must be modified when there are hidden units. Recall that x = (xy ,XH), where xy denotes the values of the visible units and XH denotes those of the hidden units. Then the training data take the form D = {xy\ k = I,..., N} and the maximum likelihood estimation problem is to maximize, with respect to ^
ry (xv) log pv(xv),
XV
where, for instance. pvixv)
= Pv(xv\W)
=
J^p(x\W). XH
Equivalently, we have to minimize /(ry; /?y). In this case, the gradient descent rule is Awij=ri{qij(W)-pijiW)), where ptj is as before, and
qij(W) = Ej A^-^ ^xf^xJ.^^|Z), w\.
(7)
60
N. H. Anderson and D. M. Titterington
When / and j correspond to visible units, qtj = rij; otherwise, qij involves a conditional expectation. Thus, in general, both qij and pij involve expectations, which in practice must be estimated by simulation. The neural network terminology goes as follows. The updating of Wij contains two phases: in the positive phase simulation (for qtj), the visible units are clamped at the values given in the training set, so that the resulting realizations follow the required conditional distribution; the negative-phase simulation (for pij) is free-running, with no unit clamped. Thus the complete training algorithm is iterative and each stage requires a large Monte Carlo exercise. As a result, it is notoriously slow to train a Boltzmann machine using a training set, although DeGloria et al. [20] indicate one way of speeding up the algorithm.
C. RELATIONSHIP WITH THE EM ALGORITHM The case with hidden units fits naturally into the class of problems to which the EM algorithm [21] can be applied. The EM algorithm is a general iterative procedure for calculating maximum likelihood estimates in problems involving incomplete data. Suppose the log-likelihood of interest is L{W), and that we seek to maximize this with respect to W. Let us expand the notation, writing L{W\ D) io indicate the dependence on the training data. Correspondingly, we write Lc{W) = Lc{W\ Dc) io denote the log-likelihood associated with the socalled complete data, Dc. In the context of Boltzmann machines, Dc includes the values for the hidden nodes for all of the realizations: Dc = {(Xy ,x^fj ) , k = 1 , . . . , A^}. Each iteration of the EM algorithm updates a current approximation, W^^\ say, for the maximum likeUhood estimate, to a better approximation, jy(5+i) npj^g iteration consists of an E-step and an M-step, defined as follows: E-step: Compute Q{W; D) = E{Lc(W; Dc)\D, W^'^] M-step: Find W = iy(^+i) to maximize Q(W\ D) The EM algorithm has the appealing property of being monotonic, in that /^(^(•y+i)) > LiW^^^), so that the algorithm will converge to a local, if not necessarily a global, maximum of the log-likelihood of interest. In the case of the Boltzmann machine, the EM iteration turns out to have a particularly simple formulation, characteristic of exponential-family distributions, in that W^^^ is updated to W^'^^\ where W^'^^^ solves
for all /, j \ one computes, in the E-step, the expected values qijiW^^"^) of the sufficient statistics, given D and assuming that W^^^ are the correct weights, and the M-step equates them to the unconditional expected values at the new set of
Boltzmann Machines
61
weights, W^^^^\ This gives, usually, a complicated set of equations. As before, each qtj {W^^^) usually must be computed using simulation.
D. DISCUSSION OF THE CASE WITH TRAINING DATA The principal obstacle to the widespread use of Boltzmann machines and their associated stationary distributions in practice is the need for Monte Carlo methods, in both running and training the machines. Considerable parallelism is possible, but much benefit would be gained from faster Markov chain Monte Carlo methods and innovative maximum likelihood procedures. An example of the latter is the work of Geyer and Thompson [22]. For simplicity, consider the case of no hidden unit, so that we are required to find W to maximize L{W) = \ ^jc^^^^T^Jc^^^ - A^logZ(W).
(8)
k=i
Now, Z{W) = ^ e x p ( ^ j c ^ W j c ) X
= z(y)^exp{ix'^(w-y)x}/7(x|y), X
where V is any set of weights. Thus, logZ(W) = l o g Z ( y ) + l o g J ( W ) , where d{W) = Y^Qx^{\x^{W
- V)x]p{x\V).
(9)
X
If now any W, by
are realizations from p{x\V), d{W) can be estimated, for s
ds{W) = s-^ ^ e x p {i X^^>^(W - y)X^^^}, 1=1
and the maximizer of N
Li(W) = \
^X^^^^VTJC^^^
-NlogdsiW)
k=i
is used as an approximation to the maximizer of L(W).
62
N. H. Anderson and D. M. Titterington
The maximization can be done iteratively, the main point being that only one set of realizations, that from p(V), is required for the whole process. Geyer and Thompson [22] discuss the choice of V.
E. INFORMATION GEOMETRY OF BOLTZMANN MACHINES Amari et al [23] review the differential geometry of exponential family distributions. In particular, they make the following remarks: (i) The Fisher Information matrix for the natural parameters is the unique invariant metric in the manifold «S of all probability measures on the finite state space X. (ii) 5 is a curved Riemannian manifold, but it admits a unique divergence measure between two points q and p in the direction from q to p. This measure turns out to be the KuUback-Leibler directed divergence, I(q; /?), defined in (6). (iii) If A^ is a submanifold in 5 , then that p e M that minimizes I{r\ p) represents a generalized projection of r onto M. The relevance of (iii) is that the classes of Boltzmann machines (or, more precisely, their stationary distributions) constitute submanifolds, and the minimizer of /(r; /?), for a given target r, provides an optimal approximator for r in the most appropriate sense. The gradient-descent approach to learning is given by (5), where now Vij
=¥.r{XiXj).
Formula (5) relates to the case of Boltzmann machines without hidden units. When there are hidden units, and a target distribution ry on Xy, the optimal /? is a Boltzmann machine distribution such that I{ry\ py) is minimized. It is awkward that this optimization problem involves the marginal distribution, py, of p, but it turns out that the problem can be re-expressed in terms of the underlying joint distributions. Suppose V denotes the class of probability distributions on X for which the marginal distribution on Xy is ry. Then mml{ry\ peB
py) =
min
/(r; p)\
pGB,reV
see Theorem 7 of Amari et al. [23] and Section III of Byrne [24].
(10)
Boltzmann Machines
63
R TRAINING BOLTZMANN MACHINES BY ALTERNATING MINIMIZATION The practical pay-off from (10) is the development of alternating-minimization algorithms for computing the optimal p. A sequence {p^ r^, p^, r^,...} is generated by alternating between the following two problems. r^ = argmin/(r; p^)
(11)
rely
/ + ! =argmin/(/;/7).
(12)
peB
Because it is fairly easy to establish that, for any p e B, I(rv; Pv) = min/(r;/?), reV
it follows from (11) and (12) that I{ry; p^y^') = I{r^^\ / + ! ) < / ( r ^ p^^') < l{/; p^) = l{ry; p^y). Thus the sequence {/(ry; Py)} converges and the limiting p* gives at least a local minimum of /(ry', pv)In the case of no hidden unit, of course, stage (11) is degenerate, and we are left with a single iteration of (12), with r^ =r. Otherwise, the solution r^ of (11) can be obtained explicitly, as r\x) = ry{xy)^^,
xeX.
(13)
Thus, as in the case of no hidden unit, the computational difficulties come down to (12). Byrne [24] describes how to solve (12) using the Iterative Proportional Fitting Procedure (IPFP). The information geometry based on /(r; p) implies that p^^^ is that p e B, such that Pij =rlj,
forall/, y;
that is, ^
p(x) XiXj = rjj,
for all /, j .
(14)
X
A sequence [Qk] of members of B is generated by successively rescaling Qk-i, say, to fit, exactly, one of the constraints defined by (14). Thereby, Qk is created, and the algorithm proceeds, visiting each constraint infinitely often (in theory).
64
N. H. Anderson and D. M. Titterington
The rescaling procedure is explicit, it does not involve Monte Carlo, and it effectively fits a member of a class of log-linear models by maximum likelihood.
G.
PARAMETER ESTIMATION U S I N G
MEAN-FIELD APPROXIMATIONS
The links with statistical physics bear fruit in the form of an approximate approach that can be used in various contexts including parameter estimation. This so-called Mean-Field Theory can be derived in various ways, one of which is the following. Suppose a random vector x has a probability distribution given by (3) and we wish to evaluate E(jc,), say. In view of the complicated partition function, Z(W), this is usually not practicable, so an approximation, to be written {xt), is defined as being E(xi) evaluated as if all other Xj are fixed at their own mean-field approximations. Thus (xi) =E{xi\xdi = (xdi)}, that is, (^/> = ^ ^ / e x p | A : / J2 mj(xj)\/Yl^^^y^
Z ] ^0^^;)[-
Because xt can only take the values 0 and 1, we obtain (xi) = exp I ^
Wij(xj) U\ 1 -h exp I ^
Wij(xj) I L
for all /.
(15)
These equations can be solved numerically, for a given W, to give the set of values {{xi)}. One can similarly obtain corresponding equations for (jc/Xy), / ^ j , but it is common practice to make a further approximation and take {xiXj) = {Xi){xj),
(16)
with {{xi)} obtained from (15). Values from (16) can then be used, instead of sample means from Monte Carlo simulation, in approximating, as appropriate, r/y and pij in (5) or qij and pij in (7). The experience of Peterson and Anderson [25] is that the mean-field approach performs very well, with speed-up factors of up to 30 over the simulation approach. It is also possible to define a mean-field approximation to p(x) itself, by PMF(X)
= Y[p{^i\^di
= {^a/>),
(17)
Boltzmann Machines
65
which bears strong similarity to Besag's [26] pseudo-hkehhood,
ppiix) = Ylp{^i\^di)' i
Clearly, both (15) and (16) follow from (17). Various aspects of Mean-Field Theory approximations are discussed in Amit [11], Hertz et al [27], and Mezard et al [28]. The theory is used by Geiger and Girosi [29] in image restoration, and Zhang [30] uses (17) as the basis of a practicable EM algorithm for dependent, incomplete data. Further applications of this type are mentioned later in Section VIII.C. However, the deterministic aspects of the method raise some concern about the statistical properties of resulting estimators; see, for instance, Dunmur and Titterington [31].
H. THE IVIETHOD OF GELFAND AND CARLIN Gelfand and Carlin [32] present a Monte Carlo method for obtaining maximum likelihood estimates when there are missing data. Geyer's [33] account of the method is as follows. Let the complete data be represented by D^ = (D, Dm), where Dm denotes the missing data, and write p(Dc\W) = p(D, Dm\W) = h(D,
Dm\W)/Zc(W),
where ZdW) is the partition function and h(D, Dm IW) is defined as appropriate. It turns out to be convenient to work in terms of likelihood ratios, relative to the model corresponding to an arbitrary set of weights V, say, so that the logarithm of the likelihood ratio, based on data x, is l(W) = log
. ''\h(D,Dm\V)'
JJ
^L ^U(D,D^|y)jJ
In practice, the expectations are estimated by sample averages. Numerical maximization of 1{W) can then be undertaken. Note that Geyer [33] suggests that it is better to calculate the first term exactly, where possible, in contrast to Gelfand and Carlin's [32] suggestion that the calculation should be carried out iteratively, using successive estimates of W as the choice for V in the following iteration. In the case of Boltzmann machine training, based on a sample of N realizations, we have Dc = {(xf \ . . . , x^^^\ xf^^+p . . . , 4^^^^/,)' k=l,,,,,N}, with ny visible units and nn hidden units, and
h(D, Dm)\W) = cxp(j2Jl'^iJ^i^^4^)-
N. H. Anderson and D. M. Titterington
66 Thus,
r ^ - - | ?!:<»«-"«..«Mi. k
i<j
As a result, the Gelfand and Carlin log-likelihood ratio for training Boltzmann machines is /(W) = log
In principle, this is now maximized numerically, with the expectations replaced by sample averages.
I. DISCUSSION In both of the cases with and without hidden units, we have more than one possible algorithm for training a Boltzmann machine. For instance, if there is no hidden unit, we may use the gradient-descent algorithm (5) or the IPFP algorithm described in Section V.F. If there are hidden units, we may apply the gradientdescent algorithm (7) or the alternating minimization algorithm (11,12). To implement the gradient-descent algorithm, we must choose the step length, rj, and a time-consuming aspect of the method is the need to carry out both simulation and iteration steps. With the IPFP/altemating minimization approach, also iterative, no simulation is necessary, apart from the possible need to estimate the {r/y}. Byrne [24] reports a limited amount of numerical experimentation using the alternating minimization algorithm, but as yet no meaningful comparison has been carried out with the gradient-descent approach. We now return to the EM algorithm. As Byrne [24] points out, the alternating minimization algorithm is an EM algorithm, with (11) representing the E-step and (12) the M-step. Indeed, this represents a particular version of a more general formulation of EM algorithms as alternating minimization algorithms based on the divergence I(r; p). The objective is to optimize p within some parametric class, B, and r is constrained so as to be compatible with the observed data, i.e., r e V, say. Then the EM algorithm for updating p^ to p^~^^ is E-step: evaluate / ( r ^ p^) = min^^P ^(''5 pO M-step: /7^+^ = argminpe/5^(^^; P)-
Boltzmann Machines
67
Amari [34] prefers to refer to these two geometrically motivated steps in general as the e-step and m-step components of what he calls the em algorithm. He shows that, in many particular problems, including that of Boltzmann machines, the EM and em algorithms are identical, but that examples do exist for which the two algorithms are different. Another discussion of when the EM algorithm can be represented in 'em' terms appears in Csiszar and Tusnady [35], and the relationship is discussed by Neal and Hinton [36]. The EM algorithm also appears at a different level. As Anderson and Titterington [37] point out, each stage of the IPFP algorithm can be interpreted as an EM iteration, an observation that can be used to help establish the convergence of IPFP when used to solve (12).
VL AN EXAMPLE WITH NO HIDDEN UNIT In this and the following section, we provide numerical illustrations of the training algorithms described in Section V. The examples in this paper are of extremely small scale, so that the properties of the various estimation procedures can be assessed in detail. First we consider Boltzmann machines with no hidden unit, so that the training problem is simplified. In fact, for this case, the surface in the parameter space defined by the Kullback-Leibler metric is concave when viewed from above [14], and so the log-likelihood surface, which is, up to an additive constant, the negative of the Kullback-Leibler distance between the empirical distribution from the data and the theoretical distribution, should be unimodal. Thus training procedures such as the simple gradient-descent rule defined by Ackley et al [14] will not become trapped in local optima, and the results should be independent of the starting values. The standard example we shall use in this section is a completely connected, three-node Boltzmann machine with one bias node, called node 0. This is shown in Fig. 1, with the values of the connection weights attached. A standard test data set {TD\) was obtained by annealing the network to equilibrium and then running for 10,000 cycles, at each stage flipping the state of a random node and collecting the state vectors at the end of each cycle. Further versions of the data set, denoted by TDk for various values of k, were generated by running the Boltzmann machine for a larger number (10,000A:) of iterations, but collecting a state vector only once every k iterations, so that successive state vectors should manifest lower serial correlation. Of course, with a network of this simplicity, it would be quite straightforward to calculate the corresponding set of multinomial probabilities and to simulate realizations directly, but the Boltzmann machine dynamics were implemented instead, to investigate informally the effects of serial correlation.
N. H. Anderson and D. M. Titterington
68
Figure 1 Architecture and weights for a simple Boltzmann machine with three nodes and bias node.
A. EXACT LIKELIHOOD ESTIMATION As a result of the deliberately small scale of the example, it is possible to calculate and optimize the log-likelihood explicitly; exact computation of the partition function is feasible. As usual, wij denotes the connection weight between nodes / and j , and x^ denotes the particular binary (0/1) state of node / when the Boltzmann machine is represented by state vector jc^ where the eight possible state vectors are indexed in the order {0, 0, 0}, {0, 0, 1}, {0, 1, 0}, {1, 0,0}, {1,1, 0}, {1, 0, 1}, {0, 1, 1}, {1,1, 1}. Then the log-likelihood for weights W is L(W) = ^ A ^ / ( ' ^ u ; / y x ^ ^ ] - y V l o g Z ( W ) , /=1
^/<;
(18)
^
where Ni is the number of replications of the /th state in the complete set of n observations. The partition function Z(W) takes the form
1=1
^i<j
'
Both of the terms in (18) can be calculated very quickly once {Ni\ and {u;/j} have been specified, and L( W) can be maximized numerically by a routine from a sub-
Boltzmann Machines
69
Table I True Weights, Maximum Likelihood Estimates, and Approximate 95% Confidence Intervals for the Network in Fig. 1, from 10,000 Realizations Weight
True value
MLE«
Confidence interval
w^l2 w^l3 W23 w^lO w^20 w^30
0.50 0.75 1.20 0.20 1.00 0.60
0.466 0.715 1.061 0.246 1.095 0.777
(0.311,0.621) (0.572, 0.858) (0.884, 1.238) (0.067, 0.425) (0.913, 1.276) (0.594, 0.959)
'^MLE, Maximum likelihood estimate.
routine library; in our work we used the NAg Fortran Subroutine Library [38]. Table I displays the estimates that resulted from the application of this calculation to the data set TDSQ. The table shows the upper triangle of the matrix of maximum likelihood estimates of the weights, along with the true weights. According to standard asymptotic theory for maximum likelihood estimation, the covariance matrix of the estimators can be approximated by the negative of the inverse of the matrix of second derivatives of L(W), evaluated at the estimates. Approximate 95% confidence intervals for the true values of the parameters can then be calculated, assuming asymptotic Gaussianity, again as a consequence of standard theory. In this example, the corresponding intervals are quite wide (see Table I), and all intervals do cover the true values. Table II contains the approximate, asymptotic correlation matrix of the estimators. The correlation structure results from the highly interconnected nature of
Table II Approximate Correlation Matrix for the MLEs of the Weights in the Network in Fig. 1, from 10,000 Realizations m3 wi2
m3 W23 wio w^20
-0.13
^ -0.13 -0.07
mo
w^20
W30
-0.68 -0.59 0.15
-0.50 0.14 -0.71 0.28
0.18 -0.47 -0.77 0.17 0.47
70
N. H. Anderson and D. M. Titterington
the Boltzmann machine: there are some large (in magnitude) off-diagonal terms indicating high correlation between the estimator for the weight Wki connecting two nonbias nodes and the two corresponding bias connections, Wko and wio.
B. THE M E T H O D OF GEYER AND THOMPSON Next we report an application of the method of Geyer and Thompson [22], described in Section V.D. Of course, for this particular small-scale example, it is unnecessary to use this method, but we include it here for comparison. Recall that we replace maximization of L(W) by maximization of Li(W), where Li(W) = ^ A ^ / [ ^ u ; , 7 X , ^ ] - 7 V l o g J , ( W ) , with ^5 (W) evaluated as in Section V.D, based on a sample of s realizations from the Boltzmann machine with parameters V. Thus we must choose a V, generate the s realizations, and then optimize L\{W) numerically. Ideally, W should be chosen to be close to the true W, but Geyer and Thompson [22] suggest an iterative procedure in which the successive iterates for W should be used as V for the following iteration. They remark that only a few iterative steps should be required to provide acceptable accuracy; it should not be necessary to continue the iteration until Y converges too precisely to the maximum likelihood estimates. We carried out a number of simulations to study the training of the Boltzmann machine of Fig. 1 by the method of Geyer and Thompson [22]. In each simulation y was chosen to be the true weight matrix, so that only one of the abovementioned iterations was carried out. The value chosen for s was 10,000. Repetitions of some of the experiments suggested that the results were independent of the starting values for W in the maximization calculation, so the true values were chosen for this purpose as well. The results for the data sets TD\, TDio, and n)2o were poor, suggesting that there was too much serial correlation in the sequence of samples, but those based on data set TDso were much closer to the maximum likelihood estimates obtained in Table I (see Table III). (For each /:, the initial run of the Boltzmann machine with weights W was carried out according to the same sampling plan as for the relevant TDk sample.) No further improvement in accuracy resulted for data with sampling gaps greater than 50, and in fact there seemed to be some degradation in performance. However, it must be said that there is considerable sampling variability inherent in the collection of the TDy^ training data, and this may well account for the phenomenon. It is not clear if the improved accuracy with TD^^ results from reduced serial correlation or from better attainment of equilibrium. It may also be appro-
Boltzmann Machines
71 Table III True Weights and Approximate MLEs Obtained by the Method of Geyer and Thompson [22], for the Network in Fig. 1, from 10,000 Realizations Weight
True value
Estimated value
mi
0.50 0.75 1.20 0.20 1.00 0.60
0.532 0.672 1.199 0.270 0.892 0.668
w^l3
W23
wio w^20
mo
priate to average the approximate maximum likelihood estimates obtained by this approach over runs that use different streams of pseudo-random numbers to obtain the V samples, because these will also be affected by the sampling variability observed in the TDk sets.
C. ALTERNATING MINIMIZATION In the case of no hidden unit, the Alternating Minimization algorithm of Byrne [24] reduces to the estimation of the weights by the IPFP. In fact, because of the existence of the bias term, we are fitting a log-linear model, for three binary variables, consisting of all main effects and two-way interactions by maximum likelihood, with the IPFP providing an iterative scheme, with guaranteed convergence, for the calculation. For extensive discussion of log-linear models see, for instance. Bishop et al [13] or Fienberg [39]. In the current example, the IPFP makes a number of sweeps through the set of weights, adjusting the weights at each sweep to rescale each marginal probability in turn, given the alterations made up to the previous step; that is, for each combination of / and j , the marginal probability that nodes / and j are both 1 is the same as that induced by the target distribution, namely, the data; see Section V.F. The iteration continues until some convergence criterion is satisfied. When this procedure was applied to the TD^o data, with weights initially set to zero, and with convergence deemed to have occurred when all relevant marginal probabilities were within 10"^ of the targets, convergence occurred after 127 iterations. The estimates agreed with those in Table I to three decimal places throughout, and to four in the cases of all but one parameter. (If the convergence tolerance was re-
72
N. H. Anderson and D. M. Titterington
duced further, then the agreement was, naturally, even better.) The log-likelihood was observed to increase monotonically after each sweep through the weights, a behavior guaranteed by the theory of the IPFP. In larger-scale examples (involving more nodes), calculation of the marginal probabilities becomes more demanding, because of the intractability of the partition function. Thus it may only be possible to estimate the required probabilities by simulation. This changes the nature of the algorithm, however, providing a noisy version of IPFP in which the guaranteed monotonicity of IPFP is lost. The algorithm tends to oscillate and, at best, converges extremely slowly. For example, with each marginal probability estimated from a sample of 10,000 observations at a spacing of 50, generated after an initial annealing process, the IPFP failed to converge after 500 iterations, at which stage the estimates of the set of weights, ordered as in Tables I and III, were (0.5015
0.7464
1.0044 0.2114
1.0627 0.8600).
Although of the correct order of magnitude, these estimates are quite poor. Not surprisingly, it is advantageous to reduce the sampling variability of the estimates of the marginal probabilities. When the above simulation was repeated with sample size 100,000, the IPFP converged, after 98 iterations, to give weight estimates (0.4835
0.7500
1.0264 0.1958
1.1041 0.7936),
and with a less noisy trajectory of log-likelihoods. However, this does increase the computational burden, and a more economical alternative would be desirable. In addition, questions about the convergence theory may have to be addressed.
D. CONCLUSIONS FROM THE EXAMPLE The IPFP will work well when weight updates are based on explicit calculation of the relevant marginal probabilities. However, both the IPFP and the Geyer and Thompson [22] methods become much less accurate when Monte Carlo simulation is used instead. This suggests that the following important issues remain to be dealt with: • ensuring the convergence of the Markov chains to their equilibrium state before relative frequencies of interest are accumulated; • finding other ways of reducing sampling variability and thereby improving estimation efficiency;
Boltzmann Machines
73
• investigating whether averaging across several different estimation runs might reduce the severity of this problem. If this can be done, these techniques will become promising ways of obtaining maximum likelihood estimates for Boltzmann machines. This applies particularly to the method of Geyer and Thompson [22], which requires less simulation and may therefore be appreciably faster in execution.
VII. EXAMPLES WITH HIDDEN UNITS As described in Section V.B, in the standard method of training a Boltzmann machine, due to Ackley et al. [14], the estimates of the weights are iteratively updated by an amount depending on a learning rate parameter, r], and the difference between the currently estimated probability that the two nodes thus connected are both active and its target value, represented by the corresponding relative frequency from the data. We discuss this further in Section VII.B. In Sections VII.C and VII.D, we consider direct maximization of the log-likelihood from the observed data and application of the EM algorithm. We begin in Section VILA by describing the two main examples used to illustrate the methodology.
A. DESCRIPTION OF THE EXAMPLES We make use of two data sets in this comparison. Each consists of 10,000 realizations of the state vector of a free-running Boltzmann machine consisting of four visible units, one hidden unit (node 5), and one bias node (node 0). The sets of weights corresponding to the two models, called BMl and BM2, are given in Table IV. These networks have 15 weights. The training data consist of frequencies of all possible states of the network, of which there are 2^ = 16.
B. THE GRADIENT-DESCENT M E T H O D As the original Boltzmann machine training algorithm, this scheme is a benchmark for other methods. In Section V.B it was noted that this algorithm is essentially a gradient-descent algorithm for minimizing the negative of the observeddata log-likelihood. Because the dimensionality of BMl and BMl is so small, it is feasible to calculate summations over all possible points in the state space of these Boltzmann machines. In the comparisons that follow, therefore, all probabilities are calculated exactly, using the appropriate Boltzmann-Gibbs distributions.
74
N. H, Anderson and D. M. Titterington Table IV True Weights Corresponding to the Boltzmann Machine Models BMl and BM2^ Weight
5M1
BMl
^\2 ^13
0.50 0.75 0.20 1.20 0.10 0.60
0.50 0.50 0.50 0.50 0.50 0.50
0.30 0.50 1.80 1.00
0.50 0.50 0.50 0.50
0.10 0.95 0.80 1.10 0.60
0.25 0.25 0.25 0.25 0.25
wu W23 ^24 w^34 W^15 w^25
ms ms ^\0 w^20 w^30
W40 w^50
Node 5 is a hidden node, and Node 0 is a bias node.
A value also must be selected for the learning rate, rj. After some experimentation with rj e {0.2,1.0, 2.0}, we decided to use r] = 0.2, because, although the steps taken by the algorithm were quite small, this choice did seem to lead eventually to the largest value of the observed-data log-likelihood, L{W). The algorithm was deemed to have converged if, in a given iteration, the change in each weight was less than a particular tolerance level; an "iteration" corresponds to a complete pass through the set of weights. An alternative way of measuring convergence would be to look at changes in the value of L{W). The results of gradient-descent training for BMl and BM2, using a convergence tolerance of 10~^, are shown in Table V, along with the attained value of L(W) and the number of iterations required for convergence. Convergence is clearly slow, especially in the case of BMl. This may mean that we are still some way from the true optimum. Rather than decrease the convergence tolerance, we applied Steffensen's iteration [40, p. 152], which is a device, based on Aitken's 8^ method, for accelerating the convergence of iterative algorithms. The results of this are also shown in Table V. They show a noticeable improvement in the BMl data and a marginal change in BMl. The objective of Boltzmann machine training is to estimate the joint probabilities of the states of the visible nodes. In Table VI we display, for each of the
75
Boltzmann Machines Table V Estimates of Weights Obtained by Gradient Descent for Data from BMl and BM2, without and with Accelerated Convergence BMl
BMl
Weight
Standard
Accelerated
Standard
Accelerated
w^l2
0.458 0.824 -0.083 1.401 0.664 0.284
0.471 0.840 -0.077 1.374 0.571 0.146
0.594 0.359 0.435 0.505 0.428 0.575
0.596 0.363 0.434 0.507 0.428 0.574
0.449 0.664 1.737 1.382
0.405 0.719 1.777 1.417
0.538 0.485 0.538 0.497
0.495 0.476 0.505 0.501
0.186 1.081 0.719 1.544 0.690
0.194 1.312 0.827 1.731 0.647
0.314 0.230 0.307 0.243 0.263
0.349 0.238 0.334 0.244 0.276
1475 -6453.036
1856 -14955.013
w^l3
wi4 W23 W24 w^34 W^15 w^25
ms W4S
wio w^20
mo W40 w^50
Iterations L(W)
21461 -6453.201
1510 -14955.005
16 possible state vectors, the observed frequencies and, rounded to the nearest integers, the estimated expected frequencies computed at the end of the various training algorithms. In these terms, accelerating the convergence of the algorithm did not have much effect. The true weights were also used to initialize the algorithms. In practice this is not feasible, and we experimented with a variety of sets of initial estimates. It turned out that values attained for L(W) and for estimated expected frequencies were not sensitive to the choice of initial values, but that the final values for some of the weights were. One example, based on the BMl data, is shown in Table VII, along with the corresponding set of estimated expected frequencies. There is evidence of nonidentifiability of the weights connecting visible unit 4, the hidden unit, and the bias unit. However, the algorithm still serves its purpose of creating a set of estimates that optimizes L(W). The identifiability of multidimensional categorical-data probability models is a complicated issue; see, for instance. Section 3.2.4 of Hagenaars [41] and Goodman [42].
76
N. H. Anderson and D. M. Titterington Table VI Observed and Estimated Expected Frequencies, Based on Estimates from Table V, in the Various States, for BMl and BM2 BMl State vector (0,0,0,0) (0,0,0,1) (0,0,1,0) (0,0,1,1) (0,1,0,0) (0,1,0,1) (0,1,1,0) (0,1,1,1) (1,0,0,0) (1,0,0,1) (1,0,1,0) (1,0,1,1) (1,1,0,0) (1,1,0,1) (1,1,1,0) (1,1,1,1)
C. DIRECT
Obs.
0 2 2 29 1 22 27 1321
0 4 4 117 1 60 192 8218
Exp. (stan.)
BMl Exp. (accel.)
0 2 1 30 1 22 28
0 2 1 30 1 22 29
Obs.
Exp. (stan.)
24 41 45 147 40 113 134 701 45 129 131 694 148 671 742
0 3 5 116 2 59 189
0 3 5 116 2 59 190
18 40 48 150 50 108 126 704 51 129 128 692 137 676 751
8220
8220
6192
1320
1319
6195
Exp. (accel.)
23 41 45 147 40 113 134 702 45 129 132 695 148 671 743 6195
MAXIMIZATION
We applied the quasi-Newton algorithm E04JAF of the NAg Library [38] to the direct maximization of L(W); the results are shown in Table VIII, with the estimated expected frequencies in Table IX. For BMl, the results are comparable to those obtained by gradient descent, apart from the troublesome weights identified above, and the expected frequencies behave well, apart from those associated with state vectors x = (0, 1, 1, 1) and X = (1,1,1,1). For BMl, however, some of the weights seem to have "blown up", although the estimated expected frequencies are acceptable and the attained value of L(W) is the highest achieved so far.
D. APPLICATION OF THE EM ALGORITHM Recall that each vector x^ in the training set indicates the responses of ny visible nodes, and we regard jc^^^ as consisting of n = ny -\- nn components, with Xy filling the first ny components and the remaining components filled
Boltzmann Machines
77
Table VII Results for the BMl Data, Based on a Different Set of Initial Estimates for the Weights^ Weight
Estimate
State vector
Obs.
Exp.
W^12
0.458 0.847 -0.083 1.495 0.592 0.187
(0,0,0,0) (0,0,0,1) (0,0,1,0) (0,0,1,1) (0,1,0,0) (0,1,0,1) (0,1,1,0) (0,1,1,1) (1,0,0,0) (1,0,0,1) (1,0,1,0) (1,0,1,1) (1,1,0,0) (1,1,0,1) (1,1,1,0) (1,1,1,1)
0 2 2 29 1 22 27 1321 0 4 4 117 1 60 192 8218
0 2 1 29 1 22 28 1320 0 4 5 116 2 59 190 8220
w^l3
wi4 W23 W24 W34
0.396 0.634 1.864 3.334
W^15 w^25 w^35 w^45
0.212 1.082 0.545 -0.243 0.991
w^lO w^20 w^30
W40 w^50
^Achieved log-likelihood is -6453.075.
by the (unobserved) responses jc^ from the HH hidden units. The complete-data log-hkeHhood, Lc(W), is n
n
/ N
\
Lc(W) = \Y1J1 ^^V E^^^^f I - A^ log Zc(W), i=0j=0
\k=l
I
where Z^C W) is the appropriate partition function. If at stage s in the corresponding EM algorithm we have current estimates W^^^ of the weights, then the next set of values, V^^^^+i), is obtained by maximizing, with respect to W, JVlogZc(W),
(19)
in which the conditioning on the observed data only through Xy is a consequence of the independence of the different observations. As suggested by Qian and Titterington, in the discussion of Geyer and Thompson [22], Zc(Ty) can be approximated by the type of Monte Carlo method described in Section V.D, and in principle, the conditional expectations can also be estimated, using Monte Carlo
78
N. H. Anderson and D. M. Titterington Table VIII Estimates of Weights for Data from BMl and BM2, Obtained by Direct Maximization BMl
Weight
BMl
y^n
0.461 0.842 -0.078 1.454 0.548 -0.047
0.676 0.247 0.408 0.591 0.488 0.552
0.403 0.679 1.843 1.597
14.428 -11.946 9.946 1.780
0.201 1.114 0.830 1.742 7.783
-13.583 12.393 -9.115 -1.111 12.411
ms wu W23 w^24 w^34 w^l5 . w^25 w^35 1^45 w^lO w^20
^30 W40 w^50
L(W)
-6452.916
-14952.698
simulation, by a weighted average of the sample averages of a series of network runs with weights W^^K At the lower level of averaging, we average, as required, several samples from the equiUbrium distribution of the network, with visible units clamped to one of the distinct vectors manifested in the data, thereby estimating the corresponding expectation of [xiXj]. At the higher level, this process is repeated for each of the 2"^ different state vectors corresponding to the visible nodes, and the results are combined after multiplication by the corresponding relative frequencies in the data. The maximization is then carried out numerically. In dealing with the small-scale illustrative examples in this paper, the expectations and partition function can be evaluated explicitly, without the need for Monte Carlo simulation, and this was done in the numerical results reported below. Convergence was defined, as before, in terms of a tolerance on the differences between successive estimates of the weights. Maximization of (19) was carried out with a Newton-Raphson procedure, which proved to be more reliable than the quasi-Newton routines we tried from the NAg library. Table X contains results for the BMl data after convergence on the second EM stage, initializing at the true weights and using a convergence tolerance of 0.01. Bearing in mind the true values of the weights, it seems that EM has left the
Boltzmann Machines
79 Table IX Observed and Estimated Expected Frequencies, Based on Estimates from Table VIII, for BMl and BM2 Data BMl
State vector (0,0,0,0) (0,0,0,1) (0,0,1,0) (0,0,1,1) (0,1,0,0) (0,1,0,1) (0,1,1,0) (0,1,1,1) (1,0,0,0) (1,0,0,1) (1,0,1,0) (1,0,1,1) (1,1,0,0) (1,1,0,1) (1,1,1,0) (1,1,1,1)
BMl
Obs.
Exp.
Obs.
Exp.
0 2 2 29 1 22 27
0 2 1 29 0 22 27
18 40 48 150 50 108 126 704 51 129 128 692 137 676 751
20 38 45 153 50 108 127 703 46 134 134 686 141 672 746
1321
0 4 4 117 1 60 192 8218
1273
0 4 5 117 1 58 192 8269
6192
6197
connections to the hidden unit relatively unchanged, but has altered the v^eights among the visible units more radically. This is more readily apparent in the third set of estimates in Table X, which corresponds to starting the algorithm at a set of weights, each of which was randomly chosen from the interval (—1, 1); these initial values constitute the second set of weights in the table. Again, it appears that the visible-unit weights have been adjusted to fit the target distribution as well as possible, the hidden-unit connections have been largely unchanged, and the bias weights have been adjusted to ensure that we have a valid set of probabilities. Note that the attained values of the log-likelihood are very similar. When more extreme starting weights were chosen, the algorithm sometimes failed, presumably because the Newton-Raphson routine diverged. Similar behavior was observed when the BMl data were analyzed. When the convergence tolerance was somewhat more demanding, convergence was much slower. With a tolerance of 0.001, the algorithm took 941 iterations to reach the fourth set of weights in Table X, starting from the true weights, once more. After so many iterations, the weights associated with the hidden units have moved significantly away from their initial values, and at least three of the weights are very large. This may be associated with the presence of three zero counts in the data. The BMl data do not include zero counts, and when the EM algorithm was
80
N. H. Anderson and D. M. Titterington Table X Estimates of Weights for BMl Data, Obtained with the EM Algorithm Weight
^n W^13
wi4 W23 W24
m4 W^15
w^25 w^35 w^45
^\0 w^20 w^30 W40 w^50 Iterations
L(W)
Estimates
True 0.01 True
0.01 Random
0.001 True
0.50 0.75 0.20 1.20 0.10 0.60
0.457 0.816 -0.087 1.406 0.530 -0.170
0.589 0.831 -0.090 1.468 0.565 -0.048
0.483 0.606 -0.151 1.855 0.658 -1.304
0.30 0.50 1.80 1.00
0.304 0.488 1.800 1.002
-0.666 0.794 -0.715 -0.087
0.756 -25.611 3.942 2.457
0.10 0.95 0.80 1.10 0.60
0.344 1.384 1.141 2.512 0.606
0.932 1.259 2.713 3.388 0.320
0.143 26.868 0.118 2.173 23.463
Tolerance initialized
2 -6452.913
2 -6452.916
941 -6452.655
applied, initializing at the true weights and with a convergence tolerance of 0.001, results were obtained as shown in Table XL The weights have not blown up in the same way, although the estimates obtained for the weights are quite different from the values obtained by gradient descent. This, again, is a manifestation of nonidentifiability, a problem that does not arise if one exploits the structure of the model for BM2 by reducing the number of parameters that must be estimated. For instance, if we require the bias connections to have a common (unspecified) value, then the number of parameters to be estimated is reduced to 11. The second set of results in Table XI was obtained after five EM iterations, starting from the true weights and using a convergence tolerance of 0.01. These results were similar to those obtained by the gradient-descent algorithm. When the parameters are unconstrained, the networks represent saturated models, and this seemed to represent too many parameters for the EM algorithm to estimate reliably. EM may therefore be more suitable for less fully parameterized networks, or for more sparsely connected networks.
Boltzmann Machines
81 Table XI Estimates of Weights for BM2 Data, Obtained with the EM Algorithm
Weight
True
Equal true
0.50 0.50 0.50 0.50 0.50 0.50
0.807 0.010 0.333 0.691 0.535 0.498
0.598 0.363 0.436 0.508 0.424 0.577
0.50 0.50 0.50 0.50
1.452 -1.112 1.247 0.655
0.554 0.408 0.546 0.439
0.25 0.25 0.25 0.25 0.25
-0.310 1.234 -0.155 0.165 0.376
0.297 0.297 0.297 0.297 0.297
initialized W^12 W^13
wu W23 W24 W34 W^15 w^25
ms W45
wio w^20 w^30
W40 w^50
Iterations L{W)
Estimates Unequal true
^iO
860 -14,954.167
5 -14,955.032
VIII. VARIATIONS ON THE BASIC BOLTZMANN MACHINE In this section we collate remarks about extensions and variations on the binary Boltzmann machine and discuss possible future developments, particularly from a statistical point of view.
A. THE BOLTZMANN PERCEPTRON The formulation of the Boltzmann-Gibbs distribution in (2) allows, in principle, for connections among all nodes in the Boltzmann machine, be they inputs, outputs, or hidden nodes. One way both to simplify the model and to structure it around particular activities, is to eliminate some of the connections—in other words, to insist that some of the weights be zero. In the Boltzmann perceptron.
82
N. H. Anderson and D. M. Titterington
the only connections permitted are between input nodes and hidden nodes, and between hidden nodes and output nodes; no intralayer connection is permitted, or any direct Hnk between inputs and outputs, so that the network does have the feedforward character of the standard multilayer perceptron. In addition, the distribution of interest during training is p(xo\xi): the input units are clamped, and it is therefore unnecessary to insist that they be binary. The concept of the Boltzmann perceptron was developed by Hopfield [43], and Yair and Gersho [44,45] showed that the network was sparse enough and suitably structured for the equilibrium probabilities to be computable deterministically, without the need for Monte Carlo approximations. Kappen [46] further extends the development, introducing lateral inhibition among the output units that facilitates a practicable deterministic learning rule. Kappen [46] also shows that the Boltzmann perceptrons are universal classifiers, and designs a version, incorporating lateral inhibition among the hidden units, for use as an estimator of the joint probabiUty distribution, p(xi, XQ)Finally, Kappen and Nijman [47] study a version of this network, dealing with continuous inputs by incorporating the features of radial basis function networks, for deriving classification tools that can deal with missing data.
B. PoLYTOMOus
BOLTZMANN IMACHINES
In statistical terms, the stationary distributions associated with Boltzmann machines provide a class of pairwise interaction models for multivariate binary data. A natural direction for generalization is to allow each xt to be polytomous (that is, multicategory) or ordered polytomous. A real-life example of the latter might be the degree of pain felt by sufferers from backache, described as "nonexistent," "mild," "moderate," "bad," or "severe". Anderson and Titterington [37] work through the case of unordered, polytomous variables, concentrating mainly on showing that the theory and algorithmic development described by Byrne [24] carry over to this case. They note that the formulation (2) corresponds to a linear logistic regression model for the binary response, jc/, regressed on the other Xj, and carry over the corresponding approach in the statistical literature for dealing with a polytomous response; see, for instance, Anderson [48] and Cox and Snell [49, Chap. 5]. The key idea is to represent a J-category variable by (d — I) binary variables, so that, ultimately, all calculations are made in terms of binary variables. Suppose Xi e {^i,..., qd] is the output from the /th neuron, and that Zi = (Zii,..., Zi{d-i)) is a vector of d — I binary variables. We create a set of d such vectors, one of which is to represent each of the [qi]. In particular, let ei be a
Boltzmann Machines
83
(d — 1)-vector with 1 in position / and zero elsewhere, and let ed be the vector that consists of (d — 1) zeros. Then we represent xt by Zi according to the code Xi = qi ^ Zi = eu
i = I, ,. .,n;
/ = 1, . . . , d.
The state space for each Zi is S = {ei,... ,ed}, and the state space for z = ( z i , . . . , z^) is Z = 5^^. (This assumes that all of the xt can take the same number, d, of categories. This can be generalized at the expense of more tortuous notation.) If Ziu denotes the wth component of Zi, then Xi =qi
<> Ziu = ^lu,
where 8iu is the Kronecker delta, for all / and w, and for / = I,... ,d — 1. For I = d,of course, Ziu — 0 for all u. With this formulation, the polytomous Boltzmann machine consists of a set of multicategory neurons linked by weighted connections, either completely interconnected or with the pattern of connections determined by the particular application, just as for the standard Boltzmann machine. However, each neuron is a composite "processing unit," consisting ofd — l binary subneurons, a maximum of one of which is active at any one time, to represent the vectors {ei,... ,ed}. There is no connection among the binary subneurons within one composite node, but the connection between two composite neurons consists of some or all of the possible connections between binary subneurons in different nodes; for an illustration see Figure 1 of Anderson and Titterington [37]. To describe the probabilistic mechanism for the Boltzmann machine, which in turn defines the activation function associated with a typical multicategory neu,Ju. i, j = 0,... ,n;l,u = I,... ,d — I] ron, we specify a set of weights W = {w\y. We assume that w^-^ = 0 for all /, /, and w, define the energy function
i
j
I
u
and let p{z) = cexp{-£'(z)},
z e Z,
denote the corresponding normalized probabilities on Z. For a particular /, denote by z\i the set [ZJ: j 7^ i}- Then, for / = 1 , . . . , J — 1 we obtain p(zi =ei \ z\i)
[ V^ V^
/M
]
/OAN
Note that this notation caters for the inclusion of a bias term, by setting zo = ei, perpetually. Note also that, because u;[" = 0, the summation over j on the right-
84
N. H. Anderson and D. M. Titterington
hand side of (20) can tidily run from Oton. From (20), therefore, the updating rule for the ith neuron is Zi = ei, with probability
1—;
T—.
,
l + Etiexp{E}=oEtN;^"^;.} /=:l,...,t/-l,
(21)
and Zi =ed, with probability
1 + X^exp Y.Jl'^il^J^ r=l
.
(22)
1;=0M=1
In terms of the original expression of the neuron states in terms of xt, we have that Xi = qi, with probability given by (21), for / = 1 , . . . , ^ — 1, and xt = qd, with probability given by (22). The updating rules (21) and (22) constitute the appropriate Gibbs sampler for the Boltzmann machine. With this framework, Anderson and Titterington [37] spell out the corresponding version of the theory and training algorithms used by Byrne [24] in the context of the binary Boltzmann machine. They also clarify the relationship between the EM algorithm and the IPFP.
C. NEAL'S APPROACH BASED ON BELIEF NETWORKS The notoriously slow gradient-descent learning procedure for Boltzmann machines with hidden units was noted in Section VI.C: in the traditional approach, Monte Carlo simulation is required in both the so-called positive and negative phases of the iterative stage. In view of this, Neal [50] pointed out the use of belief networks [51] as creators of flexible probability distributions for multivariate binary data. The structure of the joint probabilities takes the simple form
p{x) = Y\p{xi\Xi), i
where Xt is some subset of (jcy: j < i). The underlying units are implicitly ordered, and it is a trivial matter, once p(x) is specified, to generate a realization from the distribution. Furthermore, the problem of an intractable partition function has disappeared. The relationship of these models to Boltzmann machines is similar to that between causal Markov Mesh Random Field models [52] and Markov Random Fields. Neal [50] introduces a particular class of such networks.
Boltzmann Machines
85
called sigmoid belief networks, characterized by the following assumption about p(xi\P(^i):
Xi
=
1
with probability 1 4- exp | — T ^ ^ijXj [
^' 0
otherwise.
Note the similarity to, and minor but important difference from (1). The terminology derives from the use of the logistic sigmoid in defining the probabilities. However, the computation of full conditional probabilities, such as P(xi = 1 |x\/), is typically not possible, and they usually must be estimated by Gibbs sampling. Neal [50] estimates parameters from a training set by gradient ascent of the loglikelihood function. He shows that the gradient involves terms corresponding only to the "positive phase" of Boltzmann machine learning (based on conditional expectations), so that the computational burden is much less. He also describes noisy-OR belief networks, in which, conditional on {XJ: j < i], with probability 1 — exp | — T ^ y^ijXj [ j
0
'
otherwise.
In a numerical study of some small-scale problems, Neal [50] finds that his models are as flexible as those corresponding to Boltzmann machines, but that training is much faster, and he suggests using them in medical prognosis. Although, as commented above, the p{x) associated with a belief network is easy to calculate, for a given x and a given set of weights, W, this is not true if some of the components of x correspond to hidden units, with the result that the probability distribution of real interest is p{xv) = Y^^H P^^^- Unless the hidden variables "come after" all of the visible variables in the ordering defined by the X subscript, the summation cannot be done easily, and we are in a difficulty similar to that of being faced with an intractable partition function. An alternative to using Monte Carlo methods is to seek analytical approximations to p(xv), for given X and W, Saul et ah [53] use mean-field approximations to derive a lower bound, in the case of sigmoid belief networks, adapting an approach developed for the so-called Helmholtz machine by Dayan et ah [54] and Hinton et al. [55]. Jaakkola and Jordan [56] extend the method to the case of noisy OR networks and show how to obtain upper bounds for both types of belief network. In Jaakkola and Jordan [57], a way of obtaining upper and lower bounds is described for the Boltzmann machine itself, incorporating a process of pruning the corresponding network until a skeleton is left for which computation of the probability is tractable. In related work, Saul and Jordan [58,59] show how some Boltzmann machines do turn out to be computationally tractable. These are called Boltzmann trees, and
86
N. H. Anderson and D. M. Titterington
are such that the output and hidden nodes can be arranged hierarchically in one or more isolated trees. They are examples of Boltzmann networks that are decimatable, in that they can be reduced to successively smaller networks, by deletion of nodes, such that the probability distribution associated with the remaining nodes is equal to the marginal distribution for those nodes in the original distribution. This is particularly important in the context of the gradient-descent learning rule, in which quantities such as E(xi,Xj) are required. Ruger et al. [60] show that, in decimatable machines, the network can be collapsed to a three-node network, involving node /, node j , and a bias node, for which E(xi,Xj) is exactly as in the original machine. For the small network, and with the binary outputs coded as — 1/ + 1, rather than 0/1, there are weights W^ such that p(xi,Xj) = exp {xiWQi -f
XJWQJ
-h
XiXjW-j)/Z\W'),
from which it is easy to show that E{xi,Xj)
= {exp (w-j) cosh {WQ^ + WQJ) - exp ( - w^ij) cosh « . -
w'^j)}/Z\W^),
with Z\W')
= exp (w'.j) cosh {WQ^ + w[^j) -h exp ( - w'.j) cosh {w^^ - WQJ).
The W^ are computable from the original W by a series of sets of linear equations, and there is a recursive computation that leads from Z\W^) back to the original partition function, Z(W). For further details of these tractable cases, see RUger etal. [60].
IX. THE FUTURE PROSPECTS FOR BOLTZMANN IVIACHINES In principle, Boltzmann machines represent a flexible class of exponential family models for multivariate binary data, in which the flexibility is enhanced by the inclusion of hidden units (or latent variables, in statistical terms). They are being used in some practical problems, particularly in the neural network Hterature, and are being used as the foundations of more complicated networks [61]. The involvement of time-consuming Monte Carlo methods in running and training Boltzmann machines has discouraged wider application, but the use of methods such as those of Geyer and Thompson [22], mentioned in Section V.D, and Smith and Gelfand [62], together with improved Monte Carlo strategies going beyond basic Gibbs sampUng, and increased parallehsm, may help to overcome this drawback. To the statistician this will improve access to a potentially useful class of exponential-family models. Further possible extensions include the
Boltzmann Machines
87
case where the responses are polytomous with ordered categories (rather than the unordered case dealt with in Section VIII.B); conceivably, the key to their development can be found in statistical regression models for ordered polytomous responses [63, 64]. In general, the topic promises to continue to be an area for strong development, at a particular part of the interface between neural network research and modem statistics that has been exposed by Cheng and Titterington [65] and Ripley [66], among others.
ACKNOWLEDGIVIENTS Preparation of this chapter was supported, in part, by a Research Grant from the Complex Stochastic Systems Initiative of the UK Engineering and Physical Sciences Research Council.
REFERENCES [1] D. M. Titterington and N. H. Anderson. Boltzmann machines. In Probability, Statistics and Optimization: A Tribute to Peter Whittle (F. R Kelly, Ed.), pp. 255-279. Wiley, New York, 1994. [2] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intelligence 6:721-741, 1984. [3] W. K. Hastings. Monte Carlo sampling methods using Markov chains, and their applications. Biometrika 57:97-109, 1970. [4] A. F. M. Smith and G. O. Roberts. Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. /. Roy. Statist. Soc. Sen B 55:3-23, 1993. [5] N. Metropohs, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equations of state calculations by fast computing machine. J. Chem. Phys. 21:1087-1091, 1953. [6] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter (Eds.). Practical Markov Chain Monte Carlo. Chapman & Hall, London, 1996. [7] L. Tiemey. Markov chains for exploring posterior distributions. Ann. Statist. 22:1701-1762, 1994. [8] E. H. L. Aarts and J. H. M. Korst. Simulated Annealing and Boltzmann Machines. Wiley, Chichester, 1989. [9] E. H. L. Aarts and J. H. M. Korst. Boltzmann machines as a model for parallel annealing. Algorithmica 6:437-465, 1991. [10] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. U.S.A. 81:3088-3092, 1982. [11] D. J. Amit. Modeling Brain Function. Cambridge Univ. Press, Cambridge, 1989. [12] D. R. Cox and D. V. Hinkley. Theoretical Statistics. Chapman & Hall, London, 1974. [13] Y. M. M. Bishop, S. E. Fienberg, and P. W. Holland. Discrete Multivariate Analysis. MIT Press, Cambridge, MA, 1975. [14] D. Ackley, G. Hinton, and T. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive Sci. 9:141-169, 1985. [15] G. E. Hinton and T. J. Sejnowski. Learning and releaming in Boltzmann machines. In Parallel Distributed Processing (D. E. Rumelhart and J. L. McClelland, Eds.), Vol. 1, pp. 282-317. MIT Press, Cambridge, MA, 1986.
88
N. H. Anderson and D. M. Titterington
[16] S. P. Luttrell. The use of Bayesian and entropic methods in neural network theory. In Maximum Entropy and Bayesian Methods (J. SkiUing, Ed.), pp. 363-370. Kluwer Academic, Dordrecht, 1989. [17] J. Besag and P. J. Green. Spatial statistics and Bayesian computation. J. Roy. Statist. Soc. Ser. B 55:25-37, 1993. [18] A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple sequences. Statist. Sci. 7:457-472, 1992. [19] C. J. Geyer. Practical Markov chain Monte Carlo. Statist. Sci. 7:473^83, 1992. [20] A. De Gloria, P. Faraboschi, and M. Olivieri. Efficient implementation of the Boltzmann machine algorithm. IEEE Trans. Neural Networks 4:159-163, 1993. [21] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likeUhood from incomplete data via the EM algorithm (with discussion). J. Roy Statist. Soc. Sen B 39:1-38, 1977. [22] C. J. Geyer and E. A. Thompson. Constrained Monte Carlo maximum hkeUhood for dependent data (with discussion). / Roy Statist. Soc. Ser B 54:657-699, 1992. [23] S. Amari, K. Kurata, and H. Nagaoka. Information geometry of Boltzmann machines. IEEE Trans. Neural Networks 3:260-271, 1992. [24] W. Byrne. Alternating minimization and Boltzmann machine learning. IEEE Trans. Neural Networks 3:612-620, 1992. [25] C. Peterson and J. R. Anderson. A Mean Field Theory learning algorithm for neural networks. Complex Systems 1:995-1019, 1987. [26] J. Besag. Statistical analysis of non-lattice data. The Statistician 24:179-195, 1975. [27] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, Reading, MA, 1991. [28] M. Mezard, G. Parisi, and M.A. Virasoro. Spin Glass Theory and Beyond. World Scientific, Singapore, 1987. [29] D. Geiger and F. Girosi. Parallel and deterministic algorithms from MRF's: surface reconstruction. IEEE Trans. Pattern Anal. Machine Intell. 13:401^12, 1991. [30] J. Zhang. The Mean Field Theory in EM procedures for blind Markov random field image restoration. IEEE Trans. Image Processing 2:27^0, 1993. [31] A. P. Dunmur and D. M. Titterington. Parameter estimation in latent structure models. Technical Report 96-2, Department of Statistics University of Glasgow, 1996. [32] A. E. Gelfand and B. P. Carlin. Maximum likelihood estimation for constrained or missing data problems. Canad. J. Statist. 21:303-311, 1993. [33] C.J. Geyer. On the convergence of Monte Carlo maximum likelihood calculations. /. Roy. Statist. Soc. Ser B 56:261-274, 1994. [34] S. Amari. Information geometry of the EM and em algorithms for neural networks. Neural Networks S:1379-140S, 1995. [35] I. Csiszdr and G. Tusnddy. Information geometry and alternating minimization procedures. In Statistics and Decisions (E. J. Dudewicz et ai, Eds.), Supplementary Issue no. 1, pp. 205-237, Oldenbourg, Munich, 1984. [36] R. M. Neal and G. E. Hinton. A new view of the EM algorithm that justifies incremental and other variants. Preprint, Department of Computer Science, University of Toronto, 1993. [37] N. H. Anderson and D. M. Titterington. Beyond the binary Boltzmann machine. IEEE Trans. Neural Networks 6:1229-1236, 1995. [38] Numerical Algorithms Group. NAG Fortran Library, Mark 13. NAG Ltd., Oxford, 1988. [39] S. E. Fienberg. The Analysis of Cross-Classified Data. MIT Press, Cambridge, MA, 1977. [40] R. F. Churchhouse (Ed.). Handbook of Applicable Mathematics, Vol. Ill: Numerical Methods. Wiley, New York, 1981. [41] J. A. Hagenaars. Categorical Longitudinal Data. Sage, London, 1990. [42] L. A. Goodman. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61:215-231, 1974.
Boltzmann Machines
89
[43] J. J. Hopfield. Learning algorithms and probability distributions in feed-forward and feedback networks. Proc. Nat. Acad. Set U.S.A. 84:8429-8433, 1987. [44] E. Yair and A. Gersho. The Boltzmann Perceptron network: a soft classifier. J. Neural Networks 3:203-221, 1990. [45] A. Yair and A. Gersho. Maximum a posteriori decision and evaluation of class probabilities by Boltzmann Perceptron classifiers. Proc. /££: 78:1620-1628, 1990. [46] H. J. Kappen. Deterministic learning rules for Boltzmann machines. Neural Networks 8:537548, 1995. [47] H. J. Kappen and M. J. Nijman. Radial basis Boltzmann machines and learning with missing values. In Proceedings of the 1995 World Congress on Neural Networks, Washington, DC, 1995. [48] J. A. Anderson, Separate sample logistic discrimination. Biometrika 59:19-35, 1972. [49] D. R. Cox and E. J. Snell. Analysis of Binary Data, 2nd ed. Chapman & Hall, London, 1989. [50] R. M. Neal. Connectionist learning of belief networks. Artificial Intelligence 56:71-113, 1992. [51] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA, 1988. [52] K. Abend, T. J. Harley, and L. Kanal. Classification of binary random patterns. IEEE Trans. Inform. Theory 4:538-544, 1965. [53] L. K. Saul, T. Jaakkola, and M. L Jordan. Mean field theory for sigmoid belief networks. /. Artificial Intelligence Res. 4:61-76, 1995. [54] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The Helmholtz machine. Neural Comput. 7:889-904, 1995. [55] G. E. Hinton, P. Dayan, B. Frey, and R. M. Neal. The wake-sleep algorithm for unsupervised neural networks. Science 268:1158-1161, 1995. [56] T. S. Jaakkola and M. L Jordan. Computing upper and lower bounds on likelihoods in intractable networks. AI Memo no. 1571, MIT, 1996. [57] T. S. Jaakkola and M. I. Jordan. Recursive algorithms for approximating probabilities in graphical models. Computation Cognitive Science Technical Report 9604, MIT, 1996. [58] L. K. Saul and M. I. Jordan. Learning in Boltzmann trees. Neural Computation 6:1174-1184, 1994. [59] L. K. Saul and M. I. Jordan. Boltzmann chains and hidden Markov models. In Advances in Neural Information Processing Systems 7 (G. Tesauro, D. S. Touretzky, and T. K. Leen, Eds.). MIT Press, Cambridge, MA, 1995. [60] S. Riiger, A. Weinberger, and S. Wittchen. Decimatable Boltzmann machines vs. Gibbs sampling. Bericht 96-29 des Fachberichs Informatik der Technischen Universitdt Berlin, 1996. [61] R. A. litis and P.-Y. Ting. Data association in multi-target tracking: a solution using a layered Boltzmann machine. In Proceedings of the International Joint Conference on Neural Networks, 1991, Vol. 1, pp. 31-36. IEEE, New York, 1991. [62] A. F. M. Smith and A. E. Gelfand. Bayesian statistics without tears: a sampling-resampling perspective. Amer. Statist. 46:84-88, 1992. [63] J. A. Anderson. Regression and ordered categorical variables (with discussion). /. Roy. Statist. Soc. Ser B 46:1-30, 1984. [64] P. McCuUagh. Regression models for ordinal data (with discussion). / Roy. Statist. Soc. Sen B 42:109-142, 1980. [65] B. Cheng and D, M. Titterington. Neural networks: a review from a statistical perspective (with discussion). Statist. Sci. 9:2-54, 1994. [66] B. D. Ripley. Neural networks and related methods for classification (with discussion). J. Roy. Statist. Soc. Ser B 56:409-456, 1994.
This Page Intentionally Left Blank
Constructive Learning Techniques for Designing Neural Network Systems Colin Campbell Advanced Computing Research Centre University of Bristol Bristol BS8 ITR, England
I. INTRODUCTION Neural networks have been appHed to a wide range of application domains such as control, telecommunications, medical diagnosis, oil prospecting, finance, and other areas. Many of these applications can be broadly divided into three topics: classification problems, in which the input is assigned to one of a number of distinct classes or categories, tasks we will call regression, in which there is a continuous mapping from inputs to an output or outputs; and novelty detection, in which abnormal or unusual features are detected. Examples of classification include character recognition, speaker identification, digital channel equalization, and object recognition. In this case the neural network has one or more output nodes that are discrete and commonly binaryvalued. For example, for digital channel equalization using neural networks (removing intersymbol interference and noise on communication channels), the binary outputs become the bits in the reconstructed digital code. On the other hand, many applications involve setting up a nonlinear functional map between the input nodes and outputs. In this case the neural network has one or more continuously valued output nodes. Examples would be control systems for robots, estimation of risk in insurance underwriting, control of telescope mirror segments with adaptive optics, prediction of currency exchange rates, etc. Finally, we have novelty detection, in which the system may be trained on normal data and it is required to Implementation Techniques Copyright © 1998 by Academic Press. All rights of reproduction in any form reserved.
91
92
Colin Campbell
highlight divergence from normahty. Examples might include the detection of abnormal features in electroencephalographic data, the detection of abnormal cells on cervical smears, fault detection in machinery, the detection of abnormal tissue on magnetic resonance imaging scans, etc. Typically the normal data is used to construct a density function, and the novel or abnormal features are highlighted as outliers from this distribution. Estimation of a density function is a well-known topic in statistics, but is also extensively used in neural computing, particularly in unsupervised learning and radial basis function (RBF) networks, which form a distribution function covering the input data. Classification and regression problems are usually handled by using multilayered neural network architectures. Networks trained with the Back Propagation algorithm and RBF networks are a common choice for these types of problem. For standard Back Propagation, the weights and biases of the nodes are found with a gradient descent procedure and the architecture of the network is found by experimentation. Usually the network is trained with several choices for the number of hidden nodes, and performance is evaluated against a validation set to determine the optimal number of hidden nodes. This step is computationally intensive, and therefore only a small fraction of the space of possible networks can be evaluated. For RBF networks it is similarly important to find the optimal choice of base functions for covering the density distribution of the input data. Constructive algorithms determine the architecture of the network and the parameters (weights, thresholds, etc.) necessary for learning the data. Compared to algorithms such as Back Propagation, they have the following potential advantages: • They grow the architecture of the neural network in addition to finding the parameters required. • They can reduce the training to single-node learning, hence substantially simplifying and accelerating the training process. The learning speed of constructive algorithms such as Cascade Correlation are at least an order of magnitude faster than that of Back Propagation. • Some constructive algorithms are guaranteed to converge, unlike the Back Propagation algorithm, for example. • Algorithms such as Back Propagation can suffer from catastrophic interference when learning new data; that is, storage of new information seriously disrupts the retrieval of previously stored data. The incremental learning strategy of constructive algorithms offers a possible solution to this problem. Constructive techniques are of interest for other reasons. To date, a great deal of effort has been directed toward finding efficient algorithms for determining the weights in a neural network. Standard Back Propagation is typical in that
Designing Neural Network Systems
93
the procedure determines the weights but not the architecture of the network. On the other hand, for biological neural systems, the number of vesicles released at synaptic junctions suggests that the range of synaptic values is fairly limited and that the precision in these values can be expected to be poor. Learning in biological systems may be oriented toward setting up an appropriate connectivity rather than the determination of high-precision synaptic values. Evidence for this comes from the development of neural architectures in neonates, where neural structures are adapted through the branching of certain pathways, with pruning and decay of other neural circuits. This adaptive development may be essential for cognition in complex information-processing systems, because the category of leamable concepts is known to be very restricted [1], and therefore complicated concepts must be built up from simpler low-level concepts in a constructive fashion. The major potential shortcoming of constructive algorithms is their aptitude for capturing noise and overfitting the training data. The objective of training is not to obtain an exact representation of the training data, but to construct a model of the rule or process that generated that data. Unfortunately, the ability to guarantee convergence can lead to the algorithm giving a perfect fit to the training data and therefore a perfect fit to the noise. We illustrate this idea in Fig. 1, where the solid line is supposed to represent the underlying rule and the crosses represent the training data that are drawn from this rule but corrupted by noise. The dashed line represents the solution found by a constructive algorithm in which all of the training data are correctly learned, but the solution found is poor. Plainly the solution is over-complex, does not accurately capture the underlying rule, and could be expected to give poor generalization for new data drawn from that rule. The generalization error of neural networks can be generally viewed as the sum of a bias term and a variance. A nonzero bias can arise because the function learned
Figure 1 If the neural network has too many free parameters, it can fit the noise exactly (dashed line), giving a poor approximation to the underlying rule (solid line).
94
Colin Campbell
by the neural network differs on the average from the underlying rule or function generating that data. A nonzero variance can arise because the function implemented by the network may be very sensitive to the particular choice of the training data, as in Fig. 1. A neural network with too simple an architecture will be unable to learn the data correctly and will have a large bias. On the other hand, if the network is over-complex, the variance will be large, with the network exactly fitting any noise in the chosen data set, for example. Constructive algorithms can frequently generate too complex an architecture with too many hidden nodes or hidden layers, and consequently we need a strategy to reduce network complexity if required. The solution is to remove excess weights or hidden nodes during or after the training process, a topic we will discuss further in Section V. Rather than growing neural networks and then pruning, it seems sensible to start with a network that is bound to solve the problem and then prune. However, such destructive algorithms have several disadvantages. First, a network with a small number of hidden nodes can be trained more rapidly, and therefore it makes sense to start with a small network and grow further hidden nodes. Having found a network that stores the data, we use this as a first approximation for beginning the pruning process. A second disadvantage is that it is difficult to guess the initial size of the network before running a destructive algorithm. Certain problems (e.g., A^-parity or the Twin Spirals problem) require a substantial number of hidden nodes, and to avoid potential problems with a highly intermeshed data set, it would be necessary to start with a complex initial architecture. In this review we will start with constructive algorithms for classification and regression tasks. Because networks with analog outputs can be readily adapted to give binary outputs, the latter category include classification as a subclass. In both of these cases network construction proceeds through the successive addition of single hidden nodes. Consequently, Section IV is devoted to constructive techniques that add entire subnetworks. In Section V we consider pruning techniques to reduce network complexity. Given the rapid expansion of this topic, we have had to be selective and make only brief reference to many interesting algorithms that have been proposed recently. We have also had to ignore evolutionary techiques that are closely related to this topic, for example, the use of genetic algorithms to determine the network architecture [2]. In addition to altering weights and architecture, it is also possible to use adaptable updating functions at hidden nodes to add a further degree of flexibility [3-6]. This additional freedom is achieved by introducing further trainable parameters at the hidden nodes. Consequently, these models tend to favor networks with a small bias and high variance compared to networks with fixed updating functions, and consequently we will not consider them here.
Designing Neural Network Systems
95
11. CLASSIFICATION A.
INTRODUCTION
In this section we will consider constructive algorithms for solving classification problems. We have also selected the algorithms to illustrate the different types of architectures that can be generated by constructive algorithms. Broadly speaking, there are two main classes: networks with a shallow structure, perhaps consisting of a single hidden layer that is grown laterally, and deeper architectures consisting of thin hidden layers (which may only have one hidden node each), with growth consisting of additional layers interposed between the input layer and the output. During the training of a classifier, we want to identify input patterns with their correct category. We will use the object x^ to represent these inputs to the networks where the index /x labels the pattern (/x = 1 , . . . , /?) and j ranges over the components (j = I,..., N for N nodes in the input layer). Because these components will make up a vector, we will also sometimes write these objects x^. The corresponding targets are y^, where /x is the pattern label and / labels the output nodes. The output can also be viewed as a vector y^. Weights leading from a node j to a node / will be denoted Wtj, and the threshold at a node / will be denoted 7/ (the bias is the negative of this threshold, —7}). Thus if the nodes in the network use a i l updating function, the output at node / is related to the inputs by yt
= sign(j2^iJ^J-Tiy
(1)
where we define the sign function as having an output of +1 if its argument is greater than or equal to zero and — 1 otherwise. It is also common to use the quantization (1,0), in which case we use the Heaviside step function for the updating. For binary classification we will generally label the two target classes by P+ and P~ (where P+ represents the set with target +1 and P~ represents the set with targets — 1 or 0). The separatrix in Eq. (1) can be viewed as an oriented hyperplane in an A/^-dimensional space: w . X = r.
(2)
For binary classification tasks, a constructive algorithm will typically try to separate the two classes of data, or if this is not possible, it will try to maximize the number of patterns on the correct sides of the hyperplane. If it is not possible to separate the two classes, then some of the patterns are incorrectly stored and the algorithm will proceed to grow further hidden nodes to store these patterns.
96
Colin Campbell
B. THE POCKET ALGORITHM Because a data set may not be separable by a hyperplane, it is common to use a variant of the perceptron algorithm (called the pocket algorithm), which minimizes the number of errors for nonseparable data or finds a solution if the data is linearly separable (the perceptron algorithm and some alternatives to the pocket algorithm are considered in the Appendix). As for the perceptron algorithm, the pocket algorithm only changes the weights if a pattern is incorrectly stored (Al). The patterns are presented in a random order, and a separate set of weights Wf- are kept (in one's pocket), along with the number of consecutive presentations for which these weights correctly classified the patterns. Because the patterns are presented randomly, a good set of weights will have a lower probability of being altered and hence remain unchanged for a longer time than weights that give rise to errors. Consequently, to find a good solution, we only replace the pocketed weights with a new set when the current sequence of correct classifications is longer than the previous sequence of correct classifications. It is possible to show that these pocketed weights will give the minimum possible number of errors with a probability approaching 1 as training increases. Formally, the Pocket Convergence theorem [7-9] states that for a finite set of patterns and a probability P < 1, their exists an m such that after m iterations, the probability of finding optimal weights exceeds P. AL The Pocket Algorithm. 1. Randomly initialize the initial weight vector WJj. 2. Choose a training example at random (with input Xj' and corresponding target j f ) . 3. At iteration t, if the current weights correctly store the pattern, that is,
then go to 4; else update the weight vector,
and go to 2. 4. If the current sequence of correct classifications with the new weights is longer than the sequence of correct classifications for the pocketed weights Wf^j, then replace the pocketed weights by these weights and record the length of its run of correct classifications. 5. If the maximum number of iterations is exceeded, then stop; else go to step 2.
Designing Neural Network Systems
97
This algorithm improves the pocketed weights stochastically, and consequently there is a nonzero probability that a good set of weights could be overwritten by poor weights after a long sequence of the latter. An improvement is therefore to introduce a rachet mechanism such that every time the pocketed weights are replaced, the number of errors must decrease. Thus in addition to the pocketed weights and length of sequence, we must also record the number of errors made with each pocketed weight set. Suppose the current sequence length of the pocketed weights is L and the current set of perceptron weights has remained unchanged for more than L presentations; then we proceed to evaluate the number of errors for the current perceptron weights. If the number of errors is less than that for the current pocketed weights, then we replace the pocketed weights, with the weights from the perceptron rule, also recording the sequence length and this new minimum number of errors. Unfortunately, this evaluation of the number of errors for the new candidate weights substantially increases the computational load, although giving a better eventual solution.
C. TOWER AND CASCADE ARCHITECTURES The classification algorithms we will describe in the following sections have guarranteed convergence if the pattern sets have a property that Frean [10] has called convexity (although this is not a necessary condition for convergence for some algorithms). A pattern set is convex if each pattern in the set can be separated from the other patterns by a hyperplane [9-11]. For example, suppose the inputs in the training patterns, x^, have components ± 1 and let us further suppose that pattern /x = 1 has target + 1 . If we use single-node learning with weights Wj = x j and a threshold T = N, then u = sign(J2j ^j^j ~ T) gives an output M = + 1 if Vj = x | and —1 otherwise (the same construction is possible with (1,0) updating functions, of course). Geometrically, the patterns can be pictured as being located at some of the comers of an A^-dimensional hypercube, and this solution amounts to a hyperplane that cuts off one comer from the rest of the hypercube. A second class of convex pattems arises with input vectors, x^, which have real-valued components satisfying the constraint
7=1
In this case the pattem vectors can be viewed as emanating from the origin and as being of constant length, so that they lie as labeled points on the surface of an A/^-dimensional hypersphere. It is again possible to separate one point from the rest by using a tangential hyperplane that separates this point from the rest of the hypersphere.
Colin Campbell Output
Tower nodes
Input layer Figure 2 The architecture generated by the Tower algorithm. Each hidden node is connected to the input nodes and the previous hidden node.
For convex pattern sets it is possible to develop a number of constructive algorithms with guaranteed convergence. As an illustration, let us consider one of the first and simplest algorithms, namely the Tower algorithm shown in Fig. 2. There are A^ inputs in the initial input layer of the Tower, together with an additional node (A^ +1), clamped at -hi to handle the bias. The lowest hidden node has input from all of the nodes in this layer (including the bias node), and each higher node in the Tower has contributions from this input layer and the hidden node directly below it. Each node is trained using the pocket algorithm with rachet, and then its weights are fixed and its output becomes a contribution to the next higher node. If the hidden nodes use the binary quantization (1,0) and the pattern set is convex, it is straightforward to see that each new hidden node can be trained to make at least one less error than the immediately preceding hidden node. For example, suppose a hidden node is wrongly 0 when pattern v is presented. At the next higher node in the Tower, we can use the convexity of the pattern set to find weights and a threshold between the input layer and the hidden node such that this node will be 1 when pattern v is presented. If the weight between this hidden node and its predecessor in sufficiently large, it will respond correcdy to the pattern set learned by the previous hidden node in addition to the vth pattern. For a finite pattern set this guarantees convergence (although with poor generalization, unless each hidden node can find a better solution, preferably correcting a number of errors). Another early constructive technique, the Tiling Algorithm [12], similarly builds upward, using layers of hidden nodes consisting of a single master and several ancillary nodes to enforce leamable representations on the hidden layers. However, in the instance of master nodes with direct connections to all of the in-
Designing Neural Network Systems
99
Tower nodes
Input layer Figure 3 The cascade architecture generated by the Pyramid algorithm. Each hidden node is connected to the input nodes and all of the previous hidden nodes.
puts and the preceding master node [13], no ancillary nodes are required and this technique gives an alternative construction of the Tower architecture. We can generalize the Tower algorithm by introducing additional connections between each hidden node and all preceding hidden nodes in addition to the inputs. This is the Pyramid architecture illustrated in Fig. 3. Following more recent usage, we will henceforth refer to the Pyramid model as an example of a cascade architecture. In terms of successful applications and popularity, the cascade architecture is most commonly used in the context of the Cascade Correlation algorithm (Section III.B), in which the new hidden node is trained to correlate with the magnitude of the residual error at the output node. For classification tasks, the cascade architecture has been investigated (and its convergence proved) in the context of the Perceptron Cascade algorithm [14-16]. In this algorithm the hidden nodes are trained to correct the two types of classification error at the output: wrongly off (a 0 instead of a 1) and wrongly on (1 instead of a 0). Each time a hidden node is constructed its weights are trained and it is added to the network and frozen; there is no subsequent training of its weights. The output nodes are connected to all of the input nodes and to all hidden nodes by weights that are retrained after the addition of each hidden node. As previously, a bias is also included, using a weight Wo connected to an additional input node, which is clamped at +1 for all input patterns. Burgess et al have successfully used this algorithm for 3D object recognition [15] and shown that it can converge for arbitrary real-valued inputs [14].
Colin Campbell
100 Output
O
Output
QInput nodes
DD-
Bias
Input nodes
Bias
(a)
Q-
(b)
Figure 4 To simplify the drawing of Cascade Correlation networks, it is common to use a compact diagrammatic representation. The network in (a) is represented by (b), in which a link via directed lines to the right and upward represents connections and trainable weights in the network.
Q A
Output node
Hidden node
Q-
DO Input nodes
DBias Figure 5
Q-
A cascade architecture with three inputs, a single hidden node, and one output.
Designing Neural Network Systems
101
The structure of connections in cascade architectures can easily become compHcated if there are a large number of inputs or hidden nodes. This is especially true for hidden nodes higher in the tree, which will develop a large fan-in from all of the previous hidden nodes and all of the nodes in the input layer. Consequently, it is common to introduce a compact diagrammatic representation for the network (Fig. 4). Thus, for example, a network with three inputs, a bias node, and one hidden node would be represented as in Fig. 5.
D. T R E E A R C H I T E C T U R E S : THE UPSTART ALGORITHM 1. The Algorithm The Upstart algorithm [10,17] is an efficient constructive procedure for building classifier networks. As noted above, there are two types of error that can occur at the output node ((1,0) node quantization is required, as we will see): • The output is 1 when it should have been 0. We call these patterns wrongly on. • The output is 0 when it should have been 1. We call these patterns wrongly off. The algorithm starts without hidden nodes and attempts to separate the two classes of data by using the Pocket algorithm, for instance. If it is not possible to separate the two classes, then the best solution is recorded and the corresponding weights and bias are used for the direct connections between the input layer and output (Fig. 9). To correct the remaining errors, we proceed as follows: If the output makes wrongly on mistakes, we build a hidden node called a corrector (Fig. 6). The weight between this hidden node and the output is large and negative (the magnitude must exceed the sum of the positive weights into the output node from the inputs). We now use the pocket algorithm to find a set of weights from the inputs to this hidden node. Let To be the target value at the output and Oo be the actual output achieved for a given input pattern, then the target values we want at the hidden node are given by the entries in the table in Fig. 7. Plainly if the target output value (To) is 1 and we are considering a negative weight between this hidden node and the output, then the target at the hidden node must be 0 so as not to introduce an error. If the target output (r^) is 0 and the present output (Oo) is 1, then we must use a target of 1 at the hidden node to correct this error. Finally, if the target output is 0 (To) and the present output is 0 (Oo), then either a target of 0 or 1 is satisfactory at the hidden node, because 1 will reinforce this pattern, whereas 0 will have no effect (the output will remain correct). This entry has been denoted by a don't care symbol X (either 0 or 1), and the corresponding patterns can simply be dropped from the training set.
102
Colin Campbell Output Large -ve weight
Q
^ ^ ^ - ^
Large +ve ^^igh^ B
Wrongly on
Wrongly off
corrector
corrector
Figure 6 The corrector nodes generated by the Upstart algorithm. Node A corrects wrongly on errors and has a large negative weight between it and the output node. Node B corrects wrongly off errors and has a large positive weight between it and the output node. There are connections into all three nodes not shown in this figure.
If the output makes wrongly off mistakes, we build a second corrector. The weight between this hidden node and the output is large and positive (again the magnitude must exceed the sum of the negative weights into the output node from the input). We now use the pocket algorithm to find a set of weights from the inputs to this hidden input, using the entries of the table in Fig. 8 as target values. If errors exist at these correctors, then we introduce further correctors to remove these errors and keep repeating the process until no errors are left. At the end of the algorithm, we obtain a binary tree network (Fig. 9), with all hidden nodes having direct connections to the input layer nodes. Because some patterns will fall into the category of "don't care" targets, we see that the learning task becomes successively easier as we descend the tree. This does not ensure convergence, because it is conceivable that no patterns will fall into this state as we descend a limb of the tree. However, for a convex pattern set, we can always correctly store at least one pattern at each corrector, and this guarantees convergence for a finite pattern set. In Section II.F we will see that the tree structure of the Upstart algorithm is also reducible to a network with a single hidden layer. Frean [17] performed numerical experiments with artificial tasks (A^-parity, "2 or more clumps", etc.), and the generalization ability of the resulting networks
Target values 0 1 Output 0 | X x | 0o | values 1 1 I 0
X is 0 or 1
Figure 7 The truth table for the wrongly on corrector (node A) in Fig. 6. The entries in the table are the desired target values for this corrector. The output values are those at the output node (Oo), and the target values are the desired outcome at the output node (To).
Designing Neural Network Systems
103
Target values 0 1 Output 0
0
values 1
0 X
1
X is 0 or 1
Figure 8 The truth table for the wrongly on corrector (node B) in Fig. 6. The entries in the table are the desired target values for this corrector. The output values are those at the output node (Oo), and the target values are the desired outcome at the output node (To)-
was good and much better than earlier algorithms such as the Tower and Tiling algorithms. However, the algorithm does not appear to have been used in practical classification tasks since its introduction.
E. CONSTRUCTING TREE AND CASCADE ARCHITECTURES BY USING DICHOTOMIES Let us consider classification tasks where the input vectors have binary-valued components (quantized ±1) and the targets belong to two classes: patterns with target j ^ = 1 (the set P^) and those with target yf^ = —1 (the set P~) (this choice of target quantization is important, as we will see shortly). For input vectors with binary-valued components, it is always possible to find a set of weights
Output node
Correctors
Input layer Figure 9 An example of a possible architecture generated by the Upstart algorithm with two corrector nodes. In contrast to this example, the tree will grow asymmetrically for most typical data sets.
104
Colin Campbell
and thresholds that will correctly store all of the patterns belonging to one of these sets and at least one member belonging to the other set (for target y^ = + 1 , say, this was the construction Wj = ;cj and T = N for an updating function such as Eq. (1)). Usually it is possible to exceed this minimal solution and store a number of patterns of one target sign in addition to all of the patterns of the other target sign. A set of weights and thresholds that correctly store all of the P^ patterns and some of the P~ will be said to induce a ^-dichotomy, whereas a Q-dichotomy will correspond to correct storage of all of the P~ patterns and some of the P+. Let us consider a pair of nodes in a hidden layer with direct connections to the output node, with each connection having a weight value of + 1 . Let us assume the first of these nodes induces a 0-dichotomy and the second induces a ©-dichotomy (we will also refer to these hidden nodes as dichotomy nodes below). If the first node correctly stored a pattern belonging to P~ then the second node must similarly store this pattern correctly, and each hidden node contributes — 1 to the output node. On the other hand, if the pattern belonging to P~ was not stored correctly, then the first node will contribute a +1 to the output, which is canceled out by the — 1 from the second node (because the weights leading to the output node are both +1). In a similar fashion, if the second node successfully stores a pattern belonging to P"^, then both nodes will contribute +1 to the output node; otherwise the contributions from the two nodes cancel each other out. If the threshold at the output is zero, then the patterns contributing two +1 or two — 1 to the output are stored correctly. Now let us consider those patterns belonging to P'^ and P~ that remain unstored, that is, the contributions from the pair of hidden nodes cancel each other out. In the Target Switch algorithm [18], we handle these unstored patterns by introducing further hidden nodes alternately inducing 0 - and ©-dichotomies. Patterns correctly stored at the first two hidden nodes are discarded from the training sets of subsequent hidden nodes. Consequently, we avoid disrupting previously stored patterns (at earlier dichotomies) by introducing further nodes between this hidden layer and the output. This can be achieved by using a cascade architecture of linear nodes or a tree architecture of thresholding nodes (we will call these additional hidden nodes cascade nodes and tree nodes below). The topology of the cascade architecture is illustrated in Fig. 10. A Hnk between two nodes indicates a connection with a corresponding weight value always fixed at + 1 , and the numbers indicate the order in which the dichotomy nodes are grown. Suppose the output of the first two dichotomy nodes is (+1, +1). The first linear cascade node feeds a +2 to all subsequent cascade nodes and the output. This inhibits subsequent cascade nodes from sending an erroneous signal to the output node. For example, suppose the output is — 1 for all dichotomy nodes after the first pair; then the corresponding outputs of the other cascade nodes will be 0. On the other hand, if a pattern is not stored correctly at the first pair, then the first
Designing Neural Network Systems
105 Output
Hidden layer
3 e
Figure 10 The cascade architecture for the Target Switch algorithm. The dichotomy nodes labeled 1 feed the first linear cascade node, which then feeds the second and third cascade nodes and the output node. If the output from this first pair of dichotomy nodes is (+1,+1) or (—1,-1), then the first cascade node inhibits the succeeding cascade nodes and hence determines the output; otherwise the first cascade node outputs 0, and the final output is determined by the succeeding dichotomy nodes. Reprinted from C. Campbell and C. P. Vincente Neural Computation [IS], 1995, with permission of MIT.
cascade node outputs 0, and consequently the output of the network is influenced solely by the remaining dichotomy nodes. The number of dichotomy nodes in the hidden layer can be odd or even. If we have m dichotomy nodes, the number of linear cascade nodes is m/2 (m even) or (m + l)/2 (m odd). Instead of linear cascade nodes, we can use thresholding nodes between the dichotomy nodes and the output. The corresponding tree architecture is illustrated in Fig. 11, with the numbers indicating the order in which the dichotomy nodes are grown. This architecture is less economical in terms of the number of extra hidden nodes generated, with \(m^ — 2m) (m even) or ^(m — 1)^ (m odd) tree nodes for m dichotomy nodes. To implement this method for storing the patterns, we need an efficient procedure for obtaining the dichotomies. The method we now outline starts by attempting to maximize the number of correctly stored patterns (using the pocket algorithm, for example). If separation is not possible, we then shift the threshold so that all of the -hi are stored correctly (for a 0-dichotomy), although some of the — 1 are incorrectly stored. We then locate the most problematic pattern among
106
Colin Campbell Output
Figure 11 The tree architecture of thresholding nodes for the Target Switch architecture. If the output of the first pair of hidden nodes (labeled 1) is either (-|-l,+l)or(—1,—1), then the output of the network is +1 or —1, respectively, otherwise the final output is determined by the succeeding dichotomy nodes (labeled 2 and 3). Reprinted from C. Campbell and C. P. Vincente Neural Computation [18], 1995, with permission of MIT.
the P~ on the wrong side of the hyperplane and change its target value to +1 (we call this target switching). We keep repeating this process until the two sets are separated. In more detail, for a 0-dichotomy at hidden node /, the initial local pattern sets are P."^, which consists of members of P"*" that were not stored at previous dichotomy nodes, and P.~, which consists of previously unstored members of P ~ . We then proceed as follows. A2. The ^'Dichotomy Procedure. 1. We use the pocket algorithm or related techniques (Section VII) to find the maximum linearly separable subset storing most patterns. Let us suppose the corresponding weights are Wij. 2. Using these weights, we now calculate m^ = Ylj WijX^ for all of the patterns belonging to P^ and P.~. Among those patterns belonging to the set P.~, we find the pattern with the largest value of mf. Suppose this is pattern JJL = X\ then we set the threshold at / equal to m ^ that is, 7} = mf. 3. For the set of patterns belonging to P^, we then determine if there are any patterns such that m^ is less than or equal to m^, i.e., mf < mf. If there are no patterns in P-"^ with m^ less than or equal to m^, then we have finished training the weights and thresholds leading into hidden node / and
Designing Neural Network Systems
107
we proceed to step 5 below. However, if there are patterns in P^ satisfying this inequahty, then we find that pattern in P^^ that has the smallest value of m^ (including the most negative). Let us suppose this is pattern /x = v. 4. Among those vectors belonging to P-~ with m^ > mV, we find the pattern with the largest value of m^ and assign it the value +1 (i.e., this pattern moves from Pf to P."^). With the new sets P^^ and P-~, we return to step 1 to find a new set of weights and thresholds. 5. We have now obtained a 0-dichotomy. For the remaining members of P.~, the sums ^ ^ij^f ^re less than the threshold 7^, whereas for patterns belonging to P^^, these sums are greater than ^^ However, this dichotomy may not be the best solution, and consequently we can proceed with further training to maximize the number of patterns in P.~ that are stored correctly. To do this we record the number of P-~ patterns that were correctly stored (and associated weights and thresholds). We then discard these correctly stored P.~ patterns and use the unstored P.~ and the original P^ as our training set, repeating steps 1-4. Eventually we will exhaust the entire P.~ set, and we choose the solution that stored the largest number of P.~ patterns as the set of weights and threshold for this hidden node. To obtain a ©-dichotomy, we follow a very similar procedure. In step 2 we find the pattern with the smallest value ofmf (for /JL e P^) and set the threshold Tj equal to this value of m^. If the pattern sets are not linearly separable, we search through the P."^ to find the pattern that is least well stored, switch its target value +1 ^- — 1, and iterate the sequence until a separation of the two sets is achieved. A few patterns can lie in the hyperplanes found in the 0 and ©-dichotomies (for example, in step 2 the pattern fi = X lies in the hyperplane). Because we use the convention sign(O) = + 1 , it is necessary to offset the thresholds in step 2 of the ©-dichotomy by a very small positive quantity, 7/ -> 7} + 5. Steps 1-5 are similar to a procedure proposed by Zollner et al [19] (although we also introduce an outer loop in step 5 to further minimize the number of hidden nodes). However, unlike their algorithm, both the pattern sets P^ and P~ arc handled symmetrically in the Target Switch algorithm outlined below. These two pattern sets are treated asymmetrically in the algorithm of Zollner et al, leading to poor generalization. Having defined the procedures for obtaining © and ©-dichotomies, we can now outline the Target Switch algorithm [18]: A3. The Target Switch Algorithm. 1. For input and target sets x^ and y^, we attempt a ©-dichotomy. If the dichotomy is achieved without target switching, then learning is completed without hidden nodes, the threshold at the output node is set equal to the
Colin Campbell
108
threshold value determined in step 2 above, and the weights directly connect the input and output nodes. 2. If target switching occurred, then a hidden node is created and the weights leading into this hidden node (/ = 1) are set equal to the weights determined in 1 (similarly, the threshold Ti is set equal to the threshold determined in 1). A second hidden node is created, / = 2, inducing a ©-dichotomy. The training sets ^2^ and P2 are initially set equal to the original P^ and P ~ . We then determine the weights W2j and threshold T2. 3. If the previous pair of hidden nodes failed to store some of the pattern set, then we create training sets P^, P.~, Pj^i, and P.^^ for a further two hidden nodes, inducing a 0 - and ©-dichotomy, respectively. These training sets consist of patterns previously unstored at earlier hidden nodes. Step 3 is iterated until all patterns are stored (the final separation of the remaining pattern can occur at an odd-numbered node). 4. A cascade or tree architecture is created (as illustrated in Figs. 10 and 11). It is also possible to handle arbitrary real-valued inputs by using the Target Switch algorithm [18]. The algorithm has been tried on several artificial problems and has compared well with other techniques such as Back Propagation and neural decision trees (Section IV.B). In further experimentation by the author, it has been found to perform best for Boolean problems and with reasonably balanced target sets. Instead of using the above tree and cascade architectures, it is also possible to use weight pairs that are multiples of 2 between the hidden layer and output (Fig. 12). This architecture was outlined earher by Marchand et al. [11] in 1990. However, the objective of Campbell and Perez [18] was to enhance generalization performance by reducing the number of trainable free parameters in
Output
Weights between hidden layer and output
3
3
2
2
1
1
Hidden layer Figure 12 Instead of the cascade or tree architectures in Figs. 10 and 11, we can use a single hidden layer and weights that are multiples of 2 between the hidden layer and output.
Designing Neural Network Systems
109
the network. In particular, these authors were interested in training using binaryvalued weights (quantized ±1) between the inputs and hidden layer, and weights with value 1 between the hidden layer and the output. Apart from having good generaUzation properties [20], such weight values are straightforward to implement in hardware [18]. Although binary-valued weights would appear to be very restrictive, it is important to remember that they will give 2^ evenly distributed hyperplanes as candidates for a separatrix (about 10^^ hyperplanes for a network with A^ = 100 input nodes, for example). This restriction only results in an approximate doubling of the number of hidden nodes generated compared to real weights for most data sets considered [18]. Binary weight implementations were found to be best for learning Boolean problems, in which case the improvement in generalization can sometimes be substantial (for example, a generalization performance of 91.4 ± 3.4% on a test set was achieved for a Boolean problem for which Back Propagation gave 67.3±5.7%). Other constructive approaches to generating feed-forward networks with binary-valued weights have been considered recently [21].
R CONSTRUCTING NEURAL NETWORKS
WITH A
S I N G L E H I D D E N LAYER
Instead of generating tree or cascade architectures, we can also use the dichotomy procedure to generate a shallower network with a single hidden layer and weights with a value of 1 between the hidden nodes and output. As for the Target Switch algorithm, we will use target values ± 1 . Let P^ and P-~ be the local pattern sets at hidden node /; then we can generate the network as follows [22]. A4. Growing a Single Hidden Layer: Version 1. 1. Perform a 0-dichotomy with the current training set. For hidden node / the training set P^ is equal to the original P+, whereas P.~ only consists of members of P~ previously unstored at earlier hidden nodes, inducing a 0-dichotomy. We repeatedly grow hidden nodes, iterating this step until all patterns belonging to P~ are stored. 2. Similarly, we construct a set of hidden nodes inducing ©-dichotomies. For hidden node /, the training set P.~ is equal to the original P ~ , whereas P^ only consists of members of P"^ previously unstored at earlier hidden nodes inducing a 0-dichotomy. We repeatedly grow hidden nodes and iterate this step until all patterns belonging to P"^ are stored. 3. We use a threshold at the output node equal to the difference between the number of hidden nodes inducing 0-dichotomies and the number inducing 0-dichotomies.
110
Colin Campbell
This algorithm generates a committee machine [23] in which the output value is dictated by the majority of 1 or —1 on the hidden layer (viewing the threshold as representing hidden nodes with fixed outputs). Instead of starting with 0 - and 0-dichotomies, we can also start with a node that is neither of these, with the 0 and 0-dichotomies introduced to correct errors [24]. A5. Growing a Single Hidden Layer: Version 2. 1. We attempt to split the training data with a separating hyperplane by using the pocket algorithm. If the data set is linearly separable, we stop; otherwise the weights and threshold for the best solution are retained as the weights and threshold for the first hidden node (/ = 1). 2. We construct a set of hidden nodes inducing 0-dichotomies. For hidden node / the training set P^ is equal to the original P^, whereas P.~ consists only of members of P~ previously unstored either by the hidden node constructed in 1 or at earlier hidden nodes inducing a 0-dichotomy. We iterate this step until all patterns belonging to P~ are stored. 3. Similarly, we construct a set of hidden nodes inducing 0-dichotomies. For hidden node / the training set P.~ is equal to the original P~, whereas P^ consists only of members of P"^ previously unstored either at the first hidden node constructed in 1 or at earlier hidden nodes inducing a 0-dichotomy. We iterate this step until all patterns belonging to P"^ are stored. 4. We use a threshold at the output equal to the difference between the number of hidden nodes inducing 0-dichotomies and the number inducing 0-dichotomies. The architecture generated by this latter algorithm is, in fact, identical in performance to the Upstart algorithm mentioned previously. The nodes inducing 0 and 0-dichotomies play the role of correctors, amending the errors made by the initial node. Numerical simulations were pursued by Campbell and Perez [22, 24] to compare A4 and A5 with the Target Switch algorithm and determine whether the deeper structure of cascade architectures is preferable to a single hidden layer in terms of generalization performance. The differences between the algorithms were generally small and equivocal for most problems. For example, by using the same training and test sets and equivalent dichotomy procedure, the algorithms were compared on the classic mirror symmetry problem [25], with 68.4±3.2% on the test set for the Target Switch algorithm and 67.3 ib 3.8% for the single-layered network generated by A4 (using real weights). For another problem, shift detection [26], the Target Switch algorithm was poorer, with 86.6 ± 4.6% as against
Designing Neural Network Systems
111
88.9 ± 3.7% for A4. Because these are Boolean problems, a much more significant gain in performance was made by using binary weight quantization, for which these results were replaced by 80.4 ± 4.8%, 77.3 ±5.0%, 91.4 ±3.4%, and 95.4 ± 2.3%, respectively.
G.
SUMMARY
Apart from the algorithms mentioned here, a number of other constructive techniques have been proposed for solving classification tasks [27-32]. To date there has been no extensive comparison of the performance of these algorithms, and few of them have been applied to real-life data sets. In addition to generating feed-forward architectures, some authors have considered generating higher order weights [33, 34] that can solve nonlinearly separable problems without the need for hidden nodes. However, the larger number of trainable parameters in higher order networks can often mean poorer generalization in practice.
III. REGRESSION PROBLEIVLS A.
INTRODUCTION
In this section we will consider three main approaches to regression problems. We start with the popular Cascade Correlation learning algorithm, which generates a deep architecture with a number of hidden layers, each with one hidden node. This algorithm is fast and efficient and has been used in a number of applications (Section III.B). Next we consider node creation and node-splitting algorithms, which add nodes to a shallow network with a single hidden layer. In both cases the hidden node adds a decision boundary across the entire input space, which can have the disadvantage that extra hidden nodes do not substantially improve performance. As our third approach, we therefore consider constructive methods in the context of RBF networks, in which hidden nodes cover local areas of the input space (Section III.D).
B. THE CASCADE CORRELATION ALGORITHM 1. The Algorithm The Cascade Correlation algorithm was introduced by Fahlman [35] in 1989 and remains the most popular constructive technique, aside, perhaps, from adaptive RBF networks. As for the Perceptron Cascade, which succeeded the Cascade
112
Colin Campbell
Correlation, the hidden nodes are connected to all preceding hidden nodes in addition to the nodes in the input layer. The function of the hidden nodes is to correct the residual errors at the output. We are only interested in the magnitude of the correlation between the hidden node values and the residual error at the output because, if the hidden node correlates positively with the residual error, we can use a negative weight between it and the output, whereas if this correlation is negative, we can use a positive weight. A6. The Cascade Correlation Algorithm. 1. We start without hidden nodes and with an architecture looking like that in Fig. 13 (we have chosen one output node and two inputs for purposes of illustration). We then perform gradient descent to minimize an error function such as
E = \Y.iy'-o''f^
(6)
where O'^ is the output of the network and y^ is the expected target. We can use the Delta rule with weight changes AW^ = — r;9£'/9Wy, or a faster alternative, such as Quickprop [35] (Section VII.B). If the error falls below a certain tolerance, then we stop; the algorithm has found a solution without hidden nodes. If it reaches a plateau at a certain error value, then we end the gradient descent, recording the weights for the best solution found so far. It is now necessary to add hidden nodes to correct the error that remains in the output. Before doing so, we record the residual pattern errors, 5/^ = (y^^ - O'^),
(7)
for each pattern /x, where the output O^ is calculated using those weights Wj which minimize E (we calculate the output from O^ = f(J2a ^ ; ( ^ f ) ^^^ ^^^
Output O
DInput nodes
Bias
CD-
Figure 13 A cascade architecture with two input nodes, a bias, and no hidden nodes.
Designing Neural Network Systems
113
chosen updating function, such as /(>) = tanh(0), for example). We also calculate the average pattern error, 8 = (l/p) X!„ 8^, where p is the number of patterns. 2. We now grow a hidden node as in Fig. 14 and adjust the weights between the hidden node and the inputs so as to maximize the correlation S: S =
Y,{H^-'H^W-~8)
(8)
where / / ^ is the output of the hidden node and H is the average of these outputs over the pattern set, that is, H = (\/p) ^^ H^. S is maximized by a gradient ascent. Suppose the weights leading from the input to the hidden node are Jj; then suitable weight changes are dS
AJj = , ^ =
,eY:{S'-8)f{rM,
(9)
where rj is the learning rate and 0^ = J2k ^k^k- / (^) i^ ^^^ derivative of / ( 0 ) with respect to 0, for example, for an updating function / ( 0 ^ ) = tanh(yS0^/2), f\(j)f^) = p[l — [/(0^)]^]/2. € is the sign of the correlation between the hidden node's value and the output node (this parameter arises from our earlier remark about the two sign choices for the weights between hidden nodes and output). We
Hidden node
Input nodes
Bias Figure 14
The Cascade Correlation architecture after addition of the first hidden node.
114
Colin Campbell
perform this gradient ascent until S stops increasing, at which point we freeze the weights belonging to this hidden node. 3. With the hidden node in place in the network, we now go back to step 1 and perform a gradient descent in the total error £", but now adjusting the weights from the hidden node(s) to the output in addition to the weights from the inputs to the output. Having reached a plateau as in 1, we may satisfy our stopping criterion, otherwise we grow a further hidden node and perform a gradient ascent in 5, but now including the weights from the previous hidden node(s) in addition to the weights from the input layer. This step is repeated until the error E falls below a prescribed tolerance. Instead of using one candidate hidden node, we can also have several candidates with different initial weight configurations. Because single-node learning is involved at each step, this algorithm is very fast, outperforming standard Back Propagation by about an order of magnitude in terms of training speed. There are many variants on the original Cascade Correlation algorithm. For example, Phatak and Koren [36] have proposed a variant in which there is more than one hidden node in each layer, although the nodes in a new layer are only connected to those in the hidden layer immediately previous. This prevents the problem of steadily increasing fan-in to hidden nodes deep in the network, although the fixed fan-in of the hidden nodes may mean the network is not capable of universal approximation. Simon et al [37] have investigated a cascade architecture in which each output node is connected to its own set of hidden nodes rather than having the hidden nodes connected to all of the output nodes. Simulations suggest this leads to faster training and improved generalization performance, although at the expense of more hidden nodes. In the original statement of the Cascade Correlation algorithm, each hidden unit is a single node with sigmoidal updating function. We could also use other types of updating function, such as radial basis functions (Section III.D) or entire networks rather than a single node [38^0]. 2. Recurrent Cascade Correlation Recurrent neural networks can generate and store temporal information and sequential signals. It is straightforward to extend the Cascade Correlation algorithm to generate recurrent neural networks. A commonly used recurrent architecture is the Elman model (Fig. 15), in which the outputs of the hidden nodes at discrete time t are fed back as additional inputs to the hidden layer at time ^ + 1. To store the output of the hidden nodes, Elman introduced context nodes, which act as a short-term memory in the system. To generate an Elman-like network using Cascade Correlation, each added node has a time-delayed recurrent self-connection that is trained along with the other input weights to candidate nodes to maximize the correlation. When the candidate node is added to the network as a hidden node, this recurrent weight is frozen along with the other weights (Fig. 15).
Designing Neural Network Systems
115
Output nodes
Output nodes
A
ii
Hidden layer
/ V /
Recurrent links
\
T
Context nodes
Input layer
(a)
Hidden node
1\ i
Recurrent link
Input layer
(b)
Figure 15 (a) The structure of an Elman network with recurrent links and feedback via context nodes to the hidden layer, (b) Recurrent Cascade Correlation constructs similar recurrent links to the hidden nodes.
Recently Giles et al [41] have argued that this Recurrent Cascade Correlation (RCC) architecture cannot realize all finite state automata, for example, cyclic states of length more than two under a constant input signal. This result was extended to a broader class of automata by Kremer [42]. Indeed, apart from RCC, there has been very little other work on constructive generation of recurrent architectures [43] or investigations of their representational capabilities [44]. 3. Applications of Cascade Correlation The Cascade Correlation algorithm has been used in a wide variety of applications, and it is certainly the most popular choice of constructive algorithm for this purpose. A number of applications in machine vision have been reported. Masic et al [45] have used the algorithm to classify objects in scanned images for an automatic visual inspection system. In this case, the network was used to classify five different objects with the feature vectors extracted from two-dimensional images of circularly scanned images. The system was able to classify partially occluded objects with a high degree of accuracy. Cascade Correlation has also been used in the segmentation of magnetic resonance images of the brain [46]. In this study the performance of the Cascade Correlation algorithm was compared to a fuzzy clustering technique. The latter algorithm was observed to show slightly better segmentation performance when compared on the raw image data. However, for more complex segmentations with fluid and tumor edema boundaries, the performance of Cascade Correlation was roughly comparable to the fuzzy clustering algorithm. Zhao et al. [47] used a Cascade Correlation neural network to clas-
116
Colin Campbell
sify facial expressions into six categories of emotion: happiness, sadness, anger, surprise, fear, and disgust. After training, the network was successful in correctly classifying 87.5% of a test set. In chemistry, the algorithm was used to train a network for both quantitative and qualitative prediction for complex ion mobility spectrometry data sets with 229 spectra of 200 input points and 15 output classes [48], achieving significantly better results compared to those obtained from other techniques, such as partial least-squares regressian. Cascade Correlation has been used in a number of other applications, such as cereal grain classification [49], learning the inverse kinematic transformations of a robot arm controller [37], the classification of cervical cells [50], and continuous speech recognition [51].
C. N O D E CREATION AND NODE-SPLITTING ALGORITHMS 1. Dynamic Node Creation The Dynamic Node Creation method of Ash [52] adds fully connected nodes to the hidden layer of a feed-forward neural network architecture trained using Back Propagation. Training starts with a single hidden node and proceeds until the functional mapping is learned (the final error is below a tolerance), or the error ceases to descend and a new hidden node is added. After addition of a new node, both the weights into the new node and previous weights are retrained. The algorithm retains a history of width w for the error averaged over all output nodes and patterns. If this error ceases to descend quickly enough during the width of this history, a new node is added to the hidden layer. Thus if E is this average error, then a new node is added if E^ - E^-"^ where t denotes the current training epoch (number of presentations of the entire training set), s is the epoch after which the last training node was last added, and 6 is the trigger slope parameter, which is the threshold for extra node addition. Dynamic node creation has the advantage that the procedure starts with a small network that can be trained and evaluated quickly. As new nodes are added, the preexisting weights are generally in a favorable region of weight space for finding an eventual solution, so the algorithm spends the least time in the final training stages before convergence. For hard benchmark problems such as /i-parity, mirror symmetry, and the encoder problem, the algorithm always converged on a solution that was frequently close to known best solutions. The algorithm does have the disadvantage that the trigger parameter must be found by experimentation, because there is no theoretical guidance for a good choice for this parameter.
Designing Neural Network Systems
117
2. Hanson's Meiosis Networks Meiosis networks were proposed by Hanson [53] and construct a feed-forward neural network by using a node-splitting process in the single hidden layer, with the error dictating the splitting process. However, rather than just adding a new node to the hidden layer, a suitable hidden node is now selected and split. The selection is made on the basis of the ratio of variance/mean for a weight distribution at each hidden node. For nodes in the hidden layer, the summed inputs 0y^ = J2k -^jk^k ^^ evaluated by using weights Jjk sampled from a distribution: P(^jk
= Jjk) = /xw + oTwA^CO, 1),
(11)
where /z^ and CTW are the mean and standard deviation at weight Wjk, and N(0, 1) is a random normal deviate with mean 0 and standard deviation 1. Training is accomplished by modifying the means /iw and standard deviations (Tw for each weight distribution, using a error gradient procedure. The mean is modified by using a standard gradient descent procedure: M;+i=/x$,-^|f,
(12)
dJij
where t is the iteration index, rj is the learning rate, and dE/dJtj is the weight gradient evaluated with value Jij. The standard deviations are updated in a similar fashion, so that errors will increase the variance: <+'=^()s|-^|+4^,
(13)
where ^ is a learning parameter and A. is a decay parameter that must be less than 1. If A is smaller, the system settles faster, whereas if it is larger it will tend to jump around more before settling. Starting with one hidden node and initial means and variances chosen randomly, the algorithm is therefore as follows. A7. Hanson's Meiosis Algorithm. 1. We present an input vector and make a forward pass to find the outputs. 2. We find the differences between outputs and targets and use these errors to update the means and variances of the weight distribution (the errors appear via the derivatives dE/dJij). 3. We evaluate the variance/mean ratio for each hidden node for both the input and output weights. If these ratios are greater than a predefined threshold, then the corresponding nodes are split. 4. Having split certain hidden nodes, the corresponding weights are assigned half the variance of the old nodes, and the new means for the split nodes are differentiated by introducing a small random noise Wtj -^ Wij ± 6. 5. Training ceases when the network error is below a given tolerance.
118
Colin Campbell
Based on experiments presented in [53], Meiosis seems to be reasonably effective, but it is rather slow when the number of input nodes is large. The parameters are also found by experimentation, and the performance can be quite sensitive to the choice of these parameters. For example, if the decay parameter A is too small, node splitting may not occur, and if it is too high, then too many nodes may be generated. This experimentation to find the most suitable parameters adds to the computational load. Hanson's Meiosis method gives a criterion for splitting nodes based on those weight distributions exhibiting a large variance. The magnitude of the divergence is well defined but the directions are not. This causes most splits to slip straight back to the original nodes position without any improvement in network performance. 3. Other Node-Splitting Algorithms An alternative scheme has been proposed by Wynne-Jones [54], in which there is an attempt to find the best location for the two splitting nodes. In this scheme we freeze the current weights in the network and find the weight updates that would be made for each training pattern. The distribution of weight updates is illustrated in Fig. 16, where the bold line indicates the current vector of weights leading into the hidden node, and the arrows indicate the scatter of proposed new weight updates (which would average to zero at a minimum of the error). There may exist a clear clustering of new updates (indicated by the dashed arrows in Fig. 16). If
^
Weight vector
Weight updates
^^
Direction with ^ largest variance
Figure 16 After the error has reached a plateau, the proposed weight updates form a distribution averaging to zero net displacement in weight space. The best node-splitting choice would be in the direction of largest variance.
Designing Neural Network Systems
119
this principal direction can be found, the algorithm of Wynne-Jones [54] finds the variance in this direction and new nodes are placed one standard deviation to either side of the old position. The direction for placing the new nodes and the magnitude of their distance from the tip of the current weight vector are calculated by principal component analysis. To find this best direction, we evaluate the covariance matrix for the set of proposed weight updates A^ W/y: p
The largest eigenvalue for this matrix is the variance for the split, and the corresponding eigenvector is the direction. There are a number of possible ways of performing principal component analysis, but because we have a large matrix, an iterative estimation procedure is most appropriate. We do not outline the method here, but refer to [55, 56] for a more complete discussion and outline of the methods. Wynne-Jones [54] has experimented with a number of different splitting criteria and a variety of classification problems to evaluate this algorithm. Unfortunately, it was found that the node-splitting technique was frequently unable to improve on the performance of the network used as the seed for the initial split. The most likely reason for this is that the splitting may correct some misclassified patterns, but a greater number of new misclassifications may be introduced because of the long-range effects of these decision boundaries. Consequently, the newly created nodes may revert to the decision boundary of the node from which they were created, with no overall benefit in terms of classification performance. One way of circumventing this problem is to avoid long-range decision boundaries, and instead use a network of localized receptive fields such as the radial basis function (RBF) networks we now describe. In this case simulations suggest that the splitting technique works very well.
D. CONSTRUCTING RADIAL BASIS FUNCTION NETWORKS RBF networks are a simple but powerful network architecture that has been used in a variety of applications. Whereas multilayer learning algorithms such as Back Propagation construct the architecture of the network by using separating hyperplanes, RBF networks develop functional approximations using combinations of local basis functions. The RBF network has a feed-forward architecture with a single hidden layer, with nonlinear updating functions that we will denote /(•). The output nodes usually have simple linear updating functions. If weights from hidden nodes j to the output nodes / are denoted Wij, then the RBF mapping
120
Colin Campbell
can be written
yt = J2^ijfj{^^) + ^iO'
(15)
As before, the bias Wio can be included in the summation by incorporating an extra hidden node /o with updating fixed at + 1 , and consequently we can write the above equation compactly as yj^ = Yl]=o ^U fj (^^)- There are a number of possible choices for the basis functions on the single hidden layer. One obvious choice is to use Gaussian basis functions:
/,(x') = . K p ( - £ i ^ i ^ ) , where x^ is the input vector, ry is a vector that determines the center of the basis function j , and the parameter aj controls the width of the basis function j . Training RBF networks therefore amounts to finding the hidden-to-output weights Wtj and the parameters r^ and aj for the hidden nodes (Fig. 17). Assuming an arbitrary number of basis functions and with minor restrictions on the choice of basis function, it has been shown that an RBF network is capable of universally approximating any arbitrary function. Girosi and Poggio [57] have also shown that RBF networks are capable of finding the unique function that has the minimum approximating error. In early applications of RBF networks, the number of basis functions was selected arbitrarily and the positions of the centers were found by randomly selecting examples from the data set (this has the advantage that the centers are most
Linear Ouput Nodes
Nonlinear hidden nodes Bias node
/ \ >^s4jr$W^ \
(basis functions)
Input nodes Figure 17 An RBF network consists of a single hidden layer interposed between the input nodes and the output. Each hidden node represents a basis function, and the lines from inputs k to hidden node j represent the trainable parameters ry and aj. The lines from the hidden nodes to the output i represent the trainable weights Wij. The bias in the output layer is handled by an extra "basis function" clamped at+1.
Designing Neural Network Systems
121
likely to be concentrated in the regions of highest data density). The variances could then be chosen to be half the distance to the nearest center, for example. Rather than guessing the number of basis functions and their parameters, it makes more sense to follow a principled construction of the network through the growth of hidden nodes as required. The orthogonal least-squares (OLS) method proposed by Chen [58, 59] is such a constructive procedure which, starting from the empty set, adds one basis function at a time until the residual error is below a tolerance. The main idea is to select a subset of the input training vectors x^ as the centers of the basis functions ry (we will select the aj so that the difference between x^ and r^- in (16) gives a Euclidean distance). The selection of the best candidate basis function is based on minimization of an error function. If input x^ is presented to the RBF network, then the predicted output would be
ot=j:wijfj{x^).
(17)
To train the network, we then minimize the difference between this and the expected outcome:
E=
j:iyi-otf
(18)
M,'
For linear updating functions at the output nodes, the minimization of this error function leads to a set of simultaneous linear equations for determination of the weights between the hidden layer and outputs: W = (F^F) ^F^y,
(19)
in matrix notation, where F is the design matrix:
F =
ri 1
/i(x*) /l(x2)
/2(xi) /2(x2)
h-, fqiX^) f.
1
/i(x'')
fiixP)
/^(xOJ
(20)
In matrix notation, we can therefore write O = FW, and by using Eq. (19), the difference between the expected and predicted output becomes lirr.
y - O = y - F(F^F)~'FV = (I - F(F^F)~^F'')y
(21) (22) (23)
122
Colin Campbell
where P is the called the projection matrix. We have a set of candidate basis functions that are columns of the design matrix with centers x^ and which we will denote (gi, g2, • • •, g^) (^^ will decrease as candidates are selected). If we select a candidate basis function /, then the projection matrix is updated through [60] Pmg/g/ Pm g/ Pmg/ where P^ is the projection matrix for the m basis functions for the current selection, and Pm+i is the projection matrix after the addition of /. The change in the error for candidate / is found from the current projection matrix [60]: AE
= Em-\-l -Em
=
TjT-
.
(25)
g/ Pmg/ The maximum decrease is used to find the best basis function to add to the network. This forward selection procedure can be speeded up by using a GramSchmidt orthogonalization procedure [58-60]. As a result of this orthogonalization procedure, each new candidate basis function will be orthogonal to the others (hence more efficiently spanning the space). If all of the x^ are used as centers, then the error will fall to zero, which may lead to overfitting; hence a reliable stopping criterion should be used (Section V.D). Apart from the orthogonal least-squares algorithm of Chen [58, 59], a number of other techniques have been proposed for constructing radial basis function networks. Holcomb and Morari [61] have also proposed an RBF model in which the radial basis functions associated with hidden nodes are approximately orthogonal to each other, with the locations of the hidden nodes found by optimization techniques. However, the stopping criterion suggested may be problem dependent. The technique proposed by Lee and Rhee [62] similarly starts with one hidden node with a large width and then constructs additional nodes, adapting the associated widths and centers as required. Instead of expanding the network, the algorithm proposed by Musavi et al [63] begins with a large number of nodes which are merged if possible with the associated widths and locations of these nodes changed accordingly. Both the algorithms proposed by Musavi et al [63] and Lee and Rhee [62] are based on hierarchical clustering of the pattern vectors, and this clustering is not optimal in general. Billings and Zheng [64] have investigated the use of genetic algorithms to optimize the selection of the number of hidden nodes and their locations. Unlike most gradient descent procedures, genetic algorithms can escape from local minima. Simulations indicate that the network complexity can be significantly reduced compared to other methods, but genetic algorithms are computationally intensive, and for this reason the training process can take some time or become prohibitive for a large number of inputs. For classification tasks, a simple but effective growth strategy called dynamic decay adjustment has been outlined by Berthold and Diamond [65]. In this case
Designing Neural Network Systems
123
Threshold T"^
Area of conflict Figure 18 An RBF node used in the dynamic decay adjustment algorithm. It has two thresholds T'^ and T~ which define a region of conflict around the node. Training proceeds so that no pattern is in such a region of conflict and all patterns lie within the inner region define by T"*" for some node.
the RBF nodes in the hidden layer have two thresholds, r + and T~, where r + is always greater than T~ (Fig. 18). If apattemis misclassified during training, then two steps are possible. If none of the current RBF nodes of the correct class has an activation greater than T"^, then a new node is introduced with initial weight 1; otherwise the weight of the best correctly classifying node is altered accordingly. In both cases the radii of RBF nodes belonging to the wrong class are reduced, so that no node of the wrong class has an activation above T~. This method separates the two classes within relatively few presentations of the pattern set. Performance is also not critically dependent on the choice of the threshold values; Zell [66] reports that a choice T~^ = 0.4 and T~ = 0.2 was suitable for most applications. RBF networks have not been as widely applied as feed-forward networks, but they have still been successfully used in a broad range of problems, such as speech recognition [67], adaptive channel equalization [58, 68, 69], mapping ocean sediments [70], image processing [71], medical diagnosis [72], etc. Adaptive construction of RBF networks has clear performance advantages over random choice of the number of hidden nodes and their centers in these applications. For example, the OLS method has been successfully used in control applications [59, 73, 74] and disturbance rejection [75] generating RBF networks, which are economical with the number of hidden nodes and show good performance.
E. SUMMARY Cascade Correlation and adaptive RBF networks have been successfully used in a wide range of applications and are now established techniques. Apart from the algorithms mentioned here, there are many more recent algorithms for dynamically creating or splitting nodes in single or multiple layers [76-85]. The claimed
124
Colin Campbell
performances are often good, at least on artificial data sets. However, node creation and splitting algorithms have been less successful in practical applications, possibly because the new hidden nodes do not always systematically decrease errors for some (although not necessarily all) of these approaches. In particular, they have also not been extensively used on real-life data sets, so their reliability is unknown. Recent algorithms include techniques that iteratively attempt to estimate the number of required hidden nodes from the dimensions of the input and output spaces [86] and algorithms that dynamically expand and shrink the network to fit the problem [87]. Constructive algorithms that use higher order weights have also been proposed in the context of regression problems [88, 89].
IV. CONSTRUCTING IMODULAR ARCHITECTURES A.
INTRODUCTION
So far we have considered constructive methods in which single nodes are added to the network. We now consider techniques in which entire neural network modules are constructed during the training process. Mainly we will consider neural decision trees in which the modules are constructed as decision elements in a classification tree.
B. NEURAL DECISION TREES Decision trees are a popular approach to pattern classification problems. Each branch node of a decision tree has a set of child nodes, each of which is also a decision tree, and so the structure is recursive. Information is presented at the root node and propagates down the tree until it reaches a terminal or leaf node. The leaf nodes perform no processing and only assign the appropriate class labels to the input patterns (Fig. 19). The essential idea is to solve the problem by using a divide- and- conquer approach, with successive partitionings or spUttings of the input space. The splitting procedure usually generates a list of possible splits and then searches through this set to find the best choice. This is continued until each leaf node corresponds to a particular class. Different ways of generating the search space of partitions and evaluating the goodness of a particular split lead to different types of decision tree algorithm. This approach to classification [90-92] has been successfully used in a number of applications, for example, [93, 94], and has the useful feature that a complex rule may be decomposed into a sequence of simpler local classification rules.
Designing Neural Network Systems
125 Root tree-node
Branch tree-nodes
Leaf nodes Figure 19 The structure of a decision tree. Each circle represents a decision node, and the squares represent leaf nodes, which assign class labels.
Decision trees can be straightforwardly combined with neural networks, because neural network modules can be used to make the decisions at the branch nodes in the tree. For this reason, a number of neural tree models have been proposed [95-102]. For neural computing the neural tree architecture has the advantage that it can be readily implemented in hardware because of its recursive nature. At a branch tree node in a classification tree, the outcome of the decision causes the loading of one of two sets of weights (Fig. 20). This process continues until a leaf node is reached (indicated perhaps by a flag in the weight register), at which point the result becomes the output for the entire tree. A straightfoward way of generating a neural tree is as follows [95]. A8. Construction of a Neural Decision Tree. 1. Attempt to separate the pattern set using the pocket algorithm, etc. The solution will divide the data into two sets. Si and 5o, according to the outputs + 1 and —1, respectively. 2. For the pattern set Si, check if all of the patterns correctly belong to the same class. If this is the case, then we have arrived at a leaf node. If not, then repeat step 1 with this subset and a new node, thus creating two further subsets Sio and 5ii. 3. We repeat the above step with the corresponding set 5o if it also contains incorrectly classified patterns generating further subsets ^oo and 5oi. 4. Recursively repeat steps 2 and 3 for all subsets until leaf nodes are reached. Neural decision trees can also be extended to the case of multiple output classes by generalizing the concept of separability. A two-class problem is linearly sep-
126
Colin Campbell Weight registers
If branch node load one of the two sets of weights A or B. If leaf node then activation becomes the output of the neural decision tree Figure 20 The recursive structure of a binary neural tree can be implemented by using a single single-layered neural network. At a branch tree node, the output of the network determines which of two sets of weights to load from the register. The output of a leaf tree node is the output from the entire neural tree.
arable if there exists a hyperplane capable of separating all patterns into two sets according to their class label. To generaUze to the case of multiclass separability, Sankar and Mammone [95] define geometrical separability as follows: a ^-class problem with features from A^-dimensional space is said to be geometrically separable if hyperplanes can be drawn so that every member of class / lies within the same region defined by these hyperplanes and no member of class k =^ i Ues within that region. We can now extend the neural decision tree algorithm to multiple classes. Suppose that tree node t handles a subset St of the original training set, and let q be the number of classes in St. We use q nodes at tree node t and a local encoding scheme to label the classes, so that each node is activated by a different class. A pattern vector is then classified by the node that has the maximum output value. Thus pattern vector x^ is classified as class / if for all k =^ i:
J2^ijx^>12^kp
(26)
The splitting rule used here is therefore to assign all pattern vectors that are classified as class / to child node /. A tree node t therefore has at most q child nodes. Unfortunately, finding the optimal architecture has been shown to be an NPcomplete problem [103], and whatever the decision tree algorithm, the resulting tree is generally suboptimal with a high complexity. Thus, although some reason-
Designing Neural Network Systems
127
able comparisons with Back Propagation have been made for some tasks (e.g., vowel recognition [104]), it is fair to say that they have not outperformed multilayered neural networks or RBF networks for the majority of tasks.
C. REDUCING TREE COMPLEXITY One strategy for reducing over-complexity, called trio learning, has been proposed recently by d'Alche et al [105] and improves the generalization performance by optimizing small subtrees. Thus training is not performed on a single tree node but on a subtree consisting of a parent tree node and two child nodes. The algorithm is as follows. A9. The Trio Learning Algorithm. 1. At the root or branch tree node, we try to partition the data with a single hyperplane (using gradient descent; see Eq. 27). If the pattern set cannot be partitioned, we proceed to step 2; otherwise we stop. 2. Replace this single tree node by a trio of tree nodes (a parent and two child tree nodes). 3. Globally optimize the parameters of this trio to maximize the classification performance of the two child tree nodes (see below). 4. Freeze the parameters of the parent tree node so that each input pattern is assigned to a child tree node. 5. If a child tree node correctly classifies each pattern assigned to it, then it becomes a terminal or leaf tree node; otherwise it is a branch tree node and we return to step 1, possibly growing a further trio. The parameters in the trio are optimized by a gradient descent procedure [105], where the output of the parent node is now a sigmoidal function (rather than a thresholding function), for example: fi^)
= T^^
>(x)=W.x+Wo,
(27)
where ^ is a parameter for controlling the slope of the sigmoid. Let /?R = / ( 0 ) be the output of the parent and define PL by pL = 1 — PR- Furthermore, let 6>L and 6>R be the outputs of the left and right child nodes (with these nodes having an updating function as in (27)); then the output of the trio of tree nodes will be given by o{x) = PL(X)OL(X) -h PR(X)OR(X),
(28)
and we can interpret ps (x) as the probability that input x will be processed by child node s and ^^(x) the probability that x belongs to class P+ [105]. If OD is
128
Colin Campbell
the desired output (a 1 for P+ and a 0 for P~), then the task of globally optimizing the performance of the trio reduces to minimization of a suitable error function, such as Eix) = ^j{o(x) - OB(X))\
(29)
and this can be achieved by a suitable gradient descent procedure (Section VII.B) in the parent and child weights [105]. To further optimize the performance of the neural tree, the authors also outline a further pruning procedure [105]. For a classification task involving handwritten character recognition, this algorithm consistently outperformed the standard neural tree (A8) for performance on unseen characters [105]. A second approach to improving the neural tree performance has been considered by Guo and Gelfand [106]. Many different splitting criteria have been proposed for decision trees, but it has been found that classification performance does not differ significantly between rules [92]. Instead the performance is substantially affected by the set of features chosen by the splitting rule [92]. In the approach considered by Guo and Gelfand, the branch nodes in the classification tree consist of small multilayered neural network modules, which are trained to extract local features. These authors also propose an efficient pruning algorithm to further improve performance. On handwritten character recognition and waveform recognition, the proposed method outperformed Back Propagation in simulations.
D.
OTHER APPROACHES TO GROWING
IMoDULAR
NETWORKS
Apart from trees of networks it is interesting to consider the possibility of growing other modular structures. Modular arrangements would be beneficial in terms of hardware implementation, training speed, and the interpretational advantage of decomposing a complex problem into simpler subproblems. Furthermore, there is extensive evidence for modular processing in neural structures such as the neocortex. More generally, modular neural networks may be required for complex information tasks encountered in the real world, where the number of nodes could be very large. Despite the appeal of modular neural networks, this remains a relatively undeveloped subject. Modular schemes have been proposed for fixed architectures with a single layer of modular subnetworks, for example, the mixture of experts model of Jacobs and Jordan [107, 108], and it would be interesting to develop constructive schemes along these lines.
Designing Neural Network Systems
129
V. REDUCING NETWORK COMPLEXITY A.
INTRODUCTION
In the Introduction we pointed out that constructive algorithms have the potential disadvantage that they can create over-complex networks. Pruning redundant weights and nodes will help reduce this complexity, apart from making implementation easier and highlighting the relevance of various input nodes. Minimization of an error function (or storing all of the data in a training set) is also an inappropriate measure for stopping the training process, because it may give an over-complex model. Thus we will consider a number of other performance measures that validate the performance of a network in terms of its ability to capture the function or rule generating the training data (Section V.C). Here we will only consider pruning after training. In this case pruning procedures fall in to two categories: weight pruning and node pruning. Node pruning can be further classified into input node pruning and hidden node pruning.
B. W E I G H T P R U N I N G 1. Magnitude-Based Pruning Algorithms One of the simplest weight-pruning procedures is to successively remove weights according to their magnitude, starting with those with the smallest values. There is a theoretical justification for this rule from calculations based on the statistical mechanics approach to neural computation [109-111]. The rationale is that weights with small values generally have the least effect on the output, and they tend to arise from conflicting requirements imposed on the weights by different patterns. Although this is a simple procedure, it is also effective and often exhibits a performance comparable to that of more sophisticated approaches.
2. Optical Brain Damage For many models (e.g., Cascade Correlation or dynamic node creation) learning involves the reduction of an error function. In this case more sophisticated procedures can be used. Let us consider the change in an error function due to a change in the value of one of the weights. Let us suppose the weight Wt is changed by an amount A Wt; then the corresponding change in the error function is
^ ^ = E Wi^^'
+ ^ E HiJ^WiAWj + ...,
(30)
130
Colin Campbell
where the Hij is the Hessian matrix: Hij =
.
(31)
If training has proceeded to a minimum of the error function, then dE/dWt = 0 and the first term in Eq. (30) is zero. In the optical brain damage scheme of Le Cunn et al. [112], the Hessian is approximated by discarding the nondiagonal terms. Neglecting the higher order terms in the Taylor expansion gives .
A£ = i^////AW^.
(32)
If the weight Wi is set to zero, then AWi = Wi, and so the saliency of each weight is given approximately by st = Ha W?/2. The weights with the smallest saliencies are the best candidates for pruning. For this scheme to work, we need an approximation for the diagonal terms of the Hessian matrix. If the error function is additive, that is, E = Xla=i ^ ^ ' these diagonal terms are given by
where O^ = f (x^) is the output of the node, given the input x^. For the standard sum of squared errors expression:
This gives
''..=i:[S(^)^-(,'-0-)(^)], (33) If the training set has been correctly learned, then >>'* ~ O^, and we can effectively discard the second term [112]. Thus the saliency of weight i can be written
which can be evaluated straightforwardly for a given O^ = fi^^)procedure is then as follows.
The pruning
Designing Neural Network Systems
131
AlO. The OBD Algorithm. 1. 2. 3. 4.
Train the network until a minimum of the error function is found. Computer the sahency of each weight st. Delete one or more of the low saliency weights. If a stopping criterion is achieved (see below) then stop else go to step 1 to correct the small residual error and repeat.
Of course, the approximation of the Hessian using only the diagonal elements can be poor. A technique called optimal brain surgeon (OBS) has been proposed by Hassibi and Stork [113] and does not make this approximation. Let us suppose the weight Wt is set to zero, then the remaining weights are adjusted so as to minimize the increase in the error resulting from removal of this weight. Let us suppose the vector of weight changes is denoted by AW. As with the optical brain damage scheme, the network has already been trained to a minimum of the error function, so the change in the error resulting from AW is given by A ^ = ^HAW^AW
(37)
in vector notation. Furthermore, we have set weight Wi to zero, which gives the additional constraint, efAW-]-Wi=0,
(38)
where e is the unit vector in weight space parallel to the vector (0, ... ,Wi,... ,0). Thus we need to find the weight changes AW that minimize A^", subject to this constraint. Using the method of Lagrange multipliers, we obtain the following result for the change in the weight vector: AW=-W/-^3^,
(39)
which leads to a corresponding change in the error:
^"=A <^°> For this method to work, we require a means of evaluating the inverse Hessian. Hassibi and Stork [113] have proposed an efficient iterative method for doing this, using 1
i _ i £ W ^ H r : _ ^
(41)
where g is the vector of gradients gj = dE/dWj and t is the iteration index. The initial choice for H (at r = 0) can be chosen as a multiple of the inverse matrix,
132
Colin Campbell
that is, A,I. Consequently, the algorithm actually finds the inverse of H + A,I. The OBS algorithm can be summarized as follows.
AIL The OBS Algorithm. 1. Train the network until a minimum of the error function is found. 2. Find the inverse Hessian by using the iterative technique in Eq. (41). 3. Find the error changes AEt for each weight value / and eliminate the weight or weights that give rise to the smallest increase in the error function. 4. After pruning, if a stopping criterion is met, stop; else go to step 1. The OBS algorithm is more computationally intensive than OBD, and OBD is in turn more computationally intensive than straightforward weight eUmination based on magnitudes. However, there is a reward for this greater effort in that numerical simulations suggest that the OBS technique is superior to OBD, which in turn is often superior to magnitude-based pruning techniques. This has been confirmed in simulations [112-115].
C. N O D E PRUNING Apart from eliminating weights, we can also eliminate entire input or hidden nodes. The most obvious way of assessing the relevance of a node is to consider how well the network would perform in the absence of that node, in particular what the change would be in the error after its elimination: AE
= -^without node "~ ^with node-
(42) To make this idea more mathematically precise, we can introduce [116] an attentional coefficient aj for node j that can have a value of 0 or 1. Thus if Vj is the output of node j and it can feed node / (with output M/), then Ui=f(j^Wijajv\
(43)
and the aj gates the contribution of node j to node /. This arrangement is shown in Fig. 21. Plainly, if ay = 0, then node j has no effect on the rest of the network. Using these attentional coefficients, we can therefore write Eq. (42) as AE = Eaj=o-Eaj=i.
(44)
Designing Neural Network Systems
133
Figure 21 A feed-forward network. Semicircular arcs show the locations of the attentional coefficients. Each attentional coefficient gates all of the outputs of hidden or input nodes in a forward direction.
We can approximate this difference by the derivative of the error with respect
toaj: AE =
dE
(45)
daj
which can be computed by using an error back-propagation algorithm similar to the method used in the Back Propagation [116-118]. This approximation saves on the computation without much loss of accuracy in assessing the attentional strength of each node. In practice, this derivative fluctuates sharply, and it is best to moderate it with a component of the previous AE value during pruning: SE^ AE'^-^ = O.SAE'. + 0.2 . •^ -^ daj
(46)
Further details and examples are given by Mozer and Smolensky [116].
D.
SUMMARY
These and other pruning techniques are surveyed further by Reed [119]. To complete the pruning process, we must also have an effective stopping criterion. So far we have suggested minimization of the training error to a less-thanprescribed tolerance. Although it is a common stopping criterion, the training error can introduce a bias due to the choice of data set. Instead, one could use a separate test set or generalized cross-validation [120]. Other criteria, such as the Bayesian evidence [121], or criteria such as generalized predicted error or predicted squared error can also be used [122] and can be more accurate.
134
Colin Campbell
VI. CONCLUSION Because most constructive methods are of recent origin, there remain many open questions regarding their use and performance. One obvious question is whether a deep architecture is preferable to a shallow architecture. Certainly deep cascade architectures can be more economical in the use of hidden nodes. This is due to the use of direct connections from inputs to outputs and additional links between hidden nodes. Thus, for example, the XOR problem is solvable with one hidden node using a cascade architecture, but requires a minimum of two hidden nodes if a single hidden layer is used. The powerful higher order feature detectors of cascade architectures may have advantages in certain applications, but there are also long propagation delays through the network, and the increasing number of inputs to deeper hidden nodes makes hardware implementation more difficult. We have experimented with the generalization performance for single hidden layer versus cascade architecture (Sections lI.E, II.F), using the same data sets and dichotomy procedure, and we have found that performance is problem dependent, with one or another architecture being most suitable for a particular problem. However, a proper comparison with theoretical insight from the VapnikChervonenkis dimensions of these models would give more reliable guidance. Looking over the current development of constructive techniques, it is noticeable that certain areas remain relatively undeveloped. We mentioned the potential advantages of modular construction and the lack of techniques in this area. Furthermore, it was pointed out that recurrent networks had only been implemented in the context of Cascade Correlation, to our knowledge. It would be very interesting to develop further algorithms for recurrent networks that could handle complex bit strings, for example, in addition to expanding our knowledge of the classes of automata that can be handled by such models. To date, most applications of constructive algorithms have used the Cascade Correlation algorithm or adaptive RBF networks. There has been very little practical use of algorithms dedicated to generating networks specifically for classification tasks. Part of the reason is that some of these algorithms have worked satisfactorily on artificial problems, but their practicality is unknown, or they have appeared disappointing in application to real-life data sets. Cascade Correlation or adaptive RBF networks have the advantage of an established usage and known reliability. However, it would be useful to benchmark these algorithms across a wide range of common data sets to properly compare performance. Constructive algorithms have the inherent advantages of rapid training in addition to finding both the architecture and weights. The risk of overfitting may be alleviated through the use of suitable pruning procedures. Ultimately the most important challenge to their use would come from reliable methods for determining the network architecture directly before training. For example, using Bayesian model comparison [121, 123, 124] networks with different numbers of hidden
Designing Neural Network Systems
135
nodes will give rise to different evidences, and the model yielding the greatest evidence would be the most suitable choice. However, although there is some correlation between Bayesian evidence and generalization performance [124, 125], it is not clear if this correlation is sufficiently close to reliably select the best architecture for most data sets [126].
VII. APPENDIX: ALGORITHMS FOR SINGLE-NODE LEARNING Constructive algorithms are generally based on single-node learning: the algorithm attempts to store the training set, using the weights and threshold for a single node, and proceeds to the training of further nodes if this is not possible. The nodes may be binary-valued (e.g., for classification problems) or continuously valued (e.g., for regression problems). For completeness, we outline the main approaches to single-node learning in this appendix.
A. BINARY-VALUED NODES 1. The Fisher Discriminant Consider the weight vector W normal to the proposed hyperplane and suppose we find the projections of the input vectors onto this vector. The two classes of points will give two distributions for these projections. With the Fisher discriminant method, we further assume that these two distributions have an approximately normal distribution, with means and variances (/x+, a+) and (/x-, or_), where
^± = ]^Ew•x^
(47)
^± - ] ^ E (W-x^ - N^f^±f>
(48)
where N~^ and N~ are the number of patterns belonging to P+ and P~, respectively. A good separating hyperplane is one that would maximize the distance between the means and minimize the sum of the two variances, that is,
f (^+ - fx.)^ 1
maxr^'-; {W}
^'-^
\.
1 a^+a^ r
(49)
136
Colin Campbell
This determines the weight vector, which maximizes the separation between the two clusters and gives a solution:
This rule is fast to implement, but the assumption of a normal distribution for the two classes is generally unrealistic, and the solution found is far from optimal. Its use may be restricted to problems in which the number of inputs is large, and the methods below are too computationally intensive. 2. The Perceptron Algorithm The perceptron algorithm is a much more efficient algorithm to use for singlenode learning. The well-known Perceptron Convergence Theorem [23, 25] states that the perceptron algorithm will find weights for storing a pattern set after a finite number of iterations through the set, provided a solution exists. For singlenode learning with an updating function as in (1), the perceptron algorithm is as follows. A12. The Perceptron Algorithm. 1. At the first iteration r = 0, initialize the weights Wj to random values. 2. Present the pattern set /x = 1 , . . . , /?. For each pattern evaluate X^ = y^^ J2j ^j^'j' If ^^ ^ 0» ^ ^ ^ update the weights according to 3. If A,^ > 0 for all /x = 1,...,/?, or a maximum number of iterations has been exceeded, then stop, else repeat step 2. The perceptron rule is capable of storing patterns up to the optimal capacity. Given the simplicity of single-node learning, it has been possible to calculate the optimal properties of single-layered neural networks, though only for the idealized case of randomly constructed patterns with binary components [127-130]. We find that for p random patterns and A^ input nodes, it is theoretically possible to store up to /? = 2N pairs of patterns in a fully connected single-layered neural network by using the perceptron algorithm (strictly in the large A^ limit). By contrast, the Fisher rule can only store up to p 2:^ 0.14N pattems [131]. If p > IN, the result is shown in Fig. 22, where the horizontal axis is the storage ratio a = p/N, and the vertical axis is the storage error (minimum fraction of incorrectly stored pattems) [132, 133]. A nonzero error starts at a = 2 and gradually increases as a increases. Constructive algorithms based on the Pocket algorithm utilize this curve: as successive hidden nodes are added, the algorithm iterates down this curve until the number of remaining pattems is sparse compared
Designing Neural Network Systems 0.08
-|
137 \
1
1
1
1
r
0.07 o
0.06
CD
0.05
^_
CO
c o o
CO
LL
0.04 0.03 0.02 0.01
0
_i
I
0.5
1
i_
1.5 2 2.5 Storage ratio
3
Figure 22 The theoretical minimum fractional error versus storage ratio for randomly constructed, unbiased, and uncorrelated patterns with binary components. The storage ratio is the number of stored patterns per node (in a fully connected network).
to the dimensionality of the space, and so it is possible to separate the remaining pattern set. The use of randomly constructed patterns is unrealistic, of course, but the same essential feature holds for real-life data sets. Although the perceptron rule is efficient and capable of finding a quasi-optimal solution, there has been interest in rules that offer additional improvements. In particular, rules have been developed which are faster and not only store the patterns correctly (X^ = j ^ ^j W^-x^ > 0 V/x), but embed them strongly so that the patterns are still recognized correctly, despite the presense of noise (by using the stronger bound A.^ = y^ J2j W]x^ > c\Wl where \W\ = Jj2j Wp. Examples are the Minover algorithm [12] and the Adatron [134] (whereas the perceptron and Minover reduce the error as a power law, the Adatron approaches a solution exponentially fast [135]).
3. The Maxover Algorithm, the Thermal Perceptron, and Other Alternatives to the Pocket Algorithm For constructive algorithms we often want to store as many patterns as possible if we cannot store the entire training set. If we want to maximize the number of stored patterns, then it makes sense to train on those patterns that are closest to being stored correctly. The Maxover algorithm [136] is an example of such an algorithm.
138
Colin Campbell
A13. The Maxover Algorithm. 1. At the first iteration f = 0, initialize the weights W^ to random values. 2. At iteration t evaluate X^ = j ^ J2j ^j^'t ^^^ ^^^ pattern set /x = 1 , . . . , p. For those JJL such that A^ < c| W|, where \W\ = Jj2j WJ, find the X^ that is the maximum (most positive or closest to 0 if negative). Let this be pattern v then update the weights according to Wj'^^ = \yj + y^x^-. 3. If a maximum number of iterations is exceeded or no corrections were required, then stop; else return to step 2. A similar procedure for finding the optimal linearly separable subset is the thermal perceptron rule proposed by Frean [137]. This rule is based on the following observation. If a pattern is correctly stored and A^ = y^ J2j ^j^f ^^ l^^*g^' ^^^^ this pattern appears less likely to be corrected at the next weight change compared to a pattern for which A,^ is small. Similarly, if the pattern is incorrectly stored (k^ < 0) the weight changes required to correct this pattern will be larger and therefore much more likely to disrupt the storage of other correctly stored patterns. Consequently, the weight changes should be biased toward correcting those patterns for which A,^ is close to zero. An effective way of doing this is to bias the weight changes by using an exponential factor: AWy =r?>;^jc^^^-"^'''/^,
(51)
with the "temperature" T controlling the degree of bias against patterns with a large value of |A^|. For a large value of T, we have the usual perceptron algorithm, because the exponential is virtually unity for any input, whereas a small value of T would mean no appreciable changes unless |X^| is nearly zero. Thus the best way of using this algorithm is to gradually anneal the system, that is, to reduce the temperature from a high T value where the usual perceptron algorithm is used toward small T, thereby biasing the algorithm toward those patterns with small |A,^ I values. For best results it is also advisable to reduce the learning rate rj from 1 to 0 while the temperature T is reduced. It is possible to show that this algorithm will converge by using the perceptron convergence theorem [25], which can be shown to hold at any given T value. This method compares well with the pocket algorithm and other techniques, such as gradient descent [137]. A further temperature-dependent algorithm for finding the maximum linearly separable subset is the Minierror algorithm of Raffin and Gordon [138].
B. CONTINUOUSLY-VALUED N O D E S In the case of continuously valued data, gradient descent techniques offer the best approach to finding an approximately good solution, even if the data set is not linearly separable. Gradient descent techniques need little introduction [117,118],
Designing Neural Network Systems
139
but given their use in the Cascade Correlation algorithm and elsewhere, we briefly summarize them here. Suppose we wish to minimize an error function such as the standard quadratic form
where O^ = f((p^) and 0^ = J2j ^j^^ (single-node learning is assumed). The simplest gradient descent technique involves iterative descent in weight space: W'j = Wj + AWj
AWj = -rj^-
(53)
This derivative can be straightforwardly evaluated by using the chain rule:
AWJ =r,J2iy^-
0'')f'(r)x';,
(54)
if /((/>) = tanh(0), then / ' ( 0 ) = [1 — fi(t>)^], for example. A number of much faster variants are possible and should be used in practice. For example, rather than use a constant learning rate rj, it is better to use adaptive step sizes. The adaptive learning rates can be different for different weights [107, 139,140]: ^Wj = - , j ^ .
(55)
Second-order methods use a quadratic approximation to the error function and therefore try to use information about the shape of the error surface beyond the gradient. An algorithm along these lines is Quickprop [35], which was originally used with Cascade Correlation in Fahlman's original paper [35]. Many other algorithms using second-order information have also been proposed [141-143].
REFERENCES [1] L. G. Valiant. A theory of the leamable. Comm. Assoc. Comput. Machinery 27:1134-1142, 1984. [2] X. Yao. A review of evolutionary artificial neural networks. Intemat. J. Intelligent Systems 8:539-567, 1993. [3] J. H. Friedman. Adaptive spline networks. In Advances in Neural Information Processing Systems, Vol. 3. pp. 675-683. Morgan Kaufmann, San Mateo, CA, 1991. [4] J. H. Friedman and W. Stuetzle. Projection pursuit regressian. /. Amer. Statist. Assoc. 76:817823, 1981. [5] J. N. Hwang, S. R. Lay, M. Maechler, D. Martin, and J. Schimert. Regressian modelling in back-propagation and projection pursuit learning. IEEE Trans. Neural Networks 5:342-353, 1994.
140
Colin Campbell
[6] J. O. Moody and N. Yarvin. Networks with learned unit response functions. In Advances in Neural Information Processing Systems, Vol. 4, pp. 1048-1055. Morgan Kaufmann, San Mateo, CA, 1992. [7] S. Gallant. Neural Network Learning and Expert Systems. MIT Press, Cambridge, MA, 1993. [8] S. I. Gallant. Optimal linear discriminants. In IEEE Proceedings of the 8th Conference on Pattern Recognition, pp. 849-852, 1986. [9] S. I. Gallant. Three constructive algorithms for network learning. In Eighth Annual Conference of the Cognitive Science Society, Amherst, MA, pp. 652-660, 1986. [10] M. Frean. Small nets and short paths: optimising neural computation. Ph.D. Thesis, Center for Cognitive Science, University of Edinburgh, 1990. [11] M. Marchand, M. Golea, and P. Rujan. A convergence theorem for sequential learning in twolayer perceptrons. Europhys. Lett. 11:487-492, 1990. [12] M. Mezard and J.-P. Nadal. Learning in feedforward layered networks: the Tiling algorithm. J. Phys. A 22:2191-2203, 1989. [13] J.-P. Nadal. Study of a growth algorithm for a feed-forward network. Intemat. J. Neural Systems 1:55-60, 1989. [14] N. Burgess. A constructive algorithm that converges for real-valued input pattems. Intemat. J. Neural Systems. 5:59-66, 1994. [15] N. Burgess, R. Granieri, and S. Patamello. 3D object classification: application of a constructive algorithm. Intemat. J. Neural Systems 2:275-282, 1991. [16] N. Burgess, S. D. Zenzo, P. Ferragina, and M. N. Granieri. The generalisation of a constructive algorithm in pattern classification problems. Second workshop on neural networks: from topology to high energy physics. Intemat. J. Neural Systems 3:65-70, 1992. [17] M. Frean. The Upstart algorithm: a method for constructing and training feedforward neural networks. Neural Comput. 2:198-209, 1990. [18] C. Campbell and C. Perez Vicente. The Target Switch algorithm: a constructive learning procedure for feed-forward neural networks. Neural Comput. 7:1221-1240, 1995. [19] R. Zollner, H. Schmitz, F. Wunsch, and U. Krey. Fast generating algorithm for a general threelayer perceptron. Neural Networks 5:111-111, 1992. [20] E. B. Baum and Y.-D. Lyuu. The transition to perfect generalization in perceptrons. Neural Comput. 3:386-401, 1991. [21] E. Mayoraz and F. Aviolet. Constructive training methods for feedforward neural networks with binary weights. Intemat. J. Neural Networks 7:149-166, 1996. [22] C. Campbell and C. Perez Vicente. Constructing feed-forward neural networks for binary classification tasks. In European Symposium on Artificial Neural Networks, pp. 241-246. D. Facto Publications, Brussels, 1995. [23] N. J. Nilsson. Learning Machines. McGraw-Hill, New York, 1965. [24] C. Campbell, S. Coombes, and A. Surkan. A constructive algorithm for building a feedforward neural network. In Neural Networks: Artificial Intelligence and Industrial Applications, pp. 199-202. Springer-Verlag, New York, 1995. [25] M. Minsky and S. Papert. Perceptrons, 2nd Ed. MIT Press, Cambridge, MA, 1988. [26] S. J. Nowlan and G. E. Hinton. Simplifying neural networks by soft weight-sharing. Neural Comput. 4:473^93, 1992. [27] D. Martinez and D. Esteve. The Offset algorithm: building and learning method for multilayer neural networks. Europhys. Lett. 18:95-100, 1992. [28] P. Rujan. Learning in multilayer networks: a geometric computational approach. Lecture Notes in Phys. 368:205-224, 1990. [29] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise procedure for building and training a neural network. In Neurocomputing: Algorithms, Architectures and Applications (J. Fogelman, Ed.). Springer-Verlag, New York, 1990.
Designing Neural Network Systems
141
[30] M. Biehl and M. Opper. Tiling-like learning in the parity machine. Phys. Rev. A 44:6888-6902, 1991. [31] G. Barkema, H. Andree, and A. Taal. The Patch algorithm: fast design of binary feedforward neural networks. Network 4:393-407, 1993. [32] F. M. F. Mascioli and G. Martinelli. A constructive algorithm for binary neural networks: the Oil-Spot algorithm. IEEE Trans. Neural Networks 6:794-797, 1995. [33] N. J. Redding, A. Kowalczyk, and T. Downs. Constructive higher-order network algorithm that is polynomial in time. Neural Networks 6:997-1010, 1993. [34] S, Mukhopadhyay, S. Roy, L. S. Kim, and S. Govel. A polynomial time algorithm for generating neural networks for pattern classification: its stability properties and some test results. Neural Comput. 5:317-330, 1993. [35] S. Fahlman and C. Lebiere. The cascade correlation architecture. \n Advances in Neural Information Processing Systems, Vol. 2, pp. 524-532. Morgan Kaufmann, San Mateo, CA, 1990. [36] D. S. Phatak and I. Koren. Connectivity and performance tradeoffs in the cascade correlation learning architecture. IEEE Trans. Neural Networks 5:930-935, 1994. [37] N. Simon, H. Cororaal, and E. Kerckhoffs. Variations on the cascade correlation learning architecture for fast convergence in robot control. In Proceedings of the Fifth International Conference on Neural Networks and Their Applications, Nimes, France, pp. 455-464, 1992. [38] E. Littmann and H. Ritter Cascade LLM networks. In Artificial Neural Networks (I. Aleksander and J. Taylor, Eds.), Vol. 2, pp. 253-257. Elsevier, Amsterdam, 1992. [39] E. Littman and H. Ritter Cascade network architectures. In Proceedings of the International Joint Conference on Neural Networks. Baltimore, MD, 1992, Vol. 2, pp. 398^04. [40] E. Littman and H. Ritter. Learning and generalisation in cascade network architectures. Neural Comput. 8:1521-1540, 1996. [41] C. L. Giles, D. Chen, G.-Z. Sun, H.-H. Chen, Y.-C. Lee, and M. W. Goudreau. Constructive learning of recurrent neural networks: limitations of recurrent cascade correlation and a simple solution. IEEE Trans. Neural Networks 6:829-836, 1995. [42] S. Kremer Constructive learning of recurrent neural networks: hmitations of recurrent cascade correlation and a simple solution—comment. IEEE Trans. Neural Networks 7:1047-1049, 1996. [43] R J. Angeline, G. M. Saunders, and J. B. Pollack. An evolutionary algorithm that constructs recurrent neural networks. IEEE Trans. Neural Networks 5:54-65, 1994. [44] D. Chen and C. Giles. Constructive learning of recurrent neural networks: limitations of recurrent cascade correlation and a simple solution—^reply. IEEE Trans. Neural Networks 7:10451051, 1996. [45] N. Masic and S. D. Ristov, and B. Vojnovic. Application of the cascade correlation network to the classification of circularly scanned images. Neural Comput. Appl. 4:161-167, 1996. [46] L. O. Hall, A. M. Bensiad, L. P. Clarke, R. P. Velthuizen, M. S. Silbiger, and J. C. Bezdek. A comparison of neural network and fuzzy clustering techniques in segmenting magnetic resonance images of the brain. IEEE Trans. Neural Networks 3:672-682, 1992. [47] J. Zhao, G. Kearney, and A. Soper Classifying expressions by cascade correlation neural network. Neural Comput. Appl. 3:113-124, 1995. [48] P. Zheng, P. D. Harrington, and D. M. Davis. Quantitative analysis of volatile organic compounds using ion mobility spectrometry and cascade correlation neural networks. Chemometrics Intelligent Lab. Systems 33:121-132, 1996. [49] B. Dolenko, H. Card, M. Neuman, and E. Shwedyk. Classifying cereal grains using back propagation and cascade correlation networks. Canad. J. Electrical Comput. Engrg. 20:91-95, 1995. [50] S. J. McKenna, I. W. Ricketts, A. Y. Cairns, and K. A. Hussein. Cascade correlation neural networks for the classification of cervical cells. In lEE Colloqium on Neural Networks for Image Processing Applications, London, pp. 1 ^ , 1992.
142
Colin Campbell
[51] I. Kirschning, H. Tomabechi, M. Koyama, and J. I. Aoe. The Time-Sliced paradigm: a connectionist method for continuous speech recognition. Inform. Set 93:133-158, 1996. [52] T. Ash. Dynamic node creation in backpropagation networks. Connection Sci. 1:365-375,1989. [53] S. J. Hanson. Meiosis networks. In Advances in Neural Information Processing Systems, Vol. 2, pp. 533-541. Morgan Kaufmann, San Mateo, CA, 1990. [54] M. Wynne-Jones. Node splitting: a constructive algorithm for feed-forward neural networks. Neural Comput. Appl. 1:17-22, 1993. [55] J. E. Jackson. A User's Guide to Principal Components. Wiley, New York, 1991. [56] I. T. JoUiffe. Principal Component Analysis. Springer-Verlag, New York, 1986. [57] F. Girosi and T. Poggio. Networks and the best approximation property. Biol. Cybernetics 63:169-176, 1990. [58] S. Chen, C. Cowan, and P. Grant. Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans. Neural Networks 2:302-309, 1991. [59] S. Chen, S. Billings, and W. Luo. Orthogonal least squares learning methods and their application to non-Unear system identification. Intemat. J. Control 50:1873-1896, 1989. [60] M. Orr. Introduction to radial basis function networks. Technical Report, Centre of Cognitive Science, University of Edinburgh, 1996. [61] T. Holcomb and M. Morari. Local training for radial basis function networks: towards solving the hidden unit problem. In Proceedings of the American Control Conference, pp. 2331-2336, 1991. [62] S. Lee and M. K. Rhee. A Gaussian potential function network with hierarchically selforganising learning. Neural Networks 4:207-224, 1991. [63] M. T. Musavi, W. Ahmed, K. H. Chan, K. B. Paris, and D. M. Hummels. On the training of radial basis function classifiers. Neural Networks 5:595-603, 1992. [64] S. A. Billings and G. L. Zheng. Radial basis function network configuration using genetic algorithms. Neural Networks 8:877-890, 1995. [65] M. Berthold and J. Diamond. Boosting the performance of RBF networks with dynamic decay adjustment. In Advances in Neural Information Processing Systems (G. Tesauro, D. Touretzky, and T. Leen, Eds.), Vol. 7. Morgan Kaufmann, San Mateo, CA, 1995. [66] A. Zell. Stuttgart Neural Network Simulator Manual, Report 6/96. Applied Computer Science, Stuttgart University, 1995. [67] M. Niranjan and F. Fallside. Neural networks and radial basis functions in classifying static speech patterns. Comput. Speech Language 4:275-289, 1990. [68] S. Chen. Radial basis functions for signal prediction and system modelling. Technical Report. Department of Electrical Engineering, University of Edinburgh, 1992. [69] J. Cid-Sueiro and A. Figueiras-Vidal. Improving conventional equalizers with neural networks. In Applications on Neural Networks to Telecommunications (J. Alspector, R. Goodman, and T C. Brown, Eds.), pp. 20-26. Erlbaum, Hillsdale, NJ, 1993. [70] A. Caiti and T. Parisini. Mapping ocean sediments by RBF networks. IEEE J. Oceanic Engrg. 19:577-582, 1994. [71] T. Poggio and S. Edelman. A network that learns to recognize three-dimensional objects. Nature 343:263-266, 1990. [72] D. Lowe and A. Webb. Exploiting prior knowledge in network optimization: an illustration from medical prognosis. Network 1:299-323, 1990. [73] S. Chen and S. A. Billing. Neural networks for nonlinear dynamic system modelling and identification. Intemat. J. Control 56:319-346, 1992. [74] S. Chen, S. A. Billings, and P. M. Grant. Recursive hybrid algorithm for nonhnear system identification using radial basis function networks. Intemat. J. Control 55:1051-1070, 1992. [75] S. Mukhopadhyay and K. Narenda. Disturbance rejection in nonlinear systems using neural networks. IEEE Trans. Neural Networks 4:63-72, 1993.
Designing Neural Network Systems
143
[76] V. Honavar and L. Uhr. Generating learning structures and processes for generalised connectionist networks. Inform. Set. 70:75-108, 1993. [77] N. Thacker and J. Mayhew. Designing a layered network for context sensitive pattern classification. Neural Networks 3:291-299, 1990. [78] Y. Hirose, K. Koich, and S. Hijiya. Back-propagation algorithm which varies the number of hidden units. Neural Networks 4:61-66, 1991. [79] E. B. Bartlett. Dynamic node architecture learning: an information theoretic approach. Neural Networks 7:129-140, 1994. [80] T. Nabhan and A. Zomaya. Toward generating neural network structures for function approximation. Neural Networks 7:89-99, 1994. [81] X. Liang and S. Xia. Methods for training and constructing multilayer perceptrons with arbitrary pattern sets. Internal J. Neural Systems 6:233-248, 1995. [82] M. R. Azimi-Sadjadi, S. Sheedvash, and F. Trujillo. Recursive dynamic node creation in multilayer neural networks. IEEE Trans. Neural Networks 4:242-256, 1993. [83] S. G. Romanuik. Pruning divide and conquer networks. Network 4:481^94, 1993. [84] R. Setiono and L. K. H. Hui. Use of a quasi-Newton method in feed-forward neural network construction. IEEE Trans. Neural Networks 6:273-277, 1995. [85] R. Setiono and G. Lu. A neural network construction algorithm with application to image compression. Neural Comput. Appl. 61-67, 1994. [86] Z. Wang, C. Di Massimo, M. T. Tham, and A. Juhan Morris. A procedure for determining the topology of multi-layer feed-forward neural networks. Neural Networks 7:291-300, 1994. [87] Y. Q. Chen, D. W. Thomas, and M. S. Nixon. Generating-shrinking algorithm for learning arbitrary classifications. Neural Networks 7:1477-1490, 1994. [88] A. Roy, L. Kim, and S. Mukhopadhyay. A polynomial time algorithm for the construction and training of a class of multi-layer perceptrons. Neural Networks 6:535-546, 1993. [89] C.-H. Choi and J. Y. Choi. Constructive neural networks with piecewise interpolation capabilities for function approximation. IEEE Trans. Neural Networks 5:936-944, 1994. [90] J. R. Quinlan. Induction of decision trees. Machine Learning 1:81-106, 1986. [91] S. R. Sethi and G. R R. Sarvarayudu. Hierarchical classifier design using mutual information. IEEE Trans. Pattern Analysis Machine Intelligence 4:441^145, 1982. [92] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth and Brooks/Cole Advanced Books and Software, Monterey, CA, 1984. [93] I. Bratko and I. Kononenko. Learning diagnostic rules from incomplete and noisy data. In Interactions in Artificial Intelligence and Statistical Methods (B. Phelps, Ed.), pp. 142-153. Gower Technical Press, Aldershot, England, 1987. [94] S. Safavian and D. Landgrebe. A survey of decision tree classifier methodology. IEEE Trans. Systems Man Cybernet. 21:660-674, 1991. [95] A. Sankar and R. J. Mammone. Growing and pruning neural tree networks. IEEE Trans. Comput. 42:291-299, 1993. [96] P. E. Utgoff. Perceptron trees: a case study in hybrid concept representations. In Proceedings of the Seventh AAAI National Conference on Artificial Intelligence, pp. 601-606. Morgan Kaufmann, San Mateo, CA, 1988. [97] R. P. Brent. Fast training algorithms for multi-layer neural nets. IEEE Trans. Neural Networks 2:346-354, 1991. [98] J. E. Stromberg, J. Zrida, and A. Isaksson. Neural trees—^using neural nets in a tree classifier structure. IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, 1991, pp. 137-140. IEEE Press, Long Beach, CA, 1991. [99] K. J. Cios. A machine learning method for generation of a neural network architecture: a continuous 3D algorithm. IEEE Trans. Neural Networks 3:280-291, 1992.
144
Colin Campbell
[100] T. Sanger. A tree-structured adaptive network for function approximation in high-dimensional spaces. IEEE Trans. Neural Networks 2:285-293, 1991. [101] J. Sirat and J.-R Nadal. Neural trees: a new tool for classification. Network 1:198-209, 1990. [102] M. Marchand and M. Golea. Greedy heuristics for halfspace intersections and neural decision Hsts. Network 4:67-86, 1993. [103] L. Hyafil and R. L. Rivest. Constructing optimal decision trees is NP-complete. Inform. Process. Lett. 5:15-17, 1976. [104] A. Sankar and R. J. Mammone. Neural Tree Networks, pp. 281-302. Academic Press, San Diego, 1991. [105] F. D'Alche-Buc, D. Zwierski, and J.-P. Nadal. Trio learning: a new strategy for building hybrid eural trees. Internat. J. Neural Systems 5:259-274, 1994. [106] H. Guo and S. B. Gelfand. Classification trees with neural network feature extraction. IEEE Trans. Neural Networks 3:923-933, 1992. [107] R. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks 1:295-307, 1988. [108] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton. Adaptive mixtures of local experts. Neural Comput. 3:79-87, 1991. [109] M. Bouten. Storage capacity of diluted neural networks. Lecture Notes in Phys. 368:225-236, 1990. [110] M. Bouten, A. Engel, A. Komoda, and R. Semeels. Quenched versus annealed dilution in neural networks. / Phys. A 23:4643^657, 1990. [ I l l ] P. Kuhlmann, R. Garces, and H. Eissfeller. A dilution algorithm for neural networks. J. Phys. 25:L593-L598, 1992. [112] Y. Le Cunn, J. S. Denker, and S. A. SoUa. Optimal brain damage. In Advances in Neural Information Processing Systems, Vol. 2, pp. 598-605. Morgan Kaufmann, San Mateo, CA, 1990. [113] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: optimal brain surgeon. In Advances in Neural Information Processing Systems, Vol. 5, pp. 164-171. Morgan Kaufmann, San Mateo, CA, 1993. [114] N. Tolstrup. Pruning of a large network by optimal brain damage and surgeon: an example from biological sequence analysis. Internat. J. Neural Systems 6:31-42, 1995. [115] J. Gorodkin, L. K. Hansen, A. Krogh, C. Svarer, and O. Winther. A quantitative study of pruning by optimal brain damage. Internat. J. Neural Systems 4:159-170, 1993. [116] M. C. Mozer. Skeletonization: a technique for trimming the fat from a network via relevance assessment. Neural Inform. Process. Systems 1:107-115, 1989. [117] J. Hertz, A. Krogh, and R. Palmer. Introduction to the Theory of Neural Computation. AddisonWesley, Redwood City, CA, 1991. [118] S. Haykin. Neural Networks: A Comprehensive Foundation. Macmillan Co., New York, 1994. [119] R. Reed. Pruning algorithms—a survey. IEEE Trans. Neural Networks 4:740-747, 1993. [120] G. Golub, M. Heath, and G. Wahba. Generalised cross-validation as a method for choosing a good ridge parameter. Technometrics 21:215-223, 1979. [121] D. J. C. MacKay. Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neural networks. Network 6:469-505, 1995. [122] J. Moody. Prediction risk and architecture selection for neural networks. In From Statistics to Neural Networks: Theory and Pattern Recognition Applications, NATO ASI Series F136 (J. H. Friedman and H. Wechsler, Ed.), pp. 147-165. Springer-Verlag, New York, 1994. [123] H. H. Thodberg. Ace of bayes: application of neural networks with pruning. Technical Report no. 1132E, Danish Meat Research Institute, Roskilde, Denmark, 1993. [124] D. J. C. MacKay. A practical Bayesian framework for back propagation networks. Neural Cornput. 4:448-472, 1992.
Designing Neural Network Systems
145
[125] D. J. C. MacKay. The evidence framework applied to classification networks. Neural Comput. 4:720-736, 1992. [126] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford Univ. Press, 1995. [127] T. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14:326-334, 1995. [128] E. Gardner. Maximum storage capacity of neural networks. Europhys. Lett. 4:481^85, 1987. [129] E. Gardner. The space of interactions in neural network models. J. Phys. A 21:257-270, 1988. [130] E. Gardner and B. Derrida. Optimal storage properties of neural network models. J. Phys. A 21:271-284, 1988. [131] D. Amit, H. Gutfreund, and H. Sompolinsky. Spin-glass models of neural networks. Phys. Rev. A 32:1007-1018, 1985. [132] J. K. Fontanari and W. K. Theumann. On the computational capability of a perceptron. J. Phys. A 26:L1233-L1238, 1993. [133] W. Whyte and D. Sherrington. Replica-symmetry breaking in perceptrons. /. Phys. A 29:30633073, 1996. [134] J. K. Anlauf and M. Biehl. Properties of an Adaptive Perceptron Algorithm, pp. 153-156. North-Holland, Amsterdam, 1990. [135] W. Kinzel. Statistical mechanics of the perceptron with maximal stability. Lecture Notes in P/i>;5. 368:175-188, 1990. [136] A. Wendemuth. Learning the unleamable. J. Phys. A 28:5423-5436, 1995. [137] M. Frean. A thermal perceptron learning rule. Neural Comput. 4:946-957, 1992. [138] B. Raffin and M. Gordon. Learning and generalisation with Minierror, a temperature-dependent learning algorithm. Neural Comput. 7:1206-1224, 1995. [139] F. Silva and L. Almeida. Speeding up backpropagation. In Advanced Neural Computers. pp. 151-156. North-Holland, Amsterdam, 1990. [140] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: the Rprop algorithm. In IEEE International Conference on Neural Networks, San Francisco, CA, 1993, pp. 586-591. [141] R. Battiti. First and second order methods for learning: between steepest descent and Newton's method. Neural Comput. 4:141-166, 1992. [142] S. Becker and Y. Le Cun. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 Connectionist Models Summer School (D. Touretzky, G. Hinton, and T. Sejnowski, Eds.), pp. 29-37. Morgan Kaufmann, San Mateo, CA, 1989. [143] A. I. Shepherd. Second-Order Methods for Neural Networks. East and Reliable Training Methods for Multi-Layer Perceptrons. Springer-Verlag, London, 1997.
This Page Intentionally Left Blank
Modular Neural Networks Kishan Mehrotra
Chilukuri K. Mohan
Department of Electrical Engineering and Computer Science Center for Science and Technology Syracuse University Syracuse, New York 13244-4100
Department of Electrical Engineering and Computer Science Center for Science and Technology Syracuse University Syracuse, New York 13244-4100
I. INTRODUCTION Modularity is perhaps the most important principle underlying modem software engineering methodologies. Although all complex systems consist of interconnected parts, a system is considered modular only if each part has clearly defined and distinguishable internal structure and purpose, and the linkages between different parts are sparse and well defined. This allows each module to be designed, analyzed, and understood separately; it is a maxim among software developers that modularity is essential to the synthesis and analysis of complex systems. Modularity also allows us to grapple successfully with physical and biological systems that are too complex to be understood in their entirety by single individuals: a physician who specializes in understanding the digestive system, for instance, abstracts away from most of the rest of the complexity of human anatomy and physiology, and refers to other specialists if it becomes necessary to do so. The digestive system may be considered a module of the human body, weakly coupled with other modules such as the circulatory system, via a relatively small number of interconnections. Within each such module, we may again find further evidence of specialization by using submodules, for example, the functioning of the stomach may be understandable more or less separately from that of the intestines and the rest of the esophagus, to which it is connected. It therefore comes as no surprise that the most marvelous of human organs, the brain, itself exploits the principle of modularity. For instance, different parts of the brain specialize for different tasks, and groups of neurons interact in comImplementation Techniques Copyright © 1998 by Academic Press. All rights of reproduction in any form reserved.
147
148
Kishan Mehrotra and Chilukuri K. Mohan
plex ways. There is no general agreement on the precise number of modules in the brain, the precise function of each module, and the precise nature of interaction between modules. It is believed that further specialization does occur within a high-level module such as the visual cortex, devoted to processing and understanding images. Minsky [1] describes a model that views the human brain as a collection of interacting modules called agents. In Minsky's model, each agent is capable only of performing simple actions, but intelligence emerges from the collective behavior of such agents. Artificial neural networks and connectionist models have been inspired by studies of the behavior of animal brains. Hence it is logical that such networks must also exploit the principles of modularity that find expression in biological brains, to solve problems of significant complexity. Traditional neural network learning algorithms and models focus on small subtasks, mimicking low-level behavior of small numbers of neurons, and making assumptions such as exhaustive interconnectivity between all neurons in adjacent layers. Although these algorithms are vital to the foundation of learning systems, a practical approach must necessarily address issues such as modular design, if there is to be a good chance of being able to solve complex tasks. To use an engineering analogy: constructing an edifice needs careful design, understanding the edifice as composed of high level substructures, each of which is constructed from various building blocks and bricks. Similarly, developing an elaborate neural network for a practical application is best done by designing the system in terms of modules, which, of course, invoke learning algorithms that serve as the building blocks for connectionist systems. A building is more than a heap of bricks glued together—yet many neural network practitioners attempt to solve complex tasks by pushing data through a "black box" that is essentially a mass of interconnected neurons. Neural networks may be modular at different conceptual levels. In a trivial sense, each node or each layer can be thought of as a module. However, our primary concern is with networks that consist of several loosely interconnected components, where each component (module) is itself a neural network. Many successful real-world applications of neural networks have been achieved by using modular architectures. Several papers exist in the literature that address the construction of modular networks, often oriented toward solving a specific task by using a specific architecture; Plant and Hinton [2] and Jacobs et al. [3] were among the first to address this issue explicitly, although other networks, such as those of Fukushima [4], have also used modular neural networks for difficult problems. In this chapter, we survey modular connectionist systems, analyzed from the viewpoint of system architecture, with examples of successful applications of modular networks. We describe how modularity may be achieved in neural networks, and how modular neural networks are useful in practice, achieving faster and better quality results for complex tasks, even though a single neural network
Modular Neural Networks
149
is theoretically capable of achieving the desired goal. In most such networks, modularity is planned by the network builder, who is explicitly attempting an optimal use of resources. Such optimality may not be achieved easily by selforganizing networks whose structure develops via evolutionary mechanisms, because progress is constrained by past history and current configurations. Section II outlines some of the advantages of using modular neural networks. Various possible modular neural network architectures are described in Section III. Each of these architectures is discussed in greater detail in Sections IV to VII, along with examples. Adaptive modular networks, the internal structure of which evolves during the training process, are discussed in Section VIII. The last section contains a summary and presents issues for further study.
II. WHY MODULAR NETWORKS? Modularity is the central tenet of modern software engineering practice, and ought to be rigorously applied when neural network software is to be used in applications where the networks may change with time. Maintaining, improving, and enhancing the functionality of a system are known to be much easier if a software system is designed to be modular, and this holds for neural networks as well. Modular neural networks have several advantages when compared to their counterparts among nonmodular networks: training is faster, the effects of conflicting signals in the weight modification process can be diminished, networks generalize better, the network representation is more meaningful to the user (i.e., the behavior of the network can be interpreted more easily by the user), and practical hardware constraints are met more easily. These advantages are discussed in the remainder of this section.
A. COMPUTATIONAL REQUIREMENTS In theory, a sufficiently large neural network can always be trained to accomplish any function approximation task [5-8]. In practice, however, the learning algorithm invoked may not succeed in the training task in a reasonable amount of time. The task of learning is much easier if the function to be learned can be decomposed naturally into a set of simple functions, where a separate network can be trained easily to approximate each of the simple functions. Improvements in training time can be attributed to three reasons: 1. Each module often addresses a simpler task, and hence can be trained in fewer iterations than the unimodular network.
150
Kishan Mehrotra and Chilukuri K. Mohan
2. Each module is small, with fewer weights than the unimodular network, so that the time taken for each module's training iteration is much less than the time taken for the unimodular network's training iteration. 3. Modules can often be trained independently, in parallel.
B. CONFLICTS IN TRAINING Training algorithms attempt to modify each weight based on information aggregated from signals received from several different neurons and at different times. The performance of the learning algorithm is adversely affected when these signals conflict with each other. One such phenomenon is called spatial crosstalk [3]. In an algorithm such as error back-propagation, weights in the nonoutermost layers are modified based on a weighted sum of the error signals flowing from all of the nodes at the next higher (outer) layer in the network, which may partially neutralize each other. For instance, the error signal from one output node may suggest increasing an interior weight, whereas the error signal from another output node may suggest that the system error would instead be decreased by reducing that connection weight. The resulting net change in weight may be so small that a considerable amount of time must be spent training the network until the weight modifications becomes significant enough to decrease system error to an acceptable level. This problem can be alleviated by using a modular network, because the network consists of distinct parts to which conflicting signals may be channeled. Thus each training signal influences only a proper subset of the modules, avoiding the effects of cross-talk.
C.
GENERALIZATION
A neural network is expected to perform reasonably well on test data distinct from the data on which the network has been trained, although both are drawn from the same distribution or problem characteristics. Unfortunately, this "generalization" property may not be satisfied unless the structure of the network accurately reflects the function to be trained. For example, a multilayer feedforward neural network may be trained to perform well on the training data from a linear function, fitting a complicated many-parameter curve to the data, but perform poorly on new test data relevant to the same function. In this case, a single perceptron unit accurately reflects the nature of the function, and will hence have better generalization properties. If a complex function can be decomposed naturally into a set of simple functions, then its nature is more accurately reflected by a modular network, each of whose component modules mimic the simple functions into which the original
Modular Neural Networks
151
function is decomposed. A nonmodular network may appear to perform well on the training data, but its structure may be significantly different from that of the function to be learned; hence it may not generalize well and may perform poorly on test data. From another perspective, the process of training a nonmodular network involves optimizing the total cost (error) function; as a result, the network may perform well for some subtasks but not for others. By contrast, the modular approach attempts to minimize component cost functions separately, with better generalization qualities for all subtasks.
D.
REPRESENTATION
When an expert system is used in an advisory or interactive mode, it is often necessary for the system to explain its conclusions. The reason for reaching a conclusion may be as important as the conclusion itself, especially if the user is making the final decision, based on the nonbinding recommendations of the system. The inability to give reasons or explain conclusions has traditionally been a failing of neural networks, which are data-driven and not model-driven. This problem is alleviated by the use of modular networks, which contain understandable and explainable structure, with individual modules designed to address specific subtasks.
E. HARDWARE CONSTRAINTS Hardware limitations impose constraints that often require neural network implementations to be modular. The size of any silicon chip is limited, imposing natural restrictions on the size of any network to be implemented on a single chip. In practice, it may be very difficult or expensive to achieve the high degree of interconnectivity between nodes required in a large neural network. Furthermore, the pin count of a silicon chip may be much smaller than the required number of inputs or outputs to a neural network, so that many such chips must be combined to implement the network in hardware; the task of interconnecting the chips is simplified if each chip can correspond to a separate module. Because the hidden nodes in a such a chip are not accessible to other chips, it is difficult to encode very large networks in which nodes in adjacent layers are exhaustively interconnected. This is not a concern with modular networks: the network designer's main task is to select the number and configuration of the chips (modules) needed for the task. Arrays of neural network chips have been designed to implement some simple neural networks [9]. To use such arrays effectively, it is desirable to design the neural network to be modular, so that each chip implements one module.
152
Kishan Mehrotra and Chilukuri K. Mohan
III. MODULAR NETWORK ARCHITECTURES A system with complex input/output relationships can be decomposed into simpler systems in many ways. What are the various possible ways of putting modules together to form a modular neural network? We categorize modular network architectures as follows: 1. Input decomposition. Each first-level module processes a different subset of input attributes (dimensions). These subsets need not be disjoint, that is, each input attribute may be fed into several modules. Sparse connectivity is achieved if each input node is connected only to a small number of higher level nodes. 2. Output decomposition. The output vector is separated into its components, and different modules are invoked for each component or subset of components. This is a popular decomposition procedure, particularly in many-class classification problems. 3. Hierarchical decomposition. Modules are organized into a treelike structure, the leaves of which are closest to the inputs, and the root of which generates the system outputs. Each module performs some operations, and each higher level module processes the outputs of the previous level module. "Pipelining" may be considered a special case of hierarchical decomposition, with a sequence of operations performed on the input vectors by different modules. 4. Multiple experts. In this architecture, the "opinions" of multiple expert modules (operating independently) are combined in a final decision-making module. Different modules may be specialized for different subtasks, or trained to perform similar tasks. In human decision making in the medical domain, for instance, experts with different expertise may be consulted, and (in some cases) different experts with the same speciaUzation may be consulted to have a high level of confidence in the final decision-making step. The decision-making module may form a consensus or probabilistic combination of the outputs of these modules, or may use a competitive approach, determining which of the modules has its say for any given element of the input space. These architectures are examined in greater detail in the next four sections, with example applications.
IV. INPUT DECOMPOSITION A system with multiple inputs may be decomposable into subsystems in such a way that the inputs of each subsystem constitute a proper subset of the inputs of the entire system, as illustrated in Fig. 1. Each subsystem's outputs depend only on its own inputs. The inputs of different subsystems need not be completely
Modular Neural Networks
153
Module 1
Module 2
Output Module
Module n
Inputs Figure 1 Input modularity: eachfirst-levelmodule processes a different subset of inputs.
disjoint. For instance, different parts of the human retina and early visual system are devoted to different input regions, although they overlap to some extent.
A.
NEOCOGNITRON
In some situations, it is to our advantage to decompose a large input array into several small arrays because it is easier to understand and process smaller arrays. Information from smaller arrays can be combined at a later stage. This is the essential feature of Neocognitron, developed by Fukushima for visual pattern recognition [4, 10-12]. In a Neocognitron, the input images are two-dimensional arrays, and the final result of pattern recognition indicates which high-level feature or shape has been found in the entire input image, activating the appropriate output node. The network uses a sequence of many modules, with each module extracting features from the previous module. Each network module consists of two layers of nodes. The first layer in each module, known as the 5-layer, searches for the presence of a feature in a specific region of the input image U. For a character recognition problem, for instance, one S-layer node may search for a vertical line segment in the top left comer of the input image, as shown in Fig. 2, whereas another may search for a horizontal line segment in the lower left comer of the input image. Modules that are farther away from the input array perform successively higher level pattem recognition, abstracting away from details such as precise positions of features.
154
Kishan Mehrotra and Chilukuri K. Mohan Module 1
Figure 2 Connections to and within the first module of the neocognitron. The four outputs of the lowermost array in Us^ indicate which quadrants of the input image contain vertical line segments.
B. DATA FUSION In some applications, different kinds of inputs are available to the system, and each kind of input must be processed separately by a separate module before information (from different kinds of inputs) can be meaningfully combined. For instance, audio and visual inputs may be available, which must first be processed separately (by a specialized audio module and a specialized visual module). This yields better results than invoking a single complex module that attempts to process raw audio and visual inputs simultaneously. For another example, in the medical diagnosis of a patient, some conclusions can be derived from results of blood tests, and others from sources such as physical symptoms. X-rays, or a patient's verbal description of his condition. In this case it would be desirable to train one network for recognition based on results of the blood test, another neural network would learn from results of the X-rays, and so on.
V. OUTPUT DECOMPOSITION Sometimes a task can be seen to consist of several subtasks that can be learned independently. For instance, many practical applications of neural networks address classification tasks in which each subtask can be thought of as learning the characteristics of a single class. A neural network can be designed for each subtask, and the overall result is a collection or combination of the results of smaller neural network modules. The basic architecture is illustrated in Fig. 3. In this
Modular Neural Networks
155
Inputs
0
o 6
Module 1
V
•
Figure 3
\
Module 2
^
•
Outp
\ Module O
Output modularity. Each first-level module processes a different subset of outputs.
approach to modularity, a training set with its own input and output vectors is obtained for each subtask. For instance, in attempting to recognize letters of the English alphabet, it is often desirable to group similar letters together, and train one network for each group of letters. Rueckl et al [13] have studied the problem of analyzing images in which one of a number of known objects could occur anywhere. They found that training time was shorter when separate networks were used for two tasks: (a) to identify the location of the object in the image, and (b) to recognize the object in the image. The problem of modularity has been approached from a more pragmatic viewpoint in the area of speech recognition. Waibel [14] has devised a technique called connectionist glue to train modules for different tasks and then connect them together, as shown in Fig. 4. Waibel [14] trained one neural network for recognition of phonemes B, D, and G (voiced stops) and another neural network to recognize phonemes P, T, and K (voiceless stops), using separate data sets. The individual neural networks achieved 98.3% and 98.7% correct recognition, respectively, whereas naively combining them into a single neural network achieved only 60.5% correct recognition for the six phonemes mentioned above. Fixing the (previously trained) weights from the inputs to the hidden nodes, and retraining the rest of the weights improved the performance of the combined neural network was to 98.1%. Next, four new nodes were introduced into the hidden layer, with attached connection weights trained along with all weights from the hidden layer to the output nodes. These new nodes act like "glue" to join the previously trained neural networks. Performance of the network was improved from 98.1% to 98.4% by addition of these extra hidden-layer nodes. Some additional training of this modular neural network further improved the performance to 98.6%, illus-
Kishan Mehrotra and Chilukuri K. Mohan
156
I
^
"l
^
I
Ol
^
«2
^ Q,
Second Module
Connectionist Glue (additional nodes) Combined Network Figure 4
Waibel's connectionist modular network.
trating that it helps to add new hidden nodes, and to perform additional training with the entire training data set on some or all weights that connect hidden nodes to output nodes. Anand et al [15] exploit output modularity for problems involving the classification of input vectors into K disjoint classes, where K > 2. When compared to a monolithic network with K output nodes, a modular network with K one-output modules is easier to train and is more likely to succeed in reaching desired error rates. A module for class C/ is trained to distinguish between patterns belonging to classes Ct and its complement C/. Because a training set for a ^-class problem may be expected to contain approximately equal numbers of exemplars for all classes, the training set for module / will contain many more exemplars for class Ct than for C/. Anand et al. [16] provide a training algorithm that performs better than back-propagation on such "imbalanced" problems. The relative performance of the modular network improves as the number of classes increases. The performance of modular and nonmodular networks has been compared on three problems: classifying iris flowers [17], a speech recognition example, and a character recognition example. The training process was faster for the modular network by a factor of 3 for the first two data sets, when compared to a monoUthic network. For the third task, the
Modular Neural Networks
157
modular network yielded a speedup of about 2.5 when the target mean square error was relatively high (0.007). However, the nonmodular network was unable to converge to a lower target mean square error (0.003), whereas the modular network did succeed in achieving this criterion.
VI. HIERARCHICAL DECOMPOSITION A system with multiple outputs and multiple inputs can sometimes be decomposed into simpler multi-input multi-output modules arranged in a hierarchy, as illustrated in Fig. 5. The outputs of lower level modules act as inputs to higher level modules. The simplest kind of modularity involves pipelining, shown in Fig. 6, which is useful when the task requires different types of neural network modules at various stages of processing.
Inputs
Output
(a) Hierarchical organization
Outputs
o Inputs (b) Successive refinement Figure 5 (a) Hierarchical organization. Each higher level module processes the outputs of the previous level module, (b) Successive refinement. Each module performs some operations and distributes tasks to the next higher level modules.
Kishan Mehrotra and Chilukuri K. Mohan
158 First
Second
Module
Module
j ^
Third
Fourth
Module
Module
^^ Figure 6
Pipelining architecture.
Yang et al. [18] present an illustrative example of the hierarchical approach. They solve the problem of partial shape matching by using a two-stage algorithm in which each stage contains association network modules (resembling Hopfield networks). The goal of the network is to determine which of a set of model shapes are present in an input shape; shapes may overlap or be partially occluded or otherwise distorted. Each input vector is an attributed string that represents a shape [19]. In the first stage, each training set shape is compared with the input shape to find matching attributes between the input shape and the training shape. This requires one associator module for each training shape. Output from this layer of modules measures the similarity between the input shape and the training shapes. Results of the better matching first-stage modules are fed into a second-layer modular network for final decision making. Yang et al. [18] show that this approach performs much better than a monolithic network in analyzing complex input shapes. Anand et al. [20] developed a modular architecture to solve a problem in analyzing two-dimensional nuclear magnetic resonance (NMR) spectra to determine the composition of organic macromolecules. The input to the system is a large matrix obtained from an NMR spectrum, which is generally very sparse; the location and intensity of each nonzero entry support the hypothesis that one amino acid is present. The difficulty of the problem arises because of noise and random fluctuations in these entries. In the modular neural network, each module detects the presence of patterns that belong to one class (amino acid) in the input image, utilizing the output decomposition. Each module has two stages: the first stage is a feature detector network that performs clustering, and the second stage is a standard back-propagation-trained feedforward neural network. The first stage exploits a self-organizing feature map [21] to preprocess the input and extract essential features, yielding seven new data sets, one for each amino acid. A feedforward network is trained on each data set, using back-propagation, to detect the presence of one amino acid. Figure 7 illustrates a schematic diagram of the complete network. The combined modular network can detect the presence of multiple amino acids in the NMR spectra being analyzed. This approach was successful in obtaining high correct classification rates (87%) on 256 x 256 images with noisy test data. Lee [22] has applied a modular neural network architecture to the classification problem of handwritten numerals. As in Anand et al. [20], the first layer of
Modular Neural Networks
159
1 Module 1
J /
Clustering Filter
Feature Vector
1 Output (1 or 0)
Feedforward Network
1 for class 1
\ • • •
Input Image ( \ .
1 Module?
N
Clustering Filter
1 1
Figure 7
Feature Vector
Feedforward Network
1 Output (1 orO) 1 for class 7
Two-stage modular network for analyzing NMR data.
modules is designed to extract features, the output of which is piped into nodes in the second layer, where final classification occurs. The first layer contains five feedforward modules, each with a 4 x 4 array of hidden nodes. The second layer consists of 10 output nodes. Each first-layer module is separately trained; after the trained modules are put together, the weights are retrained. In contrast with this network, the monolithic feedforward network is very large, with node arrays of size 5 X 4 X 4 at the input and hidden layers, and 10 output nodes; the number of connection weights is very large because all nodes in one layer are connected to all other nodes in the next layer. Lee shows that the performance of the monolithic network is poor; for example, the modular network achieves a classification rate of better than 93% after 400 epochs of training, whereas the performance of the monolithic networks remains below 90%, even after 1000 epochs.
VII. COMBINING OUTPUTS OF EXPERT MODULES In competitive neural networks such as Kohonen's self-organizing map [21], different nodes establish their primacy over different regions of input data space, and the final network output for an input vector is the same as that of the node whose weight vector is nearest the input vector. Variations of this approach construct the network output as a weighted average of the node outputs, giving greater importance to nodes whose weight vectors are closer to the input vector. In their
160
Kishan Mehrotra and Chilukuri K. Mohan
use for function approximation, radial basis function (RBF) networks also construct the system output as a weighted average of some of the node outputs. The same principle can be extended to modular networks: each module is associated with a prototype or reference vector whose proximity to an input vector is used to determine the relative importance of the module for that input vector. Often each node in an RBF network computes a Gaussian function, suggesting that the approach may be evaluated in the context of previous statistical work on Gaussian distribution functions. An interesting inference problem in statistics, often observed in industrial applications as well as in nature, involves parameter estimation in a mixture of distributions. The problem may be formally stated as follows: Let dbQ 3. random vector. Let pt denote the a priori probability that the density function ofd is given by f{d; Ot), fori = 1 , . . . , m, with Y1T=\ Pi = ^' Then the sum m
g(d; pu...,pm,0u02,....0m)
=
^pifid'.Oi) i=l
is also a density function. It is known as the mixture density^ function and represents the unconditional density ofd. Unknov/n a priori probabilities ( p i , . . . , pm) and unknown parameters ( ^ i , . . . , tf^) are estimated by using a sample (set of available vectors) {d\,d2,..., dn) from g(d', p i , . . . , /7^, ^i, ^2, • • •, ^m)The maximum likelihood method, discussed in Section VILA, is a favorite estimation procedure among statisticians. We discuss its neural network implementation in Section VII.B. Another possible statistical estimation procedure is the EM algorithm. This method and its neural networks implementation are discussed in Section VILD.
A. MAXIMUM LIKELIHOOD M E T H O D For notational convenience, we use 0 to denote the vector of all Oi's, X to denote the sample {d\,d2,..., dn], and p to denote the vector ( p i , . . . , pm). Assuming that d\,... ,dn are chosen independently and randomly with density g{d\ p, 0 ) , the "likelihood function" of 0 and p is given by the joint density n
n
(
m
L(0, p\X) = Y\ 8(dj; P,^) = YI\J2
Pif^'^J'^ ^'^
^We have formulated the problem in terms of continuous random variables and densities, but a similar definition applies for discrete random variables as well.
Modular Neural Networks
161
In the maximum likelihood method, estimates of the unknown parameters are obtained by maximizing the above function or (equivalently) its logarithm. Using the classical maximization approach, it suffices to solve the following system of equations:
a — - l o g L ( e , / ? I A') = 0 ,
fork=l,...,m,
d
— - l o g L ( 0 , p \X) = 0, fork=l,...,m. dpk It is easy to verify which solutions of these equations give the maxima (instead of the minima). Because each pi is constrained to satisfy 0 < pt < 1 and YlT=i Pi ~ 1, the relation exp(w^) Pk = ^^=^ ^^ (1) 22i=i exp(M/) gives unconstrained parameters ( w i , . . . , w^) that are easier to estimate. The loglikelihood function (i) is given by i^logL
= J2\
logf ^ e x p ( w , ) / ( t / ; ; ^ / ) J - l o g / ^ e x p ( w , ) J L
(2)
maximized by solving the following system of equations:
9 ^ ^ y ^ [ Qxp(uk)f(dj; Ok) ^^k ~ f^^ 1 E r = i exp(w,)/(J,-; Oi)
exp(M;^) 1 _ Y.T=i ^^Viui) \~
'
The system of equations in (3) and (4) is nonlinear and can be solved by iterative methods. In the following, we specialize the above general formulation to an important special case of multivariate Gaussian densities, give a few more details from the statistical viewpoint, and illustrate its neural network implementation as developed by Jordan et al [3, 23-26].
B.
MIXTURE OF EXPERTS
NETWORKS
Consider the problem of supervised learning. The training set T = {(xj, dj)\ j = 1 , . . . , n} consists of n pairs of observations where Xj denotes an /dimensional input vector and dj denotes the associated O-dimensional output
Kishan Mehrotra and Chilukuri K. Mohan
162 1
>
Expert 1
1 p,
Expert 2 Y
A
_
r >
Expert m
1 ii Gating Network
^m
Figure 8 Basic structure of mixture of experts networks.
vector. The probabilistic model that relates an input Xj to its output dj is described below: Given an input vector jc, a parameter $ is chosen randomly with probability pt (x), where 0 < pt < I and YlT=i Pi — ^' The desired output (d) represents a random variable satisfying the probability density function f(-\0i). Finally, the functional relation Ft (x) = Oi completes the relation between x and d. The data come from a mixture of densities g{d;&) = YlT=i Pif(^'^ ^i)- ^^~ though m is an unknown integer in most real world-problems, it is often possible to make a reasonable guess for its value during the design of the network. In the above model, parameters 0\,.. .,0m and p\,..., Pm are unknown. In a mixture of experts network, one expert (a neural network) is dedicated to learn the probabihty density function / ( • ; 0/) or the associated parameters for each /, and a gating network is assigned to learn /?/ for / = 1 , . . . , m. Figure 8 shows the basic structure of this network, extensively discussed by Jordan, Jacobs, and their co-workers.
C. MIXTURE OF LINEAR REGRESSION MODEL AND
GAUSSIAN DENSITIES
We illustrate the above-described modular structure, examining an important special case in greater detail. Let Ot represent the expected value of J, conditional on the selected density. Assume that Ot is a linear combination of JC; that is, Oi = Fi{x) = 'E{d \x,i)
= WiX
Modular Neural Networks
163
Xj
X2
Xi
Figure 9
Structure of an experts network for linear regression model.
where Wt is a matrix of weights. The statistical impHcation is that the vector observation, J, is a Hnear combination of the input x and a random (vector) component. Its neural network implication is that the /th expert network is a simple network without any hidden layer, as shown in Fig. 9. In addition, let / be the probability density function of a multivariate Gaussian random variable with independently distributed components, that is, fid\ Si) = - ^ ^
exp(^-l(rf - 0tfid
- 0,)^,
for / = 1 , . . . , m, where Oi = Fi(x) = Wix, and O is the output vector dimensionality. This mixture of experts is called an associative Gaussian mixture model. Recall that the gating frequencies pt are expressed through the unconstrained parameters Uj; the latter are related linearly to the inputs, with Ui =
ViX.
Estimates of weights Wk and Vk can be obtained from Eqs. (3) and (4), differentiating Ok with respect to Wk, and Uk with respect to Vk. Because the vector Ok and the matrix Wk are related by Ok = WkX, it follows that dOkii)
dWkHJ)
^7.
,^, (5)
and every other such derivative is 0. Similarly, because Uk — J^j ^kjxj, it follows that ^
^Xj,
(6)
164
Kishan Mehrotra and Chilukuri K. Mohan
every other such derivative is 0. For convenience, the a posteriori probabiHty, that dj is drawn from the /:th density f(dj; ^jt), is denoted by _
exp(uk)f(dj\Ok)
Summarizing the results for the Hnear regression model, from Eqs. (3) and (4) we obtain
and 9
"
The chain rule of derivatives allows us to estimate the weights. The following algorithm emerges from these observations: 1. Initialization: Assign randomly chosen uniformly distributed values between 0 and 1, to all elements of Wk and Vk, for each k. 2. Iterative Modification: Repeatedly perform the following weight modifications: AWk(iJ)
=
-r]qk{di-Ok(i))xj
and Avk = -rjiqk - Pk)xj, where rj is a. small positive real number.
D. THE EM ALGORITHM The EM algorithm, proposed by Dempster et al. [27], is an iterative technique for obtaining maximum likelihood estimates in the presence of incomplete data. There are two types of sample sets: A", which represents the data that can be observed, and y, the data that we desire to observe. The main difference between A* and y is the missing data, Z. First we reexamine the mixture of distributions case mentioned above. Suppose that when we observe the sample element dj we also know which of the m density functions generates it. That is, suppose that it is possible to observe (dj, Zj), where Zj = (zyi,..., z/m). and precisely one element Zj^i equals 1; each of the
Modular Neural Networks
165
rest equals 0. If Zjj = 1, then the true density function of dj is f(dj; Oi), and the joint density function of (dj, Zj) can be expressed as m
g''(dj,zj;Q) = Y[[Pif(dj;Oi)Y^\ i=l
Consequently, the likelihood function associated with y = [(dj,Zj): n] is given by
j = 1,...,
n
Lc{&\y)
=
Y[8*idj,Zj;@) n
m
and the log-likelihood function is given by n
m
4 ( e I y) = J ] ^ z , - , / [ l o g / ( r f , - ; 6^,) + l o g A ] .
(8)
A comparison of the above equation with Eq. (2) indicates why this log-likelihood estimation is easier; the logarithm is now inside the second summation sign. Knowledge of Zjj values facilitates the estimation problem considerably. However, in practice, 2^ = ( z i , . . . , Zn) is unobservable, that is, JY is the observable data, y represents what we wish we could observe, and Z represents the missing data. In this setting, the EM algorithm proposes the following solution to estimate 0 and p (via u): 1. Choose an initial estimate ©0 of 0 . 2. Until the estimates stabilize, repeatedly perform the following computation steps, where a denotes the iteration number and E denotes the expectation: (a) Expectation step. Find the expected value of the log-likelihood ic associated with 3^, from Eq. (8), conditional upon observed data A! and the current estimate ©« of 0 . (b) Maximization step. Find the "improved" estimate 0 a + i of 0 by using the above expected log-likelihood function. To apply the EM algorithm to the mixture of densities problem, we need n
m
E4 = J2J2^{^J''i^'>Sf(dj,0i) +log Pi] I X, &a), 7=1>=1
166
Kishan Mehrotra and Chilukuri K. Mohan
which essentially requires evaluation of E(zj,i \ Af), because the other terms are fixed by knowing 0 ^ . Because Zj,i depends on A' only through dj, we can obtain E(zj,i I Af) by an application of the Bayes theorem, as follows: Eizjj I dj) = Mzj,i = 1 I dj) Mdj I zj,i = DPvjzjj = 1) ET=iMdj \ Zjj = l)Mzj,i = I) ^ f(dj;Oi)Pi ET=if(dj;Oi)pi where, as before, qt represents the (a posteriori) proportion of £f values belonging to the ith density f(d;Oi). The maximization step merely involves differentiating n
m
with respect to pi and Oi, treating qt as fixed: rv
n
m
rv
n
m
^"^
j=l
^
^
/=1
Very simple expressions are obtained for the previously described associative Gaussian mixture model: r.
n
m
j= l i= l
rj
n
m
- E 4 = i:i:„.(i-p,). 7= 1 /= 1
Finally, specialization for the multivariate regression model and its neural network implementation requires an application of chain rule in conjunction with Eqs. (3) and (4). The associated algorithm is straightforward. Jacobs and Jordan [26] have also developed a hierarchical extension containing several mixtures of experts, the simplest of which is shown in Fig. 10. The training algorithm is very similar to that for the mixture of experts model. Properties of this network, when trained by an application of EM algorithm, have been studied by Jordan and Xu [28].
Modular Neural Networks 1
>
167
Mixture of Experts 1
A
^ ^1
• • X
>
• Mixture oi ii
Experts k Gating Network
—7-*
•
Pk
Figure 10 A two-level hierarchical mixture of a experts network.
Singh [29] has also adapted the mixture of experts approach, presenting a modular network for multiple Markovian decision tasks. The problem consists of putting together subsets of a given set of elemental Markovian decision tasks, forming composite Markovian decision tasks. For each composite task, the exact number and sequence of elemental tasks must be obtained without excessive computational effort. Singh has introduced a "stochastic" switch into the mixture of experts network, to add an elemental task and measure its effect on the performance criterion, that is, the expected return on taking a particular decision in a specific state at a given time. The mixture of experts model and its extensions have been applied in several contexts. For instance, Jacobs [23] applied the competitive modular connectionist architecture to a multipayload robotic task in which each expert controller is trained for each category of manipulated objects in terms of the object mass. Gomi and Kawata [30] use this type of network architecture for simultaneous learning of object recognition and control tasks, using feedback error learning. More recently, Fritsch [31] has used this model for speech recognition.
VIIL ADAPTIVE IVIODULAR NETWORKS How may we modularize a network for a new task not encountered before, about which little heuristic information is available, to construct and arrange "expert modules" optimally? Most adaptive network construction algorithms (e.g., cascade correlation [32]) insert nodes one at a time into a network, and do not introduce modules, because it is hard to delineate the purpose or scope of the
168
Kishan Mehrotra and Chilukuri K. Mohan
module when it is introduced adaptively. The main questions to be addressed in developing adaptive modular networks are • • • • •
When is a new module to be introduced? How is each module to be trained? What is the network architecture? How does the network output depend on the existing modules? How are existing modules to be modified when a new module is introduced?
The first of these questions has a straightforward answer: new modules can be generated when the performance of current modules is inadequate. No a priori information would then be needed regarding the number of modules needed for a specific problem. The remaining questions posed above are answered differently by different adaptive modular network approaches. This section outlines four such algorithms that adaptively introduce modules into a network. 1. The Attentive Modular Construction and Training Algorithm (AMCT) [33] successively introduces and trains new modules on small portions of the data set, steadily extending the "window" of data on which a module has been trained. A complete network is then constructed, with all of the modules constituting a "hidden layer" (of modules), and the weights (including those from the modules to an output node) are then trained, using a small number of iterations of the back-propagation algorithm. 2. Ersoy and Hong [34] propose a network consisting of multi-output perceptron modules. In their model, later modules are invoked only if it is obvious from the output of earUer modules that they "reject" an input vector (i.e., do not generate a reasonable output vector). The training process involves the successive addition of new modules. The input vectors are subjected to nonlinear transformations before being fed into the new modules. Note that the outputs of the earlier modules are not directly supplied as inputs into later modules, allowing for greater paralleUsm in execution. 3. The Adaptive Multi-Module Approximation Network (AMAN) [35] associates multiple "prototypes" or reference vectors with each module in the network; the network output depends on the output of the module whose reference vector is nearest an input vector. A competitive learning mechanism is used to associate the module with reference vectors identifying cluster centroids. The introduction of a new module may necessitate readjusting the reference vectors of existing modules, but not their internal connection weights. 4. Some existing nonmodular adaptive training algorithms can be modified so that the adaptive process successively introduces neural network modules into the network, instead of individual nodes. For instance, the "Blockstart" algorithm [37] adaptively develops a multilayer hierarchical neural network for two-class classification problems, following the principles of Frean's "Upstart" algorithm
Modular Neural Networks
169
[37]. Each module is a small feedforward network, trained by an algorithm such as back-propagation.
A. ATTENTIVE MODULAR CONSTRUCTION AND
TRAINING
Theoretically, multilayered feedforward neural networks can approximate any continuous function to an arbitrary degree of accuracy [5-7, 3 8 ^ 0 ] . However, these networks are often difficult to train, possibly because of local minima, inadequate numbers of hidden units, or noise. Discontinuous functions and functions with sharp comers are particularly difficult for feedforward networks to approximate, as recognized by Sontag [41]. Such functions arise naturally in problems such as multicomputer performance modeling: a typical communication cost curve is flat for one range of input parameter values, then jumps steeply whenever a parameter such as message length reaches certain thresholds. Some work on discontinuity approximation has been carried out in surface reconstruction in 3-D images [42]. Current neural network methods that are successful for this problem [43-45] can only deal with the case of a discretized domain where function values (intensity or pixel values) are defined at a finite number of pixel locations. These networks have limitations in interpolation because each output node approximates the function value at a single (x, y) location. Moreover, they are designed to work only in two-dimensional input space. Lee et al [33,46] developed a new algorithm to solve this problem. In general, the number of discontinuities in a problem may previously be unknown; hence the neural network is constructed incrementally. Because discontinuities are local singularities and not global features, modules are trained by using small neighborhoods of data space. Partitioning an input space for better performance has been suggested for other applications by various researchers [47-50]. In the approach of Choi and Choi [49], for instance, the entire input space is tessellated into a large number of cells, for each of which a separate network "granule" is trained; however, the computational needs and number of parameters of the resulting system are excessively high. Ideally, the number of "cells" should be variable, with nonuniform sizes, and determined adaptively by the algorithm. In the network constructed by the AMCT algorithm, each module is a onehidden-layer network containing a small number of hidden nodes, trained on a small subset of training samples. The need for a new module is automatically recognized by the system. This algorithm performs very well on low-dimensional problems with many discontinuities, and requires fewer computations than traditional back-propagation. The success of the algorithm relies on independent scaling of training data and window management. Modular training is followed by merging and finding weights of the top layer connections. The weights of the top
170
Kishan Mehrotra and Chilukuri K. Mohan
layer connections can be found adaptively by gradient descent training without back-propagating errors. 1. Module Development Each module is a feedforward network with one hidden layer containing a small number of nodes, and is capable of learning only a very simple concept, such as a single discontinuity or a step. The window for the module is initially small, and overlaps partially with the window of the immediately preceding module. The window steadily expands, repeatedly retraining the module's weights, until the current module's performance degrades drastically. This identifies the location of the next discontinuity, to be captured by using another module. The current module's window contracts back to the region in which it performs well. 2. Final Network Formation Modules are merged as follows, resulting in the network shown in Fig. 11. 1. All hidden nodes of all modules are concatenated to form the hidden layer. 2. The weights of the connections to the hidden units of the modules are copied onto the corresponding links of the merged network. 3. The weights of the upper layer connections are retrained. 3. Experimental Results Results from the AMCT algorithm are given in Fig. 12, for a one-dimensional multistep function with nine discontinuities, illustrating the marked improvement over the result obtained with a monolithic network. In these experiments, the
Outputs Inputs Figure 11 Construction of Network from Modules in AMCT Algorithm. First layer and intramodule weights are trained before modules are merged; outermost layer weights are retrained after merging modules; M\,. ..,Mn contain hidden nodes of the separately trained modules.
Modular Neural Networks
±
171
1
1
1
J J^
1
0.8
--
/] TARGET FUNCTION 'J NN OUTPUT 0.6
0.4
,i
0.2
0 1
0.8
1/
0.1 1
0.2 1
1
1
0.3
0.4
1
1
0.5 1
1
•
0,6
0,7
1
1
\
rJ^
0.6
0.4
r
n
/T
• - " "
0
1
I 0.9 1
I 1.0 ,----|
^
V
y
0.2
I 0.8
/ I TARGET FUNCTION \ NN OUTPUT
J^
^
^
/T ll
1
1
1
1
1
1
1
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Figure 12 Approximation of discontinuous function (with nine steps) by AMCT (modular network) and back-propagation (monolithic network).
172
Kishan Mehrotra and Chilukuri K. Mohan
AMCT algorithm was found to require roughly 30% fewer weight changes than a nonmodular network trained by back-propagation.
B. ADAPTIVE HIERARCHICAL NETWORK Algorithms have been designed to solve classification problems by using an adaptively developing hierarchical neural network. These algorithms first use a "simple" neural network module, attempting to classify correctly as many input/ output pairs as possible. If the resulting error exceeds a preassigned threshold, then a new neural network component is trained on the incorrectly classified data. This step is repeated until all or most data are correctly classified. An example of this approach is the nonmodular "tiling algorithm" of Mezard and Nadal [51], in which new nodes are successively introduced to separate samples of different classes that cannot be distinguished by existing nodes in the current layer. Ersoy and Hong [34] develop a similar adaptive neural network, based on the premise that most errors occur because of input/output pairs that are linearly nonseparable, and that a nonlinear transformation of the inputs can improve the separability between two classes. Inputs and outputs are presumed to be binary, and the applications explored so far are primarily in signal processing. Each module is a multi-output perceptron (without any hidden layer) containing one layer of weights trained by the delta rule (gradient descent), attempting to classify correctly the data for which previously existing modules were unsuccessful. The output vector generated by the module is examined to determine if the module has been successful in producing the desired output vector. After a reasonable training period, the remaining data vectors (on which the current module is unsuccessful) are piped through a nonlinear transform and then into a newly spawned module, on which the training process is again applied. The nonlinear transform most used by Ersoy and Hong consists of a real discrete Fourier transform followed by the sign function. Figure 13 illustrates the architecture of a network that develops in this manner. The training process continues until a prespecified stopping criterion, such as an error bound, is achieved. In applying the network to new test data, the modules are successively applied to the input data (along with the relevant nonlinear transforms) until one is found whose output is considered satisfactory. Ersoy and Hong [34] have observed that the performance of their selforganizing hierarchical neural networks is better than the performance of the usual feedforward back-propagation neural networks with one or two hidden layers. They also report that this modular network model has better fault-tolerant properties.
Modular Neural Networks
173
Nonlinear Transformation Figure 13
Adaptive modular network of Ersoy and Hong.
C. ADAPTIVE MULTIMODULE APPROXIMATION NETWORK Kim et al. [35] propose an adaptive modular neural network in which each module is associated with one or more adaptively determined reference vectors that identify when the module is to be invoked. The connection weights of that module are used to generate the appropriate output vector. Unlike the other adaptive modular algorithms, each module may be appropriate for more than one small region in data space, because each module may be associated with multiple reference vectors. This network model is successful in solving problems involving the approximation of a collage of several different simpler functions; it is much more difficult to train unimodular networks for such tasks. The network architecture is illustrated in Fig. 14. In using the notion of "reference vectors," the relative proximity of which is used to determine the choice of the appropriate module, the AMAN algorithm bears an analogy with learning vector quantization (LVQ) [52] and counterpropagation [53] algorithms. However, those algorithms are not adaptive or modular, and were developed for classification and pattern association tasks, and not for arbitrary function approximation problems.
Kishan Mehrotra and Chilukuri K. Mohan
174
Reference Vectors to be compared with input vector
Gating signal identifying module with reference vector closest to input vector
Network output selected from module outputs
(modules) Figure 14 Adaptive Multi-Module Approximation Network (AMAN) being used to generate output corresponding to desired input.
The algorithm consists of the initiahzation of the first module, followed by the repeated introduction of new modules, until satisfactory outputs are obtained for all or most of the input data. When a new module is introduced, it is first trained on a randomly chosen subset of the input data on which currently existing modules do not perform well enough. The reference vectors of the old and new modules are adjusted to reflect the addition of the new module, to preserve the property that the reference vector closest to an input vector must be associated with a module for which the network error is low for that input vector. These steps of the algorithm are described below in greater detail.
1. New Module Introduction Assume that modules MQ, M i , . . . , M/_i have already been introduced into the network, currently associated with some reference vectors; rj^k is the A:th reference vector associated with module My, where i > j > 0, J > k > 0, using the uppercase letter (/) to denote the current number of reference vectors of a module (j). Tj consists of all of the training data for which the performance of the module My is satisfactory, and T consists of all of the training data for which the performance of each of the modules M o , . . . , M/_i is unsatisfactory. If T is
Modular Neural Networks
175
sufficiently small, then the algorithm terminates and it is not necessary to introduce a new module. The initial connection weights of M/ are chosen randomly. An element x e T is randomly selected, and the initial training set 7/ consists of all members of T that are sufficiently close to x, by a simple distance criterion. The error back-propagation algorithm is applied for a relatively small number of iterations, training M/ to produce the desired outputs for elements in 7). The first reference vector for Mi is the centroid of the subset of all of the input vectors in 7} for which the error generated by M/ is considered satisfactory. Mi is then tested on all of the remaining input vectors in T, moving from T to Tl each input vector for which performance is satisfactory (i.e., the desired output is sufficiently close to the actual output). T is trimmed, deleting all members of T/. 2. Reference Vector Refinement The reference vectors of previously existing modules may have to be refined as a result of the introduction of the new module Mi. For each element x in TU for each j < i, we first determine the nearest reference vector ra,b among the existing reference vectors. No adjustment is necessary ifa = j , because the nearest reference vector is the one whose module performs best on the input vector under consideration. Otherwise, if a / 7, the reference vectors of module Mj must be adjusted. Among the reference vectors of module Mj, ifrj^k is nearest the current input vector, and its distance to x is small enough, the desired goal is achieved merely by perturbing Vj^k to move it closer to x. However, if distance (x, Vj^k) is relatively large, it becomes necessary to introduce a new reference vector Vjj centered at jc. When this process is complete, we are assured that the nearest reference vector for each input vector in TQ U • • • U T!_ ^ is associated with the module that performs best for that input vector. Minor adjustments may be necessary to ensure that the set of reference vectors of Mj accurately reflects cluster centroids of vectors in r j ; a few iterations of the k-means algorithm should suffice. 3. Results The AMAN algorithm was applied to several problems in which the input data were a composite of simple functions (such as sigmoids) applicable in different parts of the data space. In each case, modules continued to be added until some module gave a reasonably low error value, for every input vector. Each module came to be associated with one to three reference vectors, indicating that in some cases the training carried out on one part of the data space could be exploited to obtain good performance for the same module on another part of the data space. In these experiments, AMAN required far fewer iterations and required a running
Kishan Mehrotra and Chilukuri K. Mohan
176
time an order of magnitude less than the unimodular network, reaching lower error levels.
D.
BLOCKSTART ALGORITHM
Frean's "Upstart" algorithm [37] adaptively adds nodes to a network, successively invoking the algorithm recursively on smaller subsets of the training data. The "Blockstart Algorithm" [36] extends this approach, adaptively developing a multilayer hierarchical neural network for two-class classification problems, as illustrated in Fig. 15. Each module is a small feedforward network, trained by an algorithm such as back-propagation. The first module is trained on the entire training set, with the initial invocation of the algorithm being A^ = Blockstart(ri,ro), where T\ is the subset on which the desired output is 1 and TQ is the subset on which the desired output is — 1. The current invocation of the algorithm terminates if perfect (or acceptable) classification performance is obtained by this module. Otherwise, if T^ c T\ and T^ C TQ constitute the misclassified samples, the algorithm is recursively invoked twice: A^i = Blockstart(rf 7b), A^o = Blockstart(rQ~ In the complete network, all inputs are supplied to each of N, Ni, No',in addition, there are also connections from the output nodes of A^i, A^o to the output node of
'/Kh
MAX/
^MAX
>r
r-
Input layer Input layer
Figure 15 Blockstart algorithm. Connections from lower level module outputs lead directly into the outermost node of the next higher level module, carrying a weight of MAX or —MAX.
Modular Neural Networks
177
A^, with connection weights MAX and — (MAX), respectively, where MAX is sufficiently large, say exceeding | J2k ^kik\^ where Wk is the connection weight from the ^th input to N and i^ is the ^th element of the pth input pattern. The resulting network has many layers, but the training process is not computationally expensive, because each module is small, and the weights within a module are trained exactly once. Termination of the algorithm is ensured because each successive recursive call addresses fewer training patterns.
IX. CONCLUSIONS This chapter has surveyed the topic of modular neural networks, examining various modular network architectures. In the first model, different modules process different subsets of input attributes. In the second, the problem being addressed is partitioned into subtasks that are solved by using different modules. The third model includes a number of different levels of modules, with results from one level feeding into the next. The fourth model invokes a decision-making method to combine the results of various "expert" modules. This is followed by a discussion of adaptive modular networks, in which modules are automatically added by the learning algorithm. The construction of such modular networks allows us to conduct successful training of networks for large problems with reasonably small computational requirements. If a problem can be solved easily by using a small neural network whose number of weights is one or two orders of magnitude less than the number of input samples, then there is no need to resort to a modular architecture. In practice, this is rarely the case. Modular neural networks can be used in many practical applications for which training a single large monolithic neural network is infeasible or too computationally demanding, or leads to poor generalization because of the large number of weights in the network. The choice of the network architecture depends on the problem, and on the nature of the a priori information available about the problem. When fault tolerance is a major concern in an application, each module may be trained once and then duplicated, yielding the redundancy needed for the network in operational use. The research area of modular neural networks is far from exhausted. To begin with, it may be possible to formulate other modular network architectures for various problems. Few adaptive modular algorithms exist, providing scope for further exploration, especially in the context of applications characterized by a continually changing problem environment. Of particular interest are new adaptive network training algorithms for association and optimization tasks. Several algorithms also beg for a clear theory that indicates why modular networks are better than nonmodular ones of equal size. It may be conjectured, for instance, that the presence of a larger number of weights in a nonmodular network leads
178
Kishan Mehrotra and Chilukuri K. Mohan
to the existence of a larger number of local optima of the mean square error than in the case of the modular network. Similarly, when compared to nonmodular networks with the same number of nodes, modular networks are capable of implementing only a proper subset of the set of functions that produce desired results on the training data. When domain knowledge is available, the main task is that of embedding such knowledge in the network structure, and deciding whether a connectionist or hybrid model is appropriate for the problem. In the absence of clQai a priori domainspecific knowledge, a difficult issue bearing serious study involves determining how best to break down a complex task into simpler subtasks. The learning process would be accelerated considerably if one can find ways to reduce the complexity of the function to be learned by the neural network. We end this chapter by noting that a modular connectionist system may include some modules that are not neural networks. For instance, such a system may contain rule-based expert system modules, or may contain specialized hardware components for tasks such as signal processing. The overall architecture of the system may again follow one of the schemes discussed in Section III. For instance, Zhao and Lu [54] use a hybrid pipelining architecture with two modules for classifying the depth of bum injury, using inputs extracted from multiple-wavelength spectra of visible and infrared radiation used to evaluate bum injuries, which is first preprocessed to have zero mean and unit variance. The first module in the system is a statistical analyzer that performs principal components analysis, and the second module is a feedforward neural network trained by back-propagation. A classification accuracy of 87.5% is reported by the authors. Another example of a hybrid neural network is the face recognition system of Lawrence et al [55], which combines local image sampling, a self-organizing map neural network, and a convolutional network that resembles the neocognitron.
REFERENCES [1] M. Minsky. The Society of Mind. Simon & Schuster, New York, 1985. [2] D. C. Plaut and G. E. Hinton. Learning sets of filters using back-propagation. Comput. Speech Language 2:35-61, 1988. [3] R. A. Jacobs, M. I. Jordan, and A. G. Barto. Task decomposition through competition in a modular connectionist architecture: the what and where vision task. Cognitive Sci. 15:219-250, 1991. [4] K. Fukushima. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybernet. 36:193-202, 1980. [5] R. Hecht-Nielsen. Theory of the backpropagation neural network. Proc. Intemat. Joint Conference Neural Networks 1:593-611, 1989. [6] G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2'.2>02>-3\A, 1989. [7] K. Homik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks 2:359-366, 1989.
Modular Neural Networks
179
[8] V. Kurkova. Kolmogorov's theorem and multilayer neural networks. Neural Networks 5:501506, 1992. [9] R. E. Robinson, H. Yoneda, and E. Sanchez-Sinencio. A modular CMOS design of a Hamming network. IEEE Trans. Neural Networks 3:117-124, 1992. [10] K. Fukushima, S. Miyake, and T. Ito. Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Trans. System Man Cybernet. SCM-13:826-834, 1993. [11] K. Fukushima. Neural network model for selective attention in visual pattern recognition and associative recall. Appl Optics 26:193-202, 1987. [12] K. Fukushima. Neocognitron: a hierarchical neural network capable of visual pattern recognition. Neural Networks 1:119-130, 1988. [13] J. G. Rueckl, K. R. Cave, and S. M. Kosslyn. Why are 'What' and 'Where' processed by separate cortical visual systems? A computational investigation. / Cognitive Neurosci. 1:171-186, 1989. [14] A. Waibel. Connectionist glue: modular design of neural speech systems. Technical report, Carnegie Mellon University, Pittsburgh, R\, 1989. [15] R. Anand, K. G. Mehrotra, C. K. Mohan, and S. Ranka. Efficient classification for multiclass problems using modular networks. IEEE Trans. Neural Networks 6:117-124, 1995. [16] R. Anand, K. G. Mehrotra, C. K. Mohan, and S. Ranka. An improved algorithm for neural network classification of imbalanced training sets. IEEE Trans. Neural Networks 4:962-969, 1993. [17] R. A. Fisher. The use of multiple measurements in taxonomic problems. Ann. Eugenics 7: 179188, 1936. [18] M.-C. Yang, K. Mehrotra, C. K. Mohan, and S. Ranka. Partial shape matching with attributed strings and neural networks. In Proceedings of the Conference on Artificial Neural Networks in Engineering (ANNIE), November 1992, pp. 523-528. [19] M. Flasinsski. On the parsing of deterministic graph languages for syntactic pattern recognition. Pattern Recognition 21:1-16, 1993. [20] R. Anand, K. G. Mehrotra, C. K. Mohan, and S. Ranka. Analyzing images containing multiple sparse patterns with neural networks. Pattern Recognition 26:1717-1724, 1993. [21] T. Kohonen. Self-organized formation of topologically correct feature maps. Biol. Cybernet. 43:59-69, 1982. [22] S.-W Lee. Multilayer cluster neural networks for totally unconstrained handwritten numeral recognition. Neural Networks 8:783-792, 1995. [23] R. A. Jacobs and M. I. Jordan. A competitive modular connectionist architecture. In Advances in Neural Information Processing Systems, Vol. 4, pp. 767-773. Morgan Kaufmann, San Mateo, CA, 1991. [24] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Comput. 3:79-87, 1991. [25] R. A. Jacobs and M. I. Jordan. Hierarchies of adaptive experts. In Advances in Neural Information Processing Systems (J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds.), Vol. 4, pp. 985992. Morgan Kaufmann, San Mateo, CA, 1991. [26] R. A. Jacobs and M. I. Jordan. Hierarchical mixtures of experts and the EM algorithm. Neural Comput. 6:181-214, 1994. [27] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. /. Roy Statist. Soc. Sen B 39:1-38, 1977. [28] M. I. Jordan and L. Xu. Convergence results for the EM approach to mixtures of experts architectures. Neural Networks 8:1409-1432, 1995. [29] S. P. Singh. The efficient learning of multiple task sequence. In Advances in Neural Information Processing Systems (J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds.), Vol. 5, pp. 251-258. Morgan Kaufmann, San Mateo, CA, 1992.
180
Kishan Mehrotra and Chilukuri K. Mohan
[30] H. Gomi and M. Kawata. Recognition of manipulated objects by motor learning. In Advances in Neural Information Processing Systems (J. E. Moody, S. J. Hanson, and R. R Lippmann, Eds.), Vol. 5, pp. 547-554. Morgan Kaufmann, San Mateo, CA, 1992. [31] J. Fritsch. Modular neural networks for speech recognition. Diploma Thesis, Interactive Systems Laboratory, Carnegie Mellon University, Pittsburgh, PA, 1996. [32] S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems II (Denver 1989) (J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds.), Vol. 3, pp. 524-532. Morgan Kaufmann, San Mateo, CA, 1990. [33] H. Lee, K. Mehrotra, C. K. Mohan, and S. Ranka. An incremental network construction algorithm for approximating discontinuous functions. In Proceedings of the International Conference on Neural Networks (ICNN), Vol. 4, pp. 2191-2196, 1994. [34] O. K. Ersoy and D. Hong. Parallel, self-organizing, hierarchical neural networks. IEEE Trans. Neural Networks 2\\61-\1%, 1990. [35] W. Kim, K. Mehrotra, and C. K. Mohan. Adaptive multi-module approximation network (AMAN). In Proceedings of the Conference on Artificial Neural Networks in Engineering, 1997. [36] D. Anton. Block-start neural networks. Technical report, personal communication, 1994. [37] M. Frean. The upstart algorithm: a method for constructing and training feedforward neural networks. Neural Computation 2:198-209, 1990. [38] R. Hecht-Nielsen. Neurocomputing. Addison-Wesley, Reading, MA, 1989. [39] A. N. Kolmogorov. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Dokl Akad. Nauk USSR 114:953956,1957. [40] K. I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks 2:1^3-192, 1989. [41] E. D. Sontag. Capabilities of four- vs. three-layer nets, and control applications. In Proceedings of the Conference on Information Science and Systems, Johns Hopkins University, Baltimore, MD, March 1991. [42] S. S. Sinha and B. G. Schunck. A two-stage algorithm for discontinuity-preserving surface reconstruction. IEEE Trans. Pattern Analysis Machine Intelligence 14:36-55, 1992. [43] C. Koch, J. Marroquin, and A. Yuille. Analog "neuronal" networks in early vision. Proc. Natl. Acad. Sci. U.S.A. 83:4263^267, 1986. [44] J. G. Harris. The coupled depth/slope approach to surface reconstruction. Technical Report TR908, Artificial InteUigence Laboratory, MIT, 1986. [45] J. G. Marroquin. Deterministic Bayesian estimation of Markovian random fields with applications to computational vision. In Proceedings of the First International Conference on Computer Vision, June 1987, pp. 597-601. [46] H. Lee. Neural network modeling approaches in performance evaluation of parallel applications. Ph.D. Thesis, Syracuse University, Syracuse, New York, 1993. [47] J. E. Moody and C. Darken. Learning with localized receptive fields. In Proceedings of the 1988 Connectionist Models Summer School (D. Touretzky, G. Hinton, and T. Sejnowski, Eds.), pp. 133-143. Morgan Kaufmann, San Mateo, CA, 1988. [48] J. E. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. Neural Comput. 1:281-294, 1989. [49] C. Choi and J. Y. Choi. Construction of neural networks for piecewise approximation of continuous functions. In Proceedings of the International Conference on Neural Networks, 1993, pp. 428-433. [50] J. H. Friedman. Adaptive spline networks. In Advances in Neural Information Processing Systems (J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds.), Vol. 3, pp. 675-683. Morgan Kaufmann, San Mateo, CA, 1991. [51] M. Mezard and J.-P. Nadal. Learning in feedforward networks: the tiling algorithm. J. Phys. A 22:2191-2204, 1989.
Modular Neural Networks
181
[52] Tuevo Kohonen. Improved versions of learning vector quantization. In Proceedings of the International Joint Conference on Neural Networks, San Diego, CA, 1990, Vol. 1, pp. 545-550. [53] R. Hecht-Nielsen. Counterpropagation networks. IEEE Intemat. Conf Neural Networks 2:1932, 1987. [54] S. X. Zhao and T. Lu. The classification of the depth of bum injury using hybrid neural network. Technical Report, Physical Optics Corporation, Torrance, CA, 1996. [55] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back. Face recognition: a hybrid neural network approach. Technical Report UMIACS-TR-96-16, University of Maryland Institute for Advanced Computer Studies, April 1996.
This Page Intentionally Left Blank
Associative Memories Zong-Ben Xu
Chung-Ping Kwong
Institute for Information and System Sciences and Research Center for Applied Mathematics Xi'an Jiaotong University Xi'an, China
Department of Mechanical and Automation Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong
I. INTRODUCTION Memory is a very fundamental human feature—it refers to the relatively enduring neural alterations induced by the interaction of an organism with its environment [1]. When an activity pattern (certain environment) causes such a change (i.e., a learning process takes place), the pattern is stored in memory. This stored pattern then can be recalled later when required. The mechanism of how a pattern is stored in and retrieved from memory is complex and may not be exactly exploited. In neurobiology, however, this has been commonly explained as performing association [2]. With this understanding, the memory is therefore referred to as associative memory. In this chapter we study a class of associative memories—neural network associative memories, in which a large number of simple connected building blocks (called artificial neurons) act individually in a synchronous or asynchronous manner, yet collectively constitute an organ that performs association in a robust way. The characteristics of such a memory can be identified as follows: 1. The memory is a brainlike network consisting of weightedly connected artificial neurons. 2. The memory stores information in a distributive way. To be more precise, the memory lies in the network's connections. Memorizing or learning will take place through specific changes in the connection weights of the network. 3. Recall of the stored information is accomplished by collective computation of all of the elements in the network. Implementation Techniques Copyright © 1998 by Academic Press. All rights of reproduction in any form reserved.
183
184
Zong-Ben Xu and Chung-Ping Kwong
4. The stored memories are accessed not by specifying an address, but by specifying a subpart or distorted stimulus of the desired contents. Hence the memory has content addressabiUty. 5. The distributive storage and collective computing features make the memory robust against damage and noise, as local damages may be recovered by correct contribution of other memory elements. From this brief exposition of neural network associative memories, we may infer that such memories are indeed plausible models of biological memory. In addition to this strictly biological importance, they are also drawing increasing attention because of their successful use in many applications.
A. ASSOCIATIVE MEMORY PROBLEMS We may distinguish two types of basic association problems: autoassociation and heteroassociation. For the former, a set of p patterns Z^^^ are stored in such a way that when presented with a key vector U, the network responds by producing whichever one of the stored patterns most closely resembles U; and, for the latter, a set of p pattern pairs (X^^\ y(M)) are stored in such a way that when presented with one of X^^^ (say X^^^) or its noisy version as a key, the network responds by producing the partner Y^^^ or its nearest pattern. Here the pattern or pattern pairs to be stored are labeled b y / z = 1,2,...,/?. An autoassociative memory is a system (device) that performs the autoassociation. Likewise, a heteroassociative memory is a system for performing the heteroassociation. The main difference between the two types of memories is clear. In an autoassociative memory, the key vector is always associated with itself—hence the name; but the key vector in a heteroassociative memory is associated (paired) with another memorized vector. Therefore, whereas the input and output spaces of an autoassociative memory have the same dimension, this may not be the case for a heteroassociative memory. In both cases, however, a stored pattern may be recalled from the memory by applying a stimulus that consists of a partial description or noisy version of the key vector originally associated with a desired form of the stored pattern. The heteroassociation problem appears to be more general and difficult than the autoassociation problem. Although this is understood, we would like to note that we can also establish the equivalence of the two problems by the follow-
Associative Memories
185
ing assignment of stored patterns and input/output vectors for an autoassociative memory: • the stored patterns Z^^^ = (X^^\ y^^>) • the key vector U = (§, 0), with § = X^''\ or a fraction of Z^^^ • the output V = (X^^\ Y^^^) or its nearest resemblance The key vector (^, 0) is, of course, a partial description of the expected pattern (X^^\ Y^^^). The resultant memory can therefore be regarded as a special form of autoassociative memories. This equivalence makes possible the study of autoassociative and heteroassociative memories in a unified fashion, which can be shown to be very useful in the synthesis and analysis of associative memory networks. Henceforth we will focus our attention on autoassociative memories. And, if it is not specified, an associative memory will always mean an autoassociative memory. Associative memories have many applications and can be quite powerful. For example, suppose we store the coded information about a fixed set of images in a network. Then an approximate key input vector, despite the error or imprecision it contains, would be sufficient to recall every detail of the images (Fig. 1). Other apphcations of associative memories include information retrieval from incomplete references, pattern recognition, expert systems, and the solution of some difficult combinatorial optimization problems (e.g., traveling salesman problem). See [3-6] for details.
B. PRINCIPLES OF NEURAL NETWORK ASSOCIATIVE MEMORIES There have been a large number of neural network models that can function as associative memories. We may classify all of these models into three types according to their principles of operation.
Principle 1: Function Mapping A direct way to perform heteroassociation is perhaps to take the correspondence between the key input and the output of the memory as a function mapping. From this point of view, a heteroassociative memory may be realized by a function mapping, say F : X -> Y, so that (i) F(X(^>) = Y^^^\ /x = 1, 2 , . . . , p, and (ii) F is continuous in each neighborhood of X^^\ Here X and Y are assumed to be the input and output vector spaces, respectively.
Zong-Ben Xu and Chung-Ping Kwong
186
(a) Figure 1 The image at the bottom is retrieved from an associative memory presented with the corrupted image at the top. The middle image is the recovered image at an intermediate stage.
Figure 2 shows schematically the function of such an associative memory, which is seen to correspond to an interpolator. Because of the continuity of F as required in (ii), not only can any key input X^^^ lead directly to the expected output Y^^\ but any input vector near X^^^ may result in an output closed to Y^^\
187
Associative Memories
Figure 1
(b) (Continued)
It should be noted, however, that the output F(§) is not necessarily nearest to the pattern Y^^\ even when the input § stays near X^^\ This drawback can be improved by imposing another constraint on the construction of the mapping F. The constraint is often the minimization of an appropriate error function, for example,
(iii) Min Yla=i IIF(X(^>) - Y^^^f.
Zong-Ben Xu and Chung-Ping Kwong
XM Figure 2 Schematic illustration of a function-mapping associative memory. A set of 50 onedimensional pattern pairs (X^^\ y(M)) was used.
This type of memory was initially studied by Taylor [7] in 1956, followed by the introduction of the learning matrix by Steinbuch [8]; the correlation matrix memories by Anderson [9], Kohonen [10], and Nakano [11]; and the pseudoinverse memories by Wee [12], Kohonen [13], and others [14-18]. All of these models use Unear functions F (i.e., F are matrices). Gabor et al were probably the first to introduce nonlinear function to the study of memory [19]. Since then, various nonhnear models were developed. Some of the most significant models include the perceptron of Rosenblatt [20, 21], the adaline of Widrow and Hoff [22], the madaline of Widrow and his students [23], the sparsely distributed memory of Kanerva [24], the back-propagation algorithm of Rumelhart et al [25], and the radial basis function layered feedforward networks of Broomhead and Lowe [26]. The above type of associative memories will henceforth be referred to as Function Mapping Associative Memories (FMAMs).
Principle 2: Attractors of Dynamic Systems The most striking feature of a dynamic system is the existence of attractors. There has been increasing evidence in biology that the attractors of a biological memory function as storage [27]. This suggests that an associative memory can
Associative Memories
189
Figure 3 Schematic illustration of an attractor associative memory. Three point attractors (patterns) were assumed.
be developed as a dynamic system, say Q : X x {/} isan attractor of the system Q, for example,
X, such that each X^f^^
(i) each X^f^^ is an equilibrium point of Q (/x = 1, 2, (ii) eachX^^> IS attractive
, p) and
Here X is again the pattern state space, and t is the time (usually t e [0, oo)). Figure 3 shows schematically the principle of this kind of associative memory. Each pattern to be stored is an (point) attractor of the constructed dynamic system. Because of this, the memory is named an Attractor Associative Memory (AAM). Two striking characteristics of an AAM are (1) every stored memory pattern can be recalled from its nearby vectors—an implicit error-correction capability; (2) the retrieval process of the stored pattern is not a direct mapping computation, but an evolution of the memory's orbit as time goes by. We will present a detailed analysis of this type of memory in this chapter. Attractor neural networks have been intensively studied during the past two decades. The additive model [27-30], the correlograph model [31], the brainstate-in-a-box [32], Hopfield models [33-35], the Cohen-Grossberg model [36], the Boltzman machine [37], the mean-field-theory machine [38], the bidirectional
Zong-Ben Xu and Chung-Ping Kwong
190
associative memory [39, 40] and Mortita's nonmonotone memory [41], among others, are the most typical. Principle 3: Pattern Clustering Another possible way to realize associative memory is to transform the association problem into a pattern clustering problem. Specifically, each given pattern X^^^ is embedded in a cluster of patterns, say ^(/x) = {X^^'^^: L = 1 , 2 , . . . , /(/x)}, in such a way that (i) Q (/x) takes X^^^ as its cluster center, and (ii) Qifjii) and Q (/L62) are disjoint for any different /xi and /X2 Then, regarding X^'^'^^ as the incomplete or distorted versions of X^^\ the association problem becomes the correct clustering of all patterns X^^'^^ (i = 1 , 2 , . . . , /(/x), /x = 1, 2 , . . . , p). Accordingly, any successful clustering system (technique) applied to the pattern set {^(/x): /x = 1,2,...,/?} functions as an associative memory. Such an associative memory is illustrated schematically in Fig. 4. It is seen that the memory also has implicit error-correction capability. The performance, however, relies heavily on the specific construction of the expanded cluster patterns {Q (/x): /x = 1, 2,...,/?} as well as the cluster technique being used. Neural networks developed for performing clustering are enormous. Related models in-
XM
) D
D
,D 0D D
c c c cc
X,
Figure 4 Schematic illustration of a pattern-clustering associative memory. Four clusters (prototypes) were assumed.
Associative Memories
191
elude the adaptive resonance theory of Grossberg [42, 43], the self-organizing map of Kohonen [44], and the counter-propagation net [45]. We will call this type of memory the patter-clustering associative memory (PCAM). Based on the above three principles of operations, we may conclude that almost every existing neural network model can serve as an associative memory in one way or another. However, each has its own advantages and shortcomings, as follows. The FMAMs have an exclusive advantage of generalization, that is, they may generate new patterns that are different from the given memory patterns but have reasonable correspondence with any key inputs. However, this is necessarily attained by carrying out a complicated, long-time training of the networks based on the pattern examples [4, 6,46]. Moreover, error-free recall of the stored pattern is not guaranteed. Consequently, function mapping neural networks may not be the most powerful when they are performing association tasks. The most attractive property of the PCAMs is their capability of error correction; they may also allow an error-free retrieval of the expected memory pattern, as long as the expanded cluster of patterns ^(/x) have taken the memory X^^^ as their center. Yet such attractive properties of the networks are returned from the huge cost of training an expanded cluster pattern {^(/x)}, the number of which must be very large to achieve perfect clustering. Thus, in some cases, the high complexity may offset the advantages. A distinctive feature of an AAM is the natural way in which the time, an essential dimension of learning and recall, manifests itself in the dynamics of the memory. As a result, the AAM is particularly suitable for handling input errors in associative memory tasks (because, unlike FMAMs or PCAMs, which are feedforward in topology, the time evolution of recalling a pattern can be continued until all errors are eliminated). This is known as content addressability or errorcorrection capability. Other advantages include short network training time, guarantee of perfect recall of all of the stored patterns, and easy electronic implementation. However, the storage capacity of such a memory is not high for small networks. The efficiency is also frequently depressed by the existence of spurious memories.
C . ORGANIZATION OF THE CHAPTER The objective of this chapter is to provide a uniform account of some newly developed theories of attractor-type neural network memories. In Section II the fundamental aspects of attractor associative memories (AAMs), which provide the basis for subsequent expositions, are discussed. Two fundamental problems of information storage and information retrieval will be ad-
192
Zong-Ben Xu and Chung-Ping Kwong
dressed. The emphasis here is placed on some new powerful encoding techniques. In Section III we investigate a new type of continuous AAM—the competitive associative memory, which is based on simulating the competitive persistence of biological species and aimed at excluding the spurious stable points. By applying the general theories developed in Section II, the neural network model and its neurodynamics are analyzed. As a demonstration of the effectiveness of the proposed model, we also derive in this section a novel neural network memory that can always respond to input reliably and correctly. Section IV is devoted to the study of a class of discrete AAMs—the asymmetric Hopfield-type networks. We first demonstrate the convergent dynamics of the networks by developing various convergence principles in a unified framework. We then conduct a quantitative analysis of the network's error-correction capability by establishing a classification theory of energy functions associated with the networks. The established theories are applied to the formulation of a "skill" (called the complete correspondence modification skill), for improving the error-correction capability of the symmetric Hopfield-type networks. Numerical simulations are used to demonstrate the effectiveness and power of the developed theories. We conclude the chapter with a summary of results in Section V.
11. POINT ATTRACTOR ASSOCIATIVE MEMORIES Loosely speaking, a dynamical system is a system whose state varies with time (such as those governed by a set of differential or difference equations). In this section we study a special type of dynamical system, feedback neural networks functioning as associative memories. From the dynamic system point of view, a feedback neural network is a largescale, nonlinear, and dissipative system. This last property implies that the state trajectories of the system may converge to manifolds of low dimensions [5, 4750]. If the manifold is a single point, we call it a point attractor. If the trajectory is in the form of a periodic orbit, we call it a stable limit cycle. A dynamic system whose attractors are all point attractors is said to be a point attractor (dynamic) system. A point attractor system is an associative memory or, more precisely, a point attractor associative memory (PAAM), if the system's attractors correspond to or contain the prototype patterns to be memorized. The PAAMs have been extensively studied because of their usefulness in modeling associative memory processes in both biological and artificial systems. In the following, the basic models and the related theories are reviewed and generalized in a unified framework. New techniques for design and analysis of the PAAMs are then developed.
Associative Memories
193
A. NEURAL NETWORK MODELS 1. Generalized Model and Examples Neural networks with dynamics are generally referred to as dynamic neural networks (DNNs). A DNN is usually described by a system of differential equations, j^Xi{t)
= Fi{t,x{t)),
/ = l,2,...,iV,
(1)
/ = 1, 2 , . . . , TV,
(2)
or a set of difference equations, Xi{t + 1) = Gi{t, x(0),
where x{t) = (xi(t), X2(t),..., XN{t))^ e HN is the state variable, ^N is the underlying state space, and Ft, G/: it^ x (0, oo) -> R are, in general, nonlinear functions of their arguments. The state space 9:^ can be the Euclidean space or its subset. It can also be a non-Euclidean space such as a circle, a sphere, or some other differentiable manifold. A neural network is a system that consists of a large number of processing units, called neurons, which are fully interconnected to each other. To see how a neural network can be modeled by either system (1) or by system (2), let us first consider possible models of a single neuron that accepts a set of stimuli (inputs) xi,X2,... ,XN and outputs a signal yt. The model of McCuUoch and Pitts [51] is given by yt =(pi{ui), with Ui = ^WijXj
-Oi,
(3)
where (pi, called the activation function, is a sigmoid nonlinearity (i.e., a continuous, increasing, range-bounded function); W/ = (wn, Wi2,..., VOIN)^ is the so-called synaptic weights, with wij representing the strength of the stimulus Xj acting on the /th neuron, and Oi is a threshold attached to the neuron. Alternatively, Hopfield [34], aiming at processing spatial-temporal inputs, treated the neuron function as a RC circuit described by Ji
=(pi(Ui),
194
Zong-Ben Xu and Chung-Ping Kwong
with ^ aui
Ui
v^
Here, apart from the specific physical significance, the function cpt and the synaptic weight Wij share the same meaning with that in the McCuUoch-Pitts model, whereas Ci, Rt, and // together may be understood as the threshold representing the external inputs. Some other models can also be found, for example, in [4, 5, 52]. For developing a unified framework, we go further, to introduce a generalized (but formal) neuron model: yi =(pi{ui), with Ui(t) = Hi{t,Wi,x(t)),
(5)
where Wt and x{t) are the same as before, and Hi is a general mapping such that, for any fixed t and W/, Hi maps any continuous function x(t) onto a continuous function Ui(t). The mapping Hi may, of course, involve differential or difference operators similar to that in the Hopfield model. Any high-order differential (difference) equation can be recast as a set of firstorder differential (difference) equations. We can therefore consider, without loss of generality, the following two particular forms of model (5): • Continuous dynamical neuron: yi =(pi(ui), with j^Ui(t)
= Hi{t,Wi,xit)),
(6)
• Discrete dynamical neuron: yi =(pi(Mi), with Ui{t + \) = Hi{t,Wi.x{t)).
(7)
When A^ neurons of the same type (say, the McCulloch-Pitts neuron) are interconnected, a specific neural network is built. Assume, for example, that the neurons are coupled in such a way that the output yj(t) of the 7 th neuron is taken as the 7th input into the i\h neuron (hence yj{t) = Xi(t)). Then the use of
Associative Memories
195
McCuUoch-Pitts type of neuron results in a network, • N
Vi(t-\-l)=(p
Y^WijVj{t)-Ii
/ = 1,2,,
A^,
L;=l
and the use of the Hopfield RC neuron leads to dui
u
^'IT^'^'^Yl
^iJ^(^j) + ^/'
Ri
1 , 2 , . . . , TV.
These are the most popular and typical neural network models known as Hopfield nets. In a similar way, with our formal dynamic neuron models (6) and (7), the subsequent generalized neural networks can be derived: d
Ui{t) =
Hi{t,Wi,^{u{t))),
1,2,..
A^,
(8)
Ui{t + \) =
Hi{t,Wi,^{u{t))),
1,2,
A^,
(9)
dt and
where w = (wi, W2, • • •, ujsi)^ e HN and ^(u)
=
{(pi(ui),(p2(u2),.. >(PN{UN))
(10)
We call the system (8) a continuous DNN, and the system modeled in Eq. (9) a discrete DNN. In either case, vi = (pi(ui) is said to be the state and Ui to be the activation potential of the ith neuron, and the vector ^ ( 0 = (i^i, ^2, • • •, ^N)^ is the state of the overall D N N at time t. As a dynamic system, a DNN modeled in Eq. (8) or (9) may not have a guaranteed convergent dynamics (i.e., no chaotic behavior) for a given set of weights Wi. This implies that not all of the DNNs can function as point attractor associative memories. For a DNN to be a PAAM, the DNN must be dissipative, that is, the system's energy should be decreasing along its trajectories. Therefore we incorporate this dissipation requirement into the formal definition of PAAM. D E F I N I T I O N 1 (PAAM). A DNN is a PAAM (point attractor associative memory) if there is a differentiable functional E : R ^ -^ R such that (i) when the DNN is continuous.
J2 -^a E{u{t))Hi{t, Wi, O(w(0)) < 0, i=l
dui
Vf > 0 ,
(11)
(ii) when the DNN is discrete,
E{v(t-hl))<E{v(t)),
t>0,
(12)
Zong-Ben Xu and Chung-Ping Kwong
196 where v(t -\- 1) satisfies
Vi(t + 1) = cpi{ui(t + D) = (pi{Hi{t, Wi,
(13)
In either case, E is said to be an energy of the PAAM. The conditions (11) and (12) here mean nothing but the dissipation property of the PAAMs. Equation (11) implies dE(u(t)) dt
(8)
-y E{u(t))-ui{t) ^ aui ' dt N = ^ — E{u(t))Hi{t, Wi, <^{u(t))) < 0, dui /=i
and Eq. (12) implies A£(i;(0)|(9) = E{v(t + 1)) - E{v{t)) = £((D(w(r + l ) ) ) - £ ( i ; ( 0 ) < 0 ; both indicate that the energy E decreases monotonically along the trajectories of the PAAMs. The following is a set of PAAM examples that encompass most of the wellknown neural network memories in the literature: Cohen-Grossberg Model [36]: dui
~dr
,
= ai(ui) bi(ui)-^Cij(pj(uj)
7 = l,2,...,iV,
(14)
bi(X)(pl(X)dk,
(15)
where «,(«/) > 0, ctj = Cjt and E(u) = ^TTcijipjiuj) Hopfield Model [34]:
where wij = Wji and N N E(v) = -^
- T
r
Associative Memories
197
• Little-HopfieldModel[35,53]: (16)
Vi(t -{- 1) = s g n 7=1
where wij = Wji, sgn (•) is a signum function, and A^
A^
A^
j=i
i=l
(17) i=i
• Brain-State-in-a-Box Model [32]: Vi(t + 1) = sat
(18) 7=1
where wij = Wji, sat(-) is a saturation function, and N
N
i=l
7=1
P E(v) = --J2J2'^iJ'^i''j' • Recurrent Back-Propagation Algorithm [54]: dxi (t) dt
N
-Xi(t)-h(p
Y^WijXjit)
-\-Ki,
/ = l,2,...,iV,
i=l
or, equivalently, dui (t)
dt
= -Ui(t)-\-^Wij(p{uj(t))-\-^WijKj,
i =
l,2,...,N,
where wtj = Wji, (p(') is sigmoid function, and N
N
N
y.
^(^) = -2 I ] E ^0-^/(^7 +2/^7) + E / (P~\m^i=l j=l
i=l
^^
We will further suggest some new PAAM structures in Sections III and IV of the present chapter. It should be noted that among all of the examples listed above, the CohenGrossberg model is the most general [5, 49, 55-58] (hence, in the following, any general result will be illustrated by and applied to this model). Nevertheless, it was the Hopfield model that first manifested the physical principle of storing informa-
198
Zong-Ben Xu and Chung-Ping Kwong
tion in a dynamically stable network, a crowning success of PAAM techniques [34, 35]. 2. The Main Issues The primary function of a PAAM is to store a given set of prototype patterns in the system, so that it can retrieve a stored pattern upon the presentation of an incomplete or noisy version of that pattern. Thus the storage and retrieval of patterns are fundamental issues to be addressed. Formally, we refer to the two issues as, respectively, the encoding (or learning) problem and the decoding problem. The encoding (or synthesis) problem is concerned with the identification of point attractors with the given prototype patterns by either encoding the given information or training the PAAM with a learning algorithm. The memory of a DNN lies completely in the synaptic weights; therefore an encoding scheme can be regarded as a specific synaptic-weight-setting rule, which uniquely determines the characteristics of the network. We will return to this shortly. The decoding problem is to recover the information stored in the memory from a reasonable cue. In a PAAM, all of the information is stored as point attractors. Retrieval of information, that is, actuating the PAAM system from a cue or key input, appears to be simple. There are, however, several considerations: Can any input serve as the key to retrieving a memorized pattern? If not, what type of inputs can do so? In particular, can any nearby vector of a memorized pattern be used as the key for recalling that pattern? All of these problems are associated with the analysis of the dynamics of the PAAM (the neurodynamics). The neurodynamics of a PAAM clearly provides the basis for successful PAAM synthesis. Conversely, a particular synthesis procedure, in turn, characterizes the neurodynamics of the synthesized PAAM. The encoding and decoding problems are intricately connected.
B. NEURODYNAMICS OF P A A M S We will consider in this subsection only the autonomous and continuous cases of PAAMs, that is, systems described by — Ui{t) = Hi{Wi, O(w(0)),
/ = 1, 2 , . . . , A^,
(19)
or, in its state-equation form, ^ Vi(t) = ip[{(p-\vi))Hi{t,
W/, i;(0),
/ = 1, 2 , . . . , A^.
(20)
Associative Memories
199
Denote H{W,
0(M))
= {Hi{Wu 0(w)), H2{W2, 0 ( w ) ) , . . . ,
Div) = dmg{(p[{cp-Hvi)), (p!^{cp-\v2)), • . . ,
HN{WN,
(u))f, (21)
(PN{
(22)
Equations (18) and (19) can also be rewritten in a more compact form: ^^u(t) = H{W,{u(t)))
(23)
and j^v(t)
= D(v)H{W,v(t)).
(24)
Whenever distinction between Eqs. (23) and (24) is unnecessary, we express both using a general form: dx —- = F(x), at
Xe
HN,
(25)
where F : ^N -^ R^ is a nonlinear operator. We assume the existence of a unique solution x(f, XQ) to Eq. (25) for any given initial state x(0) (which is the case, e.g., when H is locally Lipschizian [59]). The solution x(t, XQ) is also called a trajectory of (25) through XQ, denoted henceforth by FQ. If D is a subset of ft^v, D is said to be invariant under the system (25) if XQ e D implies FQ C. D. A point /? is a (w-limit point of FQ if there is a subsequence [tt} such that p = lim/_>oo-^(^z, -^o)- The (w-limit points of ^o are called the co-limit set of FQ. The a)-limit set is invariant under the dynamics. Our purpose is to provide a clear understanding of the -limit sets of every trajectory ofaPAAM. More background on the stability of a system is needed. A constant vector x is said to be an equilibrium state of the system (25) if Jc is a zero point of the operator F, that is, if Jc is such that F(x) = 0.
(26)
The equilibrium state x is said to be stable if any trajectory of (25) can stay within a small neighborhood of Jc whenever the initial state x(0) is close to Jc. It is attractive if there is a neighborhood K(x), called the attraction basin of i , such that any trajectory of (25) initiated from a state in K(^) will approach x as time approaches infinity. An equilibrium state is said to be asymptotically stable, or a point attractor, if it is both stable and attractive. The equilibrium state x is further said to be globally asymptotically stable if it is asymptotically stable, and ^^(jc) = ^ N . Let F~^ (0) denote the set of equilibrium states of (25). An equilibrium point is said
200
Zong-Ben Xu and Chung-Ping Kwong
to be almost globally asymptotically stable provided it is asymptotically stable
and^(x) = {ftN\{F-HO)}UW. In terms of the above dynamic system notions, the essence of a PAAM is a mapping from the given set of stored prototype patterns onto its asymptotically stable states. The number of the asymptotically stable states determines the storage capacity of the memory, and the size of attraction basins of the asymptotically stable states characterizes its error-correction capability. Thus the performance of a PAAM is characterized by its neurodynamics. We may expect two kinds of dynamics of a PAAM: global convergent and asymptotically stable dynamics. The stable dynamics is imperative for a PAAM to have the capability of error correction, whereas the convergent dynamics is necessary for keeping a PAAM out of oscillatory or chaotic behavior; the latter implies that every trajectory of the system should converge to some equilibrium. By Hirsch [47], the dynamics is said to be convergent if the convergent limit, the R be a functional that is continuously differentiable on D. (i) E is said to be a Liapunov function of the system (25) on D, if its derivative along the trajectories of (25) are nonpositive, that is, dE(x), , , ~~T~^\i25) = ( ^ ^ ( ^ ) ' ^(•^)> ^ ^'
when A: G D ,
where (•, •> denotes the usual inner product, (ii) The Liapunov function E is strict if the derivative along the trajectories vanishes only at the equilibria of (25):
dE(x) dt
= 0
iffX e F~\0)
no,
(25)
(iii) The Liapunov function E is positive definite if E(x) > 0 for any x e D, and E(x) = 0 if and only if JC G F~^ (0). THEOREM 1 (LaSalle Invariance Principle). Let the system (25) have a Liapunov function E on itjsf and TQ be a bounded trajectory. Then (JI){T) C E , where
Associative Memories
201
E is the largest invariant set contained in the set E, where E = {x:
= 0
THEOREM 2 (Liapunov Method). An equilibrium state x e F~^(0) is stable if there is a positive definite Liapunov fiinction on a neighborhood of x. The state X is asymptotically stable if furthermore, the Liapunov function is strict. THEOREM 3 (Linearization Method). Assume that x e F~^(0) and the linearization equation of the system (25) at x is given by
^ = DF(x)(x-x), (27) at where DF(x) denotes the derivative of F(x) at x. Let ReX(DF(x)) denote the real part of the eigenvalue k{DF(x))ofDF{x). Then x is asymptotically stable if all RQX(DF(X)) < 0; X is unstable if there is at least one eigenvalue X(DF(x)) for which Rek(DF(x)) > 0. It is important to note that both Theorems 1 and 2 apply the Liapunov function. However, although this is recognized, there is a fundamental difference between the LaSalle invariance principle and the Liapunov method: the positive definiteness of the Liapunov function is not required in the former, but the same condition is not only necessary, but crucial, to the validity of the latter approach. We are now in a position to apply these principles to investigating the convergence and stability of the PAAM systems. 1. Global Convergence Note first that, by Definition 1, the energy £" of a PAAM system is naturally identified by its Liapunov function. If E is strict in the sense of Definition 2, then E = F~^(0), and it follows that S, the largest invariant set of (25) contained in E, is identical to F"^(0). In addition, d/dtx(t) -^ 0 implies that the (W-limit set coiVo) is connected. Consequently, whenever F~^(0) is known to be disconnected (say, isolated), the convergence of x(t) follows directly from the LaSalle invariance principle. We state this in the following theorem. THEOREM 4 (Global Convergence of PAAM). If the PAAM modeled in (25) admits a bounded trajectory x(t, XQ) and its energy E is strict, then x(t, XQ) is quasi-convergent. Furthermore, if F~^(0) is disconnected (e.g., it consists of a finite number of states), x(t, XQ) is convergent.
The quasi-convergence does not imply convergence in general (we will present such an example in Section III.B). Therefore, the existence of a strict energy function is not sufficient to guarantee the convergence of a PAAM, even it is the case for quasi-convergence.
202
Zong-Ben Xu and Chung-Ping Kwong
Theorem 4 can be very powerful for the study of convergence of various PAAMs such as those Usted in the previous section. The reasons are as follows. (i) For almost all models, the boundedness of the trajectories is a direct consequence of (neuron) state boundedness implied by the range-boundedness of the sigmoid function. (ii) The vast majority of the models can be recast as a generalized gradient system, dx — = -D{x)VEix) at
:= Fix),
(28)
where D(jc) is a nonnegative definite, diagonal matrix for any jc. As an example, we can rewrite the Cohen-Grossberg model (14,15) in the form ^ = -D(u)VE(u), (29) at where Misgiven as in (15) and D(M) = diag{ai(Mi)(^'^(Mi),... ,a]si{uNWj^{uN)} is nonnegative definite. In this setting, it follows trivially that ^ ^ ^ ^ = -(D(jc)V£(jc), V£(jc)) = 0 iff F{x) = 0. at Consequently, E suggests a natural strict Liapunov function of the model. (iii) The disconnection of F~^ (0) is a natural requirement. For instance, if we assume that each equilibrium Jc 6 F~HO) is simple (i.e., DF{x) is invertible), the disconnection property follows. When applied to the generalized gradient system (28), this simplicity property can be met by the following condition: D{x) are positive definite and the Hessian matrices V^E{x) are invertible for every x € F~^(0) (because, in this case, F(x) = 0 amounts to WE(x) = 0, and hence DF(x) = D(x)V^E(x)). If this last condition specifically refers to the CohenGrossberg model, we then find that F~^ (0) is disconnected as long as at (w/) > 0, (Pi(ui) > 0, and the matrix M(w) is nonsingular, where M(M) is defined by M(u) = C- diag -77^—, ^ 7 3 - , . . . ,
,,- , .
(30)
Various criteria for the nonsingularity of M{u) may be given, for example, Cii - ^ - ^ \ > J2 l^^-l' (p\Ui)
/ = 1, 2 , . . . , A^.
(31)
Based on the above arguments, we have the following. THEOREM 5 (Convergence of Generalized Gradient System). If a PAAM is a generalized gradient system of the form (28), which yields bounded trajectories,
Associative Memories
203
then the PAAM is always quasi-convergent. In addition, if D{x) in (28) are positive definite and the Hessian matrices V^E{x) are nonsingular (or F has a finite number of local extremes), then the PAAM is convergent. COROLLARY
1. Let the PAAM be defined by the Cohen-Grossberg model
(i) lfai(ui)>0, and(p[{ui) > 0, then the PAAM is quasi-convergent; (ii) Ifai(ui) > 0, (p[{u) > 0 and the inequalities (31) are satisfied, then the PAAM is convergent. Theorem 5 provides a generic criterion for the convergence of generahzed gradient systems. Because most PAAMs designed for optimization fall into this category, the importance of Theorem 5 is obvious. Also observe that statement (i) of Corollary 1 is indeed the famous Cohen-Grossberg theorem [36]. 2. Asymptotic Stability Consider a PAAM of splitting form: ^ = D(x)Fi(x):=F(x), (32) dt where D(x) = dmg{di(xi), d2(x2),..., ^A^(^A^)}, with di(xi) > 0. We assume x is an equilibrium of the system (32). THEOREM 6 (Asymptotic Stability of PAAM). Suppose Fi is continuously differentiable on a neighborhood of x. Denote A = DF\{x) = (aij)NxN ^^^ B = (bij)NxN, where bu = {aul btj = -\aij\forany i, j = 1,2,..., N. Then, under either of the following conditions (i) and (ii), x is asymptotically stable:
(i) ReA.(A) < 0 , (ii) an <0(i = l,2,...,N)
andRQX(B) > 0.
The proof of Theorem 6 is straightforward for the case Re A (A) < 0, and under condition (ii), B^ is an M-matrix; thereby there is a positive vector § = (?i, ?2, • • •, ^NV such that B^^ > 0. Because an < 0, it then follows that ^j^jj + I]?H«.7l < 0,
y = 1, 2 , . . . , A^.
(33)
Let S(x) = diag{sign(xi), sign(x2),..., sign(jciv)} for any x e R^. Construct a positive definite Liapunov function by E(x) = J2^i
[di(xi)]
sign(A - Xi)dX.
Zong-Ben Xu and Chung-Ping Kwong
204 We find dE(x) = 5Z^'[^'^^'^] (32) dt
sign(x/ -Xi)— Xi
= I Six - x)^, D(x)-^ ^ \ = {Six - i ) ^ Fiix)). We express Fiix) at x as Fi(jc) = DFI(JC)(JC — Jc) + o(jc — jc), where oix — x) denotes the remaining nonlinear-order terms of Fiix). Substituting this into the above equation gives dEix) = [Six — A:)§, Aix ~ x) 4- oix — x)) (32) dt N
N
= ^^,sign(x/ - Xi)Y^aijix
- Xj) + o{\\x - x\\
i=l ?7^J7 + Z!?/I^OI \xj -Xj\
-\-o{\\x -x\\
/#; < max I ^jGjj + X^^/|tz,7l | ||A: - ^||i + o{\\x
-x\\).
In light of Eq. (33), the last term of the above inequality is negative whenever x is in a small neighborhood of Jc. The asymptotic stability of x then follows from Theorem 2 by applying the Liapunov method. The conditions in Theorem 6 can be shown to be strict in the sense that no inequality signs can be reversed (i.e., replacing the "larger than" sign with the "smaller than" sign). If we impose the reversion. Theorem 3 shows that x is indeed unstable [59]. To split a system into a form of (32) is of particular significance. Whenever it can be done, the stability of the original system is characterized by its simplified system, dx/dt = Fiix). For example, we can recast the Cohen-Grossberg model Eq. (14) in the splitting form dv — = dt
Div)Fi(v)
(34)
where D(v) = Ai&g{ai{(p-^{vi))(p[{
{bi{
Associative Memories
205
Theorem 6 can then be appHed directly to yield the stability of Cohen-Grossberg PAAM, as follows. THEOREM 7 (Asymptotic Stability of the Cohen-Grossberg Model). Assume that the PAAM is the Cohen-Grossberg model in which cp^ exist, adui) > 0 and (p[{ui) > 0 for any i and ut. If v is an equilibrium state of (34), then v is asymptotically stable provided
(i) di - b'.((p-\vi)) := bii > 0, / = 1, 2 , . . . , A^, and (ii) B = (bij)^ X N with btj = —\cij \ (i ^ j) is an M-matrix. There exists a great number of well-known results for M-matrices that can be applied to derive the specific conditions for B to satisfy (ii). An example is the quasi-diagonal dominance of B: N
^i\bii\>^^j\bijl
i = h2,...,N,
(35)
where (§i, ^2» • • •, §A^) is a set of A^ positive numbers. When this specification is used, we obtain, from Theorem 7, the following. COROLLARY 2. In the setting of Theorem 7, any v e F~^(0) is asymptotically stable if there are N positive constants §i, §2» • • •» ?A^ •^wc/i that
(pi): B(v) = Cv,Vv e F-\0) (p2): di - §.-1 Eyy/ ^j\cij\ > max{b^.(cp-\vi)y. v e F'^}, i =
h..,,N
Corollary 2 underlies the synthesis techniques of the Cohen-Grossberg model [60-62].
C. ENCODING STRATEGIES Suppose we are given a set of M A/^-dimensional prototype patterns {X^^^: /i = 1,2,...,/?} that we wish to store as asymptotically stable equilibrium states of an A^-neuron PAAM of the form (8). An encoding strategy is a method or rule to specify the configuration of such a PAAM system. Encoding is an "inverse" problem in the sense that, given the set of point attractors, we are asked to construct a dynamic system (PAAM) that possesses these attractors. The solution is not unique in general. To simplify the subsequent exposition, we assume in this subsection that the PAAMs are given as the CohenGrossberg model in which the functions a/(-) and /?/(•) are also prespecified (particularly, we assume bi(ui) = M^, as in the Hopfield model). Thus only the connection weights ctj and the neuron activation function (pd-) have to be deter-
206
Zong-Ben Xu and Chung-Ping Kwong
mined. However, although these assumptions simplify the problem, they by no means ensure a unique solution. In practice, the prototype patterns can be set to be bounded, say, in (—1,1)^ (i.e., X^^^ e (—1, 1)^). Moreover, the neuron activation functions (pi(ui) are taken to be the same type of sigmoid functions, for example, (Pi(ui) = (p(ui) = tanh
[~rT
+ exp(-Qf/M/)'
where at > 0 is called the gain of the neuron /. Note that
ail
Ll + i^uJ
oii
In this setting, it suffices to specify the connection weights ctj and the gains at for encoding. Corollary 2 in the last section provides the basis of encoding methods. Indeed, by taking the equilibria of the network as the prototype patterns (i.e., F~^(0) = [X^^^: jji = 1 , 2 , . . . , p}), every X^^^ will be asymptotically stable according to this corollary, as long as the network satisfies conditions (pi) and (p2) contained therein. Consequently, any solution of the equations in (pi) and (p2) gives an encoding for the Cohen-Grossberg PAAM. Furthermore, conditions (pi) and (p2) can be written as (pi)': R-^<^(X^'^^) = CX^^\ /x = 1, 2 , . . . , /?, (p2)': \-\-Cii>Hr'T!'j^i^j\cijl where R = diag{ai, a 2 , . . . , aA^} and 4>(u) = {(p-~\vi), (p'~\v2)....,
(P~HVN))^-
If we set W = /?C, we further obtain (pi)'': 0(X(^>) = WX(^\ M = 1, 2 , . . . , /?,
(p2)":a,>?-E"//?;
Wi
Now we observe that once the matrix W satisfying the condition (pi)" is constructed, (p2)" follows easily by letting A^ f ^ 1 at = -Wii + y ^ \wij\ or at =a = max \ - wa + Y^ \wij\ \ and f/ = 1. This reveals that the construction of the connection-weight-related matrix W is fundamental to the encoding. In the remainder of this section, we develop several efficient rules for constructing W that satisfies (pi)".
Associative Memories
207
1. A Successive Interpolating Encoding The problem may be subsumed under the following more general interpolation problem: Given a set of M vector pairs {X^^\ y(A^)}, find a matrix HM(X, Y ) such that HM{X, Y)X(^> = y ( ^ \
/x = 1, 2 , . . . , M,
where X = {X^^\ X^^\ . . . , X^^^} and Y = {7 = (Y^^\ Y^^\ . . . , y(^>}. We propose the following construction algorithm: Algorithm 1 (Successive Interpolating Encoding). Let
(xa))(za))r ^'^^^ - (za))^(x(i))'
(ya))(za))^ ^^^^' ^^ - (z(i))^(za))'
^^^^
and for anyfc= 1, 2 , . . . , Af — 1, Pk+i{X) = Pk(X) + €;t(JSOeJ(X)/eJ(X)Z(*+i),
(37)
Hk+iiX, Y) = if^(X, Y) + %(X, Y)eJ(X)/eJ(X)X(*+i>,
(38)
€k{X) = Z(*+i) - /\(X)X*+i>,
(39)
%(X, Y) =- y(*^+i' - Hfc(X, Y)X(*+'>.
(40)
where
When the vectors {X^'^\ Z^^) X^^)} are in general position (i.e., linearly independent [23, 63]), the recursive algorithm defined as above is well defined and the so constructed Pjt(X) is a best approximation of R^ in Xk, where X^ is the subspace spanned by X^^\ X^^\ . . . , X^^^ [64]. Furthermore, the matrix //^(X, Y) defined by the algorithm is an interpolation operator. THEOREM 8. / / X = {X^^\ X^'^\ . . . , X^^^} are in general position, then, for any integer A: G [1, M], the matrix Hk(X, Y) constructed by Eqs. (36)-(40) satisfies
Hk(X, Y)X^^^ = Y^^\
Ai = 1, 2 , . . . , A:.
(41)
We prove Theorem 8 by induction. First note that the equality (40) is trivially true for A: = 1, because Eq. (36) implies
208
Zong-Ben Xu and Chung-Ping Kwong
If one assumes Eq. (41) holds for any I < k < L — 1 with some L > 1, then, for k = L,(3S) becomes
where €l_^iX)X^^^ + 0, because of the definiteness of / / L - I ( X , Y ) . Because the matrix P L - I ( X ) is the best approximation of R^ in X L - I = span{X(i>,..., X^^-^\ which means (X - P L - I ( X ) X , 7} = 0 for any Z € R^ and Y € X L - I [61], we deduce that, for any /z < L — 1, 6[_i(X)X(^) = (X(^> - P L - I ( X ) X ( ^ \ X(^>) = 0.
It follows from Eq. (42) that i/L(X, Y)X('^> = i/L-i(X, Y)X(^> = F^^^ for any /JL < L — I, Furthermore, a direct calculation with Eqs. (42) and (40) gives / / L ( X , Y)X(^> = ifL-i(X, Y)X(^> + ^L-i(X, Y) = Y^^K
The equality (41) follows for i = L.By induction, the proof is completed. Some remarks on Algorithm 1 are useful. (i) The algorithm proposed here is a direct generalization of the Hebbian learning rule: M
//M = ^y(/^)(X(M))^.
(43)
To see this, let us suppose that {X^^\ X^^\ . . . , X^^^} and {Y^^\ Y^^\ . . . , Y^^^} are both normalized and orthogonal. Then Algorithm 1 implies that Pi(X) = {X^^^XX^^Y and Hi(X, Y) = (Y^^^)(X^^Y- If we assume that k
k
Pk(X) = Yl X^^\X^^y^
Hk(X, Y) = ^
/x=l
y(/^)(x(/^))^
ix=\
for some k > I, then by definition (Eqs. (39)-(40)),
€^(X) = x(^+i> _ ^x(^>(x(^))^z(^+i) = x^^+i),
it ix=\
(44)
Associative Memories
209
and hence, Eqs. (37) and (38) imply, respectively,
and
Thus the identities (44) are also valid for /x = /: + 1. This is true for any k = 1, 2 , . . . , M — 1. In particular, we obtain the Hebb's rule (43) when k-\-\ = M. (ii) The Hebb's rule has been regarded as a learning algorithm with biological plausibility [4, 65-67]. Apart from this, the most distinctive advantage of the rule is its "accumulation" feature in learning, that is, a new pattern can be learned by accumulating (adding) it directly to the existing memory. This accumulation property is very useful in problems in which the patterns to be stored are presented on-line. Clearly the much generalized Algorithm 1 retains this distinguished feature of the Hebb's rule. (iii) Let X = [X^'^\ X^^\ . . . , X^^>] and Y = [Y^^\ Y^^\ . . . , Y^^^]. Then HM(^, Y ) , defined by Algorithm 1, can also be expressed as i/M(X,Y)=YX+, where X"^ is the Moore-Penrose inverse of X. This is a direct encoding known as the pseudo-inverse method [12, 13, 39]. However, as noted by many authors, the accumulation property will be lost if such a direct encoding method is adopted. This reveals a fundamental difference between the proposed Algorithm 1 and the pseudo-inverse method. The successive interpolating encoding developed in this section is a very fundamental learning rule that can be applied to general PAAM research [64,68-71]. An example is shown in the following subsection. 2. A Sparsely Connected Encoding The matrix W to be encoded represents the connection weights among the neurons of the PAAM. The sparseness of W therefore reflects the connection complexity and the storage efficiency of the network [57, 64, 72]. All elements of the matrix Hp(X, Y) encoded according to Algorithm 1 are nonzero in general, which corresponds to a fully connected neural network. However, if the given pattern set X is known to be "decomposable" in certain way, we may obtain a very sparsely connected encoding. This subsection is devoted to this study. We begin with a fundamental definition.
210
Zong-Ben Xu and Chung-Ping Kwong
DEHNITION 3 (Decomposable Pattern Set). Let X = {X^^^ € (-1,1)^: IJL = 1, 2 , . . . , M} be a set of M pattern vectors. X is said to be decomposable at K-level if K >2 and there exist index sets Nk,k = 1 , 2 , . . . , ^ , such that
(i) {1, 2 , . . . , A^} = A^i U A^2 U .. • U A^^ (ii) NinNj^0J^j,iJ
^ = 1,2,...,/^,
are all in general position, where n{k) = \Nk\ is the cardinality of Nk and
xi^^ = {xf:jeNkY
e{-l,ir^'\
By this definition, if X is decomposable at ^-level, it is also decomposable at s-level for any s < K. The following is an example of a decomposable pattern set: EXAMPLE 1.
Let A^ = 10, M = 4, and
X(') = (1, - 1 , - 1 , - 1 , 1 , 1 , - 1 , - 1 , 1 , 1 ) ^ , X(2) = ( - 1 , 1 , 1 , - 1 , 1 , - 1 , 1 , 1 , - 1 , - 1 ) ^ , X^^^ = (1, - 1 , 1,1, - 1 , 1, - 1 , 1,1, - I ) ' ' , X^'^^ = ( 1 . 1 , - 1 , - 1 , 1 , 1 , - 1 , 1 , - 1 , 1 ) ^ Then X = [X^*^^: fi = 1, 2, 3,4} is decomposable at 2-level. One possible decomposition is { 1 , 2 , . . . , 10} = {1,2, 3,4, 7} U {5,6, 8,9,10}, from which XC^) = (X$'^\ 4 ' ^ y ,
Xi = { X | ^ X f \ Xf\
M = l,2,3,4,
Xf>}, andX2 = { x f , x f , x f , x f } ,
with X[^^ = (1, - 1 , - 1 , - 1 , - I ) ' ' , X P = ( - 1 , 1 , 1, - 1 , 1)^,
X^^) = (1,1, - 1 , 1,1)'"; X f = (1, - 1 , 1 , - 1 , -1)^;
Associative Memories
211 Xf^ = ( - 1 , 1 , 1 , 1 , - 1 ) ^ ;
Xf^ = ( 1 , - 1 , 1,1,-1)^, .(4)
= (1,1,-1,-1,-1)^,
^(4)
= (1,1,-1,1,-1,1)^
When the pattern set X = {X^^^: /JL = 1, 2 , . . . , M} is decomposable at Klevel, we can write X = {X^f^^: /x = l , 2 , . . . , M }
=
{{x[^\x^^^\,..,x^^y'"/x =
l,2.
M}
= XieX2e---eX/^, where X^^^ are A/^-dimensional patterns and X[^^ (k = 1, 2 , . . . , ^ ) are n(k)dimensional patterns. In this case, to memorize the patterns {Z^^^: /x = 1, 2 , . . . , M} is equivalent to memorizing the pattern pairs {(X^ \ ^2 ' • * *' X^O^* M = 1,2,..., M}. This prompts us to develop a sparsely connected encoding for the decomposable patterns. To formulate such encoding, let us cluster coordinates of an A/^-dimensional vector X in such a way that X = (Xi, X 2 , . . . , XKf
with Xk = (Xr. i € Nk)'^ e R"^^\
Then, under the decomposition, the given pattern pair (X^^\ yC^)) is split into X(^> = {X[^\ X[^\ . . . , Xff
with X^^^ = (X^^^^ / G A^,)"^ G R"^^>,
and consequently, the stationary equations WX^^^ = F^^^ (/x = 1, 2 , . . . , M) can be recast as
Vyj^'^V
WKI
Wi2
WlK\
W22
W2K I
WK2
WKK/
/x = 1,2, . . . , M ,
\x^^>/
where Wij (1 < /, 7 < ^ ) are n(/) x n(j) block matrices. We propose to solve these equations by the following algorithm. Algorithm 2 (Sparsely Connected Encoding). For any fixed permutation of ( 1 , 2 , . . . , K), say, (^i, 5 2 , . . . , SK), let ^
__ f HM(Xi,Xsi), ^^"10,
if 7 = Si, if j ^ Si
(45)
212
Zong-Ben Xu and Chung-Ping Kwong
where HM(', •) is the interpolation operator constructed according to Algorithm 1. The so constructed matrix W is clearly very sparse—each row or column contains only one nonzero block (hence the name "sparsely connected encoding"), for example,
(
0
Wn
0
.'.
'.
^'.'
- O x :::
.'.
•
(^6)
WK\ 0 0 •. • 0 / In contrast with the fully connected encoding that requires N x N connections, the number of connections of a neural network using sparsely connected encoding is K C(5i, 5 2 , . . . , 5^) = ^ [n(k) X nisk)}, k=l
which may be far less than N x N, because A^ = n(l) + n(2) -\ h n(K). This shows that the sparsely connected encoding can yield a PAAM of much lower connection complexity than any other fully connected encoding. Based on this observation, a complexity reduction principle has been developed in [64] for the Litter-Hopfield model. EXAMPLE 2 (Bidirectional Associative Memories). Consider the heteroassociation problem [(^^^\ f ^^^); M = 1, 2 , . . . , p}, where ^^^^ e ( - 1 , 1)^, ^W = (—1, 1)^. As explained in Section I, the problem can be converted into an autoassociation problem {X^^\- /x = 1, 2 , . . . , /?} with X^^^ = (^^^\ ^^^^) in the product space ( - 1 , 1)^+^. The pattern set X = {X^^>: /x = 1,2,...,/?} is clearly decomposable at 2-level whenever {§^^^: fji = 1,2, . . . , p } and |^(M); ^ = 1 2 , . . . , /?} are both in general position. The decomposition is given by, for example,
{1,2, ...,A^ + M} = M UA^2, with A^i = {1, 2 , . . . , A^} and yV2 = {A^ + 1, A^ 4- 2 , . . . , A^ + M}. In the decomposition framework, X^^^ = (X\^\ X2 ) is specified by Z$^^=f(^)andZ^^^=f(/^\ Therefore, Xi = {§(^>: /x = 1, 2 , . . . , /?} and X2 = {^^^^: ^ = 1, 2 , . . . , p}.
Associative Memories
213
Now if we apply sparely connected encoding with the permutation (^-i, 5*2) = (2, 1), we obtain /
O
H^(X2,Xi)\
"^ - \Hp(Ki,X2)
O
) •
Using this encoding in the Cohen-Grossberg model gives rise to the following PAAM system:
where A{X) = (ai(xi), (22fe),..., aN+MixN^u))
,
B{X) = (biixi), b2(X2), . . . , bN+M{XN+M)) , OCX) = (^l(xi), (P2(X2), . . . , (PN+M{XN+M))
'
Note that A, B, and O are all diagonally dependent only, so that we can express the system in the following form: ^ § = A{H)[B{H) - Hp(X2, Xi)cD(0],
^ ; = A(0[B(0
- //p(Xi, X2)0(§)],
which is seen to be a generalization of the bidirectional associative memory model proposed by Kosko [39, 73].
III. CONTINUOUS PAAM: COMPETITIVE ASSOCIATIVE MEMORIES In the design of a PAAM there are two major considerations: the storage capacity and the error-correction capability. We would like to attain both large storage capacity and high error-correction capability. However, these two requirements are incompatible. The reason is that a large storage capacity requires a small attraction basin for each stored pattern, whereas a high error-correction capability asks for a large-size attraction basin for each stored pattern; hence the number of patterns that can be stored is limited. The existence of the so-called spurious attractors is the major obstacle to extending the storage limit.
214
Zong-Ben Xu and Chung-Ping Kwong
It is difficult to avoid spurious attractors by using the encoding techniques introduced in the last section. In this section, we develop a new type of PAAM called competitive associative memories, which can solve this problem effectively. The new model exercises a novel control on the attraction basins of the stored memories and defines a new association (recognition) principle. This is an example of using the concept of quasi-autonomous dynamic systems to attain a trade-off between the storage capacity and the error-correction capability.
A. T H E M O D E L A N D ITS SIGNIFICANCE The underlying principle of all PAAM models discussed so far is based on storing the prototype patterns as point attractors. They perform "noncompetitive recognition" in the sense that recalling is initiated by using the plain information contained in the individual key patterns without considering the relationships between two different patterns. In real applications, however, dynamic characteristics such as the occurrence probabilities of the stored patterns and their evolution over time [6, 16] provide additional information on the differences among patterns. The type of associative memories that use this kind of additional information can be regarded as performing "competitive recognition." To perform competitive recognition, we regard the prototype patterns as being stored as a set of different biological species and the competitive retrieval of the stored patterns as the competitive persistence of the species. Using this point of view, the known competition equation of biological species (e.g., see [59] and [74]) suggests a competitive associative memory model: dU
r
-r
-I
— = Wx(t)u(t) - [u^(t)Wa(t)uit)]Wu(t), at
(47)
where M
M
Wx(t) = ^Ait(OW^, k=l
Wait) = Y^ak(t)Wk, k=l
M
W = J2^k
(48)
k=l
and Wk = (X^^MX^^Y, {X^^\ /: = 1, 2 . . . , M} are the given M A^-dimensional prototype patterns. In biological terms, kkiO and ak(t) in Eq. (48) are, respectively, the competitive parameter and the inhibitive parameter associated with the species X^^\ Note that both parameters are nonnegative and range from 0 to 1. The model (47) performs excellent competitive associative memory in such a way that, as long as the competitive and the inhibitive parameters are chosen appropriately, the system (47) will not contain any nontrivial equilibrium point other than the prototype patterns ±X^^\ Furthermore, if the competitive param-
Associative Memories
215
eter Xk(t) eventually evolves to a unique winner, then only the prototype pattern corresponding to the unique winner is asymptotically stable, and hence can be retrieved from any nearby vector [74]. Nevertheless, the model (47) is only effective for orthonormal patterns, thus is, X^^\ k = 1, 2 , . . . , M, must be orthonormal. This is a severe limitation that excludes many applications. To generalize the model (47) to the case of nonorthonormal patterns, we observe that, under the orthonormal assumption, (48) reads Wx(t)X^'^ = Xi{t)X^\ WX^'^ = X^'\
Wo,{t)X^^ = ai{t)X^'\ / = 1,2, . . . , M .
(49) (50)
Therefore, an extension of the model (47) may be obtained by substituting the matrices Wx{t), Wait), and W, respectively, by some operators satisfying the identities Eqs. (49) and (50). The successive interpolating encoding introduced in the last section can be applied exactly for this purpose because of the generality of Theorems. Thus let XA;,(0 = [klit)X^^\ k2it)X^^\ . . . , kM{t)X^^^\ XAa(t) = [ai(t)X^^\ a2(t)X^^\ . . . , aMiOX^^^]. We apply Algorithm 1 to yield the matrices //M(XAx,(t), X), //M(XAa(t), X) and PM(X), and use them to replace Wx{t), Ha{t), and W in (47). The following equation results: ^
= //M(XAx(t), X)u(t) - [u^(t)HM{XAcc(t), X)u(t)]PM(X)u(t).
(51)
This is the general competitive associative memory model we will study in this section. As in [74], when the prototype patterns X^^^ (/ = 1, 2 , . . . , M) are viewed as M different biological species, and A/ (t) and at (t) as the competitive and inhibitive parameters associated with the species X^^\ the model (51) can be interpreted as a competitive persistence equation of the species. Then competitive retrieval of the stored patterns amounts to a competitive persistence process of the biological species. The model (51) is a nonautonomous system of differential equations, for which the following existence theorem is proved in [75]. THEOREM 9. If [at (t)} and [ki {t)} are continuous in t, then, for any given initial value u(0) = UQ, there exists a unique solution u(t, UQ) to Eq. (51). More-
216
Zong-Ben Xu and Chung-Ping Kwong
over, ifM = N, a/(0 = ki(t) > € > 0 and \\X^'^\\ = I for any t > 0 and i = 1,2,..., M, the solution u(t,uo) has the following properties: (i) (ii) (iii) (iv)
\\u(t, Mo)ll = I for any t > 0provided \\uo\\ = 1 ||w(f, wo)ll increases monotonically to 1 provided0 < \\uo\\ < 1 \\u(t, Mo)II decreases monotonically to 1 provided \\UQ\\ > 1 IfuQ ^ 0, then u(t, UQ) ^ 0 for any t > 0
Theorem 9 shows that the system (51) can be considered as a third-order continuous Hopfield-type network without second-order connections [76,77]. In fact, with M = (MI, M2, • • . , Miv)^,
B(t) = HM(XAx(t), X) = (bijit))^^^,
(52)
C{t) = /fM(XA«(t),X) = (cit/(0)^xM'
(53)
^M(X) =
(pij)NxN,
we can rewrite the model (51) as M
'I
M
M
M
= X^^7(0i^; + ] ^ ^ X^ {pijhl(t))vjVkVi,
dt "" j=l
j=l
k=l
/ = 1, 2 , . . . , A^,
1=1
where vi = sat(M/) is the saturation nonlinearity:
1
+1, M/,
Ui > 1 M/G[-l,l] .
(54)
- 1 , Ui < 1 The competition mechanism embodied in Eq. (51) is uniquely characterized by the parameters A./(r) and a/(0, and their selection is, in general, problem dependent. For example, when the model (51) is used to perform distance-based recognition, ki (t) can be taken as a constant or the reciprocal of the distance between the pattern X^^^ and the initial guess MQ, that is,
' "
'• " ll«o -
A:(I)||-I
+ •. • + ||«o - x w | | - ' •
^ ^
On the other hand, when the network is used for "criminal recognition" or "spelHng check in writing" applications [74], A/(r) should be chosen to effect both the distance-based and the recurrence-frequency-based competitions. For instance, ki (t) may be specified as ^n = Pi
(56)
ki=kfkf,
(57)
or
Associative Memories
217
where Pi is the recurrence probabiUty (frequency) of the pattern X^^^ (obtained by, say, experience or statistics). Alternatively, A/ (t) may be taken as the solution of a specific differential or difference equation, if the competition does obey such an equation. It is seen that many different competition rules can be incorporated into the model (51), which may embrace more than one competition principle. However, no network training is necessary once the competition principle (equivalently, the parameters A/ (t) and ai (t)) is specified.
B. COMPETITION-WINNING PRINCIPLE The competitive recognition mechanism of the model (51) is elucidated in this subsection through a detailed stability analysis of the system. We will show, in particular, the following distinctive features of the model: • Provided that the competitive and inhibitive parameters are suitably chosen, there will be no nontrivial spurious stable state and only the prototype patterns are stored in the memory. • There holds a competition-winning principle: if there exists a unique winner, the pattern corresponding to the winner is almost asymptotically stable in global, and hence it can be retrieved from almost any input vector in the underlying state space. Without loss of generality, we assume that M — N (hence PM — i according to (50)), so that the model (51) can be analyzed in the whole space R^. We also take X^^^ and —X^^^ be identical, because it is well known that an associative memory cannot distinguish a pattern and its complement in general (see, e.g., [78-80]). 1. Nonexistence of Spurious Stable State The following theorem asserts the nonexistence of nontrivial spurious stable states in the model (51). THEOREM 10. Ifatit) = ki{t)/{X^^)'^{X^'^) andki(t) ^ Xj(t) whenever i ^ j for any i, j = 1, 2 , . . . , M, then the model (51) has only 2M + 1 equilibria given by the zero vector and the prototype patterns =FX^^\ T ^ ^ ^ ^ • • •» = F X ^ ^ \
We show by contradiction that no other equilibrium state exists except the said IM + 1 states. Assume that Eq. (51) has an equilibrium point M* that is not in {0, ipZ^i), T^^^\ . . . , T^^^^}. We apply the transformation M
W = ;^>;,X(^>=XF,
(58)
218
Zong-Ben Xu and Chung-Ping Kwong
where X = [X^^\ X^^\ . . . , X(^>] and Y = (yu J2, • • •, 3^M)^, to convert the system (51) into ^
= xl^f(XY)
where Z^^ = {X'^Xy^X^
= Ax(t)Y - [Y^X^XAa(t)Y]Y,
(59)
is the left inverse of X [113,114], and
Ax(0 = diag{Xi(0, ^ 2 ( 0 , . . . , ^A^(0}, Aa(t) = diag{ai(0, ^2(0, • • •, «iv(0}. The system (59) then possesses 2(M 4-1) equilibria,
{0, ±xi'x^-'\ ±xi'x^^\..., ±xi'x^^\ x-^M*}, because {0, ±X^^\ ±X^^\ . . . , ±X^^\ M*} are the equilibria of the system (51), and u = XY has a unique solution Y = X^^u for any M € X. On the other hand, [Y^X^XAait)] in (59) is a real number, showing that any equilibrium point Y of the system (59) satisfies Ax(t)Y = [Y'^X'^Aat)Y]Y, or, equivalently, Xi(t)yi = [Y^X^XAa(t)Y]yi,
/ = 1, 2 , . . . , M.
(60)
This shows that any nonzero equilibrium point Y of (59) can have only one nonzero component (otherwise, Y will have two nonzero components, say yt ^ 0 and yj ^ 0, so that eliminating yi and yj, respectively, in the iih and jth equations of (60) will lead to Xi(t) = Xj(t) = Y'^X'^XAa(t)Y, contradicting the assumption Xi(t) ^ Ay(r)). Furthermore, Y'^X^XAaiOY
= IUi(t)y^i-^f^aijyiyj\
+ U2(0>^| + f^ c.2;j2j; j
+ • • • + f A,M(OJM + X^ oiMjyMyj I 1,
(61)
where (x^'Y(x^j^) ''^J-^j^'\xij))T^x(J)y Bringing this into Eq. (60), we conclude that the system (59) has only 2M equilibria given by ±rO) = {yf = h j y = 0 , V j ^ / } , , =
[TXI^X^'K
i = l,2,
...,M}.
/ = l,2,...,M, (62)
Associative Memories
219
Thus M* must be one of the elements in {0, ±X^^\ ±X^^\ . . . , ±X^^^} to be an equihbrium point of the system (51)—a contradiction. Theorem 10 is proved. Thus the system (51) not only stores all of the given prototype patterns, but also excludes any nontrivial spurious states. This offers a significant advantage in terms of reliability over most existing associative memories, in which a large number of spurious states unavoidably exist.
2. Competition-Winning Principle Keeping the choice of the competitive and inhibitive parameters as that in Theorem 10, we analyze the effect of the parameters on the association dynamics of the stored prototype patterns. We assume, subsequently, that ||X^^^|| = 1, / = 1,2, . . . , M . THEOREM 11. Suppose the competitive parameters {ki{t)} satisfy the following conditions:
(i) bounded, namely, supJAKO: ^ > 0} < +00, inf [ki{t)\ r > 0} > 6 > 0,
/ = 1, 2 , . . . , //;
(ii) the limits k^- = limy_>oo ^j(t) exist and the winner parameter A*^=maxU*: y = l , 2 , . . . , M } is unique and positive. Then the prototype pattern ibX^^o) associated with A* is a unique equilibrium state that is almost globally asymptotically stable. Theorem 11 has several meanings: (i) all of the equilibria of the system (51) other than X^'o^ and -X^'o) are unstable; (ii) X^'o^ and -X^^'o^ are both asymptotically stable; and (iii) the attraction basins of these two equilibria cover the whole underlying state space, except a set of measure zero. Thus, whenever X^'o) and -X^^o) are regarded as identical, the prototype pattern ±X^^o^ can be retrieved from almost any key in R^. The instability of any equilibrium other than ±X^^o) can be checked directly by using the linearization method (Theorem 3). To see the almost globally asymptotic stability of ±X^^^\ we can apply the transformation (58) and study the transformed equation (59). The equilibrium states ±X^^o^ of the system (51) then correspond to the equilibrium states ±7^^^)^ defined as in Eq. (62) (i.e.,
220
Zong-Ben Xu and Chung-Ping Kwong
_l_yO'o) = {^.^ = ± 1 , yj = 0, V7 ^ io}). Thus system (59) can be written accordingly as ^ (3;,) = Xiit)yi - [Y^X^B(t)XY]yi,
i = h2,...,N,
(63)
Furthermore, by using (52), the system (51) can be rewritten as ^
= B(t)u(t) - [u^(t)B(t)u(t)]Pu(t),
(64)
which, by Theorem 8, has a unique solution u(t) for any initial state M(0). Moreover, IIM(Oil -> 1 as ^ goes to infinity. Define a zero-measure set ^0 by ^0 = {>^€R^:
yio=0}.
We proceed to show that the solution Y(t) of the system (64) will converge to -j-yC^o) from any initial value 7(0) € R^\^o- Actually, for any given initial value r(0), there is a unique solution yi{t) of the system (64). Moreover, if ytOo) = 0 for some to, then yi(t) = 0 for all t > to. Assume that Y(0) is not in ^0; then yi^(0) i^ 0. It follows from Eq. (64) that
Because ||M(OII -> 1, [Y'^X^B{t)XY] = u^{t)B{t)u{t) < max{XKO; i = 1 , 2 , . . . , A^}||M(0|P -> A,/Q, which means that there exists a T > 0 such that Xi^it) - [Y'^X'^B(t)XY]
> 0,
V^ > T.
We conclude from Eq. (65) that diyt^f/dt > 0 (t > T), and so yt^it) ^ 0 (t > T). For any t > 7, let Zi(t) = yi(t)/yio(t). By Eq. (64) and
| a , ) = {^(.,)(.,o)-|(y,o)(y,)}/(^.o)^ we obtain d ^^(zi) =
{Xi(t)-Xi,(t))zi,
the solution of which is Ziit) = exp N
[kiis) -
Xi,(s)]ds\zi(0).
Because, by assumption, lim^-^oo ^i (s) — Xi^ (s) = X* — X < 0, the above equality
Associative Memories
221
gives Zi (0 ^^ 0 as ^ -> oo for any / ^ io. Thus, yt (t) -> 0 (/ / /Q) and u(t) = XY = Yyi(t)X^'^
-^ lim yio)(t)X^^^\
i=l
This shows Hm^-^oo ytoiO = ± 1 because ||w(OII -> 1, that is, Y(t) converges to In Theorem 11, the assumption that the competition winner parameter ki^ is unique is fundamental. If this assumption does not hold, the dynamic behavior of the system (61) may become more complicated. The following example helps understanding this phenomenon. EXAMPLE 3 (Equal Competitive Parameters Imply Limit Cycle) [73]. Assume two 2-D prototype patterns X^^^ = (x[^\ x^^V and X^'^^ = (x[^\ x^^^)^, which are assumed to be orthonormal. We let Xi(t) = X2(t) = CQ and specialize the competitive associative memory model (51) to
- ^ at
= Cu(t) - \u'^(t)Cu(t)]u(t), -•
u = (Ml, U2) e R^,
(66)
where
c = co[{x^'^){x^'Y + (x(2))(x*2))^]. Then, with f/ = xj Ui -\- ^2 U2,
V = Xl U\ + ^2 M2,
the system (66) can be expressed as
«; = c o [ i - ( t / 2 + y2)](;c|i)[/ + ^ P y ) ,
u'^ = co[i - (t/2 + y2)](x*''[/ + x f y). Under the polar coordinate transformation U = r cos ^, V = r sin ^, we obtain -=co(l-.>,
-
= l.
(67)
From Eq. (67) we easily see that {r = 1, ^ = ^} is a nontrivial limit cycle of the system (76). This Hmit cycle is also asymptotically stable, because, starting from any (ro, OQ), the phase trajectory of the first equation of (67) must approach the solution r = 1 from either the outside or the inside of the circle r = 1, depending on whether r > 1 or r > 1. Therefore, {u(t) = {ui(t),U2(t)):
{x\^ui -\-X2 1^2) + (xj ^wi+X2 W2) = 1 }
is an asymptotically stable limit cycle of the system (66).
222
Zong-Ben Xu and Chung-Ping Kwong
The system (66) has an energy function E defined by E(u) = M^M - In {u^Cu),
(68)
and the time derivative of E along the trajectory of the system (66) is l{uTCu)u — Cu du\ -2\\Cu-(u^Cu)uf dt (66) "~ \ u'^Cu u'^Cu ' dt''dtj^ u'^Cu
<0.
This shows that E is a Liapunov function of the system (66). The function is also strict by Definition 2. TakeZ(i> = (-0.5547,-0.832), X^^^ = (-0.832, 0.5547). Figure 5a and b depict, respectively, the energy function and the contour plot for Xi = X2 = 0.5. It is seen that the zero point is an unstable state of the system and r = 1 is an asymptotically stable limit cycle. Starting from any point outside the limit cycle, the trajectory of the system will move toward the limit cycle. The stored prototype patterns X^^^ and X^^^ are shown at the comers of Fig. 5(b), passing through which is exactly the Hmit cycle. Figure 6a and b show the case when Xi = 0.7 and A2 = 0.3, for which the limit cycle disappears. In this case, there are five equilibria: 0, ±X^^\ and ±X^'^\ As claimed in Theorem 11, the patterns ±X^^^ are unique asymptotically stable equilibrium states (because k* > X^), but ±X^^^ are saddle points. The above results demonstrate the crucial role of the winner parameter A,*Q in Theorem 11. Note also that we have just considered a nontrivial example of systems with strict Liapunov function. The system does not possess convergent dynamics, although it is quasi-convergent by Theorem 4.
C. COMPETITIVE PATTERN RECOGNITION EXAMPLE An example of competitive recognition of mixed images is presented in this subsection to confirm all of the theoretical assertions of Theorems 10 and 11. To implement the continuous model (63), Euler's discrete approximation, Un+l =Un+
Sn{[B(tn)u(tn)]
- [u^(tn)B(tn)uitn)]Pu(tn)}.
« = 1, 2, . . . ,
(69) is adopted, where MQ = w(fo) is the initial value, and [Sn] are step sizes that satisfy n—l
Sn > 0,
tn= Y^Sj
00
< ^Sj
< +00.
The six pictures shown in Fig. 7 were taken as the prototype patterns to be stored, each composed of 32 x 32 pixels. The initial input was the picture shown in Fig. 8. It is composed of (l/6)[House] + (1/6)[Palette] + (l/6)[Tools] +
223
Associative Memories Two Dimension Potential Function
0
0
(a) Two Dimension Contour Plot (Phase Space)
Figure 5 (a) The landscape of the energy E with Ai = X2 = 0.5. (b) The corresponding contour plot of E.
(l/6)[No Smoking] + (l/6)[aock] + (l/6)[Lens], where [House], [Palette], [Tools], [No Smoking], [Clock] and [Lens] are matrices (the patterns) representing, respectively, the pictures of "House," "Palette," "Tools," "No Smoking," "Clock," and "Lens." The purpose of our simulation is to observe the competitive recognition dynamics of the model (51).
Zong-Ben Xu and Chung-Ping Kwong
224
Two Dimension Potential Function
0
0
(a) T w o D i m e n s i o n C o n t o u r Plot ( P h a s e S p a c e )
10
15
20
25
30
35
40
45
50
(b) Figure 6 The landscape of the energy E with X\ = 0.7 and A2 = 0.3. (b) The corresponding contour plot of E.
For the given initial guess, no stored patterns can be recognized by the conventional "competitive recognition in distance" associative memories, because the initial input has the same distance to all patterns of "House," "Palette," "Tools," "No Smoking," "Clock," and "Lens." However, the introduction of the competitive parameters {A„} in the associative memory model (51) allows not only competitions of stored patterns in distance, but also other factors that could improve
225
Associative Memories
House
Palette
No Smoking
Clock
Figure 7
Tools
Lens
The prototype patterns of six pictures, each composed of 64 x 64 pixels.
the recognition (e.g., the recurrence probabilities of the stored patterns). We have taken in the simulations the parameters {kn} to be kn = knK^
n = 1,2,.. . , 6 ,
(70)
Figure 8 The initial input composed of (1/6) [House] + (1/6) [Palette] + (1/6) [Tools] + (l/6)[No Smoking] + (l/6)[Clock] + (l/6)[Lens].
226
Zong-Ben Xu and Chung-Ping Kwong
where I|M0-X<")||-1 '^n — I I . . .
vniii-1
, , 11... vf6>ii-1' | | « O - X ( 1 ) | | - I + - - - + | | M O -- X ( 6 ) | | - l '
x^ =
^
f^ ^
^
^'^''
(72)
In Eqs. (70)-(72), MQ is the initial input, {^(1)^ X(2)^
^ ^(6)} ^ {"House," "Palette," "Tools,", "No Smoking," "Clock," and "Lens."},
and /x„ is the degree of attention paid to the pattern Z^"^ in the recognition process. The parameters X^ and k^ can be interpreted as contributions of the pattern Z^"^ to the recognition of the input UQ in the sense of, respectively, minimumdistance competition and maximum-attention competition. The two recognition principles are combating: whereas the former recognizes UQ through checking which of the stored pictures is nearest to UQ, the latter asks which picture receives the most attention. Hence the larger the parameter k^k^, the higher the probability of the nth pattern Z^"^ being recalled. In our first test, we took /^House = 0.3, /xpaiiete = 0.2, /ZTOOIS = 0.1, MNo Smoking = 0.1, Mciock = 0.2, and /XLens = 0.1. With this set of data, the implementation of the discrete model (69) in accordance with Eqs. (70)-(72) led to the convergence of {«„} to "House." In other words, the picture "House" was recognized from the given initial key. Fig. 9a shows the recognition process. In our second test, we took /XHouse = 0.2 and /xpaiette = 0.3, and the other parameters remained unchanged. It was found that the model converged to "Palette" this time (see Fig. 9b). Similarly, when we set /xciock = 0.3 and /xpaiette = 0.2 with other parameters unchanged, the system converged to "Clock", i.e., the picture "Clock" was recalled (Fig. 9c). We also tried initial inputs other than (1/6) [House] + (1/6) [Palette] + (l/6)[Tools] + (l/6)[No Smoking] + (l/6)[Clock] -h (l/6)[Lens] but the attention parameters /x/ (/ = 1,2,..., 6) were set to be identical. By Eqs. (70)-(72), different choices of initial input amount to different strategies of assigning the distance-competition parameters XP (i = 1, 2 , . . . , 6). All tests showed the convergence of the sequence {w„} to the pattern nearest the initial input in distance. In all simulation runs, no equilibria other than ±"House," di"Palette," ±"Tools," lb "No Smoking," ±"Clock," and ±"Lens" were observed. Therefore, all simulation results support the conclusions of Theorems 10 and 11.
227
Associative Memories
(a)
(b) Figure 9 The competitive retrieval processes of the proposed model for three different competition parameter-setting rules.
228
Zong-Ben Xu and Chung-Ping Kwong
Figure 9
(c) {Continued)
D. APPLICATION: PERFECTLY REACTED P A A M As demonstrated in the previous example, the competitive associative memory may have important appHcations in pattern recognition, especially when particular pattern must be recognized from a set of stored patterns and the recognition process is governed by certain known competition mechanism. In this section, we apply the developed method to the original association problem to yield a new, "perfectly reacted" PAAM. For most associative memory applications, it is natural to devise a system in a way that it can respond to any initial input MQ by producing a vector M* nearest to Mo in distance. Usually UQ is a distorted or incomplete version of the output M*, the latter being one of the stored prototype patterns in the memory. We have named the system that uses the above distance-based recognition process as "noncompetitive" associative memory. We ask if the competitive associative memory model (61) can also apply to this case. One straightforward way is to specify the competitive parameters A/ {t) as that in Eq. (55). However, in doing that, all of the stored prototype patterns must be examined, which is certainly undesirable (because the method will degenerate to the classical recognition of comparing the input directly with every given proto-
Associative Memories
229
type vector). In contrast, we would prefer a way of assigning A/ (t) that is based purely on the initial key. Such a new strategy is motivated by the following observations. In the competitive associative memory (cf. Eq. (63)), each competitive parameter Xi (t) is in fact an eigenvalue of B(t), with the stored prototype pattern X^^^ as its associated eigenvector (cf. (49)). From this point of view, we propose: 1. By associating with each prototype pattern X^^^ a. specific family of {Xi (t)}, the memory stores the pattern X^^^ as a common eigenvector associated with the eigenvalue A/ (t) of B(t). 2. The dynamics of the memory is described by an algorithm for finding the eigenvector associated with the largest eigenvalue of B(too), where B(too) = lim^->oo ^ ( 0 3. The variation of the eigenvalues {Xi (t)} plays a role in performing the competitive recognition of the stored memories. We suggest the following new PAAM model to implement the above proposal. Algorithm 3 (A Perfectly Reacted PAAM Model). Given a set of M prototype patterns {X^''^: / = 1, 2 , . . . , A^}, we denote A = diag{A^, A^ - 1 , . . . , 1}. (i) Let B = HN(X, X A ) be generated by Algorithm 1. (ii) Set R(uo) — (UQBUO)/(UQUO) and p = the integer nearest to R(uo)^ (iii) Take B(uo)
(R(uo) - B)-l - (R(uo) ~ Ap_i)-l + 1; [ (B - R(uo))-^ + (Xp+i - R(uo))-^ + 1;
if R(uo) > p if R(uo) < p ' (73)
and construct - ^ = B(uo)u(t) - [u^(t)B(uo)u{t)]u(t) dt
(74)
u(0) = MO.
(75)
The feasibility of the above algorithm can be explained as follows. Definition (73) ensures that the matrix B(uo) takes all prototype patterns X^^^ as its eigenvectors as B does. The eigenvalues of B(uo) are
. /,,. xx
\(R(uo)-jr^-(R(uo)-p-^ir^-{-l,
ifR(uo)>p
[ U - R(uo)r^ + (p + 1 - R(uo)r^ + 1, if R{uo) < p which are clearly positive. Then by the competition-winning principle (Theorem 11), the unique solution M (0 of the system (74) and (75) must converge to the pro-
230
Zong-Ben Xu and Chung-Ping Kwong
totype pattern ±X^^, where /Q is the index uniquely corresponding to the largest eigenvalue of B(uo). On the other hand, according to step (ii) of the above algorithm, this largest eigenvalue is p. Consequently, u(t) converges to X^^^ (or The function R(u) used in Algorithm 3 is known as the Rayleigh quotient of B in matrix theory; its extreme values are tightly related to the eigenvalues of B (particularly, R{X^^^) = i for any / = 1, 2 , . . . , A^). In the setting of Algorithm 3 all of the eigenvalues Xi are real and positive, and R^f^{u) can be shown to be a norm in R^ when u is restricted to the unit sphere. Thus the model is recaUing the prototype pattern that is nearest the initial input MQ in the distance R^/'^{'). Now the model (74) and (75) provides a PAAM that can always respond to a given input, the prototype pattern nearest the input in the distance /?^/'^(). The model also yields no nontrivial spurious stable state. Because any well-defined norms in R^ are equivalent, we may conclude that the PAAM so constructed is perfectly reacted. The following is an example. Suppose the prototype patterns to be stored are X(i> = (0.2117, 0.9773)^ and X(2) ^ (0.9823, 0.1874)^, associated with them
1.5-1
(a) 40 30 20 10 10 Figure 10 R{u).
15
20
25
30
35
40
(b) (a) The landscape of the Rayleigh quotient R{u). (b) The corresponding contour plot of
Associative Memories
231
are the competitive parameters Ai = 2 and A,2 = 1, respectively. The matrix B found by Algorithm 3 is given by _ . 1.9603 0.2080 \ * -^.1832 1.0317/' which defines the Rayleigh quotient R(u) (Fig. 10a) and its contour (Fig. 10b). We simulated the model (70) and (71) by Ruler's method (69) with a set of 100 randomly generated initial values UQ. It was observed that in every case the model did converge to one of the pattern vectors {±X^^\ ±X^^^} if it was close to UQ in the distance /?^/^(). Figure 11 illustrates the attraction basins of ±X^^^ and diX^^^andFig. 12 is for the case when all of the initial states are constrained to the unit circle (i.e., the initial states are normalized). All simulations strongly support the effectiveness of the model, and more interesting applications can be found in [81]. Finally, we remark that the perfectly reacted PAAM introduced in this section is no longer autonomous as in Section II. Nevertheless, the nonautonomous system could be considered quasi-autonomous because its stability can be studied in exactly the same way as that of autonomous systems.
1.5
i
^
' M
^
^
^
)K
)K
)K
^
(:
M
^
^
^
5K
^
5K
)K
)K
^
)K
5K
^
5K
5((
51^
yn
^
)K
^
*
^
)K
5K
)K
5K
^
5K
^ /
-
+ ^s ^
-
+
+
0.5 -
+
+
+ ^\ ^
+
^
/+
+
+
+
+
+
+
+
+
+
+
+-
+
+
+
+
+
+
+
+
^ / '^ + y4^ + +
+
+
X
_ -
+
+
+
+
+
+
+
+ X +
+
-0.5h-
+
+
+
+
+
_ ^
+
+
+
+
+
+
+
r
+
+
+
\
+
+
-1.5
+/
/^ ^
+
+
+
+
+
+
>K^ s +
+
+
+
+
+
+
+
+
+-
+
+
)K
3K 5K
5K
N^
^ ^ ^
)K
^
^
^
^
,^
ikL.
O
^_ , , / -1.5
>/^ )K
\t>
5K
^
)K
)K
5K
)K
51^
^
^
^ ^
^
)K
^
^
)K
)K
iU
w
1 w
W 1
^u:
iU— —:a£
+/ ^
L
,
^
-0.5
0.5
. +
_
1.5
Figure 11 The attraction basins of the stored memories iX^^^ and ±X^^^. All of the points marked with an * are attracted to ±X^^\ and the points marked with a -h are attracted to ±X^^^.
232
Zong-Ben Xu and Chung-Ping Kwong 1
1
1
1
1
\
1
1.5
1
5K ^
5K
''^ ''^ ^
bii
5K
+
+ + + + + + -k + + +
0.5
0
-0.5
+
+
+ + + ^ + + + + + +
5K
* * JK 5K )KC3I^ *
-1
-1.5 o -2
1
-1.5
1
1
-0.5
1
1
0.5
1
1
1.5
Figure 12 The attraction basins of the stored memories ±X^^^ and ±X^^^ when all initial states are hmited to the unit circle. Again, all of the points marked with an * are attracted to ±X^^\ and the points marked with a + are attracted to ±X^^^.
IV. DISCRETE PAAM: ASYMMETRIC HOPFIELD-TYPE NETWORKS The Cohen-Grossberg model, the Hopfield model, and the competitive associative memory model in the previous two sections represent important examples of PAAMs with continuous dynamics. In this section we consider another important class of PAAMs that are modeled by discrete dynamic systems. One of most successful and influential discrete PAAM models is the LittleHopfield model (16,17). Many other discrete PAAM models such as bidirectional associative memories [39], recurrent neural networks [82], and Q-state attractor neural networks [73, 83], are its variants or generahzations. The model has been extensively studied (see, e.g., [60-62], [78-80], [84-101]), because of its usefulness both as associative memory and optimization problem solver. In most of the previous studies, the symmetry of the connection weight of the network is commonly assumed. In this section we develop a unified theory that is valid not only
Associative Memories
233
for networks with symmetric connections, but also for networks with asymmetric connections. Henceforth we refer to the Little-Hopfield model with asymmetric (symmetric) connections as the asymmetric (symmetric) Hopfield-type network. We begin with some equivalent definitions of Hopfield-type networks.
A. NEURAL NETWORK MODEL A Hopfield-type network of order A^ comprises N neurons, which can be represented by a weighted graph A/* = (W, I), where W = (wij)NxN is 2in N x N matrix, with wtj representing the interconnection weight from neuron j to /, and / = (//)jvxi is an A^-dimensional real vector with It representing the threshold attached to the neuron /. There are two possible values for the state of each neuron: +1 or —1. Denote the state of neuron / at time t as Vi(t); the vector V(t) = (vi (t), V2(t),..., VNO))^ is then the state of the whole network at time t. According to Eq. (16), the state of a neuron is updated according to the following first-order difference equation:
-,(.+.)=.gn(»,(,+.))={i;,^ ; f : ; n ! ! ; o '
<'«
where sgn() is the signum activation function, and N
7=1
If only one neuron is allowed to change state at any time, the network is said to be operating in the serial mode. Otherwise, the network is operating in the parallel mode {ov fully parallel mode), in which all of the neurons are updated simultaneously. Let €i = m m
I
A^
N
I = I - J (€1, 62, . . . , 6iv)^ = (/l, /2, . . . , IN)
and define N
Ui{t + 1) = ^WijVjit) 7=1
-
li.
^
234
Zong-Ben Xu and Chung-Ping Kwong
We see that iiiit + 1) > 6//2 whenever udt + 1) > 0, and Ui(t + 1) < - 6 / / 2 whenever Ui (t -\-1) < 0. Therefore we can rewrite Eq. (76) as Vi(t + 1) = sgn{ui(t + 1)) = Sgn{ui(t + 1)),
(78)
where Sgn() is the strict signum function defined by Sgn(«,) = j _ j
if„,<0-
Definition (78) enables the assumption Ui(t) 7^ 0 in the analysis of Hopfield-type networks. Furthermore, the use of the strict signum function Sgn() instead of sgn() does not affect the trajectories of the network. We also rewrite the model (76) in a parameter form. More precisely, by introducing the so-called parameter matrix, R = diag{ai,a2, . . . , « „ } , which satisfies ai > 0 for any / e {1, 2 , . . . , N}, we recast the Hopfield-type network (76) into the form ^.(^ + 1) = 1 Sgn{a/[Ef=i ^ijVjit) - h]}. I Vi(t),
if/ e I(t) ^ otherwise
^^g^
where I(t) comprises all of the indices of neurons whose states are changed at time t. In particular, the network in fully parallel mode of operation is expressed as V(t-\-l)
= Sgn{H(R)V(t)},
t>0,
with H(R)V = R(WV - / ) ,
Vy e H ^ .
(80)
Clearly, Eqs. (79) and (80) show no difference in trajectories compared with the original update models. However, the representations (79) and (80) enable us to develop a convergence principle under a general and unified framework. In summary, the Hopfield-type network we will study is the parameter form (79) and (80), in which the strict signum function is employed.
B.
GENERAL CONVERGENCE PRINCIPLES
As for any PAAM, the convergence of the model (79) is a prerequisite for any successful application. The symmetric case has been well studied, and the basic results are summarized as follows.
Associative Memories
235
• The network J\f = (W, I) will always converge to a stable state when operating in the serial mode, if the diagonal elements of W are nonnegative [34]. • The network J\f = (W, I) will either converge to a stable state or enter a 2-cycle when operating in the fully parallel mode. J\f will converge to a stable state if W is nonnegative definite [88]. Similar convergence results have been obtained for the asynmietric cases [69, 89, 91, 96, 97]. We will develop in the following a more versatile framework within which all of the known results can be derived in a straightforward way. Let H ^ be the Hamming space defined by H ^ = {y = (i;i, t;2, . . . , VNf,
Vi G {-1, 1}}.
If To is the trajectory of the system (79) initialized from V(0), then, in dynamic system notation, the time derivative of a function E : H ^ -> R along the trajectory To is defined by A^(y)|ro = E{V{t + D) - E{V{t)).
(81)
Furthermore, by Definition 1, the function E is an energy of the system (76) if A£'(y)|ro S 0. Similar to the continuous case, we say that the energy £" is a strict Liapunov function of (79) if A£'(y*)ITQ = 0 when and only when V* is an equilibrium state of Eq. (76), i.e., V* satisfies y* = 5(Tyy* - / ) ,
(82)
where the nonlinear operator S: R^ -^ H ^ is defined by S{u) = (Sgn(Mi), Sgn(w2),..., Sgn(uN)f.
Vw € R^.
(83)
Corresponding to the Liapunov method (Theorem 2), the following theorem provides the basis of convergence analysis for the Hopfield-type networks. THEOREM 12 (Convergence Criterion). A Hopfield network M = (W, I) is convergent if it has a strict Liapunov function.
The validity of Theorem 12 is rather obvious once we notice that the total number of possible states of H ^ (hence To) is finite and E(v(t)) is strictly decreasing. It is highly desirable to construct a general form of the Liapunov function of the model (79). We will apply subdifferential-mapping for this purpose. A functional g : R^ -^ R is said to be subdifferential at x € R^ if there is a vectors* € R^ such that g{y) — g{x) > {x*, y — x), Wy e R^. All such vectors X* are called the subgradient of g at jc, denoted by dg{x). The correspondence X -^ 9g(x) defines a mapping dg{'): R^ -^ 2^ , called the subdifferential mapping. To connect this notion with the Liapunov function of the system (79), we introduce the following definition.
236
Zong-Ben Xu and Chung-Ping Kwong
DEFINITION 4 (Generalized Potential of a Hopfield-type Network). Let the mapping H(R): R^ ^ R^ be defined by Eq. (80). The operator H(R) is said to be a subdijferential mapping restricted to the orbit FQ, if there is a functional g such that g{v{t + D) - g{v{t)) > {H(R)(V(t)),
v(t + 1 ) - y(0),
t > 0.
In this case the functional —g is called a generalized potential of the Hopfield network (79) with respect to the orbit FQ. The following theorem reveals that a generalized potential can provide a general construction of the strict Liapunov function of the network. THEOREM 13. Any generalized potential of the system (79) is a strict Liapunov function.
To prove the theorem we assume, for instance, that —g is a generahzed potential of the network (79) with respect to its orbit FQ. Then, denote E(V) = —g(V) and using Definition 4, we have AE(V)\ro
=g{Vit))-g{V(t + l)) <{H(R)V(t)^V(t)-V{t-hl)) = -{R{WV(t) - l),V{t -\- I) - V(t))
= J2 (-«^)h(^ + i)]h(^ +1) - ^^(0] < 0, which vanishes if and only if V(t) is an equilibrium state of (79) from some time ^0 on. This last inequality follows because Vi(t-\-l) = sgn(M/(t -\-1)) implies [ui(t-{-l)][vi(t-\-l)-Vi(t)] = 2\ui(t-\-l)\ > O,and/(0 = 0implies y(^ + l) = V(t). Therefore V(t) is an equilibrium state of (79). The significance of Theorem 13 is that it shows how a strict Liapunov function of the system (79) can be obtained by finding a generalized potential of the system; the latter can be done easily by using the special form of H(R) (Definition 4). In fact. Definition 4, together with (80), suggests that the generalized potential —g should obey dg(V) = R(WV - / ) ,
VV 6 H^.
(84)
It follows that such a functional g must be quadratic: g(V)=^^{V^AV)-V^b
(85)
for some A = (aij)j\/xN and b = {bi,b2,..., bj^)^. We proceed to specify A and b so that g defined by Eq. (85) does become a generahzed potential of a Hopfield-type network. Express g(V) at V(t) in a
Associative Memories
237
Taylor series, g(V) = g{V{t}) + [^{A + A^)V(t) - b]{V - V(t)) + k{V-
V(t)f{A
+ A^)(V - Vit)),
and bring this into Eq. (84). We see that —g is a generalized potential of the system (79), provided [i(A + A^)V(t) - b]{V(t + 1) - V(t)) + i(V(f + 1) - V(t)f{A + A'^){V{t + 1) - V(t)) > {R{WVit) - / ) , Vit + 1) - V{t)). Or, equivalently, I(RW
- ^ - y ^ ) V(t) - (RI - b), V(t + 1) - V(t}\ < l{V(t + 1) - Vit)y(^^^^yV{t
+ 1) - V(0).
Let AVit) = Vit + 1) - Vit) and takefo= i?/ in the last inequality. Then the above inequality is equivalent to
QiRWit) := ((RW - A±A_V(,), AV(f)\ - l(iA + A^)AVit), AVit)) <0, 4 which also amounts to N
iel(t) 7=1 L
^
2
-•
^
^
^^^^^^^^^^^^^_^^^
^^^^
iel(t)jel(t)
Note that i;j(0 = —^Avj(t) for any y e I(t). A sufficient condition for the above inequality to hold is
iGl(t)jeJ(t)^
'
iel{t)jel{t)
(87) where J{t) = {1,2,..., N}\I(t). We can rewrite (87) in the form
Y, Y. ^oA,(OAi;;(0>0, /e/(0 7€/(0
(88)
238
Zong-Ben Xu and Chung-Ping Kwong
where
ki(ai) = Y^
\aiWij
atj + aji
V/ el(t).
(89)
In most cases the network (79) is operated either in serial mode (I(t) contains at most one index /) or in fully parallel mode (7(0 contains at least one index /). Based on Eqs. (88) and (89), if we denote K(R) = dmg{ki(ai), k2{ct2)...., kN{otN)],
(90)
we can prove the following general conditions for convergence of a Hopfield-type network. THEOREM 14 (General Convergence Principle). Let J\f = (W, / ) be a Hopfield-type network, W not necessarily symmetric. J\f will converge to a stable state from any initial state V (0) if it satisfies either (PI) or (P2) listed below:
(PI): M is operating in serial mode and W is weakly diagonal dominant in the sense that there exists a parameter matrix R such that you > ki(ai)/(Xi,
V/ = 1, 2 , . . . , A^.
(91)
(P2): Af is operating in parallel mode and there exists a parameter matrix R such that RW — K(R) is nonnegative definite. Theorem 14 gives a very general convergence test for Hopfield-type networks. In particular, because R in the listed conditions can be chosen arbitrarily, a series of specific tests for the global convergence of the networks can be derived. For examples, we have the following. COROLLARY 3. Any one of the conditions (C1)-(C6) listed below is sufficient for the network M = (W, / ) to converge globally to a stable state when operating in the serial mode. For alii = 1,2, ...,A^,
(CI): W is symmetric and wu > 0 (C2): U^,-,>^E7=I,MI«''7-«';<• I
(C3): u;,/>i:ylij#,l«';.l (C4): wu>j:%,j^,\wij\ (C5): The matrix B — (bij) with bu = wu, bij = —\wij\, i ^ j , is an M-matrix
Associative Memories
239
The above follows directly from Theorem 14. Notice that when at is chosen to be 1 and A = W, the condition (91) reduces to (CI) and (C2). (C3) and (C4) clearly imply (C5), and (C5) is equivalent to the statement: There exists a parameter matrix R such that
(C6):
wn>^j:U,Mi"J^'"'j\
The last condition can be established by letting A = 2RW Similarly, we have the following.
m(91).
COROLLARY 4. Any one of the conditions (C1)^-(C5)^ listed below is sufficient for the network J\f = {W, I) to converge globally to a stable state when operating in the parallel mode.
(ClY: W is symmetric and nonnegative definite (Ciy-. W — diag{^ Ylf=ij=^i \^ij ~~ '^ji\'' i = ^^,2,..., N] is nonnegative definite (C3)^: W — diaglXly^i y_^j- \wji\\ i = 1,2,..., N} is nonnegative definite (CSy-. W — diag{X!^=i,;^i \^ij\'' i = 1, 2 , . . . , A/'} is nonnegative definite (C6)^: There is a parameter matrix R such that N
yoii>ki{ai)/oti+
Y^
\wijl
i = l,2,...,N.
(92)
Observe that the conditions (CI), (Cl)^ and (C3), (C3)' are exactly those derived by Hopfield [34], Goles et al [88], and Xu and Kwong [69]. The conditions (C3)-(C5) generalize the corresponding results obtained recently in [96]. It should be remarked, however, that the M-matrix condition (C6) as well as the general condition (C6)^ are new findings for the global convergence of asymmetric Hopfield networks.
C. C L A S S I F I C A T I O N T H E O R Y FOR E N E R G Y F U N C T I O N S We now turn to the study of the stability and error-correction capability of Hopfield-type networks. Although there has been direct analysis of the symmetric Hopfield-type network with Hebbian encoding [78, 80, 86, 87], it appears that no method has been developed so far for the general Hopfield-type networks. Our contribution will be such a technique based on a classification theory on energy functions. Our starting point is still due to Hopfield. In his seminal papers [34, 35], Hopfield associated his memory model with an energy function such that each pattern to be stored is located at the bottom of a "valley" of the energy landscape.
240
Zong-Ben Xu and Chung-Ping Kwong
Recall that the stored pattern is a dynamic procedure aimed at minimizing the energy where the valley becomes a basin of attraction. It turns out that the errorcorrection capability of the memory is characterized by the "width" of the valleys of the energy function. Our theory is based exactly on a quantitative measurement of the width of these valleys. The analysis conducted in the last subsection reveals that the most natural energy function associated with the Hopfield-type network (79) is given by N
N
N
E(V) = - ^ ^X^a/yi^/i^; +J2^iVi = -\ (y^AW) + V^b,
(93)
We are therefore considering the following combinatorial optimization problem: Minimize{£:(y): V e H ^ } .
(94)
Notice that the problem (94) is itself a model of many important combinatorial optimization problems such as TSP or MIS [61,62,99-101]. This in turn suggests the application of the Hopfield-type networks in solving these problems. However, we will not, go into this topic in this chapter. To quantify the valley of a minimizer of E, some precise definitions are needed. DEFINITION 5 (/:-Local Minimizer). A vector V* € H ^ is said to be a k-local minimizer of E if E(V*) < E(V) for any
V e BH(V\ k) = {V: dniV. V*) < k}, where J//(V, V*) denotes the Hamming distance between V and V* (i.e., the number of different components in V and V*). A 1-local minimizer of E is simply called a local minimizer of E. DEFINITION 6 (Classification of Energy). Let Qk (E) be the set of all A:-local minimizers of E and Q (Af) be the set of all stable states of the Hopfield-type network Af. (i) E is called a regular energy of J\f if Qk(E) c ^ (Af) (i.e., any A:-local minimizer of £" is a stable state ofAf). In this case, Af is said to have a regular correspondence property, (ii) E is called a normal energy ofAf with order k if Qk(E) ^ Q (Af) (i.e., any stable state of A/" is a /:-local minimizer of E), In this case, Af is said to have a normal correspondence property, (iii) E is called a complete energy ofAf with order k, if Qk(E) = Q (Af) (i.e., there is a one-to-one correspondence between the k-locel minimizers of E and the stable states ofAf). In this case, Af is said to
Associative Memories
241
have a complete correspondence property. Thus E is both regular and normal if it is complete. It is known that Definition 6 plays an important role in using the Hopfieldtype network as a minimization algorithm for the problem (93) [96]. In brief, whereas the regularity of the energy function E characterizes the reliability of the deduced Hopfield-type network algorithm, the normality ensures the algorithm's high efficiency. We further remark that normality can be used to characterize the error-correction capability of the corresponding Hopfield-type network and that regularity is an indicator of nonexistence of states that are not asymptotically stable. Accordingly, Definition 6 also plays an important role in the stability analysis of Hopfield-type networks. We now proceed to find general conditions under which the energy E of the form (93) becomes regular, normal, or complete. To this end, we let AV* = y - V* for any fixed V e BHiV, k), and let /* = {/ € { 1 , 2 , . . . , A^}: A V,.* ^ 0}, / * = {1, 2 , . . . , iV}\/* (hence AV^* = -2i;f if and only if / e /*). By expanding the energy E, we obtain E{V) - EiV)
= -{HiRW,
AV*) + {{H(R)V*, AV*)
- [i(A + A^)V* + bfAV"
- 1{{A + A^)AV*, AV*)}
= -{H(R)V*, AV*) + Q(R)(V*),
(95)
where
Q(/^)(y*) = {/('/?w--^^-i^V*, Ay*\-^((A-f A^)Ay*, Ay*}}. (96) By Eq. (80), -{H(R)V*, AV*) = ^ a / ( i y y * -
l)i{2vf).
iel*
Identity (95) can then be written as E(V) - £ ( y * ) = ^ 2 a / ( w y * - I).vf + Q(R){V*).
(97)
If y* € Qk(E) is a ^-local minimizer, then it is also a local minimizer. So taking /* = {/} in the last identity gives (WV* - I)ivf > 0 (i.e., y* € ^(A^)) provided Q(R)(V*) < 0. Conversely, if y* € ^ (A^) is an equilibrium state of the network A/", then E(V) > E(V*) as long as Q(R)(V*) > 0.
242
Zong-Ben Xu and Chung-Ping Kwong
It is surprising to see that the condition Q(R)(V*) < 0 is the same as the inequaUty (86) if we identify V(t) with V*. Consequently, Q(R)(V*) < 0 if any of the convergence conditions Hsted in Theorem 14 and Corollary 3 are met. Furthermore, we conclude from Eqs. (86)-(90) that a sufficient condition for Q(R){V*) > 0 is that any /:-order principal submatrix of RW + K(R), denoted by [RW -\- K{R)]k, is nonpositive definite. We finally arrive at the major results for this subsection. THEOREM 15 (Classification Criteria). network, E is its energy of the form (93).
LetAf = (W, /) be a Hopfield-type
(i) If there is a parameter matrix R such that Wii > ki{(Xi)/ai,
V/ = 1, 2 , . . . , A^,
(or any one of the convergence conditions listed in the last subsection), then E is a regular energy ofJ\f. (ii) If there is a parameter matrix R such that [RW -\- K{R)\ is nonpositive definite, then E is a normal energy ofM with order k. The special case when k = I should be noted in which /* = {/} and hence, [RW -\- K(R)]i is nonpositive definite if and only if Wii < -ki(a)/ai,
V/ = 1, 2 , . . . , A^.
Compared with Theorem 15 (i), the regularity and the normality of an energy function have a special relationship. In particular, setting R to the identity matrix yields the following. THEOREM 16 (Completeness Theorem). If J\f = (W,I) is a symmetric Hopfield-type network without self-feedback connection (i.e., wu = Oj, then E is a complete energy (and hence, M has the complete correspondence property).
Theorem 16 is of great importance when the Hopfield-type networks are used as associative memories. We will give the details in the next subsection. When additional assumptions are imposed on the encoding. Theorem 15 can be extended to Theorem 17. THEOREM 17 (Error-Correction Capability Theorem). Let V* be an equilibrium state of the Hopfield network J\f = (W.O); I\ denotes any one of the k indices in { 1 , 2 , . . . , A^}, 7 = {1, 2 , . . . , A^}\/i, and
ki(ai) = J^
aij + aji OiiWii
^
Assume that the encoding of W satisfies (P3): There is a A* such that WV* = A* V*.
V/ eh.
Associative Memories
243
Then (i) y* is a k-local minimizer of the energy function E if there is a parameter R such that A* > ki{oti)/oti + Y, I^O-I'
/ = 1, 2 , . . . , A^.
(98)
(ii) The attraction basin ofV* contains Bn(V*, k) if k*>2Y^\wij\,
i = l,2,...,N,
(99)
The conclusion (i) readily follows from (98) by observing that (WV*)i vf = A* and whenever (98) is satisfied, E{V) - E(V*) > J2^ai{WV*).v*
- i ^
^a.u^oA^*^^;
-Y.kiiadAV;' iel
ieh ^
L
jeli\{i}
> 2 ^ a J r - \ki(ai)/ai + ^ ieh
"-
•"
-I J
\wij\\\ > 0.
jell
Furthermore, for any V e BuiV*, k), denote /* = {/ G {1, 2 , . . . , A^}: vt = -vf} and y = y* + AV* with AV* = 1 " ^ ^ * ' ^ [0,
^isin/* /, is not in /*.
Then
7=1
«€/*
/€/*
and therefore vf{WV)i =X-2J2
y^ij^t^^j > ^* - 2 ^ |it;o-| > 0.
This implies that the network A/^ converges to V* with initial value V in one step. Hence the attraction basin of V* contains B}i(V*,k).
244
Zong-Ben Xu and Chung-Ping Kwong
Theorem 17 may be very powerful in the quantitative measurement of the error-correction capability of a particular Hopfield-type network. The following are two examples. EXAMPLE 4 (Error-Correction CapabiUty of Little-Hopfield Network). Assume that ?(^> = (^l^\ ^^^\ . . . , ^^^^^f G H ^ , /x = 1, 2 , . . . , p, are p orthogonal prototype patterns, and A/" = (W, 0) is a Hopfield-type network, with W encoded by the Hebbian rule:
Then \wij\ < p/N, W^^^^ = l^^^\ and W is symmetric. Hence, with R = diag{l, 1 , . . . , 1}, A = Hy (so ki(ai) = 0), and A* = 1,
i>k^>j2\wiji iel
Consequently, the condition (98) is satisfied if k<—, P that is, each stored memory ^^^^ is at least a [N//?!-local minimizer, where ["•] denotes the integer part of N/p. Similarly, the condition (99) is met whenever 2/7
that is, ^^^^ can be retrieved from any key with maximum N/2p error bits. Note that this error estimate coincides with the well-known result k = fN/lp] of Cottrel [86]. This demonstrates the great power of the theory developed in this section. EXAMPLE 5 (Error-Correction Capability of the Hopfield Network with Pseudoinverse Encoding). We still assume that ^<^> = (^l^\ ^^^\ . . . , ^^^^^f e H^, />t = 1, 2 , . . . , /7, are the prototype patterns. However, instead of the orthogonal assumption, {§^^^} are now in general position. Consider the Hopfield-type network A/^ = (W, 0), with W encoded by the pseudo-inverse rule:
W = X{X^Xy^X^
= (Wij)NxN,
where X = [^^^\ ^^^\ ..., ^^P^l Because W is an orthogonal projector, we have W^^^^ = ^^^\ and max {|u;,7l, / , ; = 1 , 2 , . . . , A^} =max{u;//: / = 1, 2 , . . . , A^}.
Associative Memories
245
Denote max{ii;//} = max{w;//: / = 1, 2 , . . . , N}. Then, as long as
k<-
V-T'
^^^^>
2max{w;//} the condition (99) is satisfied with X* = I, A = W, and R = diag{l, 1 , . . . , 1}. The estimate (100) reveals an important fact: for the Hopfield network with pseudo-inverse encoding, the smaller the diagonal elements of W, the stronger is the network's error-correction capability. This also supports the general conclusion stated in Theorem 16.
D. APPLICATION: COMPLETE-CORRESPONDENCE ENCODING Based on the theoretical results of the last two subsections, we develop encoding techniques for a Hopfield-type network A/" = (W, I)hy requiring that W and / satisfy the following requirements: (i) Equilibrium-state requirement: S(W^^^^ - I) = ^^^\ /JL = 1,2,..., (ii) Convergence requirement: (PI) or (P2) in Theorem 14 (iii) Stability requirements: (P3) in Theorem 17.
p
The condition (P3) clearly implies the equilibrium-state requirement. So we can, instead of (i), ask the pair (W,I)to satisfy W§(/^) _ / = A(^)§(/^), ^ = 1, 2 , . . . , p,
(101)
where {A,(^)} are any prefixed set of positive real parameters. In practice, by taking any one of the given patterns {^^^^: /x = 1,2,...,/?}, say ^^P^ as a reference vector, and then letting I = W^^P^-X^p)^^P\
(102)
we can transform (101) into the following standard form: WX^f"^ = X(^)X^^\
M = 1, 2 , . . . , /7 - 1,
where X^^^ = §(/^) - ^^P^ and A(^) = k(^) - A,(p) for any /x ^ p. Denote
x=
[x^'\x^^\,..,x^p-'^]
and A = diag{A(i), X(2),..., ^(p-i)}-
(103)
246
Zong-Ben Xu and Chung-Ping Kwong
The equations (103) can also be rewritten as WX = XA.
(104)
A general solution to this equation is given by W = XAX+ + Y[ld - XX+],
(105)
where X"*" is the Moore-Penrose inverse of X, Id is the N x N identity matrix, and Y is an arbitrary matrix. It is known that XX"^ is the orthogonal projection of R^ onto the subspace
Hence XX"^ can be computed recursively by means of Algorithm 1 presented in Section II.C.l. Note also that XAX+ = H(p_i)(X, XA) can be calculated according to Algorithm 1. Thus, except for A and Y, the encoding (105) can be easily obtained. Now different specifications of A and Y in Eq. (105) correspond to different encoding schemes. For example, A = / and Y = 0 correspond to the projection scheme [62] or the pseudo-inverse method [12, 13]; F = 0 corresponds to the spectrum method [102]; and A = / and 7 = r / (r is a real parameter [67]) correspond to the eigenstructure methods [103-108]. In principle, the parameters A and Y should be chosen in such a way that the resulting network can fully satisfy the convergence and stability requirements. More preferably, the network's performance in terms of error-correction capability is also optimized. It is of course, a difficult, although not impossible task to find such a solution in general. In the following we will confine ourselves to the case of symmetric encodings. For this particular but commonly assumed case, we show how all of the currently known encodings can be dramatically improved by applying the new theories developed in this chapter. From Eq. (105), the symmetry of W requires that A = al, where a is a real constant, and that Y should satisfy Y[Id - XX+] = [Id - XX+]y^. This last requirement can be met for any arbitrary pattern set X, if and only if 7 = T / , where r is a real constant. Consequently, we conclude that a symmetric encoding W must have a general form W = aXX'^ — f^Ud — XX"^]. The parameter a can be assumed to be 1. Thus a general symmetric encoding W is of the form W{T) = XX+ H- T{ld - XX+) = (1 - r)XX+ -h rid.
(106)
It is observed that for Eq. (106), WiO) is exactly the pseudo-inverse rule (or projection rule) commonly used in the current literature, and W{—\~) is the eigenstructure rule suggested in [103-108], where 1~ means a real number
Associative Memories
247
less than and very near 1. We will define an optimal T* so that the network J\f = (W(r*), /) encoded by (106) will outperform those encoded by other existing rules. The completeness theorem (Theorem 16) can serve as the basis of recognizing such an optimal parameter r*. This is because, once the r* is specified in this way, the network A/* = (W(r*), /) will have the complete correspondence property (CCP), that is, each pattern X^^^ stored in the network is at least a 1-local minimizer of an energy (hence the attraction basin of Z^^^ at least contains the Hamming ball BH(X^^\1), and no equilibrium state in the network is not asymptotically stable, resulting in fewer spurious states). Such a network therefore has stronger error-correction capability. Given any symmetric Hopfield-type network A/* = (W, / ) , an equivalent network with the CCP is obtained by forcing the diagonal elements of W to zero. This is referred to as a complete correspondence modification skill [68]. We call an encoding with such a skill a complete-correspondence encoding. From Eq. (106), a complete-correspondence encoding for the symmetric Hopfield-type network J\f can be constructed according to the following algorithm. Algorithm 4 (Complete-Correspondence Encoding). (i) Given the pattern set {^^^): ^ = 1, 2 , . . . , p}, let X^^^ = §(^) - ^^P\ M = 1, 2 , . . . , /7 - 1, and let X = {X^^\ X^^,..., X^^-^^]. (ii) Set P+ = XX+ = (pij)NxN, and p* = min{piu pii,..., PNN}(iii) With - ( 1 - p*)-^/7* < r* < 1, define M(r*) = P+ + r*(/j - P+).
(107)
(iv) Let D(r*) = diag{(l-r*);7ii+t*, (l-r*)/722+T*,..., (l-r*)pArivH-r*} and define the synmietric Hopfield network A/" = (W(r*), /) by W(r*) = M(r*) - D(r*)
(108)
/ = W§(^)-§(/').
(109)
From Eqs. (107) and (108), the so defined synmietric Hopfield-type network assuredly has the CCP. Hence it takes all of the prototype patterns §^^^ (^6 = 1, 2 , . . . , p) as its asymptotically stable states, with the attraction basin not less than BH{^^^\ 1). By Corollary 3, the network is also globally convergent whenever it is operated in the serial mode. Furthermore, write P~ = (/^ — XX"^)
248
Zong-Ben Xu and Chung-Ping Kwong
and observe that, in Algorithm 4, P^ is the orthogonal projection of R^ onto L(p_i). Thus P~ is the orthogonal projection of R^ onto L^p_iy So M(r*) can be expressed as M(r*) = P+ + T*p-, which is nonnegative definite, provided r* > 0. Consequently, the network A/^ = (Ty(r*), /) with r* > 0 will also have convergent dynamics when operating in the parallel mode. LetL« = Lp_i + {^(^>}denotetheaffineclosureoftheset{Z^^\ ...,X^^-i>, §(^^}, that is. La = {? G R^: 3 (3/ € R, ai + (32 H
\-ap = \, such that
? = aiX(^) + a2X^^^ + . . . + ap-,X^P-'^
+ ap^^P^],
FromEqs.(107H109) we see that any § € LaPiH^ is such that ^-^^^^ € Lp_i, {Id - r+)($ - ?(^>) = 0 and ^ - ^^P^ = P+(? - ?^^^). It follows that ^ - D(r*)? = P+? - P+§^^^ + §^^^ - D ( T * ) ^ = M(r*)? - / and [1 - (1 - T*)/7,/ + r*] = [ M ( T * ) ^ - / ] / > 0. This shows that § is an equilibrium state of the network. Similarly, we can deduce that any ^ € L^_^ + {^^^^} is definitely not an equilibrium state of the network A/* = (W(r*), / ) . To summarize, we state the following. THEOREM 18. LetM = (iy(r*), /) be the symmetric Hopfield-typenetwork with the complete correspondence encoding. Then
(i) The network J\f with r* > —(1 — /?*)~^/?* is convergent when operating in the serial mode. (ii) The network N with r* >Ois convergent when operating in the parallel mode. (iii) Any element ^ G La H H^ is an equilibrium state ofhf, but any point in L-^-i + {?^^n is definitely not. Theorem 18 shows that the network Af = (V^(r*), /) not only guarantees the storage of the given patterns as its asymptotically stable states, but it also effectively excludes any spurious storage in the affine subspace L-^_i -\- {^^P^]. This is a striking feature and advantage of complete-correspondence encoding. Algorithm 4 defines a family of networks: {A/'(r*) = (W(r*), / ) : (1 - /7*)-i/?* < r* < l } , which, for any fixed r*, provides a complete correspondence modification of the corresponding symmetric network Af = (M(r*), / ) . In particular, the network
Associative Memories
249
A/'CO) is the complete correspondence modification of the pseudo-inverse or projection network Af = ( P ^ , / ) , the network A^(l~) is the modification of the eigenstructure network A/* = (M(l~), /) in [103-108], and A/'(-(l - /?*)~^ x /?*) is the modification of the network J\f = M(—(l — /7*)~V*) proposed in [67]. The following example serves to demonstrate the power of complete-correspondence modification. Let us take {§^^\ §^^\ . . . , ^(^^^} to be the 26 English letters as shown in Fig. 13, each of which is identified as a 70-dimensional vector. We build the pseudo-inverse network A/" = (P^,0) with ^^^^^ = 0, and form its complete correspondence modification A/'(0) = (M(0), 0) according to Algorithm 4. To evaluate the performance of A/" and Af(0), we let BH(X^^\ r) denote the r-ball neighborhood oi^^^\ which consists of all bipolar vectors having Hamming distance r with §^^\ For any fixed r, 1000 different initial states were randomly selected in BH(^^^\ r) for each ^^^M (i = 1, 2 , . . . , 26) to run the networks AT and A^CO) (thus a total of 1000 x 26 runs were made for each fixed r). The probabilities of the correct retrieval out of the 1000 x26 runs were calculated as the measurement of the error-correction capability. The simulation results are depicted in Fig. 14. It is seen that the complete correspondence modification Af(0) has indeed dramatically improved the performance of the original network A/* in terms of error-correction capability. This highlights the great significance of the complete-correspondence encoding as a general principle of improving the existing Hopfield-type networks.
Figure 13 The sample memory patterns. Each is a 70-dimensional binary vector.
250
Zong-Ben Xu and Chung-Ping Kwong 1000
Number of Error Bits in Input Patterns Figure 14 Comparison of error-correction capability between the network A/" = M(0) with the pseudo-inverse rule and its complete correspondence modification Af = ATCO). The task is to recognize the 26 English letters.
V. SUMMARY AND CONCLUDING REMARKS Associative memories have been one of the most active research fields in neural networks; they emerged as efficient models of biological memory and powerful techniques in various applications related to pattern recognition, expert systems, optimization, and inteUigent control. We have shown that almost all existing neural network models can function as associative memories in one way or another. According to the principle of performing the association, these models can be put into three categories: Function-Mapping Associative Memories (FMAMs), Attractor Associative Memories (AAMs), and Pattern-Clustering Associative Memories (PCAMs). Each of the different types of neural networks has its own advantages and shortcomings. However, we argued that the Point Attractor Associative Memories (PAAMs) in AAMs are the most natural and promising. A PAAM is a dissipative, nonlinear dynamical system formed by a large number of connected dynamic neurons. In most cases the system is modeled either as a set of differential equations like Eq. (8) (the continuous PAAM), or as a set of difference equations like Eq. (9) (the discrete PAAM). A unified definition of PAAM system has been presented (Definition 1), which generalizes the well-known PAAM networks such as the Cohen-Grossberg model, the Hopfield
Associative Memories
251
model, the Litter-Hopfield model, the Brain-State-in-a-Box model, and the recurrent back-propagation algorithm. A PAAM is featured by the convergence of its trajectories in state space to its point attractors. Because of that, the basic principle of the PAAM is characterized by its dynamics, which maps initial network states to the final states of the network. Whereas the final states are the point attractors identified with the memorized items, any initial state of the network is a probe vector, representing a complete or partial description of one of the memorized items. The evolution of the network from an initial state to an attractor then corresponds to pattern recall, information reading, or pattern recognition. Two fundamental issues of PAAM technique are (i) encoding: how to store the given prototype patterns as the attractors of a PAAM, and (ii) decoding: how to retrieve the stored memories according to a certain recognition principle. The encoding specifies the synaptic interconnections of a PAAM network, which then, together with the recall dynamics, determines the decoding mechanism. Therefore the neurodynamics of a PAAM provides the basis for both encoding and decoding. For a PAAM to be useful, two kinds of dynamics have been shown to be imperative: global convergence dynamics and asymptotic stability dynamics. We have presented in Section II a unified account of the methodologies for conducting such convergence and stability analysis. Several generic criteria for establishing these two kinds of dynamics for the continuous PAAMs are developed in Theorems 4-6, with specialization to the Cohen-Grossberg model stated in Corollary 1 and Theorem 7. It is concluded that the famous LaSalle invariance principle provides the general mathematical framework for the convergence analysis, whereas the classical Liapunov method and the linearization method underlie the stability analysis. Nevertheless, the full use of the generalized gradient system structure defined in Eq. (28) has been shown to be very helpful in all analyses. Based on the neurodynamics analysis for the general PAAM systems, two general-purpose encoding schemes have been proposed (Algorithms 1 and 2). The successive interpolating encoding defined by Algorithm 1 is a generalization of the well-known Hebb's rule, which retains the "accumulation" feature of the latter and at the same time ensures the restorability of any set of linear independent prototype patterns. Therefore the algorithm can be widely employed to supersede the encoding schemes using the pseudo-inverse, projection, or eigenstructure methods. On the other hand, the sparsely connected encoding described in Algorithm 2 gives us a flexible framework within which not only can the heteroassociation problems be tackled as autoassociative problems (as illustrated by Example 2), but many different complexity reduction strategies for PAAMs can also be deduced. As special cases of the general PAAM theories developed in Section II, two important examples are studied in Sections III and IV. In Section III a continuous PAAM model, the competitive associative memory, is investigated. It is deduced from the successive interpolating encoding scheme and is a three-order nonautonomous differential equation. With the aim of incor-
252
Zong-Ben Xu and Chung-Ping Kwong
porating certain competitive recognition principles into the information reading procedure, such a memory treats the prototype patterns as being stored as a set of different biological species. Then the model mimics the competitive persistence of the species in performing the pattern recall from the stored memories. In Theorems 10 and 11, it is shown that the memory performs very well in competitive recognition, in the sense that, as long as the competitive and inhibitive parameters are appropriately specified, no nontrivial spurious attractor exists in the network (i.e., only the prototype patterns to be stored are in its memories). Furthermore, whenever the competitive parameters eventually lead to a unique winner, only the prototype pattern corresponding to the winner is almost asymptotically stable globally, and hence can be retrieved from almost anywhere in the underlying state space. Owing to these features, the memory can be very efficient and reliable in pattern recognition tasks, especially when a specific pattern must be recognized from a set of stored memories and when the recognition process is governed by certain known competition mechanism. By applying this principle and regarding the pattern recall as recognizing a particular pattern (from among the given set of prototypes) that is the closest in distance to the given probe UQ, a novel PAAM model called perfectly reacted PA AM is formulated in Algorithm 3. This PAAM model can always find the best matching of the given probe from the memories in the Rayleigh quotient distance R^^^('). More precisely, for any given probe, the network can yield the best approximation of the probe when projected onto the unit sphere B^ = [x £ R^: \\x\\ = 1}. Thus, whenever the prototype patterns are normal (i.e., with norm one), this kind of memory can be perfectly applied. As a discrete PAAM example, the asymmetric Hopfield-type networks are examined in detail in Section IV. The symmetric version of the model has been extensively studied in the past. Yet a deep understanding of the general asymmetric case is yet to be achieved. To this end, we have developed a theory for the asymmetric networks with a general convergence principle and a novel classification theory of energy functions. The convergence principle (Theorem 14), which characterizes the conditions under which the networks have convergent dynamics, generalizes the previous results for symmetric network and yields a series of new criteria (Corollaries 3 and 4). The classification theory provides us with a fresh and powerful methodology for the asymptotic stability analysis of the networks. Through the categorization of the traditional energy functions into regular, normal, and complete classes, we have found that the normahty (and furthermore, completeness) of an energy function can quantitatively characterize the error-correction capability of the corresponding networks, and hence underHes the asymptotical stability. By applying this new theory, a generic modification strategy called completecorrespondence modification skill has been formulated, which dramatically improves the performance of the existing symmetric Hopfield-type networks in
Associative Memories
253
terms of error-correction capability. Incorporating this skill into the general encoding technologies then yields a new useful symmetric encoding scheme, the complete-correspondence encoding defined in Algorithm 4. It is seen that all of the known encoding schemes like the projection rule, the pseudo-inverse rule, and the eigenstructure rules all are naturally embodied in and improved by this new encoding. The exposition of the present chapter has been restricted to the point attractortype neural associative memories in which the monotonic neurons (i.e., the activation function is monotonic) are applied (e.g., sigmoid, signum, or saturation neurons). It should be noted, however, that PAAMs with nonmonotonic neurons exist and have attracted increasing interest in recent years. Interested readers are referred to [41], [109], and [110]. In addition, except for the point attractor associative memories, other dynamic neural associative memories such as the chaotic and the oscillatory associative memories are drawing much attention. A comprehensive exposition of these memories can be found in [111] and [112].
ACKNOWLEDGIVIENT The artwork (Fig. 1, Figs. 7-9) produced by Joseph C. K. Chan and the computing assistance provided by W. L. Yeung are gratefully acknowledged.
REFERENCES [1] T. J. Teyler. Memory: Electrophysiological analogs. In Learning and Memory: A Biological View (J. L. Martinez, Jr., and R. S. Kesner, Eds.), pp. 237-265. Academic Press, Orlando, FL, 1986. [2] J. A. Anderson. What Hebb synapses build. In Synaptic Modification, Neuron Selectivity, and Nervous System Organization (W. B. Levy, J. A. Anderson, and S. Lehmkuhle, Eds.), pp. 153173. Erlbaum, Hillsdale, NJ, 1985. [3] H. Wechaler and G. L. Zimmerman. 2-D invariance object recognition using distributed associative memory. IEEE Trans. Pattern Anal. Machine Intelligence 20:811-821, 1988. [4] J. A. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, Reading, MA, 1994. [5] S. Haykin. Neural Networks: A Comprehensive Foundation. Macmillan, New York, 1994. [6] A. Fuchs and H. Haken. Pattern recognition and associative memory as dynamic processes in a synergetic system. J5/o/. Cybernet. 60:17-22, 1988. [7] W. K. Taylor. Electrical simulation of some nervous system functional activities. In Information Theory (E. C. Cherry, Ed.), Vol. 3, pp. 314-328. Butterworth, London, 1956. [8] K. Steinbuch. Die Lemmatrix. Kybemetik. 1:36-45, 1961. [9] J. A. Anderson. A simple neural network generating an iterative memory. Math. Biosci. 14:197220, 1972. [10] T. Kohonen. Correlation matrix memories. IEEE Trans. Comput. C-21:353-359, 1972. [11] K. Nakano. Association—a model of associative memory. IEEE Trans. Systems Man Cybernet. SMC-2:380-388, 1972.
254
Zong-Ben Xu and Chung-Ping Kwong
[12] W. Wee. Generalized inverse approach to adaptive multiclass pattern classification. IEEE Trans. Comput. €-17:1157-1164, 1968. [13] T. Kohonen and M. Ruohonen. Representation of associative data by matrix operators. IEEE Trans. Comput. C-22:701-702, 1973. [14] T. Kohonen. Associative Memory—A System Theoretical Approach. Springer-Verlag, New York, 1977. [15] R D. Olivier. Optimal noise rejection in linear associative memories. IEEE Trans. Systems Man Cybernet. SMC-18:814-915, 1985. [16] S. W. Zhang, A. G. Constantinides, and L. H. Zou. Further noise rejection in linear associative memories. Neural Networks 5:163-168, 1992. [17] K. Murakami and T. Aibara. An improvement on the Moore-Penrose generalized inverse associative memory. IEEE Trans. Systems Man Cybernet. SMC-17:699-707, 1987. [18] Y. Leung, T. X. Dong, and Z. B. Xu. The optimal encoding for biased association in linear associative memories. IEEE Trans. Neural Networks. To appear. [19] D. Gabor. Communication theory and cybernetics. IRE Trans. Circuit Theory CT-1:19-31, 1954. [20] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in tiie brain. Psych. Rev. 65:386-408, 1958. [21] F. Rosenblatt. Principles ofNeurodynamics. Spartan Books, Washington, DC, 1962. [22] B. Widrow and M. E. Hoff, Jr. Adaptive switching circuits. In I960 IRE WESCON Convention Record, Part 4, pp. 96-104. IRE, New York, 1960. [23] B. Widrow and M. A. Lehr. 30 years of adaptive neural networks: Perceptron, Madaline, and backpropagation. Proc. / £ £ £ 78:1415-1442, 1990. [24] P. Kanerva. Sparse Distributed Memory. MIT Press, Cambridge, MA, 1988. [25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagation errors. Nature 323:533-536, 1986. [26] D. S. Broomhead and D. Lowe. Multivariable functional interpolation and adaptive networks. Complex Systems 2:321-355, 1988. [27] S. Grossberg. Nonlinear difference-differential equations in prediction and learning theory. Proc. Nat. Acad Sci. U.S.A. 58:1329-1334, 1967. [28] S. Grossberg. Studies of Mind and Brain. Reidel, Boston, MA, 1982. [29] S. Amari. Characteristics of random nets of analog neuron-like elements. IEEE Trans. Systems Man Cybernet. SMC-2:643-657, 1972. [30] H. R. Wilson and J. D. Cowan. Excitatory and inhibitory interactions in localized populations of model neurons. / Biophys. 12:1-24, 1972. [31] D. J. Willshaw, O. P. Buneman, and H. C. Longuet-Higgins. Non-holographic associative memory. Nature 222:960-962, 1969. [32] J. A. Anderson, J. W. Silverstein, S. A. Ritz, and R. S. Jones. Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psych. Rev. 84:413451, 1977. [33] W A. Littie and G. L. Shaw. A statistical theory of short and long term memory. Behav. Biol. 14:115-133, 1975. [34] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad Sci. U.S.A. 79:2554-2558, 1982. [35] J. J. Hopfield. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Nat. Acad. Sci. U.S.A. 81:3088-3092, 1984. [36] M. A. Cohen and S. Grossberg. Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Systems Man Cybernet. SMC13:815-826, 1983.
Associative Memories
255
[37] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive Sci. 9:147-169, 1985. [38] C. Peterson and J. R. Anderson. A mean field theory learning algorithm for neural networks. Complex Systems 1:995-1019, 1987. [39] B. Kosko. Bidirectional associative memories. IEEE Trans. Systems Man Cybernet. SMC18:49-60, 1988. [40] P. K. Simpson. Higher-ordered and intraconnected bidirectional associative memories. IEEE Trans. Systems Man Cybernet. SMC-20:637-652, 1990. [41] M. Morita. Associative memory with nonmonotonic dynamics. Neural Networks 6:115-126, 1993. [42] G. Carpenter and S. Grossberg. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput. Vision Graph. Image Process. 37:54-115, 1987. [43] G. Carpenter and S. Grossberg. ART2: Self-organization of stable category recognition codes for analog input patterns. Appl. Optics 26:4919-^930, 1987. [44] T. Kohonen. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43:59-69, 1982. [45] T. Kohonen. Self-Organization and Associative Memory, 3rd ed. Springer-Verlag, New York, 1993. [46] R. Hecht-Nielsen. Counterpropagation networks. In Proceedings of the First IEEE International Conference on Neural Networks, San Diego, CA, 1987, Vol. 2, pp. 19-32. [47] M. W. Hirsch. Convergent activation dynamics in continuous time networks. Neural Networks 2:331-349, 1989. [48] A. Atiya and Y. S. Abu-Mostafa. An analog feedback associative memory. IEEE Trans. Neural Networks 4:111-126, 1993. [49] T. Coolen and D. Sherrington. Dynamics of attractor neural networks. In Mathematical Approaches to Neural Networks (J. G. Taylor, Ed.). Elsevier, Amsterdam, 1993. [50] A. Bhaya, E. Kaszkurewicz, and V. S. Kozyakin. Existence and stability of a unique equilibrium in continuous-valued discrete-time asynchronous Hopfield neural networks. IEEE Trans. Neural Networks 1:620-62S, 1996. [51] W. S. McCuUoch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. Bull Math. Biophys. 5:115-133, 1943. [52] S. Y. Kung. Digital Neural Networks. Prentice-Hall, Englewood Cliffs, NJ, 1993. [53] W. A. Little. The existence of persistent states in the brain. Math. Biosci. 19:101-120, 1974. [54] F. J. Pineda. Generalization of back-propagation to recurrent neural networks. Phys. Rev. Lett. 59:2229-2232, 1987. [55] A. Guez, M. Kam, and J. L. Eilbert. Computational-complexity reduction for neural network algorithms. IEEE Trans. Systems Cybernet. 19:409-414, 1989. [56] K. Nakajima and Y Hayakawa. Correct reaction neural network. Neural Networks 6:217-222, 1993. [57] E. R. Caianiello and A. D. Benedictis. Neural associative memories with minimum connectivity. Neural Networks 5'.A33^2>9, 1992. [58] M. A. Cohen. The construction of arbitrary stable dynamics in nonlinear neural networks. Neural Networks 5:83-103, 1992. [59] F. Verhulst. Nonlinear Differential Equations and Dynamic Systems. Springer-Verlag, New York, 1990. [60] A. Guez, V. Protopopsecu, and J. Barhenty. On the stability, storage capacity and design of nonlinear continuous neural networks. IEEE Trans. Systems Man Cybernet. 18:80-87, 1988. [61] J. A. Farrell and A. N. Michel. A synthesis procedure for Hopfield's continuous-time associative memory. IEEE Trans. Circuits Systems 37:877-884, 1990.
256
Zong-Ben Xu and Chung-Ping Kwong
[62] S. I. Sudharsanan and M. Sundareshan. Equilibrium characterization of dynamic neural networks and a systematic synthesis procedure for associative memories. IEEE Trans. Neural Networks 2:509-521, 1991. [63] B. Widrow and S. D. Steams. Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, NJ, 1985. [64] Z. B. Xu and C. P. Kwong. A decomposition principle for complexity reduction of artificial neural networks. Neural Networks 9:999-1016, 1996. [65] L. Personnaz, I. Guyon, and G. Dreyfus. Collective computational properties of neural networks: New learning mechanisms. Phys. Rev. A. 34:4217-4228, 1986. [66] S. Grossberg. Nonlinear neural networks: Principles, machines and architectures. Neural Networks 1:15-51, 1988. [67] H. Haken. Neural and synergetic computer. Springer Ser. Synergetics 42:2-28, 1988. [68] Z. B. Xu, G. Q. Hu, and C. P. Kwong. Some efficient strategies for improving the eigenstructure method in synthesis of feedback neural networks. IEEE Trans. Neural Networks 7:233245, 1996. [69] Z. B. Xu and C. P. Kwong. Global convergence and asymptotic stability of asymmetric Hopfield neural networks. J. Math. Anal. Appl. 191:405-427, 1995. [70] Z. B. Xu, Y. Leung, and X. W. He. Asymmetric bidirectional associative memories. IEEE Trans. Systems Man Cybernet 24:1558-1564, 1994. [71] Z. Y. Chen, C. P. Kwong, and Z. B. Xu. Multiple-valued feedback and recurrent correlation neural networks. Neural Comput. Appl. 3:242-250, 1995. [72] F. Paper and M. N. Shirazi. A categorizing associative memory using an adaptive classifier and sparse coding. IEEE Trans. Neural Networks 7:669-675, 1996. [73] B. Kosko. Adaptive bidirectional associative memories. Appl. Optics. 26:4947-4960, 1987. [74] X. W. He, C. P. Kwong, and Z. B. Xu. A competitive associative memory model and its dynamics. IEEE Trans. Neural Networks 6:929-940, 1995. [75] Z. B. Xu and Y. Leung. Competitive associative memory: The extended model and its global stabiUty analysis. IEEE Trans. Neural Networks. To appear. [76] C. L. Giles and T. Maxwell. Learning, invariance, and generalization in high-order neural networks. App/. Optics. 26:23, 1987. [77] A. N. Michel, J. A. Farrell, and H. F. Sun. Analysis and synthesis techniques for Hopfield type synchronous discrete time neural networks with application to associative memory. IEEE Trans. Circuits Systems 37:\356-\366, 1990. [78] S. V. B. Aiyer, M. Niranjan, and F. Fallside. A theoretical investigation into the performance of the Hopfield model. IEEE Trans. Neural Networks 1:204-215, 1990. [79] J. Bruck and V. P. Roychowdhury. On the number of spurious memories in the Hopfield model. IEEE Trans. Inform. Theory 36:393-391, 1989. [80] J. Bruck and M. Blaum. Neural networks, error correcting codes and polynomials over binary «-cubes. IEEE Trans. Inform. Theory 35:976-987, 1989. [81] Y Leung and Z. B. Xu. An associative memory based on key management and competitive recognition. IEEE Trans. Neural Networks. To appear. [82] T. D. Chiueh and R. M. Goodman. Recurrent correlation associative memories. IEEE Trans. Neural Networks 2:215-2M, 1991. [83] G. A. Kohring. On the |2-state neuron problem in attractor neural networks. Neural Networks 6:573-581, 1993. [84] J. Bruck and J. W. Goodman. A generalized convergence theorem for neural networks. IEEE Trans. Inform. Theory 34:1089-1092, 1988. [85] J. Bruck. On the convergence properties of the Hopfield model. Proc. IEEE 78:1579-1585, 1990. [86] M. Cottrel. Stability and attractivity in associative memory networks. Biol. Cybernet. 58:129139,1988.
Associative Memories
257
[87; S. Dasgupta, A. Ghosh, and R. Cuykendall. Convergence in neural memories. IEEE Trans. Inform. Theory 35:1069-1012, 1989. [88; E. Goles, F. Fogelman, and D. Pellegrin. Decreasing energy functions as a tool for studying threshold networks. Discrete Appl. Math. 12:261-277, 1985. [89 E. Goles. Antisymmetrical neural networks. Discrete Appl. Math. 13:97-100, 1986. [90: R. J. McEliece, E. C. Posner, E. R. Rodemich, and S. S. Venkatesh. The capacity of the Hopfield associative memory. IEEE Trans. Inform. Theory 33:461-482, 1987. [91 S. Porat. Stability and looping in contortionist models with asymmetric weights. Biol. Cybernet. 60:335-344, 1989. [92: I. Pramanick. Parallel dynamic interaction—an inherently parallel problem solving methodology. Ph.D. Thesis, Department of Electrical and Computer Engineering, University of Iowa, 1991. [93: Y. Shrivastava, S. Dasgupta, and S. M. Reddy. Guaranteed convergence in a class of Hopfield networks. IEEE Trans. Neural Networks. 3:951-961, 1992. [94: L. Personnaz, I. Guyon, and G. Dreyfus. Information storage and retrieval in spin-glass like neural networks. J. Physique Lett. 46:359-365, 1985. [95: A. N. Michel, J. A. Farrell, and W. Porod. Qualitative analysis of neural networks. IEEE Trans. Circuits Systems 36:229-243, 1989. [96: Z. B. Xu, G. Q. Hu, and C. P. Kwong. Asymmetric Hopfield-type networks: Theory and applications. Neural Networks 9:483-510, 1996. [97 A. Crisanti and H. Sompolinsky. Dynamics of spin systems with randomly asymmetric bonds: Langevin dynamics and a spherical model. Phys. Rev. A. 36:4922-4936, 1987. [98 I. Kanter and H. Sompolinsky. Associative recall of memory without errors. Phys. Rev. A 36:444-445, 1987. [99 J. J. Hopfield and D. W. Tank. Neural computation of decisions in optimization problems. Biol Cybernet. 52:141-152, 1985. [loo: E. D. Dahl. Neural network algorithms for an NP-complete problem: Map and graph coloring. In Proceedings of the Second IEEE International Conference on Neural Networks, 1988, Vol. 3, 113-120. [101 V. Wilson, and G. S. Pawley. On the stability of the TSP problem algorithm of Hopfield and Tank. Biol. Cybernet. 58:63-78, 1988. [102 R. J. Marks, II, S. Oh, and L. E. Atlas. Alternating projection neural networks. IEEE Trans. Circuits Systems 36:846-857, 1989. [103: J. H. Li, A. N. Michel, and W. Porod. Qualitative analysis and synthesis of a class of neural networks. IEEE Trans. Circuits Systems 35:976-986, 1988. [104: J. H. Li, A. N. Michel, and W. Porod. Analysis and synthesis of a class of neural networks: Variable structure systems with infinite gain. IEEE Trans. Circuits Systems 36:713-731, 1989. [105: J. H. Li, A. N. Michel, and W. Porod. Analysis and synthesis of a class of neural networks: Linear systems operating on a closed hypercube. IEEE Trans. Circuits Systems 36:1405-1422, 1989. [106: A. N. Michel, J. Si, and G. Yun. Analysis and synthesis of a class of discrete-time neural networks described on hypercubes. IEEE Trans. Neural Networks 2:32-46, 1991. [107 G. Yun and A. N. Michel. A learning and forgetting algorithm in associative memories: Results involving pseudo-inverse. IEEE Trans. Circuits Systems 38:1193-1205, 1991. [108: G. Yun and A. N. Michel. A learning and forgetting algorithm in associative memories: The eigenstructure method. IEEE Trans. Circuits Systems 39:212-225, 1992. [109: S. Yoshizawa, M. Morita, and S. Amari. Capacity of associative memory using a nonmonotonic neuron model. Neural Networks 6:167-176, 1993. [iio: S. Yoshizawa. In The Handbook of Brain Theory and Neural Networks (M. A. Aribib, Ed.), pp. 651-654. MIT Press, Cambridge, MA, 1996.
258
Zong-Ben Xu and Chung-Ping Kwong
[111] P. Thiran and M. Hasler. Intemat. J. Circuit Theory Appl. 24:57-77, 1996. [112] J. M. Buhmann. In The Handbook of Brain Theory and Neural Networks (M. A. Arbib, Ed.), pp. 691-694. MIT Press, Cambridge, MA, 1996. [113] C. R. Rao and S. K. Mitra. Generalized Inverse of Matrices and Its Applications. Wiley, New Yorlc, 1971. [114] A. Beraian and R. J. Plemmons. Nonnegative Matrices in the Mathematical Science. Academic Press, New Yorlc, 1979.
A Logical Basis for Neural Network Design Robert L. Fry
Raymond M. Sova
The Johns Hopkins University Applied Physics Laboratory Laurel, Maryland 20723-6099
The Johns Hopkins University Applied Physics Laboratory Laurel, Maryland 20723-6099
I. MOTIVATION Before embarking on any long journey, it is essential to know where one is going, how one will get there, and, most importantly, why one is going. Somehow, the last question drives the methodology upon which the trip is planned. The motivation underlies the whole purpose for putting out the energy to make the journey to begin with. Therefore, our first objective is to motivate the reader to explore the approach to neural design described in this chapter. Our intention is to characterize a logical basis for the design of artificial neural networks. To do this, we develop a methodology for the design, synthesis, analysis, testing, and above all, the understanding of a new type of computational element called the inductive logic unit or ILU. The functional architecture of this device is dictated by inductive logic, probability, and a complementary extension to information theory as originated by Shannon [1]. The ILU, as a physical device, may be thought of as being analogous to other physical devices used by electrical engineers such as a transistor, operational amplifier, or logical gate. As such, we will adopt an engineering approach in deriving its properties, analyzing its characteristics, and evaluating the ILU through both simulation and test. For instance, we will develop operating curves over various parameter regimes that are analogous to operational gain curves for a transistor. This characterization will facilitate application-oriented designs intended to address specific problems that are not described here. The exact nature of the source of data processed by the ILU is unimportant and may arise from financial, scientific, or other concerns. Hence, the developed computational element may have Implementation Techniques Copyright © 1998 by Academic Press. All rights of reproduction in any form reserved.
259
260
Robert L. Fry and Raymond M. Sova
wide applicability. Here, however, the goal is to introduce a new way of thinking about a new, or perhaps very old, type of neural network. Our objective is to design a rational device. We should therefore try to quantify what we mean by rational. The word rational is defined in the dictionary as "having or exercising the abihty to reason" and is derived from the Latin word rationalis, which, in turn, is derived from the Latin word ratio, meaning computation. The dictionary defines the word ratio as "a relation in degree or number between two similar things." We shall see that the ILU is a rational device in every regard stated above, in that it indeed computes a relation that is a ratio and which describes a relative degree between two similar things. The computed ratio is used by the ILU to determine the most logical questions to ask as well as dictating the most rational decisions to give. We will see that the ILU arises as a natural consequence of a rational argument. This makes sense because the ILU is itself a rational device and hence should easily be understood by another rational device, just as digital computers can efficiently and precisely simulate other digital computers. As a rational device, the ILU learns, reasons, asks questions, and makes decisions. Even though a single ILU is a very powerful computational device, it should be construed as part of a larger device called an inductive logic computer or ILC. This chapter will focus on the ILU, but ultimately, the ILU will typically be but one of many similar components comprising an ILC that is configured to perform some useful task. Why use the adjective "inductive"? What exactly is it that distinguishes inductive logic from, say, deductive logic? The answer is that inductive logic generalizes the concept of deductive logic. Most are familiar with deductive logic and its consequences, such as engineering reaUzations that rely on digital logic. Conventional digital computers are deductive. That is, digital computers are supplied with digital (binary) information upon which they, at their most fundamental level, operate using standard Boolean operators such as AND, OR, and NOT, or other Boolean functions arising from combinations thereof. Furthermore, a deductive logical process evolves rigorously and causally through the use of decision operators such as IF, THEN, and OR ELSE logical constructs. A common deductive statement might take the form "If a is true and b is true, then c is true." Decision operators have a fundamental element that is causal in nature and tends to drive the discrete trajectory of a deductive process in time. There is no intrinsic uncertainty associated with decisions derived through a deductive reasoning process. All of the information necessary to form a conclusion and then assert a decision at each critical juncture is present. Unfortunately, life outside the digital computer does not typically admit or permit the use of deductive logic, because all of the information required to make a deductive decision is rarely present. A deductive system confronted with such a dilemma could not evolve along its discrete deductive trajectory, and the causal flow of information would cease.
A Logical Basis for Neural Network Design
261
Alternatively, the fact that decisions must typically be made in the face of uncertainty is intrinsic to inductive logic. Decisions might be thought of as actions arising from a "participant" through inductive reasoning processes, whatever such processes might consist of or however they might be formulated. Information can be promulgated, although the quality of the information may not be on a par with that arising from a deductive decision-making process. In some intuitive sense, we would like to maximize the quality of the information that is promulgated within any notional inductive reasoning process. Here the term "inductive" is meant to describe a process whereby the "best" possible decisions are made and promulgated regarding a measure of "goodness" that is also derived through a logical reasoning process. Again, the main purpose of this chapter is to provide a quahtative and engineering-oriented overview of how the inductive reasoning process can be practically realized and implemented. This is done through the construct of the ILU. Furthermore, we will see that the ILU shares much in common with biological neurons, indeed, much more than described here. The logical theory behind our approach has origins going back as far as Aristotle and perhaps earlier. That is to say, much of the following material is basically not new, although the interpretations given may be novel. The history of logic is both interesting and technically valuable for obtaining deeper insights into the analyses given here. Logic is by far the most rigorously developed quantification of any philosophical thought or idea. Indeed, philosophers have consistently formulated basic questions that at first may seem vain, but evolve with understanding to provide fundamentally new insights and new mathematical perspectives with practical utility. More precisely, many questions, although initially thought vain, become quantified after a time through the tools of mathematics, which are but mechanized and logically self-consistent expressions of our thoughts. Of course, we cannot know with certainty whether a question is vain or not, and if forced to decide, we are unable to do so. We are indecisive. But what is interesting is that our indecision can be quantified, as we will see. Such is the nature of inductive logic. The more fundamental theory of logic is obviated here, but acknowledged through citations, references, and to a large extent, supplied logical tables. Logical arguments and examples are given instead. Indeed, we ratiocinate, that is, we reason methodically and logically, but do not resort to a mathematical rigor. Our problem is that the presented logical concepts are very basic and require the use of extremely precise language and logical meaning. This is not to say that the approach we describe cannot be rigorously formulated, but rather that we abandon this rigor in an effort to convey insight. Key to reading this chapter is the fact that the ideas described are very fundamental, so fundamental, in fact, that there is often no precise word or expression to sufficiently and succinctly convey the intended concept. Consequently, one is
262
Robert L. Fry and Raymond M. Sova
forced to use the least complex language construct, whether a specific word, expression, or paragraph, to best capture the intended meaning. As a matter of convention, when outstanding potential exists for "word abuse," we will italicize such words or expressions to signal the reader to take special caution regarding their interpretation. Unfortunately, the inadequacy of language also means that a preponderance of words must sometimes be used instead to convey basic logical concepts when strict mathematical theorems and proofs are not used. Therefore, we ask your patience in advance. In trying to understand the meaning of this chapter, use simplicity as your guide and common sense as your teacher.
IL OVERVIEW Logical reasoning represents an important and overt manifestation of human behavior. It is interesting to hypothesize that the human ability to reason arises as a natural and necessary consequence of the neurological structure of the brain itself. One can further hypothesize that even a single neuron can be described as a logical device with the capacity for asking questions, making measurements, assimilating information, formulating conclusions, and, finally, making rational decisions transmitted to other neurons. If the information necessary to make a deductive decision is lacking, then the neuron must infer the correct decision through induction and probability, the latter being the only logically consistent generalization of deductive logic [2-5]. This is our fundamental premise regarding inference and decision making. It can be shown that probability is a unique generalization of deductive logic. It makes sense, then, that probability will play an important role in the development of devices that operate inductively. A decision maker represents a source of information. Underlying assumptions in the form of di premise delineate for the decision maker all possible decisions that can be made by that decision maker. The premise, combined with the prevailing state of knowledge, provides logical support for both inference and subsequent external renderings of decisions by a decision maker. Alternatively, information is received by an observer. An observer uses measured information to resolve an internal issue. To efficiently resolve this issue, the observer must pose specific questions to efficiently obtain the required information. It somehow seems intuitive for an observer to ask those questions, in sequence, that have the most bearing on the observer's prevailing issue. The game of 20 questions provides a familiar example of trying to resolve the issue, What is the other player thinking of? The process of obtaining information represents a process of transduction on the part of the observer of the information. Transduction describes the process whereby externally presented information is encoded into an internal representation within a physical system. More precisely, optimal transduction of information by an observer is attained by posing the best questions rel-
A Logical Basis for Neural Network Design
263
ative to the prevailing issue. For instance, in the game of 20 questions, knowing that the thing being thought of by the other player is a person, the next natural question to ask is gender. Alternately, the efficient transmission of information requires that a source or participant make the best decisions as realized through asserted outputs, using all knowledge at hand. Transmission describes the process whereby internal information is decoded into an external representation, in the form of assertions, that exists between physical systems. Inductive logic and logical consistency dictate that asserted decisions must be based on probability. As will be seen, inductive logic and logical consistency also dictate which questions are the best questions to pose as guided by bearing. Assertions and questions, transmission and transduction, premise and issue, probability and bearing—each of these four pairs describes aspects of the causal flux of physical information that will be described in detail. Regarding transmission, the hypothesized inductive reasoning process carried out by the ILU is an assertive process; that is, the objective of the ILU is to derive and generate assertions that represent logical decisions based on the premise of the ILU and relevant information it has previously obtained. Inductive decisions represent a generalization of IF, THEN, and OR ELSE logical constructs used within deductive logic machines where such constructs demand the consideration of time and causality. IF, THEN, and OR ELSE logical constructs serve to functionally implement logical implication, that is, "if a then Z?" constructs. Such constructs are intrinsically causal in nature. Regarding transduction, the ILU must execute a program of inquiry in that it also asks and subsequently obtains answers to questions that are posed. Inference and inquiry [6] complement one another within a neural system or within any type of system, because all systems that support transduction and transmission must have both inputs and outputs. Inference leads to the expression of information, whereas inquiry leads to its acquisition. In and out complement one another. Without both, an entity cannot observe or participate, respectively, and the causal flow of information is interrupted. Systems engineers are familiar with the duality of inference and inquiry, although they may not think of it this way. One can jokingly say that a prerequisite to being a systems engineer is to be able to draw boxes connected to other boxes by arrows both entering and leaving them. Figure 1 is an example of such a system having inputs labeled xi through Xn and a solitary output y. Figure 1 describes a device, denoted Q, that receives input through inquiry and is itself a source of information through inference. Hence Q is an acceptable deductive or inductive logic component because it supports causal information flow. If Q were a deductive device, the transduction of information from input to output could be accomplished by using standard logical functions such as AND, OR, and NOT operators combined with IF, THEN, and OR ELSE logical decision
Robert L. Fry and Raymond M. Sova
264
Logical device
• • •
0
Figure 1 The logical device, denoted Q, represents a general system that supports causal information flow; that is, it can observe presented input assertions x\,X2, •'• ,Xn that originate externally to the system while simultaneously supporting the assertion of a single output decision y, which may, in turn, be observed by another logical device.
constructs to conduct decision points in time. At these decision points, supplied information is logically combined, operated upon, and then used to generate new information. The device Q can even be a conventional digital computer if it operates deductively. However, if Q were truly a deductive device, any input xt that was missing or somehow in error would likely cause the output y of the device to be in error as well. Imagine a digital computer with an operating system with a size of 100 million bits. If even 1 bit were inadvertently changed, the operating system would likely fail at some point. Such is the unforgiving nature of deductive logic. From now on, we will assume that Q operates on the basis of inductive logic. Missing or erroneous inputs are easily tolerated and furthermore assumed by an ILC. It is the nature of an ILC to simply make the best possible decisions given all observable and relevant information while simultaneously asking the best possible questions, given the time allowed, with the criterion of best still to be defined. Returning to Fig. 1, we will restrict the inputs xt and the output y to be Boolean or, more precisely, distinctions, because they can take on only one of two possible values. Alternatively, each input xt and the output y can be thought of as being binary-valued, but this represents a too restrictive point of view for our purposes. However, a Boolean interpretation is consistent with the function of a digital device and is consistent with our objective to design an ILU that will not only function inductively, but will compute and then transmit decisions. The ILUs considered in this chapter must abide by the standard rules of Boolean algebra, which we will summarize, and in the limit of complete knowledge, operate deductively. Furthermore, ILU operation should be based on principles that are a logical and unique generalization of Boolean algebra. These two assumptions form the basis of the discussions for the remainder of this section.
A Logical Basis for Neural Network Design
265
Boolean algebra is a logic of propositions. Propositions represent distinctions on the part of an observer, because it is the observer who defines the distinctions. That is, we distinguish things by giving them names. If two things are indistinguishable, they are not distinct relative to the observer who says they are not, and they therefore can be represented by the same proposition by this same observer. If we know only two things and these two things are not the same, then we can distinguish them from one another because whatever one thing is, the other is not. These observations are very simple and very fundamental. Observability and assertability can be viewed as complementary logical constructs. Regarding assertabiUty, a decision maker who gives the same answer twice when asked a question will provide no new information to the person posing the question. Regarding observability, an observer posing a question twice will obtain no new information from the same source. Consider the following example. Let the concept "apple" be denoted by the proposition a. Now a is defined more or less universally by most adults. The operative word here is "defined." We define the distinction a, having learned it previously from someone in conjunction with our experiences. Suppose we observe the world (i.e., our external environment) while at the same time exclusively looking for apples. That which we observe and is asserted to us will be distinguished, by us who define apple, as either a or not a, with the latter denoted by ^a. The operator " ~ " denotes the logical operator NOT, but physically should be interpreted as "distinguished from." Now suppose we see "a bushel of apples." Well, we are not looking for a proposition b = "a bushel of apples," because we have no defining distinction for this concept. However, what we may observe and what is asserted to us are many apples with one another. This is denoted by the logical expression a = aVav- --va, where " v " denotes the Boolean inclusive OR function and is the logical operator associated with the physical observation "apple with apple with • • • with apple" for each observable apple in the bushel. Even if all apples but one were to disappear, the last remaining apple would guarantee the answer "yes" to the observer. Thus the logical operator "v" represents physical superposition in that all of the presented quantities are with one another. Another physical analogy would be the measurement of the fluid level in a cylindrical glass of water in which ice and liquid were originally combined in some unknown proportion. After the ice melts, the measured level is the sum of the original water level without ice plus the additional water level attributable to the melted ice. The measurement is made on a superposition of the two fluid sources, which cannot be distinguished. Hence observation and measurement have a physical description through the logical operators " ~ " and "v." For instance, it is well known that a = a V a is always logically true. As interpreted here, when you ask a particular and specific question, you can only get an answer at the point in time when you pose that
266
Robert L. Fry and Raymond M. Sova
question. Therefore, when you ask the precise and sole question "Do I see an apple?" it is immaterial whether a single apple or a bushel of apples is seen, and the answer to this question can still only be "yes." As a more practical example, suppose you have lost your car keys in your house and are looking desperately for them. Going from room to room, you look for your keys, but see only nothing. Finally, you spot them on the television. They are presented to you, enabling you to assert the answer, "Yes, I finally found them." A major point here is that a physical interpretation of logic that deals with Boolean Os and Is can be misleading in a physical sense. In physical systems, assertions either are or are not. One either obtains the answer "yes" or "nothing," because nature never explicitly tells you "no." For example, a biological neuron either receives an action potential or does not, and in turn, either generates one or does not. Now returning to our previous example, suppose we are looking for b = "Two apples." While looking, we sight an apple ai, and then turn around and see another apple ^2. Because our sole purpose in life (at this moment) is to search for b, we observe the answer b = ai A ^2- Here " A " denotes the logical AND function. This case differs from the "bushel of apples" case in that the conjunctive operator " A " must physically and explicitly be carried out by the observer. The disjunctive operator is opposite the conjunctive operator in the sense that disjunction logically describes physical superposition, which is a property of the external world observed by the system posing the question. For the conjunctive case, the observer must be able to grasp that he or she has made two separate observations and has, in fact, performed coincidence detection. The conjunctive case represents an added level of complexity on the part of the observer and requires the consideration of the notion of time in making multiple measurements. Hence, "Time is that which keeps everything from happening at once" [7], or more precisely, time is that which keeps us from making all measurements at once and giving all answers at the same time. The perspective of this chapter is that we can describe the world about us as having a Boolean structure that is imposed by its inherent physical properties and our own. The famous logician Emil Post [8] published a remarkable paper in which he exhaustively examined many of the essential properties of logic and noted the amazing similarity between logic, geometry, and the physical world about us. Upon understanding the extreme simplicity of this structure, one must agree with Post's logical interpretations. From the perspective described so far, we can state that the information we receive has a Boolean structure dictated by its logical characteristics and our own. Historically, Boolean logic has been used to characterize the transmission of information through assertions. However, as described earlier, information flow requires the simultaneous consideration of transmission and transduction. With this in mind, consider the complementary nature and the interrelationships between
A Logical Basis for Neural Network Design
267
questions and assertions as captured in the following two definitions: (1) A question is defined by the possible answers derived through its posing and as obtained in response to presented assertions. (2) An assertion is defined by the posable questions answered through its presentation and as generated in response to answers obtained. Although innocuous enough, these two definitions are extremely important and form the basis for all that follows. Definitions (1) and (2) provide for a logical factorization of causal information flow into the separate processes of transduction and transmission, respectively. These definitions bring to mind the example of a sterogram. Sterograms are flat, two-dimensional figures, which when viewed properly (sHghtly out of focus with your eyes crossed), "come to life" as a threedimensional figure. This rendering is only possible through a cooperative interaction of the observer and the properties of the sterogram. Many metaphors and philosophical arguments capture the essence of (1) and (2) in one manner or another, for example, "Beauty is in the eye of the beholder" or "If a tree falls in the forest, does it make a sound?" Regarding the latter question, and from the previous descriptions, the answer is a resounding "No!" in that quantities that are not observable, that is, do not enter as arrows into the device shown in Fig. 1, cannot be considered information by that device. We adopt the definitions given by Cox [6] and call questions that cannot be answered relative to an inquiring observer vain. The term vain contrasts with the notion of real questions, that is, questions that can be answered. Real and vain have a logical significance to questions that is analogous to the significance of the terms true dind false to assertions. The following examples and discussions describe in further detail notions regarding real, vain, and the logical meaning of definitions (1) and (2). Suppose one person wants to convey to another person which card, out of a standard deck of 52, is being thought of. The person thinking of the card will answer any posed question except for the direct question, "What card is it?" which is therefore deemed a vain question, that is, a question that cannot be answered on the part of the person trying to guess the card. One could, for instance, ask the suit of the card, the color of the card, or even a very specific question like "Is it is the queen of hearts?" Let us denote the questions relating to suit and color by B and C, respectively. We maintain the convention that questions are denoted by capital italicized letters, whereas assertions are denoted by lowercase italicized letters. For example, one can obtain the answer b = "The card is a club" as one possible response to posing B and hearing the presentation made by the other player, which might take the form "The suit of the card has the shape of a clover." The answer is logically distinct from the presentation made to the relevant observer regarding the question that is posed. The person trying to guess the card can ask B and then C, which together will convey what these questions convty jointly to the observer. Asking both questions
268
Robert L. Fry and Raymond M. Sova
can be logically denoted BAC, signifying that both B and C have been posed separately. Furthermore, B AC = B, because knowing the suit will also indicate the color. The condition that B AC = B is the condition of logical implication, in that it signifies that the question B implies the question C,OT B -^ C. The practical interpretation is that if B is answered, then C is also answered. Altematively, one can ask for the information that the questions B and C convey in common. This is described by the question B v C. Because both questions yield the color of the card, one can see that BvC = C. This condition also signifies the logical implication of C by i5 and is logically equivalent to the condition BAC = B. One can derive one relation from the other. For instance, if we disjoin each side of B AC = B with C and note that because CVC = C,(BAC)VC = (BVC)A(CVC), and ( 5 V C) A C = C, we can then obtain the expression B v C = C (see Table I). Logical identities such as those just described are listed in Table I for both questions and assertions. A one-to-one correspondence is seen between identities and properties in the logical domain of questions and the logical domain of assertions. For instance, either condition, bAc = b or bvc = c, describes the condition that assertion b implies assertion c, or Z? —>• c. This can be interpreted as the condition that "If c answers the question, then b also answers the same question." For example, one can ask F = "Is the plant just delivered a flower?" In this case, r ^^ f, where r and / are the responses "It is a rose" and "It is a flower," respectively. We will discuss Table I in more detail in the next section.
Table I List of Corresponding Logical Identities for Assertions and for Questions Assertions
Questions
a Aa = a
av a = a
Av A = A
a Ab = b Aa
av b = bv a
Av B-Bv
(a Ab) Ab = a A(b Ab)
(av b)v b = av (bv b)
(Av =
{a Ab)v
(av b) Ac = (a Ac)v
(Av
c
= fl V C) A (/? V c)
(a Ab)v
b=b
(a A ~fl) A b = a A ~fl
(b Ac)
A AA = A A
AAB=BAA
B)vC Av(BvC)
(AAB)AC = AA(B
=
AC)
(AAB)VC
B)AC AAC)V(B A C )
(av b) Ab = b
(Av B)AB
(a V ~fl[) V b = av ^a
(A V ~A) V B = A v-A
= (AAB)V
= B
(AvC)A(BvC) B = B
(A A ~A) A 5 = A A~A
(a A ~a) V b = b
(a V ^a) Ab = b
(A V ~A) AB = B
(A A ~A) V B = B
~(« A b) = ^av
~(fl V b) = ^a A^b
~(A V B) = ~A A ~ 5
~(A A B) = ^Av^B
^b
~- ^a
=a
^^A = A
A Logical Basis for Neural Network Design
269
In discussing the implication of assertions, one replaces asks with tells and answered with answers. Logical implication is important in the consideration of ILUs and ILCs and plays a role analogous to IF, THEN, and OR ELSE logical constructs in deductive computers. In posing the question 5 v C, an additional burden is placed on the interrogator to compute the question B V C in advance of its posing. This case, therefore, has added complexity on the part of the observer relative to the posing of 5 A C. In the latter case, the observer simply poses both B and C. The question By C, for instance, corresponds to the example in which the observer was looking for more than one apple and then sighted two and took explicit note of this fact. In that example, the question B = AiW A2 was effectively posed and the corresponding answers to A\ and A2 were compressed into one observation, that is, Z? = a\Aa2. The duality between the properties of questions and the properties of assertions is intrinsic to the formulation given here. Acknowledging this duality to be the case can greatly facilitate understanding. This is one reason why we use the term assertion instead of proposition, because an assertion is more nearly the complement to the concept of a question. A question leads to the ingestion of information by an observer in the form of realized answers obtained through assertions presented to it that are subsequently transduced. Alternatively, an answer can lead to transmission of information in the form of an assertion by a decision maker. This assertion can then be presented to another observer, thereby completing a full information transaction. Such information transactions represent quantized steps of causal information flow. The main purpose of this discussion is to demonstrate that assertions and questions each represent logical entities and that both are needed to support information flow. Figure 1 explicitly shows assertions as comprising external inputs x/ and a generated output y. The presence of questions as logical entities in Fig. 1 is implicit. A question Xi is associated with each assertion xt. These questions must be present and associated with each input. Regarding the output, at least one other logical device (not shown) must exist with the potential for posing the real question Y and which therefore can receive the assertion y through its posing. If this were not so, the transmission of information from Q could not occur. Furthermore, through the assertions of other logical devices (also not shown), Q can receive inputs through the questions Xi, X 2 , . . . , X„ that it poses. If this were not so, the transduction of information would be interrupted. The symmetry and duality between transduction and transmission are perfect. Answers are obtained on the basis of questions posed in response to presented assertions. Assertions are made on the basis of answers to questions that were posed and are presented to other questions. Again, the causal element of time must be acknowledged. In this chapter we are interested in developing a methodology for the design of a device that operates inductively. This device will use assertions and questions as a substrate. Alone, however, these logical constructs can only support the design
270
Robert L. Fry and Raymond M. Sova
of devices that operate deductively. Further insight into the functional nature of an ILC is provided by the fact that any computer, whether deductive or inductive, has a motive or computational objective that defines its operating principles, its structure, and the intrinsic nature of its computation. Let us hypothesize that the motive of the logical device shown in Fig. 1 is to optimize its information throughput. Previously, the notion of information throughput was factored into two components, transduction and transmission, because causal information flow requires both. From Fig. 1, one can see that the device Q, having multiple inputs and a single output, is "lossy" in that another ILU that can observe y cannot exactly reconstruct the inputs x = (jci, A:2, . . . , JC„) to the first ILU that induced y. The ILU is motivated to simultaneously minimize the loss of information through transduction while maximizing the quantity of information conveyed through transmission. Let us now define a logical architecture for the device Q shown in Fig. 1 that can support the computational goal of maximizing information throughput. To be consistent with Fig. 1 and with previous discussions, we specify that the architecture must independently pursue two computational objectives: (i) optimizing the transduction of information/rom the vector-valued input x to the output y, thereby minimizing information loss through the ILU, and (ii) maximizing the information transmitted/rom the output of the device to any subsequent logical device to which it is physically connected and hence can observe it. Furthermore, in seeking to fulfill its computational objectives, the assumption is made that the logical device can only perform computation based on observable information. One can ask no more. This may seem an obvious requirement, but in fact it is necessary to quantify and practically realize the ILU. Objectives (i) and (ii) are still not specific enough to quantify the architecture and other functional properties of the ILU. To do so, we must introduce the formalized concepts of an issue and a premise. An issue in general represents a problem to be resolved through the posing of questions. It might be thought of as a question defined by the conjunction of a list of more elemental questions. If an observer can somehow ask an exhaustive set of questions relative to a specific issue, then the issue can be resolved. Alternatively, a premise represents the possible decisions that can be made by a decision maker. Each possible decision is an assertion that may be generated. The premise consists of the disjunction of all possible assertions relative to the decision maker. By having complete knowledge, one of the assertions relative to the subject premise can be asserted with total confidence. Suppose the issue of the ILU is determination of all of the things it can observe. Furthermore, in addition to its inputs xi, suppose the ILU can also observe its own output y. Denote the issue of the ILU by A = Xi A Z2 A • • • A X„ A F, that is,
A Logical Basis for Neural Network Design
271
the conjunction of all of the questions that the ILU can ask. Let us abbreviate Z = Xi A Z2 A • • • A X„, so that A = X AY. Alternatively, the premise of the ILU is a = jci V X2 V • • • V x„ V J, which delineates all possible states of nature or possible decisions that the ILU can make regarding these inferred states of nature. Both the issue A and the premise a belong, in a very real sense, to the ILU and, in fact, are what intrinsically define it. Our next step is to ratiocinate, that is, use as precise a logical argument as possible without resorting to mathematical proof, to propose that optimized transduction corresponds to maximizing the information common to the ILU input X and the ILU output Y relative to the issue A. Alternatively, optimized transmission corresponds to maximizing the information content of the output 7 relative to the issue A. The first computational goal is to optimize the bearing of the question common to X and Y on the issue A. Practically speaking, we want to determine the best question the ILU can ask of X that will maximize the ability of another observer to reconstruct the input, given that the other observer can only ask Y. Like probability, bearing is a numerical measure or quantity. Bearing is denoted by b ( 5 | A), where the notation b{B\A) denotes the bearing of the question B on the issue A. The second computational objective of the ILU is to maximize the bearing of the question Y on the issue A, which is denoted by b(F|A). The symbol "b" is used because it is the mirror image of the symbol "p" used for probability, and, as will be seen, bearing dindprobability are logically complementary functions. This complementary relationship holds in the sense that bearing is a numerical measure of the degree to which the subject question resolves the prevailing issue. Alternatively, probability is a numerical measure of the degree to which generated decisions are inferable by the subject premise and the prevailing state of knowledge. The first computational objective of the ILU seeks to optimize the transduction of information by the ILU. We use the term transduction to mean the process by which presented assertions are transformed through the process of measurement or observation into answers resident within the measurement device. By optimizing the transduction of information, it is the intent of the device to minimize information loss by making the best measurements. In minimizing information loss, we mean that, based on measured quantities, we can optimally reconstruct the original input that gave rise to the measured quantity(s). That is, the original input cannot explicitly be recorded by a second device, but its transduced influence can. Therefore, the ILU can, through its issue A and the measure of transduction quality given by h{X v F| A), attempt to perform this optimization through the modification of the question or questions posed by the ILU. The common question X w Y provides information common to both input and output, but only relative to the issue A, that is, the complete list of possible questions that the ILU can pose or the total state of ignorance of the ILU. Because Y provides much less information than Z , the impetus is on
272
Robert L Fry and Raymond M. Sova
determining the measurement that will maximize the information extracted from X that is common to Y. In optimizing the transmission of information, we mean that another device having the capacity for observing the outputs of the subject ILU can maximize the information received by posing the question Y. The transmission of information deals with the exchange of information between devices, whereas the transduction of information deals with the processing of information within SL device. Another device that can optimally receive the output y of the subject ILU must be able to ask the specific question Y. In maximizing the totality of forward information flow, the second observer must have the same issue A that the source ILU has (that is, exactly the same level of ignorance) and have as one of its computational goals the determination of the original input x to the first ILU. By having the same issue A, the second observer would essentially be approximating the optimized transduction process implemented within the first device. An engineer designing a communications channel implicitly incorporates the assumption of a shared issue when matching a receiver to the possible codes generated by the intended source. Ideally, the first stage of information transfer described by X v Y should be optimally matched to the second stage of information transfer Y. This strategy would support minimizing overall information loss, or equivalently, maximizing information throughput or flow. However, the determination of the input jc by a second device is, in general, impossible because a second, physically separate device cannot even know the inputs to the first ILU, as this information cannot be transmitted. Hence it is up to the first device to also ask 7, thereby emulating the second receiver by having access to the same logical issue A implicit in the transduction process. Because the issue A of the subject ILU cannot be transmitted without incurring additional inefficiency or information loss, the best that can be done is for the source ILU to attempt to optimize itself as a source of information. This can be achieved if the source ILU can measure its own output and hence have the ability to ask F, as might any subsequent ILUs to which it may be connected. More specifically, the ILU must try to maximize the bearing of the question Y on the issue A, for in doing so it is attempting to optimize its efficiency as an information source by incorporating aspects of any other hypothetical receiver, thereby supporting the local evaluation of transmission quality. Therefore, the ILU must use b(y I A) as a numerical measure of the amount of information that is transmitted while realizing that it is itself receiving this same information, but in a feedback fashion. Because the ILU can ask Y, it can measure all of the physical quantities necessary for maximizing the bearing of the question Y on the issue A and hence hypothetically support the reconstruction of inputs arising from other physically separate devices. So what is this measure called bearingl In attempting to resolve an issue, one must obtain answers to a thorough list of posed questions. If the list of questions
A Logical Basis for Neural Network Design
273
is exhaustive on the subject issue, then the issue will be resolved. As mentioned previously, the game of 20 questions is a good example of the concept of bearing. In this game, one player thinks of something (just about anything). The issue of the other player is to resolve what this thing is. To do so, the second player poses up to 20 "yes or no" questions to the first player. A good player poses questions with significant bearing on his issue. This issue evolves with each posed question and each new piece of information obtained. The objective of the game is to resolve the issue by posing a minimum number of questions. For instance, knowing that the thing is a person, a good question would be M = "Is the person male?" because this has considerable bearing on the prevailing issue of "Which particular person?" is being thought of. This intuitive notion of bearing is quantified in the next section. Again, the computational goals of the ILU require that its architecture lend itself to the numerical computation of both b(7|A) and b(X v Y\A). The bearing b(7|A), or some monotonic function of b(F|A), must be computed to maximize transmission from the subject ILU to any subsequent ILUs to which it is connected. The bearing b(Z v F | A), or some monotonic function of b(Ar v 7 | A), must be computed by the same ILU to maximize transduction between input and output. In both instances, logic requires that computation be performed on measurable quantities relative to the issue of the ILU. To reiterate, the issue A of the ILU is the conjunction of "What is my input?" = X, where X = Xi A X2 A - " A Xn, and the output question, "What is my output?" = Y. The Xt represents the individual questions that correspond to "What is my /th input?" All of this is a bit terse. A great depth of logical formalism underpins the described logical structure. Fortunately, and as it should be, the basic ideas are all logical and become intuitively obvious once recognized. Upon accepting the foregoing computational objectives, all that remains to computationally realize the ILU is to specify a quantitative measure for bearing that is useful for inquiry and measurement. In turn, we must specify a complementary measure for assertions useful for guiding decision making and inference. This second measure corresponds to probability.
III. LOGIC, PROBABILITY, AND BEARING The stated motive of the ILU is to maximize the forward flow of information. As discussed in the previous section, this can be done by simultaneously maximizing both b(7|A) and h{X v F|A), where b(y|A) describes the transmitted component and b(Z V F| A) describes the transduced component of information flow. Both b(y I A) and h{X v F| A) are computed by the ILU in response to having observed a sequence of L vector-valued inputs {jci, Jt:2,..., JCL}, where each Xi = Xii AXi2 A" ' A Xii for / = 1, 2 , . . . , L, in addition to the ILU having ob-
274
Robert L. Fry and Raymond M. Sova
served its own corresponding output decisions {yi,y2, - - - ^yi}- This adaptation, which may formally be associated with learning, is assumed to take place over a long time scale relative to the one associated with decision making. On the shorter decision-making time scale, the ILU generates decisions y through inductive inference. Inductive logic demands that the ILU use probability, or a monotomic function thereof, to perform decision making. Regarding inference, the ILU must use the conditional probability p(y\x Aa) for decision making. The premise consists of the available information given by the conjunction of jc = xi A JC2 A • • • A jc„ with a, where a is the basic premise that delineates precisely everything that can be decided. The total ILU knowledge state is given byxAa and represents the total information that the ILU possesses for making its decision y at each decision point in time. Logically, p(y\x A a) is the degree to which the prevailing knowledge x Aa implies the decision y. Therefore, Tp(y\x A a) provides for a true generaUzation of deductive decision making. As might be surmised, relationships between questions and assertions are logical as well. As described earlier, any question may be thought of as being defined by the set of possible answers that may arise through its posing in response to presented assertions. Alternatively, an assertion is defined by the set of possible questions that are answered through its presentation. These respective properties, of questions and assertions, are necessary for causal information flow. Generated assertions arise from an overt decision by the ILU. It is important to realize that assertions are distinct from answers. Answers arise from the presentation of an assertion to a posed question and are internal to the observer-participant entity described by the ILU. Alternatively, assertions only exist between physical devices. Although this discussion may be intuitively appealing, its value can only be realized through a quantification of exactly what is meant by questions, assertions, issues, and premises. These concepts, along with probability and bearing, have so far been described in a tutorial fashion. We will now describe them in more depth and characterize probability and bearing as real-valued measures of Boolean functions of questions, assertions, issues, and premises. In fact, bearing and probability are constrained to have specific functional forms as induced by the extended logic of inference and inquiry described here and by Cox [6]. The general concept of probability is likely already familiar, although probability may be better viewed as a measure of degree of distinguishability to dissociate it from its subjective interpretation. The concept of bearing will correspond to entropy, which, although traditionally considered a measure of uncertainty, should be viewed as a measure of the degree to which a question resolves an issue. Logically, entropy is a unique measure of the degree to which one question implies another. We will now quickly review the logical structure of questions and assertions, using Table I. These properties may be thought of as arising from standard Boolean algebra. However, it is more instructive to understand the physical as well as the logical interpretation of each property. One can do this by mentally
A Logical Basis for Neural Network Design
275
using the word with to describe disjunction " v " and the word coincidence to describe conjunction " A " when these operators are used with logical assertions. Alternatively, one can use the word common to describe "v" and the word joint to describe " A " when used with logical questions. Some examples have already been given. One can construct physical examples of each assertive property. For instance, the property (aAb)vb = b can be interpreted as "if" an observer B is looking for an assertion b and if a A /? is asserted to B, then B will only observe b, because it cannot distinguish aAb from b. Consider, for example, an observer B looking for b = "bananas." If the observer spots a bowl of bananas and apples, and we denote a = "apples," then this observer will see only b. The assertion b dominates over the assertion aAb relative to B as defined. Otherwise stated, the assertion aAb implies the assertion b relative to the question B. The left side of Table I delineates a set of assertive properties that are in oneto-one correspondence with the interrogative ones on the right. For questions, remember to interpret the disjunctive operator " v " as common so that, for example, the common question AvBis the question common to A and B, which is only answered by answers that answer both A and B. Therefore, the logical construct A v B provides a computational substrate for an observer configured to perform coincidence detection, that is, the detection of the conjunctive event aAb. Similarly, the joint question A A B provides a computational substrate for an observer configured to perform a measurement of two physically disparate quantities and yields the response aw b, which is the coexistent answer provided through responses to the individual questions A and B. The question A v ~A asks nothing relative to what might be asked by any observer B, because (A v ^A) v B = Av ^A, whereas A A ~A asks everything relative to what might be asked by any observer B, because (A A ~A) v B = B. The term "relative" must be included because any physically separate observer must ask a question that is distinct from a question posed by any other observer. The property (A A B) V B = B in Table I says that asking the question common to B and the question Sisked jointly by A and B is equivalent to just asking B. For example, asking A = "Are you my sister?" and then B = "Are you female?" yields the logical relationship that A A B = A, because obtaining a true answer to A will always yield a true answer to B. Then, because "Are you female?" is the common question corresponding to A V 5 , it is true that Av B = B. Therefore, it is true that (A A B)v B = B. This is also an example of logical implication as indicated by either of the conditions AVB = BOVAAB = A. So, like assertions, questions can be treated as Boolean entities. In fact, questions and assertions only exist in mutual relation to one another. Within the logical domain of assertions, true and false only exist in mutual relation to one another, as do real and vain questions in the logical domain of questions. A question is vain if it cannot be answered by a true assertion. A question is real if it can be
276
Robert L. Fry and Raymond M. Sova
answered by a true assertion. An assertion is false if it answers a vain question and true if it answers a real question. A question may change from vain to real and back again over the course of time. For instance, asking what will happen tomorrow is vain until tomorrow comes, at which point the question becomes real. Answers received today regarding this question are false. The answers that will arise tomorrow can be true. Time itself plays the role of a synchronizing agent within the process of information flow and is itself a manifestation of it. Perhaps the most important logical property of both questions and assertions is that of logical implication. If an assertion a implies another b, then if b answers a question, then a also answers the same question (both will give a true answer to the subject real question). For instance, if we ask the question, "What color is the apple?" the answer "dark red" implies that the answer "red" will also answer the question. We have denoted the logical implication of assertions by a -> Z?, which formally corresponds to either of the two logical identities a Ab = a or a v b = b. The two are equivalent and each can be obtained from the other, for example, aAb = a can be disjoined on each side by b to obtain avb = bin much the same way as discussed earlier regarding the logical implication of questions. Therefore, implication represents a possible logical relationship that may exist between two assertions. It also describes, in a complementary fashion, a possible logical relationship between two questions. If question A implies B, then question B is answered if A is answered. We already saw an example of the implication of questions where B = "Are you female?" and A = "Are you my sister?" Similarly, for assertions, if a implies b, then any question answered by b is also answered by a. Here the statement a = "She is my sister" and the statement b = "She is a female" both answer the question B, and hence a Ab = a relative to the question B. Alternatively, if questions A and B are both answered by the statement a, then Av B = B relative to assertion a. This example demonstrates the symmetry that exists within and between the logical domain of assertions and the logical domain of questions. It is instructive to work through several such practical exercises to firmly understand these interrelationships. Logical implication forms the basis for deductive logic. For instance, Sherlock Holmes would use a train of logical impUcation to deduce a decision. Such logical constructs are easily realized in a digital computer and have formed the traditional basis for artificial intelligence through the IF, THEN, and OR ELSE logical constructs discussed earlier. However, deductive logic demands complete knowledge to complete a logical train of inference. In real life, we rarely have such complete knowledge and are typically forced to make rational guesses. So if logical implication forms the basis of deductive logic, what then should form the basis for inductive logic? Richard Cox [2,4] has shown that, for reasons of logical consistency, inductive logic must, in the Hmit of complete knowledge, tend to the deductive reasoning process. In addition, decisions reached inductively must be based on probability. More specifically, probability is a real-valued mea-
A Logical Basis for Neural Network Design
277
sure of the degree to which an impHcant a implies an imphcate b. Therefore, the degree of impHcation in this case is p(b\a), which can be read as either the "probabiHty of b given the premise a" or "the logical degree to which a -> bT Furthermore, the properties of any objective probability that can be derived must be consistent with every property Hsted on the left side of Table I. An objective derivation of probability [2,4,5] has shown it to be the same unique function with properties that we ascribe as belonging to conventional probability. The implication of questions, as well as the implication of assertions, has a unique and complementary measure in an inductive framework. Just as probability is the unique measure describing the extent to which a prevailing state of knowledge implies each candidate decision, another measure, which we call bearing, describes the extent to which a question B resolves the issue A of an observer having a prevailing state of ignorance. This has been formally expressed as h(B\A). We will see that the measures p(b\a) and b(B\A) together are adequate to derive and describe the operation of the ILU. Consistent with Fig. 1, the ILU will operate in such a manner as to maximize its information throughout. The ILU will do so by giving the best decisions via inferences, while asking the best questions through interrogation to efficiently resolve its issue. As discussed, one can show that any "degree of implication" of one assertion by another must formally and uniquely correspond to probability. Alternatively, any such degree of implication of one question by another must formally and uniquely correspond to bearing [6]. The reader is undoubtedly familiar with the properties of probability. The properties of bearing are likely less familiar, although they have a one-to-one corre-
Tablell Some Selected Properties of Probability as a Unique Measure of Degree of Implication of One Assertion by Another Selected properties of probability p(b A c\a) = p{b\a)ip(c\a A b) = p(c\a)p(b\a A c) p(b\a) + p(~fe|fl) = 1 p(Z7Vc|fl) = p(Z7|fl)+p(c|a) — p(^ A c\a)
Comment Conjunctive rule (Bayes' theorem) The premise a =bw
^b
Disjunctive rule
y ^ ^{bi \a) = 1 i=\
The premise is a = b\ w b2 • • • ^ bn, where the bf are exhaustive and mutually exclusive, i.e., V?=i ^/ is true while bi A bj is false for all / and j relative to the premise «, respectively.
p(^ A c\a) + p(^ A ~c|a) = ^{b\d)
Partial sum decomposition rule
278
Robert L. Fry and Raymond M. Sova Table III
Some Selected Properties of Bearing as a Unique Measure of Degree of Implication of One Question by Another Selected properties of bearing
b{BvC\A)
= b(B\AMC\Av
B)
= b(C|A)b(5|AvC) b{B\A) + b(~B|A) = 1 b(BAC\A)
= b(fi|A)+b(C|A) -b(5vC|A)
y^b(B/1 A) = 1 i=l
Comment Disjunctive rule (Bayes' theorem) The issue is A = fi A ~ B Conjunctive rule The issue is A = 5i A ^2 A • • • A 5„, where Bi are exhaustive and mutually exclusive, i.e., AjLi ^i resolves A, whereas 5/ v Bj asks nothing for all i and j relative to the issue A, respectively
b(B V C|A) + b(fi V ~C|A) = b(fi|A) Partial sum decomposition rule
spondence with those of probability as well as a one-to-one relationship with each property listed on the right side of Table I. Some of the more important properties of bearing will be discussed in detail in the next section. Tables II and III summarize this section in terms of some selected properties of probability and bearing, respectively. Table I forms the basis for deductive computation; Tables II and III form the computational basis for a computer that operates inductively. Table II lists a few familiar properties of probability. A one-to-one correspondence can be seen between the properties in Table II for assertions and probability and those listed in Table III for questions and bearing. Transformation between these two tables (and the left and right sides of Table I) can be found by exchanging assertions and questions and by exchanging disjunction and conjunction. Cox [6] simply calls this process transformation, which is appropriate, because one can hardly envision a more basic transformation than the one that exists between questions and assertions.
IV. PRINCIPLE OF MAXIMIZED BEARING AND ILU ARCHITECTURE Logical consistency dictates that probability and bearing are unique measures of "degree of implication of assertions" and "degree of implication of questions," respectively. In this section we will specify an exact computational form for bearing and then use it to obtain a functional architecture for the ILU.
A Logical Basis for Neural Network Design
279
We assume that the ILU has n inputs x = (xi, X2, ^ 3 , . . . , x„) or x = xi v X2 \^ ''' ^ Xn, indicating that the presented input is xi with X2 with • • • with XnThe single ILU output is y. The inputs x represent that which can be presented to the ILU, as opposed to that which has been measured. The Xi can arise from any source, insofar as the ILU is concerned, because knowing such information would admit the availability of additional information not already accounted for. The source of information is unimportant to the ILU, because it can only observe X and not its source. Because the ILU can observe x, it can by implication "ask" the question, "What is my input x?" which is denoted here by the question Z = Zi A X2 A X3 A • • A Z„, where Xi corresponds to the question "What is my input xi ?" The question X can literally be interpreted as the question "What is my input jci and what is my input X2 and • • • and what is my input Xn ?" the totality of which conveys all that is presentable to the ILU, except for admitting the presentability of its own output y through the posing of the question Y. The question Y must exist or be a real question because if it did not, we would not need to consider the ILU further because it would be a closed physical system and hence be unobservable to any other system. In having the ability to ask F, the ILU can ask the eigen-question that corresponds to that which the ILU itself asserts. Subsequent to the presentation of the disjunctive form of x to the question X, the ILU obtains a measurement or answer x' = jci A ^2 A X3 A • • • A x„, that is, a binary-encoded word. The bits in this word can then be used for subsequent computation by the ILU, because at this point the measurement has been made. Before making this measurement, x has a disjunctive form that delineates what can be presented to the ILU, where the disjunctive operator signifies the spatial superposition of each element of x. The measurement process transforms x from the disjunctive to the conjunctive form by encoding and recording the presented assertions. The measurement process itself can be formally expressed as a logical transformation, but this is beyond the scope of our discussion. The asserted output y arises through inductive inference and logical decision making on the part of the ILU based on answers obtained through posing X. Therefore, a causal relationship must exist between x and y that is intrinsic to the notion of an open system as embodied by the ILU. Forward flow of information demands that another entity, perhaps another ILU, be able to ask F, thereby making it real. We assume that the ILU can also ask the question 7, that is, the ILU can observe its own output y in addition to asking what its input x consists of. Fundamentally, this means that the potential exists for the ILU to modify its functional behavior through the feedback of its own generated outputs. With the additional assumption that the ILU can observe its own output, the issue of the ILU consists of the most thorough question that the ILU can resolve. Specifically, this issue is A = X A 7, which can literally be translated (relative to the ILU) as "What are all my inputs jc, and what output y have I generated
280
Robert L. Fry and Raymond M. Sova
in response to having observed these inputs?" We reemphasize that the ILU can observe jc = jci AX2 A • • • A jc„ in response to asking X = Xi A X2 A X3 A • • • A X„. That is, while the ILU asks "Xi and X2 and • • • X„" and is presented with "xi with X2 with • • • with XnT it then observes "jci and X2 and • • • and x„." In having A as its issue, the ILU is asking n + 1 elementary questions. All of these questions must be asked by the ILU to resolve its issue A. Implicitly, the ILU must also provide an assertion y to Y, because this question is only real if the ILU can participate in response to the observed presentations JC to answer the question regarding its own output. Because the ILU can ask Y, it contains a computational mechanism whereby it can change what it asks of X while simultaneously changing how it decides. Together, these adaptations provide for learning by the ILU and overall modulation of ILU information throughput. Learning characterizes an organizational structure that can, in response to all observable information, modify how it generates assertions to questions it has posed and how it adapts questions in response to assertions it has given. In addition to having the issue A, the ILU also has the premise a regarding the possible inputs it can be presented with and what possible outputs it can generate. The premise a of the ILU is directly obtained from its issue A through the process of transformation. The issue is A = Zi A X2 A X3 A • • • A X„ A F, which, through the process of transformation, yields the premise a = xi VX2 VjC2 • • • vx^ V j . The premise a represents every possible combination of things the ILU can anticipate as possible states of nature. Conversely, the issue A represents all things that the ILU can resolve based on what has been presented to it. Because a question is defined by the set of possible answers that can be obtained through its posing, it can be seen that the ILU can pose n -\- I elementary questions as represented by the issue A and obtain 2""^^ possible answers as delineated by the premise a. The domain of questions described by A is a logarithmic domain relative to the domain of assertions described by a in the sense that \A\ = log2 \a\, where I • I denotes the number of elements comprising either an issue or a premise. This numerical property is a fundamentally important relationship between a system of questions and its corresponding system of assertions and is the underlying reason why probability and bearing have the exponential and logarithmic forms to be described. In summary, we have characterized a logical structure for the ILU which corresponds to that of Q as shown in Fig. 1, with one important exception: the ILU can observe its own output y in addition to its external inputs Xi. With the general logical structure of the ILU and its computational objectives in place, we can finally proceed with its design. The computational objectives will serve to drive the ILU to maximize the total flux of information associated with and under its control. In particular, we require that the ILU pursue the strategy suggested earlier, and
A Logical Basis for Neural Network Design
281
stated here more quantitatively. In particular, the ILU should (1) optimize the transduction of information by maximizing the bearing of the question X V Y on the issue A, that is, the measure h(X v F| A), or a monotomic function thereof, and (2) independently of (1), optimize its transmission of information to other logical devices that can pose Y (this should be accomplished by maximizing the bearing of Y on the issue A, that is, b(F|A) or a monotomic function thereof). Remember that bearing addresses how well the argument question resolves the relevant issue with the prevailing state of ignorance, whereas probability measures the extent to which the premise and prevailing state of knowledge support inference. The ILU defines both its issue A and its premise a. Logically it is more proper to say that the issue and the corresponding premise define the ILU. From the partial sum decomposition rule in Table III, it can be seen that b(F v ~X|A) = b ( r | A ) - b ( X v F | A ) , a n d i f b ( F | A ) a n d b ( X v 7 | A ) can independently be maximized and have the same numerical maximum, then one is tempted to simply minimize b(F V ^X\A), provided that the global minimum occurs at a common global maximum of b(F| A) and b(X v F| A). Minimizing b(F V ~ X | A) can be accomplished by a simpler mathematical derivation than one that chooses to separately maximize b(F| A) and b(X v F| A). However, such an approach does not yield the degree of insight given by the individual optimizations of b(F| A) andb(XvF|A). Before proceeding, a more detailed consideration of complementary questions can provide additional insight into the logic of questions and assertions described here. Let A correspond to the issue of which room in your home contains the car keys that you lost. Let X represent things you can ask to see within the room you are in. Then let the question ~ Z represent things you can ask to see in all other rooms of your house, except for the room you are in. Thus, while standing in the subject room, ~ X is a vain question and can yield no directly observable result until you go to another room. Even then, only a partial answer to ~ X can be obtained. You must travel, in succession, to every other room in the house to resolve the issue A by asking ^X. Thus one can view the logical property which states F = (Z v F) A (~X v F) as equivalent to saying that the information obtained by asking F is the same as the information obtained by asking the two questions X vY jointly with ~ X v F. The question X A ^X asks everything relative to the ILU because X A ~ X will always resolve any issue, including A. One can equivalently say that b(X A ^X\A) = 1, because logically X A ^X -^ A. Upon acceptance of goals (1) and (2) in this section as comprising the computational objectives of the ILU, all that is necessary to derive its architecture is to specify a functional form for bearing. We claimed earlier that bearing is con-
282
Robert L. Fry and Raymond M. Sova
strained by logical consistency to have properties that mirror those of probability as described in Table II. However, unlike probability, we have no definitive mathematical form for bearing. It is stated, but not proved, that bearing is really a normalized function of the information-theoretic measure entropy denoted by the function H. The entropy of a random variable X as denoted by H(X) is interpreted as being b(X| A), that is, the bearing of the question X on the issue A, where the scale factor K that is necessary to normalize H(X) to unity has been omitted. More generally, K = H(A), which is the entropy of the issue of the system/observer and which, using our notation, is simply b(A| A). Just as probability is normaHzed such that the sum of the probabilities on a set of mutually exclusive and exhaustive assertions of an assertor must sum to 1, the sum of the bearings on a set of mutually exclusive and exhaustive questions of an inquirer must sum to 1. H(A) provides the normalization factor K because the observer can, at most, have a total ignorance of H(A), as this quantity corresponds to the average number of questions the observer would have to ask to resolve its issue A. The bearing of an issue on the same issue is 1, that is, b(A| A) = 1. The probability of a premise on the same premise is also 1, that is, p(a |a) = 1. In the first instance, in asking all questions relevant to an issue, the issue will be resolved. In the second instance, having all knowledge enables the decision maker to make a certain inference regarding which assertion from among all possible assertions on the subject premise is true. Although H(X)/K and b(X| A) may have the same numerical value, it is imperative that we understand the conceptual difference in their interpretations. Regarding the function H, entropy is typically construed to provide a measure of uncertainty. However, uncertainty is a relative term; one must specify the uncertainty with respect to something or someone, that is, it must be referenced to a prevailing state of ignorance. Obviously, as one gains information, uncertainty can be reduced from an initial value, but this reduction occurs with respect to the prevailing issue of the observer. The observer and associated issue are implicit in an entropic formulation and are systematically dismissed in information theory, just as the premise of a probability measure is typically dismissed or subsumed in what might be considered the more important aspects of a problem. However, the conditioning of bearing on the underlying issue makes explicit the fact that there is a specific issue at hand that is to be resolved by the relevant observer, who is made part of the problem formulation. A key point in this discussion is that bearing and probability are both properties of an open system. Consider a glass of water that is only partially full. The molecules of water in the glass might be thought of as knowledge from which inferences can be made using probability. The empty portion of the glass, if filled, would provide a deductive basis for decision making and represents ignorance. Even though ignorance exists, this amount of ignorance is recognized and can
A Logical Basis for Neural Network Design
283
be quantified by entropy or bearing. Bearing can dictate the succession of questions that can most quickly fill the glass until a complete knowledge state, relative to the volume of the glass, is attained. Lao Tsu, a contemporary philosopher of Confucius, perhaps said it best in the 11th verse of the Tao Te Ching [9]: Thirty spokes share the wheel's hub; It is the center hole that makes it useful. Shape clay into a vessel; It is the space within that makes it useful. Cut doors and windows for a room; It is the holes which make it useful. Therefore profit comes from what is there; Usefulness from what is not.
At this point, it is interesting to associate the concepts we have described with Shannon's measures of information and with information theory in general where entropy and mutual information are used extensively. We have equated entropy with bearing. However, we have suggested no analog to mutual information. In fact, there is no intrinsic difference between mutual information and entropy regarding the described formulation of bearing. Mutual information between two random variables, say X and Y, corresponds to the bearing of the common question X V 7 on an underlying issue. More precisely, the question common to X and Y is XvY, and so the mutual information l(X; Y) is proportional to b(X V Y\A), with the same constant of proportionality K given earlier. From Table III it can be seen that b(Y v ^X\A) = b(Y\A) - b(X V Y\A). This relationship corresponds to the well-known information-theoretic identity for two random variables X and Y that states H(F|X) = H(F) - I(Z; Y). The conditional entropy U(Y\X) reads as "the uncertainty of Y given X." In our context, H(F|X) should be read as "the bearing of the common question Y v ^X on the issue A" and corresponds to b(7 V ~ X | A). All of these notions can be understood in terms of the two diagrams shown in Fig. 2. The Venn diagram shown by Fig. 2a depicts logical relationships between probabilities on two assertions b and c on a premise a; Fig. 2b depicts logical relationships between bearings of two questions B and C on an issue A. Figure 2a shows regions within the Venn diagram that correspond to p(bAc\a), p(bA^c\a), p(~Z? A c|a), and p(b v c\a). Alternatively, Fig. 2b shows regions in a complementary type of Venn diagram that correspond to the bearings b(B v C|A), b(B V -'CIA), b(--B V C|A), and b(J5 A C|A). In a way. Figs. 2a and b are mirror images of one another. The bearings b(B v C| A), b(^B v ^ C | A), b(^B V C| A), and b(B A C| A) correspond to the information-theoretic measures 1(5; C), H(5|C), H(C|5), andH(5, C), respectively. Information-theoretic notation explicitly suppresses the consideration of the underlying issue A. This issue must be present to make information-theoretic
284
Robert L Fry and Raymond M. Sova p(t)ACla)
b(eACI>A)
(b) Figure 2 Familiar Venn diagrams such as that shown in (a) really only make sense in the context of a dual diagram, such as that shown in (b). (a) describes standard propositional logic, which in the present context consists of logical assertions, (b) describes the corresponding interrogative logic and must coexist with the Venn diagram in (a). Probability represents a natural measure for the implication of assertions, whereas bearing (entropy) represents a natural measure for the implication of questions.
measures meaningful to the implicit observer. In information theory, B and C represent random variables, that is, set theoretic quantities upon which an entropic measure has been defined. There is fundamentally no mathematical problem in using the logical methodology described here instead of a Kolomogorov-based approach, which deals with probability or entropy triples [10, 11] of the form {Q, F, /x}, where Q is the universal set, F is a finite Borel field, and /x is the established relevant measure on elements of this field. In Fig. 2a, the natural measure is probability. In Fig. 2b, the natural measure is entropy. We interpret B and C as questions upon which bearings are defined relative to an issue A. Although a Kolomogorov-based approach and the mathematical framework we propose lead to the same results, the conceptual difference be-
A Logical Basis for Neural Network Design
285
tween the two should be understood because of additional insights gained and the resulting mathematical courses upon which one may be led. The relationships depicted in Fig. 2b have been noted by other authors, including Papoulis [10] and Yeung [11]. Yeung calls Fig. 2b an /-diagram. Mutual information is not essentially different from any other entropic measure. More specifically, mutual information corresponds to the bearing of the information common to two questions on an underlying issue. Furthermore, one can consider and easily interpret the significance of bearing on many other logical combinations of questions that otherwise would defy meaningful interpretation in an information-theoretic context. For instance, b(B v C V D\A) has no information-theoretic analog, but in the present context it is an easily interpreted construct. Now returning to our issue, we want to consider how the ILU can implement its optimization strategy as described earlier by the simultaneous and independent maximization of both b(Z v Y\A) and b(F| A). This presumes that the ILU can ask (XVY)AY relative to the issue A. We now know that bearing can be equated numerically to entropy or mutual information, depending on the logical argument of the bearing function. Furthermore, logic dictates that the ILU be capable of directly computing or estimating monotonic functions of b(Z V F| A) and b(y | A) in the execution of its optimization strategy. Knowing this and the mathematical form of entropy, we can write explicit formulas for both b(F| A) and b(X V Y\A). In particular,
b(Y\A) = -J2^(y\a)\np(y\a)
(1)
yeY
and
MA:vy|A) = X:Ep(3'Ax|a)in4^^^^, ttxtTr P(^l«)P(3'|a)
(2)
respectively, where for brevity we have written jc = xi A JC2 A • • • A x„. The scale factor K needed to convert entropy into bearing has not been included, because it does not change our results in that it is a fixed constant. In Eqs. (1) and (2), the respective bearings are found by summing over all possible answers to the question Y and the question X. Because each of these answers is Boolean in nature, we can take answers as being either "0" or "1," or nothing or true, respectively. As discussed earlier, true and nothing represent descriptions of the types of answers that can arise from a posed question in response to a presented assertion as well as the two classes of answers that can arise from any physical question. Alternatively, vain and real describe the possible types of questions as contingent upon whether the relevant question can be answered. Surprisingly, there is no reference to false
286
Robert L Fry and Raymond M. Sova
answers because such answers can only logically arise by posing vain questions, and none of the questions dealt with here are vain. The simultaneous and independent maximization of Eqs. (1) and (2) defines the computational objectives of the ILU. Therefore the ILU must be able to compute measures that are directly related to these equations. Furthermore, the ILU must have physical or, equivalently, logical access to the various probabihties upon which the two subject bearings are defined through the total measurement (X v y) A F it is capable of. The establishment of these probabilities seems to be the more fundamental problem of the ILU, because this measure is a prerequisite to computing the bearings b(y|A) and b(X v Y\A). We will address this problem first and see how the form of the distribution p(x A y\a) dictates the general architecture of the ILU. As stated earlier, probabilities represent degrees of belief on the part of a decision-making device regarding the possible states of nature. Bearings represent degrees of resolution of a prevailing issue based on posable questions. Here the more basic problem consists of determining the form of the distribution p(xAy\a) used by the ILU, because this distribution reflects the complete knowledge state of the device. The probability p(x A y\a) can be used to perform inference regarding inputs, outputs, conditional input and output distributions, as well as all bearings computable on logical combinations of questions posable by the ILU. Alternatively, entropy represents the state of ignorance of the observer. In the latter case, these bearings represent the state of ignorance of the observer regarding the argument questions. The premise a completely describes the ILU's state of knowledge regarding decisions that might be made. Initially, one might think that this distribution can have an infinite number of perhaps equally reasonable forms. In fact, it has only one. By independently maximizing the bearings b(y|A) and b(X v K|A), the ILU is effectively asking, "What question or question Y can another observer ask that has the most bearing on my issue AT and "What question or measurement XvY can I pose that has the most bearing on my issue A?" To answer these questions, the ILU must be capable of measuring two distinct quantities: (i) the bearing that information contained in its own output assertion y provides about A and (ii) the bearing of the information common to its output >^ and its input x on the issue A. The output decision y of the ILU has a clear interpretation and need not be discussed further, except to note that it answers Y and must somehow be causally generated in response to the ILU having observed x. Regarding quantity (ii), the question XvY yields answers containing the information that is common to the input and the output. Because questions are defined by the set of possible answers that can be obtained, we can therefore look at the set of possible answers to the common question XvY. By asking Z v F, we can only accept responses that answer both the question X and the question F, because responses that do not answer both will not provide
A Logical Basis for Neural Network Design
287
for an answer to the common question. Possible answers to X v Y = (Xi A X2 A " - A Xn) V Y or X V Y = (Xi V Y) A (X2 V Y) A ' " A (Xn V Y) must take the form x A y = (xi A y) v (x2 A y) v - • • v (xn A y) for every assertion y that can be presented to Y and every disjunction of input assertions X = X1VX2V' • 'VXn that can be presented to X, that is, for every presented input, y e Fandjc G X. The true answers contained within X A J = (xiVx2V- • •vxn)Ay or, equivalently, each disjunctive component of x A y = (xi A y) v (x2 A y) V • " V {xn A y) that is true indicates the existence of information common to the respective input xt and the output y of the ILU. This is our second example of the transformation between the logical domains of questions and assertions. That is, any rule in one logical domain can be formulated into a corresponding rule in the other logical domain by exchanging questions and assertions and by exchanging disjunction and conjunction. In particular, the common question Xi V Y has as its possible answer x/ Ay, because a true answer to the common question Xt v Y can only be obtained by a simultaneously true answer to both the individual questions Xt and Y. Thus Xt v Y provides for coincidence detection as realized
hyxi Ay. Therefore, using the foregoing argument, we state that the ILU must be capable of two distinct types of measurements of its input x and its own output y. The first measurement is simple and consists of the ability of the ILU to measure its own output y by asking Y. The second measurement is only slightly more complex and consists of the detection of the coincidence of the output y with the presented input state x by asking XvY. We should note that xAy = (xi VX2 VX3 V• • • VXn) A 3; and that the presented input x consists of xi VX2 VX3 V • • • Vx^, to reemphasize that the disjunctive Boolean operations described here simultaneously represent logical and physical aspects of the ILU. Although the input x presented to the ILU has a physically disjunctive representation, after observation the ILU then has access to the answer (xi A X2 A X3 A • • • A Xn), that is, the value of each measured input component xt. As has been discussed, the distinction between presented and measured input quantities represents a very subtle point. In particular, the physical measurement process can be described, in sequence, as the posing of the question Z, the eventual presentation of the quantity x leading to the answer (xi A X2 A X3 A • • • A Xn). The measured quantity x is an n-hit encoded quantity. Therefore, an essential aspect of the observation process is the conversion of disjunction into conjunction. The distinction between the quantity x presented to the ILU and the resulting answer will not be of direct consequence for most of our discussions, so we will assume that sometimes x = (xi v X2 v X3 v - -- v Xn) and sometimes x — {x\ A X2 A X3 A • • • A Xn) and that context will make it clear whether we are referring to the quantity x presented to the ILU or the quantity x measured by the ILU. More formally, the question X is an operator that has the disjunctive form of X as its input and the conjunctive form of x as its output, with the measurement
288
Robert L. Fry and Raymond M. Sova
process providing the logical transformation. When necessary, we will explicitly make the distinction between these two flavors of x clear. Let us ask the following question: What probability distribution p(jc A y\a) maximizes the normalization factor K = H(A), given that the ILU must make the measurements y and x A yl The factor K represents the total ignorance of the ILU and is necessary to convert standard entropy to normalized bearing. By maximizing K over p(jc A y |a), we are effectively asking, What can we optimally infer, given that we have ignorance H(A)? That is, what exactly and truthfully can be said about all that we know without saying any more or any less? The maximization of K formally corresponds to the principle of maximized entropy [3,5,12]. In the notation p(jc A y\a), x is in the conjunctive form because it is available to the ILU for use in the physical computation of this probability and can support the causal inference of y. Regarding the measurement x A j , x is in the disjunctive form because this represents a potential presentation to the ILU. The total ignorance b(A|A) of the ILU is given by b(A|A) = - ^ ^ p ( 3 ; A J c | a ) l n p ( > ; A A : | a ) .
(3)
yeYxeX
Over its operational life, the ILU will have many presentations y and x A y made to it. It cannot be expected to maintain an inventory or otherwise store every answer to these presentations, but rather, it can retain a memory of these answers. We assume that the ILU can store a memory or some other type of reconstruction of the expected values of y and x A y. These expected values are denoted here by (>') and {x Ay), respectively. The assumption that these expected values (or a mathematical equivalent) are stored within the ILU requires some type of internal storage mechanism for n + 1 real-valued parameters. There must be exactly n -f-1 values because each (jc/ A j} for / = 1 to n requires a separate storage area, as does iy). We will be satisfied if the ILU can store n + 1 quantities that could, in principle, be used to compute (y) and {x Ay). This will be the case for the subject ILU. Now, the n + 1 expectations (y) and {x A y) are defined by the joint inputoutput distribution p(jc A j|(2), the determination of which was our original problem. Therefore, in addition to the optimization of Eq. (3), we can add that the ILU must measure {y) = E{y} and {x Ay) = E{x A y}, where the expectation operations denoted by E are with respect to p(jc Ay\a), that is, the knowledge state of the ILU regarding inference. Incorporating these constraints into Eq. (3) serves to acknowledge exactly that information which the ILU obtains by posing the questions X v Y and Y, that is, (X v Y) A Y, thereby constraining its architecture to those structures that admit the possible answers x Ay with yor(xAy)vy. The optimization of Eq. (3) subject to the measurable quantities {y) and {x A y) can be accomplished by introducing Lagrange multipliers with one multiplier
A Logical Basis for Neural Network Design
289
assigned to each measurement function that the ILU must perform. We assume n Lagrange multipliers A = (Xi, A,2,..., A„) for each element of (jc Ay) = E{xAy} and one additional multipHer JJL for {y) = E{y}. By introducing these Lagrange multipliers, we can effectively turn the constrained optimization of Eq. (3) into the unconstrained problem of finding the extrema of
b\A\A) = -J2J2^^^ ^ y^^^ ^^p^^ ^ y^^^ yeYxeX
+'^^ Z l Z l p^-*^ ^ y^^^^^ A >;) - (X A y) ^yeYxeX
-'
p(x Ay\a)y-
(y)
^yeYxeX
^^pCxAjla)-!
(4)
yeVxeX
where (jc A j ) is treated as a column vector. Note also that we have included one additional Lagrange multiplier AQ, which introduces another important and fundamental constraint, namely that the probabilities p(xAy\a) must sum to unity over X e X and y e Y. Strictly speaking, we also must introduce constraints on each of the 2^~^^ values that p(jc A j |a) can have over x e X and y e Y to ensure that each value is positive. Fortunately, the form of the solution to Eq. (4) will implicitly guarantee that this is true, thereby making the explicit introduction of this large number of additional constraints unnecessary. Because Eq. (3) has been made unconstrained by the introduction of additional dimensions to the problem through the Lagrange multipliers, we can optimize Eq. (4) by finding its partial derivative with respect to pCjc A jl^) for each answer X € X and each y e F. In doing so, and by setting these derivatives to zero, we obtain -lnp(x
A y\a)-
I-\-x'^(x A y) - fiy-{-XQ = 0 .
(5)
By combining — 1 + A,o into just —XQ, Eq. (5) can be solved to give p(x Ay\a) = Qxp[X^(x Ay)-
^.y - AQ],
(6)
where, because Eq. (6) must sum to 1, we can define XQ in terms of the other Lagrange factors X and [x by forming the sum Ao = In ^
^
yeYxeX
txp[X^ {x Ay)-
/xj] = In Z,
(7)
290
Robert L. Fry and Raymond M. Sova
where Z is the normaUzation factor. The total ILU probabihty function then becomes , ^ , , exp[X'^ (x A y) - fiy] p(xAy\a) = ,
(8)
which is the final form for the probability function known to the ILU and which allows it to optimally infer from what it has measured. In deriving Eq. (8), we made explicit what the ILU must measure to support the computation of b(F| A) and b(X v F| A). It must be capable of measuring the n + 1 quantities x A y and y, while somehow retaining memory of (x Ay) and (y). This is reflected in the form of the probability distribution in Eq. (8) in that the argument of the exponent is a function of the observable information, as is the normalization factor Z. Now, the bearings b(y |A), b(X A Y\A), b(X v F|A), and all other bearings on logical combinations of the questions Y and X can be explicitly computed. For instance, b(X AY\A) = — A,^ {x Ay) -{• fi(y) — XQ, less the normalization factor K. The quantity b{X AY\A) represents the amount of information that the ILU can extract from the question X A Y to help resolve its issue A. Similarly, the joint input-output probability p(x A y\a) in Eq. (8) represents the state of knowledge for inference and subsequent decision making. Equation (8) contains all of the information necessary for the ILU to compute the expected measurements {x Ay) and {y). The ILU, in having possession of the probability p(jc A y\a), does not explicitly store the n + 1 values (x A y) and {y), but rather the n + 1 quantities X and /i, which are parameters of the distribution p(xAy\a). Therefore, the quantities X and /x must be physically stored somewhere within the ILU, thereby making Eq. (8) computable by the ILU. The parameters X and /x are sufficient for the ILU to compute {x Ay) and (y). Equation (8) dictates the physical architecture of the ILU. For example, knowing that the inputs consist of x and that the single output is y, one can find the conditional probability p(y\a A x) for decision making. Appendix lists the more important marginal and conditional distributions for Eq. (8). In particular, the conditional output distribution for y given the input x is given by
p(y =l\aAx)
=—
—J-y
(9)
1 +exp[—(X^jc - /x)] or p(y = l\a Ax) = — ^——:, l+exp[-^(x)]
(10)
where ^(x) = X^x — fi. The quantity f (JC) is called the evidence for decision making, statistical evidence, or just evidence. Evidence is formally defined as the logarithm of the odds function O given by 0[p(y = \\a A x)] = ln[p(y =
A Logical Basis for Neural Network Design
291
1 |a A x) -:- p(y = 0\aAx)]. More generally, 0(p) = Inp/(l - p) for any discrete probability p. Evidence has many interesting characteristics, with perhaps the foremost being that its properties agree well with intuition and the subjective use of the term evidence. Evidence is linear in the sense that new information can either add or subtract from the prevailing evidence. Evidence represents a numerical encoding of knowledge at any point in time and is truly the knowledge state of the ILU for inference. New evidence can linearly combine from independent sources, such as testimony for the jurors in a courtroom trial. Moreover, the linear properties of evidence minimize the complexity of arithemetic operations on the part of the ILU, as will be seen (look ahead to Eq. 11). Figure 3 is a plot of the probability of deciding y = I based on the evidence f (x) assimilated by the ILU. The ILU can make one of two decisions. It can decide to do nothing, that is, not provide a presentation to the question Y. In this case, y = 0, that is, nothing is asserted. Alternatively, the ILU can assert y = ItoY, yielding the answer yes. In this case, y = 1 and an overt action on the part of the ILU is required. In deciding to assert y = 1, the ILU must perform inference using p(y = 1 |jc A a) or a convenient monotonic function thereof. The evidence ^(x) provides the simplest function to carry out the inference on y. Because f (jc) is effectively computed anyway, the ILU can simply use evidence as the sole basis for inference and decision making. The evidence f (x) yields a probabilistic decision rule. That is, f (x) supports inference in that it embodies the complete knowledege state of the ILU at each
Evidence, ^x) Figure 3 Conditional probability ip(y = 1 \aAx) as a function of ^(jc). As p(y = 1 \aAx) approaches either 0 or 1, substantial amounts of new evidence are required to significantly alter these probabilities. This is one reason why a logarithmic scale is often used to characterize the evidence function.
Robert L. Fry and Raymond M. Sova
292
time instant. In particular, the expression ^(je) = iJx — /x corresponds to the logarithm of the posterior odds of deciding y = \, given observed information X as shown by the linear terms present in a logarithmic form of Bayes' theorem given by In
p(y = \\aAx) p(j = 0|a A jc)
p[x\aA{y = 1)] p ( j = \\a) In -^— r-T + In -P(y = 0|a) V\.x\aA{y^ = 0)] = X^x
At = t;{x).
(11)
In other words, the new information for decision making equals the original information for decision making plus newly observed information for decision making. The prior evidence for decision making is given by the decision threshold /x, and )Jx represents newly measured information. Figure 4 summarizes the inferential architecture of the ILU. Note that the reason for incorporating the question Y has not been discussed. The question Y will be seen to support ILU question formation and adaptation regarding both optimized transduction and optimized transmission. In the next section, we will develop adaptation algorithms to be implemented by this architecture that will optimize its transduction and transmission through separate optimizations of \y{X V Y\A) and b(y|A), respectively. Both of these optimizations intrinsically require that the ILU ask the question Y.
Feedback
Presented inputs YJ
^1
^ ^
Xg
>
^1
1
"T^
I
r
>f
\
y - 1 \x/\f^
/
^n
^
^n
Asserted output
y
H
Figure 4 Architectural layout of the ILU for supporting inference and forward information flow. Asserted inputs x are measured through k and then combined with /x to compute the evidence ^(x). Subsequently, ^(jc) can be used to generate a probabilistic decision >'. The decision >' is measured by the ILU through feedback to support subsequent optimization of transmission and transduction.
A Logical Basis for Neural Network Design
293
V. OPTIMIZED TRANSMISSION In attempting to optimize the transmission of information by the ILU, we effectively want to generate and present the best assertions to questions that other observers can ask regarding the ILU output. Regarding the source ILU, this optimization corresponds to maximizing b(7|A) with respect to the decision threshold yL6, given that the ILU can measure presented inputs x through X. The decision threshold /x dictates what answers will be given in response to the measurement X^x and therefore realizes the question F, because questions are intrinsically defined by their possible answers. The threshold /x cleaves the space of possible assertions that the ILU can generate while simultaneously forming the output premise of the ILU. Qualitatively, maximizing transmission corresponds to maximizing b(y|i4) over /JL for fixed A, with no inherent constraints placed on /x, except that it is real-valued. The problem then becomes that of solving \ d/x
= 0,
(12)
where h(Y\A) = T
p ( / | a ) In — i — .
(13)
From Appendix, the marginal distribution of the output assertion y is p(,l,).E-M£^^pt^^V:i^.
(14)
Now let us define Sx according to Sjc=J2 exp[A^x']. x'eX
(15)
Then urvlA^
M2",
Z
5^exp[-M],
Z
]
We incorporate the factor 1/ In 2 in Eq. (16) to evaluate it in terms of base 2, which is overall a more natural base for the consideration of computations performed by the ILU. This factor can be eliminated because it will not change the location of the extrema of Eq. (16). The factor Z in equation (16) is the normalization factor
294
Robert L. Fry and Raymond M. Sova
for p(x A y\a) and again is given by
yGYx'eX
A summation over the possible answers to Y, that is, y = 0 and y = I, yields Z = 2" + I J2 exp[A^Jc'] j exp[-/x].
(18)
It can be seen that the second term in Eq. (18) approximately equals 2" if ^^^exp[A^jc],
(19)
for typical x G Z . If this condition holds, the maximum of b(y |A) is approximately 1 and Z is fixed at Z ^ 2""^^. These conditions can be achieved if Eq. (19) holds with a sufficient fidelity to be determined. In particular, it can be shown that Eq. (19) asymptotically holds if fi = E[X^x\y = l] = {X'^x\y = l).
(20)
This can be verified by explicitly solving Eq. (12) and then making use of Eq. (20) and obtaining the conditional probability p(y = l\x) from the Appendix, which is given by
and then finally ensuring that |^(jc)| = \X'^x — /xl is small relative to unity. In general, this last constraint requires that |A,p = y^ for y^ < 1, that is, the total measurement vector-squared magnitude |Ap = y^ is such that y^ < 1. Fully optimized transmission can at best yield b(y | A) = 1 bit. This can be demonstrated through a Monte Carlo simulation where y^ is varied for randomly selected measurement vectors X having |Xp = y^,/x = {X^x\y = 1), and varying the input dimension n. The condition /x = {X^x\y = 1) represents our first adaptation condition for the ILU. The ILU parameter /i is just the expected value of the quantity k^x, given that this quantity, in conjunction with the prevailing value of /x, yields the decision y = lin response to computed evidence f (jc). Figure 5 contains plots generated by 32 Monte Carlo trials for each selected value of y ^ and n, where n is the dimension of the ILU or, equivalently, the number of inputs. The ILU dimension is restricted to the range n e (2,12). Results and trends can be extrapolated for n > 12 from the four plots shown and only serve to increase the simulation time. The total modulus-squared magnitude y^ was varied from 10~^ to 10^ in logarithmic steps of 10.
A Logical Basis for Neural Network Design (a) Average decision threshdd, |j 2.5/f
(c) Average bearing, b(/l>4)
295 (b)
Variance of the decision threshold, M
(d)
Figure 5 Each (y ,n) coordinate in all plots arises from an average of 32 Monte Carlo trials, (a) Averaged decision threshold fx = {k'^x\y = 1). (h) Variance of /x over the 32 trials, where the computed values are statistically stable for each selected (y^,n) coordinate, (c) and (d) Average bearing b(y|A) (or entropy H(F)) and the variance of this estimate, respectively. The conditions y^ < 1 and /x = {)^'^x\y = 1) guarantee near-optimal ILU operation in that b(r|A) approaches 1.
Figure 5a shows that for y^ < 1, the parameter /x stays bounded with n. Moreover, the parameter /x is a monotonically decreasing function of y^. Figure 5b demonstrates that for y^ < 1, the variabiHty of /x over the randomly selected X is very small for all selected dimensions and is small relative to unity. This suggests that jx is approximately a deterministic function of A, y^, and n. Figure 5c shows that the output entropy b(F|A) is approximately 1 or that the bearing b(F|A) is consistently maximized for y^ < 1 and for all selected dimensions n. This indicates that the ILU transmission is indeed optimized for IJL ^ (X'^xly = 1). Finally, Fig. 5d shows that the variability of the output bearing is essentially 0, thereby providing additional confidence that b(F| A) has minimal deviation from unity, regardless of «, A,, or the selected y^ < 1.
296
Robert L. Fry and Raymond M. Sova
VI. OPTIMIZED TRANSDUCTION In attempting to optimize the transduction of information by the ILU, we effectively want to determine the best questions to ask based on assertions previously presented to it. The determination of the best question to ask should be achieved through the maximization of b(X v Y\A) with respect to each element A,/ in the vector X subject to the constraint |Xp = y^ that arose during the optimization of ILU transmission. The parameter vector A dictates what answers can arise through inputs x as interrogated by the question X. Good measurements will maximize the information common to the output and the input, that is, that which can be asked of the input x and that which can be asked of the output y. These measurements will emphasize the Xi whose coincidences xi A y are most frequent. If x/ Ay ^ xt, then x/ -^ y, which even for a single-input ILU would minimize information loss. Because y^ must be bounded below unity to satisfy the maximization of b(F| A), we maintain this as an explicit constraint for optimizing b(X v Y\A). The quantitative mathematical problem then becomes that of finding the extremum of the objective function J by solving dJ for / = 1, 2 , . . . , n, where n is the number of input elements in x and where
J =
b{XvY\A)-\--a\X\^
The factor a in Eq. (23) is a Lagrange multiplier introduced to explicitly enforce the condition \X\^ = y^. One can prove the so-called Gibbs' mutual information theorem or GMIT [13], which states that if the mutual or common information between a question X and a question Y is extremized for a joint probability distribution with an exponential form, the resulting condition for the extremal solution has an especially simple form. Basically, the theorem states that if one has a Gibbs or exponential distribution of the form p(jc A y\a;v) = exp[—vi / i (x,y) — vifiix, y) — y„/„ (x, y)]/Z, then the common information or b(X V 7 | A) is extremized over the parameter set v when ^
^
p ( . ' A / l a ; v)ln £ ^ ^ ^ [ ( / , ( . ' , / ) ) - / , ( . ' , / ) ] . 0
(24)
for j = 1, 2 , . . . , «. Here v = (X, fi), fj(x, y) = Xj A y, and (fj(x, y)) = (Xj A y) is the expected value of this coincidence, that is, the answers to the
A Logical Basis for Neural Network Design
297
common question Xj V 7, and is given by
with p(x A j|a;A.,/x) = p(jc A y\a) given by Eq. (8). We have replaced the conjunction operator with the multipHcation operator in making the substitution Xi Ay = xiy in Eq. (8), because for Boolean numbers, multiplication and conjunction are equivalent. It is worthwhile to note that rescaling the problems here from the Boolean values of 0 and 1 to 0 and any other number will not change any results because discrete entropy is invariant to scale. The computational objective is to solve Eq. (23) and to find the conditions under which Eq. (22) holds. One can go through the rigorous mathematical detail, but using the GMIT, using Eqs. (22) through (25), and noting that
^[||AP]=aA,.,
(26)
the optimization condition stated by Eq. (22) becomes —- = Ej In
[xty - {xiy)\ \ + akf.
(27)
As one can see from Eqs. (22) and (23), the partial derivatives are only taken with respect to the Lagrange factors of the n elements {xiy, X2y,..., Xny] contained in the vector xy = x Ay. Fromthe Appendix, the forms of p(y|jcAa) andp(j|a) are known. When these are substituted into Eq. (27), one obtains the equation Ai(A, /x) + Bi(A, /x) + aki = 0
(28)
for / = 1, 2 , . . . , « . This equation must be simultaneously satisfied for all /, where Ai and Bi are defined according to Ai(X, IX) = E{[xiy - {xiy}][X^xy - fiy]}
(29)
and BKA,/.)
=E{[.,.,-
(.,.,)]lnPll^^|^},
(30)
respectively. A^^, M) can be simplified by remembering that /JL = X^{x\y = 1). Furthermore, X'^{x\y = l){y) = k^{xy). With this last substitution, Eq. (29) becomes Ai(k, fi) = k'^ixxiy) - fiixiy).
(31)
298
Robert L. Fry and Raymond M. Sova
Bi(k, fji) cannot be simplified so easily, although one can develop a detailed asymptotic expression for it. We will not take an analytic approach, but rather simulate it, much as we did for the optimization of b(y|A), although an analytic approach provides the same answer. The simulation approach we will use is to compute the ratio R/(A, jx) = A/(A,, /x)/B/(A, /x). Having done so in advance, and knowing the-answer in advance, we find that R/(A., /x) = —2[1 + 6i(k, />6)/2], where 6i(X, JJL) is an error term that will momentarily be shown to be small if y^ < 1. Now, using this equation to solve for B/ (A, /x), we find that B/(A,M) =
-
A/(X,/x) 2
1 [_\+8i{K^l)/l^
(32)
Making this substitution into Eq. (28) gives AK>^,M)r i + siix.tJi) 2
= -aki.
(33)
l\+8i{k,^i)/2_
\iei{X, /x) is sufficiently small, then A/(A,/x) = -2aA/.
(34)
Equation (25) requires that ei{X, /x) be sufficiently small. The ratio ^i{X, JJL) can be computed for varying y^, X, and n to investigate the magnitude of Si relative to 1 (/x is a function of A and is therefore not an independent parameter, because/x = X^ (x\y = 1)). Figure 6 is a plot of Si(X, /x) for y^ in the range of 10""^ to 10^ and for n in the range of 2 to 12. These are approximately the same ranges used for evaluating the optimization of b(F| A). The decision threshold /x is computed per Eq. (20) to optimize transduction. The magnitudes of Si{X, /x) are plotted in units of —10 \ogiQ Si to indicate the wide dynamic range of values that Si(X, /JL) have over the selected values of y^, A, and n. Each (y^,n) coordinate in Fig. 6 was computed based on the averaged value of e/(A, /x) over 32 random trials, that is, randomly selected X with nonzero elements. Zero elements in X can cause a convergence problem in computer simulations, but will be seen not to occur in practice because the corresponding inputs xi will, in this case, have no common information with the output y. Simulation results remain unchanged for varied dynamic ranges for each A/ between 10 and 1000. In fact, a dynamic range of only 10 works quite well. Dynamic range is an important hardware implementation issue. Figure 6 shows that si is consistently less than 0.01 for y^ < 1 and decreases in an approximate linear fashion with y^. Thus Eq. (34) will hold for the ranges of values that provide optimized b(F| A). Furthermore, e, (A, /x) decreases monotonically with n. Therefore, ILUs having a greater number of inputs should have a
A Logical Basis for Neural Network Design
299
-10 log
Figure 6 A plot of the error term Si defined in the ratio R,(A, /x) = —2[1 -^ Si(X, /x)/2]. The magnitude ofsf is consistently less than 0.01 for y^ < 1, decreases in an approximately linear fashion with y^, and decreases monotonically with the input dimension n.
better degree of the approximation regarding Eq. (34). In fact, though, only n = 2 inputs work quite satisfactorily and readily support optimized transduction. The computational objective defined by the optimization of Eq. (23) can now be approximately satisfied if Eq. (34) holds. To repeat, we desire that X'^ixxty)
' J^i^iy) =
-2aki
(35)
for J = 1, 2 , . . . , n. The system of equations defined by Eq. (35) can be greatly simplified regarding their representation prior to their solution. In particular, combining these equations for / = 1,2,..., n gives the matrix equation RX = ~4aA.,
(36)
300
Robert L. Fry and Raymond M. Sova
where we have taken account of the fact that /JL = {k^x\y = 1) and where the matrix R is defined by R = {[x-
(x\y = l)][x - {x\y = l)Y\y
= l).
(37)
Because the Lagrange factor a for the constraint \X\^ = y^ has not been specified, Eq. (37) can be written as the eigenvector equation RX = a'X
(38)
by letting a' = —4a. One can go back and solve for a to find that a' must be the largest eigenvalue and hence A the largest eigenvector of the covariance matrix R. Note that R, being a covariance matrix for the input x present during an output decision corresponding to y = I, has full rank as long as none of the inputs are perfectly correlated. This condition is guaranteed if one of the elements of a perfectly correlated pair forces X, -> 0 through the optimized transduction adaptation algorithm. Forcing Xt -^ 0 essentially eliminates input xt from any further consideration by the ILU, subsequently reducing the number of input dimensions from n to n — 1. In summary, finding the largest eigenvector of the matrix R defined by Eq. (37) represents the computational goal of the ILU regarding the optimized transduction of information.
VII. ILU COMPUTATIONAL STRUCTURE Now that we know the computational objectives of the ILU, we must consider how to efficiently achieve them for both transmission and transduction. The optimized transmission, or inferential structure, has been described through the computation of the evidence f (jc) for inference and decision making. The optimized transduction or interrogational structure must be addressed to modify the questions asked by the ILU. The simultaneous or individual modification of the inferential and interrogational structures of the ILU corresponds to learning. As derived in the previous section, the computational objectives were found to include fi = {X^x\y = l)
(39)
for optimized transmission and to find the largest eigenvector solution of RA = Qf'A
(40)
for optimized transduction, where again, R = ([x — {x\y = l)][x — {x\y = l)]^|y = 1). It is significant that Eqs. (39) and (40) only hold for presented inputs X that causally yield y = I. This is fundamental to our development of
A Logical Basis for Neural Network Design
301
an efficient algorithm for simultaneously achieving the computational objectives given by Eqs. (39) and (40). We assume that "on-line learning" occurs as driven by the sequential presentation of input patterns {jci, X2,...}. The function evidence was defined to be f (x) = X^x — JJL. The ILU must compute evidence to support inference and decision making. The conditional expectation of f (jc) given an output positive decision y = 1 is given by {^{x)\y == 1) = {}Jx\y == 1) - /x = 0.
(41)
Because Eq. (39) holds, Eq. (41) states that the average evidence for decision making, given that a positive decision is made, is 0! Equation (39) basically states that the decision threshold /x is just the conditional average of the measurement A^jc, given that a positive decision y = 1 has been made. The ILU output decision conditionalizes the validity of both Eqs. (39) and (40). The simplest conceivable way of computationally realizing Eq. (39) is to numerically average the induced measurement k^x conditionalized on a previously made decision y = \ and, internal to the ILU, adjust the parameter jx accordingly. This computation can practically be realized through a simple differential equation of the form ^
= ( l - r i ) M + ri(A^x)
(42)
implemented with time constant \ — x\ for ri 6 (0, 1) and only implemented when J = 1. Equation (42) realizes an exponential window averaging of the past quantities X^x measured by the ILU, which induced the decision y = \. The width of the exponential window is controlled by the parameter x\. Now we consider Eq. (40), which optimizes transduction. In 1982, Erikki Oja [14] derived and discussed a very important and simple adaptation equation which he proposed as a biologically plausible adaptation algorithm for neurons. Basically, Oja's adaptation equation takes as input a sequence of real-valued vectors upon which it operates. For a statistically stationary input, the equation computes and identifies the largest eigenvector of the correlation (not covariance) matrix of the input sequence. The form of Oja's adaptation equation is given by —^
= 7tiri(x)[Tt2x(t) - 7t3r](x)k(t)],
(43)
where rj = k^x, n\ is a time constant analogous to ri, and 712, TTS are two scaling factors for the solution. For a sequence of inputs {xi, X2,...}, this equation will compute the largest eigenvector k of the autocorrelation matrix R' = E[xjc^]. This is almost exactly the same problem we have here, except for the following
302
Robert L. Fry and Raymond M. Sova
three items: (1) The matrix R in Eq. (40) is an autocovariance matrix and not the autocorrelation matrix. In general, the largest eigenvector of the autocovariance matrix is much more powerful in terms of the ILU's ability to distinguish than the autocorrelation matrix. (2) The matrix R we are interested in is conditionalized on the output decision y = I. This poses no conceptual or practical difficulty in the adaptation algorithm proposed here. (3) The magnitude of the resulting largest eigenvector is constrained to be normalized such that |A.p = y^. All of these conditions can be satisfied with some minor modifications of Oja's original equation. In particular, Eq. (44) will satisfy each of conditions (1) through (3) and, when implemented for inputs inducing y = 1, will compute the largest eigenvalue of the covariance matrix R. The proposed adaptation equation for optimized transduction is X(t + At) = X(t) + n^(x)[x{t)
- y-^Ux)Ht)l
(44)
where f (jc) = k^x — /x is the evidence for decision making. One can compute E[f (x)|j = 1] and find that it is equal to aV^- Because y^ is fixed and the computed k has a^ as its largest eigenvalue, it can be seen that the evidence has maximum variance or variability, given that the ILU has made the decision 3^ = 1. The quantity f (jc) can be viewed as a current, voltage, or other physical quantity computed within the ILU, depending on the selected physical implementation. Equation (44) ensures that the evidence will have maximum variability given that the ILU decides y = I.
VIII. ILU TESTING We will now complete this chapter by testing the ILU. Table IV describes a spatial pattern of data used to train a model ILU with the adaptation methodology described. It contains 20 training vectors {jci, JC2,..., JC20} describing the set of inputs to an ILU with n = 20 distinct inputs. Input 10 never has an assertion presented to it. Therefore, question Xio is never answered. Alternatively, input 20 is consistently presented to for every input pattern. Considerable similarity can be noted among subgroups of the input patterns. This similarity is intentional, so that groupings of input code vectors can more easily be evaluated regarding the distinguished class in which they reside following training. The results of 16 random training sessions using the indicated patterns for input are shown in Fig. 7. The vector A was initialized to different nonzero random values for each simulation. A constant value of y^ = 0.001 was maintained, although
A Logical Basis for Neural Network Design
303
Table IV Set of 20 Input Pattern Vectors Used to Train the Model ILU 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
11110110100011011001 01000100000111011101 1 1010001000000100001 111101 1 0 1 0 0 0 1 0 0 1 1 0 0 1 111101 1 0 1 0 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1010001000001001 01010110100001011001 111101 1 0 1 0 0 0 1 0 0 1 1 0 0 1 111101 1 0 1 0 0 0 1 1 0 1 1 0 0 1 10010001101111011001 1 100011000000100101 1 111101 1 0 1 0 0 0 1 0 0 1 1 0 0 1 111101 1 0 1 0 0 0 1 1 0 1 1 0 0 1 10001110101100111101 00001111001010100111 11110110100010011001 01101110001111010011 10000010001011110111 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 111 11010100101000000001
little sensitivity to this parameter was observed as long as y^ < 1. Between 60 and 100 training iterations were required for the convergence of the ILU parameters X and /x to a state whereby approximately half the training patterns typically yielded a decision j = 1 by the ILU. A computational temperature T was monotonically decreased to near 0 during training, where ILU decision making was performed probabilistically based on pCj = l|jc) = 1/1 + exp[—^(jc)/r]. Figure 7a shows that the decision threshold JJL varies by only a few percent between simulations in agreement with previous simulations results shown in Fig. 5. Figure 7b describes the number of input patterns (out of 20) that will typically induce the decision y = I. About half of the input patterns yield j = 1, as indicated by the fact that the mean value is approximately 9.5, as averaged over the 16 random training trials. The bearing b(7|A) is maximized when exactly half of the input patterns induce y = I. Figure 7c is a plot of the averaged Lagrange vectors A, over the 16 training sessions. Little variability occurred in the computed values, a finding consistent with previous simulation results. The Lagrange factor Aio associated with input 10 asymptotically tended to 0, which is consistent with the absence of presented assertions. Alternatively, the factor A,20» associated with input 20, was consis-
304
Robert L. Fry and Raymond M. Sova 17.2x10"^ o
2 16.6x10"^ c
CO
0}
16.0x10"^
4 8 12 16 Simulation number
(a) if "5 E
4 8 12 16 Simulation number (b)
2x10"^
1x10"^
0 O)
c
1 5 10 15 20 Connection strength index
5 10 15 Pattern index
(c)
(d)
20
Figure 7 The decision threshold fM in (a) varies by only a few percent between simulations, (b) describes the number of input patterns that induce the decision y = 1 across simulations. A plot of the averaged Lagrange vectors k over the 16 training sessions is given in (c). The specific input patterns jc that induce the decisions y = 0 and y = 1 are described in (d).
tently the largest in each training session. This finding is counterintuitive from an information-theoretic standpoint, because this input offers no distinguishability among the training patterns. However, this interpretation is improper because the computational objective is to compute the magnitude of each A, based on the degree to which there is common information XivY as described by xiAy. Therefore, outputs y that are consistently asserted when input xi is asserted will weight Xi more heavily, because this condition is a stronger indicator of the implication y -^ Xi implyingreconstructability of inputs. Figure 7d indicates which training patterns give rise to the decisions y = 0 and J = 1 by letting the computational temperature T tend to 0, where inference was performed based on p(y = l|jc) = 1/1 -h exp[—^(x)/T]. The parameter T essentially models additive measurement noise. As the noise tends to 0, T tends to 0, and the decision rule is deterministic and consists of the simple rule in which the ILU decides j = 1 if ^(jc) > /x. As T becomes large, the decision rule is completely random and is independent of the measured quantity f (JC). In this case p(y = l\x) and p(y = 0\x) each approach 1/2.
A Logical Basis for Neural Network Design
305
IX. SUMMARY We must conclude our discussion without producing a practical utility, although a single ILU has shown biomedical application for the screening of cataracts. Nor have we dealt with the issue of timing and synchronization between multiple ILU configurations, or even considered such configurations. It is a straightforward exercise to extend the analysis we have described to such configurations, which have more practical use than a single ILU. We have, however, described an analytically tractable, intuitive, and rational methodology for the design of artificial neural networks that operate inductively rather than deductively. In this sense, the proposed ILU and any realized ILCs represent a computational generalization of conventional digital computers and deductive logic. As stated in the beginning of the paper, many of the ideas we have presented here are not new. A logical legacy can be traced back quite far. Limited space does not permit a complete account of all participants. George Spencer-Brown [15] was the first to ascertain the logical origins of Boolean algebra and to emphasize the essential nature of distinguishability as a basis for logic. Edwin Jaynes [3], Myron Tribus [5], and Richard Cox [6] are major contemporary contributors to the field of inductive logic. Jaynes established the connectivity between thermodynamics and information theory. In doing so, he showed that thermodynamics is not an arbitrary theory, but rather a theory of theories. The work of Jaynes provided a Rosetta stone to Tribus [5], who extended the practical utility of inductive logic and advanced its logical underpinnings. It was Tribus who first conceived of a set of desiderata for the design of an ILC, which provided the basis for the approach described here. Finally, Cox, in his unassuming and profoundly insightful manner, realized the fundamental and physical nature of assertions, questions, probability, and entropy. It is to him that this chapter is dedicated, in that it is but a more detailed rendering of his account.
APPENDIX: SIGNIFICANT MARGINAL AND CONDITIONAL ILU DISTRIBUTIONS Complete joint input-output distribution for jc and y\ p{xAy\a)
=
.
(Al)
Partition function Z required for normalization:
^ = Y.Y. y'eY x'eX
e^p[^^(^' ^ 3^) - ^^y\
(A2)
306
Robert L. Fry and Raymond M. Sova
Marginal input distribution for jc: p(x|(2) =
1 + exp[A^JC - /x]
.
(A3)
Marginal output distribution for y\
Unconditional probability of no output decision: P(J = 0|«) = ^ .
(A5)
Unconditional probability of the output decision y = 1:
p(,= l|a)=^-'^^^^f'^-^^.
(A6)
Conditional probability of output y, given the input x\ p(y|aA^)=
^^^^p^,,^_^^
•
(A7)
Conditional probability of no decision, that is, j = 0, given the input x:
p(y = 0|aA.)=^_^^J,^_^^.
(A8)
Conditional probability of the decision y = 1, given the input x: p(y = l|a
AX)
=
exp[A^jc — /x] 1 + exp[A^x — /z] 1 l+exp[-f(x)]*
1 1 + exp[—A^x + /x] (A9)
Conditional distribution of input x, given output >': p(Ar|a A y) =
exp[X^(x Ay) — /xy] ^ ' r / , ; y Ljc'eX exp[A^ (x^ A y ) - /xy]
, . . ^. (AlO)
Conditional distribution of input x, given no output decision (y = 0): p[x|«A(y = 0)] = ^ .
(All)
Conditional distribution of input x, given output decision y: ri w iM exp[A.^x-/x] exp[A^x] p [ . | . A (y = 1)] = ^^^^^ ^^^^^,^, _ ^^ = ^^^^^ ^^^^^,^,j •
(A12)
A Logical Basis for Neural Network Design
307
ACKNOWLEDGMENTS My gratitude to my wife for taking up more than her share of the load in keeping things running smoothly at home while I was distracted from the family with the writing of this chapter. Also, my gratitude to The Johns Hopkins University Applied Physics Laboratory for its support through the Janney Fellowship in compihng this chapter. And finally, my sincere thanks to Myron Tribus for his thoughts and inspirations that assisted in the final preparation of this chapter.
REFERENCES [1] C. E. Shannon and W. Weaver. The Mathematical Theory of Communication. Univ. of Illinois Press, Urbana, IL, 1949. [2] R. T. Cox. Probability, frequency, and reasonable expectation. Amer. J. Phys. 14:1-13, 1946. [3] E. T. Jaynes. Information theory and statistical mechanics. Phys. Rev. 106:620-630, 1957. [4] R. T. Cox. The Algebra of Probable Inference. Johns Hopkins Univ. Press, Baltimore, MD, 1961. [5] M. Tribus. Rational Descriptions, Decisions, and Designs. Pergamon Press, New York, 1969. [6] R. T. Cox. Of inference and inquiry. In Proceedings of the 1978 Maximum Entropy Formalism Conference, pp. 119-167. MIT Press, Cambridge, MA, 1979. [7] J. A. Wheeler. Information, physics, quantum: The search for links. In Proceedings of the 1988 Workshop on Complexity, Entropy, and the Physics of Information, pp. 3-28. Addison-Wesley Publishing, Redwood City, CA, 1989. [8] E. L. Post. Introduction to a general theory of elementary propositions. Amer. J. Math. 163-185, 1921. [9] L. Tsu. Tao Te Ching. Random House, New York, 1972. (A contemporary translation by G.-F. Feng and J. English. Originally dated to around the sixth century B.C.). [10] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1984. [11] R. W Yeung. A new outlook on Shannon's information measures. IEEE Trans. Inform. Theory 37:466-475, 1991. [12] J. N. Kapur and H. K. Kesavan. Entropy Optimization Principles with Applications. Academic Press, New York, 1992. [13] R. L. Fry. Observer-participant models of neural processing. IEEE Trans. Neural Networks 6:918-928, 1995. [14] E. Oja. A simplified neuron model as a principal component analyzer. J. Math. Biol. 15:267-273, 1982. [15] G. Spencer-Brown. The Laws of Form. E. P. Dutton, New York, 1979.
This Page Intentionally Left Blank
Neural Networks Applied to Data Analysis Aamoud Hoekstra
Robert P. W. Duin
Martin A. Kraaijveld
Pattern Recognition Group Faculty of Applied Physics Delft University of Technology 2600 GA Delft, The Netherlands
Pattern Recognition Group Faculty of Applied Physics Delft University of Technology 2600 GA Delft, The Netherlands
Shell International Exploration and Production Research and Technical Services 2280 AB Rijswijk, The Netherlands
I. INTRODUCTION The analyst who has to find an automatic data classifier has to solve a set of problems. On one hand there are the data, objects, which have to be classified as well as possible; on the other hand, one has numerous classifiers to choose from. The question that arises is: How to choose the best classifier for the given set of objects? There is no general recipe in training a neural classifier on a data set. The best one can do is to obtain as much information as possible from the data available. This can be used to match against the capabilities of the classifier one has chosen. The techniques presented in this chapter can be used to study the neural classifier in relation to the data it has to classify. Neural networks have significantly increased the possibilities of analyzing and classifying data sets. The adaptive nonlinear property of these networks are of primary importance. Herewith they can be used for describing and generalizing almost any data set that can be represented in a continuous space. This property is directly related to the fact that neural networks can be given an arbitrary complexity by their architectural design, for example, the number of neurons and the number of layers, provided that the learning scheme is sufficiently powerful. Therefore in this chapter we will study how the properties of neural network classifiers can be used to analyze data. Implementation Techniques Copyright © 1998 by Academic Press. All rights of reproduction in any form reserved.
309
310
Aarnoud Hoekstra et ah
The chapter will start off with an overview of different techniques available in the literature for the analysis of unlabeled data. This includes not only neural techniques but also nonneural techniques that are generally accepted. These techniques give an insight into the complexity of the available data. However, one is also interested in how labeled data can be analyzed. Moreover, if one has labeled data, how can these data be investigated to apply the appropriate (neural) classifier? Section III introduces a technique with which we can investigate the separability of the available data. By visualizing the scatter of the data in the feature space, it is possible to gain insight into its separability. The method maps the high-dimensional space onto a two-dimensional plane. This enables us to visually inspect the high-dimensional space and choose an appropriate classifier. However, in practice many classifiers will solve a classification problem after having applied a separability analysis. Therefore, in Section IV a technique is introduced that can be used for classifier selection. By estimating the reliability, or confidence, of a classifier, a selection can be made. A classifier that has a bad confidence, that is, a low probability of correct classification, is likely to be rejected as a possible solution. Comparing different classifiers, either by confidence or test set performance, requires an independent set that can be used for testing. Generally these sets are not available; therefore validation sets are used for comparing the different classifiers. This implies that a method is required to generate these sets; Section IV also introduces a technique to generate such sets. An advantage of such validation sets is that they share the same characteristics as an independent test set. A next step is to train a classifier on a given classification problem. Neural network classifiers are well known for their ability to solve almost any classification problem. This constitutes a problem in training a classifier on a particular task. If the classifier is trained too long it may not generalize at all. Section V introduces a measure that can be used to keep track of the network's generalization behavior during training. This behavior is strongly determined by the complexity of the classifier, for example, the number of hidden units, in relation to the data it is trained on. By examining this measure, the appropriate moment to stop training the classifier can be determined. The last section goes into the problem of classifier stability. During the training of a classifier its decision regions are frequently changing. This behavior can be characterized by the stability of a classifier. The more stable the classifier that is yielded, the better it is for solving a particular problem. The general line of this paper follows the solution of a given classification problem with data by the following steps. Analyze the data using general techniques to get an indication of its complexity. Next, analyze the separability to be able to select an appropriate classifier. Having several appropriate classifiers, select from these classifiers those that have the most confidence. In training a classifier on a particular classification problem, keep track of its generalization behavior and stability to yield an optimal solution.
Neural Networks Applied to Data Analysis
311
11. DATA COMPLEXITY The most basic form is which data can be observed is as unlabeled data. This type of data is obtained, for instance, from measured signals belonging to unknown categories. Several new techniques have been developed to gain insight into the data. In this section an overview will be presented of different methods. It is not our intention to give a full and exhaustive list of algorithms. We will merely give an indication of how they can be placed in the entire field. Numerous methods can be found in the literature [1-14]. They belong to one of the main categories we distinguish below. Although the techniques were mainly developed for unlabeled data, they can be applied to labeled data as well. One can distinguish between algorithmic and descriptive complexity. Algorithmic complexity describes the methods in terms of computing power needed to perform the algorithms. The more power that is needed, the more complex a method is. However, this complexity measure depends on the implementation of a method and the available computing power, and is therefore a subjective one. Descriptive complexity is concerned with describing the method in terms of abilities, for instance, generalization capacity and the number of free parameters. We distinguish three main categories to which the data analysis algorithms may belong: 1. Prototype vector methods 2. Subspace or projection methods 3. Density estimation methods The distinction is based on an increasing complexity of the methods, starting with the techniques with the lowest complexity. Our ordering is based on a descriptive complexity, the data need. The more data (objects) are needed to perform an algorithm accurately, the more complex it is. For example, to estimate the density of a set of objects and their a priori probabilities, far more objects are needed than for finding the cluster centers of the objects. Now the different categories a studied in more detail.
A . PROTOTYPE
VECTORS
As indicated, methods belonging to this category have the lowest descriptive complexity, that is, in terms of data need. The prototype methods are also known as vector quantization algorithms and determine cluster centers in the data. The available objects can be divided into (separate) clusters, and each of these clusters can be represented by a single vector called a prototype. Depending on the method, the number of prototypes is either fixed or can be increased during learn-
312
Aarnoud Hoekstra et ah
ing. In general, the more prototypes that are used, the better the data can be represented. However, too many prototypes may lead to an "overrepresentation," the extreme case in which each object is represented by a single prototype. The problem when applying these algorithms is how to choose the number of prototypes. In the literature many vector quantization algorithms can be found. We distinguish the following four algorithms: Vector Quantization [15] This algorithm starts with a fixed number of prototypes randomly initialized. At each update step, a prototype that is closest to the object under investigation is updated toward this object. This results in a set of prototypes that are centered in the clusters present in the data. Isodata/K-Means Clustering [1,14,16] These methods are basically the same as vector quantization. They differ from the previous methods in the sense that they apply batch updating. At each update step the available objects are assigned to a prototype. This prototype is thereafter updated by taking the mean of the feature vectors of the objects assigned to that prototype. Self-Organizing Map [10,15,17] This method is more complex than the previous one because it takes the neighborhood relations between the different prototypes into account. This implies that more data are needed for an accurate performance of the algorithm. The selforganizing map is described in more detail in one of the next sections. The ART Network [18-20] This method is similar to the vector quantization method, although it is described in the literature in a rather complicated way. It differs from vector quantization in the sense that it is able to increase the number of prototypes, at least up to a certain limit. An object that is fed to the network is either added to a matching prototype or rejected, that is, no prototype was available to assign the object to. If the object is rejected, a new prototype can be created. There are also more traditionally oriented, that is, nonneural, techniques used to gain insight into data complexity. We can distinguish two types of algorithms, although in the literature variations can be found:
Neural Networks Applied to Data Analysis
313
Hierarchical Clustering [1,14,16] This type of data analysis algorithm builds a tree structure (dendrogram) of the objects available. The tree is built according to rules that state how objects can be clustered. This tree can be inspected visually to get an indication of how the data are clustered. Minimal Spanning Tree [1,14] Minimal spanning trees are pruned versions of cluster trees such that the neighbor relations between objects are preserved. The methods presented in this section have a low complexity. In general not many objects are needed for an accurate performance of the algorithms [8].
B. SuBSPACE OR
PROJECTION
IVLETHODS
Subspace approaches differ from prototype algorithms because these methods try to determine which components in the data are important. In contrast to vector quantization methods, subspace or projection algorithms describe the data in terms of a reduced set of components. This requires more objects. The number of objects needed must be at least as large as the dimension of the subspace. For example, to locate a two-dimensional subspace in a three-dimensional space, we need at least three objects. Here five methods are described; more can be found in the literature:
Karhunen-Loeve Transform [3] The Karhunen-Loeve transform is a linear mapping method that describes the data by using principal components. These principal components are derived from the available objects by transforming them to a space of lower dimensionality. By determining the eigenvalues and eigenvectors, a description of the data is obtained. The larger an eigenvalue, the more important that direction is for the description of the data. Oja Neuron [21,22] This is also a linear mapping method; it is the neural variant of the KarhunenLoeve transform. As in the previous method, principal components are derived. In contrast to the Karhunen-Loeve transform, a learning algorithm is used to find the principal components.
314
Aarnoud Hoekstra et ah
Sammon's Projection [23] This method preserves all of the neighbor relations between the objects as much as possible, in contrast to the methods from the previous subsection, where a minimum number of neighbor relations are preserved. Sammon's method is explained in more detail in Section III. Self-Organizing Map [10,15,17] The self-organizing map is not only a vector quantization method; it is also a nonlinear projection method. Because neighborhood relations are preserved by using a fixed, usually two-dimensional grid of prototype vectors, a mapping is performed onto a two-dimensional space. By this type of mapping, possible subspace properties of the data are revealed. More details regarding this type of mapping algorithm can be found in Section III. Auto-Associative Feedforward Neural Network [3, 7] In contrast to the self-organizing map, the auto-associative network is able to project from a high-dimensional space onto a arbitrary lower dimensional space. This is done using a feedforward neural network with a "bottleneck." The number of neurons in the hidden layer is much smaller than the number of input neurons. By using auto-associative learning (that is, the network is taught to reproduce the input), the network is forced to compress the high-dimensional input into the bottleneck. Another major difference with the self-organizing map is that this network performs a global mapping, whereas the previous method preserves local neighborhood relations. As stated earlier, these methods require more objects than vector quantization methods, and are therefore more complex.
C. DENSITY ESTIMATION M E T H O D S The last category of methods for data analysis is density estimators. This category contains the most complex methods. To estimate the densities in the data, many objects are needed. We distinguish the following three methods: Radial Basis Functions [3,9] This type of neural network uses a Gaussian function as activation function in the hidden units. The output is constituted by a linear combination of the Gaussian functions in the hidden layer, thus resulting in a global density estimation of the
Neural Networks Applied to Data Analysis
315
data. Because this method makes a global estimate (in the sense that a kernel encloses several objects, in contrast to Parzen) of the data density distribution, it is less complex than the Parzen method.
Parzen Method [6,24] In Parzen estimation, a superposition of Gaussian kernels located on the objects is used as the density estimate. These kernels are combined by linear summation, as in the radial basis functions, to find a probability density function of the objects. This method is very accurate but requires a huge number of objects, especially when estimating in a high-dimensional space.
Boltzmann Neural Networks [25] This type of network is a so-called stochastic network. The objects are fed to the network, and after learning the neurons in the network represent the probability distribution of the objects. A drawback, however, is that the Boltzmann machine can only handle binary coded inputs. The methods described in the previous sections can be used to analyze the data given. Depending on the degree of analysis, one of the methods from a category may be used. However, the accuracy of a method is restricted to the number of objects available. If, for instance, only a few objects are available, a vector quantization might be more informative than a Parzen estimation, which requires a lot of objects. This section has introduced some general methods available in the literature that can be used to gain insight into the data. Depending on the information one needs to build a (neural) classifier, any one of these methods can be chosen. However, one must bear in mind that the number of available objects restricts the different methods in their applicability. The more objects that are available, the more information that can be extracted. Prototype vector approaches are suited to investigating whether the data are clustered in one way or another, whereas projection methods map the data onto a space of lower dimensionality and thus extract information on how the data are located. The last class of methods gives insight into the distribution of the data in terms of probability density estimations. It gives an indication of the probability of finding an object at a particular place. The next step is to find out how separable the data are. The more separable a set of objects is, the less complex the classifiers that can be used to solve a problem. In general, less complex classifiers are easier to find and analyze. The next section will deal with this separability problem.
316
Aarnoud Hoekstra et at.
III. DATA SEPARABILITY As illustrated in the previous section, there are many methods for analyzing unlabeled data. However, in practical situations where one is dealing with a classification problem, this implies the use of labeled data. In this section we will focus on the problem of data separability. A method will be discussed for estimating the separability of the data. Knowing the separability of the data enables one to choose an optimal classifier that solves the problem, or at least one that is better than a random choice. Some data sets are more complex in the sense that they need a more complex classifier for classification. More complex classifiers are, for instance, neural classifiers with a large number of hidden units, or nearest-neighbor classifiers with a small k. Let us consider the following classification problem. Figure 1 shows two classes with equal covariances but different means, which are separated using the A:-nearest neighbor classifier, that is, a classifier that takes the k nearest neighbors around an object to determine its class membership. These two classes are separable with a small error, and therefore it does not appear to be a hard to separate the data set. However, if we increase the variance of both classes in the y direction (Fig. 1), and keep the number of objects fixed, it becomes more complex. Applying again the nearest-neighbor method to these classes now results in a classifier that performs worse, although the increase in the variance does not contribute to a better separation. The classes, however, are still separable with a small error. Figure 2 shows how the error depends on k for the different, stretched data sets. The error made by the classifier differs for each set, although the discriminating direction of the data set (the x direction) does not change. Consequently, from the viewpoint of the classifier, the data have become more complex, inasmuch as they appear to be harder to separate. Therefore, knowing the separability of the data will give an indication of which classifier to use {k in Fig. 2). For large variance a small k should be selected, whereas small variances require a relatively large k. Another complexity problem, related to separability, is the problem of dimensionality. In two-dimensional problems, like the one in Fig. 1, it is easy to see how separable the data are. In higher dimensions it is not so clear how the data are distributed. To determine this, a method for inspecting the space is needed. The remainder of this section is devoted to the problem of visualizing the high-dimensional data, such that their separability can be investigated. It will be shown that a high-dimensional data space can be mapped onto a two-dimensional plane such that the separability of the data can be visuaHzed. A nonUnear mapping method, the self-organizing map (SOM), is used to map the space onto a grid. This method is topology preserving, that is, neighbor relations in the high-dimensional space are preserved, as much as possible, in the lower dimensional grid. This property is used to inspect the scatter of the data.
°
o ooo ^
° o
0)
O
Oc
o o O
-ho
0 +
O
o + +
o
^
o-*-
o
^
+ :t ^ "" + ^
+ ^
-^-H
o
44^-^
O
o
+ -^ +
o
+ +
+ f^+
Aarnoud Hoekstra et at.
318 1
1
1
1
1
1
1
1
1
0.6
\
0.5
0.4 r'
1
- ^
0.01 - - 1
0.3
•
0.2
\\
-10
-
100
^^
-
0.1
1
1
1
1
1
1
1
1
1
1
20
40
60
80
100 k
120
140
160
180
200
Figure 2 The curves show the effect on the complexity of the data as "seen" by the k nearest neighbor classifier. The larger the variance, varying from 0.01 to 100, the more complex a data set seems. The error of the classifier approaches 1 because of the leave-one-out method used to estimate the performance.
A. MAPPING ALGORITHMS The Self-Organizing Map The self-organizing map (SOM) [10,15,17] is a well-known method for mapping a high-dimensional feature space onto a low-dimensional one. In general the feature space is mapped onto a two-dimensional rectangular grid of neurons that are interconnected (see Fig. 4). This makes the SOM both a feature extractor as well as a clustering method. The map is trained in an unsupervised manner using competitive learning. Each neuron in the low-dimensional grid (usually 2) is assigned a feature vector representing a part of the feature space (Fig. 3). The training of the map is started by initializing the neurons in the grid with a small random feature vector, or weight, m. In every step of the learning process this weight is updated in such a way that it represents a part of the feature space as
Neural Networks Applied to Data Analysis
319 X-
riTill
Wi Figure 3 A neuron from the rectangular grid of the self-organizing map.
accurately as possible. Three steps in the learning procedure can be distinguished: 1. Presentation of a randomly selected object from the input feature space 2. Evaluation of the network 3. Updating of the weight vectors of the network The first step is to randomly select an input object, although in practice the procedure is such that all objects are used at least once. This object is then mapped by the network. Each unit in the grid computes the distance of its weight vector m/ to the input vector x by using the Euclidean distance (Fig. 3). The neuron having the least distance is selected as "winner" s: \\x(t)-ms(t)\\
= mm{\\x(t)-mi(t)\\).
(1)
Note that the iteration step is indicated with t. The winning neuron is used as a starting point for updating all of the other neurons in the grid. All neurons within a certain neighborhood Nc(t) (for a graphical interpretation of Ndt) see Fig. 4) of the winning neuron receive an update toward the presented input object. The update rule is as follows: mi{t + l)-
\^^^^^^ \mi(
if i^Ncit)
•
^^^
The neighborhood size Ndt) decreases monotonically in time; fewer neurons around the winning neuron must be updated because the objects are better represented each time an update is performed. The learning parameter a (0 in Eq. (2) linearly decreases in time to ensure that after a certain number of steps the network stops training. In the original formulation of the network [10,15], a neighborhood parameter hst (t) was used. This hst (t) is in contrast to Ndt), a continuous value modeled by a Gaussian that has a kernel shrinking in time: hsi = ho(t)Qxp
/
\\ri-rs\\\ —z— ,
(3)
320
Aarnoud Hoekstra et ah
Nc Figure 4
The two-dimensional grid of neurons in the self-organizing map.
where r/ and r^ are the coordinates of the neurons in the grid, and ||r/ — r^ || is a suitable distance measure between those neurons. For instance, if the coordinates were numbered as in Fig. 4, then the distance between / and s would be max(|r/j I). The size of the Gaussian function, determined bya{t)^, decreases as the network reaches convergence. The parameter ho{t) has the same role as the learning parameter a ( 0 in Eq. (2). The update rule stated in Eq. (2) now changes into rhiit + 1) = hsi{t)[x{t) - rhiit)].
(4)
After training the network has reached convergence (see [26-30]) and can be used for classification. Here we will use the final values of the neuron weights to compute the intemeuron distances [31]. This will make it possible to visualize the scatter of the data set. Sammon's Projection Sammon's projection [23] is a nonlinear projection method of maping a highdimensional space onto a space of lower dimensionality. The algorithm maps the original space onto the projection space in such a way that the interobject dis-
Neural Networks Applied to Data Analysis
321
tances in the original space are preserved in the projected space, just as in the SOM. However, experiments show that Sammon's algorithm usually outperforms any other mapping algorithm [32]. Let us denote the distance between an object / and an object j in the feature space by J*., and let dij be the distance between the same objects in the projected space. Sammon's algorithm tries to minimize the following error term, that is, the distortion of the projection:
This equation can be minimized by using, for instance, a gradient descent as proposed by Sammon [23]. Although in principle Sammon's projection is not neural network oriented, it can be modeled by using a neural network as shown by [33].
B. VISUALIZING THE DATA SPACE Two methods for maping a high-dimensional space onto a low-dimensional one have been introduced. However, to be able to display the structure of the feature space, the visualization of the feature vectors is modified. We do not display the feature vector values when inspecting the SOM, but the feature vector distances. By displaying the distances of the feature vectors of neighboring neurons, one is able to get an indication of how the feature vectors of the objects are distributed. Figure 5 shows a data set (for a description of this set see the experimental section) as it is mapped onto the SOM grid. The distances between the neurons in the grid are depicted as gray values. Because each neuron has eight surrounding neighbors, at least in the two-dimensional grid, eight distances are calculated. The gray value that is finally displayed is the maximum of these distances. Large distances result in a high gray value, and consequently a bright "color" in the map, and small distances result in dark areas in the image. The picture clearly shows a white area across the grid. This indicates that the data are clustered in at least two separable clouds of objects. Note-that because of the high dimensionality of the feature space (ten-dimensional) and the low dimension of the network grid (two-dimensional), the network is not able to show all of the differences between the clusters. These kinds of images can be obtained from any trained SOM by using the distances of neighboring neurons in the grid [31]. Because the mapping algorithm is topology preserving, neurons that are neighbors in the grid are also expected to represent neighboring clusters of objects. Neurons in the grid at a large distance from each other, in terms of Euclidean distance, are expected to be separated by a large distance in the feature space. Therefore, the SOM can be used to visualize the data separability.
322
Aarnoud Hoekstra et al
Figure 5 Mapping of the ten-dimensional data on a two-dimensional SOM grid. The data are divided into two parts, just as would be expected.
C.
EXPERIMENTS
To judge the method introduced in the previous section, several experiments have been conducted. Here we will examine three experiments; in the tables more experiments are mentioned, two conducted on artificial data sets and one conducted on a real-life data set. More experiments and their results can be found in [31]. The two artificial data sets were used to verify the method because the properties of these sets are well known. The three sets are: Data set 1: A ten-dimensional data set consisting of two standard normally distributed clusters of 500 objects each. Both sets have as covariances the identity matrix, and have means of 1 and — 1, respectively. They have a small overlap (only 0.078% Bayes error) and are therefore suited to displaying the mapping property of the SOM. Data set 2: A three-dimensional artificial data set as depicted in Fig. 6. The classes have an equal number of objects (500) and were generated by a standard procedure [31]. As can be observed from the figure, the data have an intrinsic dimension of almost 2; therefore it is expected that they map very well onto the two-dimensional network grid.
323
Neural Networks Applied to Data Analysis
-1
Figure 6
-0.5
The data set in three dimensions, with an intrinsic dimensionahty of 2.
Data set 3: A real-life data set consisting of the IRIS [34] data. These data are four-dimensional and consist of 150 objects in three classes; each class has an equal number of objects (50). For each of the data sets 10 different SOMs with different initialization were trained, in order to obtain statistics about the SOM performance. Each of these maps consisted of 100 x 100 units. Before training all the objects were randomized to cancel out any preferences in the data. Randomization was also performed to be able to cyclically train the data. Furthermore, the maps were trained for 100,000 epochs. The exact experimental conditions can be found in [31]. Figures 5, 7 and 8 show the mapping results for data sets 1, 2, and 3, respectively. The artificial data sets are clearly mapped onto the two-dimensional SOM grid. The white areas denote a large distance between neighboring neurons. The darker an area, the less distance there is between two neighboring neurons. Figure 5 shows a bright white line across the map. This line separates the feature space into separate clusters, which is in agreement with the data, since they consists of two ten-dimensional normally distributed clusters. The same holds for data set 2, which was intrinsically almost two-dimensional. Figure 7 shows that the SOM grid represents the two clusters very well, and indeed extracts its twodimensional nature. More interesting is to investigate the behavior of the SOM on a real-Hfe data set such as the IRIS set. The mapping shown in Fig. 8 shows the mapping on the IRIS data. Remarkably, only two clusters are detected, whereas the data can be classified into three different classes. When we have a look at the Sammon mapping, shown in Fig. 9, it turns out that two classes are mapped close to each other. Because the Sammon mapping has less distortion than the SOM [32], it can be expected that the SOM shows only two clusters.
324
Aarnoud Hoekstra et al 100
100 Figure 7 Mapping of the three-dimensional data set on the SOM grid. The intrinsic two-dimensional nature of the data is clearly extracted. The data can be found in Fig. 6.
100
Figure 8 Mapping of the IRIS data on a two-dimensional SOM grid. The data are divided into two parts, whereas the number of classes is three.
325
Neural Networks Applied to Data Analysis u8-
+
iris s^tosa
•
iris versicolor
A
iris virginica
6-
•
-3
•••
+
4-
2-
4^
A
• •_
.
—1—'—1—'—i——1—1—1—1—I—1
1
-1
1
Figure 9 Mapping of the IRIS data set using Sammon's mapping method. Reprinted with permission fromM. A. Kraajveld, J. Mao, and A. K. Jain, IEEE Trans. Neural Networks 6:548-559, 1995 (©1995 IEEE).
Table I summarizes the distortion of the two mappings for the three data sets. Sammon's projection generally results in less distortion when mapping from the high dimension onto the lower one. However, if the mappings are used for classification, then it shows that the SOM preserves the classes much better than Sammon's mapping. Tables II and III show the classification results from the mappings. In conclusion, the proposed method indeed shows how separable a set of objects is. However, because of the high dimensionality of certain spaces, information is lost during the nonlinear mapping. The advantage of the visualization
Table I The Average Sammon Distortion (in %) for the Different Data Sets Sammon distortion^ Set 1 2 3
— —
Description
SOM
Sammon
Ten-dimensional separated normal clusters Two elongated clusters in 3D space IRIS data 16 texture data SOX data
34.6 (1.2) 0.51 (0.0) 3.4 (0.4) 42.0 (6.9) 49.3 (2.2)
6.0(0.1) 2.7 (1.2) 0.6(0.1) 5.7(0.1) 5.3 (0.4)
Standard deviations are in parentheses.
Aarnoud Hoekstra et ah
326 Table II
The Average Error of the Nearest-Neighbor Classifier (in %), Estimated with the Leave-One-Out Method and Its Standard Deviation Classification error Set 1 2 3
— —
Description
Input space
SOM projection*^
Sammon projection^
Ten-dimensional normal clusters Two clusters in 3D space IRIS data 16 texture data SOX data
0.2 0.0 4.0 3.6 4.4
0.3 (0.1) 0.0 (0.0) 4.1 (0.7) 4.6 (0.9) 4.9 (2.2)
0.3 (0.1) 0.3 (0.4) 5.5 (1.8) 27.0 (2.5) 11.8(4.0)
^Standard deviations are in parentheses.
method is that it enables one to actually have a look in the high-dimensional feature space. Neurons separated by a large distance in the map represent objects separated by a large distance in the feature space. The experiments shown in this section indeed confirmed this. Furthermore, the mapping distortion of the SOM is generally worse than Sammon's method (see Table I), but when the map is used as input to a classifier, it performs better. The method introduced in this section enables us to visually inspect the labeled data located in some feature space. By visual inspection one gets an indication of the separability and therefore an appropriate classifier can be selected to classify the data. For instance, in the case of the IRIS data set, it can clearly be seen that two large clusters exist. This indicates that a simple classifier can be used to sepa-
Table III The Average Error of the Nearest Mean Classifier (in %), Estimated with the Leave-One-Out Method and Its Standard Deviation Classification error Set 1 2 3
— —
Description
Input space
SOM projection^
Sammon projection^
Ten-dimensional normal clusters Two clusters in 3D space IRIS data 16 texture data SOX data
0.1 13.3 8.0 6.5 4.4
3.1 (3.5) 13.5(0.1) 6.6(1.8) 8.8(1.4) 12.9 (2.2)
0.1 (0.1) 13.4(1.2) 7.7(1.3) 28.4 (0.9) 16.0 (5.4)
'^Standard deviations are in parentheses.
Neural Networks Applied to Data Analysis
327
rate the "setosa" from the "versicolor" and "virginica" simultaneously, but a more complex classifier is needed to separate "versicolor" from "virginica" (see also Fig. 9). The next section introduces the concept of confidence values. This enables us, for instance, to select the best classifier from a set of appropriate classifiers.
IV. CLASSIFIER SELECTION In the previous section we introduced a method for estimating and inspecting the separability of labeled data. Suppose one has found a classifier (for example, a neural network) to separate these data in the feature space. In general there are many classifiers that can be used to build a function that separates the data. A selection must be made from among these classifiers. The problem is, however, deciding on the criterion for selection. In general, a classifier is selected by looking at the empirical error. That is, select the classifier with the lowest empirical error or the one that gives the most reliable estimate of the empirical error. In this section a technique is introduced that can be used to select a classifier based on the reliability of classification. This reliability of classification, referred to as confidence, can be used to judge the classifier's classifications. By evaluating the levels of confidence for the different classifiers, the one in which we have the most confidence can selected as our final classifier. This section only introduces the tools for classifier selection, because at this moment there are no actual results on classifier selection. We do show some experimental results in which the confidence levels were used as a reject criterion. Confidence values are not restricted to classifier selection, but can also be used to reject objects to increase the performance of a classifier. The confidence can also be used to construct new, multistage classifiers [35-37], based on the confidence assigned to an object by different classifiers. Note that the concept of confidence values holds for a particular classifier and data set; it is highly problem dependent. This section is divided as follows. Next the concept of confidence values is introduced (what exactly do we define as a confidence value and how can it be computed?). The following subsection deals with the different methods for estimating confidence. Four methods will be introduced that can be used to estimate the confidence of a network classification. These different estimators must be tested by using a test set. However, this constitutes a problem, because independent test sets are hard to get or are not available at all. Subsection IV.C introduces a method for generating validation sets that share the same characteristics as independent test sets. The section ends with the results obtained from several experiments.
328
Aarnoud Hoekstra et at.
A. CONFIDENCE VALUES When a trained neural network is used to classify an unknown object, one would like to know what the reliability of the classification is, that is, how certain we are the network classification is correct [6, 38-40]. In this section it is shown how the probability that an estimate of the class label is correct can be estimated. The reliability, or confidence, and its corresponding estimators will be introduced next. Suppose that a network finds a discriminant function S\x -^ o). This discriminant function S assigns a class label S to an object x. For each object x the classification based on S is either correct or incorrect. Therefore, a correctness function C{x,ci)) is defined:
where co is the true class label of object x. When an unknown object is classified, the confidence of the estimate of the class label is the probability that the classification is correct. Given a classifier (a feedforward neural network classifier in our case) for an (in general unknown) object, the confidence ^ (3c) is given by q{x) = P{C{x,a)) = \\x).
(7)
The classification confidence is thereby the a posteriori probability for the estimated class. For known distributions and given <S, only the errors due to the class overlap as well as those due to imperfect training influence q(x)\ therefore it can be expressed as
fix) where /^ is the class conditional density function for class co (the class with the highest a posteriori probability) as a function of the object x, and P^ is the a priori class probability. The function fix) denotes the joint probability density for X, f{x) = J2a) ^o)fa)(x). Howcvcr, in practice the distributions of the objects are unknown. Consequently, the confidence ^(jc) must be estimated somehow. The next section introduces several possibilities for estimating ^(jc). Assume that ^(jc) can be estimated by a qiix). Because there are many different qi (x) possible, a criterion function is needed to compare them. This criterion function 7/ is defined as follows:
Ji = ~^T[Cix^co)-qi(x)]\
(9)
Neural Networks Applied to Data Analysis
329
where T is an independent test set, Nj- is the number of objects in the test set, and the index / is used to indicate the / for the ith estimator used for q(x). The lower bound of 7/ is usually not zero, because of class overlap. However, when the underlying class densities are known, Tmin can be determined by substituting Eq. (7) for each object in Eq. (9). The aim is to minimize // over different estimators qi (x) for a certain independent test set, yielding the best estimator for the confidence ^(^). Using an estimator for the confidence, the classifier we want can be selected. This is, of course, only useful if we have classifiers with the same performance on a test set. Otherwise we would just select that classifier with the best performance on the test set. Suppose the total confidence for a classifier is calculated by averaging ^(x) over all objects x in the test set:
^ xeT
where the symbols N^ and T have the same meaning as in Eq. (9). If the classifier predicts all of the objects perfectly, Qs will be 1, because every q(x) = 1. The selection of the best classifier just comes down to selecting max(Q5) over all possible classifiers S. Boosting the performance by using q(x) can be done by using it as a reject mechanism. By rejecting the objects with a low confidence, the performance increases. This is important, because many applications do not allow a wrong prediction. In these cases it is better to return the object to the operator for a second classification. In the experimental section an application is described in which the confidence was used as a mechanism. A last possibility is to use the different q {x) as input to train a new classifier, thus constructing, it is hoped, a more reliable classifier. We will not go into the details of such multistage classifiers; the interested reader is referred to [37].
B. CONFIDENCE ESTIMATORS Because in real applications parameters such as distributions of the objects and the a priori class probabilities are unknown, there is a need for estimating the confidence values. In the previous section it was discussed how the different estimators can be compared, but it was not explained how such an estimation can be made. Two kinds of estimations are considered: the method used to calculate the estimate and the place in the network where the estimate is calculated. A feedforward neural network maps the input space on the hidden layer(s) and then on the output layer. Thereby, in the input space as well as the hidden unit space or the output space q (x) can be estimated by the methods to be described next. Consider the example of two overlapping classes that must be separated
330
Aarnoud Hoekstra et ah
by using a quadratic discriminant function. A neural network consisting of two input units, two hidden units, and two output units was trained on the data set as depicted in Fig. 10. Figure 10 also shows the optimal Bayes decision function in the input space. However, if one would consider the hidden unit space, shown in Fig. 11 (left), the classes might be more easily separated and it may therefore be advantageous to estimate q{x) in that part of the network. As a result the level of training (undertrained or overtrained) of the network determines the layer to estimate in. For sufficiently trained networks, it may be better to estimate q{x) in layer other than the input space. Second, there are different methods for estimating the confidence. Here four methods will be introduced that were used in experiments: 1. Network output 2. Nearest-neighbor method
Figure 10 The data set used to train the neural network. The soUd hne shows the optimal Bayes decision function.
hidden unit space
output unit space I
1
10 t
+ o 0.8 -
0
-
0 0
Q
0.e
0
0.6 .-2 C
-
0
- 20.6 .-
t
c 0
20.4 0.2 -
I
R
-
%t 0
00 0
r%, OO
-
1
a0.4
0.2
0
d0-
C
”
0
0
0
I
I
0
0.2
*.,
C
00 I
I
0.4 0.6 hidden unit 1
I
I
I
0.8
1
0
, 0.2
0.4 0.6 output unl 1
I
I
0.8
1
Figure 11 Here the mapping of the objects into the hidden unit and the output unit space is shown. The figure on the left shows how the objects are mapped into the hidden unit space. The figure on the right shows the mapping of the objects in the output unit space.
332
Aarnoud Hoekstra et al.
3. Parzen method 4. Logistic method Each of the methods has its advantages and disadvantages. In the experiments we restricted ourselves to methods 2 and 4. Method 3 has been extensively tested [41], but appeared to be outperformed by the other methods. However, the Parzen method is suited to detecting outliers and is mentioned for the sake of completeness. Now the methods will be discussed in more detail.
Network Output This method is the classical way to determine the reliability of a network. However, in general we do not know whether the network is well trained. This means that the a posteriori probabilities of the objects may not be estimated well. Moreover, neural network outputs are generally not directly interpretable as probabilities. This implies postprocessing of the network outputs such that they constitute probabilities or adapt the learning algorithm; see, for instance, [38]. The final thing we want to do is adapt the learning rule, inasmuch as we have given networks. In the experiments we did take the network output into account, because it is a well-estabUshed method.
Nearest-Neighbor Method Given a certain space with a defined measure, for example, the Euclidean distance, the /:-nearest-neighbor density estimator (/:-NN) can be used to estimate the confidence of the object under consideration: qknn{x) =
Y ' (11) k where Ns{x=w) is the number of correctly classified objects among the A:-nearest neighbors (that is, the number of objects among the /:-nearest neighbors sharing the same label as the object under consideration). Moreover, the more objects "agree" about the label of jc the more certain we are that the label assignment is correct. If one were to take a 10-NN approach and five neighbors point to class A, whereas the remaining neighbors point to class B, the confidence of the object X would be 0.5. A general rule of thumb is to use k = y/Nc, where Njc is the number of learning objects. A disadvantage of this method is that it is computationally heavy for large data sets.
Neural Networks Applied to Data Analysis
333
Parzen Method In the Parzen method, first the density functions for the classes are estimated and those estimates are used to estimate q(x). It is assumed that a fixed Gaussian kernel is used:
/'^(-')=]^ E i2^hr'^'e.p ( ^ ^ 1 ^ ) ,
(12)
where D(x, JcO is the Euclidean distance between object x and an object x^ from a subset Coj, that is, the set of objects with the same label co. The dimension of the space in which q(x) is estimated is denoted as d. Each class has a single kernel fco(x), of which the smoothing factor hco must be estimated. This parameter can be determined by using a maximum likelihood estimation over the learning set. Having found the densities for each class, q(x) can be approximated by ^Parzen (x) =
^r^
,
(13)
fix) where f(x) is the estimated joint density function /(3c) = J^co ^cofcoi^). Experiments show that the Parzen estimator never outperforms any of the other methods [38, 41]. Therefore, the Parzen estimator was not used in the experiments to be presented in a later section.
Logistic Method The logistic method [42] was chosen for its simplicity. It is fast because the parameters only need to be calculated once. The logistic method models the conditional class probabilities by using a sigmoid function. For a two-class problem, the probability that x belongs to the correct class is given by ^iogistic(-J) = P{C(x, CO) = l\x) = -j:; —, ^ ^ 1 + exp()6ic + ^o)
(14)
assuming equal prior class probabilities. The parameters to be determined are ^ and PQ. The total number of parameters is given by the dimensionality of the space P plus a biasfio-They can be estimated by using a learning set and an optimization technique like Newton's method. As stated earlier, Eq. (14) only holds for the twoclass case; however, it can be extended to more classes quite easily. This is done by taking one class as a base class, called g, and estimating all of the parameters
334
Aarnoud Hoekstra et ah
of the other classes with respect to that base class. Rewriting Eq. (14) for the base class approach results in ^logistic(^) = I 1 + m
^^P ( ~ ^^'"^0 I
'
^^^^
where x^ = (x, 1), agi is the (P, Po) for class / with respect to the base class g, and Ng denotes the number of classes present. C. VALIDATION SETS In the previous section, three estimation methods for q(x) were introduced. Each of these estimates should be tested by using an independent test set (see Eq. (9)). Usually these sets are not available or are hard to derive. Moreover, in most practical situations one has only a single set from which a learning set and a test set must be drawn. A possible solution might be to also use the learning set for estimation, but this is an unhealthy situation. Both the classifier and the estimator were found by using the same set, which makes the estimator biased. To overcome this situation we have studied the possibility of generating an artificial validation set. Such a validation set has roughly the same characteristics as the learning set, but is distinctive enough to act as a test set. To create a validation set, the k-NN data generation method [43] was used. The learning set is used as a starting point for creating a new set. Around each object in the learning set, new objects are generated using Gaussian noise around the original object. The objects are generated in such a way that the local probability density is followed. This local probabiHty density is estimated using the /:-nearest neighbors around an original object. In the experiments k was chosen to be the dimensionality of the space +2. Figure 12 shows the effect of the k-NN data generation method using a learning set. Note that a validation set will not be able to replace an independent test set entirely, but it is an acceptable alternative. Experiments show that the k-NN data generation method generates a validation set that is as reliable in selecting the same estimators and the network layer as an independent test set would be. In the application described below, both a validation set and a test set were used. Many experiments not reported here can be found in [41]. These experiments show the validity of the data generation method for badly trained neural networks.
D.
EXPERIMENTS
To test the applicability of the confidence value estimation for classifier selection, several experiments were performed. Here two experiments are reported, one on artificial data showing the influence of the level of training of the classifier
335
Neural Networks Applied to Data Analysis
-1
0 feature 1
1
Figure 12 The effect of the A:-NN data generation method. Here the vaHdation set is denoted by o and the original set is denoted by +. The original set contained 50 objects, Gaussian distributed around (0, 0). The validation set was generated using five neighbors and consists of 150 objects.
on the place to estimate q(x). The other experiment is concerned with using the confidence estimator as a reject mechanism. In this experiment real-world data, chromosome banding profiles, were used. Here the confidence was used to improve the performance of the classifier. The first experiment only uses the A:-NN method and the network output for estimating the confidence, but estimates the confidence in different parts of the network. In the second experiment the logistic method is also used, and the confidence is used as a reject mechanism. Furthermore, in the real-life problem, the validity of the /:-NN data generation method was tested. Artificial Data The experiments with an artificial data set were done to show that in some cases it is beneficial to estimate the confidence in a layer other than the input layer. Two data sets were generated: one two-class two-dimensional problem, and one four-class six-dimensional data set.
336
Aarnoud Hoekstra et at. Table IV The Estimation Results for the ^-NN Method for a Badly Trained and a Well-Trained Neural Network on the Two-Class, Two-Dimensional Data Set Estimator A:-NN Input Hidden Output Netout ''min Performance
J (well)
J (bad)
0.093 0.092 0.080 0.080 0.080 90%
0.093 0.094 0.093 0.209 0.080 86%
On the first data set a 2-3-2 network architecture (two inputs, three hidden units, and two outputs) was trained. Each class consisted of 2000 objects, of which 1000 were used for training and the remaining 1000 for testing the estimator. The two classes have about 10% overlap, equal covariances, and different means. Note that the objects were generated independently from each other and therefore constitute a valid training set and a valid test set. The network was trained for 10 epochs, resulting in a badly trained network, and for 1000 epochs, resulting in a well-trained network. After the training was stopped, the test set was used to find the best layer in the network for estimating the confidence. In both experiments the /:-NN method was used. Table IV shows the results for the two-class experiment. The J for the different layers of the network where the estimator was used was determined using an independent test set. This test set was also used to calculate the network performance. The confidence itself was computed from the training set. Because the distribution of the objects is known, 7niin can be estimated. The results of the A:-NN estimator can be compared against this value. For the welltrained network this shows that the best layer to estimate q (jc) is the output space. Because the network is trained well, it best approximates the a posteriori probabilities in the output space. This is also shown by the netout values. These are the actual outputs of the network used as estimator. For a badly trained network the selection is unclear. None of the layers shows a dominance; it is unclear which layer the q{x) can be best estimated. Using the network outputs in this case is certainly a bad decision. Because of the low level of training of the network, the a posteriori probabilities are not approximated very well. The second experiment involved a 6-15-4 architecture. Again, the network was trained such that we obtained a well-trained and a badly trained network. Because
Neural Networks Applied to Data Analysis
337
Table V The Estimation Results for the A:-NN Method for a Badly Trained and a Well-Trained Neural Network on the Four-Class Six-Dimensional Data Set Estimator A:-NN Input Hidden Output Netout •'min Performance
J (well)
J (bad)
0.170 0.175 0.171 0.173 0.149 73%
0.146 0.154 0.149 0.210 0.129 55%
of the higher dimension of the data, 2000 objects for training and 2000 objects for testing were used. Again the A:-NN method was used in different layers of the network. Table V summarizes the results. Because of the high dimensionality and number of classes, the optimal performance is much less than in the first experiment. More remarkable is the fact that, in contrast to the previous experiment, in the badly trained situation there is a clear preference for a layer (the input space). This is the consequence of the larger number of classes. In general, the /min will increase for more than two classes, because in regions with overlapping classes, very bad classification yields lower values for / . However, both experiments do show that selecting a layer other than the input layer may be beneficial. Chromosome Data A normal human cell contains 46 chromosomes, which can be divided into 24 groups. There are 22 pairs of so-called autosomes and two sex chromosomes. The chromosomes have different features that can be used for classification like length, centromere position, and banding pattern. In this classification problem we only take the banding pattern into account. This banding pattern is sampled and then fed to a neural network for classification. Figure 13 shows a chromosome and its corresponding banding profile. The sampling of a chromosome band resulted in 30 features (that is, 30 samplings per band), which are used for classification of a chromosome. Earlier studies indicate that this is a reasonable value to take [44]. Consequently, a neural network consisting of 30 input neurons and 24 output neurons (one for each class) was used. The number of hidden units, however, is still arbitrarily chosen and was
338
Aarnoud Hoekstra et al.
Average P r o f i l e o f 1 0 , ssimples: 3 8
W^
60
^^ ^-<^
50 /
40
10
o/''"V>'>.-^ ^»—•^«.—
-«—-
^
"^v
\
/ / It/ *
30 20
i
i 1'
1
It
It
\ \\ \\
It
It It
/' /' • / »
' t
o'
2
0.4
0.6
.
0.8
»\ %\
1 ^ 1
Figure 13 A chromosome and its corresponding profile. At top a gray-value image of a chromosome is shown, the profile on the bottom is derived by sampling the chromosome along the line drawn in the chromosome. Note that the banding profile (on the right dashed line) is obtained by averaging over 38 chromosome profiles of class 10. The solid fine shows a single chromosome profile of class 10.
set to 100, a number derived from earlier experiments [44]. For training the network a total set of 32,000 training objects was used; the same number of objects was used for testing the network. Furthermore, each object was scaled between 0 and 1. Network training was done for 200 epochs with momentum and adaptive learn rate. The momentum term was set to 0.7; the initial learn rate was set to 0.1. Each time the mean squared error of the network increased by 10%, the learn rate was halved. After training the network performance was 85% on the learning set and 83.5% on the test set (see also Table VI). Two estimators plus the output of the network were used for the estimation of the confidence values. Table VI, published earlier in [45], states the different values of / that were found for the different estimators. Furthermore, the /:-NN and the logistic estimator were also applied in different parts of the network to estimate the confidence. Table VI shows that the /:-NN estimator in the output space has the best score over all of the different sets. The logistic estimator performs worse in the out-
Neural Networks Applied to Data Analysis
339
Table VI The J of the Estimators for Different Estimators and in Different Parts of the Network, Trained on the Chromosome Data Set Learning set with Used set Estimator )t-NN Input Hidden Output Logistic Input Hidden Output Netout Avg. SD (J) Class perf.
1.0.0. est.^
Validation set
Test set
/
J
J
0.167 0.119 0.086
0.165 0.119 0.088
0.161 0.119 0.095
0.140 0.692 0.147 0.482 0.001 85.0%
0.141 0.692 0.150 0.479 0.001 84.5%
0.141 0.694 0.158 0.474 0.001 83.5%
'^ 1.0.0., leave-one-out estimation method. Reprinted with permission from A. Hoekstra, S. A. Tholen, and R. R W. Duin, "Estimating the reliability of neural network classifications," Proceedings of the ICANN 96, p. 57, 1996 (©1996 Spinger-Verlag).
put space than in the input space. Furthermore, the network's output is not very useful as an estimator for the confidence. This is due to the fact that the network has trained for only 200 epochs. The data generation method was also tested. The resulting validation set shows no advantage over using the learning set. Now that we have the different estimators, these estimators can be used for rejecting objects. An obvious method is to first reject the objects with the lowest confidence. This way we can prevent accepting classifications with the risk of a high error. Considering Table VI it was expected that the ^-NN estimator would have the best performance, which is indeed the case. In Fig. 14 the performance of the different estimators is depicted. Note that the lowest performance of the network is 83.5% when rejecting 0% of the objects. The performance of the network increases fast when rejecting objects. For a 90% performance only 13% of the objects must be rejected, that is, about 4,000 objects of the 32,000. To summarize, the confidence values are useful for judging a classifier. By calculating the confidence levels for the objects in different parts of the classifier, an indication can be gained of the reliability of the classification made by the
340
Aarnoud Hoekstra et i Network performance with objects rejected
2
0.981
©0.96 CO-
O)
'^'^
.
c "c
/"
8> c o gO.92 c CO E
y
:
;
^.--i<
. • *
.-•• /
^'•••''- • .'^'-'' '
a> a c o
'
'
^
>
- - - • '
#
./^/
o
/
y
// /
-
-
-
'
4 ^
y^
Zy
"
^
- • - 1 : k - N N , output 2: netout values - - 3: logistic, output
/
<'y
k.
•
- < • ' '
.y^^
/"••'
0 0.86
-
"
^
10.88 CO CO
^
.
^^.'^^
.y^
- - • •
1 0.9 -
s
-
."' ..-•• y
1 0.94
|0.84
-
^ • ^
—
4: k-NN, input 5: logistic, hidden
•
/
5 n Qo
L_
1
10
1
1
1
15 20 25 Percentage of objects rejected
1
30
1
35
1
40
Figure 14 Network performance on the remaining test set as a function of rejected test objects, for several estimators.
network. Furthermore, in the real-life application, chromosome banding profile classification, it was shown that the confidence level can be used as a method for improving the classifier performance. By rejecting a small number of objects, those with the lowest confidence, the performance is increased significantly. Although we have not shown any experiments on classifier selection, we do have a tool forjudging the classifier. This confidence of the classifier can be used in various ways, as shown in the experiments. The next step is to train a classifier on a particular problem. This is quite hard, because we are interested in an optimal classifier, that is, one that is not overtrained. The next sections deal with this problem of classifier training.
Neural Networks Applied to Data Analysis
341
V. CLASSIFIER NONLINEARITY The previous sections explained how data can be analyzed and how trained classifiers can be selected for a particular problem. In this section we go into the details of classifier training. Moreover, the concept of nonlinearity in classifier training will be introduced. A strong point of neural classifiers is their ability to implement almost any classification function [46, 47]. However, this is also a weak point, because this means that one must be very careful not to select a classifier that implements a good function on the learning set but a bad one on the test set. Overtrained classifiers are expected not to generalize very well [4850], which makes such a classifier useless for solving the classification problem [40,51]. Overtraining can be studied by inspecting an estimate of the generalization. Whenever this estimate reaches a certain value, one might consider the classifier as well trained, and stop further training. Here the shape of the classification function implemented by the network is related to the overtraining behavior. This will be referred to as nonlinearity of the classifier. It is expected that overtrained classifiers show a larger nonlinearity than optimal ones. Moreover, the nonlinearity is also applicable to traditional classifiers (e.g., A:-nearest-neighbor classifier), which makes it possible to judge neural networks versus traditional classifiers on given data. The nonlinearity measure presented in this section is closely related to the well-known complexity measures like the Vapnik-Chervonenkis (VC) [52,53] dimension. This VC dimension gives a bound on the capacity of a classifier related to its architecture, for instance, the number of hidden neurons. The difference with the nonlinearity measure, however, is that the VC dimension is used without any knowledge about the data to be classified. Given a network, the VC dimension can be calculated and its maximum capacity is known. Here we are interested in the actual behavior of the classifier on the data. In the next section the nonlinearity measure is introduced; this measure is quite computationally intensive, and therefore an estimation procedure is also presented. After that, a number of experiments will be studied for two-dimensional problems, as this enables us to illustrate the nonlinearities with the classification functions themselves and to show how this is related to the overtraining behavior of a neural classifier.
A. A NONLINEARITY IVIEASURE As stated earlier, the classifier adapts itself to the data; therefore we relate the nonlinearity of a classifier to the data it has to classify. Consequently, nonlinearities that are not observed in the classification result are not interesting because these do not contribute. This can be the case for the behavior of the classifier in
342
Aarnoud Hoekstra et al.
areas where no data are present or on a small scale between objects, but without influencing their classification. Figure 15, derived from [54], shows a set of objects and two classifiers <Si and 52. The behavior of both classifiers is the same for objects in a remote area of the data space. In that case, the two classifiers are interchangeable—objects are classified the same for S\ and 52. Moreover, for remote objects it holds that a linear combination of objects (i.e., their feature vectors) sharing the same classification results in a linear combination of the outputs. That is, all interpolated objects have the same classification. This situation changes for objects located near the classifier. Consider again Fig. 15. Objects that lie between y\ and yi have the same classification with respect to classifier S\. However, objects that he on a linear interpolation between y\ and j3 (or x\ and JC2, depicted by the dotted lines) do not share the same classification. Here a fraction of the interpolated objects is assigned to a class different from that of >'i (or ys), depicted by the heavy dashed line in Fig. 15. In other words, here a nonlinearity of the classifier S\ is observed. If one were to choose ^2, none of the nonlinearities observed with respect to the data and the classifier earlier would occur. Note that the relation of the classifier and the data is of particular importance. Suppose the objects x i , . . . , ;c3 and
Figure 15 A highly nonlinear and a linear decision function on a set of objects. Reprinted with permission from A. Hoekstra and R. R W. Duin, "On the nonlinearity of pattern classifiers," Proceedings of the 13th ICPR, Vienna, Vol. 3, 1996 (©1996 IEEE Comput. Soc).
Neural Networks Applied to Data Analysis
343
j i , . . . , j3 are removed; then the nonhnearity of Si will not be observed, because all objects and their linear interpolations share the same classification. These observations lead to the following definition of nonlinearity: The nonlinearity A/" of a classifier S with respect to a data set C is the probability that an arbitrary object, uniformly and linearly interpolated between two arbitrary objects in the data set with the same classification, does not share this classification. This definition of nonlinearity is independent of the true class labels of the objects, because only the classification made by the classifier is taken into account. Moreover, because of the independence of the true class labels, the nonlinearity is also independent of the classifier performance. However, it does depend on the data set used to train the classifier. Now the nonlinearity of a classifier is introduced more formally. First we define the nonlinearity of a classifier S with respect to two objects Xk and XI sharing the same classification, for example, xi and X2 in Fig. 15. The following definition defines n(x), the nonlinearity for two objects: 1
f^ / SA(axk + (1 -oi)xi,Xk)da,
n(x) = n((xk,xi)) =
(16)
where x — fe, x/) a pair of objects with the same classification. The term 1 /1|XA:— XIII is used to normalize for the distance between two objects. Between the objects Xk and jc/, a linear interpolation is done by ax^ + (1 — oi)xi. Each of the objects in the interpolation is checked against the class oix^ using 5A:
-'<'••''>=I":
'IS^''-
Consequently, if all interpolated objects share the same classification, no nonlinearity is observed and therefore the nonlinearity will be zero. Any change in classification increases the nonlinearity, and the size of the increment is determined by the fraction of differently classified objects. Now the nonlinearity of S with respect to a data set L can be defined by using the definition of n(x): 1 A/'(£,<S) = —TT
^ y
y
n(x),
(18)
where N{C, S) is the nonlinearity of a classifier with respect to a data set, and C is the set of objects sharing the same classification. This is needed because we interpolate between objects of the same assigned class label. The nonlinearity M{C, S) is calculated by interpolation between all possible pairs of objects in the data set, provided that they have the same assigned class labels.
344
Aarnoud Hoekstra et al.
Depending on the required accuracy, the calculation of Eq. (16) and (18) requires a huge number of computations. This gets even worse when we calculate the nonlinearity at different stages of the training phase of the classifier. An algorithm was constructed that uses a Monte Carlo estimation to approximate 1. Classify all objects according to the classifier S. 2. Randomly draw n objects Xk from the data set C. 3. Find for each Xk a corresponding xi with the same classification label, that is, create an ^. 4. Randomly generate n a's, where 0 < a < 1 (for each pair jc an a is drawn). 5. Compare the classification S{otXk -h (1 — Gt)xi) with S{xk), and count as nonlinearity if they are not classified in the same class. Do this for all pairs Jc. 6. N{C, S) is now determined by the number of nonlinearity counts divided by n.
B. EXPERIMENTS To observe the nonlinearity behavior of classifiers and to relate that to their generalization behavior, several experiments were conducted. Moreover, one experiment shows the nonlinearity behavior of a traditional classifier. This classifier yields a stable nonlinearity and can therefore serve as a reference for the neural classifier in the experiment, because a neural classifier yields a nonstable nonlinearity. The data sets used were two-dimensional, to be able to visually inspect the behavior. The nonlinearity measure, however, is not restricted to these cases. Nearest-Neighbor Classifier In Fig. 16 the data set is depicted that was used to train a /:-nearest-neighbor classifier. This data requires a high nonlinear decision function, but is perfectly separable. First the optimal k and its corresponding J\f were calculated. Second, the N for different k was determined. The optimal k showed to be 1, which is not surprising. If k is large, meaning taking into account many neighbors, the separation between the two classes will not be observed. Figure 16 [54] shows the one nearest neighbor on the data set, its nonlinear nature clearly shown. The corresponding M was calculated and turned out to be 0.11, indicating a high nonlinearity. In Fig. 17 [54] the results for the other experiment are shown. The nonlinearity of the nearest neighbor decreases when k becomes larger. This implies that the
345
Neural Networks Applied to Data Analysis
-4
-2
feature 1
0
Figure 16 The data set with no overlap and requiring a highly nonlinear decision function. Also depicted is the decision function implemented by the 1 nearest-neighbor classifier, which is optimal in this case. Reprinted with permission from A. Hoekstra and R. R W. Duin, "On the nonlinearity of pattern classifiers," Proceedings of the 13th ICPR, Vienna, Vol. 3, 1996 (©1996 IEEE Comput. Soc).
Figure 17 The nonhnearity behavior for the k nearest-neighbor classifier, for different k. Reprinted with permission from A. Hoekstra and R. R W. Duin, "On the nonlinearity of pattern classifiers," Proceedings of the 13th ICPR, Vienna, Vol. 3, 1996 (©1996 IEEE Comput. Soc).
346
Aarnoud Hoekstra et at.
classifier becomes more linear. This is indeed the case, as can be seen in Fig. 18. This figure shows the nearest-neighbor decision function for larger values of k. As k increases, the decision function implemented becomes "straighter." It can be concluded that it becomes more linear for large k , that is, a classifier that has a larger local sensitivity. A large k implies that many neighbors are taken into account for classification and therefore it is less likely that nonlinearities are observed. A neural network with a different number of hidden units was also trained on this data set. It was expected that, because the classes are perfectly separable, a neural network consisting of enough hidden units will not suffer from overtraining. The network was trained by using the Levenberg-Marquardt rule and consisted during the experiments of 2, 4, 10, and 20 hidden units. Figure 19 shows the decision function for 2 and 10 hidden units. Clearly the number of 2 hidden units is too small, resulting in a weakly nonlinear classifier (which is confirmed by the nonlinearity curve in Fig. 20). The 10 hidden units network implements almost the same decision function as the one nearest neighbor and has therefore an almost identical nonlinearity value of 0.096, where the nearest-neighbor classifier had 0.11. Figures 20 and 21 show the development of the nonlinearity of the networks during training. Each network was trained for a fixed number of epochs, namely 150 (the figures show the results per five epochs). Remarkably, the nonlinearity of the larger neural network starts high, and then stabilizes to a lower nonlinearity value. Probably the best network is already found in an early stage of the training. Inspection of the decision function at that early stage indeed shows that this function gives a better separation. This also indicates that the nonlinearity of the nearest-neighbor classifier can be used as a reference. Neural Network Classifier Like almost any other classifier, neural classifiers suffer from small sample size problems. Building the classifier by using too few samples will generally result in classifiers having a poor generalization capability. They have adapted too much to the noise in the data set. It is therefore expected that they have an increased nonlinearity, because the decision function implemented by the classifier is highly nonlinear. In Fig. 22 the data set is depicted on which a neural network was trained. This data set consists of two overlapping classes that require at least a quadratic decision function to be separated. A small training set of 40 samples per class was used to investigate the nonlinearity behavior. Overtraining in the neural network was obtained by using the Levenberg-Marquardt learning rule. This learning rule, combined with a sufficient number of hidden units and a small number of examples, leads relatively quickly to an overtrained network. The overtraining is desirable because the nonlinearity of the classifier during the
8 \
4t
0 (u
2 c
e
-2
-4
I
I
I , a - 6 - 5 - 4 3 - 2 - 1 0 feature 1 I
I
I
I
-st I
1
I
2
3
-
4 6
-
5
-
4
3
-
2
-
1 feature 1
0
1
2
Figure 18 The decision function implemented by the 80 nearest-neighbor classifier (on the left) and the function implemented by the 140 nearest neighbor (on the right).
3
4
4
2
0
(u
0
5
iii c
-2
-4
-6 - 6 - 5 - 4 - 3 - 2 - 1 feature 1
0
1
2
3
4
- 6 - 5 - 4 - 3 - 2 - 1 feature 1
0
1
2
Figure 19 The decision functions implemented by the network with 2 hidden units (on the left) and the function implemented by the 10 hidden units network (on the right). Both functions are the result of training for a fixed number of epochs (150).
3
4
Neural Networks Applied to Data Analysis
349
0.12
Figure 20 The nonlinearity development of the neural network with 10 hidden units. The dotted line shows the nonlinearity value for the 1 nearest-neighbor classifier. The nonlinearity quickly stabilizes to a high nonlinear value, indicating a highly nonlinear decision function. U. til
1
1
1
1
—1
1
0.1
H
0.08 r
H
0.06
H
0.04
]
0.02
n
-
r
10
15 epoch/5
20
25
30
Figure 21 The nonlinearity development of the neural network with 2 hidden units. The dotted line shows the nonUnearity value for the 1 nearest-neighbor classifier. The nonlinearity quickly stabilizes to a low nonlinear value, indicating a more or less quadratic decision function.
350
Aarnoud Hoekstra et al.
0
feature 1
5
Figure 22 Two overlapping classes, requiring a quadratic decision function, but which can easily cause overtraining in a neural network.
whole learning process and not only the optimal case is of interest. However, the number of hidden units is difficult to determine beforehand. Therefore several neural networks with different sizes of hidden layer were trained and investigated. Six different sizes of hidden layer were inspected: 2,4, 8,10,20, and 30. For each of these neural networks their nonlinearity was computed on 10,000 data pairs. Furthermore, each network was trained for a fixed number of epochs, which was set at 150. A quadratic decision function was also calculated on the data set. The nonlinearity of this (optimal) decision function was used to serve as a reference, because this nonlinearity is fixed. It is only calculated once, whereas the nonhnearity of the neural network is calculated during the training process. Figure 23 depicts the nonlinearities of the different networks. Each network is denoted by 2-X-l, indicating that it has two inputs, X hidden units, are one output unit. The nonlinearity for each network starts small and then almost monotonically increases to a certain level, where it remains more or less stable. However, this does not occur in all situations, for instance, for the 2-20-1 network, the nonlinearity continues to increase and does not reach a stable point. When these curves are compared with the error curves for the networks (see Fig. 25), it turns
Neural Networks Applied to Data Analysis
351
2-2-1 neural network
2-4-1 neural network
0.05
0.05
10
epoch/5
20
30
10
2-8-1 neural network
epoch/5
20
30
2-10-1 neural network
0.05
0.05
0
10
epoch/5
20
30
0
2-20-1 neural network
10
epoch/5
20
30
2-30-1 neural network
_j
z
0.05
0.05
10
epoch/5
20
30
10
epoch/5
20
Figure 23 The different nonlinearity curves for networks with a different number of hidden units. The dashed Une is the nonhnearity of the quadratic classifier, which is stable for each epoch.
352
Aarnoud Hoekstra et at.
Figure 24 An overtrained network on the overlapping classes. The dashed line denotes the optimal, quadratic decision function.
out that the nonUnearity increases with increasing generaUzation error. Moreover if the test set error remains almost constant the nonUnearity also remains stable. Hence the nonUnearity is an indicator for the network's local sensitivity. A small local sensitivity results in a small training error, in general zero, and a large nonlinearity. Figure 24 depicts the 2-20-1 network's decision function and the optimal decision function using a quadratic classifier. This network is clearly in a severely overtrained situation, which implies a high nonUnearity. Compared to the "optimal" nonUnearity, it is about 6 times higher. The network with 20 hidden units has far too much flexibility. For instance, the 2-2-1 neural network has just enough flexibility, because its nonUnearity is almost equal to that of the quadratic classifier. In Fig. 25 the error curves for the 2-20-1 neural network are displayed. It is clear that the network reaches an overtrained situation because of the increasing test set error and zero training error. This behavior of the generalization curve is reflected in the nonUnearity curve. Up to epoch 50, N gradually increases just as the test set error; at about epoch 50 there is a large increase in nonUnearity and a
353
Neural Networks Applied to Data Analysis 0.3
—
0.25
'
1
1
1
—I
n.n. test n.n. train
-
quadratic train r
0.2
/
0.15
^ / "^ /
0.1
0.05
0
/
/ / /
-
V
1 N
\
.1
10
...
1
1
1
15 epoch/5
20
25
30
Figure 25 This figure shows the error curves for the 2-20-1 neural network and the quadratic classifier. For the neural network both the generalization error (dashed line) and the learning set error (solid line) are depicted.
relatively large increase in the test set error. The apparent error behaves the other way around—at epoch 50 it drops to a lower level, resulting in a test set error increase. To conclude, training a neural network classifier results in a classifier that starts as a linear one and gradually increases to a highly nonlinear one. Experiments show that if the generalization error stabilizes, the nonlinearity also reaches a stable point. However, this behavior strongly depends on initial conditions, such as sample size, learning algorithm, and network size. The larger the neural network, the more flexibility it has and the higher its nonlinearity will be if no precautions are taken. However, because a traditional classifier yields a stable nonlinearity, it is possible to measure the performance of a neural network with respect to that classifier. Another problem in classifier training is that due to initialization, and learning that the classifier might be unstable. That is, during training its decision boundary is frequently changing, such that no stable classification is obtained. The next section deals with this problem.
354
Aarnoud Hoekstra et ah
VI. CLASSIFIER STABILITY We have indicated in several places in this paper that there is a trade-off between sample size and classifier complexity. The larger the training set, the more complex the classifiers can be without risking overtraining or noise fitting. Recall that neural network classification functions can be made more complex by more training iterations. As more complex classifiers are better able to describe the details of classes or their decision boundaries, we have here the natural trade-off between sample size and observed class detail. ff for a given sample size the classifier complexity is decreased below some optimal complexity, then fewer details will be observed. Thereby wrong decisions will systematically be made for a number of objects. This will be called bias, as it corresponds to the systematic errors in, for instance, regression. For classifiers it results in a larger generalization error, larger than is possible by using the given training set. If for a given sample size the classifier complexity is increased above some optimal complexity, then more details can be observed. These details, however, will be accidental and are just a result of fitting the noise in the data. In regression this results in a larger variance if the same estimator is applied on repeated realizations of the training set. For classifiers the consequence is that they are instable, that is, one and the same test object has a large probability of being classified into a different class for small variations of the training set. This is the bias-variance dilemma [55, 56] encountered in choosing the complexity of the classifier. The concept of classifier instability has been studied before by one of the authors [57]. It is defined here as the probability that an arbitrary object is changed in classification by some small disturbance of the classifier. This might be another initialization or the addition or deletion of a single training object. In the next section we will discuss a way to measure the instability using the training set, illustrated by some experiments. Thereafter we will go into the issue of finding good stabilizers and the consequence for the classification error. A. A N ESTIMATION PROCEDURE FOR STABILITY As we are interested in investigating stabilizing techniques, we need to measure the stability (or the instability) of classifiers. In this section a possible estimating procedure will be discussed. This will be illustrated by the relation between the performance and the instability of some linear classifiers. Our procedure for the estimation of the instability of a classifier has the following steps: 1. Use the entire training set for computing the classifier So, 2. Compute a large set of classifiers 5/ (/ = 1, 2 , . . . , n) using one and the same modified training procedure. Possibilities are:
Neural Networks Applied to Data Analysis
355
• Delete one training sample and rotate over the entire set (leave-one-out procedure). • Bootstrap the training set (that is, sample with withdrawal). Bootstrapped training sets are as large as the original set but may contain copies for some samples, whereas other samples are not represented. • Use different initializations for the classifier optimization. 3. Compute the average different classification between So and the classifiers Si, averaged over the n classifiers Si and the test set. For this test set the following might be used: • the original training set • a truly independent test set • an artificially generated test set as discussed in Section IV.C In the last step of this procedure we just count the number of different estimated classification labels. The true labels (if known) are not used. Thereby the size of the probability mass that is changed from classification is found. The classification error itself might be unaffected. The instability is a probability and thereby a number between 0 and 1. It may show a large variation from classifier to classifier. Therefore a large number of different classifiers (e.g., [24]) might be needed to get a reliable estimate. If the classifier training procedure is computationally heavy, like some neural network optimizers, the computation of the instability might become prohibitive. An instability of zero points to an entirely stable classifier: no changes in the individual classification results whatsoever. Such a number can only be expected for very separable classes or for very bad classifiers that assign all samples to the same class. At the other end, an instability of 1 is only possible if all samples change classification, which is very improbable. As a result, for many situations the instability will clearly stay away from these two extreme values. An example is presented in Fig. 26, originally published in [58]. It shows the instability based on 50 bootstrapped versions of a set of linear classifiers on the same 30-dimensional Gaussian problem as a function of the size of the training set. In the left graph of Fig. 26 the estimate is based on an independent test set, and for the right graph in Fig. 26 the training set itself is used. The absolute numbers differ, but the relative behavior over the classifiers is preserved. The classifiers used in this experiment are the Nearest Mean Classifier (NMC), the Pseudo Fisher Linear Discriminant (PFLD) (see [59]), the Small Sample Size Classifier (SSSC) (see [59]), a linear discriminant based on a Karhunen-Loeve Feature Selection (KLLC), and a regularized version of the Fisher Linear Discriminant (RFLD) (see [60]). The generalization error of this experiment is presented in Fig. 27. One can observe that the performance and stability of classifiers are related—more stable classifiers perform better. In general, the instability of a classifier depends on
—1
1
r M
r' h
p—.--—. ..
- -j
,
S?
o c3 S S^ -^ 3 S tt ^ ^ fe 1 ,
•
O
• \
* *t *?
1
u C/5
M
\
'^
•
\ *
-..--^ "^.-v" ^>
T
1
9jr
i U
OH
:z; s^
t"
NX \
1
...
\
1
€>
1
^ i\
1
~
1
*\\. \ ^ vvV
1
<=>
^
les eiBQ )eej. em Xq pejnseo^ A)|||qe)su|
1
.
•
i
^1
11
357
Neural Networks Applied to Data Analysis 0.5
0.4
jO.3
(D
§0.2 0
0.1
10' Training Sample Size per Class
10^
Figure 27 The generalization error of the NMC, the PFLD, the SSSC, the KLLC, and the RFLD (A = 3) versus the training sample size for 30-dimensional Gaussian correlated data.
many factors, such as the complexity of the classifier, the distribution of data used to construct the classifier, and the sensitivity of the classifier to the size and the composition of the training data set. For different classifiers the instability shows a different behavior as a function of the training sample size. All classifiers are most unstable for small sizes of the training data set. After the number of training samples is increased above the critical value (the data dimensionality), the classifiers become more stable and their generalization error decreases. Only the Nearest Mean Classifier is an exception; it is almost insensitive to the number of training samples (see experiments in the following sections). In spite of becoming more stable with an increasing training set, its generalization error does not change much. The peaking of the Pseudo Fisher Linear Discriminant is discussed in [59]. The instability measured by the training data set (right figure of Fig. 26) increases for larger training sets, at least for the numbers studied here. Any classifier built on a very small training set is bad. So the changes in the training data set
358
Aarnoud Hoekstra et al.
do not change the bad performance. The classifiers constructed on large sets are already stable. Therefore, changes in the composition of the training set do not change the classifier much. As the instability of the classifier illuminates the situations in which the classifier is most sensitive to the composition of the training set, this instability could help to predict the possible use of stabilizers for a particular classifier for certain data and certain sizes of the training set.
B. STABILIZERS Classifier instability is caused by too large a sensitivity to the noise in the data. Here we will summarize three techniques that prevent this: adding different noise generalizations, using a penalty for a too complex (too sensitive) classifier, and combining different realizations of the same, instable, classifier.
Adding Noise Increasing the noise itself does not stabilize, of course. What is meant by "adding noise" is the addition of several noise realizations. Hereby the training set is enlarged artificially. Such a larger training set yields a more stable classifier. Sometimes it appears to be possible to add the noise in an analytic way. This holds for classifiers that are based on noise estimators. For instance, in the Fisher Linear Discriminant the covariance of the noise is estimated. This estimator Q can be made more robust by replacing it by
gR = g^xi
(19)
in which / is the unity matrix and X is some small constant (e.g., 0.1). In Figs. 26 and 27 a regularized version of the Fisher Linear Discriminant, RFLD, is used with A = 0.3. In Figs. 28 and 29, the error and the instability of this classifier are shown as a function of k. These figures, borrowed from [57, 58], show that a large regularization parameter A really stabilizes. They also show that for a large value of X the classifier approaches the Nearest Mean Classifier (NMC), which is more stable, but performs worse for increasing sample sizes, as it is insensitive for covariances. Generally, this is the character of the technique of adding noise: it simplifies the structure of the problem. This addition of noise is heavily used in neural network training procedures. In this area this concept is also used for adding noise to the weights or to the weight
^ "2
-3 -s IS
II OS JOJJ2I uoi^eziiejeuQo
J '
11
»
1
= 3§ z
G:
cc
1 i
rh
1
L
1
r
/•
\
/••
r
^'T-'' • —
• ' ^ • " *
__._•—;;
i
. . • • • '
1 3
s r
1
i
1 J—
y /
/ 1
F
L- -
/
•-••-.
F
1
/ //•••./;
^
o
/
\
''•' :
,-»"^*
,. ,-
JOJJ3 uoj^eznejeueo
•
:
:
;
J 1
Ij J
j
•'
/
/
C"-''
•"'
:
/
:
/
..:::•:..% % 3 / \
••
.•
•
.••
.•• - ^ ••• «^
'•••\
/ •' : :
^-r**-/•' .•• •• / • ;:
---•^'••'"
[••••'^ * " . - • • '
. —
>:v
/"i
j
-_:fi iS
/ t ••''
J 1
J
j&
<<¥.^
A\\\\cye\9U\
O .''••'V' •
iS o
..-
''X-'7I • .•/
^
iv>^'i
•T?
A}!l!qe)su|
/•'
s*
Neural Networks Applied to Data Analysis
361
updates. The consequences of such procedures are hard to analyze mathematically. Both sides, the data as well as the classifier, are affected.
Weight Decay Weight decay is a typical example of stabilizing by a simplification of the classification procedure. It is used in neural network training by penalizing large weights. As large weights correspond to large nonlinearities, they also correspond with complex and thereby noise-sensitive classifiers. An experimental analysis is computationally heavy, as both training and bootstrapping are very time consuming. The verification that weight decay stabilizes must therefore be skipped here. An interesting observation on weight decay has been made by Raudys [61]. The use of a negative weight decay causes an "antiregularization," which might be helpful for finding better neural network classifiers (see also [62]). In our context this can be understood to mean that an increase in the instability might be useful for improving too stable classifiers like the Nearest Mean Classifier, as discussed in the next section.
Bagging A third possible way of stabilizing might be to accept classifier instability, but to collect and combine a number of classifiers for the same problem in such a way that the combined result is better. A well-known technique is to combine neural networks derived from different initializations. Another, systematic approach has been proposed by Breiman [63], which is called bagging. It is based on bootstrapping and aggregating concepts. In the case of linear classifiers, bagging can be implemented by averaging the parameters (coefficients) of the classifiers built on several bootstrap replicates. From each bootstrap replicate of the data set, a "bootstrap" version of the classifier is built. The average of these "bootstrap" versions gives the "bagged" classifier. Breiman has shown that bagging could reduce the error of linear regression and classification. It was noticed that bagging is useful for unstable procedures only. For stable ones it could even cause deterioration of the performance of the classifier. The performance of bagged classifiers was also investigated by Tibshirani [56], and Wolpert and Macready [64], who used the bias-variance decomposition to estimate the generalization error of the "bagged" classifiers. An example is shown in Fig. 30 [57, 58]. Here the instability of the Nearest Mean Classifier is presented for the same Gaussian example as used in this chapter. This example shows that bagging might improve performance, but at the same time does not stabilize.
^ §
§ ^
i ^
i ^
i O
i
CD
\BS « l « a Ou|U|«JX • M l Aq p e j n s v e ^ X^nqvisui
a ^
J0JJ3 uo!)t9Z!|t9jeu«o
Neural Networks Applied to Data Analysis
363
C. STABILITY EXPERIMENT The stability of neural network classifiers depends on the architecture (the number of hidden units), the training algorithm, and the training set. An example is given in Figs. 31 and 32 for the instability caused by repeated initializations. The networks are trained on two different realizations of the same two-dimensional problem of spherical Gaussian distributed classes, with identical means but different variances. In the left figures the scatter plots of these two training sets are presented. In the two right figures the instability as a function of the number of hidden units is shown. Levenberg-Marquardt optimization is used with a fixed number of 150 epochs. The label assignments for an independent test set are compared, yielding an estimate of the instability as discussed in Section VLB. The instability curves for the two training sets are entirely different. This section shows that classifier stability plays its role between bias and variance. As the above experiments illustrate, no fixed rules can be formulated for the architecture in relation to the size of the classification problem (numbers of features and samples). The use of some stabilizing technique, as discussed in this chapter or elsewhere, is thereby necessary in almost any case. This is due to the fact that neural networks have a large number of free parameters. A straightforward optimization yields unstable solutions in the case of limited data sizes. Classifier stability has been studied in its relation to the bias and the variance of the resulting classification function. Very stable classifiers are biased and insensitive to class irregularities. They thereby tend to have a bad performance. Very unstable classifiers show large variances in the classification in which bad performances dominate. Somewhere in between, classifiers have sufficient sensitivity to follow data peculiarities without being too heavily affected by the noise. Each of the techniques we discussed here controls this stability in its own way. Adding noise in an artificial way and regularization by weight decay stabilize. Bagging might improve the performance, but it is certainly not a clear stabilizer. More research is needed to determine the conditions that can be expected from this method. The way we measure the stability is very time consuming. For larger neural networks this will become impossible, certainly on a routine basis. Measuring the instability, however, might help us to understand why and the conditions under which certain techniques are useful and others are counter-productive. Combining neural networks is a popular issue [37, 65, 66] and might certainly improve the performance. Bagging experiments presented here and elsewhere [58] show that this is not necessarily caused by stabilizing. This corresponds with the conclusions drawn by Raudys [61] and Hamamoto [62] that sometimes antiregularization might be useful in training neural networks. So sometimes neural network training procedures are too stable, are thereby insensitive to the data, and
Averagd mstabirty over 100 experimentswilh 20 trainii samples o
.
2
L
5 I 0.1 -
-5
0.08
-
0.06
-
0.04 \
I
\
I
\
I
\
I
-5'
1
I
I
- 5 - 4 - 3 - 2 - 1
I
/I
0
\ I
I \
t
I
1
2
3
4
o.02t-n
5
"0
2
4
6
8 10 12 Number of hidden units
14
16
Figure 31 Instability as function of the number of hidden units (right) for a given training set (left). In the scatterplot some examples of neural network classification functions are given.
18
20
Averaged instabilityover 500 experiments with 20 training samples
O
.
I
*
I
I
I
1
I
Number of hidden units
Figure 32 Instability as function of the number of hidden units (right) for a given training set (left). In the scatterplot some examples of neural network classification functions are given.
w m UI
I
366
Aarnoud Hoekstra et al
need some destabilizing; sometimes they are too sensitive to the noise, tend to overtrain, and need a stabiUzer.
VIL CONCLUSIONS AND DISCUSSION In this chapter we have introduced several techniques for analyzing data in relation to neural networks. We started off with the analysis of plain data without any classification objective and showed how these data can be investigated in general (Section II) and inspected with respect to their separability (Section III). This separability inspection is important because the final objective is to find a classifier that separates the objects as much as possible. However, a separability analysis alone is not enough; the classifier must also be taken into account. Tools for estimating the rehability of neural classifiers were introduced in Section IV. Next to selecting classifiers to solve a particular problem, training classifiers is an important issue. In Sections V and VI, techniques were introduced for follow the classifier's generalization behavior and stability during learning. This is important because we are interested in optimal classifiers, and not in the ones that are overtrained or undertrained. What can be done with all of these available techniques, from data analysis to classifier analysis, in practical applications? No single answer can be given to this question, but we distinguished the following steps from data to classifier: 1. Analyze the data with general techniques to gain insight into their structure. This includes density estimation, cluster analysis, separability analysis, etc. 2. Select an appropriate classifier or a set of classifiers that should be trained (e.g., a feedforward neural classifier with various numbers of hidden units). 3. Monitor its generalization behavior during training to prevent overtraining (e.g., nonlinearity analysis). 4. Finally, verify the classifier found after training by judging its reliability (e.g., confidence analysis). Techniques have been introduced for each of these steps. Section II introduced the techniques available in the literature for gaining insight into the structure of the data by applying cluster analysis, projection analysis, or density estimation. Section III showed how insight into the data structure can be obtained by mapping the data onto a two-dimensional grid and visualizing the resulting mapping. In this way classifiers can be selected that might be used to solve the problem. Techniques for the actual (trained) classifier selection, in step 2, were introduced in Section IV. By estimating the confidence of the classification made by the classifier, a classifier can be selected or rejected. Depending on how well a network is trained, these confidence levels can be estimated for different layers of the
Neural Networks Applied to Data Analysis
367
network. Different methods for estimating the confidence levels were also introduced, next to the "classical" network output values. Step 4 is training a neural network classifier in such a way that the classification problem is solved in an optimal way. This can be done by monitoring the network's generalization capability and stability in relation to the data it has to classify. The nonlinearity measure can be inspected during training, in contrast to the VC dimension-related methods, which determine the generalization capability in advance. This nonlinearity measure is very useful because the tendency of a network to enter an overtrained situation can be seen almost immediately. The stability also must be monitored, because nonstable classifiers yield suboptimal solutions. Finally, the last step was introduced in Section IV, in which it was shown how reliability estimations can be performed. In practical situations, however, one is not always interested in a general procedure of classifier selection and data analysis. The questions are more specific: "How many hidden layers should I select?" "How many hidden neurons do I need?" or "Which learning algorithm should I choose?" This paper has not dealt with this kind of questions in detail, although in the presented experiments the techniques have been applied to networks with various architectures. However, one must bear in mind that each of the presented techniques can be applied to answering particular questions. The authors' objective was to present a variety of techniques that can be applied in the various stages of analysis and training to solve a classification problem by using neural networks.
ACKNOWLEDGMENTS This research was partially sponsored by the SION (Dutch Foundation for Computer Science Research), the NWO (Dutch Foundation for Scientific Research), and the SNN (Foundation for Neural Networks, Nijmegen). Servaes Tholen, Marina Skurichina, Jianchang Mao, and Anil Jain are gratefully acknowledged for their contribution to the research described.
REFERENCES [1] E. Backer. Computer Assisted Reasoning in Cluster Analysis. Prentice-Hall, Englewood Cliffs, NJ, 1995. [2] R. Beale and T. Jackson. Neural Computing: An Introduction. Hilger, Bristol, England, 1990. [3] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon, Oxford, 1995. [4] R A. Devijver and J. Kittler. Pattern Recognition: A Statistical Approach. Prentice-Hall, London, 1982. [5] M. R. Anderberg. Cluster Analysis for Applications. Academic Press, New York, 1973. [6] K. Fukanaga. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, CA, 1990. [7] J. A. Hertz, A. S. Krogh, and R. G. Palmer. Introduction to the theory of neural computation. Sante Fe Institute Studies in the Sciences of Complexity. Addison-Wesley, Reading, MA, 1991.
368
Aarnoud Hoekstra et at.
[8] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ, 1988. [9] B. Ripley. Recognition and Neural Networks. Cambridge Univ. Press, Cambridge, England, 1996. 10] H. Ritter, K. Schulten, and T. Martinetz. Neural Computation and Self-organizing Maps: An Introduction. Addison-Wesley, Reading, MA, 1991. 11] V. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, Heidelberg, 1982. 12] P. K. Simpson. Artificial Neural Systems: Foundations, Paradigms, Applications, and Implementations. Pergamon, Ehnsford, NY, 1990. 13] J. A. Anderson and E. Rosenfeld. Neurocomputing: A Collection of Classic Papers. MIT Press, Cambridge, MA, 1987. 14] J. A. Hartigan. Clustering Algorithms. Wiley, New York, 1975. 15] T. Kohonen. Self-Organization and Associative Memory, 3rd ed. Springer-Verlag, Heidelberg, 1989. 16] B. S. Duran and P. L. Odell. Lecture Notes in Economics and Mathematical Systems—Econometrics—Cluster Analysis^A Survey. Springer-Verlag, Berlin, 1974. 17] T. Kohonen. The self-organizing map. Proc. /EEE 78:1464-1480, 1990. 18] S. Grossberg. Studies of Mind and Brain. Reidel Press, Boston, 1982. 19] S. Grossberg. The Adaptive Brain I: Cognition, Learning, Reinforcement, and Rythm. Elsevier/ North-Holland, Amsterdam, 1986. 20] S. Grossberg. The Adaptive Brain II: Vision, Speech, Language and Motor Control. Elsevier/ North-Holland, Amsterdam, 1986. 21] E. A. Oja. Simplified neuron model as a principal component analyzer. J. Math. Biol. 15:267273, 1982. 22] E. Oja. Neural networks, principal components, and subspaces. Intemat. J. Neural Syst. 1:61-68, 1989. 23] J. W. Sammon, Jr. A nonlinear mapping for data structure analysis. IEEE Trans. Comput., C-18: 401-409, 1969. 24] E. Parzen. On estimation of a probability density function and mode. Ann. Math. Statist. 33:1065-1076, 1962. 25] D. Ackley, G. Hinton, and T. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive Sci. 9:141-169, 1985. 26] A. Bamas and A. La Vigna. Convergence of Kohonen's learning vector quantization. In International Joint Conference on Neural Networks, June 17-21, 1990, Vol. 3, pp. 21-26. IEEE Press, New York. 27] E. Erwin, K. Obermayer, and K. Schulten. Self-organizing maps: ordering, convergence properties and energy functions. Biol. Cybernet. 6:47-55, 1992. 28] H. Ritter and K. Schulten. On the stationary state of Kohonen's self-organizing sensory mapping. Biol. Cybernet. 54:99-106, 1986. 29] H. Ritter and K. Schulten. Convergence properties of Kohonen's topology preserving maps: fluctuations, stabiUty, and dimension selection. Biol. Cybernet. 60:59-71, 1988. 30] H. Ritter and K. Schulten. Kohonen's self-organizing maps: exploring their computational capabilities. In IEEE International Conference on Neural Networks, Vol. 1. IEEE, New York, 1988. 31] M. A. Kraaijveld, J. Mao, and A. K. Jain. A nonlinear projection method based on Kohonen's topology preserving maps. IEEE Trans. Neural Networks 6:548-559, 1995. 32] G. Biswas, A. K. Jain, and R. C. Dubes. Evaluation of projection algorithms. IEEE Trans. Pattern Anal. Machine Intelligence 3:701-708, 1981. 33] J. Mao and A. K. Jain. Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans. Neural Networks 6:296-317, 1995.
Neural Networks Applied to Data Analysis
369
[34] R. A. Fisher. The use of multiple measurements in taxonomic problems. Ann. Eugenics 7(Part 11):179-188, 1936. [35] M. Anthony and B. S. Holden. On the power of linearly weighted neural networks. In Proceedings of the ICANN'93 (S. Gielen and B. Kappen, Eds.), pp. 736-743. Springer-Verlag, Heidelberg, 1993. [36] R. Battiti and A. M. CoUa. Democracy in neural nets: voting schemes for classification. Neural Networks 7:691-707, 1994. [37] M. Kittler, M. Hatef, and R. R W. Duin. Combining classifiers. In Proceedings of the 23th ICPR, Vienna, Vol. 2, track B: Pattern Recognition and Signal Analysis, pp. 897-901. IEEE Comput. Soc, Los Alamitos, 1996. [38] J. S. Denker and Y. Le Cun. Transforming neural-net output levels to probability distribution. In Advances in Neural information Processing Systems (R. P. Lippmann, J. E. Moody, and D. S. Touretzky, Eds.), Vol. 3, pp. 853-859. Morgan Kaufmann, San Mateo, CA, 1991. [39] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973. [40] R. P. W. Duin. Superleaming capabilities of neural networks? In Proceedings of the 8th Scandinavian Conference on Image Analysis, May 25-28, 1993, pp. 547-554. Norwegian Society for Image Processing and Pattern Recognition. [41] S. A. Tholen. Confidence values for neural network classifications. Master's Thesis, Delft University of Technology, Delft, The Netherlands, 1995. [42] J. A. Anderson. Handbook of Statistics, Vol. 2, Logistic Discrimination, pp. 169-191. NorthHolland, Amsterdam, 1982. [43] R. P. W. Duin. Nearest neighbor interpolation for error estimation and classifier optimization. In Proceedings of the 8th SCIA Troms0, 1993. [44] J. D. J. Houtepen. Chromosome banding profiles: how many samples are necessary for a neural network classifier? Master's Thesis, Delft University of Technology, Delft, The Netherlands, 1994. [45] A. Hoekstra, S. A. Tholen, and R. P. W. Duin. Estimating the reliability of neural network classifications. In Proceedings of the ICANN 96, Bochum, Germany, 1996, pp. 53-58. [46] K.-I. Funahashi. On the approximate reahzation of continuous mappings by neural networks. Neural Networks 2:1S3-192, 1989. [47] K. Homik. Multilayer feedforward networks are universal approximators. Neural Networks 2:359-366, 1989. [48] E. B. Baum and D. Haussler. What size net gives valid generalization? Neural Comput. 1:151160, 1989. [49] V. Roychowdhury, K.-Y. Siu, and A. Orlitsky, Eds. Theoretical Advances in Neural Computation and Learning. Kluwer Academic, Dordrecht, The Netherlands, 1994. [50] D. H. Wolpert, Ed. The Mathematics of Generalization. Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, November 1994. [51] L. N. Kanal and P. R. Krishnaiah. Handbook of Statistics 2: Classification Pattern Recognition and Reduction of Dimensionality. North-Holland, Amsterdam, 1982. [52] V. Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems (J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds.), Vol. 4, pp. 831-838. Morgan Kaufmann, San Mateo, CA, 1992. [53] M. A. Kraaijveld and R. P. W. Duin. The effective capacity of multilayer feedforward network classifiers. In Twelfth International Conference on Pattern Recognition, pp. B-99-B-103, 1994. [54] A. Hoekstra and R. P. W. Duin. On the nonlinearity of pattern classifiers. In Proceedings of the 13th ICPR, Vienna, Vol. 3, track D: Parallel and Connectionist Systems, pp. 271-275. IEEE Comput. Soc, Los Alamitos, CA, 1996. [55] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Comput. 4:1-58, 1992.
370
Aarnoud Hoekstra et at.
[56] R. Tibshirani. Bias, variance and prediction error for classification rules. Technical Report, University of Toronto, Toronto, 1996. [57] M. Skurichina and R. R W. Duin. Stabilizing classifiers for very small sample sizes. In Proceedings of the 13th ICPR, Vienna, Vol. 2, track B: Pattern Recognition and Signal Analysis, pp. 891-896. ffiEE Comput. Soc, Los Alamitos, CA, 1996. [58] M. Skurichina and R. R W. Duin. Bagging for linear classifiers. Pattern Recognition. To appear. [59] R. R W. Duin. Small sample size generalization. In Proceedings of the 9th Scandinavian conference on Image Analysis, June 6-9,1995 (G. Borgerfors, Ed.), pp. 957-964. [60] S. Raudys and M. Skurichina. Small sample properties of ridge estimate of covariance matrix in statistical and neural classification. In New trends in Probability and Statistics, Vol. 3, pp. 237245, 1994. Tartu, Estonia. [61] S. Raudys. A negative weight decay or anti-regularization. In Proceedings of the ICANN 95, Paris, 1995. [62] Y. Hamamoto, Y. Mitani, H. Ishihara, T. Hase, and S. Tomita. Evaluation of an anti-regularization technique in neural networks. In Proceedings of the 13th ICPR, Vienna, Vol. 2, track B: Pattern Recognition and Signal Analysis, pp. 205-208. IEEE Comput. Soc. Los Alamitos, CA, 1996. [63] L. Breimann. Bagging predictors. Machine Learning 24:123-140, 1996. [64] D. H. Wolpert and W. G. Macready. An efficient method to estimate bagging's generalization error. Technical Report, Santa Fe Institute, Santa Fe, NM, 1996. [65] G. Rogova. Combining the results of several neural network classifiers. Neural Networks 1:111781, 1994. [66] K. Tumer and J. Ghosh. Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition 29:341-348, 1996. [67] H. Demutii and M. Beale. Neural Network Toolbox for Use with Matlab. The Mathworks, 1993. [68] R. R W. Duin. PRTOOLS a Matlab Toolbox for Pattern Recognition. Pattern Recognition Group, Delft University of Technology, October 1995. [69] A. Hoekstra, M. A. Kraaijveld, D. de Ridder, and W. F. Schmidt. The Complete SPRLIB & ANNLIB. Pattern Recognition Group, Faculty of Applied Physics, Delft University of Technology, 1996. [70] T. Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen. SOMPAK: The Self-Organising Map Program Package, version 3.1. Laboratory of Computer and Information Science, Helsinki University of Technology, Rakentajanaukio 2 C, SF-02150 Espoo, Finland. April 7,1995.
Multimode Single-Neuron Arithmetics Chang N. Zhang
Meng Wang
Department of Computer Science University of Regina Regina, Saskatchewan S4S 0A2, Canada
Department of Computer Science University of Regina Regina, Saskatchewan S4S 0A2, Canada
L INTRODUCTION Neurons, whether real or artificially modeled, are basic processing elements (PEs) of computational neural networks. Formalizing single-neuron computation has been inspired by two types of endeavors. For computer scientists, understanding the computation performed by real neurons is essential to the design of basic processing elements that equip new models of artificial neural networks (ANNs). In contrast, computational neuroscientists have been exploring the biophysical mechanisms that enable neurons to "compute," to support their computational explanation of brain functions. Computational ability and biophysical plausibility, respectively, are the two focuses of interest for researchers in these two different areas. New insights into either of these aspects would inevitably benefit both. In this chapter we shall alternatively take the standpoints of computational neuroscience and computer science.We choose to proceed this way because we believe that many fundamental problems in neuron modeling are, in fact, of the same importance to both disciplines. The current generation of artificial neural networks makes use of highly simplified models of neurons, such as summation-and-firing models. The reason for the use of these simplified models is tv/ofold. On one hand, many researchers believe that the complicated dynamic behavior and advanced functions of a neural network system are primarily the result of collective behaviors of all participating neurons, and are less relevant to the operational detail of each processing element [1]. Assuming this, simple neuron models, such as summation-and-firing Implementation Techniques Copyright © 1998 by Academic Press. All rights of reproduction in any form reserved.
371
372
Chang N. Zhang and Meng Wang
models, are preferable to complicated neuron models because of their ease of implementation and understanding. On the other hand, there has been a lack of available models of single neurons that are both computationally powerful and biologically plausible as replacements of the current models. In fact, as a main source of novel neuron models, computational neuroscience is just at the initial stage of tackling the principles of single-neuron information processing [2]. Our present knowledge of the biophysical basis of single-neuron computation is largely based on the understanding of the functional structure of biological membranes [3]. In brief, membrane structures that separate the inner world of a single neuron from its environment form a leaky barrier for charged ions, resulting in a difference in electrical potentials across the cell membrane (membrane potentials). An important variable observable at the cellular level, membrane potential has been chosen as the basic variable describing the state of a single neuron in most modeling works. Neurophysiological studies at cellular and molecular levels have elucidated the fact that it is the opening and closing of ion channels provided by certain membrane proteins that are responsible for changes in membrane potentials [4]. In the investigation of how the opening and closing of ion channels change the membrane potentials, equivalent electrical circuits are a useful model. Figure lb shows a typical equivalent circuit model of the membrane patch shown
Extracellular side
am mm non-gated channel
gated channel (a)
membrane potential
Cytoplasmic side (b)
Figure 1 Single-neuron computation can be studied via equivalent circuit models, (a) Simplified model of a membrane patch subject to synaptic input. The nongated channel in the figure is an integrated representation of all nongated channels in this patch that are selectively permeable to various ion species; similarly, the gated channel on the right represents all chemical-gated channels in this patch whose opening/closing is controlled by the synaptic input x. (b) Equivalent circuit model for the membrane patch in (a). The nongated channels are modeled by the constant conductance gm, and the gated channels are modeled by the ^syn-^syn branch. The temporal property of the membrane potential is modeled by the capacitive branch Cm- Membrane potential is given by V measured in reference to the extracellular ground.
Multimode Single-Neuron Arithmetics
373
in Fig. l a . In this electric model, membrane permeability of charged ions is modeled by lamped electric parameters (conductances). Consequently, the influence of a synaptic input signal x on the membrane potential V is studied in terms of how the variable input conductance gsyn determines the value of voltage V. Mathematically, this equivalent circuit model defines the membrane potential y as a real function of the input conductance gsyn- Suppose that V is the membrane potential under observation and {^i, ^2, • • •, ^AI} is a set of input conductances. Then the task of single-neuron computation can be stated as a multivariate mapping f .R^ -> R from the «-dimensional real space of input conductances to the one-dimensional real space of membrane potential. More generally, when n different spatial sites of a cell membrane are observed, n membrane potentials can be defined (Fig. 2). In this case, the mapping from input conductances to membrane potentials is f :R^ -^ R^. A central interest of single-neuron computation is to understand this mapping / for both its computational power and the supporting biophysical mechanisms. During the last three decades, biophysical studies of membrane potential dynamics have led to several models of the mapping / [5-12]. From the viewpoint of computation, all of those models can be classified into two categories: arith-
Extracellular side
Figure 2 A typical equivalent circuit model for passive membranes. The entire surface of a cell membrane is considered as being paved by n membrane patches. For each of these patches a unique membrane potential V/ (/ = 1,2,..., n) is measured. The difference between membrane potentials at different patches is caused by tangential membrane resistances, which are modeled by an axial resistance network. The resting property of the membrane is assumed to be the same throughout the membrane, and therefore the same gm and Cm values are assigned to all patches. Each of the membrane patches is assumed to be subject to a synaptic input. Depending on the property of the synaptic input, the reversal potential of the synaptic potential can be negative (e.g., the Ei), positive (e.g., the E2), or neutral (e.g., the missing En in series with gn). (Throughout this chapter, the resting membrane potential is chosen to be zero, and all potential values are specified in reference to this convention.)
374
Chang N. Zhang and Meng Wang
metical models and logic models. According to the arithmetical model, the mapping / can be described in terms of arithmetical operations, such as addition and subtraction [5, 12], multiplication [6, 7, 12], and division [5, 11, 12]. By combining these neuronally feasible operations, a single neuron can be regarded as a polynomial approximator [6, 7, 10, 13-15], a normalization functor [11], or a rational approximator [12]. In logic models, the mapping / from conductance space to membrane potential space is modeled as logic operations [10], such as AND-NOT [7], AND and OR [8], and XOR [8, 9]. All of these models represent a substantial challenge to the traditional view of the single neuron as a linear summing machine of input signals. In this chapter we focus on arithmetical models. Because of the elementary importance of arithmetical operations in constructing any complicated mathematic operations, the single neuron's ability to perform arithmetical operations is a fundamental issue in understanding single-neuron computation. Questions related to this issue include: What is the biophysical basis of neuronal arithmetics? What kinds of arithmetical operation can a neuron perform? What are the potential and limitation of single-neuron arithmetics? Do neurons change their operational models under the control of certain input signals? How do neuronal arithmetical operations support higher level functions?
11. DEFINING NEURONAL ARITHIVIETICS A. A BRIEF HISTORY OF STUDY For the biophysical basis of neuronal arithmetics, most previous works have relied on two types of gated membrane mechanisms: chemical-gated (passive) and voltage-gated (active) channels. Voltage-gated membrane mechanisms have been found to be especially important to logic operations [8-10]. Noticing the relationship between logic AND operation and arithmetic multiplication, voltagegated mechanisms are also thought to contribute to multiplicative operations [16]. Recent discoveries made in neurophysiology [17, 18] suggest that voltage-gated mechanisms can, in fact, participate in all computations. They do so by converting computation results (the membrane potentials) into new operands (the conductances) and therefore play a role of integrating several intermediate computations (usually achieved in local dendritic structures) into more complicated outcomes [17]. In contrast, chemical-gated mechanisms (as depicted in Fig. 1) seem to contribute more directly to all types of arithmetic operations. Blomfield [5] studied three types of operations: addition, subtraction, and division, all supported by passive membrane properties. A notable passive membrane property, termed shunting
Multimode Single-Neuron Arithmetics
375
inhibition, has been studied a great deal by many researchers, and has been found to contribute to multiplicative [6, 7, 16] and divisive [11] operations. Almost all models of neuronal arithmetics studied so far have been built by following a "finite precision approximation" approach. In this approach, none of those model operations (say, multiplication) is precisely performed by a single neuron. Instead, all of them are approximations to the actual computation by the neuron. The idea is to set up a precision of approximation, and to determine the conditions under which the actual neuronal operation can be reduced to a model operation subject to the precision of approximation. Usually the actual operation is a composite operation, and the model operation is clean. The way the model operation is derived prescribes the restriction for the valid range of the model. For example, Torre and Poggio [6] derived their model of shunting multiplication by applying a Taylor expansion to the original function (a rational function). Accordingly, the result of [6] applies only to small conductance circumstances [16]. Furthermore, to give a clean multiplication, some additional procedures are required to remove the unwanted terms in the Taylor expansion [19]. To all researchers, addition and subtraction are merely the linear approximation to the actual membrane operation under the condition of small-input conductances [5]. This finite-precision approximation approach leads to a logical concern over the conditions under which an arithmetical model applies. Recently, the authors [12] have developed a phase space analysis on the arithmetic ability of passive membrane mechanisms. In that work, a rational function model is proposed to be the unified basis for the description of all four types of arithmetical operations (+, —, X, ^ ) . The phase diagrams of all four arithmetical operations are determined in the space of input conductances, and serve as a quantitative measure for the potential and limitation of neuronal arithmetic operations. This result extends the previous works in several aspects. For example, subtraction is found to be an "observable" operation, not only for small input signals but also for large inputs; multiplication and division are found to have several possible realizations, not being restricted to the shunting inhibition.
B. POLARIZATION CONFIGURATION In the equivalent circuit model shown in Fig. 1, the polarity of the battery £'syn is determined by the property of the synaptic connection: if the input is excitatory, then £'syn represents a positive potential with respect to the extracellular ground; if the input is inhibitory, then jE'syn is negative. Neutral ^syn (that is, £'syn = 0) corresponds to a special inhibitory mechanism termed shunting inhibition. In general, for two adjacent membrane patches, there are three types of configurations that are significant to computation (Fig. 3): (a) two isopolarized patches, that is, two patches with the same polarity of £'syn, such as the Ei and E2 in
Chang N. Zhang and Meng Wang
376
(a)
(b)
(c)
Figure 3 Equivalent circuit of two spatially adjacent patches, (a) Two isopolarized patches; (b) two heteropolarized patches; (c) a shunting patch and a polarized patch.
Fig. 3a; (b) two heteropolarized patches, that is, two patches with opposite polarity of Esyn (see Fig. 3b); and (c) a shunting (neutral) patch and a polarized patch (Fig. 3c). Notice that in these equivalent circuits shown in Fig. 3, the nongated conductances gm of the two patches are assumed to be of the same value and are merged into a single conductance. Configurations that are not considered here are symmetrical to the listed configurations (e.g., two inhibitory patches are symmetrical to two excitatory patches) or computationally trivial (e.g., two shunting patches). With the polarization configuration defined, various arithmetic operations can be termed accordingly. For example, the multiplication model studied in [6,7,16] can be referred to as the shunting multiplication, the divisive operations constructed by [11] as shunting division, and so on. Additions and subtractions studied by many researchers can be termed isopolarized addition and heteropolarized subtraction, respectively, or simply addition and subtraction, because the polarization configuration for these two linear operations is implied from the name of the operation.
C. COMPOSITE OPERATION AND CLEAN OPERATION In a neuronal arithmetic computation, the operands are the input conductances and the result is the local membrane potential. The set of operators involved in a computation defines an arithmetical mode. The nature of a neuronal arithmetical mode is described by several specifications. First of all, we may distinguish between clean operations and composite operations. In a clean arithmetic operation, only one type of operator (+, —, x, or -^) is involved (e.g., addition with several input conductances, yielding a membrane potential value). The clean operations considered here include the case in which an operand may be a variable
Multimode Single-Neuron Arithmetics
377
or a variable multiplied with a constant (e.g., ax\ + X2, where a is a constant), and the case in which the result of the operation may contain an additive constant (e.g., X 1^2 + ^ for constant b). In particular, group divisions are considered as a clean operation in which both the numerator p and the denominator ^ of a rational expression p/q are polynomials of several input conductances. Composite operations are constructed by applying subsequent operations to the result of several clean operations (e.g., x\X2 + x^/x^). Composite arithmetics are usually observed in the case of group operations where more than two input conductances are involved. Two famous examples of composite operation that have been considered by many researchers are: V = J2i j ^j8i§j' addition-after-multiplication/subtraction ( E n operation), V = [J2{k} '^kgk][Yl{j} ^jgj]'- multiplication-after-addition/subtraction (IlE operation). Figure 4a shows a possible case of the E FT operation. In that example, components of two input vectors are first multiplied by each other, and the products are then summed up (inner product of two vectors). An example of the n E operation is shown in Fig. 4b. In both figures, a total of eight input conductances is to be dealt with. For the well-definedness of arithmetical operations, two values gmin ^^d gmax are chosen as the lowest and highest values of the input conductances, respectively, and i^min and Umax are the lowest and highest values of local membrane potentials. For real neurons, t>min = ^IPSP ^ — 5 mV and i;niax = ^EPSP ^ 65 mV are usually assumed. For the conductances, a natural setting is ^min = 0, and gmax can be, say, 10"^ S. Following the finite-precision approximation approach, two
afferent group A
afferent group B
(a)
Figure 4
afferent group A
afferent group B
(b)
(a) Sigma-Pi operation, (b) Pi-Sigma operation.
378
Chang N. Zhang and Meng Wang
types of measure are used to characterize a model, clean operation: (1) For a given 6 > 0, the actual, composite operation V = f(gi, g2,..., gn) is said to be approximated by a clean operation gi o g2' -- o gn in a. subspace 5 of /?" (gi, g2, • •., gn), if for any (gi, g2, • • •, gn) e 5 we have 1/(^1, 82.'". gn) - gi o g2 "• o gn\ <€. S is referred to as the domain range of the operation o. (2) Given a domain range S and (gi, g2, • • •, ^«) ^ 5, the quantity rj = max{^i o g2" ' o gn] — min{^i o g2" - o gn) is used to describe the dynamic range of the operation o.
D. DEVELOPEDNESS OF ARITHMETICAL MODES The measures of domain range and dynamic range present quantitative criteria for the biological plausibility of a model operation. Generally speaking, if a proposed model arithmetic can be implemented by a single neuron in very rare cases (i.e., with small domain range), it can hardly be accepted as a reasonable model. Formally, we can describe model operations in three categories: less developed, developed, and well developed. An arithmetical mode is said to be developed if it has a wide domain range (compared with gm) in at least one input conductance dimension; it is well developed if, in addition, the domain range S measures a large portion of the input space R^(g\, g2. -- - y gn)\ and it is less developed if the domain range is narrow in all dimensions.
III. PHASE SPACE OF NEURONAL ARITHMETICS A. PROTOTYPES OF CLEAN ARITHMETICS In this section we describe the structure of the input conductance space ^ " ( ^ 1 . ^2, • • •, ^n). It has been shown in [12] that for the system of n adjacent patches, the most general composite operation that can be performed by any of the three polarization configurations shown in Fig. 3 is the rational function n^gm-h
22 J gj
In this rational function, if we restrict ourselves to positive ohmic relations, then all coefficients, with the possible exception of Ej, are positive constant. It is obviously analytical everywhere in its domain of definition. Furthermore, it is easy to verify that V is a monotonic increasing (or decreasing) function of gj, j = 1,2,..., if Ej > 0 (or Ej < 0). It also saturates itself for large-valued
Multimode Single-Neuron Arithmetics
379
input conductances because both the numerator and the denominator are of the same degree of gj, 7 = 1 , . . . , « . In fact, we have min{E\, E2,.,.,
En} SV
< maxf^i, £"2,.. •, En}.
(2)
< ^EPSP.
(3)
An even looser estimator follows: £'iPSP
In other words, the results of neuronal arithmetical operations are bounded at both the lower and upper ends. Based on the general rational function (1), we construct the following prototypes of clean arithmetic operations subject to the criterion of developedness as defined in the last section: 1. Summation and subtraction. In these two clean operations, the rational function (1) is approximated by the linear arithmetic
u^y
(4)
where Ej (j = 1 , . . . , n) are constant. In this linear arithmetic, summation takes place among the input conductances of isopolarized patches and subtraction among heteropolarized patches. 2. Multiplication. The composite operation (1) also has multiplicative approximations. In the case ofn = 2, where (1) is reduced to V = (£"1^1 +£"2,^2)/(2gm+ ^1 + ,^2), we may construct an approximation of the form £2 y ^ j-j ^1-^2.
(5)
3. Division. In (1), suppose that the index set {j} is partitioned into two nonoverlapping subsets {k} and {/} such that
x/ _
E f Ekgk + Ef Eigi nxgra
+ E f §k + E f §1 '
where K -\- L = n. A divisive approximation of (1) takes the form V ~
E f Ekgk
7—,
(7)
nx gm-hJ2i 81 that is, the input conductances corresponding to the index set {/} effectively divide the conductances in the subset {k}. For the case ofn = 2(L = K = 1), the rational function V = (£1^1 -I- E2g2)/('^gm + ^1 + gi) is approximated by ElglKgl +2gin).
380
Chang N. Zhang and Meng Wang
B. D O M A I N RANGE For the case ofn = 2 , the domain range S of the Unear arithmetics (4), (5), and (7) is described by certain inequalities about gi and g2 which specify S as an area (finite or infinite) in the gi-g2 plane. (In all formulas below, 6 > 0 is assumed. For the derivation of these formulas, see [12].) 1. Linear Operations Case I El = E2^0
(isopolarized patches; additive operation): 0
gm[€ + y 6 ( € + 4 | £ i | ) ] / | £ i | .
(8)
Case II E\ > 0 > E2 (heteropolarized patches; subtractive operation. The case of E2 > 0 > El can be known by symmetry): when g2 > ^m[^ + V^(^ +4|£'2|)] /\E2l ^[g2(\Ei\ + \E2\) - 2€gm]^ - l6e\Ei\gl
- 2€gm
<2\Ei\gi-^{\Ei\-\E2\)g2 < yJ[g2(\El\ + \E2\) + 2€gm]^ + l6€\Ei\gl and when g2 < gm[€ +
+ 2€gm,
(9)
^/€(€-\-4\E2\)]/\E2\,
0 < 2\Ei\gi < y/[g2(\Ei\ + \E2\) + 2€gm]^ + -h2€gm-{\Ei\-\E2\)g2-
l6€\Ei\gi (10)
It is seen from Fig. 5 that the domain ranges of the two types of linear operation possess different natures. In brief, the additive operation is performed when both gi and g2 are small with respect to the gm (refer to relation (8)). This observation is consistent with the general intuition that the linear approximation (4) is adequate for small signal situations. Nevertheless, as can be seen from Fig. 5b, the domain of subtractive operation is not described by small signals: its S occupies a band expanding over a wide range in both ^1 and g2 dimensions. The slope of this band can be roughly described by the ratio \E2\/\Ei\. For Ei = 65 and E2 = —5, this indicates that the input conductance of the inhibitory patch must be '^IS times higher than the input conductance of the excitatory patch to perform a subtractive operation.
Multimode Single-Neuron Arithmetics le-06
gl{S)
5e-07
(a)
valid zone of the additive approximation
381 4e-06
2e-06 \
g2(S) 5e-07
gi(S)
(b)
valid zone of the subtractive approximation 1
4-
le-06
g2(S)
5e-06
le-05
Figure 5 Domain range of linear approximation in the case of (a) Ei = Eo = ^EPSP ^"^ (^) ^1 = ^EPSP ^^^ ^2 = ^EPSP- Parameters for the curves: £"EPSP = 65, £'ipsp = —5, ^m = 10~^, 6 = 5.
2. Multiplication The domain range of (5) is determined for three cases: Case I El = E2^0
(isopolarizedmultiphcation):
4gi{\Ei\-e)-\Ei\g2{g2-^2gra) + 04^2^(1^11 - 6) + \Ei\g2(g2 + 2g^)]^ - 32EJglg2 < 2\Ei\gig2 < 4gl{\Ei\ + 6) - 1^11^2(^2 + 2gm) + ^[4gl{\Ei\
+ 6) + \Ei\g2ig2 + 2gm)]^ - 2>2E\glg2,
(11)
Case II El > 0 > E2 (heteropolarized multiplication):
4 g ^ ( | ^ l | - € ) - 1^21^2(^2+ 2^ni) + y[4^^(|^2l - O + 1^21^2(^2 + 2gm)]^ < 2\E2\g\g2 < 4gi{\Ei\ + 6) - 1^21^2(^2 + 2gm)
32Elglg2
+ 7[4g^(|^2l + O + I^2l^2(g2 + 2gr^)f - ^2Elglg2.
(12)
382
Chang N. Zhang and Meng Wang
Case III El =0, Ei^O
(shunting multiplication):
yl{g2\E2\{g2 + 2gr^) - 4€gi]^ + l6(gmg2\E2\)^ -[\82\E2\(g2-\-2gm)
+ 4€gi]
< 2\E2\glg2 < y/[g2\E2\(g2 + 2gm) + 4€gl]^ + 16(^^^21^21)2 - [\82\E2\(g2 +
2gm)-4€gil
(13)
Figure 6 illustrates the domain range of the multiplicative operation (5) in isopolarized, heteropolarized, and shunting cases. In both the isopolarized and
le-05
5e-06
le-06
|gi (s)
valid zone of the multiplicative approximation
gl (S)
(b)
valid zone of the multiplicative approximation
5e-07
5e-06
le-05 gl (S)
le-05
(c)
4e-06 3e-06 valid zone of the L1 multiplicative 2e-06 \ approximation le-06 ' /'-•--•-•t^^^:::::^^
0
^^IrWf] 0
2e-06 4e-06 6e-06 8e-06 le-05
Figure 6 Domain range of multiplicative approximation in the case of (a) £ i = E2 = £^EPSP' (b) £"1 = £^EPSP 2"^d £"2 = ^IPSP' ^
Multimode Single-Neuron Arithmetics
383
heteropolarized cases, 5 is a narrow band bounded by two curves (not strictly hyperbolic; see relations (11) and (12)). When one of the two patches is of the shunting type, the shunting conductance gi can vary over a wide range for small g2, and this range becomes narrower as the shunted conductance increases (Fig. 6c).
3. Division Case I El = E2 ^0 (isopolarized division): -1^(82-h4gm) + 2gm\Ei\] +
^[2g^\Ei\-€g2]^-^4\Ei\(\Ei\+e)g 2(\Ei\^€)
< 81 e(g2 + 4gm) - 2ginl^il + ^[eg2 +
2gra\Ei\]^-^4\Ei\(\Ei\-e)g,
2(\Ei\-€) (14) Case II E1E2 < 0 (heteropolarized division): €ig2 + 4g^) - 2gm\Ei\ + J[€g2 + 2g^\Ei\]^ - 4\E2\(\Ei\ - €)gl 2(|£:i| - 6 )
Q^ ^ gml^'ll - ^ ^ - VI^2l(l^i|-^)-0.5€*
.^-.
Case III El =0 and E2 ^ 0 (shunting division): gi > g2[V0.25+|^2lA - 0.5] - 2gm.
(16)
The S of the isopolarized division is shown in Fig. 7a. As in the case of subtraction, the 5 is a band with a slope of almost 1. There are two possibilities with the case of E1E2 < 0, that is £"1 > 0 > £^2 (Fig- 7b) and Ei < 0 < E2 (Fig. 7c). In both cases, the S stretches along the dimension of the inhibitory conductance. The S of the shunting division occupies one-half of the gi-g2 plane, indicating a very loose restriction to be met by the two participating conductances in performing a division.
Chang N. Zhang and Meng Wang
384 le-05
4e-06
gi (s
gl (S)
(b)
valid zone of the divisive approximation 2e-06
5e-06
valid zone of the divisivel approximation \ le-05 gl (s)
A 5e-06
0
g2 (S) 4e-06
gl (S
(c)
valid zone of the divisive approximation
valid zone of the ^divisive approximation
5e-06
g2 (S) 5e-06
—-__ 2e-06
le-05
0
le-06
2e-06
3e-06
Figure 7 Domain ranges of divisive approximation, (a) When Ei = E2 = ^EPSP- (^) ^1 = ^EPSP and E2 = £'ipsp- (c) Ei = ^ipsp and £2 = ^EPSP- (d) When Ei =0 and £2 = ^EPSP- Parameters for all curves: £EPSP = 65, £ipsp = —5, ^m = 10"^, € = 5.
C. DYNAMIC RANGE Because of the monotonity of the rational function (1) and of its various approximations (4), (5), and (7), it suffices to determine the dynamic range of each prototype operation by looking at the function values on the boundaries of S. For the additive operation and the shunting divisive operation, it is readily known that rj = [€ -\- V ^ ( ^ + 4 | £ i | ) ] / 2 . With € = 5, it gives r] = 20.7 (mV) for Ei = 65 (EPSP) and ry = 8.1 (mV) for Ei = - 5 (IPSP). Figure 8 illustrates the dynamic range of other prototype operations.
s^-i
o o
! II I
i £ "•5.
^
w
"~"X
D1
3
' Q , C3
3 \
/^
/
3
•
/
/"
-_
\
-S
\
V \ \
«: ^
-B t
1^2 ^
•« X)
pP 3! OH S
III
3l S E,
U3
11 7
• \
,^
Xi
\ \ \ \ \ \ \ \
'w
\ e> t 3 ..•••• >..,:;iu:;SSS^2^
I l l -B S
_0 73 cu S ; ^
/ '^
•s
>/
•\
•
'D/
1/ 1 \
.
I til PH
I
\X!
>
1
fi 'g o [«
Xtl
1|
5u
£&
^
c«
tob;!
f^
^ ^1=5 2
5
\
CO
1 x>
\u
1 / ^
uuu
Chang N. Zhang and Meng Wang
386 D. PHASE DIAGRAM
The phase diagram is a map of the input conductance space that classifies the subspaces of input conductances according to the domain range of prototype operations. Figure 9 gives the phase diagrams for all three types of polarization configurations. Two types of information can be read a phase diagram. First, the phase diagram prescribes the possible types of arithmetical operation that can be performed by a specific polarization configuration. As can be seen from Fig. 9, two isopolarized patches can effectively perform addition, multiplication, and division, but no subtraction (Fig. 9a); two isopolarized patches may perform subtraction, multipUcation, and (a littie) division (Fig. 9b);finally,multipUcation and division are two typical operations in the shunting configuration (Fig. 9c). Secondly, the phase diagram also gives the domains of operation and the relation-
9e-07
7e-06
gi (S)
(b) .
7e-07
. . - • • • • • "
5e-06
s /
5e-07
......^"-3e-07
• s"
/ le-07 n 0 le-06
3e-06
5e-06 je
7e-06
uo
gi (s)
y
4e-06
.
— — , - - - • ' -
"M and S ,..--"
/
M
D : g2 (:s7" Ole-06 3e-06 5e-06 7e-06 9e-06
(c)
/
3e-06
1 ^ /
2e-06
\
le-06
D
/ /M
g2 (S) n /^/ 0 le-06 2e-06 3e-06 4e-06 5e-06 Figure 9 Phase diagram of (a) the system of two isopolarized patches, (b) the system of two heteropolarized patches, and (c) the system of a shunting patch and an excitatory patch. Add, additive phase; Sub, subtractive phase; Mul, multipUcative phase; Div, divisive phase.
387
Multimode Single-Neuron Arithmetics
ships between the domain of different operations. There are overlaps between some phases, indicating that for certain input values the arithmetic function of the system may be interpreted in different ways. The areas not marked with specific functions correspond to the general rational function (1).
IV. MULTIMODE NEURONAL ARITHMETIC UNIT A. DEVELOPEDNESS OF ARITHMETIC OPERATIONS Given a prototype of clean arithmetic operation, we can discuss its developness based on the phase diagram. This discussion defines the biological plausibility of that arithmetic operation in a specific polarization configuration. As shown in Fig. 9, a total of eight phases can be observed in the three possible polarization configurations. Their respective levels of developedness are described below and are summarized in Table I. 1. Subtraction Subtraction can be observed only in heteropolarized patches, and is a developed mode because of its broad domain range in both the g\ and g2 dimensions (Fig. 9b). Moreover, its wide dynamic range (Fig. 8a) also suggests it to be a sig-
Tablel Developedness of Arithmetic Operations Domain range in g\
Domain range in g2
Phase area
Developedness
+ -
Narrow Wide
Narrow Wide
Small Small
Less developed Developed
x(Isopolarized) X (Heteropolarized) X (Shunting)
Wide Narrow Wide
Wide Wide Wide
Small Small Small
Developed Developed Developed
^(Isopolarized) -^(Heteropolarized, inhibitory/excitatory) -^ (Heteropolarized, excitatory/inhibitory) H-(shunting)
Wide Narrow
Wide Narrow
Small Small
Developed Less developed
Wide
Narrow
Small
Developed
Wide
Wide
Large
Well developed
Operation
388
Chang N. Zhang and Meng Wang
nificant prototype operation in heteropolarized patches. However, it is not well developed because of its limited phase area. According to Blomfield's work [5], the subtraction is plausible for small-input conductances. As an extension to this observation, it is evident from the phase diagram (Fig. 9b) that the subtraction operates not only for small values of input conductances but also for large values. In fact, the long, narrow band of the subtraction phase implies that it is the interrelation between the two input conductances, rather than the magnitude, that is critical to the formation of a subtractive operation. 2. Addition A linear operation, addition (observable in isopolarized patches) is a less developed operation because of its narrow domain range in both dimensions (in comparison to gm). This fact calls into question the validity of taking the summationand-firing model as a widely observable operation, even though addition can be a fair prototype as long as the membrane input is sustained at a low level, as in the case of spontaneous presynaptic activities. 3. Multiplication Multiplicative operation is found to be developed in all three configurations. The isopolarized and shunting multiplications possess a wide domain range in both the gi and g2 dimensions (Fig. 9a and c), whereas the heteropolarized multiplication is widely defined only in the g2 dimension. The dynamic range of the isopolarized and shunting multiplications is wide (Fig. 8b and d). In contrast, heteropolarized multiplication has a narrow dynamic range (Fig. 8c). Combining these, it can be implied that multiplication is a more widely observable operation in isopolarized and shunting configurations than in heteropolarized configurations. 4. Division Division is also observed in all three configurations. As told by the small domain range (Fig. 7b), the heteropolarized division of the excitatory-dividesinhibitory type is a less developed mode. On the other hand, the division of the inhibitory-divides-excitatory type can be observed for a large range of inhibitory conductance values (Fig. 7c), suggesting it to be a developed operation (this is consistent with Blomfield's study [5]). This asymmetry results from the difference in the magnitude of £'EPSP and £'ipsp. The isopolarized division is a developed mode (Figs. 9a and 8e) and possesses a narrow-phase band similar to the
Multimode Single-Neuron Arithmetics
389
case of subtraction. In all eight prototype clean operations, the shunting division is the only well-developed mode. This is evident from its broad domain range (Fig. 9c) and large phase area (a half phase plane). It also has a wide dynamic range, rj = [e -\- V^(€ + 4|£'i|) ]/2, which is independent of the value of the two input conductances.
B. MULTIMODE ARITHMETIC U N I T Most previous models can be regarded as "single-mode" models, in the sense that a single model neuron is usually equipped with a single type of composite or clean operation. Phase space representation provides a viewpoint for considering a single neuron as potentially capable of several arithmetical operations. In a recent work of the authors [12], a concept of multimode single-neuron arithmetics is introduced. For a given polarization configuration, the developed and well-developed arithmetical operations correspond to possible operational modes in which the system may stay. If there is more than one developed or well-developed operation in the phase diagram of the given configuration, then that configuration is potentially capable of multimode arithmetical operations. Thus, corresponding to the three polarization configurations in Fig. 3, one can define three types of two-input, multimode arithmetic units (AUs) (Fig. 10). In each type of AU, there is a set of internal arithmetic modes, in each of which the operation of the AU can be described by a clean, prototype operation. Computationally, we may consider these AUs as possessing a set of internal arithmetic instructions; the selection of these instructions is determined by the mode the AU is currently in. If the AU is in neither of these modes, then its operation is described by the general rational function (1).
1
r
gl
%2
^1
(a)
^2
(b) Figure 10
Three types of arithmetic units.
Si
^2
(c)
Chang N. Zhang and Meng Wang
390 C. TRIGGERING AND TRANSITION OF ARITHMETIC MODES
Triggering of clean arithmetic modes has been widely observed in neurophysiological experiments. As an example, we consider a normalization model for simple cells of primate visual cortex [11, 20]. In this model, the simple cell's response begins with a linear stage, which performs essentially a linear summation of input, followed by a normalization stage (Fig. 11). At the normalization stage, each cell's response is divided by a quantity proportional to the pooled activity of other cortical cells. This model can explain a variety of physiological phenomena [20, 21]. Recently the authors [12] considered an implementation of this model by using a group division of the form (7). In fact, if we interpret the group K in that formula as the input from the LGN and group L as the pooled cortical activity, then the normalization model is restated. A numerical simulation result is illustrated in Fig. 12. This simulation introduces a general question about the conditions under which the input may trigger a specific arithmetic mode. In the case of n = 2, we have seen that as long as the value of the (gi, gi) pair falls into a specific phase in the phase plane, the corresponding arithmetical operation is performed. Functionally significant situations are those in which the input conductance pair (gi, g2) remains in a specific phase long enough for a sustained clean operation to be performed. Because in most practical systems, both g\ and g2 are time varying, we see that it is not the magnitude of a single-input conductance but the temporal pattern of several changing conductances that is relevant to the triggering of
LGN input
Linear sum stage
Normalization Output nonlinearity
Figure 11
Summation and division by visual cortical neurons.
Multimode Single-Neuron Arithmetics
391 V (mv) 1: r a t i o n a l f u n c t i o n . 2 : n o r m a l i z a t i o n model 3: e r r o r
g r o u p g (S) 0.017 0.013 0.009 0.005
A A A' A A A A
A AA AA nA n(i nA /(:
W / W / \ l\U V V \j V V V
/
•
/3\
\
0.001
/'\
10 V (mv)
(d)
••t Hr^ii) 15
g r o u p g (S)
(b)
0.017 0.013 0.009f
1: rational function 2: normalization model t 10
(ms)
\
/
\
/
/
V
V
V
1/
10
V
/
V
15
g r o u p g (S)
°^°"f\ A A A A A /' '' /! i l l ; \j y V \i I' u
i
„... K/\/\/\/\/vn
0.005
15
MAAAA/
(f)
1: rational function 5 L 2: normalization model 3: error
yV.yCvA?A..A,...Aj
Figure 12 Simulation of the normalization model. A total of 400 patches were used, and the average LGN contribution to membrane potential is set to 20 mV. In (a), (c), and (e), a hypothetical sinusoidal temporal pattern of LGN activity ^j^ g^ is shown as curve 1, the lower bound for intracortical activity level is shown as curve 2, and an admissible intracortical activity is shown as curve 3. Intracortical activity is determined in (a) by a white noise superposed with the lower bound, in (c) as an oppositely phased temporal pattern, and in (e) as an in-phase pattern. The corresponding original rational model and the normalization model are depicted on the right. For all curves, € = 1.
a specific arithmetic mode. From Fig. 9, we may distinguish the following two types of cases: 1. The persistent triggering of the isopolarized multiplication and division, of the heteropolarized subtraction, and of the shunting multiplication, demands highly ordered (gi, ^2) activity patterns. Specifically, in-phase (gi, ^2) temporal patterns may effectively trigger isopolarized division and heteropolarized subtraction, and oppositely phased (^1, g2) patterns may trigger a multiplication.
392
Chang N. Zhang and Meng Wang
2. Some clean arithmetics, such as addition and shunting division, do not require highly ordered input patterns. As pointed out earlier, addition is performed whenever both input conductances are small, and is typically associated with spontaneous presynaptic activities. The shunting division does not require a wellorganized input either, although a patterned input still triggers it. For the three types of AUs shown in Fig. 10, the strategy of arithmetical operations can be summarized: (1) For the first type of AU, if the input level; is low, linear addition may be performed. If the input level is high and is highly synchronized, then division is performed. With high-level, oppositely phased input, multiplication is performed. (2) For the second type of AU, low-level input can effectively evoke division, subtraction, and multiplication, and high-level in-phase activity would trigger subtraction. (3) For the third type of AU, high-level, oppositely phased input produces multiplications, and sufficiently high-level activities in the shunting input terminal can produce divisions. A sustained clean arithmetic mode is maintained by a particular input pattern. On the other hand, when the input pattern changes, the operational mode is changed. Figure 13 illustrates the transition of the arithmetic mode in an AU with isopolarized configuration. The neuronal AU's ability to change its internal arithmetic mode is an interesting property. In general, a necessary condition for a computational mechanism to be programmable is that there be a set of internal modes in the mechanism's state that can be selected by input instructions. In this sense, neuronal AU is a promising candidate for programmable, data-driven, arithmetic computation.
V. TOWARD A COMPUTING NEURON So far we have investigated in detail the arithmetical power of single AUs. To summarize, the three types of AUs shown in Fig. 10 can effectively perform a set of prototypical arithmetical operations as specified by (4), (5), and (7), when the input values fall into those phase areas, and perform the general rational function (1) otherwise. Noticing that the rational function (1) is an approximation to a more general form of rational function model of neuronal spatial summation when the axial resistance network of Fig. 2 is short-circuited, these single AUs may represent local computations performed across a small piece of membrane patch. As shown in Fig. 14, for a typical cortical neuron with branching dendrites (Fig. 14a), we may approximate the local computation in each dendritic segment by an AU with multiple inputs. Because of the large value of axial resistances, all of the computational results of these AUs can be considered as making virtually no contribution to each other (Fig. 14b). An integrated model of multimode single-neuron arithmetics is plausible by assembling these local computing AUs with certain active membrane mechanisms.
Multimode Single-Neuron Arithmetics B
16
12
8
4
4 9 12 13 11 12 8 4
13 1
gll
ig2
T
4 9 12 13 11 12 8 4
glj
\^1
T
393
1 X X X X X X X X X X X X X X X X X X X X
X X X X X X
/// // //
/ // /// //// //// //// //// X // X X
lg2
T 32
/ / / / /
/ / / / /
(a)
X X X X X X X X X X X X X X X X X X X X X X X X
4 9 12 13 11 12
glj
/ / / / /
12
4 9 12 13
glj
ig2
1.09
16
4
gll
9
ig2
glj
ig2
©©I | © 0 T 36 1.08
t=0 (b)
Figure 13 Transition between multiplicative and divisive modes, (a) A hypothetical "digitized" phase diagram of isopolarized configuration. The multiplicative phase is marked with x and divisive phase with /. (b) Operations performed by an AU of isopolarized type with two input terminals. The operator of the AU is chosen by the current operands. The currently active operator is indicated by a double circle.
Readers interested in the details of supporting biophysical mechanisms are referred to some recent neurophysiological experiments [18, 22, 23]. Some operational features of the integrated model of multimode single-neuron arithmetics are summarized as follows: • A single neuron is modeled by a multilevel tree (Fig. 14c). Each node of the tree consists of an AU of multiple inputs; its output corresponds to the local computation result in a dendritic segment. The root node corresponds to the soma of the neuron, and the output of the root AU stands for the output of the integrated neuron. Synaptic inputs can be fed to leaf nodes as well as to any internal nodes. Each edge of the tree consists of a converter C, the function of which is a one-to-one mapping from voltage to conductance.
Chang N. Zhang and Meng Wang
394
M AU ^
•
rAU ^
^
AU
"nr
rAU ^
"T" "T"
^
^ ^ AU
AU
^
^
LA AU
T (a)
(b)
I r i JL±
(c) Figure 14 (a) A model dendrite with multiple synaptic inputs, (b) Local computation in each dendritic segment is approximated by an AU. The computations in different AUs are independent of each other, (c) Supported by active membrane mechanisms, local computation results can be integrated.
For each leaf AU, rational arithmetic is performed on input conductances. The operands of an internal AU are either input conductances or the voltage-converted conductances produced by the afferent C. In particular, voltage-converted conductances can participate in both clean and composite operations. The cascaded operation may work either synchronously or asynchronously. For synchronous operations, all AUs are assumed to start a new operation at the same time, and to produce an arithmetic result in a unit time. The whole neuron thus works in a pipelining fashion.
Multimode Single-Neuron Arithmetics
395
VI. SUMMARY Interest in modeling single-neuron computation has been constantly shaped by two types of considerations: computational ability and biophysical plausibility. In this chapter, we have carried out an introductory review of the potential and limitations of single-neuron arithmetics. Driven by suitably patterned input signals, passive membranes can effectively perform multimode arithmetic operations on the input conductances. Based on this observation, an abstract model of neuronal arithmetic unit is described. By taking active membrane mechanisms into account, an integrated neuron model can be constructed that may function as a programmable rational approximator.
REFERENCES [1] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Set. U.S.A. 79:2554-2558, 1982. [2] T. McKenna, J. Davis, and S. F. Zometzer, Eds. Single Neuron Computation. Academic Press, Boston, 1992. [3] A. Finkelstein and A. Mauro. Physical principles and formalisms of electrical excitability. In Handbook of Physiology (E. R. Kandel, Ed.), Vol. 1, Part 1, pp. 161-213. Am. Physiol. Soc, Bethesda, MD, 1977. [4] E. R. Kandel, J. H. Schwartz, and T. M. Jessell, Eds. Principles of Neural Science, 3rd ed. Elsevier, New York, 1991. [5] S. Blomfield. Arithmetical operations performed by nerve cells. Brain Res. 69:115-124, 1974. [6] V. Torre and T. Poggio. A synaptic mechanism possibly underlying directional selectivity to motion. Proc. Roy. Soc. London Sen B 202:409^16, 1978. [7] C. Koch, T. Poggio, and V. Torre. Retinal ganglion cells: A functional interpretation of dendritic morphology. Philos. Trans. Roy Soc. London Ser. B 298:227-264, 1982. [8] G. M. Shepherd and R. K. Brayton. Logic operations are properties of computer-simulated interactions between excitable dendritic spines. Neuroscience 21:151-166, 1987. [9] A. M. Zador, B. J. Claiborne, and T. J. Brown. Nonlinear pattern separation in single hippocampal neurons with active dendritic membrane. In Advances in Neural Information Processing Systems (J. Moody, S. Hanson, and R. Lippmann, Eds.), Vol. 4, pp. 51-58. Morgan Kaufmann, San Mateo, CA, 1992. [10] B. W. Mel. Information processing in dendritic trees. Neural Comput. 6:1031-1085, 1994. [11] M. Carandini and D. Heeger. Summation and division by neurons in primate visual cortex. Science 264:1333-1336, 1994. [12] M. Wang and C. N. Zhang. Single neuron local rational arithmetic revealed in phase space of input conductances. Biophys. J. 71:2380-2393, 1996. [13] T. Poggio and E Girosi. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247:978-982, 1990. [14] R. Dubin and D. E. RumeUiart. Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Comput. 1:133-142, 1990. [15] B. W. Mel and C. Koch. Sigma-pi learning: On radial basis functions and cortical associative learning. In Advances in Neural Information Processing Systems (D. S. Touretzky, Ed.), Vol. 2, pp. 474-481. Morgan Kaufmann, San Mateo, CA, 1990.
396
Chang N. Zhang and Meng Wang
[16] C. Koch and T. Poggio. Multiplying with synapses and neurons. In Single Neuron Computation (T. McKenna, J. Davis, and S. F. Zometzer, Eds.), pp. 315-345. Academic Press, Boston, 1992. [17] M. Barinaga. Dendrites shed their dull image. Science 268:200-201, 1995. [18] J. C. Magee and D. Johnston. Synaptic activation of voltage-gated channels in the dendrites of hippocampal pyramidal neurons. Science 268:301-304, 1995. [19] T. Poggio and V. Torre, A theory of synaptic interactions. In Theoretical Approaches in Neurobiology (W. E. Reichardt and T. Poggio, Eds.), pp. 28-38. MIT Press, Cambridge, MA, 1981. [20] D. Heeger. Normalization of cell responses in cat striate cortex. Visual Neurosci. 9:181-197, 1992. [21] D. Heeger. Modeling single-cell direction selectivity with normalized, half-squared, linear operators. J. Neurophysiol. 70:1885-1898, 1993. [22] W. A. Catterall. Structure and function of voltage-sensitive ion channels. Science 242:50-61, 1988. [23] N. Spruston, Y. Schiller, G. Stuart, and B. Sakmann. Activity-dependent action potential invasion and calcium influx into hippocampal CAl dendrites. Science 268:297-300, 1995.
Index
Adaptive modular neural networks, 167-177 Admissible control matrix for recurrent neural networks, 23 Admissible discrete time mixed recurrent neural networks, 20 Admissibility of an activation function for recurrent neural networks, 5, 23 Alternating minimization algorithm for training Boltzmann machines, 66 of Byrne for Boltzmann machines, 71 Amari's information geometry, 57 Annealing schedule, 55 Application of EM algorithm for Boltzmann machine training, 76 Associative memories in neural network systems, 183-192 associative memory problems, 184-185 neural network associative memories, principles, 185-191 Attractor neural networks, 55 Attractor type neural networks, 192-253 B Bayesian statistics, 54 Binary classification tasks in constructive learning techniques, 95 Boltzmann-Gibbs distribution, 52 Boltzmann machine basic variations, 81-86 Boltzmann Perceptron, 81-82 Neal's approach based on belief networks, 84-86 polytomous Boltzmann machines, 82-84
Boltzmann machine links with the Markov chain and Monte Carlo techniques, 52 Boltzmann machines, 51-87 algorithms for training, 51-53 as a class of recurrent stochastic networks, 51 deterministic origins, 55-56 hidden units, 56-57 in optimization problems, 56 relationships with Markov chain and Monte Carlo methods, 53-54 statistical associations, 51-53 without hidden units, 58 Boltzmann machine training, 57-67 alternating minimization, 63-64 case with hidden units, 59-60 case without hidden units, 58-59 case with training data, 61-62 Gelfand and Carlin method for parameter estimation, 65-66 information geometry, 62 parameter estimation, mean-field approximation, 64-65 Boltzmann machine training algorithm example with hidden units, 73-81 direct maximization for training, 76 EM algorithm for training, 76-81 gradient descent method for training, 73-76 Boltzmann machine training algorithm example with no hidden units, 67-73 alternating minimization algorithm for training, 71-72 397
398 exact likelihood estimation, 68-70 Geyer and Thompson method for training, 70-71 Boolean problem learning, 109
Characterizing identifiability and minimality of recurrent neural networks, 3 Classical identifiability problem, 2 Complex singularities in recurrent neural networks, 6 Constructing modular architectures, 124-128 neural decision trees, 124-127 other approaches to growing modular networks, 128 reducing neural decision tree complexity, 127-128 Constructive algorithms for regression problems, 111-124 applications of the cascade correlation algorithm, 115-116 cascade correlation algorithm, 111-116 dynamic mode creation, 116 Hanson's meiosis algorithm, 117-118 Hanson's meiosis networks, 116-117 mode creation and mode splitting algorithm, 116 radial basis function networks for regression problems, 119-123 recurrent cascade correlation, 114 Constructive algorithms for solving classification problems, 95-111 binary classification tasks, 95 single hidden layer classification neural networks, 109-111 pocket algorithm, 96-97 target switching algorithm, 107-109 tower and cascade architectures, 97-101 training of a classifier, 95 tree and cascade architectures, construction, 103-109 tree architectures, 101-103 upstart algorithm, 101-103 Constructive learning algorithms for singlenode learning, 135-139 binary-valued nodes, 135-136 continuous-valued nodes, 138-139 maxover algorithm, 137-138 perceptron algorithm, 136-137
Index pocket algorithm, 137-138 single-node learning, 135-139 thermal perceptron, 137-138 Constructive learning techniques for designing neural networks, 91-139 Constructive learning techniques in network complexity reduction, 129-133 magnitude-based pruning algorithm, 129 node pruning in complexity reduction, 132-133 optical brain damage algorithm for complexity reduction, 129-132 Continuous point attractor associative memories (PAAM): competitive associative memories, 213-232 application to perfectly reacted PAAM, 228-232 competitive pattern recognition example, 222-228 competitive winning principle, 217-222 PAAM model and its significance, 214-217 Continuous time recurrent neural networks, 4,8 Controllability requirements in recurrent neural networks, 5 Coordinate subspaces in recurrent neural networks, 8 D Decision tree neural networks in pattern classification, 124-128 Delta rule with weight changes in constructive learning, 112 Discrete point attractor associative memories (PAAM): asymmetric classification theory for energy functions, 239-245 complete correspondence encoding, appHcation, 239-250 general correspondence principles, 234-239 Hopfield-type networks, 232-250 neural network model, 232-234 Discrete time recurrent neural networks, 4, 8 Dynamical version of neural networks, 1
Elman-like network for constructive learning techniques, 114
Index
399 architecture layout of inductive logic unit (ILU), 290 inductive logic unit (ILU), 259-281 computational structure, 300-302 optimized transduction of information, 296-300 optimized transmission of information, 293-295 testing, 302-304 Kolomogorov-based approach, 284 principle of maximized bearing and ILU architecture, 278
EM algorithm, 60, 66, 164 Exact likelihood estimation by Boltzmann machines, 68
Fisher information matrix, 62 Forward accessibility for continuous time recurrent neural networks, 14 for discrete time recurrent neural networks, 14
Gelfand and Carlin method, 65 Geyer and Thompson methods, 61, 70 Gibbs distributions, 56 Gibbs sampler, 53, 59 Glauber dynamics, 53 Gradient-descent rule for minimization, 58, 59, 66, 73, 112 H Heat-bath method, 53 Hidden nodes, 56 Hidden units in Boltzmann machines, 62, 73 Hopfield associative memories, 55 Hopfield network function, 55
Incorporation of hidden nodes in Boltzmann machines, 56 Information geometry of Boltzmann machines, 62 Interconnected linear and nonlinear recurrent network systems, 3 Invariance property of coordinate subspaces in recurrent neural networks, 8 Iterative proportional fitting procedure (IPFP), 57 K KuUback-Leibler directed divergence, 58
Learning in Boltzmann machines, 57 Logical basis for neural network design, 259-305
M Markov chain Monte Carlo methods (MCMC), 53, 59, 61 Maximality of unobservable subspaces in recurrent neural networks, 19 Maximum likelihood estimation, 57, 59, 60 McCuUoch-Pitts neuron, 195 Metropolis-Hastings method, 54 Minimal recurrent neural networks, 2, 13 Mixed recurrent neural networks, 17-46 identifiability, 30-33 proofs, 33-46 minimality, 30-33 proofs, 33-46 models, 17 observability, 17-23 state observability, 17-23 Modeling nonlinear input/output behaviors, 2 Modular network architectures, 152-159 data fusion, 153 input decomposition, 152-153 neocognition, 153 output decomposition, 153-154 Modular networks by combining outputs of expert systems, 159-167 EM algorithm, 164-167 maximum likelihood method, 160-161 mixture of expert systems, 161-162 mixture of linear regression model and Gaussian densities, 162-164 with adaptivity, 167-177 adaptive hierarchial networks, 172-173 adaptive multimodule approximation networks, 173-176
400 attentive modular construction and training, 169-172 blockstart algorithm, 176-177 Modular neural networks, 147-178 computational requirements, 149-150 conflicts in training, 150 generalizations of modular networks, 150-151 hardware constraints of modular networks, 151-152 representations of modular networks, 151 Multimode single-neuron arithmetics, 371-394 defining neuronal arithmetics, 374-378 biophysical basis, 374-375 composite operation and clean operation, 376-378 developedness of arithmetical modes, 378 polarization configuration, 375-376 development of computing neuron, 392-394 multimode neuronal arithmetic unit, 386-392 developedness of arithmetic operations, 387-389 multimode arithmetic unit, 389 triggering and transition of arithmetic modes, 390-392 phase space of neuronal arithmetics, 378-387 domain range of linear arithmetics, 380-384 dynamic range of operations, 384-385 phase diagram for neuronal arithmetics, 386-387 prototypes of clean arithmetics, 378-379
N rt-admissibility in recurrent neural networks, 15 Nearest-neighbor neural network data classifiers, 344-353 Necessity of observability conditions in recurrent neural networks, 27 Neural networks apphed to data analysis, 309-367
Index data classifier nonlinearity, 340-353 experiments, 344-353 nonlinearity measure, 341-344 data classifier selection, 327-340 confidence estimators, 329-334 confidence values, 328-329 data vaUdation sets, 334 experiments for data separability, 334-340 data classifier stability, 353-365 estimation procedure for stability, 354-355 stabihty experiments, 361-365 stabilizing techniques, 358-361 data complexity, 311-316 density estimation methods, 314-316 prototype vectors, 311-313 subspace or projection methods, 313-314 data separability, 316-327 experiments, 322-325 mapping algorithms, 318-321 visualizing the data space, 321-322 Node creation and node-splitting algorithms for constructive learning techniques, 116-119
Parameter estimation using mean field approximations, 64 Parzen method for reliability classification of data, 331 Perceptron algorithm for constructive learning, 99 Phase space of neuronal arithmetics, 378-387 Point attractor associative memories (PAAM), 192-213 encoding strategies, 205-213 neural network models, 193-198 neurodynamics, 198-205 Pyramid algorithm for constructive learning, 99
Quickprop methods in constructive learning techniques, 112
Radial basis function networks in constructive learning applications, 119-123
Index Realizations for nonlinear input/output behaviors, 2 Recurrent neural networks, 1-47 admissible activation functions, 5-7 as models for nonlinear systems, 2 controllability, 13 for constructive learning techniques, 114 forward accessibility, 13-16 identifiability and minimality, 11-13 models for recurrent networks, 4-5 open problems, 46-47 state observability, 7-11
Sammon's nonlinear projection method, 320-321 Self-organizing map (SOM) for mapping high dimension feature spaces to a lower dimension space, 318-321 Single layer recurrent neural networks, 4 Single-neuron computation, 372 Stationary distributions in Boltzmann machines, 57 Stochastic updating rules for Boltzmann machines, 56 Symmetries in recurrent neural networks, 11 Symmetry group reduction in recurrent neural networks, 4
401 System theoretic study of recurrent neural networks, 2, 4
Target Switch algorithm in constructive learning, 104 Thresholding nodes in tree architectures for constructive learning, 106 Tiling algorithm for constructive learning, 98 Tower algorithm for constructive learning, 98 Training Boltzmann machines by alternating minimization, 63 with a sample of N realizations, 65 Training data for Boltzmann machines, 61 U Unobservable subspaces in recurrent neural networks, 18 Upstart algorithm for constructive learning, 101 W Weak n-admissibility in recurrent neural networks, 16 Weight pruning techniques in neural networks, 129-133
This Page Intentionally Left Blank