Optimization Techniques (Neural Network Systems Techniques and Applications)

optimization Techniques Neural Network Systems Techniques and Applications Edited by Cornelius T, Leondes VOLUME 1. ...

Author: Cornelius T. Leondes

364 downloads 2289 Views 17MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

optimization Techniques

Neural Network Systems Techniques and Applications Edited by Cornelius T, Leondes

VOLUME 1. Algorithms and Architectures VOLUME 2. Optimization Techniques VOLUME 3. Implementation Techniques VOLUME 4. Industrial and Manufacturing Systems VOLUME 5. Image Processing and Pattern Recognition VOLUME 6. Fuzzy Logic and Expert Systems Applications VOLUME 7. Control and Dynamic Systems

optimization Techniques Edited by

Cornelius T. Leondes Professor Emeritus University of California Los Angeles, California

V O L U M E

Z

OF

Neural Network Systems Techniques and Applications

ACADEMIC PRESS San Diego London

Boston

New York

Sydney

Tokyo

Toronto

This book is printed on acid-free paper. @

Copyright © 1998 by ACADEMIC PRESS All Rights Reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher.

Academic Press a division ofHarcourt

Brace & Company

525 B Street, Suite 1900, San Diego, California 92101-4495, USA http://www.apnet.com Academic Press Limited 24-28 Oval Road, London NWl 7DX, UK http://www.hbuk.co.uk/ap/ Library of Congress Card Catalog Number: 97-80441 International Standard Book Number: 0-12-443862-8

PRINTED IN THE UNITED STATES OF AMERICA 97 98 99 00 01 02 ML 9 8 7 6

5

4

3 2 1

Contents

Contributors xv Preface xvii

Optimal Learning in Artificial Neural Networks: A Theoretical View Monica Bianchini, Paolo Frasconi, Marco Gori, and Marco Maggini

I. Introduction 1 II. Formulation of Learning as an Optimization Problem A. Static Networks 7 B. Recurrent Neural Networks 8 III. Learning with No Local Minima 10 A. Static Networks for Pattern Classification 11 B. Neural Networks with "Many Hidden Units" 22 C. Optimal Learning with Autoassociators 23 D. Recurrent Neural Networks 25 E. On the Effect of the Learning Mode 32 IV. Learning with Suboptimal Solutions 33 A. Local Minima in Neural Networks 34 B. Symmetrical Configurations 40 C. Network Saturation 41 D. Bifurcation of Learning Trajectories in Recurrent Neural Networks 42 V. Advanced Techniques for Optimal Learning 44 A. Growing Networks and Pruning 44 B. Divide and Conquer: Modular Architectures 45 C. Learning from Prior Knowledge 45

Contents VI. Conclusions References

45 47

Orthogonal Transformation Techniques in the Optimization of Feedforward Neural Network Systems Partha Pratim Kanjilal I. Introduction 53 II. Mathematical Background for the Transformations Used 55 A. Singular Value Decomposition 55 B. QR Factorization 56 C. QR with Column Pivoting Factorization and Subset Selection 56 D. Modified QR with Column Pivoting Factorization and Subset Selection 57 E. Remarks 58 III. Network-Size Optimization through Subset Selection 58 A. Basic Principle 58 B. Selection of Optimum Set of Input Nodes 59 C. Selection of Optimum Number of Hidden Nodes and Links 60 IV. Introduction to Illustrative Examples 61 V. Example 1: Modeling of the Mackey-Glass Series 62 VI. Example 2: Modeling of the Sunspot Series 65 A. Principle of Modeling a Quasiperiodic Series 65 B. Sunspot Series Model 66 VII. Example 3: Modeling of the Rocket Engine Testing Problem 71 VIII. Assessment of Convergence in Training Using Singular Value Decomposition 74 IX. Conclusions 76 Appendix A: Configuration of a Series with Nearly Repeating Periodicity for Singular Value Decomposition-Based Analysis 76 Appendix B: Singular Value Ratio Spectrum 77 References 77

Contents

vi

Sequential Constructive Techniques Marco Muselli

I. Introduction 81 II. Problems in Training with Back Propagation 82 A. Network Architecture Must Be Fixed a Priori 83 B. Optimal Solutions Cannot Be Obtained in Polynomial Time 85 III. Constructive Training Methods 85 A. Dynamic Adaptation to the Problem 87 B. High Training Speed 87 IV. Sequential Constructive Methods: General Structure 88 A. Sequential Decision Lists for Two-Class Problems 89 B. Sequential Decision Lists for Multiclass Problems 96 C. General Procedure for Two-Class Problems 98 D. General Procedure for Multiclass Problems 100 V. Sequential Constructive Methods: Specific Approaches 105 A. Halfspace Choice Set 106 B. Hyperplane Choice Set 117 VI. Hamming Clustering Procedure 123 VII. Experimental Results 125 A. Exhaustive Learning 128 B. Generalization Tests 132 VIII. Conclusions 139 References 140

Fast Backpropagation Training Using Optimal Learning Rate and Momentum Xiao-Hu Yu, Li-Qun Xu, and Yong Wang I. Introduction 145 II. Computation of Derivatives of Learning Parameters A. Derivatives of the Learning Rate 149 B. Derivatives of the Learning Rate and Momentum 151

148

viii

Contents III. Optimization of Dynamic Learning Rate 154 A. Method 1: Learning Rate Search with an Acceptable 5 E 154 B. Methods 2 and 3: Using a Newton-Uke Method to Compute /JL 156 C. Method 4: Using the Higher-Order Derivatives of /x 156 IV. Simultaneous Optimization of yji and a 158 A. Method 5: Using the First Two Partial Derivatives 159 V. Selection of the Descent Direction 160 VI. Simulation Results 161 VII. Conclusion 168 References 172

Learning of Nonstationary Processes V. Ruiz de Angulo and Carme Torras I. II. III. IV. V. VI. VII. VIII. IX.

X. XI.

Introduction 175 A Priori Limitations 177 Formalization of the Problem 178 Transformation into an Unconstrained Minimization Problem 179 One-to-One Mapping D 182 Learning with Minimal Degradation Algorithm 183 Adaptation of Learning with Minimal Degradation for Radial Basis Function Units 186 Choosing the CoeflBcients of the Cost Function 188 Implementation Details 190 A. Advance Rate 190 B. Stopping Criterion 191 C. Initial Hidden-Unit Configuration 191 Performance Measures 191 Experimental Results 194 A. Scaling Properties 194 B. Solution Quality for Different Coefficient Settings 194

Contents C. Computational Savings Derived from the Application of Learning with Minimal Degradation 197 D. Learning with Minimal Degradation versus Back Propagation 198 XII. Discussion 200 A. Influence of the Back Propagation Advance Rate on Forgetting 200 B. How to Prepare a Network for Damage or the Relation of Learning with Minimal Degradation with Fault Tolerance 201 C. Relation of Learning with Minimal Degradation with Pruning 204 XIII. Conclusion 204 References 206

Constraint Satisfaction Problems Hans Nikolaus Schaller I. Constraint Satisfaction Problems 209 II. Assessment Criteria for Constraint Satisfaction Techniques 213 A. P and NP Problems, Complexity Theory 213 B. Scaleability and Large-Scale Problems, Empirical Complexity 214 C. Parallelization 216 D. Design Principles for Computer Architectures 217 E. Examples of Constraint Satisfaction Problems 218 F. Summary 220 III. Constraint Satisfaction Techniques 221 A. Global Search 222 B. Local Search 223 C. Neural Networks 226 IV. Neural Networks for Constraint Satisfaction 227 A. Hopfield Networks 228 B. Neural Algorithms and the Strictly Digital Neural Network 231 C. Neural Computing Networks 233

ix

Contents D. E. F. G.

Guarded Discrete Stochastic Net 234 Boltzmann Machine 234 i^-Winner-Take-All 235 Dynamic Barrier Neural Network and Rolling Stone Neural Network 236 V. Assessment 240 A. A^-Queens Benchmark 240 B. Comparison of Neural Techniques 241 C. Comparison of All Techniques 241 D. Summary 243 References 244

Dominant Neuron Techniques Jar-Ferr Yang and Chi-Ming Chen I. Introduction 249 II. Continuous Winner-Take-All Neural Networks 252 III. Iterative Winner-Take-All Neural Networks 256 A. Pair-Compared Competition 256 B. Fixed Mutually Inhibited Competition 259 C. Dynamic Mutual-Inhibition Competition 262 D. Mean-Threshold Mutual-Inhibition Competition 263 E. Highest-Threshold Mutual-Inhibition Competition 264 F. Dynamic Thresholding Competition 265 G. Simulation Results 267 IV. X-Winners-Take-All Neural Networks 268 A. Continuous ^-Winners-Take-All Competition 268 B. Interactive Activation ^-Winners-Take-All Competition 269 C. Coarse-Fine Mutual-Inhibition i^-Winners-Take-All Competition 270 D. Dynamic Threshold Search iC-Winners-Take-All Competition 270 E. Simulation Results 272 V. Conclusions 273 References 274

Contents

CMAC-Based Techniques for Adaptive Learning Control Chun-Shin Lin, Ching-Tsan Chiang, and Hyongsuk Kim

I. Introduction 277 II. Neural Networks for Learning Control 278 A. Nonlinear Controller: Identification of Inverse Plant and Its Usage 278 B. Model Reference Adaptive Controller 280 C. Learning a Sequence of Control Actions by Back Propagation through Time 280 D. Neural Networks for Adaptive Critic Learning 283 III. Conventional Cerebellar Model Articulation Controller 284 A. Scheme 284 B. Application Example of Cerebellar Model Articulation Controller 287 rv. Advanced Cerebellar Model Articulation Controller-Based Techniques 290 A. Cerebellar Model Articulation Controller with Weighted Regression 290 B. Cerebellar Model Articulation Controller with General Basis Functions 293 V. Structure Composed of Small Cerebellar Model Articulation Controllers 298 A. Neural Network Structure with Small Cerebellar Model Articulation Controllers 298 B. Learning Rules 299 C. Example: Function Approximation 301 VI. Conclusions 302 References 303

Information Dynamics and Neural Techniques for Data Analysis Gustavo Deco I. Introduction 305 II. Statistical Structure Extraction: Parametric Formulation by Unsupervised Neural Learning 307

Contents A. B. C. D. E.

Basic Concepts of Information Theory 309 Independent Component Analysis 311 Nonlinear Independent Component Analysis 318 Linear Independent Component Analysis 322 Dual Ensemble Theory for Unsupervised and Supervised Learning 323 III. Statistical Structure Extraction: Nonparametric Formulation 326 A. Statistical Independence Measure 328 B. Statistical Test 331 IV. Nonparametric Characterization of Dynamics: The Information Flow Concept 337 A. Information Flow for Finite Partitions 339 B. Intrinsic Information Flow (Influential Partition) 342 V. Conclusions 345 References 349

Radial Basis Function Network Approximation and Learning in Task-Dependent Feedforward Control of Nonlinear Dynamical Systems Dimitry Gorinevsky

I. Introduction 353 II. Problem Statement 357 A. Control Formulation 357 B. Example: Control of Two-Link Flexible Arm 360 C. Discretized Problem 362 D. Problems of Task-Dependent Feedforward Control 365 III. Radial Basis Function Approximation 366 A. Exact Radial Basis Function Interpolation 367 B. Radial Basis Function Network Approximation 369 C. Recursive Identification of the Radial Basis Function Model 370 D. Radial Basis Function Approximation of Task-Dependent Feedforward 372 IV. Learning Feedforward for a Given Task 373 A. Learning Control as On-Line Optimization 374

Contents

B. Robust Convergence of the Learning Control Algorithm 375 C. Finite-Difference Update of the Gradient 376 V. On-Line Learning Update in Task-Dependent Feedforward 378 A. Approximating System Sensitivity 378 B. Local Levenberg-Marquardt Update 379 C. Update of Radial Basis Function Approximation in the Feedforward Controller 380 VI. Adaptive Learning of Task-Dependent Feedforward 382 A. Affine Radial Basis Function Network Model of the System Mapping 382 B. Adaptive Update Algorithm 384 C. Discussion 386 D. Application Example: Learning Control of Flexible Arm 387 VII. Conclusions 391 References 391

Index

395

This Page Intentionally Left Blank

Contributors

Numbers in parentheses indicate the pages on which the authors' contributions begin.

Monica Bianchini (1), Dipartimento di Sistemi e Informatica, Universita degli Studi di Firenze, 3-50139 Florence, Italy Chi-Ming Chen (249), Department of Electrical Engineering, Kao Yuan College of Technology and Commerce, Luchu, Kaohsiung, Republic of China Ching-Tsan Chiang (277), Department of Electrical Engineering, University of Missouri—Columbia, Columbia, Missouri 65211 Gustavo Deco (305), Siemens AG, Corporate Research and Development, Munich 81739, Germany Paolo Frasconi (1), Dipartimento di Sistemi e Informatica, Universita degli Studi di Firenze, 3-50139 Florence, Italy Marco Gori (1), Dipartimento di Ingegneria dell'Informazione, Universita degli Studi di Siena, 56-Siena, Italy Dimitry Gorinevsfc^ (353), Measurex Devron, Inc., North Vancouver, British Columbia V7J 3S4, Canada Partha P. Kanjilal (53), Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur 721-302, India Hyongsuk Kim (277), Department of Control and Instrumentation Engineering, Chonbuk National University, Republic of Korea Chun-Shin Lin (277), Department of Electrical Engineering, University of Missouri—Columbia, Columbia, Missouri 65211 Marco Maggini (1), Dipartimento di Ingegneria dell'Informazione, Universita degli Studi di Siena, 56-Siena, Italy XV

xvi

Contributors

Marco Muselli (81), Istituto per i Circuiti Elettronici, Consiglio Nazionale delle Ricerche, 161 49 Genoa, Italy Vicente Ruiz de Angulo (175), Institut de Robotica i Informatica Industrial (CSIC-UPC), Edifici NEXUS, 08034 Barcelona, Spain H. Nikolaus Schaller (209), DSJ TRI, D-80798 Munich, Germany Carme Torras (175), Institut de Robotica i Informatica Industrial (CSICUPC), Edifici NEXUS, 08034 Barcelona, Spain Yong Wang (145), Department of Radio Engineering, National Communications Research Laboratory, Southeast University, Nanjing 210018, China Li-Qun Xu (145), Intelligent Systems Research, Advanced Applications and Technology, BT Laboratories, Ipswich IP5 7RE, England Jar-Ferr Yang (249), Department of Electrical Engineering, National Cheng Kung University, Tainan 70101, Taiwan Xiao-Hu Yu (145), Department of Radio Engineering, National Communications Research Laboratory, Southeast University, Nanjing 210018, China

Preface Inspired by the structure of the human brain, artificial neural networks have been widely applied to fields such as pattern recognition, optimization, coding, control, etc., because of their ability to solve cumbersome or intractable problems by learning directly from data. An artificial neural network usually consists of a large number of simple processing units, i.e., neurons, via mutual interconnection. It learns to solve problems by adequately adjusting the strength of the interconnections according to input data. Moreover, the neural network adapts easily to new environments by learning, and can deal with information that is noisy, inconsistent, vague, or probabilistic. These features have motivated extensive research and developments in artificial neural networks. This volume is probably the first rather comprehensive treatment devoted to the broad area of optimization techniques, including systems structures and computational methods. Techniques and diverse methods in numerous areas of this broad subject are presented. In addition, various major neural network structures for achieving effective systems are presented and illustrated by examples in all cases. Numerous other techniques and subjects related to this broadly significant area are treated. The remarkable breadth and depth of the advances in neural network systems with their many substantive applications, both realized and yet to be realized, make it quite evident that adequate treatment of this broad area requires a number of distinctly titled but well-integrated volumes. This is the second of seven volumes on the subject of neural network systems and it is entitled Optimization Techniques, The entire set of seven volumes contains Volume Volume Volume Volume Volume Volume Volume

1: 2: 3: 4: 5: 6: 7:

Algorithms and Architectures Optimization Techniques Implementation Techniques Industrial and Manufacturing Systems Image Processing and Pattern Recognition Fuzzy Logic and Expert Systems Applications Control and Dynamic Systems

xviii

Preface

The first contribution to Volume 2 is "Optimal Learning in Artificial Neural Networks: A Theoretical View," by Monica Bianchini, Paolo Frasconi, Marco Gori, and Marco Maggini. The effectiveness of neural network systems emulating intelligent behavior and in solving many significant applied problems is strictly related to the learning algorithms intended to determine the optimal or near optimal values of the neural network systems weight elements. This contribution is a rather comprehensive treatment of techniques and methods for optimal learning (weight determination), and it provides a unified view of these techniques as well as a presentation of the state of the art in this broad and fundamental area. This contribution treats the issues and techniques related to the problem of local minima of the cost function that might be utilized in the process of determining neural network systems weights. Some rather significant links with the computational complexity of learning are presented, and various techniques for determining optimum neural network systems weights are presented. A number of rather illuminating illustrative examples are included in this contribution. The next contribution is "Orthogonal Transformation Techniques in the Optimization of Feedforward Neural Network Systems," by Partha Pratim Kanjilal. Orthogonal transformation techniques can be utilized to identify the dominant modes in any information set. As this implies the realization of a neural network system of reduced or minimum order (minimum complexity), it is the basic motivation behind the use of orthogonal transformation techniques in optimizing (achieving the minimal complexity of) neural network systems. This contribution is a rather comprehensive treatment of the techniques and methods that are utilized in this important area, with illustrative examples to show the substantive effectiveness of the techniques presented. The next contribution is "Sequential Constructive Techniques," by Marco Muselli. The theoretical and practical problems associated with the backpropagation algorithm have led to continual advances in learning techniques for this significant problem. Among these techniques is a new class of learning algorithms called sequential restoration methods. This highly effective method allows the treatment of training sets that contain several thousand samples. This contribution is a rather comprehensive treatment of the techniques and methods involved, with numerous substantive examples. The next contribution is "Fast Backpropagation Training Using Optimal Learning Rate and Momentum," by Xiao-Hu Yu, Li-Qun Xu, and Yong Wang. This contribution presents a family of fast backpropagation (BP) learning algorithms for supervised training of neural networks. The achievement of rapid convergence rate is the result of using systematically

Preface

xix

optimized dynamic learning rate (and momentum, if required). This is in contrast to both the standard BP algorithm, in which a constant learning rate and momentum term is adopted, as well as other ad hoc or heuristics based methods. The main features of these algorithms are the attempts to explore the derivative information of the error surface (cost function) with respect to the learning rate and momentum to a certain necessary order rather than to obtain the Hessian matrix of synaptic weights, which is normally very costly to compute. This contribution is a rather comprehensive treatment of the methods and techniques for fast backpropagation neural network system learning methods. This contribution includes illustrations of the application of the techniques presented to several benchmark problems as well as comparisons to other well-studied classic algorithms. The highly effective performance of the techniques are made quite clear by these examples in terms of both fast convergence rate and robustness to weights initialization. The next contribution to this volume is "Learning of Nonstationary Processes," by V. Ruiz de Angulo and Carme Torras. The degradation in performance of an associative network over a training set when new patterns are trained in isolation is usually called forgetting or catastrophic interference. Applications entailing the learning of a time-varying function require the ability to quickly modify some input-output patterns while at the same time avoiding catastrophic forgetting. Learning algorithms based on the popularity of the repeated presentation of the learning set backpropagation are suited only to tasks admitting two separate phases: an off-line phase for learning and another phase for operation. This contribution is a rather comprehensive treatment of techniques for the use of neural network systems for learning nonstationary processes. Numerous illustrative examples are presented that clearly manifest the effectiveness of the techniques presented. The next contribution is "Constraint Satisfaction Problems," by Hans Nikolaus Schaller. System optimization problems in which the variables involved are either continuous or discrete are rather straightforward, comparatively speaking, when compared with similar problems wherein the continuous or discrete system variables are required to satisfy constraints. This contribution is a rather comprehensive treatment of the utilization of neural network systems for the treatment of this class of problems which has many diverse and broad applications of substantial applied significance. Numerous illustrative examples of the techniques and methods presented are included as an important element of this contribution. The next contribution is "Dominant Neuron Techniques," by Jar-Ferr Yang and Chi-Ming Chen. This chapter provides an integrated and intensive investigation of the fundamental issues in the design and analysis of

XX

Preface

unsupervised learning neural networks for resolving which neuron (or neurons) has the maximum preference. The exploration of dominant neuron and K neurons (noted subsequently) can be related to the techniques for winner-take-all (WTA) and K winners-take-all (KWTA) problems, respectively. Generally, the KWTA neural network performs a selection of the K competitors whose activations are larger than the remaining (M - K) ones (as noted in this contribution). When K = I, the KWTA network devolves to the WTA process, in which the neuron with the maximum activation is determined. Hence, the KWTA network can be treated as a generalization of the WTA network. Well-known neural networks such as Grossberg's competitive learning, adaptive resonance theory, fuzzy associative memory, learning vector quantizers, and their various versions all require a WTA neural network. WTA methods have applications to such other diverse areas as classification applications, error correction systems, fuzzy associative memory systems, Gaussian classifiers, nearest match content addressable memory, signal processing, and the building of many complex systems. This contribution is a rather comprehensive treatment of the dominant neuron techniques of WTA and KWTA methods, with illustrative examples. The next contribution is "CMAC-Based Techniques for Adaptive Learning Control," by Chun-Shin Lin, Ching-Tsan Chiang, and Hyongsuk Kim. This chapter treats the cerebellar model of articulation controller (CMAC) and CMAC-based techniques, which are often used in learning control applications. The CMAC was first developed by Albus in the mid-1970s for robot manipulator control and functional approximation. The CMAC is an efficient table lookup technique. Its most attractive characteristic is that learning always converges to the result with a least square error and the convergence is fast. The CMAC technique did not receive much attention until the mid-1980s when researchers started developing strong interests in neural networks. CMAC is now considered one type of neural network with major applications in learning control. Several illustrative examples are included which clearly manifest the significance and substantive effectiveness of CMAC systems. The next contribution to this volume is "Information Dynamics and Neural Techniques for Data Analysis," by Gustavo Deco. One of the most essential problems in the fields of neural networks and nonhnear dynamics is the extraction and characterization of the statistical structure underlying an observed set of data. In the context of neural networks, the problem is posed as the data-based learning of a parametric form of the statistical dependences behind the data. In this parametric formulation, the goal is to model the observed process. On the other hand, an a priori requirement

Preface

xxi

for the extraction of statistical structures is the detection of their existence and their characterization. For time series, for example, it is useful to know if the dynamics that originates the observed values is stationary or nonstationary, and if the time series is deterministic or stochastic, and to be able to distinguish between white noise, colored noise, Markov processes, and chaotic and nonchaotic determinism. The detection and characterization of such dependences should therefore be previously performed in a nonparametric fashion in order to be able a posteriori to model the process in a parametric form. The basic problem is of a statistical nature and therefore information theory offers the ideal theoretical framework for a mathematical formulation. This contribution is a rather substantive treatment of a detailed and unifying formulation of the theory of parametric and nonparametric structure extraction with a view toward establishing a consistent theoretical framework for the extremely important problem of discovering the knowledge implicit in empirical data. The significant implications are manifest by considering only a few of the many significant applications, including, biological data such as EEGs, financial data such as the stock market, etc. Illustrative examples are included. The final contribution to this volume is "Radial Basis Function Network Approximation and Learning in Task-Dependent Feedforward Control of Nonlinear Dynamical Systems," by Dimitry Gorinevsky. This contribution considers intelligent control system architectures for task-level control. The problem is to compute feedforward control for a sequence of control tasks. Each task can be compactly described by a task parameter vector. The control update is performed in a discrete time: from task to task. This contribution considers an innovative controller architecture based on radial basis function (RBF) approximation of nonlinear mappings. The more advanced of these architectures enable on-line learning update for optimization of the system performance from task to task. This learning update can be considered as a generalization of the well-known learning (repetitive) control approach. Unlike repetitive control, which is only applicable to a single task, the proposed algorithms work for a parametric family of such tasks. As an example, a task-level feedforward control of a flexible articulated arm is considered. A vibration-free terminal control of such an arm is achieved using a task-level algorithm that learns optimal task-dependent feedforward as the arm goes through a random sequence of point-to-point motions. This volume on neural network system optimization techniques clearly reveals the effectiveness and significance of the techniques available, and with further development, the essential role they will play in the future.

xxii

Preface

The authors are all to be highly commended for their splendid contributions to this volume which will provide a significant and unique reference source for students, research workers, practitioners, computer scientists, and others on the international scene for years to come. Cornelius T. Leondes

optimal Learning in Artificial Neural Networks: A Theoretical View* Monica Bianchini

Paolo Frasconi

Dipartimento di Sistemi e Informatica Universita degli Studi di Firenze Florence, Italy

Dipartimento di Sistemi e Informatica Universita degli Studi di Firenze Florence, Italy

Marco Gori

Marco Maggini

Dipartimento di Ingegneria deirinformazione Universita degli Studi di Siena Siena, Italy

Dipartimento di Ingegneria deirinformazione Universita degli Studi di Siena Siena, Italy

I. INTRODUCTION In the last few years impressive efforts have been made in using connectionist models either for modeling human behavior or for solving practical problems. In the field of cognitive science and psychology, we have been witnessing a debate on the actual role of connectionism in modeling human behavior. It has been claimed [1] that, like traditional associationism, connectionism treats learning as basically a sort of statistical modeling and that it is not adequate for capturing *This chapter is partially reprinted from M. Bianchini and M. Gori, Neurocomputing 13:313346, 1996, courtesy of Elsevier Science-NL, Sara Burgerhartstraat 25, 1055 KV Amsterdam, the Netherlands; and partially from M. Bianchini, M. Gori, and M. Maggini, IEEE Trans. Neural Networks 5:167-177 (© 1994 IEEE), M. Bianchini, P. Frasconi, and M. Gori, IEEE Trans. Neural Networks 6:512-515 (© 1995 IEEE), M. Bianchini, P. Frasconi, and M. Gori, IEEE Trans. Neural Networks 6:749-756 (© 1995 IEEE), M. Maggini and M. Gori, IEEE Trans. Neural Networks 7:251-254 (© 1996 IEEE). Optimization Techniques Copyright © 1998 by Academic Press. All rights of reproduction in any form reserved.

1

2

Monica Bianchini et al.

the rich structure of most significant cognitive processes. As for the actual novehy of the recent renew of connectionist models, Fodor and Pylyshyn [1] look quite skeptical and state "We seem to remember having been through this argument before. We find ourselves with a gnawing sense of deja vu." A parallel debate has been taking place concerning the application of connectionist models to engineering (pattern recognition, artificial intelligence, motor control, etc.). The arguments addressed in these debates seem strictly related to each other and refer mainly to the peculiar kind of learning that is typically carried out in connectionist models, which seem not to take enough of the structure into account. Unlike other symbolic approaches to machine learning, which are based on "intelligent search" (see, e.g., [2]), in connectionist models the learning is typically framed as an optimization problem. After the seminal books by the PDP group, Minsky published an extended edition of Perceptrons [3] that contains an intriguing epilogue on PDP's novel issues. He pointed out that what the PDP group calls a "powerful new learning result is nothing more than a straightforward hill-climbing algorithm" and commented on the novelty of backpropagation by saying: "We have the impression that many people in the connectionist conmiunity do not understand that this is merely a particular way to compute a gradient and have assumed instead that Backpropagation is a new learning scheme that somehow gets around the basic limitation of hill-climbing" (see [3, p. 286]).^ Minsky's issues call for the need to give optimal learning a theoretical foundation. Because simple gradient descent algorithms get stuck in local minima, in principle, one has no guarantee of learning the assigned task. It may be argued that more sophisticated optimization techniques (see, e.g., [5, 6]) guarantee reaching the global minimum, but the computational burden can become exaggerated early for most practical problems. The computational burden is obviously related to the shape of the error surface and particularly to the presence of local minima. Hence, it turns out to be very interesting to investigate the presence of local minima and particularly to look for conditions that guarantee their absence. Obviously, we do not claim that the absence of local minima identifies the limit of practically solvable problems, because the use of sophisticated optimization techniques can actually be valuable also in the presence of error surfaces with local minima. However, beyond that bound, troubles are likely to begin for any learning algorithm, whose effectiveness seems very difficult to assess in advance. One primary goal of this chapter is that of reviewing the basic results known in the literature concerning the optimal convergence of supervised learning algorithms in a unified framework. In the case of batch mode, the optimal convergence ^The criticism raised by Minsky to the backpropagation (BP) learning scheme also involves the mapping capabilities of feedforward nets. For example, Minsky [3, p. 265] points out that the net proposed by Rumelhart et al. [4, pp. 340-341] for learning to recognize the symmetry has very serious problems of scaling up. It may happen that the bits needed for representing the weights exceed those needed for recording the patterns themselves!

Optimal Learning in Artificial Neural Networks

3

is strictly related to the shape of the error surface and particularly to the presence of local minima. When using pattern mode (or other schemes), the optimal convergence cannot be framed as an optimization problem, unless we use a conveniently small learning rate that leads us to an approximation of the "true gradient descent." We focus mainly on batch mode by investigating conditions that guarantee local minima free error surfaces for both static and dynamic networks. In the case of feedforward networks, local minima free error surfaces are guaranteed when the patterns are linearly separable [7] or when using networks with as many hidden units as patterns to learn [8, 9]. Analogous results hold for radial basis function networks for which the absence of local minima is gained under the condition of patterns separable by hyperspheres [10]. Roughly speaking, these results suggest to us that optimal learning is certainly achieved in the limit cases of "many input" and "many hidden unit" networks. In the first case, the assumption of using networks with many inputs makes the probability of linearly separable patterns very high.-^ In the second case, the result holds independently of the problem at hand, but the main drawback turns out to be the excessive number of hidden units that are necessary for dealing with most practical problems. In the case of dynamic networks, local minima free error surfaces are guaranteed when matching the decoupling network assumptions (DNAs). They are essentially related to the decoupling of sequences of different classes on at least one gradient coordinate. Unlike other sufficient conditions, DNAs seem more valuable in network design. Basically, for a given classification task, one can look for architectures that are well suited for learning. In the best case, such a search leads to the discovery of networks for which learning takes place with no local minima. When no optimal network is found that guarantees local minima free error surface, one can, in any case, exploit DNAs for discovering architectures that are well suited for the task at hand. The theoretical results described for batch mode can partially be extended, at least for feedforward networks, to the case of pattern mode learning [13]. This duality, which also holds for "nonsmall learning rates," is quite interesting, because it suggests conceiving new learning algorithms that are not necessarily based on function optimization, but on smart weight updating rules acting similarly to pattern mode. In practice, particularly for large experiments, the learning process takes place on subsets of the learning environment selected heuristically. For example, "difficult patterns" are commonly presented more often than others. We also discuss some examples of suboptimal learning in the framework of the theory developed for understanding local minima free error surfaces. In so doing, two different kinds of local minima are identified that depend on joint spurious ^This follows directly from the results found by Cover [11] and Brown [12] concerning the average number of random patterns with random binary desired responses that can be absorbed by an ADALINE.

4

Monica Bianchini et al.

choices of the neuron nonlinear function and the cost {spurious local minima), and on the relationship between network and data {structural local minima), respectively. We also discuss premature saturation and appropriate choices of the cost for avoiding getting stuck in configurations in which the neurons are saturated. This chapter is organized as follows. In the next section, we give the formulation of learning as an optimization problem and define the notation used throughout the chapter. In Section III, we review the basic results on local minima free error surfaces, while in Section IV, we discuss problems of suboptimal learning. In Section V we give a brief sketch of approaches that have been currently pursued to overcome the local minima problem. Finally, some conclusions are drawn in Section VI.

11. FORMULATION OF LEARNING AS AN OPTIMIZATION PROBLEM In this section, we define the formalism adopted throughout the chapter. Basically, we are interested in experiments involving neural networks that can be represented concisely by £^ = {A/*, Ce, ET},N' being the network, Ce the learning environment, and ET the cost index.

1. Network A/" We consider neural networks whose N neurons are grouped into sets called layers. With reference to the index /, we distinguish between the input layer (/ = 0), the output layer (/ = L), and the hidden layers (0 < / < L). The number of neurons per layer is denoted by n{l), whereas each neuron of layer / is referred to by its index i(/), /(/) = 1 , . . . , n{l). We assume that the network is fed at discrete time 0,... ,t — l,t,t-\-l,... ,T by a sequence of vectors. For each t and /, we consider MO

=

[ai(i){t),..,,anii){t)]\

Xl{t) =

[xm){t),,..,Xn(l){t)]\

where Ai{t) e R"^^^ and Xi{t) e M"^^^ are the activation and the output vector, respectively. The following model is assumed for the activation: aiii){t) = ^(W,(/), X o ( 0 , . . . , X/-i(0; Xi{t - 1 ) , . . . , XL{t - 1)),

(1)

where W/(/) is the weight vector associated with the neuron / (/). The function T{') depends on the particular model of each neuron and defines the way of combining the inputs received from all other neurons or external inputs. The initial state of the

Optimal Learning in Artificial Neural Networks network is referred to as Z o ( 0 ) , . . . , to its activation as follows:

XL(0).

5

The output of neuron /(/) is related

where / ( • ) : M ^- [J, J ] is a C^ function and f\ai(i)) "squashlike" function [4] satisfies these hypotheses.

7^ 0 in R. For example, a

2. Learning Environment Ce In this chapter, we deal with supervised learning and, therefore, we need to refer to the following collection of T input-output pairs: Ce = {(/(O, D(t)), I(t) e X, D(0 e {d, d}\

? = 1,..., r } ,

where I(t) is the input, D(t) the corresponding target, and X is the input space. Each component of D(t) belongs to {^, d). All the targets are collected in the matrix V = [ D ( l ) , . . . , D(T)Y e {d, J}^'". 3. Cost Function ET For a given experiment Se, the output-target data fitting is estimated by means of the cost function

ET = J2^t = Y,d{XL(t),D(t)), t=i

t=i

where J ( ) is a distance in M". The choice of this function plays a very crucial role in practice and depends significantly on the problem at hand. A common choice, which simplifies the mathematical analysis, is that of considering the distance induced by an Lp norm (I < p < 00). In the case of p = 2, which is most frequently considered, the cost is given by T

E^ =

n

\Y.Y.[^j{t)-dj{t)]\

The use of different values for p has been evaluated by Hanson and Burr [14] and by Burrascano [15] in a number of different domain problems. It turns out that the noise in the target domain can be reduced by rising power values less than 2, whereas the sensitivity of partition planes to the geometry of the problem may be increased with increasing power values. The choice of the cost follows several different criteria, which may lead to opposite requirements. We focus on the requirements deriving from the need to limit problems of suboptimal solutions. One important requirement is that the

6

Monica Bianchini et at.

particular function choice should not give rise to spurious local minima that, as will be shown in Section IV, depend on the relationship between the cost and the neuron functions. As pointed out in the following, this can be achieved by using an error criterion that does not penalize the outputs "beyond" the target values. Suppose that the outputs are exclusively coded, that is, if t belongs to class 7, then di(L)(t) = d for i(L) = j , and di(L)(t) = ^ otherwise. In order to deal with spurious local minima, the introduction of the following LMS (least mean squaTQ)-threshold error function^ turns out to be useful: nil)

|-

;=1 tej •-

/(L)/j

where ki-): R -> R, A: = 1, 2, are C^ functions (except for a = 0), such that

I

/l(Qf)=0,

ifQf<0,

hia) > 0 , l[(a) > 0 ,

ifa > 0 ,

|/2(a)=0,

ifQf>0,

[hid) > 0 , /^(of) < 0 ,

if a < 0 ,

and ^ stands for differentiation with respect to a. Another important requirement for the error function is that of limiting the "premature saturation" problems due to erroneous choice of the initial weights. As will be shown in Section IV.C, the following relative cross-entropy metric [1720]: T

n

E- = EE''.W"|5^ + C-<'.<'))lnt^

(2)

has the property of significantly reducing premature saturation problems. Solla et al. [19] have also pointed out that using the logarithmic error function (2) yields significant reductions in learning times. It is worth mentioning that in the case of an exact solution Ej = E^^^ = E^ = 0, because dj(t) = Xj{t) for all r, 7. The learning experiments analyzed in this chapter involve networks J\f with the architecture fixed in advance. The studies on variable networks are beyond the scope of this work. ^This definition is just the extension of the function proposed in [16] for the case of single-output networks.

Optimal Learning in Artificial Neural Networks

7

A. STATIC NETWORKS In the previous general framework, static networks [4],"^ whose neuron activation does not involve past samples, commonly adopt the following choice of function ^(O: n{l~l)

ai{l)it) = J^{Wi(i), Xi-i(t))

= WiQ) +

Y^

Wi(i)j(i-i)Xj(i-i){t)

where W/(/) e R"^'~^^+^ u;/(/)j(/_i) is the weight of the link between the neurons 7 (/ — 1), /(/) and WIQ) is the neuron threshold. In this case, the function T is simply the dot product of X^_^ = [X[_^, 1]' and W^(/), with " 1 " accounting for bias. The output of neuron i{l) is related to the activation by means of the squashing function^: Xi(l) = f{cii{l)) = -rz r. (3) l+exp(-a,(/)) Another common choice of the function T(') leads to radial basis function networks (RBFs). The activation of the locally tuned units of these networks follows the equation: J cii{i){t) = T{Wi(i), X/_i(0) = WiQ) + -Y-

nil-l) ^

{xj(i-i)(t) - w;/(/),j(/_i))

||X/-i(o-W,(/)|p

4) where WIQ) = [W^/^, w;/(/), a/(/)]' e R"^^~^^+^, WiQ)j{i-\) is the weight of the link between the neurons j{l — \)J (/), WIQ) is the unit threshold, and a^i) is the "width" of the Gaussian. In this case, the output neuron function is Xi{i) = /te(/)) = exp(-a/(/)). The REF networks also contain an output layer of linear or sigmoidal neurons.^ ^In the literature and throughout the chapter, feedforward networks are also referred to as multilayered networks (MLNs). ^The symmetric squashing function tanh(M) can be used with analogous results. ^Unlike the RBFs proposed in [21], Eq. (4) includes a bias term. Let L = 2 be. Because x/(i) = exp(-a/(i)) = exp(-u;j(i) - WX^it) -^i{\)f/(^f^i^) = exp(-M;/(i))exp(-||Xo(0 W/(l)|P/a^jO, the contribution of hidden neuron i{\) to output neuron j{2) turns out to be [u;^(2),i(i)exp(-u;,(i))]exp(-||Xo(0 - W'/CDlP/^fd))- When denoting w;^(2),i(i) [^7(2),/(I) exp(—w;j(i))], the RBF we consider becomes exactly the one assumed in [21].

=

8

Monica Bianchini et al.

For both MLNs and RBFs, the learning environment involves pairs of inputoutput vectors as follows:

Ce = {(Xo(0, ^(0), ^0(0 € R'^^^^ D(t) e W^^\ r = 1,..., r } , where Xo(t) is the input pattern and D(t) is its corresponding target. When using a feedforward network as an autoassociator [22], the target is set to the input, thus reducing the learning environment to

Ce = {Xo(t)eM!'^^\ r = l , . . . , r } . In order to understand the basic results proposed in this paper, additional notation is needed that allows us to deal with a compact vectorial formulation of the problem. 1. The vector X/(/) = [jc/(/)(l),..., Xi(i)(T)Y e M^, called the output trace of neuron i (/), stores the output of neuron i (l) for all the T patterns of the learning environment. The output trace for all the neurons of a given layer / is called the layer output trace. It is kept in the matrix Xi = [Zi(/)... XnQ)] e E^''^^^), 0 < / < L. 2. Wi(i)j(i-i) is the weight connecting neuron j(l — 1) to neuron /(/). The associated matrix W/-.i e R«(0,«(/-i) js referred to as the layer weight matrix. The symbol Q denotes the weight space. 3. Let yi(i){t) = dEt/dai(i)(t). We call the delta trace the vector YiQ) = [ji(/)(l)» • • •» yiil)(T)Y and the layer delta trace the matrix yi = [Yi(i)' • • Yn(i)] e E^'"<^>. We denote by Sf C M^'"(^> the set of all yi generated when varying the weights in Q. Moreover, let us define j)/ such that y,(/)(0 = ym{t)/f{aiQ){t)). 4. We assume that there is no connection that jumps a layer. Therefore, for MLNs and autoassociators, for weights connecting layers / — 1 to / the gradient can be represented by a matrix Qi-\ € ^n{i-i)+\Mi)^ whose generic element gO'(^ — 1)» KO) is given by dEr/dwiQ^jQ-i) if ; ( / - 1) < n{l - 1) and dEr/dwid) if jQ - 1) = n(l - 1 ) + 1. For the hidden layer of an RBF network, Gi-i € E«(^-i)+2,«(/)^ where the (n(l) + l)th row accounts for the derivative w.r.t. aiQ).

B. RECURRENT NEURAL NETWORKS We are mainly interested in recurrent networks used for processing sequences. The computational style we consider is that of feeding the network by sequences of frames (tokens) and computing the activation on-line without waiting for state relaxation as in Hopfield networks [23] and Boltzmann machines [24].

Optimal Learning in Artificial Neural Networks

9

Formally, a token Sf(t)(t), with t = 1 , . . . , T, is a sequence of F(t) frames (input vectors): 5 F ( O ( 0 = {^o(/, 0 ^ 1^"^^^ / = ! , . . . , F(t)}. The number of frames composing a given token t is referred to as the token length F{t) < Fmax, where Fmax = maxi
«/(!)(/, 0 =

G{Wi^i),Xo(fj),Xi(f-ht))

= (w/(i))%(/ - 1 , 0 + « ! ) ) % ( / , 0, where W^^^^ G R"^^^ denotes the feedback connections and W9^^^ e W^^^ the connections from external inputs. The output of the neuron /(I) is related to its activation by a squashing function [see, Eq. (3)]. Finally, among the processing units of the network, the output neuron plays the important role of coding the class of the sequences that feed the network, as we are interested in dealing with positive and negative tokens only. The learning process is based on a set of supervised tokens, collected in the T input-target pairs: Ce = {(SF(t)(t). D(t)), SFit)(t) 6 ST, D(t) G E, r = 1 , . . . , r } , where SF(t) (0 is the input sequence, D(t) is its corresponding target value for the output at F(t), and ST is the token space. For recurrent neural networks the following notation needs to be defined: 1. ^o,s(t) = [Xo(l, 0 • • • Xo(F(t), t)] e R"(0),F(o is called the token trace. Let F* = Y.]=\ F(t). Ab = [Ab,5(l) • • • ^o,s(T)] e R"(^>'^* is called the input trace. It collects all the tokens of the learning environment. 2. Let us define Ab,/(/) = [Xo(/, 1) • • • Xo(/, T)] e R"^^)'^. If / > F{t), then we assume Xo(/, 0 = 0. Ab,/(/) is referred to as the/ram^ trace. 3. The matrix Ms(t) = [X(0, 0 • • • X(F(t) - 1, 01 € R'^^i)'^^^), 1 < r < 7, is referred to as the output token trace. X = \Xs(\) • • • XsiJYi e R''^^^'^* is called the neuron trace. It collects the outputs of all the neurons of the learning environment. 4. The matrix A'/(/) = [X(/, 1) • • . X ( / , T)] e W^^^^^, 0 < / < F^ax - 1, is called the output frame trace. For f ^ F(t) we define X{f, t) = 0. 5. Let us define yt (/, t) = dE/dat (/, t). The yi (/, t) delta error can be collected in vectorial structures similar to those used for inputs and neuron outputs. Hence, ys(t) e R"(i)'^(^) is called the the delta token trace, ^Thus, for the sake of simplicity, we will discard the layer index.

10

Monica Bianchini et at. y/if) e R"^^^'^ is referred to as the delta frame trace, and y e E"(^>'^* is the delta trace. 6. The gradient of the cost function Ej {w] •, w^ f, J\f, Ce) w.r.t. the weights W^ e R'^d)'"^!) and W^ G R"(1)'"(0) may be kept in the matrices ^yyi G R"^!)''^^!) and ^yyo G R'^(0),n(i)^ respectively. Notice that the transpose of these matrices must be used for weight updating.

III. LEARNING WITH NO LOGAL MINIMA This section contains some theoretical results aimed at guaranteeing local minima free error surfaces under some hypotheses on networks and data. The identification of similar conditions ensures global optimization just by using simple gradient descent learning algorithms (batch mode). The interest in similar conditions is motivated by the comparison with the perceptron learning (PL) algorithm [3, 25,26] and with ADALINE [27,28] for which optimal learning is guaranteed under the assumption of linearly separable patterns. Baldi and Homik [29] proposed a first interesting analysis on local minima under the assumption of linear neurons. They proved that the attached cost function has only saddle points and a unique global minimum. As the authors pointed out, however, it does not seem easy to extend such an analysis to the case of nonlinear neurons. Sontag and Sussman [30] provided other conditions guaranteeing local minima free error surfaces in the case of single-layered networks of sigmoidal neurons. When adopting LMS-threshold cost functions, they proved the absence of local minima for linearly separable patterns. This is of remarkable interest, in that it allows us to get rid of spurious local minima arising with an improper joint selection of cost and squashing functions [31]. Shynk [32] showed that the perceptron learning algorithm may be viewed as a steepest-descent method by defining an appropriate performance function. In so doing, the problem of optimal convergence in perceptrons turns out to be closely related to that of the shape of such performance function. However, although interesting, these analyses make no prediction in the case of networks with nonlinear hidden neurons. Beginning from an investigation of small examples. Hush and Salas [33] gave some interesting qualitative indications on the shape of the cost surface. They pointed out that the cost surface is mainly composed of plateaus, which extend to infinity in all directions, and very steep regions. When the number of patterns is "small," they observed "stair-steps" in the cost surface, one for each pattern. When increasing the cardinality of the training set, however, the surface become smoother. Careful analyses on the shape of the cost surface, also supported by a detailed investigation of an example, were proposed by Gouhara et al. [34, 35]. They introduced the concepts of memory and learning surface. The learning sur-

Optimal Learning in Artificial Neural Networks face is the surface attached to the cost function, whereas the memory surface is the region in the weight space that represents the solution to the problem of mapping the patterns onto the target values. One of their main conclusions is that the learning process ".. .has the tendency to descend along the memory surfaces because of the valley-hill shape of the learning surface." They also suggest what the effect of the P and S symmetries^ [37] is on the shape of the learning surface. In the next sections, we give a detailed review of studies that addresses the problem of local minima for networks with nonlinear hidden layers from a theoretical point of view.

A. STATIC NETWORKS FOR PATTERN CLASSIFICATION In this section, all our analyses and conclusions rely upon the following assumption: ASSUMPTION 1.

The entire training set can be learned exactly.

This hypothesis can be met when using a network with just one hidden layer, provided that it is composed of a sufficient number of neurons [38^1]. According to more recent research, when using the hard-limiting neurons (sgn(-) function), the perfect mapping of all the training patterns can also be attained by using at least T — \ hidden neurons [42]. It may be argued that this architectural requirement is unreasonable in most interesting problems dealing with redundant information. On the other hand, for many problems of this kind (e.g., pattern recognition), the architectures that are commonly selected simply by trial and error give errors that are very close to 0, and it is likely even to find examples showing perfect mapping (see, e.g., [43,44]).

1. Feedforward Networks We begin by imposing the condition for finding stationary points in the cost function. On the basis of the definitions given in the previous section and on the ^P and S symmetries are weight transformations that do not affect the network output. The P symmetry can act on any vector Wi of input weights of a given hidden neuron i. The vectors of the hidden neurons of an assigned layer can be permuted in any order, because their global contribution to the upper layer is not affected at all. The S symmetry acts for symmetric squashing functions such that f{a) — —f{—a).ln this case, a transformation of the weights can be created which inverses the sign of all the input and output connections of a neuron. More recently, Chen et at. [36], have proven that when using P and S symmetries, there are n!2" different assignments with the output.

11

12

Monica Bianchini et al.

backpropagation rule,^ the gradient of the cost can be written as

gi-x = {xu)'yu

/ = i,...,L,

(5)

where Xf_^ = [A/_i U] e RTMI-I)+\ and n = [ 1 , . . . , 1]' e R^. The following theorem introduces some hypotheses primarily concerning the network architecture, but also the relationship between the network and the learning environment. Basically, the theorem gives a sufficient condition for local minima free error surfaces in the case of pyramidal networks, conmionly used in pattern recognition. THEOREM 1. The cost function E^^^(wij;J\f, Ce) is local minima free if the network M and the associated learning environment Ce meet the following PR\ (pattern recognition) hypotheses:

1. n{l + 1) < «(/), / = 1 , . . . , L — 1 (pyramidal hypothesis). 2. The weight layer matrices W/, / = 1 , . . . , L — 1, are full-rank matrices. 3. Ker[A'^]n5f ={0}. Proof Sketch (see [50] for more details). Because of PRl .3, 5i = {X^yy\ = 0 implies ^ i = 0. According to the backpropagation step, yi = 3^/+i W/. From 3^/ = 0, 3^/ = 0 follows and, consequently, 0 = j)/ = 3^/+iW/, from which yi_^i = 0 because of PRl.2. Because 3^1 = 0, yi =0 follows by induction on /. Finally, ^ L = 0 implies E^^^ = 0. • A few remarks concerning the hypotheses of this theorem help to clarify its meaning. First, we notice that the pyramidal assumption does not involve the input layer (/ = 0). This hypothesis appears as a natural consequence of the task accomplished by the neurons in networks devoted to classification, because the more a hidden layer gets close to the output, the more the information is compressed. This structure is often adopted in many practical experiments, and particularly for classification (see, e.g., [43,44, 51-54]). Second, as also pointed out in [29], the hypothesis concerning the rank of the weight layer matrices W/ is quite reasonable. Finally, the PRl.3 hypothesis involves both the network and the learning environment. Unfortunately, it is quite hard to understand its practical meaning as it requires knowledge of 5^ , that is, the set of all the 3^i generated when varying the weights in Q. The computation of S^ seems to be very hard with no assump^As pointed out by le Cun [45], to some extent, the basic elements of backpropagation can be traced back to the famous book of Bryson and Ho [46]. A more expHcit statement of the algorithm has been proposed by Werbos [47], Parker [48], le Cun [49], and the members of the PDP group [4]. Although many researchers have contributed in different ways to the development and proposition of different aspects of BP, there is no question that Rumelhart and the PDP group are given credit for the current high diffusion of the algorithm.

Optimal Learning in Artificial Neural Networks tion on the problem at hand. Basically, this condition involves both the network and the learning environment very closely, thus stating formally the intuitive feeling that the presence of local minima depends heavily on the mutual relationship between the given problem and the architecture chosen for its solution. We can think of Theorem 1 as a first general attempt to investigate the presence of stationary points in the error surface in the case of pyramidal networks. From this general point of view, the problem is very complex and the role of this theorem is essentially that of moving all the difficulties to condition PR1.3. A case in which PR 1.3 holds is when all the patterns are linearly independent, because, in that case, Ker[A'Q] = 0. It is worth mentioning that if Ker[A:'Q] = 0 holds, then the PR1.3 hypothesis only involves the learning environment. This is a very desirable property but, on the other hand, when the patterns are linearly independent the number of patterns T cannot be greater than w(0). This is a very serious restriction, because the number of inputs dramatically limits the cardinality of the learning environment. However, as it will be shown later, this condition can be extended to more significant practical cases. Theorem 1 can be easily restated to provide a necessary and sufficient condition guaranteeing local minima free error surfaces. COROLLARY 1. Let us consider experiments based on pyramidal networks matching PR 1.2 and learning environments satisfying Assumption 1. The associated cost function EY^^(wij;J\f, Ce) is local minima free if and only if, for all the stationary points W^, J^iCW^) = 0 holds.

Proof If y\{W) = 0 holds for all the stationary points W , then E^^{wi^j\ J\f, Ce) is local minima free using the same arguments of the proof of Theorem 1. On the other hand, if E^^^ (wtj ;J\f,Ce) has only one global minimum, £^MS ^ Q implies 3^L(W^) = 0 for this point, from which yi(W) = 0 follows, because of the recursive application of the backpropagation relationship 3^/_i = yiWi-i and because j)/_i = 0 presupposes yi =0. • This corollary deserves attention primarily for the intriguing relationship with Rosenblatt's PL algorithm [26] and ADALINE [27] which it suggests. A close relationship between learning in multilayered and single-layered networks comes out because, under the PRl.l and PR1.2 hypotheses, the search for global minima of the cost in the case of multilayered networks is restricted to inspecting 3^1 (W^) = 0 only, exactly as in single-layered networks. This also makes clear that the additional problem coming out in the analysis of multilayered networks is that of providing a description of yi (W^), that is, of space S^. In order to discover meaningful conditions with a straightforward and practical interpretation, we propose investigating the case of patterns that are separable by

13

14

Monica Bianchini et al.

Figure 1 Separable patterns: (a) linearly separable patterns; (b) patterns separated by hyperspheres. Reprinted with permission from Neurocomputing 13,1996; courtesy of Elsevier Science-NL.

a family of surfaces Oc(), c = 1 , . . . , C; that is, cD,(Xo(0) < 0, Oc(Zo(0) > 0,

Vr in class c, otherwise.

For example, if Oc(^o(0) = A^Zg(0, Ac e W^^^-^^, the patterns are linearly separable (see Fig. la), whereas if c|)^(Xo(0) = ll^oCO — Cc\\ — re the patterns are separable by hyperspheres, where Cc and r^ are the center and the radius of the hypersphere, respectively. In the last case, all "positive examples" of different classes in the learning environment belong to regions bounded by hyperspheres, whereas all eventual negative examples, which do not belong to the assumed classes, are in the complementary domain (see Fig. lb). The following theorem deals with the simplest case of linearly separable patterns and specializes the results given in Theorem 1 under this new assumption. THEOREM 2. The cost function EY^^(wij;J\f, Ce) is local minima free if the network and the learning environment satisfy the following PR2 hypotheses:

• Network 1. The network has only one hidden layer (L = 2). 2. The network has C outputs where C is the number of classes. 3. Full connections are assumed from the input to the hidden layer The hidden layer is divided into C sublayers, Hi,..., He,..., He, and connections are only permitted from any sublayer to the associated output unit (see Fig. 2). The sublayer He contains ne(l) neurons. • Output coding Exclusive coding is used for the output. • Learning environment All the patterns ofCe are linearly separable.

Optimal Learning in Artificial Neural Networks

o

o^ ^ o /

i

^i,n(i)

Hi

(^nil)

i

^C,ie(l

1=2 C output neurons

)

Hc{^ic(i)

Hc(^ic{i)

15

C sub-layers He with nc(l) neurons

1

Wo

/=0 n(0) inputs Figure 2 Network architecture with multiple outputs.

Proof Sketch (see [50] for more details). Because of PR2.3, yi((o) is composed of elements of the same sign for patterns of the same class. Let us consider Qi = (^:^yyi = O and A € W^^^^K The equation A^X^yyi = 0 has, at least, the same solutions as Qi = 0. When considering ^ i 's sign and the hypothesis of linearly separable patterns, it follows that (A'(A:Q )0>'i is composed of terms with the same sign. As a result, 3^i = 0, which, in turn, implies that E^^^ is local minima free, because of Corollary 1. • The hypothesis on the architecture is not very restrictive. No output interaction is assumed; that is, the outputs are computed independently of each other. This hypothesis has also been adopted for proving the interpolation capabilities of MLNs in [38, 40, 41]. Plant and Hinton [55] have shown that these architectures learn faster than those with fully connected layers. Jacobs et al [56] have considered network assumption PR2.2 as a first step toward the conception of modular architectures that are usually well suited for high generalization. When keeping the pyramidal assumption, this architectural hypothesis can be removed at the price of introducing the assumption that W\ is a full-rank matrix [57]. The hypothesis of linearly separable patterns suggests a comparison with Rosenblatt's perceptron. It is well known that this hypothesis is also sufficient for guaranteeing, in the case of the simple perceptron, the convergence of the 5 rule [3, 25, 26] to configurations where all the patterns are correctly classified. Nevertheless, this must not lead us to conclude that when dealing with linearly separable patterns, perceptrons and MLNs are equivalent. As pointed out in [7, 44], the generalization to new examples is significantly better for networks

16

Monica Bianchini et al.

with a hidden layer. In the case of MLNs, the assumption of hnearly separable patterns is only sufficient to guarantee the convergence of a gradient descent learning algorithm. Moreover, also in the presence of local minima, one still has a chance to perform global optimization. Finally, there are cases in which backpropagation gets stuck in local minima, but the resulting suboptimal solutions are still useful in practice, whereas the PL algorithm for the perceptron oscillates. ^^ As a result, we can state that the superiority of MLNs with respect to single-layer perceptrons is not only a matter of experiments, but that it can be established on the basis of theoretical results. It is still an open research problem to identify sharper sufficient conditions guaranteeing local minima free cost surfaces.

2. Radial Basis Functions The results stated in Theorems 1 and 2 involve classic sigmoidal neurons with activation computed by the dot product of weights and inputs. When looking into the details of the proofs of these theorems, it is quite easy to realize that they are essentially based on the network architecture that is responsible for the factorization of the gradient stated by Eq. (5). This factorization is gained by using backpropagation and has nothing to do with the special neuron with which we are dealing. These remarks suggest extending the previous results to other multilayered networks based on different neurons. From the different choices following Eq. (1), the radial basis function networks [21,59] seem the most interesting. We consider multilayered architectures with a hidden layer of locally tuned units [21] and an output layer of ordinary processing units [4].^^ The multilayered architecture of radial basis function networks makes it possible to give the gradient a factorization that looks analogous to Eq. (5). • For the output layer, the use of the BP computing scheme makes it possible to determine the stationary points, as for MLN networks, by means of

Qi = {xD'yi = 0. • For locally tuned hidden neurons, the use of backpropagation leads to

a,(i) = 0 =^ x;^,iii)ym = o,

HD = i,..., MD,

^^This undesirable behavior, however, is not found for single-layered networks trained by LMS [28]. For Rosenblatt's perceptron, there is a generalization of the PL algorithm, called the pocket algorithm [58], that avoids cycling in the case of nonlinearly separable patterns. ^^The RBFs proposed in [21] have linear outputs. The assumption made in this chapter, however, does not change the essence of the analysis, which can also be carried out under the hypothesis of linear outputs.

Optimal Learning in Artificial Neural Networks

17

where ,^o--^r(
A-(i) -2-

n and A^r(-) is the T-matrix-replica operator, which creates a matrix with T rows all equal to the argument W^(i). ^la) = [^/(i)(l)' • • •» M{\){T)y e R^, where A,-(i)(0 = ll^o(0 - W/(i)||Vo^f(i), ^ = 1 , . . . , T, derives from differentiation with respect to Gaussian widths, and n = [ 1 , . . . , 1]' e R^ derives from biases. We can provide a natural extension of Theorems 1 and 2, stated for sigmoidal networks, to this case. THEOREM 3. The cost function E^^^(wij;J\f, Ce) is local minima free if the network Af and the associated learning environment Ce satisfy the following PRl-bis hypotheses:

1. n(2) < n(l) (pyramidal hypothesis). 2. The weight layer matrix Wi is full row rank. The following relationships between the kernel of matrices ^Q^I) cind ^i(\) ^^^d-

M ' ^ a / d ) ] ^ ^?i) = {0}' Proof (SQC [10]).

^^'(1) = 1' • • •' "(I)-

•

Nevertheless, as in the case of MLNs, in order to discover more significant conditions, we propose a geometric hypothesis on the data. The presence of locally tuned processing units leads us to consider inputs separated by hyperspheres. This assumption turns out to be dual with respect to that of linearly separable patterns for inner product-based neurons. THEOREM 4. The cost function EY^^(wij',J\f, Ce) is local minima free if the hypotheses of Theorem 2 are satisfied for the network J\f and the output coding, whereas the patterns of the learning environment Ce are separated by hyperspheres (PR2-bis hypotheses).

18

Monica Bianchini et al. Proof, • If the PRl-his hypotheses hold, then ^o,/(i)^W=0,

/(l) = l , . . . , n ( l ) ,

(6)

only admits the solution Yi{\) = 0. In fact, if the patterns are separable by hyperspheres, for each W/(i) e W^^\ there exist C vectors 4>^(\y^(i)) = (^'(W/(i)), 1, P(Wi^i))y G R"(0>+^ where HWi(i)) =

'—

,

c = 1 , . . . , C,

and

PiWiii)) = 4 - [ ^ c - \\Cc - W,(i)f ],

c = 1,..., C,

such that

sgn[(D;,;e^,.(!)] = sgnl" - ^ { o ' ( A b ( 0 - yWr(W,(i))) + ||A'o(O-W,-(i)f}+)0l = [-^ii-^ii---kii---i-^n-

(7)

The solution 7/(1) of Eq. (6) must necessarily satisfy the following equations:

KKiaM) = ^'

vc = i,...,c.

Hence, for each class c and for each neuron idl) of the hidden sublayer He, the following equality must hold:

Implementation Techniques (Neural Network Systems Techniques and Applications)

Read more

Control and Dynamic Systems (Neural Network Systems Techniques and Applications)

Read more

Algorithms and Architectures (Neural Network Systems Techniques and Applications)

Read more

Fuzzy Logic and Expert Systems Applications (Neural Network Systems Techniques and Applications)

Read more

Optimization techniques: With applications to aerospace systems

Read more

Optimization Techniques: With Applications to Aerospace Systems

Read more

Neural networks, algorithms, applications, and programming techniques

Read more

Neural Networks in Business: Techniques and Applications

Read more

Neural networks, algorithms, applications, and programming techniques

Read more

Image Processing and Pattern Recognition (Neural Network Systems Techniques and Applications)

Read more

Biomechanical Systems: Techniques and Applications, Volume II: Cardiovascular Techniques

Read more

Modern Optimization Techniques with Applications in Electric Power Systems

Read more

Advanced Network Analysis Techniques

Read more

Rewriting techniques and applications

Read more

Communication Systems and Techniques

Read more

Modern Optimization Techniques with Applications in Electric Power Systems

Read more

Modern Optimization Modelling Techniques

Read more

Advanced radar techniques and systems

Read more

Quasicrystals: Types, Systems, and Techniques

Read more

Automatic differentiation: techniques and applications

Read more

Position Location Techniques and Applications

Read more

Bioinformatics Algorithms: Techniques and Applications

Read more

Neural network learning and expert systems

Read more

Chemical Biology: Applications and Techniques

Read more

Mechanical Design Optimization Using Advanced Optimization Techniques

Read more

Large deviations techniques and applications

Read more

Matrix Preconditioning Techniques and Applications

Read more

Tree Automata Techniques and Applications

Read more

Multilevel Analysis - Techniques and Applications

Read more

Unconventional Nanopatterning Techniques and Applications

Read more

Recommend Documents

Implementation Techniques (Neural Network Systems Techniques and Applications)

Implementation Techniques Neural Network Systems Techniques and Applications Edited by Cornelius T. Leondes VOLUME 1...

Control and Dynamic Systems (Neural Network Systems Techniques and Applications)

Control and Dynamic Systems Control and Dynamic Systems Neural Network Systems Techniques and Applications Edited b...

Algorithms and Architectures (Neural Network Systems Techniques and Applications)

Algorithms and Architectures Algorithms and Architectures Neural Network Systems Techniques and Applications Edited ...

Fuzzy Logic and Expert Systems Applications (Neural Network Systems Techniques and Applications)

Fuzzy Logic and Expert Systems Applications Neural Network Systems Techniques and Applications Edited by Cornelius T....

Optimization techniques: With applications to aerospace systems

Optimization Techniques: With Applications to Aerospace Systems

OPtimixation Techniques MATHEMATICS I N SCIENCE A N D E N G I N E E R I N G A S e r i e s o f M o n o g r a p h s a n...

Neural networks, algorithms, applications, and programming techniques

COMPUTATION AND NEURAL SYSTEMS SERIES SERIES EDITOR Christof Koch California Institute of Technology EDITORIAL ADVISORY ...

Neural Networks in Business: Techniques and Applications

Neural Net works in Business: Techniques and Applications Kate Sm Smiith Jati nder Gupt a tin IDEA GROUP PUBLISHING ...

Neural networks, algorithms, applications, and programming techniques

Image Processing and Pattern Recognition (Neural Network Systems Techniques and Applications)

Image Processing and Pattern Recognition Neural Network Systems Techniques and Applications Edited by Cornelius T. Le...