No title - PDF Free Download

ARTICLE Communicated by Peter Dayan Cortical Map Reorganization as a Competitive Process Granger G. Sutton I11 James ...

Author: MIT Press

11 downloads 771 Views 26MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

ARTICLE

Communicated by Peter Dayan

Cortical Map Reorganization as a Competitive Process Granger G. Sutton I11 James A. Reggia Steven L. Armentrout C. Lynne D’Autrechy Departinriit of Computer Scicmc, A. V. Willinins BIdg., Uiiiversity of Morylaiid, College Park, M D 20742 U S A

Past models of somatosensory cortex have successfully demonstrated map formation and subsequent map reorganization following localized repetitive stimuli or deafferentation. They provide an impressive demonstration that fairly simple assumptions about cortical connectivity and synaptic plasticity can account for several observations concerning cortical maps. However, past models have not successfully demonstrated spontaneous map reorganization following cortical lesions. Recently, an assumption universally used in these and other cortex models, that peristimulus inhibition is due solely to horizontal intracortical inhibitory connections, has been questioned and an additional mechanism, the competitive distribution of activity, has been proposed. We implemented a computational model of somatosensory cortex based on competitive distribution of activity. This model exhibits spontaneous map reorganization in response to a cortical lesion, going through a two-phase reorganization process. These results make a testable prediction that can be used to experimentally support or refute part of the competitive distribution hypothesis, and may lead to practically useful computational models of recovery following stroke. 1 Introduction

Feature maps in primary sensory cortex are highly plastic in adult animals: they undergo reorganization in response to deafferentation (Merzenich et al. 1983; Kaas 1991), de-efferentation (Sanes et al. 1988), localized repetitive stimuli (Jenkins et al. 19901, and focal cortical lesions (Jenkins and Merzenich 1987). During the past few years there have been several efforts to develop computational models of such cortical map self-organization and map refinement (Obermeyer et al. 1990; Pearson et al. 1987; Grajski and Merzenich 1990; Sklar 1990; von der Malsburg 1973; Ritter et a/. 1989). For example, models of the hand region of primary somatosensory cortex (SI) have demonstrated map refinement, map reorganization after localized repetitive stimulation and deafferentaNritrnl Cotnputntian 6, 1-13 (1994)

@ 1993 Massachusetts lnstitute of Technology

2

Granger G. Sutton 111 et al.

tion (Pearson et al. 1987), and the inverse magnification rule (Grajski and Merzenich 1990). These computational studies show that fairly simple assumptions about network architecture and synaptic modifiability can qualitatively account for several fundamental facts about cortical map self-organization. However, it is known from limited animal experiments that focal cortical lesions also produce spontaneous map reorganization (Jenkins and Merzenich 1987). Developing a model of map reorganization following cortical lesions is important not only for the insights it may provide into basic cortical physiology, but also because it could serve as a model of nervous system plasticity following stroke (Reggia et al. 1993). In the only previous computational model we know of that simulated a focal cortical lesion, map reorganization would not occur unless it was preceded by complete rerandomization of weights (Grajski and Merzenich 1990). Map reorganization following a cortical lesion is fundamentally different from that involving deafferentation or focal repetitive stimulation. In the latter situations there is a change in the probability distribution of input patterns seen by the cortex, and such a change has long been recognized to result in map alterations (Kohonen 1989). In contrast, a focal cortical lesion does not affect the probability distribution of input patterns, so other factors must be responsible for map reorganization. Past computational models of cortical map self-organization and plasticity have all assumed that the sole mechanism of intracortical inhibition is horizontal (lateral) inhibitory connections. Recently, the validity of this assumption has been called into question and an additional mechanism, the competitive distribution ofactivify, has been proposed for some intracortical inhibitory phenomena (Reggia et al. 1992). We recently implemented a model of cerebral cortex and thalamocortical interactions based on the hypothesis that competitive distribution is an important factor in controlling the spread of activation at the level of the thalamus and cortex (Reggia et al. 1992). Because of the flexible nature of the competitive cortex model, we hypothesized that a version of this model augumented by making thalamocortical synapses plastic would not only demonstrate map formation and reorganization as with previous models, but would also demonstrate spontaneous map reorganization following cortical lesions. In the following we describe simulations that show that this is correct. Our computational model of somatosensory (SI) cortex uses competitive distribution of activity as a means of producing cortical inhibitory effects (Sutton 1992). This model, augmented with an unsupervised learning rule, successfully produces cortical map refinement, expansion of cortical representation in response to focal repetitive stimulation, and map reorganization in response to focal deafferentation. More importantly, our model exhibits something that previous models have not yet produced: map reorganization following focal cortical damage without the need to rerandomize weights. This reorganization is a two-phase process, and

Cortical Map Reorganization

3

results in a testable prediction that can be examined experimentally. We conclude that competitive distribution of activity can explain some features of cortical map plasticity better than the traditional view of cortical inhibition. 2 Methods

We augmented the original competitive distribution cortical model with an unsupervised learning rule that modifies synaptic strengths over time. The intent was to examine how attributing peristimulus inhibition in cortex to competitive distribution of activity rather than to lateral inhibitory connections affected map formation and reorganization. We refer to our augmented model as the competitive S l model because it is a crude representation of a portion of the thalamus and primary somatosensory cortex (area 3b of SI), specifically portions of those structures receiving sensory input from the hand. This area was chosen because of its topographic organization, the availability of interesting experimental data (Jenkins and Merzenich 1987; Merzenich et al. 19831, and to allow comparison with some previous models of SI which make more traditional assumptions about intracortical inhibition (Pearson et al. 1987; Grajski and Merzenich 1990). The competitive SI model is constructed from two separate hexagonally tesselated layers of 32 x 32 volume elements representing the thalamus and the cortex. Each element represents a small set of spatially adjacent and functionally related neurons. To avoid edge effects, opposite edges of the cortical sheet are connected to form a torus. All connections are excitatory and competitive. Each thalamic element connects to its corresponding cortical element and the 60 surrounding cortical elements within a radius of four. With probability 0.5 a thalamocortical connection is initially assigned the minimum weight value 0.00001; otherwise, the weight is chosen uniformly randomly between this minimum and 1.O. Each cortical element connects to its six cortical neighbors; all corticocortical weights are equal (their magnitude then has no effect due to the activation rule). Each element’s functionality is governed by a competitive activation rule (Reggia et al. 1992). The activation a l ( t ) of cortical element j, representing the mean firing rate of the neurons contained in element j, is governed by

where in,(t) = C,out,,(t) with i ranging over all thalamic and cortical elements sending connections to cortical element j. Activation of thalamic element j is also determined by equation 2.1, but its in,(t) term represents only input from sensory receptors. Equation 2.1 bounds al(t)between

Granger G. Sutton 111 et a].

4

zero and a constant M. An output dispersal rule provides for competitive distribution of activation: (2.2)

for both thalamic and cortical elements. The small predefined, nonnegative constant q serves two purposes: it dampens competitiveness and it prevents division by zero. The competitive learning rule for changing weight wjIon the connection to cortical element j from thalamic element i is Azql(t) = c[al(t) wIl(t)]a,( t). To maintain normalized incoming weight vectors, an explicit weight renormalization step is needed after the weight update takes place. Simulations were run both with and without this explicit weight renormalization step, and only minor differences were observed. The results reported here are for the normalized model to be consistent with Grajski and Merzenich (1990). Random, uniformly distributed hexagonal patches were used as input stimuli because of their simplicity, intuitive appeal, and similarity to stimuli used with some past models of SI cortex. To evaluate topographic map formation, we defined the measures Total response:

rj

(2.3)

= I

Center:

X,

Moments: wx,

WYj

=

=

=

(2.5)

Here a,, is the activation level of cortical element j when a point stimulus (input of 1.0 to a single thalamic element) is applied at thalamic element i, rI is the total response of cortical element j summed over the thalamic point stimuli, xI and yl are the x and y coordinates of thalamic element i, f, and y, are the x and y coordinates for the center of cortical element j’s receptive field, and wx,and wy,are the x and y moments of cortical element j’s receptive field. The x and y moments of the cortical receptive field do not indicate the entire extent of the receptive field, but rather are measures of its width analogous to the standard deviation. 3 Results

Figure 1 shows the coarse topographic map that existed before training due to initially random weights and the topographic projection of thalamocortical connections. The topographic map is plotted by placing

Cortical Map Reorganization

5

a.

b.

Figure 1: Cortical receptive field plots for the model before learning: (a) receptive field centers with nearest neighbors connected, and (b) receptive field centers and moments. points at the computed centers of the cortical elements’ receptive fields and connecting any two points which represent cortical elements that are nearest neighbors (Fig. la). The x and y cortical receptive field moments are plotted by drawing ellipses in sensory/thalamic space centered on the computed centers (Fig. lb). The lengths of the x and y axes of each ellipse represent the x and y moments (not the full receptive fields). Anal-

6

Granger G. Sutton 111 et al.

ogous plots can be defined for thalamic response fields (each thalamic point stimulus generates a response across the cortical layer called its response field), cortical incoming weight vectors, and thalamic outgoing weight vectors. Figure 2 shows that with training a finely tuned, uniform topographic map appeared, and receptive field moments became uniform and decreased in size. Incoming weight vectors of cortical elements also became very uniform compared to their initial random state; after training with hexagonal patches of radius two, the incoming weight vectors became roughly bell-shaped when the weights are plotted as a surface. When the fingertips of attending monkeys are stimulated much more frequently than the rest of the hand, the region of SI cortex which represents the fingertips increases in size (Jenkins et al. 1990). This increase in size is mostly at the expense of neighboring regions, but also at the expense of more distant regions. The receptive fields in the expanded fingertips region of cortex also show a decrease in size. For our model, these results were simulated by first performing topographic map formation as above. After the map formed (Fig. 2), the input scheme was changed so that an 8 x 16 region of the thalamic layer designated as the repetitively stimulated finger (second finger region from the left), was now seven times more likely to be stimulated than other regions. After the topographic map reorganized due to this change in the input scheme, a number of effects observed in animal studies occur in the competitive SI model (Fig. 3). The number of cortical elements whose receptive field centers were in the repetitively stimulated finger representation increased dramatically, more than doubling. Thus, there was a substantial increase in the cortical magnification for the repetitively stimulated finger, as is observed experimentally. The neighboring finger representations decreased in size and shifted, and even more distant finger representations were reduced in size. Following repetitive stimulation, for the enlarged representation of the second finger shown in Figure 3b the mean receptive field size did not decrease. However, receptive field size did decrease for a large number of the cortical elements whose receptive field centers lie toward the edges of the repetitively stimulated finger representation, consistent with the inverse magnification rule in these regions. To simulate an afferent lesion, our model was trained and then a contiguous portion of the thalamic layer corresponding to a single finger was deprived of sensory input. Input patterns within the deafferented finger region ceased to occur. With continued training, some cortical receptive fields which were in the deafferented finger region shifted outside of this region, forcing a resultant shift of surrounding cortical receptive fields. Some cortical elements near the center of the deafferented region received insufficient activation to reorganize and remained essentially unresponsive to all input patterns. Much reorganization in biological cortex involves replacement of deafferented glabrous representation by

Cortical Map Reorganization

7

a.

b.

Figure 2: Cortical receptive field plots for the model after training with randomly positioned hexagonal patches of radius two as input stimuli: (a) receptive field centers with nearest neighbors connected and (b) receptive field centers and moments. Contrast with Figure 1. The refined topographic map (b) has small, roughly equal-size receptive fields [actual receptive fields extend beyond the ovals drawn in (b) and overlap]. As a reference for discussion, boundaries for four "fingers" and a "palm" are added. In all the simulations described in this paper cs = -2.0 (self-inhibition), cp = 0.6 (excitatory gain), M = 3.0, 9 = 0.0001 in equations 2.1 and 2.2, and a time step of 0.5 is used, for cortical elements. The same values are used for thalamic elements except cp = 1.0. A learning rate of t = 0.01 is used. Distance between neighboring receptive fields here and in subsequent figures is 1.0.

Granger G. Sutton 111 et al.

8

a.

b.

Figure 3: Cortical receptive field moments plotted in cortical space and filled to show regional location of receptive field centers (a) after training with uniformly distributed stimuli, and (b) after subsequent repeated finger stimulation. The horizontal line indicates the breadth of the cortical representation of the selectively stimulated finger before (a) and after (b) repetitive stimulation. The x and y moments of the cortical receptive fields are represented by ellipses centered at the physical location of the cortical element in the cortex rather than the location of the receptive field center in sensory/thalamic space. The general location of receptive field centers is shown by filling ellipses with different gray scale patterns to indicate the different finger and palm regions.

Cortical Map Reorganization

9

new/extended representation of dorsal hand surfaces (Kaas 1990); this could not happen in our model as no sensory input corresponding to the dorsum of the hand was present. To simulate a focal cortical lesion in our model, a contiguous portion of the trained cortical layer (elements representing the second finger from the left) was deactivated after training (i.e., activation levels clamped at 0.0). After lesioning, the topographic map showed a two-phase reorganization process. Immediately after lesioning and before any retraining, the receptive fields of cortical elements adjoining the lesioned area shifted towards the second thalamic finger and increased in size (Fig. 4a). This immediate shift was due to the competitive redistribution of thalamic output from lesioned to unlesioned cortical elements. The second phase of map reorganization occurred more slowly with continued training and was due to synaptic weight changes (Fig. 4b).' Cortical representation of the "lesioned finger" was reduced in size (reduced magnification). The mean receptive field x moment (y moment) prior to the lesion was 0.626 (0.627) for the entire cortex. Following the cortical lesion and subsequent map reorganization, the mean receptive field x moment (y moment) increased to 0.811 (0.854) for elements within a distance of two of the lesion site (mostly shaded black in Fig. 4b), consistent with an inverse magnification rule. 4 Discussion

It has recently been proposed that competitive distribution of activity may underlie some inhibitory effects observed in neocortex (Reggia et al. 1992). The present study supports this hypothesis in two ways. First, we have shown that a computational model of cerebral cortex based on the competitive distribution hypothesis, starting from a coarse topographic map, can simulate the development of a highly-refined topographic map with focused, bell-shaped receptive fields. Once such a map was formed, changing the probability distribution of input stimuli resulted in substantial map reorganization. With repetitive stimuli to a localized region of the sensory surface, the cortical representation of that region increased dramatically and changes occurred that were in part consistent with an inverse magnification rule as has been observed experimentally (Jenkins et al. 1990). With deafferentation of a localized region many of the cortical elements originally representing the deafferented region came to represent surrounding sensory surface regions, as has been described experimentally (Kaas 1991). Of course, neither our model nor previous ones, being tremendous simplifications of real cortex ( e g , having very limited radii of interconnectivity and modeling only a small cortical 'Results reported here are for unchanged uniformly random stimuli. If stimuli frequency in the region originally represented in the lesioned cortex was increased, reorganization was even more pronounced.

P

Cortical Map Reorganization

11

region), can account for all experimental data related to these phenomena. For example, none of these models, including ours, adequately accounts for the almost immediate changes in receptive fields observed following deafferentation (Calford and Tweedle 1991; Kaas et al. 1990; Chino et al. 1992; Gilbert and Wiesel 1992) nor for long-term “massive” reorganization (Pons et al. 1991). Second, our competitive SI model exhibited dramatic map reorganization in response to a focal cortical lesion. Reorganization following a cortical lesion is fundamentally different from repetitive stimulation and deafferentation as there is no change in the probability distribution of input stimuli. To our knowledge, the only previous cortical model which has tried to simulate reorganization following a focal cortical lesion is the three-layer model of Grajski and Merzenich (1990). Following a focal cortical lesion, map reorganization did not occur with this earlier model unless the synaptic strengths of all intracortical and cortical afferent connections that remained intact were randomized, that is, unless the model reverted to its completely untrained state (it was also necessary to enhance cortical excitation or reduce cortical inhibition). In our SI model no special procedure such as weight randomization was required: sensory regions originally represented in the lesioned cortex spontaneously reappeared in cortex outside the lesion area. This spontaneous map reorganization is consistent with that seen experimentally following small cortical lesions (Jenkins and Merzenich 1987). Further, the model’s receptive fields increased in size in perilesion cortex as has also been described experimentally (Jenkins and Merzenich 1987). Demonstration that the competitive SI model reorganizes after a cortical lesion provides a potential computational model of stroke.2 Map reorganization following a cortical lesion to the competitive SI model involved a two-phase process where each phase, rapid and slow, is due to a different mechanism. Immediately after a cortical lesion, competitive distribution of activation caused some finger regions orginally represented by the lesioned area of cortex to ”shift outward” and appear in adjacent regions of intact cortex. This result provides a specific testable prediction for half of the competitive distribution hypothesis: if competitive distribution of activity is present from thalamus to cerebral cortex, then significant shifts of sensory representation out of a lesioned cortical area should be observed right after a cortical lesion. The second, slower phase of additional map reorganization is due to synaptic plasticity, and is apparently triggered by the first phase. It is not yet clear whether a model based on more traditional intracortical inhibitory connections can produce spontaneous reorganization following a cortical lesion. Further computational studies should determine whether the difficulties encountered in obtaining such reorganization are a general property of cortical models using inhibitory connections or whether they 2We have recently extended this work to a model of proprioceptive cortex based on

length and tension input from muscles in a model arm (Cho d ~ 1 1993). .

12

Granger G. Sutton 111 et al.

reflect specific details of the one computational model studied so far (Grajski and Merzenich 1990).

Acknowledgments Supported by NINDS Awards NS 29414 a n d NS 16332. The authors are also with the Department of Neurology a n d the Institute for Advanced Computer Studies at the University of Maryland.

References Calford, M., and Tweedle, R. 1991. Acute changes in cutaneous receptive fields in primary somatosensory cortex after digit denervation in adult flying fox. j . Neurophys. 65, 178-187. Chino, Y., Kaas, J., and Smith, E., et al. 1992. Rapid reorganization of cortical maps in adult cats following restricted deafferentation in retina. Vision Res. 32, 789-796. Cho, S., Reggia, J., and DAutrechy, C. 1993. Modelling map formation in proprioceptive cortex. Tech. Rep. CS-TR-3026, Dept. of Computer Science, Univ. of Maryland. Gilbert, C., and Wiesel, T. 1992. Receptive field dynamics in adult primary visual cortex. Nature (London) 356, 150-152. Grajski, K., and Merzenich, M. 1990. Hebb-type dynamics is sufficient to account for the inverse magnification rule in cortical somatotopy. Neural Comp. 2, 71-84. Jenkins, W., and Merzenich, M. 1987. Reorganization of neocortical representations after brain injury: A neurophysiological model of the bases of recovery from stroke. In Progress in Brain Research, Vol. 71, F. Seil, E. Herbert, and B. Carlson, eds., pp. 249-266. Elsevier, Amsterdam. Jenkins, W., Merzenich, M., Ochs, M., Allard, T., and Guic-Robles, E. 1990. Functional reorganization of primary somatosensory cortex in adult owl monkeys after behaviorally controlled tactile stimulation. 1. Neurophys. 63, 82-1 04. Kaas, J. 1991. Plasticity of sensory and motor maps in adult mammals. Ann. Reu. Neurosci. 14, 137. Kaas, I., Krubitzer, L., Chino, Y., et a/. 1990. Reorganization of retinotopic cortical maps in adult mammals after lesions of the retina. Science 248, 229-231. Kohonen, T. 1989. Self-Organization and Associatizw Memory. Springer-Verlag, Berlin. Merzenich, M., Kaas, J., et al. 1983. Topographic reorganization of somatosensory cortical areas 3b and 1 in adult monkeys following restricted deafferentation. Neuroscience 8, 33-55. Obermayer, K., Ritter, H., and Schulten, K. 1990. A neural network model for the formation of topographic maps in the CNS: Development of receptive

Cortical Map Reorganization

13

fields. Proceedirigs C J Iiiteriintioiinl ~ joint Coriferciice 017 Neirral Networks, Vol. 11, pp. 423-429. San Diego, CA. Pearson, J., Finkel, L., and Edelman, G. 1987. Plasticity in the organization of adult cerebral cortical maps: A computer simulation. j . Neurosci. 7, 42094223. Pons, T., Garraghty, I?, Ommaya, A., e t a / . 1991. Massive cortical reorganization after sensory deafferentation in adult macaques. Science 252, 1857-1860. Reggia, J., DAutrechy, C., Sutton, G., and Weinrich, M. 1992. A competitive distribution theory of neocortical dynamics. Neural Comp. 4, 287-317. Reggia, J., Berndt, R., and DAutrechy, C. 1993. Connectionist models in neuropsychology, Hatidbook of Neuropsycholog!y, Vol. 9, Elsevier, in press. Ritter, H., Martinetz, T., and Schulten, K. 1989. Topology-conserving maps for learning visuo-motor-coordination. Neirral Networks 2, 159-168. Sanes, J., Suner, S., Lando, J., and Donoghue, J. 1988. Rapid reorganization of adult rat motor cortex somatic representation patterns after motor nerve injury. Proc. Natl. Acnd. Sci. U.S.A.85, 2003. Sklar, E. 1990. A simulation of cortical map plasticity. Proc. IjCNN 111, 727-732. Sutton, G. 1992. Map formation in neural networks using competitive activation mechanisms. Ph.D. Dissertation, Department of Computer Science, CS-TR2932, Univ. of Maryland. von der Malsburg, C. 1973. Self-organization of orientation sensitive cells in the striate cortex. K!ybernetic 14, 85-100.

Received January 4, 1993; accepted June 1, 1993.

This article has been cited by: 1. Charles E. Martin, James A. Reggia. 2010. Self-assembly of neural networks viewed as swarm intelligence. Swarm Intelligence 4:1, 1-36. [CrossRef] 2. Jared Sylvester, James Reggia. 2009. Plasticity-Induced Symmetry Relationships Between Adjacent Self-Organizing Topographic MapsPlasticity-Induced Symmetry Relationships Between Adjacent Self-Organizing Topographic Maps. Neural Computation 21:12, 3429-3443. [Abstract] [Full Text] [PDF] [PDF Plus] 3. Reiner Schulz , James A. Reggia . 2005. Mirror Symmetric Topographic Maps Can Arise from Activity-Dependent Synaptic ChangesMirror Symmetric Topographic Maps Can Arise from Activity-Dependent Synaptic Changes. Neural Computation 17:5, 1059-1083. [Abstract] [PDF] [PDF Plus] 4. Reiner Schulz, James A. Reggia. 2004. Temporally Asymmetric Learning Supports Sequence Processing in Multi-Winner Self-Organizing MapsTemporally Asymmetric Learning Supports Sequence Processing in Multi-Winner Self-Organizing Maps. Neural Computation 16:3, 535-561. [Abstract] [PDF] [PDF Plus] 5. Svetlana Levitan , James A. Reggia . 2000. A Computational Model of Lateralization and Asymmetries in Cortical MapsA Computational Model of Lateralization and Asymmetries in Cortical Maps. Neural Computation 12:9, 2037-2062. [Abstract] [PDF] [PDF Plus] 6. Dean V. Buonomano, Michael M. Merzenich. 1998. CORTICAL PLASTICITY: From Synapses to Maps. Annual Review of Neuroscience 21:1, 149-186. [CrossRef] 7. Eytan Ruppin , James A. Reggia . 1995. Patterns of Functional Damage in Neural Network Models of Associative MemoryPatterns of Functional Damage in Neural Network Models of Associative Memory. Neural Computation 7:5, 1105-1127. [Abstract] [PDF] [PDF Plus] 8. Manfred Spitzer, Peter Böhler, Matthias Weisbrod, Udo Kischka. 1995. A neural network model of phantom limbs. Biological Cybernetics 72:3, 197-206. [CrossRef]

NOTE

Communicated by Michael Hines

An Efficient Method for Computing Synaptic Conductances Based on a Kinetic Model of Receptor Binding A. Destexhe Z. F. Mainen

T. J. Sejnowski The Howard Hughes Medical lnstitute and The Salk Institute, Computational Neurobiology Laboratory, 10010 North Torrey Pines Road, La lolla, C A 92037 USA

Synaptic events are often formalized in neural models as stereotyped, time-varying conductance waveforms. The most commonly used of such waveforms is the 0-function (Rall 1967):

where gsynis the synaptic conductance and to is the time of transmitter release. This function peaks at a value of l / e at t = to + T , and decays exponentially with a time constant of T . When multiple events occur in succession at a single synapse, the total conductance at any time is a sum of such waveforms calculated over the individual event times. There are several drawbacks to this method. First, the relationship to actual synaptic conductances is based only on an approximate correspondence of the time-course of the waveform to physiological recordings of the postsynaptic response, rather than plausible biophysical mechanisms. Second, summation of multiple waveforms can be cumbersome, since each event time must be stored in a queue for the duration of the waveform and necessitates calculation of an additional exponential during this period (but see Srinivasan and Chiel 1993). Third, there is no natural provision for saturation of the conductance. An alternative to the use of stereotyped waveforms is to compute synaptic conductances directly using a kinetic model (Perkel eta!. 1981). This approach allows a more realistic biophysical representation and is consistent with the formalism used to describe the conductances of other ion channels. However, solution of the associated differential equations generally requires computationally expensive numerical integration. In this paper we show that reasonable biophysical assumptions about synaptic transmission allow the equations for a simple kinetic synapse model to be solved analytically. This yields a mechanism that preserves the advantages of kinetic models while being as fast to compute as a single tr-function. Moreover, this mechanism accounts implicitly for satN p m l Computation 6, 14-18 (1994)

@ 1993 Massachusetts Institute of Technology

Computing Synaptic Conductances

15

uration and summation of multiple synaptic events, obviating the need for event queuing. Following the arrival of an action potential at the presynaptic terminal, neurotransmitter molecules, T , are released into the synaptic cleft. These molecules are taken to bind to postsynaptic receptors according to the following first-order kinetic scheme:

R

+

T $TR*

(2)

where R and TR* are, respectively, the unbound and the bound form of the postsynaptic receptor, N and 1) are the forward and backward rate constants for transmitter binding. Letting r represent the fraction of bound receptors, these kinetics are described by the equation

where [TI is the concentration of transmitter. There is evidence from both the neuromuscular junction (Anderson and Stevens 1973) and excitatory central synapses (Colquhoun et al. 1992) that the concentration of transmitter in the cleft rises and falls very rapidly. If it is assumed that [TI occurs as a pulse, then it is straightforward to solve equation 3 exactly, leading to the following expressions: 1. During a pulse ( t o < t < t,), [TI = T,,,

and r is given by

where

and

If the binding of transmitter to a postsynaptic receptor directly gates the opening of an associated ion channel, then the total conductance through all channels of the synapse is r multiplied by the maximal conductance of the synapse, gsyn.Response saturation occurs naturally as Y

A. Destexhe et al.

16

approaches 1 (all channels reach the open state). The synaptic current, Iscl,, is given by the equation Iwn(f)

r(t) [V,y,(t)

s,yn

1

-

(6)

Lyn]

where Vsyn is the postsynaptic potential, a n d E.,yn is the synaptic reversal potential. These equations provide a n easily implemented method for computing synaptic currents a n d have storage and computation requirements that are independent of the frequency of presynaptic release events. To simulate a synaptic connection, it is necessary only to monitor the state of the presynaptic terminal a n d switch from equation 5 to equation 4 for a fixed time following the detection of a n event. At each time step, this method requires the storage of just two state variables [either f,, a n d r(t,,) o r t , and r(tl)], a n d the calculation of a single exponential (either equation 4 or equation 5). This compares favorably to summing rr-functions, which requires storage of 17 release times a n d I I corresponding exponential evaluations, where 17 is the product of the maximum frequency of release events a n d the length of time for which the conductance waveform is calculated. The parameters of the kinetic synapse model can be fit directly to physiological measurements. For instance, duration of the excitatory neurotransmitter glutamate in the synaptic cleft has been estimated to be on the order of 1 insec at concentrations in the 1 mM range (Clements et nl. 1992; Colquhoun et nl. 1992). Figure 1 shows simulated synaptic

Figure I: Fnciricg page. Postsynaptic potentials from receptor kinetics. Presynaptic voltage, Vpre (mV); concentration of transmitter in the synaptic cleft, [T] (niM); fraction of open (i.e., transmitter-bound) postsynaptic receptors, r; synaptic current, Isyn(PA); and postsynaptic potential, VSyl1(mV), are shown for different conditions. (A) A single transmitter pulse evokes a fast, excitatory conductance ( r i = 2 msec-' mM /l = 1 msec Esyn = 0 mV). (B) A train of presynaptic spikes releases a series of transmitter pulses evoking excitatory synaptic conductances (parameters as in A). C and D correspond to A and 8,but with parameters set for slower, inhibitory synaptic currents (0 = 0.5 msec- I mM I , , j = 0.1 msec-', Esyn = -80 mV). For all simulations, the synaptic current was calculated using equations 4-6, with ysyn= 1 nS, T,,,, = 1 mM, and transmitter pulse duration ( t , to) = 1 msec. Membrane potentials were simulated using NEURON (Hines 1993). Presynaptic and postsynaptic compartments were described by single-compartment cylinders (10 pm diameter and 10 pm length) with passive (leak) conductance (specific membrane capacitance of 1 /rF/cm2, specific membrane resistance of 5000 61-cm2,leak reversal potential of -70 mV). Presynaptic action potentials were modeled by standard Hodgkin-Huxley kinetics. A transmitter pulse was initiated when Vpre exceeded a threshold of 0 mV, and pulse initiation was inhibited for 1 msec following event detection.

.',

~

',

Computing Synaptic Conductances

17

events obtained using these values. Figure 1A and B show fast, excitatory currents resulting from a single synaptic event and a train of four events. Note that the time course of the postsynaptic potential resembles an (Yfunction even though the underlying current does not. Figure 1C and D show the time courses of the same variables for a slower, inhibitory synapse. In this case the rates for ( t and /j were slower, allowing a more progressive saturation of the receptors. We have presented a method by which synaptic conductances can be computed with low computational expense using a kinetic model. The kinetic approach provides a natural means to describe the behavior of synapses in a way that handles the interaction of successive presynaptic events. Under the same assumption that transmitter concentration occurs as a pulse, more complex kinetic schemes can be treated

18

A. Destexhe et al.

in a manner analogous to that described above (Destexhe rt al. in preparation). The “kinetic synapse” can thus be generalized to give various conductance time courses with multiexponential rise a n d decay phases, without sacrificing the efficiency of the first-order model.

Acknowledgments This research was supported by the Howard Hughes Medical Institute, the U.S. Office of Naval Research, a n d the National Institutes of Mental Health. Z . F. M. is a Howard Hughes Medical Institute Predoctoral Fellow.

References Anderson, C. R., and Stevens, C. E 1973. Voltage clamp analysis of acetylcholineproduced end-plate current fluctuations at frog neuromuscular junction. 1. Physiol. (London) 235, 655-691. Clements, J. D., Lester, R. A. J., Tong, J., Jahr, C., and Westbrook, G. L. 1992. The time course of glutamate in the synaptic cleft. Scirrice 258, 1498-1501. Colquhoun, D., Jonas, P., and Sakmann, B. 1992. Action of brief pulses of glutamate on AMPA/KAINATE receptors in patches from different neurons of rat hippocampal slices. 1. Physiol. (Lotidon) 458, 261-287. Hines, M. 1993. NEURON-A program for simulation of nerve equations. In Neural Systems: Analysis and Modeling, F. Eeckman, ed., pp. 127-136. Kluwer Academic Publishers, Norwell, MA. Perkel, D. H., Mulloney, B., and Budelli, R. W. 1981. Quantitative methods for predicting neuronal behavior. Neurosci. 6, 823-827. Rall, W. 1967. Distinguishing theoretical synaptic potentials computed for different somedendritic distributions of synaptic inputs. I. Ncurophysiol. 30, 1138-11 68. Srinivasan, R., and Chiel, H. J. 1993. Fast calculation of synaptic conductances. Neural Conip. 5, 200-204.

Received March 31, 1993; accepted May 26, 1993.

This article has been cited by: 1. Samuel A. Neymotin, Kimberle M. Jacobs, André A. Fenton, William W. Lytton. 2010. Synaptic information transfer in computer models of neocortical columns. Journal of Computational Neuroscience . [CrossRef] 2. Ming Yi, Lijian Yang. 2010. Propagation of firing rate by synchronization and coherence of firing pattern in a feed-forward multilayer neural network. Physical Review E 81:6. . [CrossRef] 3. M. A. Komarov, G. V. Osipov, J. A. K. Suykens. 2009. Sequentially activated groups in neural networks. EPL (Europhysics Letters) 86:6, 60006. [CrossRef] 4. Sheng-Jun Wang, Xin-Jian Xu, Zhi-Xi Wu, Zi-Gang Huang, Ying-Hai Wang. 2008. Influence of synaptic interaction on firing synchronization and spike death in excitatory neuronal networks. Physical Review E 78:6. . [CrossRef] 5. Kalyan V. Srinivas, Sujit K. Sikdar. 2008. Epileptiform activity induces distance-dependent alterations of the Ca 2+ extrusion mechanism in the apical dendrites of subicular pyramidal neurons. European Journal of Neuroscience 28:11, 2195-2212. [CrossRef] 6. Rogerio R. L. Cisi, André F. Kohn. 2008. Simulation system of spinal cord motor nuclei and associated nerves and muscles, in a Web-based architecture. Journal of Computational Neuroscience 25:3, 520-542. [CrossRef] 7. Xu-Long Wang, Xiao-Dong Jiang, Pei-Ji Liang. 2008. Intracellular calcium concentration changes initiated by N-methyl-D-aspartic acid receptors in retinal horizontal cells. NeuroReport 19:6, 675-678. [CrossRef] 8. Romain Brette, Michelle Rudolph, Ted Carnevale, Michael Hines, David Beeman, James M. Bower, Markus Diesmann, Abigail Morrison, Philip H. Goodman, Frederick C. Harris, Milind Zirpe, Thomas Natschläger, Dejan Pecevski, Bard Ermentrout, Mikael Djurfeldt, Anders Lansner, Olivier Rochel, Thierry Vieville, Eilif Muller, Andrew P. Davison, Sami El Boustani, Alain Destexhe. 2007. Simulation of networks of spiking neurons: A review of tools and strategies. Journal of Computational Neuroscience 23:3, 349-398. [CrossRef] 9. Giuseppe Massobrio, Paolo Massobrio, Sergio Martinoia. 2007. Modeling and simulation of silicon neuron-to-ISFET junction. Journal of Computational Electronics 6:4, 431-437. [CrossRef] 10. Quan Zou, Alain Destexhe. 2007. Kinetic models of spike-timing dependent plasticity and their functional consequences in detecting correlations. Biological Cybernetics 97:1, 81-97. [CrossRef] 11. Gregory R. Stiesberg, Marcelo Bussotti Reyes, Pablo Varona, Reynaldo D. Pinto, Ramón Huerta. 2007. Connection Topology Selection in Central Pattern Generators by Maximizing the Gain of InformationConnection Topology Selection in Central Pattern Generators by Maximizing the Gain of Information. Neural Computation 19:4, 974-993. [Abstract] [PDF] [PDF Plus]

12. T Pereira, M. S Baptista, J Kurths. 2007. Detecting phase synchronization by localized maps: Application to neural networks. Europhysics Letters (EPL) 77:4, 40006. [CrossRef] 13. T. Pereira, M. Baptista, J. Kurths. 2007. General framework for phase synchronization through localized sets. Physical Review E 75:2. . [CrossRef] 14. Mario F. Simoni, Stephen P. DeWeerth. 2007. Sensory Feedback in a Half-Center Oscillator Model. IEEE Transactions on Biomedical Engineering 54:2, 193-204. [CrossRef] 15. Pablo Balenzuela, Javier Buldú, Marcos Casanova, Jordi García-Ojalvo. 2006. Episodic synchronization in dynamically driven neurons. Physical Review E 74:6. . [CrossRef] 16. Yuan Wu-Jie, Luo Xiao-Shu, Wang Bing-Hong, Wang Wen-Xu, Fang Jin-Qing, Jiang Pin-Qun. 2006. Excitation Properties of the Biological Neurons with Side-Inhibition Mechanism in Small-World Networks. Chinese Physics Letters 23:11, 3115-3118. [CrossRef] 17. Simon Durrant, Jianfeng Feng. 2006. Negatively correlated firing: the functional meaning of lateral inhibition within cortical columns. Biological Cybernetics 95:5, 431-453. [CrossRef] 18. M.F. Simoni, S.P. Deweerth. 2006. Two-Dimensional Variation of Bursting Properties in a Silicon-Neuron Half-Center Oscillator. IEEE Transactions on Neural Systems and Rehabilitation Engineering 14:3, 281-289. [CrossRef] 19. Romain Brette. 2006. Exact Simulation of Integrate-and-Fire Models with Synaptic ConductancesExact Simulation of Integrate-and-Fire Models with Synaptic Conductances. Neural Computation 18:8, 2004-2027. [Abstract] [PDF] [PDF Plus] 20. P. Suffczynski, F. Wendling, J.-J. Bellanger, F.H. Lopes Da Silva. 2006. Some Insights Into Computational Models of (Patho)Physiological Brain Activity. Proceedings of the IEEE 94:4, 784-804. [CrossRef] 21. Dexter M. Easton. 2005. Gompertz kinetics model of fast chemical neurotransmission currents. Synapse 58:1, 53-61. [CrossRef] 22. Pablo Balenzuela, Jordi García-Ojalvo. 2005. Role of chemical synapses in coupled neurons with noise. Physical Review E 72:2. . [CrossRef] 23. William W. Lytton , Michael L. Hines . 2005. Independent Variable Time-Step Integration of Individual Neurons for Network SimulationsIndependent Variable Time-Step Integration of Individual Neurons for Network Simulations. Neural Computation 17:4, 903-921. [Abstract] [PDF] [PDF Plus] 24. Pablo Balenzuela, Jordi García-Ojalvo. 2005. Neural mechanism for binaural pitch perception via ghost stochastic resonance. Chaos: An Interdisciplinary Journal of Nonlinear Science 15:2, 023903. [CrossRef]

25. Katsuki KATAYAMA, Tsuyoshi HORIGUCHI. 2005. Synchronous Phenomena of Neural Network Models Using Hindmarsh-Rose Equation. Interdisciplinary Information Sciences 11:1, 11-15. [CrossRef] 26. A Volkovskii, S Brugioni, R Levi, M Rabinovich, A Selverston, H D I Abarbane. 2005. Analog electronic model of the lobster pyloric central pattern generator. Journal of Physics: Conference Series 23, 47-57. [CrossRef] 27. Toshiaki Omori, Tsuyoshi Horiguchi. 2004. Dynamical State Transition by Neuromodulation Due to Acetylcholine in Neural Network Model for Oscillatory Phenomena in Thalamus. Journal of the Physics Society Japan 73:12, 3489-3494. [CrossRef] 28. S. Martinoia, P. Massobrio, M. Bove, G. Massobrio. 2004. Cultured Neurons Coupled to Microelectrode Arrays: Circuit Models, Simulations and Experimental Data. IEEE Transactions on Biomedical Engineering 51:5, 859-864. [CrossRef] 29. Sang-Gui Lee, Shigeru Tanaka, Seunghwan Kim. 2004. Orientation tuning and synchronization in the hypercolumn model. Physical Review E 69:1. . [CrossRef] 30. Masaki Nomura , Tomoki Fukai , Toshio Aoyagi . 2003. Synchrony of Fast-Spiking Interneurons Interconnected by GABAergic and Electrical SynapsesSynchrony of Fast-Spiking Interneurons Interconnected by GABAergic and Electrical Synapses. Neural Computation 15:9, 2179-2198. [Abstract] [PDF] [PDF Plus] 31. Toshio Aoyagi , Takashi Takekawa , Tomoki Fukai . 2003. Gamma Rhythmic Bursts: Coherence Control in Networks of Cortical Pyramidal NeuronsGamma Rhythmic Bursts: Coherence Control in Networks of Cortical Pyramidal Neurons. Neural Computation 15:5, 1035-1061. [Abstract] [PDF] [PDF Plus] 32. Jan Reutimann , Michele Giugliano , Stefano Fusi . 2003. Event-Driven Simulation of Spiking Neurons with Stochastic DynamicsEvent-Driven Simulation of Spiking Neurons with Stochastic Dynamics. Neural Computation 15:4, 811-830. [Abstract] [PDF] [PDF Plus] 33. P. Tiesinga. 2001. Information transmission and recovery in neural communication channels revisited. Physical Review E 64:1. . [CrossRef] 34. Michael Rudolph, Alain Destexhe. 2001. Correlation Detection and Resonance in Neural Systems with Distributed Noise Sources. Physical Review Letters 86:16, 3662-3665. [CrossRef] 35. M. Eguia, M. Rabinovich, H. Abarbanel. 2000. Information transmission and recovery in neural communications channels. Physical Review E 62:5, 7111-7122. [CrossRef] 36. Elizabeth Thomas , Thierry Grisar . 2000. Increased Synchrony with Increase of a Low-Threshold Calcium Conductance in a Model Thalamic Network: A Phase-Shift MechanismIncreased Synchrony with Increase of a Low-Threshold Calcium Conductance in a Model Thalamic Network: A

Phase-Shift Mechanism. Neural Computation 12:7, 1553-1571. [Abstract] [PDF] [PDF Plus] 37. Michele Giugliano . 2000. Synthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network SimulationsSynthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network Simulations. Neural Computation 12:4, 903-931. [Abstract] [PDF] [PDF Plus] 38. Luis Lago-Fernández, Ramón Huerta, Fernando Corbacho, Juan Sigüenza. 2000. Fast Response and Temporal Coherent Oscillations in Small-World Networks. Physical Review Letters 84:12, 2758-2761. [CrossRef] 39. Michele Giugliano , Marco Bove , Massimo Grattarola . 1999. Fast Calculation of Short-Term Depressing Synaptic ConductancesFast Calculation of Short-Term Depressing Synaptic Conductances. Neural Computation 11:6, 1413-1426. [Abstract] [PDF] [PDF Plus] 40. BRIAN MULLONEY, FRANCES K. SKINNER, HISAAKI NAMBA, WENDY M. HALL. 1998. Intersegmental Coordination of Swimmeret Movements: Mathematical Models and Neural Circuitsa. Annals of the New York Academy of Sciences 860:1 NEURONAL MECH, 266-280. [CrossRef] 41. J. Köhn , F. Wörgötter . 1998. Employing the Z-Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network SimulationsEmploying the Z-Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network Simulations. Neural Computation 10:7, 1639-1651. [Abstract] [PDF] [PDF Plus] 42. Richard Bertram. 1997. A Simple Model of Transmitter Release and FacilitationA Simple Model of Transmitter Release and Facilitation. Neural Computation 9:3, 515-523. [Abstract] [PDF] [PDF Plus] 43. Alain Destexhe. 1997. Conductance-Based Integrate-and-Fire ModelsConductance-Based Integrate-and-Fire Models. Neural Computation 9:3, 503-514. [Abstract] [PDF] [PDF Plus] 44. M. Storace, M. Bove, M. Grattarola, M. Parodi. 1997. Simulations of the behavior of synaptically driven neurons via time-invariant circuit models. IEEE Transactions on Biomedical Engineering 44:12, 1282-1287. [CrossRef] 45. M. Bove, M. Grattarola, G. Verreschi. 1997. In vitro 2-D networks of neurons characterized by processing the signals recorded with a planar microtransducer array. IEEE Transactions on Biomedical Engineering 44:10, 964-977. [CrossRef] 46. Xing Fang Li, Jorge L. Armony, Joseph E. LeDoux. 1996. GABAA and GABAB receptors differentially regulate synaptic transmission in the auditory thalamo-amygdala pathway: An in vivo microiontophoretic study and a model. Synapse 24:2, 115-124. [CrossRef]

47. William W. Lytton . 1996. Optimizing Synaptic Conductance Calculation for Network SimulationsOptimizing Synaptic Conductance Calculation for Network Simulations. Neural Computation 8:3, 501-509. [Abstract] [PDF] [PDF Plus] 48. Bo Cartling. 1996. Dynamics control of semantic processes in a hierarchical associative memory. Biological Cybernetics 74:1, 63-71. [CrossRef] 49. François Chapeau-Blondeau , Nicolas Chambet . 1995. Synapse Models for Neural Networks: From Ion Channel Kinetics to Multiplicative Coefficient wijSynapse Models for Neural Networks: From Ion Channel Kinetics to Multiplicative Coefficient wij. Neural Computation 7:4, 713-734. [Abstract] [PDF] [PDF Plus] 50. M. Grattarola, M. Bove, S. Martinoia, G. Massobrio. 1995. Silicon neuron simulation with SPICE: tool for neurobiology and neural networks. Medical & Biological Engineering & Computing 33:4, 533-536. [CrossRef] 51. Alain Destexhe, Zachary F. Mainen, Terrence J. Sejnowski. 1994. Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. Journal of Computational Neuroscience 1:3, 195-230. [CrossRef] 52. Gene V. Wallenstein. 1994. Simulation of GABA B -receptor-mediated K+ current in thalamocortical relay neurons: tonic firing, bursting, and oscillations. Biological Cybernetics 71:3, 271-280. [CrossRef] 53. Elizabeth ThomasModeling Individual Neurons and Small Neural Networks . [CrossRef]

Communicated by David Robinson

A Neural Network for Coding of Trajectories by Time Series of Neuronal Population Vectors Alexander V. Lukashin Apostolos P. Georgopoulos Brain Sciences Center, Department of Veterans Affairs Medical Center, Minneapolis, M N 55417 U S A , and Departments of Physiology and Neurology, University of Minnesota Medical School, Minneapolis, M N 55455 U S A

The neuronal population vector is a measure of the combined directional tendency of the ensemble of directionally tuned cells in the motor cortex. It has been found experimentally that a trajectory of limb movement can be predicted by adding together population vectors, tipto-tail, calculated for successive instants of time to construct a neural trajectory. In the present paper we consider a model of the dynamic evolution of the population vector. The simulated annealing algorithm was used to adjust the connection strengths of a feedback neural network so that it would generate a given trajectory by a sequence of population vectors. This was repeated for different trajectories. Resulting sets of connection strengths reveal a common feature regardless of the type of trajectories generated by the network namely, the mean connection strength was negatively correlated with the angle between the preferred directions of neuronal pair involved in the connection. The results are discussed in the light of recent experimental findings concerning neuronal connectivity within the motor cortex.

1 Introduction The activity of a directionally tuned neuron in the motor cortex is highest for a movement in a particular direction (the neuron’s preferred direction) and decreases progressively with movements farther away from this direction. Quantitatively, the change of neuron activity can be approximated by the cosine of the angle between the movement direction and the neuron’s preferred direction (Georgopoulos et al. 1982). The direction of an upcoming movement in space can be represented in the motor cortex as the neuronal population vector which is a measure of the combined directional tendency of the whole neuronal ensemble (Georgopoulos et al. 1983, 1986). If C; is the unit preferred direction vector for the ith neuron, Nrurd Computation 6, 19-28 (1994)

@ 1993 Massachusetts Institute of Technology

Alexander V. Lukashin and Apostolos P. Georgopoulos

20

then the neuronal population vector P is defined as the weighted sum of these vectors:

P(t)=

c

Vl(t)Cl

(1.1)

1

where the weight V , ( t ) is the activity (frequency of discharge) of the ith neuron at time bin t. The neuronal population vector has proved to be a good predictor of the direction of movement (for a review see Georgopoulos 1990; Georgopoulos et a/. 1993). Moreover, the population vector can be used as a probe by which to monitor in time the changing directional tendency of the neuronal ensemble. One can obtain the time evolution of the population vector by calculating it at short successive intervals of time or continuously, during the periods of interest. Adding these population vectors together, tip-to-tail, one may obtain a neural trajectory. It was shown that real trajectories of limb movement can be accurately predicted by neural trajectories (Georgopoulos et nl. 1988; Schwartz and Anderson 1989; Schwartz 1993). It was hypothesized (Georgopoulos et al. 1993) that the observed dynamic evolution of the neuronal population vector is governed by the interactions between directionally tuned neurons in motor cortex while extrinsic inputs can initiate the changes in activity and contribute temporarily or constantly to the ongoing activity. Two types of neural network models could be suggested in the framework of the hypothesis. Within a model of the first type, the movement is decomposed in piecewise parts, and local geometric parameters of a desired trajectory are introduced into the network by the mechanism of continuous updating of the current position (Bullock and Grossberg 1988; Lukashin and Georgopoulos 1993). The main disadvantage of this model is that it needs a mechanism for relatively fast local changes of synaptic weights during the movements. The second type of models may be treated as an opposite limiting case. It could be supposed that subsets of synaptic weights in the motor cortex permanently store information about possible trajectories or at least about its essential parts, and synaptic weights do not change during the movement. Then for realization of a particular trajectory, only one external command is needed: namely, a global activation of an appropriate neuronal subset. The purpose of the present paper is to simulate dynamic evolution of the neuronal population vector in the framework of the second model above. We consider a one-layer feedback network that consists of fully interconnected neuron-like units. In full analogy with experimental approaches, the neuronal population vector is calculated at successive instants of time in accordance with equation 1.1 as a vector sum of activities of units. A neural trajectory is computed by attaching these vectors tip-to-tail. The network is trained to generate the neural trajectory that coincides with a given curve, and its synaptic weights are adjusted until it does. This is repeated for different trajectories. It is obvious that practically any kind of reasonable dynamic evolution could be reached by

Neural Network for Coding Trajectories

21

appropriate learning procedure; for example, rather complex dynamics of trained neuronal ensembles have been demonstrated by Jordan (1986), Pineda (1987), Dehaene et al. (1987), Massone and Bizzi (1989), Pearlmutter (1989), Williams and Zipser (1989), Fang and Sejnowski (1990), and Amirikian and Lukashin (1992). For the same network design, learning different trajectories entails different sets of synaptic weights. Moreover, one and the same trajectory can be generated by the network with different sets of connection strengths. The main question we address in the present paper is whether these sets of connection strengths reveal common features. The results of this analysis are compared with experimental data (Georgopoulos et a / . 1993) concerning functional connections between directionally tuned neurons in the motor cortex. 2 Model and Learning Procedure

We consider a network of N neurons whose dynamics is governed by the following system of coupled differential equations: 1fU;

Tdt V,(t) El

-

(2.1)

-

=

tanh[ir,(t)]

(2.2)

=

COS(H

(2.3)

- (ti)

Argument t is shown for values which depend on time. The variable u l (t ) represents internal state (for example, the soma membrane potential) and the variable V , (t ) represents correspondingly the output activity (for example, firing frequency) of the ith neuron, T is a constant giving the time scale of the dynamics, and zu,, is the strength of interaction between neurons (j+ i). External input El (2.3) serves to assign preferred direction for the ith neuron. Indeed, in the simplest case, ZU,~ = 0, one has 1 4 , ( f >> T ) = El and V , FZ cos(0 - 0 , ) .Thus, if the angle H is treated as a direction of "movement" that is given externally, then the angle o, can be regarded as the preferred direction for the ith neuron. It is noteworthy that preferred directions of motor cortical neurons range throughout the directional continuum (Georgopoulos et al. 1988). The same type of distribution was obtained for a network that learns arbitrary transformations between input and output vectors (Lukashin 1990). Therefore, below we use random uniform distribution of angles ( t , . Once preferred directions are assigned, components of the neuronal population vector P can be calculated as the decomposition (equation 1.1) over preferred directions:

P,(t)

=

cV , ( t )

cos

I

(ti

(2.4)

22

Alexander V. Lukashin and Apostolos P. Georgopoulos

where the time dependence of the V , values is determined by equations 2.1-2.3. Equations 2.4 may be interpreted as an addition of two output units with assigned synaptic weights. Let a desired two-dimensional trajectory be given as a sequence of points with coordinates X d ( t k ) , Y'l(tk), k = 1,. . . K . In accordance with the above consideration corresponding points X l l ( t k ) ,Yn(tk) of the actual trajectory generated by the network should be calculated by attaching successive population vectors: (2.5) The goal of a training procedure is to find a set of connection strengths wlIthat ensures that the difference between desired and actual trajectories is as small as possible. We minimized this difference by means of the simulated annealing algorithm (Kirkpatrick et al. 1983)treating the chosen cost function

as the "energy" of the system. The optimization scheme is based on the standard Monte Carlo procedure (Aart and van Laarhoven 1987) that accepts not only changes in synaptic weights wi/that lower the energy, but also changes that raise it. The probability of the latter event is chosen such that the system eventually obeys the Boltzmann distribution at a given temperature. The simulated annealing procedure is initialized at a sufficiently high temperature, at which a relatively large number of state changes are accepted. The temperature is then decreased according to a cooling schedule. If the cooling is slow enough for equilibrium to be established at each temperature, the global minimum is reached in the limit of zero temperature. Although the achievement of the global minimum cannot be guaranteed in practice when the optimal cooling rate is unknown, the simulated annealing algorithm seems to be the most adequate procedure for our specific purposes. We wish to extract the common features of the sets of synaptic weights ensuring different trajectories. In general, each given trajectory can be realized by different sets of synaptic weights. A complete analysis of the problem needs exhaustive enumeration of all possible network configurations that can be done only for sufficiently simple systems (Carnevali and Patarnello 1987; Denker et al. 1987; Baum and Haussler 1989; Schwartz et al. 1990). The advantage of the simulated annealing method is that during the procedure a treated system at each temperature (including zero-temperature limit) tends to occupy likeliest (in a thermodynamical sense) regions of the phase space (Kirkpatrick et nl. 1983; Aart and van Laarhoven 1987). Thus the algorithm provides a useful tool for obtaining likeliest or "typical" solution of the problem.

Neural Network for Coding Trajectories

23

3 Results of Simulations

The minimal size of the network that still allows the realization of the desired dynamics is about 10 units. In routine calculations we used networks with number of neurons N from 16 to 48. Since in this range the size of the network was not an essential parameter, below we show the results only for N = 24. During the learning procedure, the randomly chosen set of preferred directions of was not varied. For each selected set of connection strengths w,/, the system of equations 2.1-2.3 was solved as the initial value problem, u,(O) = 0, using a fifth-order Runge-Kutta-Fehlberg formula with automatic control of the step size during the integration. Components of the neuronal population vector (equation 2.4), current positions on the actual trajectory (equation 2.51, and the addition to the cost function (equation 2.6) were calculated at time instances separated from each other by the interval ~ / 1 0 0 .The total running time ranged from 7 ( K = 100) to 57 ( K = 500). Below we show results for K = 300. The time constant, T , is usually thought of as the membrane time constant, about 5 msec. At this point one should take into account that the running time in the model is completely determined by the time that it takes for a desired trajectory to complete, and may be given arbitrarily. Since the crucial parameter for us was the shape of trajectories, we considered short (or fast) trajectories in order to make the training procedure less time-consuming. Slower trajectories can easily be obtained. Nevertheless, we note that a direct comparison of the velocity of the real movement and the velocity obtained in the model is impossible. The model operates with neural trajectories, and the question of how the "neural" length is related to the real length of a trajectory cannot be answered within the model. For each learning trial, the connection strengths w,/ were initialized to uniform random values between -0.5 and 0.5. The temperature at the initial stages of the simulated annealing was chosen so that practically all states of the system were accepted. During the simulated annealing procedure, values w,!were selected randomly from the same interval [-0.5,0.5] without assuming symmetry. The angle H (equation 2.3) was also treated as a variable parameter on the interval [ O , T ] . We used the standard exponential cooling schedule (Kirkpatrick et al. 1983): Tn+l = /jT,i, where T,, is the temperature at the nth step and the value 1 - /jis varied within the interval from 5 ~ 1 0 to -~ Each step of the simulated annealing procedure included a change of one parameter and the entire recalculation of the current trajectory. We checked the robustness of the results with respect to different series of random numbers used for the generation of particular sets of preferred directions It, and during the realization of the simulated annealing procedure (about 10 trials for each desired trajectory; data not shown). Figure 1 shows three examples of desired curves and trajectories produced by the trained network described above. It is seen that actual

Alexander V. Lukashin and Apostolos P. Georgopoulos

24

a

b dellred

IClUd

Figure 1: The (X. Y)-plots of desired (upper) and actual (lower) trajectories. Arrows show directions of tracing. The actual curves shown were obtained after the following number of steps of the simulated annealing procedure: 2 x lo4 for the orthogonal bend (a), 9 x lo4 for the sinusoid (b), and 4 x lo5 for the ellipse with the relation between axes 3 : 1 (c). trajectories generated by the network reproduce the desired ones very well. The trajectories generated by the network (Fig. 1) d o not correspond to the global minimum of the cost function (equation 2.6). In all cases these are local minima. This is the reason why the corner in Figure l a is rounded and the curve in Figure l c is not closed. If allowed to continue, Figure l c would trace a finite unclosed trajectory. However, we have found that a limit cycle close to the desired elliptic trajectory can be obtained if the network is trained to trace twice the elliptic trajectory. To extract the common features of the sets of synaptic weights giving the dynamics shown in Figure 1 we calculated the mean value of the synaptic weight as a function of the angle between the preferred directions of the two neurons in a pair. Corresponding results are shown in Figure 2a, b, c for each trajectory presented in Figure la, b, c. Regardless of the type of trajectories generated by the network, the mean connection strength is negatively correlated with the angle between preferred directions: r = -0.86 for the orthogonal bend (Fig. 2a), -0.90 for the sinusoid (Fig. Zb), and -0.95 for the ellipse (Fig. 2c). 4 Discussion

Increasing efforts have been recently invested in neural network models for motor control (see, for example, Bullock and Grossberg 1988; Massone and Bizzi 1989; Kawato et al. 1990; Burnod et al. 1992; Corradini et 01. 1992). An important question is whether the neural networks that control different types of movements share many or few neuronal subsets. At one end of the spectrum, quite different behavior could be produced by

Neural Network for Coding Trajectories

b

a 0.2:

0.25

0.oc

O

.

O

0

k

l

25

'

T

C 0.25

0.00

l

-0.25

.n

I1L

Llll

Figure 2: The dependence of the mean value (kSEM) of connection strength on the angle between preferred directions of neurons involved in the connection. The mean value of connection strength was calculated by averaging over connections between neurons the preferred directions of which did not differ from each other by more than 18". Straight lines are linear regressions. Connection strengths w,, used in the calculation of mean values were the same zu,, parameters that gave actual trajectories presented in Figure 1: (a) orthogonal bend, (b) sinusoid, and (c) ellipse.

continuous modulation of a single network. At the other end, different subsets could generate each type of movement or "movement primitive." Taking together in sequential chains or in parallel combinations these movement primitives may provide a variety of natural behavior. Both types of organization have been found experimentally (for a discussion see, for example, Alexander et al. 1986; Harris-Warrick and Marder 1991; Bizzi ef al. 19911. Clearly, intermediate cases involving multiple networks with overlapping elements are likely. The model we have used implies that synaptic weights d o not change during the movement. This means that at the level of the motor cortex different trajectories are realized by different neuronal subsets or by different sets of synaptic weights which store the information about trajectories. Our main result is that although different trajectories correspond to different sets of synaptic weights, all of these sets have clearly a common feature: namely, neurons with similar preferred directions tend to be mutually excitatory, those with opposite preferred directions tend to be mutually inhibitory, whereas those with orthogonal preferred directions tend to be connected weakly or not at all (see Fig. 2). Remarkably, the same structure of the synaptic weights matrix was obtained in modeling

26

Alexander V. Lukashin and Apostolos P. Georgopoulos

studies of connection strengths that would ensure the stability of the neuronal population vector (Georgopoulos et a/. 1993). The results of this study are relevant to those obtained experimentally in the motor cortex (Georgopoulos et a / . 1993). In those studies, the connectivity between cells was examined by recording the impulse activity of several neurons simultaneously. The correlation between the firing times of pairs of neurons was examined. The correlation reveals the net effect of the whole synaptic substrate through which two neurons interact, including both direct and indirect connections; it represents the “functional connection” between the two neurons. The weight of a connection was estimated by calculating the “difference distribution” between the observed and randomly shuffled distributions of waiting times (for details see Note 34 in Georgopoulos ef a/. 1993). It was found that the mean connection strength was negatively correlated with the difference between preferred directions of the neurons in a pair ( r = -0.815). This result is in good agreement with the results of our calculations (Fig. 2). Although the weight of the functional connection estimated experimentally is not completely equivalent to the efficacy of single synapse that is implied in the model, our simulations show how this type of the organization of connections in the motor cortex can provide a dynamic evolution of the neuronal population vector during the limb movement. The correlations between the strength of interaction and a similarity among units observed in the experiments and in our simulations might reflect a general principle of the organization of connections in the central nervous system (for a discussion see Tso ef a/. 1986; Martin 1988; Sejnowski et a / . 1988; Douglas et a/. 1989; Georgopoulos et a / . 1993). Acknowledgments This work was supported by United States Public Health Service Grants NS17413 and PSMH48185, Office of Naval Research contract N00014-88K-0751, and a grant from the Human Frontier Science Program. A. V. Lukashin is on leave from the Institute of Molecular Genetics, Russian Academy of Sciences, Moscow, Russia. References Aart, E. H. L., and van Laarhoven, P. J. M. 1987. Simulated Annealing: A Review of the Theory nnd Applications. Kluwer Academic Publishers. Alexander, G. A., DeLong, M. R., and Strick, P. L. 1986. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci. 9, 375-381. Amirikian, B. R., and Lukashin, A. V. 1992. A neural network learns trajectory of motion from the least action principle. Bid. Cybern. 66, 261-264.

Neural Network for Coding Trajectories

27

Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Bizzi, E., Musa-Ivaldi, F., and Giszter, S. 1991. Computations underlying the execution of movement: A biological perspective. Science 253, 287-291. Bullock, D., and Grossberg, S. 1988. Neural dynamics of planned arm movements: Emergent invariants and speed-accuracy properties during trajectory formation. Psychol. Rev. 95, 49-90. Burnod, Y., Grandguillaume, P., Otto, I., Ferraina, S., Johnson, P. B., and Caminiti, R. 1992. Visuomotor transformations underlying arm movement toward visual targets: A neural network model of cerebral cortical operations. J. Neurosci. 12, 1435-1453. Carnevali, P., and Patarnello, S. Exhaustive thermodynamical analysis of Boolean learning networks. Europhys. Lett. 4, 1199-1204. Corradini, M. L., Gentilucci, M., Leo, T., and Rizzolatti, G. 1992. Motor control of voluntary arm movements. Kinematic and modelling study. Biol. Cybern. 67, 347-360. Dehaene, S., Changeux, J. P., and Nadal, J. P. 1987. Neural networks that learn temporal sequences by selection. Proc. Natl. Acad. Sci. U.S.A. 84, 2727-2731. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. 1987. Large automatic learning, rule extraction, and generalization. Complex Syst. 1, 877-922. Douglas, R. J., Martin K. A. C., and Whitteridge, D. 1989. A canonical microcircuit for neocortex. Neural Comp. 1, 480438. Fang, Y., and Sejnowski, T. J. 1990. Faster learning for dynamic recurrent backpropagation. Neural Comp. 2, 270-273. Georgopoulos, A. P. 1990. Neural coding of the direction of reaching and a comparison with saccadic eye movements. Cold Spring Harbor Symp. Quant. Biol. 55, 849-859. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., and Massey, J. T. 1982. On the relation between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci. 2, 1527-1537. Georgopoulos, A. P., Caminiti, R., Kalaska, J. F., and Massey, J. T. 1983. Spatial coding of movement: A hypothesis concerning the coding of movement by motor cortical populations. Exp. Brain Res. Suppl. 7, 327-336. Georgopoulos, A. P., Schwartz, A. B., and Kettner, R. E. 1986. Neuronal population coding of movement direction. Science 233, 1416-1419. Georgopoulos, A. P., Kettner, R. E., and Schwartz, A. B. 1988. Primate motor cortex and free arm movement to visual targets in three-dimensional space. 11. Coding of the direction of movement by a neuronal population. J. Neurosci. 8, 2928-2937. Georgopoulos, A. P., Taira, M., and Lukashin, A. V. 1993. Cognitive neurophysiology of the motor cortex. Science 260, 47-52. Harris-Warrick, R. M., and Marder, E. 1991. Modulation of neural networks for behavior. Annu. Rev. Neurosci. 14, 39-57. Jordan, M. I. 1986. Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the 1986 Cognitive Science Conference, pp. 531-546. Erlbaum, Hillsdale, NJ.

28

Alexander V. Lukashin and Apostolos P. Georgopoulos

Kawato, M., Maeda, Y., Uno, Y., and Suzuki, R. 1990. Trajectory formation of arm movements by cascade neural network models based on minimum torque change criterion. Biol. Cyberrr. 62, 275-288. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Scieiice 220, 671-680. Lukashin, A. V. 1990. A learned neural network that simulates properties of the neuronal population vector. Biol. Cyberrr. 63, 377-382. Lukashin, A. V., and Georgopoulos, A. P. 1993. A dynamical neural network model for motor cortical activity during movement: Population coding of movement trajectories. Bid. Cybertz., in press. Martin, K. A. C. 1988. From single cells to simple circuits in the cerebral cortex. Q. I. Ex". Physiol. 73, 637-702. Massone, L., and Bizzi, E. 1989. A neural network model for limb trajectory formation. B i d . Cybertr. 61, 417425. Pearlmutter, B. A. 1989. Learning state space trajectories in recurrent neural networks. Neural Cornp. 1, 263-269. Pineda, F. 1987. Generalization of backpropagation to recurrent neural networks. Phys. Rev. Lett. 19(59), 2229-2232. Schwartz, A. B. 1993. Motor cortical activity during drawing movements: Population representation during sinusoid tracing. ]. Neurophysiol., in press. Schwartz, A. B., and Anderson, B. J. 1989. Motor cortical images of sinusoidal trajectories. Soc. Neurosci. Abstr. 15, 788. Schwartz, D. B., Salaman, V. K., Solla, S. A., and Denker, J. S. 1990. Exhaustive learning. Neirml Cornp. 2, 374-385. Sejnowski, T. J., Koch, C., and Churchland, P. S. 1988. Computational neuroscience. Scieiice 241, 1299-1306. Tso, D. Y., Gilbert, C. D., and Weisel, T. N. 1986. Relationship between horizontal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. J. Neurosci. 6, 1160-1170. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Coinp. 1, 270-280.

Received February 10, 1993; accepted June 9, 1993.

This article has been cited by: 2. Charidimos Tzagarakis, Trenton A. Jerde, Scott M. Lewis, Kâmil Uğurbil, Apostolos P. Georgopoulos. 2009. Cerebral cortical mechanisms of copying geometrical shapes: a multidimensional scaling analysis of fMRI patterns of activation. Experimental Brain Research 194:3, 369-380. [CrossRef] 3. Sohie Lee Moody , David Zipser . 1998. A Model of Reaching Dynamics in Primary Motor CortexA Model of Reaching Dynamics in Primary Motor Cortex. Journal of Cognitive Neuroscience 10:1, 35-45. [Abstract] [PDF] [PDF Plus] 4. Siming Lin, Jennie Si, A. B. Schwartz. 1997. Self-Organization of Firing Activities in Monkey's Motor Cortex: Trajectory Computation from Spike SignalsSelf-Organization of Firing Activities in Monkey's Motor Cortex: Trajectory Computation from Spike Signals. Neural Computation 9:3, 607-621. [Abstract] [PDF] [PDF Plus] 5. Shoji Tanaka, Noriaki Nakayama. 1995. Numerical simulation of neuronal population coding: influences of noise and tuning width on the coding error. Biological Cybernetics 73:5, 447-456. [CrossRef] 6. Alexander V. Lukashin, Apostolos P. Georgopoulos. 1994. Directional operations in the motor cortex modeled by a neural network of spiking neurons. Biological Cybernetics 71:1, 79-85. [CrossRef]

Communicated by Michael Arbib

Theoretical Considerations for the Analysis of Population Coding in Motor Cortex Terence D. Sanger MIT, E25-534, Cambridge, M A 02139 U S A Recent evidence of population coding in motor cortex has led some researchers to claim that certain variables such as hand direction or force may be coded within a Cartesian coordinate system with respect to extrapersonal space. These claims are based on the ability to predict the rectangular coordinates of hand movement direction using a ”population vector’’ computed from multiple cells’ firing rates. I show here that such a population vector can always be found given a very general set of assumptions. Therefore the existence of a population vector constitutes only weak support for the explicit use of a particular coordinate representation by motor cortex. 1 Introduction Recent results suggest that the representation of arm movement in motor cortex involves the simultaneous activity of many cells, and that the pattern of activation over the group of cells specifies the motion that occurs (Caminiti et a!. 1990; Kalaska and Crammond 1992, for review). These results have led many researchers to ask whether movement variables are coded internally in terms of a particular coordinate system such as a Cartesian or polar representation of extrapersonal space, a representation of muscle lengths around relevant joints, or some other set of coordinates. In the following, I distinguish between a coded variable (such as hand position) and the coordinates used to represent that variable (such as Cartesian coordinates). I will summarize certain experiments that demonstrate that hand movement direction is represented within motor cortex, but I claim that these experiments cannot be used to determine the coordinate system in which movements are coded. I discuss a set of experiments that investigated the relationship between cell firing rates during free arm movements in awake monkeys, and the direction in which the hand was moved to a target in space (Georgopoulos et al. 1988; Kettner et al. 1988; Schwartz ef al. 1988). These experiments led to the following results:

R1: The firing rate of 89.1% (486/568) of the cells tested in motor cortex varied consistently with the direction of hand motion within a Ntwral Cornpiitation 6, 29-37 (1994)

@ 1993 Massachusetts Institute of Technology

Terence D. Sanger

30

limited region of space, and many cells would be simultaneously active for any given direction. R2: A statistically significant component of the variance of the firing rate of 83.6% of the cells could be accounted for by a broadly tuned function of the form d;(M) zz b; + k;COS( 0; - OM)

(1.1)

where d,(M) is the firing rate of cell i for hand motion in the direction of a unit vector M, 0, is the direction of motion in which the cell has maximal response, 0, - OM is the angle' between the direction of hand motion OM and the cell's preferred direction, and b, and k, determine the average firing rate and modulation depth, respectively. (Here and in the following, capital letters indicate vector quantities.) R3: The preferred directions 0, are approximately uniformly distributed with respect to directions in the workspace. R4: The hand direction vector M can be approximated in Cartesian coordinates by a population vector P computed from a linear combination of the cell firing rates. R5: The coefficients of this linear combination are given by unit vectors C, along the preferred direction 0, for each cell, so that N

Mz P

=

C C,d,

(1.2)

i=l

where the di have been normalized to account for resting firing rate and response amplitude, and both the movement direction M and the preferred direction vectors Ci are given in Cartesian coordinates with respect to the external workspace. Together, these results might suggest that a Cartesian representation of the direction of hand motion is coded in motor cortex (Schwartz et a!. 1988). I will show that results R2, R4, and R5 are direct consequences of results R1, R3, and the experimental design. This in no way reduces the importance of these experiments, but rather emphasizes the fact that results R1 and R3 contain the most significant information. Although their importance was recognized in Georgopoulos et al. (19881, the fact that they imply the other results was not. Previous investigations have studied the conditions under which results R4 and R5 hold, and it has been shown that the population vector predicts the direction of hand motion if ' A difference of 3D angles is defined by B, both unit vectors.

- HM

= cos-'(C, . M ) ,where C, and M are

Population Coding in Motor Cortex

31

the tuning curve is symmetric and the distribution of preferred directions is uniform (Georgopoulos et al. 1988) or has no second harmonic components (Mussa-Ivaldi 1988). I derive a necessary and sufficient condition that is even broader, since it requires only that the three components of the preferred directions be uncorrelated with each other over the population. Before doing this, I will first show that the cosine tuning curves found in (Schwartz et al. 1988) may be an artifact of the analytic techniques used. 2 Single Unit Tuning Curves In Schwartz et al. (19881, the firing rate of each tuned cell d,(M)is approximated by a linear combination of the normalized Cartesian coordinates of the target toward which the monkey is reaching. These coordinates are relative to the initial hand position Xu and are given by a unit vector in the direction of motion M = (m,. my.mZ). The linear approximation is d l ( M )= b,

+ b,,m,. + b,,m, + blzmz

(2.1)

and an F test showed that the variance of 83.6% of all cells was at least partly accounted for by this linear regression. The preferred direction vector C , is calculated from

(2.2) (2.3) and we can now write d i ( M ) = bj

+ k,Cl . M

(2.4)

which is equivalent to equation 1.1. Note that cells with k, = 0 are not sensitive to the direction of movement and were not analyzed further, so k, f 0. To understand results R1 and R2, I perform a simplified analysis of movement in two dimensions (the extension to three dimensions is straightforward but complicates the notation significantly). For a fixed initial hand position and with all other variables held constant, consider any arbitrary firing rate function d ( O M ) that depends on the direction of hand movement OM. OM is a periodic variable, so the output of d ( O M ) will be periodic, and if eight uniformly spaced directions are tested then the complete behavior can be described by a discrete Fourier series with periods u p to 7r/2 4

(2.5) where 4k is the phase for each angular frequency component k. Note that for k > 1 the terms have no directional component, since they consist of

Terence D. Sanger

32

either two, three, or four "lobes" symmetrically placed around the circle. Thus a linear regression on the Cartesian coordinates x = COS(HM), y= sin(HM)will be unaffected by the values of {YZ? (13, and ( ~ and 4 will depend only on ckn and 01. To see this, note that linear regression computes the three expected values:

1

=

(F)cos(Q1)

where the expectation operator E [] is taken over all tested directions HM. The preferred direction is therefore equal to 411 and is independent of d ~ ~ . 4 (or '~, Even if more than eight directions are tested, the linear regression will respond only to the 4 ~ 1component. The "goodness of fit" to the linear regression is the extent to which the k = 0 and k = 1 terms capture the behavior of ~ ( H M ) . However, it is important to realize that a statistically significant F test does not indicate a good fit to a linear model in the sense of having small prediction error variance. Fit is determined by mean squared error, which distributes according to a ,y2 statistic. The F test estimates only the probability that the linear model accounts for some portion of the total variance. This is equivalent to testing if cvl is significantly different from 0. A significant F test does not imply that (v1 describes the dominant response behavior, and ( v 2 , 03, or (y4 might well be larger. If a set of tuning curves were generated randomly by selecting the coefficients o k independently from a normal distribution, then one would expect 95% of the tuning curves to have statistically significant values of (PI.Thus the observed value of 83.6% [93%in Caminiti etal. (1990)l does not support statistical arguments that the population has been "engineered" to have directional tuning. Since this method of analysis ignores terms for k > 1, it in effect low-pass filters the tuning curves. So the cosine tuning results from the method of analysis and may not be justified by the original data. These considerations show that result R2 does not provide any information beyond result R1, since R2 would be true for a randomly chosen set of tuning curves satisfying R1 that were analyzed in this way.

a4.

Population Coding in Motor Cortex

33

True cosine tuning could be verified by fitting equation 1.1 to data samples from many different directions and measuring the average meansquared approximation error over the population using a k 2 statistic. A similar test was done in the two-dimensional case, where it was found that 75%of 241 cells had a normalized mean-squared approximation error less than 30% of the total variance (Georgopoulos et nl. 1982). Although this is not a statistically good fit to the population, there may have been individual cells whose response was well predicted by cosine tuning. What is the significance of the cells that were well fit by a cosine tuning curve? As shown in equation 2.4, these cells have a response d that is approximately linearly related to the hand movement vector M . We can thus claim either that these cells are in fact linear in the movement direction, or else that they are linear in the testing region but may be nonlinear if tested in other regions of space. So if we write the response as d ( X 0 . X ) where X0 is the initial hand position and X is the target, then we know that d ( X 0 . X ) must be sufficiently smooth that it appears locally linear for the positions X that were tested. Over larger distances, d may not be well approximated linearly, but it can still be written as

d ( M ) z b + kC(X,,) . M

(2.6)

where C ( X O emphasizes ) that the preferred direction may become dependent on the initial position, as was indeed found in Caminiti et d . (1990). But equation 2.6 is a general representation for arbitrary smooth functions, so even an accurate fit to a locally linear function does not allow one to claim much beyond the fact that the preferred direction remains approximately constant over the tested region.

3 Population Vectors Result R4 that there exists a linear combination of the firing rates that can predict the Cartesian coordinates of hand motion follows as a direct consequence of well-known results on coarse coding and the theory of radial basis functions (Poggio and Girosi 1990, for example), since a raised cosine function of angle can be thought of as a local basis function centered on the preferred direction. An alternate way to prove this fact follows. Define an N x 3 matrix Q whose rows are the preferred direction vectors C,. Let D be an N-dimensional column vector formed from the firing rates of all the cells by the formula [ D ] ,= (d, - b , ) / k ,as in Georgopoulos ef al. (1988). Then from equation 2.4 we can write

DzQM

(3.1)

We seek a 3 x N weighting matrix H such that a population vector of the form H D predicts hand direction M according to M z HD

= HQM

(3.2)

Terence D. Sanger

34

There are many matrices H that will satisfy this equation. One possibility is to use linear least-squares regression, giving H

=

(3.3)

(QTQ)-'QT

where the inverse (QTQ)-' will always exist so long as there are three linearly independent preferred direction vectors. We now have M

FZ

HD

FZ

HQM

=

( Q T Q ) - ' Q T Q M= M

(3.4)

as desired. This equation means that so long as there exist three linearly independent direction vectors, the hand direction will be approximately linearly related to the cell firing rates d; in any coordinate system for M that satisfies equation 3.1. So far I have shown that result R1 implies both results R2 and R4, given the method of analysis. In Georgopoulos et al. (1988) the columns of H were not found by performing a regression of the cell firing rates against the hand direction according to equation 3.3, but instead were assumed a priori to be equal to the preferred direction Ci for each cell, so that H = QT. I now discuss under what conditions result R5 holds, so that this particular linear combination will give the right answer. The population vector is given by equation 1.2, which we can rewrite in vector notation as M sz P

=

QTDE QTQM

(3.5)

and if this holds for all directions M then we must have QTQ = 1. This is a necessary condition for the existence of a population vector. In Georgopoulos et al. (1988) a more restrictive sufficient condition satisfying equation 1.2 is that the distribution of preferred directions is uniform over the sphere. Another necessary and sufficient condition based on Fourier analysis of the distribution of preferred directions for the planar case is given in Mussa-Ivaldi (1988). To understand the meaning of equation 3.5, we can write each component of QTQ as

and I = QTQ implies that Cf"=,[Cl],[C,]k = 0 whenever j # k. This expression is the correlation of the jth and kth components of the preferred direction vectors C,, so a necessary and sufficient condition for equation 1.2 to work is that the x, y, and z components of these vectors are uncorrelated and have equal variance. The result that equation 1.2 is satisfied is thus implied by the approximately uniform distribution of cell preferred directions in result R3. Note that for other coordinate systems, even if the components of the C,s are correlated there will still exist a linear combination H # QT of the firing rates that will predict the desired values, although the matrix H may need to be found by regression using

Population Coding in Motor Cortex

35

equation 3.3. But if both results R1 and R3 hold, then result R5 must hold. Suppose that rather than using the predicted value Q M we use the true measured value D and this includes significant noncosine (nonlinear) terms. Then we have D=QM+E where E is a vector with components 4

e,(oM) =

Oik

cos(kflM + h k )

k=2

If the terms C?,k and djrk are distributed independently of the components of C,, then QTE = 0 and these terms will not affect the value of the population vector. So even if the individual cells d o not have cosine tuning, the population vector will correctly predict hand direction if the terms for k > 1 do not correlate with the terms for k = 1 in the expansion given in equation 2.5. If the experiments are repeated with differing initial positions as in Caminiti ef al. (1990), then the preferred directions C, may change. This will lead to a new matrix Q‘ so that D’= Q M . Population vector analysis under the new conditions will give P‘ = Q T Q M ,so again the requirement for success is that the components of the new preferred directions are uncorrelated. The fact that population vectors “proved to be good predictors of movement direction regardless of where in space the movements were performed” (Caminiti et al. 1990, p. 2039) provides no information beyond the knowledge that the components of the preferred directions remain uncorrelated as the initial hand position changes. 4 Coordinate-Free Representations

One might ask if the experiments described above could be modified to determine the ”true” coordinate system used by motor cortex to describe hand movement direction. However, I claim that for certain classes of distributed representation this is not a well-defined question. Distributed representations of measured variables can be coordinate-free in the sense that they do not imply any particular coordinate system. To see this, let X be any variable represented in cortex (such as hand movement direction), and let D ( X ) be a vector-valued function representing the outputs of a large set of basis functions d , ( X ) that describe the behavior of (motor) cortical cells. D ( X ) is then a distributed representation of the variable X. Now, consider a vector function T ( X ) that measures X in a particular coordinate system W ( X ) might give the three Cartesian components of hand movement direction, for example]. If there exists a matrix H such that H D ( X ) M T ( X ) ,then one can say that the distributed representation D codes the coordinate system T. Yet this will hold for any T ( X ) that is

36

Terence D. Sanger

close to the linear span of the basis functions d ; ( X ) ,so we cannot claim that D encodes any single coordinate system for X within this span better than another. 5 Conclusion

In this letter, 1 have extended the generality of previous results (Georgopoulos eta/. 1988; Mussa-Ivaldi 1988) to show that cosine tuning curves will be found for large classes of arbitrary response functions if they are analyzed according to the statistical techniques in Schwartz et a/. (19881, and that the existence of a population vector a s found in Georgopoulos eta/. (1988) is determined by very general necessary and sufficient conditions that depend only on the distribution of preferred directions rather than on any intrinsically coded coordinate system. The concept that a distributed representation codes a particular coordinate system may not be well-defined, since certain types of representations can be considered "coordinate-free." These considerations imply that experiments of the type described may yield population vectors which predict many different three-dimensional coordinates (such as Cartesian, polar, muscle lengths, or joint angles). It is important to understand that the considerations presented here in no way reduce the importance of the results reported in Schwartz et a/. (1988), Georgopoulos et al. (19881, Kettner et a/. (19881, Caminiti et a/. (1990), and elsewhere. The fact that results R2, R4, and R5 are direct consequences of R1 and R3 serves only to underscore the significance of these two results. They show that large populations of motor cortical cells respond to hand motion in a predictable way, and that the preferred directions are approximately uniformly distributed with respect to a Cartesian representation of extrapersonal space. No additional conclusions can be drawn from the population vector, since its existence is a mathematical consequence of these two facts. However, if the distribution of preferred directions is nonuniform with respect to other coordinate systems or if the distribution can be modified through experience, then this would provide significant information about cortical representations. In addition, if cosine tuning can be verified by explicitly fitting cell tuning curves to a linear regression model, then further studies may discover constraints that explain why more than 486 linear cells are needed to code for only 3 linearly independent components of hand direction. Acknowledgments

I would like to thank Sandro Mussa-Ivaldi, Ted Milner, Emilio Bizzi, Richard Lippmann, Marc Raibert, and the reviewers for their comments and criticism. This report describes research done within the laboratory of Dr. Emilio Bizzi in the department of Brain and Cognitive Sciences

Population Coding in Motor Cortex

37

at MIT. The author w a s supported during this work by the division of Health Sciences a n d Technology, a n d by N I H Grants 5R37AR26710 a n d 5ROlNS09343 to Dr. Bizzi.

References Caminiti, R., Johnson, P. B., and Urbano, A. 1990. Making arm movements within different parts of space: Dynamic aspects in the primate motor cortex. J. Neurosci. 10(7), 2039-2058. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., and Massey, J. T. 1982. On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci. 2(11), 1527-1537. Georgopoulos, A. P., Kettner, R. E., and Schwartz, A. B. 1988. Primate motor cortex and free arm movements to visual targets in three-dimensional space. 11. Coding of the direction of movement by a neuronal population. 1.Neurosci. 8(8), 2928-2937. Kalaska, J. F., and Crammond, D. J. 1992. Cerebral cortical mechanisms of reaching movements. Science 255, 1517-1523. Kettner, R. E., Schwartz, A. B., and Georgopoulos, A. P. 1988. Primate motor cortex and free arm movements to visual targets in three-dimensional space. 111. Positional gradients and population coding of movement direction from various movement origins. J. Neurosci. 8(8), 2938-2947. Mussa-Ivaldi, F. A. 1988. Do neurons in the motor cortex encode movement direction? An alternative hypothesis. Neurosci. Lett. 91, 106-111. Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 978-982. Schwartz, A. B., Kettner, R. E., and Georgopoulos, A. P. 1988. Primate motor cortex and free arm movements to visual targets in three-dimensional space. I. Relations between single cell discharge and direction of movement. J. Neurosci. 8(8), 2913-2927.

Received April 28, 1992; accepted May 13, 1993.

This article has been cited by: 2. A. Moran, H. Bergman, Z. Israel, I. Bar-Gad. 2008. Subthalamic nucleus functional organization revealed by parkinsonian neuronal oscillations and synchrony. Brain 131:12, 3395-3409. [CrossRef] 3. S. Shoham, L.M. Paninski, M.R. Fellows, N.G. Hatsopoulos, J.P. Donoghue, R.A. Normann. 2005. Statistical Encoding Model for a Primary Motor Cortical Brain-Machine Interface. IEEE Transactions on Biomedical Engineering 52:7, 1312-1322. [CrossRef] 4. Paul Cisek. 2005. Neural representations of motor plans, desired trajectories, and controlled objects. Cognitive Processing 6:1, 15-24. [CrossRef] 5. Rony Paz, Eilon Vaadia. 2004. Learning-Induced Improvement in Encoding and Decoding of Specific Movement Directions by Neurons in the Primary Motor Cortex. PLoS Biology 2:2, e45. [CrossRef] 6. Pierre Baraduc , Emmanuel Guigon . 2002. Population Computation of Vectorial TransformationsPopulation Computation of Vectorial Transformations. Neural Computation 14:4, 845-871. [Abstract] [PDF] [PDF Plus] 7. Frank Bremmer. 2000. Eye position effects in macaque area V4. NeuroReport 11:6, 1277-1283. [CrossRef] 8. Sohie Lee Moody , David Zipser . 1998. A Model of Reaching Dynamics in Primary Motor CortexA Model of Reaching Dynamics in Primary Motor Cortex. Journal of Cognitive Neuroscience 10:1, 35-45. [Abstract] [PDF] [PDF Plus] 9. Bremmer, Pouget, K. -P. Hoffmann. 1998. Eye position encoding in the macaque posterior parietal cortex. European Journal of Neuroscience 10:1, 153-160. [CrossRef] 10. Alexandre Pouget, Terrence J. Sejnowski. 1997. Spatial Transformations in the Parietal Cortex Using Basis FunctionsSpatial Transformations in the Parietal Cortex Using Basis Functions. Journal of Cognitive Neuroscience 9:2, 222-237. [Abstract] [PDF] [PDF Plus] 11. Etienne Koechlin, Yves Burnod. 1996. Dual Population Coding in the Neocortex: A Model of Interaction between Representation and Attention in the Visual CortexDual Population Coding in the Neocortex: A Model of Interaction between Representation and Attention in the Visual Cortex. Journal of Cognitive Neuroscience 8:4, 353-370. [Abstract] [PDF] [PDF Plus] 12. Shoji Tanaka, Noriaki Nakayama. 1995. Numerical simulation of neuronal population coding: influences of noise and tuning width on the coding error. Biological Cybernetics 73:5, 447-456. [CrossRef] 13. Marco Idiart, Barry Berk, L. F. Abbott. 1995. Reduced Representation by Neural Networks with Restricted Receptive FieldsReduced Representation by Neural Networks with Restricted Receptive Fields. Neural Computation 7:3, 507-517. [Abstract] [PDF] [PDF Plus]

14. Alexandre Pouget, Lawrence H SnyderModeling Coordinate Transformations . [CrossRef]

Communicated by Stephen Lisberger

Neural Network Model of the Cerebellum: Temporal Discrimination and the Timing of Motor Responses Dean V. Buonomano* Department of Neurobiology and Anatomy, University of Texas Medical School, Houston, TX 77225 U S A and Departamento de Matematica Aplicada, lnstituto de Matematica, Universidade Estadual de Campinas, Campinas, Brasil, and Laboratdrio de Psicobiologia, Universidade de Srio Paulo, Ribeirio Preto, Brasil Michael D. Mauk Department of Neurobiology and Anatomy, University of Texas Medical School, Houston, TX 77225 U S A Substantial evidence has established that the cerebellum plays an important role in the generation of movements. An important aspect of motor output is its timing in relation to external stimuli or to other components of a movement. Previous studies suggest that the cerebellum plays a role in the timing of movements. Here we describe a neural network model based on the synaptic organization of the cerebellum that can generate timed responses in the range of tens of milliseconds to seconds. In contrast to previous models, temporal coding emerges from the dynamics of the cerebellar circuitry and depends neither on conduction delays, arrays of elements with different time constants, nor populations of elements oscillating at different frequencies. Instead, time is extracted from the instantaneous granule cell population vector. The subset of active granule cells is time-varying due to the granuleGolgi-granule cell negative feedback. We demonstrate that the population vector of simulated granule cell activity exhibits dynamic, nonperiodic trajectories in response to a periodic input. With time encoded in this manner, the output of the network at a particular interval following the onset of a stimulus can be altered selectively by changing the strength of granule + Purkinje cell connections for those granule cells that are active during the target time window. The memory of the reinforcement at that interval is subsequently expressed as a change in Purkinje cell activity that is appropriately timed with respect to stimulus onset. Thus, the present model demonstrates that 'Present address: Keck Center for Integrated Neuroscience, University of California, San Francisco, CA 94143.

Neural Computation 6, 38-55 (1994)

@ 1993 Massachusetts Institute of Technology

Timing and the Cerebellum

39

a network based on cerebellar circuitry can learn appropriately timed responses by encoding time as the population vector of granule cell activity.

1 Introduction

Generating movements inherently involves producing appropriately timed contractions in the relevant muscle groups. Accordingly, an important aspect of motor control is the timing of motor commands in relation to internal (i.e., proprioceptive cues) and external stimuli. One clear and experimentally tractable example of the timing of responses with respect to an external stimulus is Pavlovian conditioning of eyelid responses. Learned responses are promoted in this paradigm by paired presentation of a cue or conditioned stimulus (CS) and a reinforcing unconditioned stimulus (US). For example, the presentation of a tone (the CS) is reinforced by the copresentation of a puff of air directed at the eye (the US). This promotes the acquisition of learned eyelid responses that are elicited by the CS. These learned eyelid movements are delayed so that they peak near the onset of the potentially harmful US (Schneiderman and Gormezano 1964; Gormezano et al. 1983). Previous studies have shown that the timing of the eyelid response is learned and that the underlying neural mechanism involves temporal discrimination during the CS (Mauk and Ruiz 1992). Since appropriately timed responses can be generated for CS-US intervals between 80 and 2000 msec, the neural mechanism appears to be capable of temporal discrimination in this range. Very little is known about how and where the nervous system encodes time. There is evidence that neurons can use axonal delays to detect temporal intervals. For example, axonal conduction delays appear to contribute to the detection of interaural time delays (Carr and Konishi 1988; Overholt et al. 1992). Furthermore, theoretical work (Koch et al. 1983) has suggested that dendritic conduction delays could contribute to the detection of temporal delays important for direction selectivity in the visual system. In both these instances, however, the relevant time intervals are below tens of milliseconds. Braitenberg (1967) suggested that the parallel fibers in the cerebellum could function as delay lines that would underlie the timing of movements. Given the conduction velocity of parallel fibers of approximately 0.2 mm/msec it is unlikely that such a mechanism could underlie timing in the tens of milliseconds to second range (Freeman 1969). Other models have been presented in which intervals above tenths of seconds could be stored by a group of oscillating neurons with different frequencies (Miall 1989; Church and Broadbent 1991; Fujita 1982; Gluck et al. 1990). As of yet, however, no such population of neurons has been described.

40

Dean V. Buonomano and Michael D. Mauk

It has been suggested that the cerebellum may play an important role in the timing of movements (Braitenberg 1967; Freeman 1969; Eccles 1973), in addition to being an important component for the learning of movements (Marr 1969; Albus 1971; Fujita 1982; Lisberger 1988). Indeed it has been shown that while cerebellar cortex lesions do not abolish the conditioned response, the timing of the responses is disrupted (McCormick and Thompson 1984; Perrett et nl. 1993). Whereas numerous studies indicate that output from the cerebellum via cerebellar nuclei is required for Pavlovian eyelid conditioning (Thompson 1986; Ye0 1991; however, see Welsh and Harvey 1989). One class of cerebellar afferents (mossy fibers) appears to convey to the cerebellum the presentation of the CS and a second class of cerebellar afferents (climbing fibers) appears to convey the presentation of the US (e.g., see Thompson 1986). These data suggest that (1) Pavlovian conditioning is a relatively simple, experimentally tractable paradigm for the study of the input/output properties of the cerebellum, (2) the cerebellum is necessary for the appropriate timing of conditioned movements, and thus, (3) Pavlovian conditioning is particularly well suited for the study of cerebellar mechanisms that mediate the ability to discriminate time with respect to the onset of an internal or external stimulus. The purpose of this paper is to use a neural network model to test a specific hypothesis suggesting how the known circuitry of the cerebellum could make temporal discriminations and mediate the timing of the conditioned response. The model tested here is based on the knoivn circuitry of the cerebellum and does not utilize delay lines, assume arrays of elements with different time constants, or assume arrays of elements that oscillate at different frequencies, but instead shows how the dynamics of cerebellar circuitry could encode time with the population vector of granule cell activity. 2 Structure of the Model

2.1 Cerebellar Circuitry. In comparison to most brain regions the neural circuitry, cell ratios, and synaptic convergence/divergence ratios of the cerebellum are fairly well established (Eccles e t a / . 1967; Palkovits et nl. 1971; Ito 1984). The principal six cell types are the granule (Gr), Golgi (Go), Purkinje (PC), basket, stellate, and cells of the cerebellar nuclei. The two primary inputs to the cerebellum are conveyed by mossy fibers (MF) and climbing fibers (CF). Figure 1A illustrates the known connectivity of these cells. 2.2 Classic Cerebellar Theories. Based on the known characteristics of cerebellar organization, Marr (1969) and later Albus (1971) proposed models of the cerebellum suggesting a mechanism to mediate motor learning. In these theories (1) the contexts in which movements take

Timing and the Cerebellum

GOLGI

A.

41

a 0 STELLATEBASKET

4 GRANULE

LEGEND OFSYNAPSES --O

-INHIBITORY

-i]

4 EXCITATORY

MOSSY FIBERS CLIMBING FIBER

D OUT

B.

0

.

0

Figure 1: Synaptic organization of the cerebellum. (A) A schematic diagram of the known synaptic connections in the cerebellum. Shown in bold and with solid lines are the components incorporated into the present computer model. (B) A schematic representation of the connectivity in the present model. Small representations of the granule (Gr) and Golgi (Go) layers are shown; there were lo4 and 900 Gr aiid Go cells, respectively, in the simulations. The 500 MFs and single PC are omitted for clarity. The arrows aiid shaded pyramids illustrate the spans or regions to which the cells are eligible to make a synaptic contact. Within the spans connections are made with a uniform distribution. The white cells in each span exemplify cells that receive a synaptic input from the presynaptic cell. Thus, the shape of the spans was designed to reflect the geometry of the projections of each cell type, whereas the number of cells within the span that actually received a connection was a reflection of the convergence and divergence ratios of the synaptic connections.

place are encoded by the population vector of G r cell activity, (2) the CFs convey error signals indicating that the response requires modification, a n d (3) this CF error signal modifies active G r i PC synapses such that subsequent motor performance in that context is improved. Marr also

Dean V. Buonomano and Michael D. Mauk

42

suggested that the Go cell negative feedback stabilizes the amount of Gr cell activity which could maximize the discriminability of the contexts.

2.3 Hypothesis. The hypothesis we test here is an elaboration of the Marr/Albus scheme which suggests that due to (1) the dynamic interactions between Gr and Go cell populations and (2) the complex geometry of their interconnections, the population vector of granule cell activity exhibits time-variant trajectories that permit both stimulus and temporal discriminations (see Mauk and Donegan 1991). Thus, the population vector of Gr cell activity encodes both the particular MF input pattern and the time since its onset. A particular periodic MF input pattern will activate a subset of Gr cells which will activate a subset of Go cells; these in turn will inhibit a second, partially overlapping subset of Gr cells. The Gr -+ Go Gr negative feedback loop will create a dynamic, nonperiodic population vector of Gr cell activity even when driven by a periodic input pattern of ME Thus, the population vector of Gr cell activity would encode not just the constellation of stimuli impinging on the organism, but also the time since the onset of the stimuli. In a manner consistent with the Marr/Albus theories, a given subset of Gr cells and thus a particular interval can be stored by changing the strength of the Gr PC connection of active Gr cells. The retrieval of the interval is expressed as a change in the activity level of the PC. -+

-+

2.4 Neural Network. The neural network consisted of lo4 Gr cells, 900 Go cells, 500 MF inputs, and one PC. The architecture is illustrated schematically in Figure 1B; the 500 MFs and the single PC are omitted for clarity. Each Gr cell received excitatory synaptic inputs from three MFs and inhibitory inputs from three Go cells. Each Go cell received excitatory inputs from 100 Gr cells and 20 MFs. The PC received inputs from all lo4 Gr cells. In scaling the network to computationally feasible dimensions, it is not possible to maintain both the empirically observed Gr/Go cell ratios and convergence/divergence ratios. The biological Gr/Go cell ratio is approximately 5000/1, and the Gr convergence/divergence ratio is approximately 0.4 (4/10, i.e., each Gr cell receives input from 4 Go cells and sends contacts to 5-10 Go cells; Eccles 1973; Palkovits et al. 1971). Thus, a network with lo4 Gr cells should have only 2 Go cells, yet each of the Gr cells should contact approximately 5-10 Go cells. The underlying assumption we employed in making the compromise between cell ratios and convergence/divergence ratios was that the latter is more important. Thus, the convergence/divergence ratios for the Gr and Go cells were maintained within an order of magnitude of observed experimental values. In the model the Gr/Go ratio was 11.1 (104/900) and the Gr convergence/divergence ratio was 0.33 (3/9).

Timing and the Cerebellum

43

The Gr and Go cells were simulated as modified integrate and fire elements. For example, the voltage of each Go cell (VF" was determined by

A spike (S$) was generated if threshold was reached.

sy= { 0.1.

V p 2 Thr;" V p
(2.2)

where Thry' = spike threshold for Go cell i. Synaptic currents were simulated with an instantaneous rise and an exponential decay. All inputs of a particular type are summed into a single current that saturates at 1.0 and decays at the rate of T . Thus the MF Go cell synaptic current is determined by -+

GmMF

MF

dgi = 1sf;"F dt

I/VGo:MF

(1 - g F : M F )

- gPMFTGo:MF

(2.3)

I1

where SyF represents a spike in a MF, is the synaptic weight of ~is the ~ decay : time ~ constant. ~ the MF synapses, and T A relative refractory period and spike accommodation was simulated by increasing threshold to MaxThrCoafter each spike with a subsequent exponential decay to the initial value with a time constant of T " ' : ~ ~ ' . The dynamics of each Gr cell was controlled by similar equations except each Gr cell received an excitatory and an inhibitory synaptic current, gGrzMF and gGr:Go , respectively. The values and definitions of the constants are given in Table 1. The single PC received input from all Gr cells in the network. Initially all Gr cells were connected to the PC with the same weight. During the first presentation of the stimulus the weights of the Gr + PC connections for all Gr cells active within a target window were decreased. This decrease in synaptic weight during a particular time window simulates the long-term depression (LTD) of the parallel fiber to PC connection produced by coactivation of parallel and climbing fibers (Ito et a/. 1982; Linden et al. 1991). The ability of the network to generate an appropriately timed response was then tested by monitoring the PC activity during a second presentation of the stimulus. Thus the voltage of the PC represented the weighted summed activity of all the Gr cells in the network with a time constant of 2.5 msec. In certain versions of the network a conduction delay was incorporated in the Gr Go connection. Since the delay did not affect the results in a significant manner and to stress that timing in the network emerges Go -+ Gr feedback loop the simulations from the dynamics of the Gr presented here did not include conduction delays. -+

-+

Dean V. Buonomano and Michael D. Mauk

44

Table 1: Definitions and Values of Constants.' Decay time constants +r:MF TGr:Go T

~

TGo:Gr T

~

MF + Gr synaptic conductance Go + Gr synaptic conductance ~MF ~ Go synaptic : ~ conductance ~ Gr + Go synaptic conductance ~Gr threshold : ~ decay ~ ~ Go threshold decay

2.86 5.0 2.87 2.86 1.7 2.0

Cellular parameters ~~~

Eleak

E,, Einh

gGr.leak gcn'leak

Leak equilibrium potential EPSP equilibrium potential IPSP equilibrium potential Leak conductance for Gr Leak conductance for Go

60 0 -80 .07 .07 -

Synaptic parameters k@r:MF k@r:Go

k@n:MF

wotGr

Synaptic weight Synaptic weight Synaptic weight Synaptic weight

of of of of

MF + Gr Go + Gr MF + Go Gr + Go

0.15 0.15 0.007 0.008

Threshold parameters Thr" Thr"" MaxThr" MaxThr""

Gr cell minimum threshold Go cell minimum threshold Maximum Gr threshold Maximum Go threshold

-40 -35 -35 -25

"In the table and equations the first superscript refers to the postsynaptic cell that the constant or variable applies to, whereas the second superscript refers to the presynaptic cell or to a particular variable.

Input to the network w a s conveyed by 500 MF. In the simulations described here the presentation of a CS w a s represented by activation of 20% of the MFs at a frequency of 100 Hz. Driving the network with a constant frequency input insured that any timing exhibited by the network was a result of the intrinsic dynamics of the circuit.

3 Simulations 3.1 Timing. Figure 2 illustrates the ability of the network to learn a temporal discrimination when driven by the MF input. Following a baseline of low background activity, a subset of mossy fibers was acti-

Timing and the Cerebellum

-50 0

45

50 100 150 200 250 300 350 400 450 500 TIME (ms)

Figure 2: Results from a typical simulation of a Pavlovian conditioning trial in which temporal coding is expressed in the PC activity. Shown along the top is the percent of MFs active each time bin with the onset of the simulated CS marked by the increase in MF activity at time zero. During each trace the network is initially silent due to the absence of MF activity. To provide "background" activity, 5% of the MFs were turned on at t = -45, and at CS onset ( t = 0) an additional 20% of the MFs were activated at 100 Hz (the phase of each MF action potential was random). The lighter continuous trace represents PC activity in the initial training trial in which a climbing fiber input was simulated by decreasing to zero the strength of the Gr + PC connections that spiked three or more times between t = 200 and 205 msec. The darker continuous trace shows a subsequent test trial in which no US was presented. The retention of the temporal interval between CS and US is expressed as a decrease in PC activity whose onset precedes and whose peak occurs near the time at which the US was previously presented. Finally, a raster plot showing the activity of 600 of the Gr cells as a function of time during the trial is superimposed. vated as a hypothetical CS presented to the network (total MF activity is shown in the upper panel). The lower panel shows the output of the network as represented by the activity of a PC during t w o trials. Activity in the PC is proportional to the weighted sum of active G r cells.

46

Dean V. Buonomano and Michael D. Mauk

Initially all weights were equal. The initial peak corresponds to the activation of Gr cells in the presence of very little Go cell inhibition. The Gr + Go + Gr feedback establishes a more or less constant amount of total Gr cell activity. Note that at t = 0 a 5-fold increase in MF activity did not increase PC activity. Although at any point in time there is approximately a constant number of Gr cells active, the subpopulation of active cells is distinct. This is demonstrated in the raster plot of a sample of Gr cells superimposed on Figure 2. In the first "training" trial (lighter trace) a US was simulated at t = 200 by decreasing the strength of the Gr + PC connections for Gr that were active between t = 200 and t = 205. These changes are reflected in the decrease in PC activity. In a subsequent "test" trial in which no US was presented (bold trace), the retention of the previous training is expressed by the decrease in PC activity that peaks near t = 200. This appropriately timed decrease in PC activity occurs because the subset of Gr cells active around t = 200 was (1) identical to the subset active at t = 200 on the training trial and (2) is different from subsets active at other post-CS intervals. Note that there is an overall decrease in PC activity after training. This decrease occurs because the subset active at t = 200 (and that was reinforced by the US) has a low percentage of elements in common with the subsets of Gr cells active at other times. If we consider the population of Gr cells as a binary vector (spike/no spike) with lo4 dimensions, then each time step will define a point in Gr cell state space. Throughout a stimulus these population vectors will describe a trajectory in this Gr cell state space where different times during the stimulus are encoded by different points along the trajectory. Thus, the degree of overlap (or the average distance in Gr cell state space) determines the signal/noise ratio. 3.2 Ability to Store Multiple Intervals. Simulations similar to those shown in Figure 2 demonstrate that the network is capable of storing multiple intervals in response to the same or different stimuli. Figure 3 illustrates the ability of the network to simulate two different intervals in response to two different stimuli. Each trace represents the response of the network after training with different input stimuli (see figure legend). The first trace (light line) represents the first stimulus that was trained with the US presented 150 msec following the onset of the CS and the second trace (dark line) represents the second stimulus which was trained with a CS-US interval of 350 msec. 3.3 Sensitivity to Noise. A crucial aspect of any realistic biological model is its ability to perform in the presence of noisy inputs. In the context of the present simulations noise refers to trial-to-trial variation in the initial (pre-CS) state of the model and in the MFs activated by the CS as well as variability in various properties of the individual elements, such as threshold, transmitter release, etc. This noise has the potential deleteri-

Timing and the Cerebellum

47

I

I

I

I

I

I

I

I

-50

50

150

250

350

450

550

650

TIME (ms)

Figure 3: Performing multiple intervals. Simulations similar to those shown in Figure 2 demonstrating that the network is capable of both stimulus and temporal discriminations. The two PC traces represent post-training responses elicited by different CSs (different subsets of MFs were activated for each), each of which was previously reinforced by the US, but at different intervals. The light trace shows the response elicited by the CS previously reinforced at 150 msec and the dark trace was reinforced at 350 msec.

ous effect of reducing the trial-to-trial consistency of the trajectory of the Gr population vector elicited by CS presentations. For example, if each presentation of the same CS promoted a completely different trajectory of the Gr population vector, then the ability to learn an appropriately timed response would be abolished. In the simulations presented thus far no noise was present; the pre-CS state of the network and the CS-evoked MF inputs were identical for each CS presentation. However, since dynamic systems characteristically show sensitivity to noise, the introduction of small errors (noise) leads to increasing divergence of the trajectory. As a first step toward addressing this issue we have analyzed how the injection of small amounts of noise affects the ability of the network to reproduce a CS-evoked trajectory in the Gr population vector (Fig. 4). Noise was introduced by altering the activity of five Gr cells at t = 150 during a CS that had been trained previously. As shown in Figure 4, the introduction of this noise attenuated the PC response and this attenuation

Dean V. Buonomano and Michael D. Mauk

48

100

1

~

=c I C

3

t I I

8o 60

40

g m

5

E e~

c

e

S 2.

c

p~ a,

c’

1 -50

50

v,

150

v -

1

250

350

450

TIME (ms)

Figure 4: Sensitivity of the network to noise. The lower panel shows PC activity during three separate test trials. Prior to each test trial the network had been trained using a reinforced trial where the US was presented at 125, 175, o r 225 msec, respectively, for the three traces. During the test trials, noise was injected at the time shown ( t = 150) in the form of spurious activity in five randomly chosen Gr cells. The top panel shows the overlap between the Gr cell population vector in the presence and absence of noise. Within approximately 100 msec from the injection of noise ( t = 250) the overlap decreases to chance. The effect of this decrease on the retention of previous training is illustrated in the PC traces. As the overlap in population vectors of Gr activity between training and test trials decreases, the ability to generate the response deteriorates. Overlap is defined as 1- (Hamming distance + number of active Gr cells), where the Hamming distance is the number of mismatches between the two population vectors.

increased with the time between noise injection and the response. This occurs because the spurious Gr activity changes the trajectory of the network in Gr state space. Since the small initial change leads to further changes in the trajectory, the divergence from the normal trajectory increases with time. With small changes in the trajectory the learned PC response is attenuated, but eventually the divergence is sufficient to eliminate the response completely. The rate at which trajectories are altered

Timing and the Cerebellum

49

by the introduction of noise can be illustrated by plotting the overlap of Gr cell states during two CS presentations, one with noise and one without (Fig. 4). Before the injection of the noise the overlap between the two Gr cell population vectors is 100%. With the injection of noise the overlap decreases within 100 msec to a baseline level of 10-15%. At this point the ability to generate timed responses is lost. Note that the speed with which the trajectories diverge following noise injection-that is, the slope of the overlap function-can be used as a measure of noise sensitivity (see next section).

3.4 Effects of the MF + Go Connection on Timing and Sensitivity to Noise. The ability of the network to make temporal discriminations was fairly insensitive to most parameters. The parameters best analyzed were the connection strengths from the MF to Gr cells ( WrM F ) and the reciprocal connection strengths between the Gr and Go cells ( Go. H@ Gr 1. Within reasonable boundaries timing was not dramatically affected by these parameters, although changes in the signal/noise ratio and the absolute number of cells active in the network were observed. We were particularly interested in the MF to Go cell connection ( WoMF), since neither in our model nor in previous cerebellar models is the functional consequence of this connection entirely clear. To obtain insights as to the possible function of this connection we performed a parametric analysis of the effect of the strength of the MF -, Go connection on temporal discrimination and on sensitivity to noise. Figure 5 illustrates the effect of different values of the MF Go connection on temporal discrimination and sensitivity to noise. We define temporal discrimination as the ability to generate a temporally specific response to the CS. We observed that temporal discrimination is best with low VF" M F or when the MF + Go connection is omitted (I@ MF = 0.0). With higher h@' MF values temporal discrimination deteriorates. Conversely the network is more resistant to noise injected into the Gr cells with higher WoMF values. At low WoMF the dynamics of the network are determined mostly by the Gr + Go + Gr loop, resulting in an unstable but highly variable Gr cell population trajectory (robust temporal discrimination but high sensitivity to noise). At high values the periodic MF input can entrain the network by dominating the input to the Go cells, resulting in little or no temporal discrimination but better resistance (low sensitivity) to Gr cell noise. Thus the MF + Go cell connection may play a role in making the dynamics of the network more resistant to Gr cell noise, and the effectiveness of the MF -+ Go cell connection in the cerebellum may reflect a compromise between temporal discrimination and resistance to noise. It should be stressed that this conclusion pertains to instances in which the noise is present in the Gr cell activity. It remains to be determined whether the same holds true in the presence of noisy MF inputs.

wr

----t

wMF

Dean V. Buonomano and Michael D. Mauk

50

0.8 0.7

0.6 X Q)

-0

:i..-E

0.4 0.5

I-

0.3 0.2 0.1

'

1

I

I

0

0.01

0.02

I

0.03

0.04

MF -> Go Weight

Figure 5: The influence of the MF + Go synaptic strength on the sensitivity of the network to noise and on a measure of the ability of the network to perform temporal discriminations. The injection of noise was identical to the simulations shown in Figure 4. With low MF + Go synaptic strength (or no connection, = 0.0) the quality of temporal discrimination or timing was relatively high but resistance to noise was relatively low. As the MF + Go connection was increased, noise resistance increased, but temporal discrimination decreased. Resistance to noise was defined as the reciprocal of the slope (absolute value) of the overlap function shown in Figure 4. Thus, small values reflect a sharp slope of the overlap function, indicating sensitivity to noise. The timing index was defined as the amplitude of the learned decrease in PC activity minus the decrease in the baseline activity following learning. This difference was then normalized to the prelearning baseline activity. Thus the timing index ranges from 0 (no temporal discrimination) to 1 (maximal amplitude of the timed learned response with no change in baseline).

w:MF

4 Discussion We have demonstrated that a n artificial neural network whose circuitry is based on the synaptic organization of the cerebellar cortex is capable of temporal discriminations a n d the generation of appropriately timed responses over intervals of tens of milliseconds to seconds. In this network time is encoded in the population vector of Gr cell activity-during

Timing and the Cerebellum

51

a stimulus the subset of granule cells that is active varies in a characteristic manner. To use Gr cell dynamics to encode particular time windows, there has to be an efficient way to select or to tag the subpopulation of Gr cells that encode a particular time window. It turns out that LTD gated by climbing fiber inputs to PCs provides an effective way to accomplish this. However, it is important to note that the timing mechanism we propose does not necessarily require LTD, any plasticity rule based on temporally correlated climbing fibers and Gr cell inputs would be effective. With LTD, storage of a particular interval is induced by decreasing the Gr + PC connection strength during the desired time window. The retrieval of the interval is expressed as a decrease in the activity level of the PC. Any particular time window can be encoded in this manner because the population of active Gr cells is dynamic. Chapeau-Blondeau and Chauvet (1991) also proposed a model in which time could be encoded by the population of active Gr cells, although their model stresses the importance of the parallel fibers functioning as delay lines. An interesting feature that emerged from the model is the anticipatory response, that is, after training the response begins before the US onset and peaks during the US. This type of anticipatory response is observed in eyelid conditioning of the rabbit (e.g., Mauk and Ruiz 1992). In the model the anticipatory response occurs because the Gr cells that code for the US interval tend to have a higher probability of being active shortly before and after the US interval. In other words as the Gr cell population trajectory approaches the time of US onset the anticipatory response is generated. The major weakness of the model is its sensitivity to noise. Spurious activity in a few Gr cells can degrade the trajectory of the network such that within 100-200 msec the population dynamics is altered to a degree that the timing signal is lost. Since timing is generated by the dynamics of the Gr Go + Gr loop, and encoded in the population of active Gr cells, activity in spurious Gr cells is amplified throughout the network. We did not perform a formal mathematical analysis to determine if the network was chaotic, although the sensitivity of the network to noise suggests that this may be the case. It has been proposed (Mauk and Donegan 1991) that sensitivity to noise could be responsible for the dependence of Pavlovian conditioning on the interstimulus interval (ISI); conditioning requires that CS onset precede the US by at least 80-100 msec, but not by more than 2-3 s (the so called IS1 function). Thus acquisition could require trial-to-trial consistency in the subset of Gr -+ PC synapses modified by the US, and the IS1 function could reflect a within-trial variation of the acrosstrials CS-evoked trajectory. The present simulations demonstrate that this notion might be feasible. Noise injected early in a CS leads to a timedependent increase in the variation of the CS-evoked Gr cell trajectories. This increase is consistent with the decrease in conditioning as the IS1 increases.

-

52

Dean V. Buonomano and Michael D. Mauk

It is clear that the network presented here is too sensitive to noise since changes in a small percentage of Gr cells lead to rapid divergence of the Gr cell trajectory and thus loss of timing. There are several factors that may contribute to the extreme sensitivity of this network to noise. (1) We simulated timing under the worse-case condition in which the network is driven by MF inputs that convey no temporal information. It seems likely that the MFs may convey a small degree of temporal coding-perhaps some MFs fire tonically to a stimulus while others fire phasically at stimulus onset and offset. (2) The parameters used in this model favored robust temporal discrimination at the possible expense of resistance to noise. As shown in Figure 5, with increases in the MF Go cell connection strength it is possible to improve the resistance to noise and maintain some degree of temporal discrimination. (3) We simulated "learning" with a single trial, it may be that with multiple trials the tradeoff between resistance to noise and temporal discrimination may be more forgiving. (4) Aspects of cerebellar physiology not incorporated into the network, such as the correct Gr/Go cell ratios or cellular compartmentalization, may be important. Indeed, preliminary simulations in which the excitatory/inhibitory interactions of the Gr cells were compartmentalized in each dendrite improved the resistance of the network to noise. Several neural-like models have been proposed to account for the IS1 function and for the learned timing of conditioned responses. For example, Moore and colleagues (Moore et al. 1986, 1989) presented a neural model that used an eligibility period to obtain the IS1 function and tapped delay lines to obtain response timing. Grossberg and Schmajuk (1989) proposed a model using an array of CS-activated elements with different time constants to obtain response timing. Also, Gluck ef al. (1990) obtained response timing with an array of CS-activated elements that oscillate at different frequencies and phases. Clearly, from a computational point of view it is not difficult to develop hypothetical systems that generate time-varying CS representations. The important questions concerning each candidate mechanism are its overall biological plausibility in general and its ability to map specifically onto the anatomy and physiology of particular brain regions. We suggest that the present approach differs from previous models in that there are no parameters or hypothetical mechanisms included that specifically code for timing or the IS1 function. Instead, we have attempted to capture-qualitatively if not quantitatively-the basic properties of the organization of certain parts of cerebellar cortex. Our simulations show that behavioral properties such as response timing and the IS1 function can emerge from this organization. The present model gives rise to several expectations regarding the activity of granule cells that would be elicited by the presentation of a particular stimulus: (1) the presentation of a stimulus should elicit activity in a subset of granule cells, (2) the activity in each cell should follow the onset of the stimulus by a fairly consistent interval, and (3) assum-

-

Timing and the Cerebellum

53

ing recordings from a sufficient number of cells, all conditionable intervals should be represented by activity in at least some cells. Assuming again that recordings from a large number of granule cells could be obtained, the possibility that Pavlovian eyelid conditioning occurs only for a limited range of interstimulus interval can be explained by within-trials variation in the across-trials consistency of the population vector of active granule cells Mauk and Donegan (1991) suggests an additional prediction: (4) the trial-to-trial consistency in the stimulus-evoked activity in the granule cells should vary throughout the stimulus in a manner that parallels the ability of different interstimulus intervals to support conditioning. In particular, as the duration of the stimulus (e.g., the ISI) increases the trial-to-trial consistency in the stimulus-evoked granule cell activity should decrease. Thus, given recordings from a sufficiently large subset of granule cells during the presentation of a stimulus, the population vector of the activity of those cells should provide for temporal coding throughout the stimulus and the consistency of this temporal code should parallel the effectiveness of each interval to support Pavlovian eyelid conditioning. While these predictions of the present model are quite concrete, we acknowledge the difficult and time-consuming nature of the experiments. This further highlights the value of combining biologically inspired computer simulations with empirical approaches to investigate the information-processing mechanisms of the nervous system.

Acknowledgments This research was supported by NIMH Fellowship F31 MH-09895 (DVB), FAPESP Fellowship 91 /5111-4 (DVB), NIMH Grant MH46904-02, and Scholars Awards from the National Down Syndrome Society (MDM) and the McKnight Foundation (MDM). We would like to thank John Byrne, Len Cleary, Carlos Tomaz, and Vincent Buonomano for providing the necessary conditions for this research, and Garrett T. Kenyon for helpful comments.

References Albus, J. S. 1971. A theory of cerebellar function. Moth. Biosci. 10, 2541. Braitenberg, V. 1967. Is the cerebellar cortex a biological clock in the millisecond range? Prog. Brain Res. 25, 334336. Carr, C. E., and Konishi, M. 1988. Axonal delay lines for time measurement in the owl's brainstem. Pruc. Natl. Acad. Sci. U.S.A. 85, 8311-8316. Chapeau-Blondeau, F., and Chauvet, G. 1991. A neural network model of the cerebellar cortex performing dynamic associations. Bid. Cybcr. 65, 267-279. Church, R. M., and Broadbent, H. A. 1991. A connectionist model of timing. In Neural Network Models of Conditioning and Action, M. L. Commons, S. Grossberg, and J. E. R. Staddon, eds., pp. 225-240. Erlbaum, Hillsdale, NJ.

54

Dean V. Buonomano and Michael D. Mauk

Eccles, J. C. 1973. The cerebellum as a computer: Patterns in time and space. I. Physiology 229, 1-32. Eccles, J. C., Ito, M., and Szentiigothai, J. 1967. The Cerebellum as a Neuronal Machine. Springer-Verlag, New York. Freeman, J. A. 1969. The cerebellum as a timing device: An experimental study in the frog. In Neurobiology of Cerebellar Evolution and Development, R. Llinas, ed., pp. 397-420. American Medical Association, Chicago, IL. Fujita, M. 1982. An adaptive filter model of the cerebellum. Biol. Cyber. 45, 195-206. Gluck, M. A., Reifsnider, E. S., and Thompson, R. F. 1990. Adaptive signal processing and the cerebellum: Models of classical conditioning and VOR adaptation. In Neuroscienceand Connectionist Theory, M. A. Gluck and D. E. Rumelhart, eds., pp. 131-186. Erlbaum, Hillsdale, NJ. Gormezano, I., Kehoe, E. J., and Marshall, B. S. 1983. Twenty years of classical conditioning with the rabbit. Prog. Psychobiol. Physiol. Psyckol. 10, 197-275. Grossberg, S.,and Schmajuk, N. A. 1989. Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks 2, 79-102. Ito, M. 1984. The Cerebellum and Neuronal Control. Raven Press, New York. Ito, M., Sakurai, M., and Tonogroach, P. 1982. Climbing fibre-induced depression of both moody fibre responsiveness and glutamate sensitivity of cerebellar Purkinje cell. 1.Physiol. 324, 113-134. Koch, C., Poggio, T., and Torre, V. 1983. Nonlinear interactions in a dendritic tree: Localization, timing, and role in information processing. Proc. Natl. Acad. Sci. U.S.A. 80, 2799-2802. Linden, D. J., Dickenson, M. H., Smeyne, M., and Connor, J. A. 1991. A longterm depression of AMPA currents in cultured cerebellar Purkinje neurons. Neuron 7,81-89. Lisberger, S. G. 1988. The neural basis for learning simple motor skills. Science 242, 728-735. Marr, D. 1969. A theory of cerebellar cortex. J. Physiol. 202, 437-470. Mauk, M. D., and Donegan, N. H. 1991. A model of eyelid conditioning based on the cerebellum. Neurosci. Abstr. 17, 869. Mauk, M. D., and Ruiz, B. P. 1992. Learning-dependent timing of Pavlovian eyelid responses: differential conditioning using multiple interstimulus intervals. Behav. Neurosci. 106, 666-681. McCormick, D. A., and Thompson, R. F. 1984. Cerebellum: Essential involvement in the classically conditioned eyelid response. Science 223, 296-299. Miall, C. 1989. The storage of time intervals using oscillating neurons. Neural Comp. 1,359-371. Moore, J. W., Desmond, J. E., Berthier, N. E., Blazis, D. E., Sutton, R. S., and Barto, A. G. 1986. Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neural firing, and interstimulus intervals. Behav. Brain Res. 21, 143-154. Moore, J. W., Desmond, J. E., and Berthier, N. E. 1989. Adaptively timed conditioned responses and the cerebellum: A neural network approach. Biol. Cybern. 62, 17-28.

Timing and the Cerebellum

55

Overholt, E. M., Rubel, E. W., and Hyson, R. L. 1992. A circuit for coding interaural time differences in the chick brainstem. I. Neurosci. 12, 16981708. Palkovits, M., Magyar, P., and Szentagothai, J. 1971. Quantitative histological analysis of cerebellum in cat. 111. Structural organization of the molecular layer. Brain Res. 34, 1-18. Perrett, S. P., Ruiz, B. P., and Mauk, M. D. 1993. Cerebellar cortex lesions disrupt learning-dependent timing of conditioned eyelid responses. I. Neurosci. 13(4), 1708-1718. Schneiderman, N., and Gormezano, 1. 1964. Conditioning of the nictitating membrane of the rabbit as a function of CS-US interval. I . Comp. Physiol. Psycho/. 57, 188-195. Thompson, R. F. 1986. The neurobiology of learning and memory. Science 233, 941-947. Welsh, J. P., and Harvey, J. A. 1989. Cerebellar lesions and the nictitating membrane reflex: Performance deficits of the conditioned and unconditioned response. I. Neurosci. 9, 299-311. Yeo, C. 1991. Cerebellum and classical conditioning of motor responses. Ann. N. Y. Acad. Sci. 627, 292-304.

Received October 15, 1992; accepted April 9, 1993.

This article has been cited by: 2. Tadashi Yamazaki, Shigeru Tanaka. 2009. Computational Models of Timing Mechanisms in the Cerebellar Granular Layer. The Cerebellum 8:4, 423-432. [CrossRef] 3. D. M. Eagleman, V. Pariyadath. 2009. Is subjective duration a signature of coding efficiency?. Philosophical Transactions of the Royal Society B: Biological Sciences 364:1525, 1841-1851. [CrossRef] 4. R. M.C. Spencer, U. Karmarkar, R. B. Ivry. 2009. Evaluating dedicated and intrinsic models of temporal encoding by varying context. Philosophical Transactions of the Royal Society B: Biological Sciences 364:1525, 1853-1863. [CrossRef] 5. Dean V. Buonomano, Wolfgang Maass. 2009. State-dependent computations: spatiotemporal processing in cortical networks. Nature Reviews Neuroscience 10:2, 113-125. [CrossRef] 6. Tadashi Yamazaki, Shigeru Tanaka. 2007. A spiking network model for passage-of-time representation in the cerebellum. European Journal of Neuroscience 26:8, 2279-2292. [CrossRef] 7. Tom Verguts. 2007. How to Compare Two Quantities? A Computational Model of Flutter DiscriminationHow to Compare Two Quantities? A Computational Model of Flutter Discrimination. Journal of Cognitive Neuroscience 19:3, 409-419. [Abstract] [PDF] [PDF Plus] 8. Tadashi Yamazaki , Shigeru Tanaka . 2005. Neural Modeling of an Internal ClockNeural Modeling of an Internal Clock. Neural Computation 17:5, 1032-1058. [Abstract] [PDF] [PDF Plus] 9. Michael D. Mauk, Dean V. Buonomano. 2004. THE NEURAL BASIS OF TEMPORAL PROCESSING. Annual Review of Neuroscience 27:1, 307-340. [CrossRef] 10. Keith S. Garcia, Michael D. Mauk, Gabrielle Weidemann, E. James Kehoe. 2003. Covariation of alternative measures of responding in rabbit (Oryctolagus cuniculus) eyeblink conditioning during acquisition training and tone generalization. Behavioral Neuroscience 117:2, 292-303. [CrossRef] 11. TATSUYA OHYAMA, JAVIER F. MEDINA, WILLIAM L. NORES, MICHAEL D. MAUK. 2002. Trying to Understand the Cerebellum Well Enough to Build One. Annals of the New York Academy of Sciences 978:1 THE CEREBELLU, 425-438. [CrossRef] 12. Joseph E. Steinmetz, Jo Anne Tracy, John T. Green. 2001. Classical eyeblink conditioning: Clinical models and applications. Integrative Physiological & Behavioral Science 36:3, 220-238. [CrossRef] 13. John T. Green, Ronald F. Rogers, Charles R. Goodlett, Joseph E. Steinmetz. 2000. Impairment in Eyeblink Classical Conditioning in Adult Rats Exposed

to Ethanol as Neonates. Alcoholism: Clinical and Experimental Research 24:4, 438-447. [CrossRef] 14. E. A. Franz, R. B. Ivry, L. L. Helmuth. 1996. Reduced Timing Variability in Patients with Unilateral Cerebellar Lesions during Bimanual MovementsReduced Timing Variability in Patients with Unilateral Cerebellar Lesions during Bimanual Movements. Journal of Cognitive Neuroscience 8:2, 107-118. [Abstract] [PDF] [PDF Plus] 15. Peter DayanReinforcement Learning . [CrossRef] 16. Frank KrasneNeural Analysis of Learning in Simple Systems . [CrossRef]

Communicated by Peter Rowat

Computational Aspects of the Respiratory Pattern Generator Allan Gottschalk Dtpwtiiiriit of Atirsthesin, Center for SlrtJpn i i d Respirritory N ~ w r d ~ I o / o g y , Uiiizwsity of Periiisylzmiin, Philndrlphia, PA 19104 USA

To help evaluate the hypothesis that the central respiratory rhythm is generated by a network of interacting neurons, a network model of respiratory rhythmogenesis is formulated and examined computationally. The neural elements of the network are driven by tonic inputs and generate a continuous variable representing firing rate. Each neural element in the model can be described by an activation time constant, an adaptation time constant, and a step nonlinearity. Initial network connectivity was based on an earlier proposed model of the central respiratory pattern generator. These connections were adjusted interactively until the model trajectories resembled those observed electrophysiologically. The properties of the resulting network were examined computationally by simulation, determination of the phase resetting behavior of the network oscillator, and examination of the localized eigenstructure of the network. These results demonstrate that the network model can account for a number of diverse physiological observations, and, thus, support the network hypothesis of respiratory rhymogenesis. 1 Introduction The mechanism which produces the respiratory rhythm remains incompletely characterized, and hypotheses range from the presence of respiratory pacemaker neurons (Feldman and Cleland 1982) to a network of neurons whose collective interactions produces the observed rhythm Ncrrrol C o t n p t n t i o r ~ 6, 56-68 (1994)

@ 1993 Massachusetts Institute of Technology

Respiratory Pattern Generator

57

(Richter et al. 1986). Experimental evidence to support both points of view can be found in the literature (Suzue 1984; Smith and Feldman 1987; Smith et al. 1991; Cohen 1979; Richter 1982). Earlier computational efforts directed toward examination of the network hypothesis were hindered by the limited physiologic data available at the time. The current study exploits detailed recordings from respiratory neurons (Ballantyne and Richter 1984, 1986; Richter et nl. 1975, 1979; Lindsey et al. 1987, 1989; Segers et al. 1985, 1987) to configure a network whose connections are supported to a large extent by both spike-triggered averaging (for ref. see Ezure 1990; Anders et al. 1991) and cross correlation studies (Lindsey et al. 1987, 1989; Segers et al. 1985, 1987). The computational properties of this network are explored by simulation, determination of the phase resetting of the network oscillator, and examination of the localized eigenstructure of the network. In all cases the proposed network is consistent with known physiology and serves to unify a number of diverse observations. The respiratory rhythm can be broken into three electrophysiologically distinct phases: inspiration, post-inspiration or stage-I expiration, and stage-I1 expiration (Richter et al. 1986). The outputs from intracellular recordings of the different types of brainstem neurons that are thought to mediate these events are schematized in Figure 1. We stress that the trajectories of Figure 1 are idealizations of the heterogeneous population of neurons that constitute each physiologically defined neuron class. Theoretical justification for the computational use of idealized neural elements to represent a larger population is given in part by MacGregor and Tajchman (1988). The objectives of the current study are to demonstrate that individual neurons, which are represented by a simple mathematical model and which are appropriately connected, can replicate the membrane trajectories of Figure 1 in significant detail in addition to a variety of other complex behaviors that have been observed physiologically. Our approach, aspects of which are described in greater detail in Ogilvie et al. (1992), is similar to that of Botros and Bruce (1990), and the differences lie in the nature of the connections and the types of behavior that are studied. We emphasize that the connectivities between the different neural elements in the model are based on physiologic observations of the membrane trajectories (Richter et nl. 1986), and that many of these connections have been confirmed by other techniques (Ezure 1990; Lindsey et al. 1987, 1989; Segers et al. 1985, 1987). 2 Methods

The network was configured with neural elements that generate a continuous variable representing firing rate. Neural elements of this type were described in detail by Matsuoka (1985) whose equations have been modified below to included modulation of the reticular activating system

58

Allan Gottschalk et al.

Phases

ramp -I

late-I

lnrp

Post-I

Exp-2

HA=

bt4=

post-I

exp *

Figure 1: Schematic depicting the patterns of activity of five types of respiratory neurons as determined from intracellular recordings using previously described techniques (Richter ef a/. 1975). The darkened regions represent superthreshold spike activity, whereas membrane potential below threshold is represented by a single line. When compared with phrenic nerve activity, three electrophysiologically distinct phases of activity are present: inspiration, post-inspiration or stage-I expiration, and stage-I1 expiration.

Respiratory Pattern Generator

59

Table 1: Network Parametersa To

From Early I

Early I Ramp1 Late I Post I EXD

-2.6 -2.5

-5.8

Ramp I Late I Post I f0.2 0.5 -0.1

-1.0 -

-

-2.5 -6.8 -2.5

Tonic p-I effect Adaptation on tonic (c,) (b;)

Exp

(s,)

-1.5 -3.5 -2.5 -1.0 -

1.0 1.0 1.0 1.0 1.1

+0.5 +0.5 +2.5 +0.5

+2.5 -0.7 f2.2 +2.5 +2.0

@Parameterizationof the model of Figure 2 and equations 2.1-2.4. As described in Methods, these parameters were selected interactively so that the resulting network trajectories like those of Figure 3 reproduced the physiologically determined trajectories depicted in Figure 1. The effect of the post-inspiratory (p-I) neurons on the other neurons involves passing the firing rate of the p-I neurons (yP.1) through a single-pole filter with a time constant y = 0.1 (equation 2.4), multiplying by c,, and subtracting the result from the tonic inputs (s,) of the other neurons. Adaptation in cat respiratory neurons is thought to result from a calcium-activated potassium conductance (Richter et al. 1986) that is known to be present in such neurons. The adaptation time constant ( T ) represents the entire process of calcium influx, accumulation, and its activation of the calcium-activated potassium conductance. Our choice of T = 1000 msec is in the same range as the intracellular oscillations of calcium in Purkinje cells of cerebellar slices (Tank et al. 1988). The time constant of the activity variable (7 = 80 msec) is not intended to represent the membrane time constant of the neurons. Instead, it represents the rise in activity within the population of neurons that is represented here by a single neural element.

(RAS)by the post-inspiratory (p-I) neuron: Tdx,/dt + X, T d z , / d t + 2, Yl

ydwldt

+w

=

C U ~+ ~ (slYC,, W ) -

-

biz,

(2.1)

I

= =

=

yl g(xJ yP-1

(2.2)

(2.3) (2.4)

where g ( a ) = max(0, cv). Here, x , is the activity of the ith neuron, which is analogous to membrane potential, y, is the firing rate of the ith neuron, z, is the adaptation variable of the ith neuron, and w is the output of the RAS, which is modulated by the output of the p-I neuron. The all are not necessarily symmetric and indicate the strength of the connection from the jth to the ith element of Table 1. Here, a,, < 0 indicates an inhibitory connection, a, > 0 indicates an excitatory connection, and self-inhibition or self-excitation is permitted when i = j. Each element receives a tonic input s,, and uses b, to weigh the influence of the corresponding adaptation variable 2, such that for b, > 0 adaptation occurs, and for b, < 0 self-excitation results. c, weighs the influence of the RAS on tonic activity of the ith neuron. T is the activation time constant for the neural

60

Allan Gottschalk et a].

Figure 2: Diagram of the neural connections used by the simulations. These connections appear in tabular form in Table 1. Here, inhibitory connections are indicated by solid bars and excitatory connections are indicated by small open circles. The interaction of the neural elements with the reticular activating system (RAS) represents a component of s, to each neural element as indicated in Table 1. elements, and represents the time course for the activation of a population of synergistic neurons. T is the time constant governing adaptation in this population, and 7 is the time constant of the RAS. We began with the network model proposed by Richter et al. (1986), which was based on data of the type shown in Figure 1. The network depicted in Figure 2 was configured from the five identified neuron types, and included the post-inspiratory modulation of the tonic excitatory inputs described above and in Table 1. Initial estimates of the connection strengths and time constants were made on the basis of the trajectories of the intracellular recordings. The conditions under which the proposed network, when configured using equations 2.1-2.4, would oscillate were computed following Matsuoka, and these were used to guide the initial choices of tonic inputs s, and the levels of adaptation b;. Thereafter, these parameters were adjusted interactively until network trajectories most closely matched the physiologic data. All simulations were programmed using the ASYST scientific programming package (Keithley

Respiratory Pattern Generator

61

Asyst, Rochester) on an IBM PC/AT or compatible computer using the integration scheme suggested by MacGregor (1987). Eigenvalues of the linearized system were analyzed with PHSPLAN (0 Bard Ermentrout, Pittsburgh). 3 Results

The network parameters obtained as described above are given in Table 1, and the corresponding network outputs are shown in Figure 3. These trajectories are comparable to those of Figure 1, especially with respect to the cardinal three-phase feature of the physiologic network. We have not performed a quantitative point-by-point comparison of the trajectories of Figures 1 and 3 because the substantial heterogeneity within each physiological neuron class makes such a comparison inappropriate. It is especially important to appreciate that physiological recordings are available which span the range of differences between Figures 1 and 3. The level of adaptation in the expiratory (exp) neuron that was required to terminate the expiratory phase of the network model, although less than the other neurons of the model, is still greater than physiologically observed levels of adaptation of the exp neurons (Richter et nl. 1975,1986). The three-phase feature of the model outputs is preserved, and the finer features are minimally altered when parameters are varied about the set given in Table 1. The topological structure of the model is also preserved when individual parameters or combinations of parameters are varied. This was indicated by analysis of the signs of the eigenvalues of the linearized system, which consistently exhibited a single complex pair of eigenvalues with a positive real component accompanied by eigenvalues that were negative and real. We now evaluate the ability of the network model, which generated the outputs of Figure 3, to replicate the phase resetting studies of Paydarfar et nl. (1986, 1987) (see Glass and Mackey 1988 for a review of this approach). Briefly, these investigators determined the change in the phase of the phrenic output after either superior laryngeal nerve (SLN) stimulation, which is inhibitory to inspiration (Remmers et a / . 19861, or stimulation of the midbrain reticular formation, which is facilitatory to inspiration (Gauthier et a/. 1983). I n these studies, the important variables are the change in phase of the onset of phrenic nerve activity as a function of both the magnitude and timing of the stimulus. The results of Paydarfar et nl. are reproduced on the left of Figures 4A and B for the inhibitory and facilitatory stimuli, respectively. Relatively weak stimuli (top of figure) produce phase resetting curves whose average slope has an absolute value of one. This is referred to as Type 1 resetting. Relatively strong stimuli (bottom of figure) produce phase resetting curves whose average slope is 0. This is referred to as Type 0 resetting. Intermediate strength stimuli (middle of figure) produce a highly variable

Allan Gottschalk et al.

62

Phases

early-I

ramp-I

late-I

post -I

exp.

Figure 3: Normalized outputs of the five components of the network model for the parameterization given in Table 1. The horizontal line indicates the threshold for each neural element. The regions above the horizontal line indicate the period of superthreshold activity (yi), which is proportional to firing rate. The regions below the horizontal line correspond to the subthreshold pattern of activity ( x i ) . These patterns should be compared to the physiologically determined patterns of activity depicted in Figure 1, noting the presence of three well-defined phases of activity.

Respiratory Pattern Generator

63

B) FACILITATORY PULSE

A) INHIBITORY PULSE EXPERIMENT

WEAK STIMULI

SIMULATION

'.. . . . . .... . .

. ". b.

5

8.

EXPERIMENT

-.

-.

* 0

SlMULArlON

'

INTER MEDlAT E STIMULI

i .............

STRONG STIMULI

t-_

..,.., ,;.*,;; 0

O

I

Z

O

OLD PHASE

I

Z

..............

.............. 0 05

1

OLD PHASE

Figure 4: Phase resetting studies for (A) superior laryngeal nerve (SLN) stimuli, which are inhibitory to ventilation, and (B) midbrain reticular formation stimuli, which are facilitatory to ventilation. The left-hand side of each component of the figure reproduces the physiologically determined phase resetting studies of Paydarfar et al. The right-hand side of each component of the figure depicts the results of phase resetting studies performed computationally on the network model. For each section of the figure, the top panel illustrates the Type 1 phase resetting behavior that accompanies relatively weak stimuli, the bottom panel illustrates the Type 0 behavior that accompanies relatively strong stimuli, and the middle panel illustrates the variable phase resetting that occurs for intermediate strength stimuli presented at the phase singularity of the oscillator (see text for details).

resetting, a n d the phase in the respiratory cycle at which this occurs is the phase singularity of the oscillator. Some nonlinear oscillators have a topologic structure such that stimulation with a n intermediate strength stimulus at the phase singularity will terminate the oscillations (Glass and Mackey 1988). Note that apneas were not produced in the phase resetting experiments of Paydarfar ef al. In o u r simulations we examined

64

Allan Gottschalk et al.

the phase resetting of the onset of output by the inspiratory ramp ( 1 ~ ) neuron. The effects of SLN stimulation were simulated by us as excitation of the post-inspiratory neuron (Remmers et al. 1986) and inhibition of the exp neuron (Ballantyne and Richter 19861, and the corresponding set of phase resetting curves is shown to the right of Figure 4A, where the ratio of p-1:exp in the stimulus pulse was 1:3. Facilitatory stimuli were modeled as leading to excitement of the early-inspiratory neurons (Gauthier et al. 1983), and the corresponding phase resetting curves are shown to the right of Figure 4B. It is important to note that the location of the phase singularity is nearly the same in both the physiologic data and the corresponding simulations. However, there is a small, and possibly significant, difference in the location of the phase singularity with respect to the expiratory to inspiratory (E-I) junction in the simulations of the SLN data (Fig. 4A, right middle). Here, the phase singularity appears earlier in the respiratory cycle than in the experimental data. As in the phase resetting experiments, none of the simulations produced apnea when the resetting pulse was applied. A number of distinct qualitative changes in the pattern of phrenic activity can be induced by physiologic modifications of the network afferents. These include the two phase alternating inspiratory and postinspiratory rhythm described by Remmers et al. (1986), and the double firing of the p-I cells (Schwarzacher et al. 1991). Remmers et al. observed the two phase inspiratory/post-inspiratory rhythm with low frequency stimulation of the SLN. Modeling SLN stimulation as in the phase resetting studies, we observed a similar two phase rhythm with a 5% increase in the tonic input to the p-I neurons and an 18% decrease in the tonic input to the expiratory neurons. Cycles without a post-inspiratory phase have not been observed physiologically and could be produced in the model only by changes of over 300% in some of the tonic inputs. Double firing of the p-I cells, which is commonly observed (Schwarzacher et al. 19911, could be simulated with as little as a 10% increase in the tonic input to the p-I neuron of the model. 4 Discussion

Our results demonstrate that a network model of respiratory rhythmogenesis based on that proposed by Richter et al. (1986) can replicate the physiologically observed outputs of the brainstem respiratory neurons in significant detail. Although this result supports the network hypothesis, it is reasonable to expect that the network that was configured specifically to replicate the physiologic data should be capable of reproducing its trajectories in some detail. For this reason, and because of the variability in each physiologically defined neuron class, we turned to other physiological phenomena of respiratory rhythmogenesis to validate the behavior of the network model described above. In addition to support-

Respiratory Pattern Generator

65

ing the network hypothesis, our computational experiments also indicate how a diverse array of physiologic data can be unified under a single hypothesis. Moreover, the topologic structure of the network, as indicated by localized eigenvalue analysis and the phase resetting studies, has important implications for the generation of respiratory arrhythmias. The failure of the network model to produce the E-I transition without a level of adaptation of the exp neurons in excess of that which is observed physiologically (Richter et al. 1975, 1986) suggests a limitation of the model. Additional computational evidence of the limitations of the model is seen in the simulations of the phase resetting data, where the phase singularity produced by the SLN pulses is not perfectly located with respect to the E-I junction. This difficulty cannot be overcome by simple parameter changes to the model. All of this suggests the possibility of an additional type of neuron that serves to mediate the E-I transition. Physiologic evidence of an additional respiratory neuron (phase-spanning expiratory-inspiratory), which fires at the E-I transition, and, thus, could help to terminate the expiratory phase, does exist (Cohen 1969; Smith et al. 1990; Schwarzacher et al. 1991). Preliminary computational data indicate that such a neuron could terminate the expiratory phase with a more physiologically appropriate level of adaptation to the exp neuron. In this manner, our computational results have aided physiologic investigation. Although apnea could be produced in the model by manipulations such as sustained and intense SLN activity which also produce apnea in physiologic preparations (Lawson 1981), we observed nothing to suggest that brief perturbations of any type to the network, as it is presently configured, could produce apnea. It has been postulated that infant apnea has its origins in the topologic structure of the respiratory pattern generator (Paydarfar et al. 1986, 1987), and that if the system is driven by a perturbation to a locally stable region of the state space, respiratory activity will cease. That the phase singularity of the oscillator should be located within such a region motivated the studies of Paydarfar et al. However, their work in adult animals did not reveal apneas at the phase singularity. Similar results were obtained in our model, and the presence of eigenvalues with positive real components when the network is linearized about its fixed point indicates that termination of ventilation with a well-timed pulse of any type is impossible in the model as it is currently configured. Thus, if infant apnea is the consequence of perturbation of the network to a region of stability, we would anticipate the observation in susceptible individuals of networks with substantially different properties than those described above, especially since it appears that neonatal respiratory control, at least in minipigs, is organized similarly to that of adult animals (Lawson et al. 1989). In summary, our computational studies have demonstrated how a group of virtually identical model neurons, when appropriately connected, can replicate the physiologically observed trajectories of the dif-

66

Allan Cottschalk et al.

ferent respiratory neurons as well as a large number of the rhythmic phenomena of the respiratory pattern generator. In addition to supporting the network hypothesis of respiratory pattern generation, o u r studies demonstrate that the topologic structure of the network is stable with respect to brief perturbations. These studies have also motivated the physiologic search for additional types of respiratory neurons. Acknowledgments

The work reported here was supported in part by NATO Grant RG 0834/86, SCOR Grant HL-42236, DFG Ri 279/7-11 a n d Training Grant GM-07612-16. References Anders, K., Ballantyne, D., Bischoff, A. M., Lalley, P. M., and Richter, D. W. 1991. Inhibition of caudal medullary expiratory neurones by retrofacial inspiratory neurones in the cat. ]. Pliysiol. (Loiidori) 437, 1-25 Ballantyne, D., and Richter, D. W. 1984. Postsynaptic inhibition of bulbar inspiratory neurones in cat. ]. Pli!ysiol. (Loridor?) 348, 67-87. Ballantyne, D., and Richter, D. W. 1986. The non-uniform character of expiratory synaptic activity in expiratory bulbospinal neurones of the cat. 1. Pliysiol. (Londoii) 370, 433456. Botros, S. M., and Bruce, E. N. 1990. Neural network implementation of the three-phase model of respiratory rhythm generation. Biol. Cybcrti. 63, 143153. Cohen, M. I. 1969. Discharge patterns of brain-stem respiratory neurons during Hering Breuer reflex evoked by lung inflation. 1. Ncirroph!ysio/. 32, 356-374. Cohen, M. I. 1979. Neurogenesis of respiratory rhythm in the mammal. Physiol. RPZl. 59(4),1105-1 173. Ezure, K. 1990. Synaptic connections between medullary respiratory neurons and considerations on the genesis of respiratory rhythm. Pros. Neirrohiol. 35, 429-450. Feldman, J. L., and Cleland, C. L. 1982. Possible roles of pacemaker neurons in mammalian respiratory rhythmogenesis. In Crllirlnr Pnccinnkm, Vol. 11, D. 0. Carpenter, ed., pp. 101-119. Wiley, New York. Gauthier, P., Monteau, R., and Dussardier, M. 1983. Inspiratory on-switch evoked by stimulation of mesencephalic structures: A patterned response. E q i . Brniri Res. 51, 261-270. Glass, L., and Mackey, M. C. 1988. From Clocks to Chnos, pp. 98-118. Princeton University Press, Princeton, NJ. Lawson, E. E. 1981. Prolonged central respiratory inhibition following reflexinduced apnea. ]. Appl. Physiol. 50, 874-879. Lawson, E. E., Richter, D. W., and Bischoff, A. 1989. Intracellular recordings of respiratory neurons in the lateral medulla of piglets. /. Appl. Pkysiol. 66(2), 983-988.

Respiratory Pattern Generator

67

Lindsey, B. G., Segers, L. S., and Shannon, R. 1987. Functional associations among simultaneously monitored lateral medullary respiratory neurons in the cat. 11. Evidence for inhibitory actions of expiratory neurons. J. Neurophysiol. 57, 1101-1117. Lindsey, B. G., Segers, L. S., and Shannon, R. 1989. Discharge patterns of rostrolateral medullary expiratory neurons in the cat: Regulation by concurrent network processes. 1. Neurophysiol. 61, 1185-1196. MacCregor, R. J. 1987. Neural and Brain Modeling, pp. 250-254. Academic Press, San Diego, CA. MacCregor, R. J., and Tajchman, G. 1988. Theory of dynamic similarity in neuronal systems. J. Neurophysiol. 60, 751-768. Matsuoka, K. 1985. Sustained oscillations generated by mutually inhibiting neurons with adaptation. Biol. Cybern. 52, 367-376. Ogilvie, M. D., Gottschalk, A., Anders, K., Richter, D. W., and Pack, A. I. 1992. A network model of respiratory rhythmogenesis. A M .1. Physiol. 263, R962R975. Paydarfar, D., and Eldridge F. L. 1987. Phase resetting and dysrhythmic responses of the respiratory oscillator. Am. J. Physiol. 252 (Regulatory Integrative Comp. Physiol. 211, R55-R62. Paydarfar, D., Eldridge F. L., and Kiley J. P. 1986. Resetting of mammalian respiratory rhythm: Existence of a phase singularity. A m . J. Physiol. 250 (Regulatory Integrative Comp. Physiol. 19), R721-R727. Remmers, J. E., Richter, D. W., Ballantyne, D., Bainton C. R., and Klein, J. P. 1986. Reflex prolongation of stage I of expiration. Pf7ngers Arch. 407, 190-198. Richter, D. W. 1982. Generation and maintenance of the respiratory rhythm. 1. Exp. Biol. 100, 93-107. Richter, D. W., Heyde, F., and Gabriel, M. 1975. Intracellular recordings from different types of medullary respiratory neurons of the cat. J. Neurophysiol. 38, 1162-1171. Richter, D. W., Camerer, H., Meesmann, M., and Rohrig, N. 1979. Studies on the interconnection between bulbar respiratory neurones of cats. Pflugers Arch. 380, 245257. Richter, D. W., Ballantyne, D., and Remmers J. E. 1986. How is the respiratory rhythm generated? A model. NIPS 1, 109-112. Schwarzacher, S. W., Wilhelm, Z., Anders, K., and Richter, D. W. 1991. The medullary respiratory network in the rat. 1. Physiol. (London) 435, 631444. Segers, L. S., Shannon, R., and Lindsey, B. G. 1985. Interactions between rostra1 pontine and ventral medullary respiratory neurons. J. Neurophysiol. 54,318334. Segers, L. S., Shannon, R., Saporta, S., and Lindsey, B. G. 1987. Functional associations among simultaneously monitored lateral medullary respiratory neurons in the cat. I. Evidence for excitatory and inhibitory connections of inspiratory neurons. J. Neurophysiol. 57, 1078-1100. Smith, J. C., and Feldman, J. L. 1987. Involvement of excitatory and inhibitory amino acids in central respiratory pattern generation in vitro. Fed. Proc. 46, 1005. Smith, J. C., Greer, J. J., Liu, G., and Feldman, J. L. 1990. Neural mechanisms

68

Allan Gottschalk et al.

generating respiratory pattern in mammalian brain stem-spinal cord in vitro. I. Spatiotemporal patterns of motor and medullary neuron activity. I. Neurophysiol. 64, 1149-1169. Smith, J. C., Ellenberger, H., Ballanyi, K., Richter, D. W., and Feldman, J. L. 1991. Pre-Botzinger complex: A brainstem region that may generate respiratory rhythm in mammals. Science 254, 726-729. Suzue, T. 1984. Respiratory rhythm generation in the in vitro brain stem-spinal cord preparation of the neonatal rat. I. Physiol. (London) 354, 173-183. Tank, D. W., Sugimori, M., Connor, S. A., and Llinas, R. R. 1988. Spatially resolved calcium dynamics of mammalian Purkinje cells in cerebellar slice. Science 242, 773-777.

Received May 28, 1992; accepted March 29, 1993.

This article has been cited by: 2. G.N. Borisyuk, R.M. Borisyuk, Yakov B. Kazanovich, Genrikh R. Ivanitskii. 2002. Models of neural dynamics in brain information processing the developments of 'the decade'. Uspekhi Fizicheskih Nauk 172:10, 1189. [CrossRef] 3. David Paydarfar, Frederic L. Eldridge, Joseph A. Paydarfar. 1998. Phase resetting of the respiratory oscillator by carotid sinus nerve stimulation in cats. The Journal of Physiology 506:2, 515-528. [CrossRef]

Communicated by John Rinzel

Subharmonic Coordination in Networks of Neurons with Slow Conductances Thomas LoFaro Nancy Kopell Department of Mathematics, Boston University, Boston, M A 02225 U S A

Eve Marder Department of Biology and Center for Complex Systems, Brandeis University, Waltham, M A 02254 U S A

Scott L. Hooper Department of Biology and Center for Complex Systems, Brandeis University, Waltham, M A 02254 U S A and Department of Biological Sciences, Ohio University, Athens, OH 45701 U S A

We study the properties of a network consisting of two model neurons that are coupled by reciprocal inhibition. The study was motivated by data from a pair of cells in the crustacean stomatogastric ganglion. One of the model neurons is an endogenous burster; the other is excitable but not bursting in the absence of phasic input. We show that the presence of a hyperpolarization activated inward current (ih) in the excitable neuron allows these neurons to fire in integer subharmonics, with the excitable cell firing once for every N 2 1 bursts of the oscillator. The value of N depends on the amount of hyperpolarizing current injected into the excitable cell as well as the voltage activation curve of ih. For a fast synapse, these parameter changes do not affect the characteristic point in the oscillator cycle at which the excitable cell bursts; for slower synapses, such a relationship is maintained within small windows for each N. The network behavior in the current work contrasts with the activity of a pair of coupled oscillators for which the interaction is through phase differences; in the latter case, subharmonics exist if the uncoupled oscillators have near integral frequency relationships, but the phase relationships of the oscillators in general change significantly with parameters. The mechanism of this paper provides a potential means of coordinating subnetworks acting on different time scales but maintaining fixed relationships between characteristic points of the cycles. Neural Computation 6, 69-84 (1994)

@ 1993 Massachusetts Institute of Technology

70

Thomas LoFaro et al.

1 Introduction

Most biological neurons show a wide variety of voltage and time-dependent conductances that shape their electrical properties. In particular many neurons have conductances that operate on multiple time scales, and that allow them to display plateau potentials and slow bursting pacemaker potentials. The extent to which neurons express these slow conductances is often controlled by neuromodulatory substances, or influenced directly by membrane potential (Harris-Warrick and Marder 1991; Kiehn and Harris-Warrick 1992; Marder 1991,1993; McCormick and Pape 1990). Many neurons that display plateau potentials show low threshold CaZ+ currents necessary for the sustained depolarization during the plateau (Angstadt and Calabrese 1991; Hounsgaard and Kiehn 1989). These same neurons often also show strong hyperpolarization activated inward currents ( ifl) that contribute to their recovery from inhibition (Angstadt and Calabrese 1989; Golowasch and Marder 1992; Golowasch et nl. 1992; Kiehn and Harris-Warrick 1992). The voltage dependence of ill is such that this current is very small when the neuron is depolarized, but may profoundly influence the neuron’s activity when the neuron is hyperpolarized. Therefore, we were particularly interested in exploring the role of ill in neurons that are phasically inhibited in pattern generating networks. This paper illustrates that if an oscillator is coupled by reciprocal inhibition to an excitable neuron with an if,-likecurrent, there can be stable patterns in which the excitable cell fires much less frequently than the oscillator. These N : 1 patterns are known as ”subharmonics” in mathematics and are sometimes called ”coupling ratios.” In this patterned output, for fast synapses, the characteristic point in the cycle at which the excitable cell begins to fire is relatively insensitive to parameter changes. This characteristic point is just after the end of inhibition from the oscillator to the excitable cell. (For a slower synapse the delay can be more variable.) Thus this mechanism is a candidate for maintaining timing relationships among cells or subnetworks that operate on different time scales. 2 Experimental Example

The pyloric rhythm of the crustacean stomatogastric ganglion displays alternating bursts of activity in the pyloric dilator (PD) and lateral pyloric (LP) motor neurons (Fig. 1A). The strict alternation of these functional antagonists is ensured by reciprocal inhibitory synaptic connections between the PD and LP neurons. However, if the LP neuron is hyperpolarized, either by current injection or by a strong, sustained inhibitory input, it produces a characteristic depolarizing voltage “sag” (Fig. lB), due to

Subharmonic Coordination in Networks

71

0 nA

PD

B LP

PD

C LP

PD 1 sec

Figure 1: (A) (left) A schematic of the experimental set up. The I’D and LP neuron mutually inhibit each other; each neuron’s membrane potential is being monitored intracellularly (V electrodes), and current can be injected into the LP neuron through a second electrode (i electrode). (right) The activity of the two neurons with 0 current injected into the LP neuron. The neurons exhibit 1 : 1 locking: (B) Negative current is injected into the LP neuron until the locking ratio is approximately 6 PD bursts per LP burst. (C) Negative current is injected into the LP neuron until the locking ratio is approximately 12 PD bursts per LP burst. the presence of a n i,, current (Golowasch and Marder 1992). Figure 1 B also shows that when the LP neuron is hyperpolarized by current injection, it rebounds every so often from inhibition to produce a sustained depolarization a n d high frequency firing of action potentials. In the presence of the inhibition of the PD neurons by the LP, this leads to a network output in which the number of P D bursts between each LP burst is a n

72

Thomas LoFaro et al.

integer. This integer increases (Fig. 1C) with the amount of hyperpolarizing current until the LP neuron eventually completely ceases to burst (Hooper, unpublished data). This paper is not intended to model the above data but to explore how the presence of an i,, current can lead to subharmonics of the type shown and to investigate timing relationships of the model cells in the presence of fast and slow synapses.

3 Equations

We model the PD cell by the Morris-Lecar (1981)equations, a simple, twodimensional conductance-based system. These equations, originally formulated to describe electrical activity in barnacle muscle fiber, are sometimes used as a simple caricature of the envelope of bursting neurons without the spiking activity (Rinzel and Ermentrout 1989); they model situations in which the behavior of the envelope is determined primarily by calcium dynamics, which are explicitly included in the equations. We use nondimensional versions of these equations, as given in Rinzel and Ermentrout (1989). Parameters are chosen so that in the absence of coupling, the PD cell oscillates spontaneously [although in the real preparation, the oscillatory behavior of the PD neuron depends on that of another neuron, the anterior burster (AB) neuron]. The LP cell is also modeled with the Morris-Lecar equations but is given an additional current, based on ill. This current slowly activates when the LP neuron is hyperpolarized, and more rapidly inactivates when the neuron is depolarized. Its reversal potential is such that it is depolarizing when the current is active. This current thus responds to hyperpolarizing pulses (rhythmic inhibition) of the LP neuron with a delayed excitatory current that makes the LP neuron more likely to burst after the inhibition (postinhibitory rebound, Perkel and Mulloney 1974). In the absence of coupling to the PD cell, the LP cell has a stable equilibrium and a threshold for a large excursion before returning to rest. We denote the voltages of the PD and LP cells by DPD and u ~ r .The other variable in the Morris-Lecar equations is loosely modeled on the activation of an outward current. These variables for the PD and LP are labeled ~ P and D nLI'. The level of i,, is called s. The full equations are

(3.1)

Subharmonic Coordination in Networks

73

(3.3) where vl = -0.01, v2 = 0.15, v3 = -0.12, v4 = 0.3, and u5 = 0.22. The constants for conductances and reversal potentials in 3.2 are g, = 2.0, vk = -0.7, ge = 0.5, up = -0.5, g,, = 1.1. These constants were assumed to be cell independent. The time constant TI = 3.3 was used to control the frequency of the PD oscillation. The LP cell was assumed to have a slow activation of the recovery variable, so we used &Lp = 0.01 and &D = 0.1. Finally, IPD= 0.35, and ILp was varied. The effect of ih has the form H(vLp. s) = gss[vs - vLp] with g, = 0.5 and v, = 0.3. The dynamics of i,, current are defined by the functions ks(v) and s,(v), which are given by kJv) 1 + e(L1-z17)/% 1 s,(v) E 1 + e(0-1’9)/vH (3.4) where v6 = 0.04, v7 = 0.05, v8 = 0.075, and v9 was varied. Since this process is slow 6 = 0.003. The inhibitory chemical synapses are modeled as

GPD(VLP,VPD) GLP(VLP. 4

(Ymm(vLP)[vsyn - VPD] (Yd[vsyn - ULP]

(3.5)

0.5 and vsyn = -0.7. Finally, in the last equation of 3.1 r2 = 10.0 except in some simulations in which we used a simplified version of the equations with r2 E 0, that is, d = W Z , ( U ~ D ) . The numerical simulations were done using the software PHASEPLANE (Ermentrout 1989) and dstool (Guckenheimer et al. 1991). The integration method employed was either a fourth order Runge-Kutta with small step size or the variable step size Gear’s method. (Y

=

4 Results

Two sets of simulations were done, the first using the simplified system assuming d = m,(v), that is, an instantaneous synapse from the I’Dto

74

Thomas LoFaro et al.

the LP. The first simulation of the first set replicated some key features of the data in the motivating example in Figure 1. For this simulation, UY is set at 0.1. The parameter UY sets the position of the activation curve of 6,. At this position, i,, is active even at the uncoupled rest potential U L P x 0.3 of the cell. However, as shown by other simulations described below, this does not affect the qualitative behavior we are describing in this paper. Figures 2A,B,C give a typical picture of the voltages of LP and PD versus time at three levels of hyperpolarization, showing one LP burst for every one, four and seven cycles of the PD. In the absence of ill, there is no bursting in the LP cells (Fig. 2D). With a sufficient amount of tonic hyperpolarizing current, the LP cell is completely shut off (data not shown). Note that when an LP burst occurs, it does so at the same characteristic point within the PDs cycle: just on release of inhibition from the PD cell, regardless of the amount of injected current or the subharmonic locking it gives rise to. As discussed in Section 5, these results contrast strongly with the expected behavior of a pair of coupled phase oscillators whose uncoupled frequencies are nearly multiples of one another. We now give a heuristic explanation for the results, using geometrical phase plane analysis (Fig. 3) (Edelstein-Keshet 1988; Rinzel and Ermentrout 1989). (A more detailed analysis will be forthcoming; LoFaro 1993.) In the absence of the slow ill, the Morris-Lecar equations model the phenomenon of inhibitory rebound for parameter regimes in which the equations are excitable (Perkel and Mulloney 1974). That is, a pulse of inhibitory current, held for sufficiently long, moves the system to a new stationary state with a more hyperpolarized voltage; if the pulse is sufficiently strong, the trajectory from this new point in the uninhibited system jumps to the excited branch (Fig. 3A). Thus, the system is excited on release from inhibition. If the pulse is too weak, the system reverts quickly after release from inhibition to its previous stable steady state, without excitation (Fig. 38). In the parameter regime in which l i ' , ~ ~ is small, the threshold or boundary between these two possible responses is at the local minimum of the nullcline for the uninhibited system (see Figure 3A). Without if,, if the pulse of inhibitory current is sufficiently long, the system does not produce subharmonics. (Subharmonics and other patterned output can occur using another mechanism if the pulses are short; see Section 5.1.) The effect of the ill is to modulate the position of the threshold. During hyperpolarization, i,, is activated, opposing the hyperpolarizing current (causing the depolarizing "sag" back toward the original potential). Furthermore, until the LP bursts the activation of this current does not decrease or does so slowly. This enables the system to have a "memory" for hyperpolarizing pulses; the amount of excitation needed to pass the threshold decreases with each successive inhibitory pulse (Fig. 3C). Thus, an inhibitory pulse too small to cause inhibitory

Subharmonic Coordination in Networks

V

o.6

75

V

t

A

B

0.61

LP

-0.6

-0.6

0.61

0.61

PD

0

-0.61 0

.

50

100

.

150

200

250

V

-0.61 0

50

100

150

200

250'

V

0.61

C

D

0.61

LP

JVVVIIVVVIIVIIVVIIUV

-0.6.

-0.6

0.6)

0.61

PD

0

-0.61

0

-0.61 50

100

150

200

250 I

0

50

100

150

200

2501

Figure 2: Numerical integrations of equations 3.1 with fast synaptic currents. In each part the upper trace is z ~ versus ~ p time. (A) 1 : 1 subharmonic; ILI>= 0.27, ug = 0.2. (8) 4 : 1 subharmonic; I1.p = 0.09, z)y = 0.2. ( C ) 7 : 1 subharmonic: I1.p = 0.08, z 9 = 0.2. (D) Small amplitude oscillations of LP due to the absence of the sag current; 41, = 0.09, z'g = 0.2, gs = 0.0.

rebound after one pulse can do so after some finite number. However, if the level of tonic inhibitory current is sufficiently large, then i,, builds slowly toward some saturation value over many cycles. For large enough

Thomas LoFaro et al.

76

A vLpnullclines:

B

.. *

-

"u,

C

Subharmonic Coordination in Networks

77

hyperpolarization, this saturated level is still insufficient to allow inhibitory rebound, and the LP shuts off. We note that this mechanism can work in a region in which ill is partially activated even at the resting potential of the uncoupled cell; in this case the inhibitory postsynaptic potentials (IPSPs) that provide the forcing d o not substantially change the rate of increase of ilf. Figures 4A,B show the range of voltages of the LP cell during the buildup of ih, and also during the burst of the LP for two different values of ILP.When the voltages in the high and low plateaus of the LP burst lie in the saturated regimes of the activation curve, as in Figure 4C, the IPSPs to the LP cell act mainly to prevent the firing of the LP; above a certain threshold necessary to keep the LP from firing during the on phase of the PD cell, an increase in the size of the IPSPs does not substantially change the output. Even in cases where the rate of increase of ih is changed by the IPSPs (Fig. 4D) the mechanism works as described above. The above heuristic explanation suggests that a similar effect to changing the level of hyperpolarization of the LP cell can be achieved by a shift of the activation curve of i l l . This is of interest because Kiehn and Harris-Warrick (1992) found that serotonin appears to shift the activation curve of ih in a crab STG neuron and similar results are suggested in P. interruptus cultured neurons (Turrigiano and Marder 1993). Thus, w e

p diagram, showing the mechanism for Figure 3: Facing: page. The VLJ - n ~ phase LP firing. (A) 1 : 1 firing without ih. The point A is the steady state before onset ~ (dashed of inhibition from PD.-Inhibition effectiiely lowers the V L nullcline curve) causing a sudden decrease in V L P and creating a "ghost" steady-state B to which all trajectories tend. On release from inhibition the nullcline reverts to its original position causing a sudden increase in ULP. If the steady-state B is below the threshold T and the inhibition is long enough to allow the trajectory to approach B, the firing ensues. (B) No firing with ill. The points A, B, and T are as in Figure 3A. In this figure, however, the "ghost" steady-state B is greater than the threshold T preventing the LP from firing on release from inhibition. (C) 2 : 1 firing with ill. At the end of an LP burst the Pb is released from inhibition and thus becomes depolarized. In turn, it inhibits the LP. During this initial inhibition v1.p and n L p drift along the dashed nullcline to the point denoted B. In addition, during this first inhibition the nullcline U L P = 0 drifts upward due to the slow increase in s (solid curve). Because the point B is above the threshold T, the LP trajectory tends during the PD interburst interval toward the point marked A, without a large excursion. During the second inhibition this mechanism repeats with the LP trajectory during the I'D burst tending to the point marked B. Again the buildup of i h causes the i'lAp = 0 nullcline to Now upon release from inhibition the LP bursts rise giving a new threshold since B is below T .

r.

Thomas LoFaro et al.

78

-0.75 D

1

-0.75

0

0.75

Figure 4: i,, activation curves [s,(u)] versus LP depolarization and hyperpolarization. (Note: Because this current activates with hyperpolarization rather than depolarization, as is common for other currents, the activation curve has the opposite slope.) The left dark bar on the u axis represents the range of U L P during subthreshold oscillations and the right dark bar the range during depolarization; the numbers are taken from numerical integrations of equations 3.1. In the upper two graphs we kept ZJY fixed while changing only I I S . In both these graphs LP voltage during inhibition is on the upper branch of ,s while during excitation it is on the lower part of the transition. The identical qualitative aspects of these indicate that altering the amount of injected current to the LP does not significantly change the rate at which i,, increases or decreases. In figures A, C, and D, / ~ was p kept fixed while ug was reduced, shifting the if, activation curve leftward. Here a qualitative difference is apparent (especially in C and D) in the relationship between voltage ranges during subthreshold oscillations and the steep portion of the activation curve of if,. The shift of the activation curve significantly changes the average rate of increase of i,, during subthreshold oscillations. (A) 1 : 1 subharmonic; I L = ~ 0.27, z ~ g= 0.2; (B) 4 : 1 subharmonic; I L =~ 0.09, u g = 0.2; (C) 2 : 1 subharmonic; /Lp = 0.27, zlg = -0.2; (D) 2 : 1 subharmonic; IIap = 0.27, 719 = -0.4.

performed other simulations holding injected current fixed a n d shifting the activation curve s,(u) by varying parameter zly. As expected, if the curve is shifted in the depolarized direction, the effect of each pulse is enhanced, and the number of PD cycles per LP burst goes down; a shift in the hyperpolarizing direction produces the opposite effect. Figure 5 summarizes this work. For example, for I1.p = 0.1, subharmonics of three,

Subharmonic Coordination in Networks

021784

0.15

.

0.1

.

0.05

.

0 .

3-

3

87654-4

i

2

3-3

5

4

3-1

087a

s

4

478

79

2

1-1

2

2

2

4.05 *

4.1

c

0.1

3-3 0.15

2

2

,

0.2

02s

Figure 5: Subharmonics of equations 3.1 with fast synaptic currents. For each of the four values of 719 shown the range of ILp exhibiting various subharmonics is plotted. For example, when u9 = 0.0 and ILP = 0.15 equations 3.1 display 3 : 1 subharmonics. five, six, and eight were observed. The simulations summarized in Figure 5 show that the activation curve can be shifted until ih is entirely off at the LP resting potential. The subharmonics thus obtained have high values of N unless the maximal conductance of i,, is increased, in which case low subharmonics can be obtained. We note that in the parameter regime in which ih is partially activated at rest, the reset of ih can create a postburst hyperpolarization of LP. This can affect the numerical relationship between the subharmonic N and the amount of injected current; it does not change the qualitative behavior. One difference between the motivating data shown in Figure 1 and the above simulations is that the biological LP neuron does not fire immediately on release from inhibition. This difference is to be expected, because the synapse from PD to LP is relatively slow (Eisen and Marder 1982; Hartline and Gassie 1979). The second set of simulations tested the effect of such a slow synapse by replacing the instantaneous function of voltage in the model synapse from PD to LP by a synapse with dynamics as given in Section 3. The subharmonic coordination continues to be displayed. Now, however, as in the data, there is a delay from the offset of inhibition of the LP cell to the firing of the LP cell. The delay increases with the amount of hyperpolarization. That delay can be seen in the I'D trace in Figure 6A and B; the I'D begins a part of its next cycle before being shut off by the LP burst.

Thomas LoFaro et al.

80

A

V

B

V

0.61

LP -0.61

-0.61

0.61

0.61

PD

0

-0.61 0

50

100

150

200

250

-0.61 0

50

100

150

200

250

Figure 6: Numerical integration of equations 3.1 with slow synaptic currents. The traces are u ~ pand U P D versus time. (A) 1 : 1 subharmonic; ILP = 0.27, z77 = 0.2. (B) 4 : 1 subharmonic; ZLP = 0.15, u7 = 0.2. 5 Discussion

5.1 Other Mechanisms for Subharmonic Coordination and Related Work. Our simulations employ one inherently oscillatory model (the I'D) and one nonoscillatory, but excitable model (the LP) that possesses a hyperpolarization-activated, persistent inward current (i,,). The subharmonic coordination displayed in these simulations is reminiscent of the subharmonic coordination observed when two true oscillators with different inherent cycle frequencies are coupled through their phase differences. However, the two types of systems are very different. Though "phase" could be defined in various ways for the LP-PD system, the interactions depend on the rebound property and not the difference in phases, even if the LP cell is in a parameter regime where it can spontaneously oscillate. There are two kinds of differences between the two systems that are potentially relevant to regulatory properties. The first has to do with timing relationships between the elements as a parameter is varied. For a pair of oscillators whose interactions depend only on phases, there is usually a range of ratios of frequencies near each integer N in which stable N : 1 coordination is possible. As some parameter is changed, moving this ratio from one end of the possible range to the other, the subharmonic N remains fixed but the phase relationships between the oscillators in the locked solution change substantially. [For example, in

Subharmonic Coordination in Networks

81

the equations 0: = LJ, + sin(0, - O,), i. j = 1.2, i # j , 1 : 1 coordination is possible if -L J < ~ ~1. For u1 fixed and L J ~varying between w1 - 1 and LJ~ +l, the difference 0, -02 in the locked solution varies between - x / 2 and +7r/2.] Such a large change in phase difference can be repeated within each of the regimes in which N : 1 coordination is stable (Ermentrout 1981). In particular there is no pair of points on the two cycles that occur at the same time, independent of parameters that change the relative frequency of the pair. However, in the case of instantaneous synapses illustrated in this paper, if the LP bursts at all it does so with the onset of its burst coming right after the offset of the PD burst. In Figure 2A,B,C one can see a slight delay in this timing. In the "singular limit" in which the time constant Q j ~ p+ 0, this delay goes to zero. Another difference between the two kinds of systems is their activity in the regimes near the transitions from one subharmonic to the next. For the coupled oscillator system, between the intervals of N : 1 coordination, there are parameter ranges in which other behavior is expected, including no locking or more complex frequency relations (Ermentrout 1981). These sort of complex activities are not seen in the model ill system; this system instead shifts from one subharmonic to the next in a stable, step-like pattern as the LP is further hyperpolarized, with possible bistability between N : 1 and ( N 1) : 1 behavior. This is proved under some limiting conditions in LoFaro (1993). Other papers dealing with similar dynamic phenomena in a different context are Levi (1990), Rinzel and Troy (1983). It should be emphasized that the rebound property alone is not sufficient to ensure the behavior described in the above paragraphs. The step-like changes in locking pattern occurs when the inhibitory pulses of the PD cell are long enough that the LP cell can approach sufficiently close to the critical point of the inhibited system (marked B in Fig. 3A-C) before being released; also important is the fast reset of ill in the depolarized regime, due to the magnitude of the function k,. In other parameter regimes, the behavior can be reminiscent of that of a forced oscillator when the forcing oscillator has significantly different frequency from the forced oscillator. Such systems display subharmonic oscillations along with more complicated behavior (Glass and Mackey 1979; Arnold 1989; Chialvo et al. 1990; Keener 1981). Indeed Wang has observed (Wang 1993) the characteristic "devil's staircase" well known in forced oscillators (Arnold 1989) in a system closely related to equations 3.1, but in a different parameter range. We have also observed such behavior in the absence of ill when the PD pulses are short. The effects of an il, current have been discussed in the context of half center oscillations in which two cells, neither of which is an oscillator, are reciprocally coupled to form an oscillatory circuit (Angstadt and Calabrese 1989; Wang and Rinzel 1992). If the inhibitory synapses between the cells are sufficiently slow, the reciprocal inhibition can give rise to

+

Thomas LoFaro et al.

82

in-phase oscillations; otherwise they are antiphase. In these papers the locking ratio between the components is 1 : 1.

5.2 Biological Relevance. The mechanism described here was suggested by the electrophysiological recordings shown in Figure 1. However, it is important to reiterate that although stomatogastric ganglion neurons display i h (Golowasch and Marder 1992; Kiehn and HarrisWarrick 19921, much additional experimental work is needed to ascertain the role of this current in network function. Moreover, the kinds of recordings seen in Figure 1 could arise from other constellations of membrane currents and synaptic interactions in the STG. Despite the above caveat, the mechanism described here suggests that i,, could contribute to phenomena in which neurons change their activity pattern from the fast pyloric rhythm to the slow gastric rhythm (Weimann et al. 1991). One possible mechanism for these circuit "switches" is a decrease in synaptic input from the faster rhythm and an increase in an &like current that would provide slower time constant bursting. The rebound after many cycles displayed by cells with i,, is potentially relevant to other situations as well. As shown in Kopell and Le Masson (19931, it provides a possible mechanism for producing network oscillations from nonoscillatory cells connected together with a cortical-like architecture, in such a way that each cell fires only once in many cycles of the network oscillation. Such a network can be capable of modulating its field potential amplitude while keeping its frequency relatively fixed (Kopell and Le Masson 1993). Thus, the mechanisms of this paper may be relevant to complex brain functions. In another example (Kopell et a/. 19931, an ill current provides a possible mechanism for the rhythmic switching of each of the two leech heart tubes between peristaltic and synchronous pumping; this happens on a longer time scale than the period of the individual beats (Calabrese and Peterson 1983). Acknowledgments We wish to thank R. Harris-Warrick, F. Nagy, J. Rinzel, and X. J. Wang for useful conversations and helpful comments. T. L. and N. K. were supported in part by NIMH Grant MH47150 and E. M. and S. L. H. were supported in part by NIMH Grant MH46742. References Angstadt, J. D., and Calabrese, R. L. 1989. A hyperpolarization-activatedinward current in heart interneurons of the medicinal leech. 1. Neurosci. 9, 28462857.

Subharmonic Coordination in Networks

83

Angstadt, J. D., and Calabrese, R. L. 1991. Calcium currents and graded synaptic transmission between heart interneurons of the leech. J. Neurosci. 11, 7 4 6 759. Arnold, V. I. 1989. Mathematical Metlzods of Classical Mechniiics. Graduate Texts in Math., 60. Springer-Verlag, Berlin. Calabrese, R., and Peterson, E. 1983. Neural control of heartbeat in the leech, Hivudo medicinalis. In SOC.for Exp. Biol. XXXVII, Neural Origin of Rhythmic Movements, A. Roberts and B. Roberts, eds., pp. 195-221. Cambridge University Press, Cambridge. Chialvo, D. R., Michaels, D., and Jalife, J. 1990. Supernormal excitability as a mechanism of chaotic dynamics of activation in cardiac Purkinje fibers. Circ. Res. 66, 525-545. Edelstein-Keshet, Leah 1988. Matheinatical Models in Biology. Random House, New Kork. Eisen, J. S., and Marder, E. 1982. Mechanisms underlying pattern generation in lobster stomatogastric ganglion as determined by selective inactivation of identified neurons. 111. Synaptic connections of electrically coupled pyloric neurons. 1. Neurophysiol. 48, 1392-1415. Ermentrout, G. B. 1981. n:m phase-locking of weakly coupled oscillators. J. Math. Bid. 12, 327-342. Ermentrout, G. B. 1989. PHASEPLANE, Version 3.0. Brooks/Cole. Glass, L., and Mackey, M. 1979. A simple model for phase locking of biological oscillators. J. Math. B i d . 7, 339-352. Golowasch, J., and Marder, E. 1992. Ionic currents of the lateral pyloric neuron of the stomatogastric ganglion of the crab. J. Neurophysiol. 67,318-331. Golowasch, J., Buchholtz, F., Epstein, I. R., and Marder, E. 1992. The contribution of individual ionic currents to the activity of a model stomatogastric ganglion neuron. J. Neuruphysiol. 67, 341-349. Guckenheimer, J., Myers, M. R., Wicklin, F. J., and Worfolk, P. A. 1991. dstool: A Dynamical System Toolkit with an Interactive Graphical Interface. Center for Applied Mathematics, Cornell University. Harris-Warrick, R. M., and Marder, E. 1991. Modulation of neural networks for behavior. Annu. Rev. Neurosci. 14, 39-57. Hartline, D. K., and Gassie, D. V. 1979. Pattern generation in the lobster (Punulirus) stomatogastric ganglion. I. Pyloric neuron kinetics and synaptic interactions. Bid. Cybernet. 33, 209-222. Hounsgaard, J., and Kiehn, 0. 1989. Serotonin-induced bistability of ru-motoneurones caused by a nifedipine-sensitive calcium plateau potential. J. Physi01. 414, 265-282. Keener, J. P. 1981. On cardiac arrhythmias: AV conduction block. J. Math. Bid. 12, 215-225. Kiehn, O., and Harris-Warrick, R. M. 1992. 5-HT modulation of hyperpolarization-activated inward current and calcium-dependent outward current in crustacean motor neuron. J. Neurophysiol. 68,496-508. Kopell, N., and Le Masson, G. 1993. Rhythmogenesis, amplitude modulation and multiplexing in a cortical architecture. Submitted. Kopell, N., Nadim, F., and Calabrese, R. 1993. In preparation.

84

Thomas LoFaro et al.

Levi, M. 1990. A period-adding phenomenon. S I A M I. Appl. Math. 50, 943-955. LoFaro, T. 1993. Thesis: A period adding bifurcation in a family of maps describing a pair of coupled neurons. Boston University. Marder, E. 1991. Plateau in time. Current Biol. 1, 326-327. Marder, E. 1993. Modulating membrane properties of neurons: Role in information processing. In Exploring Brain Functions: Models in Neuroscience, Dahlem Conference, John Wiley, Chichester, pp. 27-42. McCormick, D. A., and Pape, H. C. 1990. Noradrenergic and serotonergic modulation of a hyperpolarization-activated cation current in thalmic relay neurons. 1.Physiol. 431, 319-342. Morris, H., and Lecar, C. 1981. Voltage oscillators in the barnacle giant muscle fiber. Biophys. I . 35, 193-213. Perkel, D. H., and Mulloney, B. 1974. Motor pattern production in reciprocally inhibitory neurons exhibiting postinhibitory rebound. Science 185, 181-183. Rinzel, J., and Ermentrout, G. B. 1989. Analysis of neutral excitability and oscillations. In Methods in Neuronal Modeling: From Synapses to Networks, C. Koch and I. Segev, eds., pp. 135-163. MIT Press, Cambridge, MA. Rinzel, J., and Troy, W. C. 1983. A one-variable map analysis of bursting in the Belousov-Zhabotinskii reaction. In Nonlinear Partial Differential Equations, 1982 Summer Research Conference, J. A. Smoller, ed., Durham, NH. Turrigiano, G. G., and Marder, E. 1993. Modulation of identified stomatogastric ganglion neurons in dissociated cell culture. 1. Neurophysiol. 69,1993-2002. Wang, X.-J. 1993. Multiple dynamical modes of thalamic relay neurons: rhythmic bursting and intermittent phase-locking. Neurosci., in press. Wang, X.-J., and Rinzel, J. 1992. Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comp. 4, 84-97. Weimann, J. M., Meyrand, P., and Marder, E. 1991. Neurons that form multiple pattern generators: Identification and multiple activity patterns of gastric/pyloric neurons in the crab stomatogastric system. ]. Neurophysiol. 65, 111-1 22.

Received December 7, 1992; accepted May 13, 1993.

This article has been cited by: 2. Adam L. Taylor , Garrison W. Cottrell , William B. Kristan, Jr. . 2002. Analysis of Oscillations in a Reciprocally Inhibitory Network with Synaptic DepressionAnalysis of Oscillations in a Reciprocally Inhibitory Network with Synaptic Depression. Neural Computation 14:3, 561-581. [Abstract] [PDF] [PDF Plus] 3. S. Coombes, M Owen, G. Smith. 2001. Mode locking in a periodically forced integrate-and-fire-or-burst neuron model. Physical Review E 64:4. . [CrossRef] 4. S. Coombes, G. Lord. 1997. Intrinsic modulation of pulse-coupled integrate-and-fire neurons. Physical Review E 56:5, 5809-5818. [CrossRef] 5. David Terman, Euiwoo Lee. 1997. Partial Synchronization in a Network of Neural Oscillators. SIAM Journal on Applied Mathematics 57:1, 252. [CrossRef] 6. S. Coombes, S. Doole. 1996. Neuronal populations with reciprocal inhibition and rebound currents: Effects of synaptic and threshold noise. Physical Review E 54:4, 4054-4065. [CrossRef]

Communicated by Bruce McNaughton

Setting the Activity Level in Sparse Random Networks Ali A. Minai William B. Levy Depurtmeiit of Neurosurgery, Uiiiversity of Virginia, Charlottesoilk, VA 22908 U S A

We investigate the dynamics of a class of recurrent random networks with sparse, asymmetric excitatory connectivity and global shunting inhibition mediated by a single interneuron. Using probabilistic arguments and a hyperbolic tangent approximation to the gaussian, we develop a simple method for setting the average level of firing activity in these networks. We demonstrate through simulations that our technique works well and extends to networks with more complicated inhibitory schemes. We are interested primarily in the CA3 region of the mammalian hippocampus, and the random networks investigated here are seen as modeling the a priori dynamics of activity in this region. In the presence of external stimuli, a suitable synaptic modification rule could shape this dynamics to perform temporal information processing tasks such as sequence completion and prediction.

1 Introduction

Recurrent networks of neural-like threshold elements with sparse, asymmetric connectivity are of considerable interest from the computational and biological perspectives. From the perspective of neurobiology, sparse, recurrent networks are especially useful for modeling the CA3 region of the mammalian hippocampus. Due to its recurrent connectivity, CA3 is thought to play a central role in associative memory (Marr 1971; Levy 1989; McNaughton and Nadel 1989; Rolls 1989; Treves and Rolls 1992) and the processing of temporal information (Levy 1988, 1989). In both cases, CA3 is seen as a system capable of learning associations between patterns and using these associations for spatiotemporal pattern recognition. Synaptic counts (Amaral rt RI. 1990) indicate that CA3 is a sparsely connected recurrent network. Another characteristic of the system is that the primary pyramidal cells are inhibited by a much smaller population of interneurons (Buzsaki and Eidelberg 1982). The relative scarcity of Neural Cutnputatiun 6, 85-99 (1994)

@ 1993 Massachusetts Institute of Technology

86

Ali A. Minai and William B. Levy

inhibitory cells also implies that inhibition is broadly directed, with each interneuron inhibiting a large number of primary cells in its vicinity. Few neural network models take these two aspects of CA3 into account. Motivated to understand the role played by these characteristics in the network’s dynamical behavior, we have recently investigated a class of sparse, random networks with nonspecifically directed inhibition (Minai and Levy 1993a,b) and have found fixed point, cyclical, and effectively aperiodic behavior. We have also developed a simple model relating the level of network activity to parameters such as inhibition, firing threshold, and the strength of excitatory synapses. Several researchers have studied the dynamics of sparse, asymmetric networks within the framework of statistical mechanics (Derrida and Pomeau 1986; Sompolinsky and Kanter 1986; Derrida ef nl. 1987; Gutfreund and Mezard 1988; Kree and Zippelius 1991). One interesting conclusion to emerge from some studies is that, above some critical value, both sparseness (Kiirten, 1988) and asymmetry (Spitzner and Kinzel 1989a,b; Niitzel 1991) lead to an effectively aperiodic network dynamics, which may be called effectively aperiodic. In this paper, we present a simplification of our model based on an approximation to the standard error function. This simplification allows accurate prediction of network activity using a closed form equation. The level of activity in a recurrent network is of crucial importance for learning. A low but consistent level of activity enables the network to recode its stimuli as sparse patterns, thus decreasing interference between the representations of different stimuli and increasing capacity. Our model demonstrates how a CA3-like network without synaptic modification can control its level of activity using inhibition.

2 Network Specification and Firing Probability

Our network model is qualitatively similar to the associative memory model proposed by Marr (1971)and later investigated by others (GardnerMedwin 1976; McNaughton and Nadel 1989; Willshaw and Buckingham 1990; Gibson and Robinson 1992). A network consists of n binary (0/1) primary neurons, each with identical firing threshold 0. The network’s connectivity is generated through a Bernoulli process. Each neuron i has probability p of receiving a fixed excitatory connection of strength U J from each neuron j (including itself). The existence/nonexistence of such a connection is indicated by the 1 / 0 random variable ci, where P(c,, = 1) = p . Inhibition is provided by a single interneuron that takes input from all primary neurons and provides an identical shunting conductance proportional to its input to all primary neurons. Defining K as the inhibitory weight and m ( t )as the number of active neurons at time t,

Activity Level in Random Networks

87

the excitation yl and output zi of neuron i at time t are given by

(2.1) (2.2)

with yi(t) = 0 for all i if m ( t - 1) = 0, so that once the network becomes totally inactive, it cannot spontaneously reactivate itself. Substituting equation 2.2 in equation 2.1 and defining CP = HK/(l -H)w, the condition for firing is obtained as (2.3)

which means that, to fire at time t, neuron i must have at least ram(t - 1)1 active inputs, where [XIdenotes the smallest integer not less than x. Note that, due to the effect of averaging, the right-hand side of equation 2.3 is independent of i, and the inequality represents a universal firing condition. The firing condition also demonstrates that it is the composite, dimensionless parameter N that determines the dynamics of the network. In effect, N represents the relative strength of inhibition and excitation in the network, weighted appropriately by the threshold. If m(t - 1) = M , the average firing probability for a neuron i at the next time step t is

averagefiring probability

= p(M;n, p , (Y)

If M is sufficiently large, the binomial can be approximated by a gaussian, giving

p ( M ;? I .p . o)RZ - 1 - erf 2

(2.5)

where

A reasonable criterion for applying this approximation is M p > 5 and M ( 1 - p ) > 5. In sparse networks with p < 0.5, only the first condition is relevant, and the criterion is M > 5/p. Assuming that neurons fire independently, as they will tend to do in large and sparsely connected

Ali A. Minai and William B. Levy

88

networks (Minai and Levy 1993a), we obtain a stochastic return map relating m ( t ) to m ( t - 1):

m ( t ) = np[m(t- l ) ]+ O(fi)

(2.6)

Thus, the expected activity at time t is

( m ( t )I m(t - 1)) = np[m(t- l ) ]

(2.7) In the long term, the activity is attracted to 0 or to an O ( & ) region around f i , the point satisfying the fixed-point condition: tfi = n p ( m ) . Alternatively, one might look at the activity level r ( t ) E n - ’ m ( t ) rather than total activity m ( t ) to obtain the map

+

(2.8)

r ( t ) = p[nr(t- l ) ] O ( l / f i ) and expected activity level

(2.9) ( r ( t ) I v ( t - 1))= p[nr(t- l ) ] Here r ( t ) is an instance of what Amari (1974) calls the inacrostate of the system. The activity level fixed-point of the network is defined as r =

~ ’ mThe . qualitative behavior of the network depends in large measure on the value of ?, as discussed in our earlier studies (Minai and Levy 1993a,b). 3 Approximating the Firing Probability

While equations 2.4 and 2.5 can be used to obtain the activity map (and thus the activity level map) for a network with specified parameters, it is useful to look for a closed form, both for ease of calculation and for analytical manipulation. Such a closed form can be found using the approximation (see, e.g., Hertz et al. 1991) erf

(5)

z tanh ( E x )

(3.1)

Substituting equation 3.1 in equation 2.5, we obtain p(M;? I . p . 0 ) z

1

{

1 - tanh

[

2 [rrMl - M p

]}

JMp(l-1,

(3.2)

2 If M is large enough, [trMl z trM, which simplifies equation 3.2 to

T where

TE- 1 (Y

-p

(3.3)

py4

Equation 3.3 shows that the average firing probability is an increasing function of M for (I < p , a decreasing one for r t > p , and constant at 0.5 if 0 = p (see Figure 1).

Activity Level in Random Networks

I \

0

.'

\

89

a = 0.07

I

I

I

I

200

400

600

800

1 I0

rn(t-1)

Figure 1: The average firing probability at time t as a function of the activity at time t - 1 for a 1000 neuron network with p = 0.05. Each curve corresponds to a different value of (k, as indicated on the graph. Equation 3.3 was used for the calculation. The point where the diagonal crosses each curve is the predicted stable activity level for the corresponding value of N.

4 Setting Parameter Values to Obtain a Stable Activity Level

~

The most useful application of this model is in specifying the average level of activity in a network. Since the activity level stabilizes around 7 in the long term, we can use it as an estimate of the average activity level, (r), though, strictly speaking, there might be a slight discrepancy between the estimate and the actual value d u e to the asymmetric shape of the firing probability curve. As r ( t ) becomes more confined with increasing network size, this discrepancy becomes less and less significant.

Ali A. Minai and William B. Levy

90

To obtain a specific U by setting (r, we just need to find the o satisfying the activity fixed point equation (4.1)

r71 = n p ( f i t )

and substitute rn

( ~ ( rx) p +

=

nr. This gives

iTp(' 2n r

tanh-'(1

-

2V)

(4.2)

As long as V is not too close to 0 or 1, this is an adequate method for setting ( t , as shown by the results in Section 6. Note that the useful range of ( r is bounded as 0 5 (r 5 I; for values of ( I larger than 1, the firing condition (equation 2.3) shows that p(M) = 0 VM. When F is too low, fit is not large enough to allow the gaussian approximation of equation 2.5, and the model breaks down. However, as n grows, lower values of r come within the range of satisfactory approximation. Using the criterion Mp > 5 for the gaussian approximation, we conclude that equation 4.2 can be applied for F > 5 / n p . 5 Extension to the Multiple Interneuron Case So far in our model, we have assumed that inhibition is mediated by a single interneuron. In this simplification, we follow previous studies such as those by Gardner-Medwin (1976) and Gibson and Robinson (1992). However, as we argue in this section, the neuron model of equation 2.1 is also consistent with a population of interneurons, provided that these interneurons respond faster than the primary neurons and are statistically identical. Thus, let there be a set I of N interneurons, where each interneuron I receives an input synapse of weight u from each primary cell j with a fixed probability y and projects back to each j with probability X and weight ZI. Also, let E denote the set of n primary cells. The net excitation to each I is given by

(5.1) where c1, is a binary variable indicating the presence or absence of a connection from j to 1. Since interneurons are postulated to be linear (see the Appendix), the output of I is

Zdt) = C Y l ( t )

(5.2)

where C is a constant. Based on physiological evidence that hippocampal interneurons respond faster than pyramidal cells (Buzshki and Eidelberg 1982; McNaughton and Nadel 1989), we assume that I responds instantaneously to its input (unlike the primary cells that take one time unit to

Activity Level in Random Networks

91

respond). The inhibitory input to a primary cell i at time t is then given by (5.3)

If ?m(t - 1) is large enough, the distribution of Z , ( t ) for each I E I will approximate a gaussian with a mean value of uC?m(t - 1). By the central limit effect, then, q$( t ) will also be approximately normally distributed with mean uuCiXNin(t - 1). Thus, we can rewrite equation 2.1 as

where K = uuCrXN and / / [ u u. . C. 7 . A. m ( t - 1).N] is a random fluctuation that is O( and becomes increasingly insignificant as N m (t - 1) increases. Thus, equation 2.1 represents a reasonable approximation for equation 5.4 in large networks with a few thousand primary cells and a few hundred inhibitory neurons-especially if X is not too small. It should be noted that the multiple interneuron scheme described above is equivalent to having nonuniformly distributed random inhibitory connections between primary neurons, albeit with a shunting effect. Calculations from intracellular studies of inhibition in CA3 (Miles 1990) suggest that each pyramidal cell receives inhibitory synapses of widely varying strengths from 10 to 50 interneurons in its general neighborhood. The same study indicates that a specific inhibitory interneuron makes synapses of roughly equal strength with a large proportion of pyramidal cells in its neighborhood. Together, these factors suggest that the inhibition to neighboring pyramidal cells is highly correlated in amplitude and phase (Miles 1990). Thus, our model does capture part of the inhibitory structure in CA3, though more realistic models will probably be needed as the physiology becomes clearer.

JN.lcf-l,)

6 Results and Discussion

To show that the relationship between ( Y and U as expressed in equation 4.2 can be used to estimate the average activity level, we simulated a number of 300 and 1000 neuron networks and obtained empirical data to compare against the model. We obtained the average activity level, ( r ) , by running each network for 2000 steps and averaging its activity level over the last 1000 of these steps. We simulated seven different, randomly generated networks for each value of o and averaged the ( r ) s obtained in the seven cases. The results for 12 = 300 and PI = 1000 are shown in Figure 2a and b, respectively. It is clear from the graphs that r, as calculated from equation 4.2, is a good estimator for ( r ) when the activity level is not too high (> 0.8 or so) or too low (< 0.15 or so). One notable feature of the data is the large variance in the empirical average activity

Ali A. Minai and William B. Levy

92

081

062

' 0.4

~I

02

I

0

-/0

1---101

0 05

-~ 70 15

-i 02

(I

0

i 0.4

! !

'4

02.

Figure 2: The predicted and empirical activity level fixed points for different values of (t in (a) a 300 neuron network, and (b) a 1000 neuron network, with p = 0.05. The solid line shows the curve predicted by equation 4.2, while the bullets show the empirical values. The activity level is averaged over seven different networks for each (y value. All runs were started from the same initial condition, and the last 1000 steps of a 2000 step simulation were used in each case. It is clear that the model works very well when r is not too low or too high. The large error bars at low activity levels are due to the fact that some networks switched off while others converged to low activity cycles.

level at high N. This is due to the fact that some networks with these (k values switched off in the first 1000 steps while others settled down to relatively short limit cycles of the activity level predicted by equation 4.2. This trivial network behavior was discussed in our earlier study (Minai and Levy 1993a). Suffice it to say that this phenomenon is mediated by very low-activity states, and we expect it to become less significant in

Activity Level in Random Networks

93

larger networks where even low-activity states have a large number of active neurons. Figure 3 plots the empirically measured averaged activity level of a network with 1000 excitatory neurons and 50 inhibitory neurons for various values of C with fixed u, u,y, and A. The solid curve indicates the prediction generated using equation 4.2 with K = uuCyAN. The results at activity levels above 0.5 are not as good as in the single interneuron case, mainly because the variance in the inhibitory term is relatively high, but, comparing Figure 3 with Figure 2b, it is clear that the model of equation 4.2 works better in the multiple interneuron case when the activity level is low-presumably because the averaging in the inhibitory term makes simultaneous switch-off of all neurons less likely. Since the overall performance of the model should improve as n and N increase, there is reason to expect that equation 4.2 can be used to predict activity levels in large networks with multiple interneurons and very low activity.

7 Biological Considerations The results given above show that equation 4.2 is a good model for relating average activity to N . However, the network model is supposed to be a representation of biological networks such as the CA3, and it is important to put it in its full biological context. Above all, it is necessary to demonstrate that the model can be applied to networks of realistic size without running into problems. The most obvious potential problem, as implied by equation 4.2 and borne out by a comparison of Figure 2a and b, is that the dependence of V on cv tends toward a step function as n increases. In infinitely large networks, therefore, all neurons fire if (1 is less than p and none fire if it is greater than p , leading to an all-or-none activity situation. Even in large finite sized networks, however, obtaining a moderate activity level is problematic unless o is set very precisely. Since N is in arbitrary units, it is difficult to say exactly what degree of precision is prohibitive, but, as shown by equation 4.2, activity in large networks is easier to control at the lower end of the activity level spectrum. This is consistent with the physiological observation that activity levels in the rat CA3 are typically less than 1 % (Thompson and Best 1989). Since Amaral et al. (1990) estimate that the rat CA3 has 300,000 or so pyramidal cells and a connectivity of around 1.9%, we calculate the relationship between F and o for n = 300.000 and p = 0.02 as predicted by our model (Fig. 4). The activity levels shown are well above the value of 0.0008 needed for the gaussian approximation according to the criterion given earlier. It is apparent that U varies smoothly with IY in the range shown and (Y can, therefore, be used to control activity at typical CA3 levels. In the biological context, one must also try to account for the effects of synaptic modification. Whereas our model treats all excitatory synapses

Ali A. Minai and William B. Levy

94

1

0.8

0.6

-

r

0.4

0.2

0 0

I

I

0.05

0.1

c

15

a

Figure 3: The predicted and empirical activity level fixed points for different values of (Y in a network with 1000 primary neurons and 50 inhibitory interneurons. The connectivity parameters are p = 0.05, y = 0.1, and X = 0.5. The firing threshold is set to 0.5 and the connection weights are w = 1.0, u = 1.0, and u = 1.0. The values of w and 0 mean that (1 = K. The bullets indicate the empirically obtained average activities, while the solid line shows the curve predicted by equation 4.2 for a 1000 neuron network with a single interneuron and calculated using K = uuCyXN. The activity level is averaged over seven different networks for each value. All runs were started from the same initial condition, and the last 1000 steps of a 2000 step simulation were used in each case. The model works adequately for moderate activity values, and much better than in the single interneuron case for low activities. Its overall performance should improve when the number of neurons and interneurons is larger.

Activity Level in Random Networks

95

0.02-

0.015 -

~

r

0.005-

00.02

,

1

0.04

0.06

a

8

a

Figure 4: The relationship between parameter (Y and activity level fixed point r, as predicted by equation 4.2, for a network with 300,000 primary cells and a connectivity of p = 0.02. These values correspond roughly to those found in the rat CA3 region. The range of activity levels shown is also consistent with that seen in the rat CA3. It is clear that at these low activity levels, n is an effective means of controlling the average activity level of the network.

as identical, real synapses have varying strengths. Furthermore, due to the effects of synaptic modification, highly potentiated synapses on a primary cell would tend to be activated in correlated groups, not at random as in our model. This is both a cause and, through recurrence, an effect of correlated firing among the neurons themselves. The variation in synaptic strength and correlated input activity means that the excitation, wCjcjjzj(t- l ) , to a typical CA3 primary cell i will not necessarily have a unimodal gaussian distribution, as is the case in our model. This more

96

Ali A. Minai and William B. Levy

complex distribution of excitation to different cells will also alleviate the all-or-none activity problem described in the previous paragraph. Thus, it is best to see the random case treated in this paper as the starting point of the synaptic modification process, and the dynamics given by our model as the a priori or intrinsic dyiznrnics of a CA3-like network. Synaptic modification can then be considered a ”symmetry-breaking” process that gradually distorts the intrinsic dynamics into the spatiotemporal patterns implicit in the received environmental information. There have been some studies of synaptic modification in networks similar to ours, but only in the context of associative memory (Marr 1971; Gardner-Medwin 1976; Palm 1980; Gibson and Robinson 1992). Our main interest lies in the temporal aspects of network behavior and not just in the stability properties required by associative memory. Finally, we turn to another interesting aspect of our model: the assumption of continuous-valued (linear) inhibition. There is some evidence that the response of inhibitory interneurons in the hippocampus differs from that of the primary cells in more than just its speed. Interneurons have very low response thresholds and respond with multiple spikes whose number (and onset latency) is directly related to the intensity of the stimulus to the interneuron (Buzsaki and Eidelberg 1982). When integrated postsynaptically, these spike trains could represent an almost continuous-valued signal with a significant dynamic range. Of course, this does not imply that the interneuron’s response is lincnv in its stimulus as we, and others, have assumed. However, the analysis developed in this paper can be extended to some kinds of nonlinear inhibitory schemes, as described in the Appendix. 8 Conclusion

The mammalian hippocampus is a complex system and is probably involved in highly abstract information processing tasks such as recoding, data fusion, and prediction (Levy 1985, 1989; Rolls 1989; Rolls and Treves 1990; McNaughton and Nadel 1989). With its recurrent connectivity, it is natural to expect that the CA3 region plays an important role in any temporal processing done by the hippocampus (Levy 1989). In this paper, we have studied a recurrent network with CA3-like characteristics and have presented a model for relating its average activity level to parameters such as neuron firing threshold and the strength of inhibition. Using simulations, we have demonstrated that this model successfully relates network parameters to the activity level over a reasonable range, and that it is easily extended to situations with multiple inhibitory interneurons. Our model thus provides an understanding of the intrinsic dynamics of the untrained random network, and from this we can proceed to the problem of temporal learning and prediction through synaptic modification.

Activity Level in Random Networks

97

Appendix: Nonlinear Inhibition We could rewrite equation 5.2 more generally as Z,(t) = g [ m ( t - l)],where g(x) is an appropriate monotonically nondecreasing function (e.g., a sigmoid, as is often the case in artificial neural networks). Furthermore, since there is evidence of spontaneous, low-intensity firing in interneurons (Buzsaki and Eidelberg 1982), one could postulate an inhibition of the general form Kg[m(t- l)] K , where K is a small constant offset. The analysis given in the paper for linear inhibition transfers directly to the nonlinear case with offset. Thus, equation 2.1 becomes

+

Following the logic used in the linear case, and applying the hyperbolic tangent approximation to the error function, we get

where m(t - 1 ) = M and Lj = H K / ( ~- H)w. As before, we have taken [trg(M)] x trg(M). This, in turn, leads to a relationship between N and T, albeit a slightly more complicated one than in the linear case:

The extension of this more general case to multiple, statistically identical interneurons is straightforward, and leads essentially to equations A.2 and A.3 for large networks if g(x) is reasonably well-behaved.

Acknowledgments This research was supported by NIMH MH00622 and NIMH MH48161 to W.B.L., and by the Department of Neurosurgery, University of Virginia, Dr. John A. Jane, Chairman. The authors would like to thank Dawn Adelsberger-Mangan for her constructive comments. The paper has also benefited greatly from the suggestions of the reviewers.

References Amaral, D. G., Ishizuka, N., and Claiborne, 8. 1990. Neurons, numbers and the hippocampal networks. In Understanding the Brain through the Hippocarnpus: The Hippocampal Region as a Model for Studying Brain Striictiire and Function (Progress in Brain Research, Vol. 83), J. Storm-Mathisen, J. Zimmer, and 0. P. Ottersen, eds., pp. 1-11. Elsevier, Amsterdam.

98

Ali A. Minai and William B. Levy

Amari, S. 1974. A method of statistical neurodynamics. Kybcrnctik 14, 201-215. Buzsaki, G., and Eidelberg, E. 1982. Direct afferent excitation and long-term potentiation of hippocampal interneurons. /. Neuropkysiol. 48, 597-607. Derrida, B., and Pomeau, Y. 1986. Random networks of automata: A simple annealed approximation. Europhys. Lett. 1, 45-49. Derrida, B., Gardner, E., and Zippelius, A. 1987. An exactly solvable asymmetric neural network model. Europhys. Lett. 4, 167-173. Gardner-Medwin, A. R. 1976. The recall of events through the learning of associations between their parts. Proc. R. Soc. London B 194, 375402. Gibson, W. G., and Robinson, J. 1992. Statistical analysis of the dynamics of a sparse associative memory. Neural Networks 5, 645-661. Gutfreund, H., and Mezard, M. 1988. Processing of temporal sequences in neural networks. Phys. Rev. Lett. 61, 235-238. Gutfreund, H., Reger, J. D., and Young, A. P. 1988. The nature of attractors in an asymmetric spin glass with deterministic dynamics. 1.PlrJys.A: Moth. Gerr. 21, 2775-2797. Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introduction to the Tlieor!/ of Neural Computation. Addison-Wesley, Redwood City, CA. Kree, R., and Zippelius, A. 1991. Asymmetrically diluted neural networks. In Models of Neural Networks, E. Domany, J. L. van Hemmen, and K. Schulten, eds., pp. 193-212. Springer-Verlag, New York. Kurten, K. E. 1988. Critical phenomena in model neural networks. Pliys. Lett. A 129, 157-160. Levy, W. B. 1988. A theory of the hippocampus based on reinforced synaptic modification in CAI. Soc. Neurosci. Abstr. 14, 833. Levy, W. B. 1989. A computational approach to hippocampal function. In C o n putationnlModels ofLearnirig in SirnpleNertral Systems. The PsychologyofLearriiri~ and Motivation, R. D. Hawkins and G. H. Bower, eds., Vol. 23, pp. 243-305. Academic Press, San Diego, CA. Marr, D. 1971. Simple memory: A theory for archicortex. Phil. Trans. R. Soc. London B 262, 23-81. McNaughton, B. L., and Nadel, L. 1989. Hebb-Marr networks and the neurobiological representation of action in space. In Neuroscience oiid Connectioiiist Theory, M. A. Gluck and D. Rumelhart, eds., pp. 1-63. Erlbaum, Hillsdale, NJ. Miles, R. 1990. Variation in strength of inhibitory synapses in the CA3 region of guinea-pig hippocampus in vitro. /. Physiol. 431, 659-676. Minai, A. A., and Levy, W. B. 1993a. The dynamics of sparse random networks. Biol. Cybernet. (in press). Minai, A. A., and Levy, W. B. 1993b. Predicting complex behavior in sparse asymmetric networks. In Advances in Neural Information Processing Systems 5, pp. 556-563. Morgan Kaufmann, San Mateo, CA. Niitzel, K. 1991. The length of attractors in asymmetric random neural networks with deterministic dynamics. /. Phys. A: Math. Gen. 24, L151-157. Rolls, E. T. 1989. Functions of neuronal networks in the hippocampus dnd neocortex in memory. In Neural Models of Plasticity, J. H. Byrne and W. 0. Berry, eds., pp. 240-265. Academic Press, New York.

Activity Level in Random Networks

99

Rolls, E. T., and Treves, A. 1990. The relative advantages of sparse versus distributed encoding for associative neuronal networks in the brain. Network 1, 407-421. Sompolinsky, H., and Kanter, I. 1986. Temporal association in asymmetric neural networks. Phys. Rev. Lett. 57, 2861-2864. Spitzner, P., and Kinzel, W. 1989a. Freezing transition in asymmetric random neural networks with deterministic dynamics. Z. Phys. B: Condensed Matter 77, 511-517. Spitzner, P., and Kinzel, W. 1989b. Hopfield network with directed bonds. Z. Phys. B: Condensed Matter 74, 539-545. Thompson, L. T., and Best, P. J. 1989. Place cells and silent cells in the hippocampus of freely-behaving rats. ].Neurosci. 9, 2382-2390. Treves, A., and Rolls, E. T. 1992. Computational constraints suggest the need for two distinct input systems to the hippocampal CA3. Hippocumpus 2, 189-200. Willshaw, D. J., and Buckingham, J. T. 1990. An assessment of Marr’s theory of the hippocampus as a temporary memory store. Phil. Trans. R. SOC.London B 329, 205-215. ~~

~

~

Received November 12, 1992; accepted May 26, 1993.

This article has been cited by: 2. Kenneth A. Norman, Ehren Newman, Greg Detre, Sean Polyn. 2006. How Inhibitory Oscillations Can Train Neural Networks and Punish CompetitorsHow Inhibitory Oscillations Can Train Neural Networks and Punish Competitors. Neural Computation 18:7, 1577-1610. [Abstract] [PDF] [PDF Plus] 3. Cengiz Günay, Anthony S. Maida. 2006. A stochastic population approach to the problem of stable recruitment hierarchies in spiking neural networks. Biological Cybernetics 94:1, 33-45. [CrossRef] 4. W. B Levy, A. Sanyal, X. Wu, P. Rodriguez, D.W. Sullivan. 2005. The formation of neural codes in the hippocampus: trace conditioning as a prototypical paradigm for studying the random recoding hypothesis. Biological Cybernetics 92:6, 409-426. [CrossRef] 5. Paul Rodriguez, William B. Levy. 2001. A model of hippocampal activity in trace conditioning: Where's the trace?. Behavioral Neuroscience 115:6, 1224-1238. [CrossRef] 6. Ali A. Minai. 1997. Covariance Learning of Correlated Patterns in Competitive NetworksCovariance Learning of Correlated Patterns in Competitive Networks. Neural Computation 9:3, 667-681. [Abstract] [PDF] [PDF Plus] 7. Xiangbao Wu, Robert A. Baxter, William B. Levy. 1996. Context codes and the effect of noisy learning on a simplified hippocampal CA3 model. Biological Cybernetics 74:2, 159-165. [CrossRef] 8. William B Levy. 1996. A sequence predicting CA3 is a flexible associator that learns and uses context to solve hippocampal-like tasks. Hippocampus 6:6, 579-590. [CrossRef] 9. Michael E. Hasselmo, Bradley P. Wyble, Gene V. Wallenstein. 1996. Encoding and retrieval of episodic memories: Role of cholinergic and GABAergic modulation in the hippocampus. Hippocampus 6:6, 693-708. [CrossRef] 10. Eytan Ruppin , James A. Reggia . 1995. Patterns of Functional Damage in Neural Network Models of Associative MemoryPatterns of Functional Damage in Neural Network Models of Associative Memory. Neural Computation 7:5, 1105-1127. [Abstract] [PDF] [PDF Plus]

Communicated by Christof von der Malsburg

The Role of Constraints in Hebbian Learning Kenneth D. Miller * Diuision of Biology, Caltech 226-76, Pasadena, C A 92225 U S A David J. C. MacKayt Compictntioti atid Neural System, Caltech 139-74, Pnsndeiia C A 92125 USA

Models of unsupervised, correlation-based (Hebbian) synaptic plasticity are typically unstable: either all synapses grow until each reaches the maximum allowed strength, or all synapses decay to zero strength. A common method of avoiding these outcomes is to use a constraint that conserves or limits the total synaptic strength over a cell. We study the dynamic effects of such constraints. Two methods of enforcing a constraint are distinguished, multiplicative and subtractive. For otherwise linear learning rules, multiplicative enforcement of a constraint results in dynamics that converge to the principal eigenvector of the operator determining unconstrained synaptic development. Subtractive enforcement, in contrast, typically leads to a final state in which almost all synaptic strengths reach either the maximum or minimum allowed value. This final state is often dominated by weight configurations other than the principal eigenvector of the unconstrained operator. Multiplicative enforcement yields a "graded" receptive field in which most mutually correlated inputs are represented, whereas subtractive enforcement yields a receptive field that is "sharpened" to a subset of maximally correlated inputs. If two equivalent input populations (e.g., two eyes) innervate a common target, multiplicative enforcement prevents their segregation (ocular dominance segregation) when the two populations are weakly correlated; whereas subtractive enforcement allows segregation under these circumstances. These results may be used to understand constraints both over output cells and over input cells. A variety of rules that can implement constrained dynamics are discussed. Development in many neural systems appears to be guided by "Hebbian" or similar activity-dependent, correlation-based rules of synaptic modification (reviewed in Miller 1990a). Several lines of reasoning suggest 'Current address: Departments of Physiology and Otolaryngology, University of California, San Francisco, CA 94143-0444 USA. 'Current address: Radio Astronomy, Cavendish Laboratory, Madingley Road, Cambridge CB3 OHE, United Kingdom. Nrirrnl Cornpiitatioti 6, 100-126 (1994)

@ 1993 Massachusetts Institute of Technology

Constraints in Hebbian Learning

101

that constraints limiting available synaptic resources may play an important role in this development. Experimentally, such development often appears to be competitive. That is, the fate of one set of inputs depends not only on its own patterns of activity, but on the activity patterns of other, competing inputs. A classic example is given by the experiments of Wiesel and Hubel (1965) on the effects of monocular versus binocular visual deprivation in young animals (see also Guillery 1972). If neural activity is reduced in one eye, inputs responding to that eye lose most of their connections to the visual cortex, while the inputs responding to the normally active, opposite eye gain more than their normal share of connections. If activity is reduced simultaneously in both eyes for a similar period of time, normal development results: each eye’s inputs retain their normal cortical innervation. Such competition appears to yield a roughly constant final total strength of innervation regardless of the patterns of input activity, although the distribution of this innervation among the inputs depends on neural activities. Evidence of competition for a limited number of synaptic sites exists in many biological systems (e.g., Bourgeois et 01. 1989; Hayes and Meyer 1988a,b, 1989a,b; Murray et a / . 1982; Pallas and Finlay 1991). The existence of constraints limiting synaptic resources is also suggested on theoretical grounds. Development under simple correlationbased rules of synaptic modification typically leads to instability. Either all synapses grow to the maximum allowed value, or all synapses decay to zero strength. To achieve the results found biologically, a Hebbian rule must instead lead to the development of selectivity, so that some synaptic patterns grow in strength while others shrink. Von der Malsburg (1973) proposed the use of constraints conserving the total synaptic strength supported by each input or output cell to achieve selectivity; related proposals were also made by others (Perez et a / . 1975; Rochester et al. 1956; Rosenblatt 1961). A constraint that conserves total synaptic strength over a cell can be enforced through nonspecific decay of all synaptic strengths, provided the rate of this decay is set for the cell as a whole to cancel the total increase due to specific, Hebbian plasticity. Two simple types of decay can be considered. First, each synapse might decay at a rate proportional to its current strength; this is called multiplicative decay. Alternatively, each synapse might decay at a fixed rate, independent of its strength; this is called subtractive decay. The message of this paper is that the dynamic effects of a constraint depend significantly on whether it is enforced via multiplicative or subtractive decay. We have noted this briefly in previous work (MacKay and Miller 1990a,b; Miller 1990a; Miller ct nl. 1989). 1 Simple Examples of the Effects of Constraints A few simple examples will illustrate that strikingly different outcomes can result from the subtractive or multiplicative enforcement of a con-

102

Kenneth D. Miller and David J. C. MacKay

straint. The remainder of the paper will present a systematic analysis of these differences. Consider synaptic plasticity of a single postsynaptic cell. Let w be the vector of synaptic weights onto this cell; the ith component, w,, is the synaptic weight from the ith input. We assume synaptic weights are initially randomly distributed with mean wInlt,and are limited to remain between a maximum value w,, and minimum value wmin.We consider the effect of a constraint that conserves the total synaptic strength, C,w,, implemented either multiplicatively or subtractively. Consider first a simple equation for Hebbian synaptic plasticity, (d/dt)w = Cw, where C is a matrix describing correlations among input activities (derived for example in MacKay and Miller 1990a). Suppose this correlation is a gaussian function of the separation of two inputs. We assume first that wmln= 0. The final outcomes of development under this equation are shown in Figure 1A. With no constraint (column l), all synapses saturate at wmax,so all selectivity in the cell’s response is lost. Under a multiplicative constraint (column 2), synaptic strengths decrease gradually from center to periphery. The final synaptic pattern in this case is proportional to the principal eigenvector of C. Under a subtractive constraint (columns 3 and 4), a central core of synapses saturate at or near strength w,,,, while the remaining synapses saturate at wmln. If w,, is increased, or the total conserved synaptic strength wtot is decreased by decreasing winlt, the receptive field is sharpened (column 4). In contrast, under the multiplicative constraint or without constraints, the shape of the final receptive field is unaltered by such changes in w,, and wtot. This sharpening of the receptive field under subtractive constraints occurs because all synapses saturate, so the final number of nonzero synapses is approximately wtot/wmax.Such sharpening under subtractive constraints can create a precise match between two spatial maps, for example, maps of auditory and of visual space, despite spatially broad correlations between the two maps (Miller and MacKay 1992). If wmlnis decreased below zero, center-surround receptive fields can result under subtractive constraints (Fig. 1B). In contrast, the results under multiplicative constraints or unconstrained dynamics are unaltered by this change. This mechanism of developing center-surround receptive fields underlies the results of Linsker (19861, as explained in Section 2.5. Again, an increase in w,,, or decrease in wtot leads to sharpening of the positive part of the receptive field under subtractive constraints (column 4). Finally, consider ocular dominance segregation (Miller et al. 1989) (Fig. 1 0 . We suppose the output cell receives two equivalent sets of inputs: left-eye inputs and right-eye inputs. A gaussian correlation function C describes correlations within each eye as before, while between the two eyes there is zero correlation; and wmin = 0. Now results are much as in (A), but a new distinction emerges. Under subtractive constraints, ocular dominance segregation occurs: the output cell becomes

Constraints in Hebbian Learning

I

Right

;

I

,

103

-

I '

Figure 1: Outcomes of development without constraints and under multiplicative and subtractive constraints. (A,B) Outcome of a simple Hebbian development equation: unconstrained equation is (d/dt)w = Cw. Initial synaptic weights are shown at the top left. The correlation matrix C is a gaussian function of the separation between two synapses (shown at top right). (A) w,in = 0; (B) wmin = -2. ( C )Outcome of a similar equation but with two identical sets of inputs, representing left- and right-eye inputs. Within each eye, correlations are the same as in (A); between the eyes there is zero correlation. Unconstrained equations are (d/dt)wL = CwL;(d/df)wR = CwR.All results are from simulations of a two-dimensional receptive field consisting of a diameter-13 circle of inputs drawn from a 13 x 13 square grid. The resulting receptive fields were approximately circularly symmetric; the figures show a slice horizontally through the center of the field. All simulations used w,, = 8; all except (B) used w,in = 0. The left three columns show results for winit = 1. The right column [subtractive(2)]uses winlt = 0.5, which halves the conserved total synaptic strength wtot.

monocular, receiving input from only a single eye. Under multiplicative constraints, there is no ocular dominance segregation: the two eyes develop equal innervations to the output cell. Segregation under multiplicative constraints can occur only if there are anticorrelations between the two eyes, a s will be explained in Section 2.6. In summary, unconstrained Hebbian equations often lead all synapses to saturate at the maximal allowed value, destroying selectivity. Multiplicative constraints instead lead the inputs to develop graded strengths.

Kenneth D. Miller and David J. C. MacKay

104

Subtractive constraints lead synapses to saturate at either the maximal or minimal allowed value, and can result in a sharpening to a few bestcorrelated inputs. They also can allow ocular dominance segregation to develop in circumstances where multiplicative constraints do not. These differences between subtractive and multiplicative constraints are easily understood, as we now show. 2 Multiplicative and Subtractive Constraints for a Single Output Cell

We begin with a general linear synaptic plasticity equation without decays, (d/dt)w(t)= Cw(t). We assume that the matrix C is symmetric: in Hebbian learning, C,, represents the correlation in activity between inputs i and j, so C,, = C,,.' Thus, C has a complete set of orthonormal eigenvectors e" with corresponding eigenvalues A" (that is, Ce" = A"e"). Typically most or all of the eigenvalues of C are positive; for example, if C is the covariance matrix of the input activities then all its eigenvalues are positive. We use indices i, j to refer to the synaptic basis, and n . b to refer to the eigenvector basis. The strength of the ith synapse is denoted by zu,. The weight vector w can also be written as a combination of the eigenvectors, w = 1,woe",where the components of w in the eigenvector basis are w,= w . e". We assume as before that the dynamics are linear u p to hard limits on the synaptic weights, wmln5 w,(t) 5 w,,,; we will not explicitly note these limits in subsequent equations. 2.1 Formulation of Multiplicative and Subtractive Constraints. By a multiplicative or subtractive constraint, respectively, we refer to a timevarying decay term y(t)w or c(t)n that moves w, after application of C, toward a constraint surface. We assume y and c are determined by the current weight vector w(t) and do not otherwise depend on t, so we write them as y(w) or c(w). Thus, the constrained equations are2

d -w(t) dt d -w(t) dt

=

Cw(t) - y(w)w(t)

(Multiplicative Constraint) (2.1)

=

Cw(t) - c(w)n

(Subtractive Constraint)

(2.2)

'We work in a representation in which each synapse is represented explicitly, and the density of synapses is implicit. Equivalently, one may use a representation in which the synaptic density or "arbor" function is explicit (MacKay and Miller 1990a, App. A). Then, although the equation governing synaptic development may appear nonsymmetric, it can be symmetrized by a coordinate transformation. Thus, the present analysis also applies in these representations, as further described in Miller and MacKay (1992, Ap B) ?To understand why the term -y(w)w(t) represents a multiplicative constraint, consider a multiplicatively constrained equation w(t + At) = /j(w) [w(t)+ Cw(t)At], where /j(w) achieves the constraint. This is identical to [w(t At) - w(l)]/At = Cw(t) - y(w)w(t At) where y ( w ) = [I - /j(w)]//j(w)At. For At 0 this becomes equation 2.1.

+

+

-

Constraints in Hebbian Learning

105

The vector n is a constant. Typically, all synapses have equal subtractive decay rate, so n = (1.1. . . . ,1)T in the synaptic basis. Multiplicative or subtractive constraints represent two methods of enforcing a constraint, that is, of maintaining the weight vector on some constraint surface. We now consider the type of constraint to be enforced. We will focus on two types. First, a constraint may conserve the total synaptic strength C ,w,, as in Section 1. We refer to this as a type 1 constraint, and to a multiplicative or subtractive constraint of this type as M1 or S1, respectively. These are frequently used in modeling studies (e.g., M1: Grajski and Merzenich 1990; von der Malsburg 1973, 1979; von der Malsburg and Willshaw 1976; Perez et al. 1975; Rochester et al. 1956; Whitelaw and Cowan 1981; Willshaw and von der Malsburg 1976, 1979; S1: Miller 1992; Miller et al. 1989).3 We define a type 1 constraint more generally as one that conserves the total weighted synaptic strength, 1, w,n, = w . n, where n is a constant vector. Typically, n = (1.1.. . . l)T. A type 1 constraint corresponds to a hyperplane constraint surface. For an S1 constraint, we choose the subtracted vector n in equation 2.2 to be the same as this constraint vector n. This means we consider only subtractive constraints that project perpendicularly onto the constraint surface. Then type 1 constraints can be achieved by choosing

M1: S1:

y(w) = n . C w / n .w F(W)

=

n.Cw/n.n

[with n .w(t = 0)# 01

(2.3) (2.4)

These choices yield n.(d/dt)w = (d/dt)(n.w ) = 0 under equations 2.1 or 2.2, respectively. Second, we consider a constraint that conserves the sum-squared synaptic strength, C, wf = w . w. This corresponds to a hypersphere constraint surface. We refer to this as a type 2 constraint (the numbers "1" and "2" refer to the exponent p in the constrained quantity C wr). This constraint, while not biologically motivated, is often used in theoretical studies (e.g., Kohonen 1989; Oja 1982). We will consider only multiplicative enforcement of this constraint? called M2. M2 can be achieved

'Many of these models used nonlinearities other than hypercube limits on synaptic weights. Our results nonetheless appear to correctly characterize the outcomes in these models. 4Subtractive enforcement, 52, does not work in the typical case in which the fixed points are unstable. The constraint fails where n is tangent to the constraint hypersphere (i.e., at points where w ' n = 0). Such points form a circumference around the hypersphere. The S2 dynamics flow away from the unstable fixed points, at opposite poles of the hypersphere, and flow into this circumference unless prevented by the bounds on synaptic weights.

106

Kenneth D. Miller and David J. C. MacKay

Table 1: Abbreviations used." Type 1 constraint Conserves the total synaptic strength, C, zoI = w n Zero-sum vector A vector w with zero total synaptic strength: Elw,= w . n = 0 M1 Multiplicatively enforced type 1 constraint Subtractively enforced type 1 constraint, using s1 perpendicular projection onto the constraint surface Type 2 constraint Conserves the length of the weight vector, 1, ~f = w . w M2 Multiplicatively enforced type 2 constraint "The typical case n = ( 1 3 1 . . . . , l ) 'is used to describe Type 1 constraints.

by choosing

M2:

T(W)

=

W.CW/W.W

(2.5)

This yields 2 w . (d/dt)w = (d/dt)(w.w) = 0 under equation 2.1. The abbreviations introduced in this section are summarized in Table 1. 2.2 Projection Operators. Each form of constrained dynamics can be written (d/dt)w = PCw, where P is a projection operator that projects the unconstrained dynamics onto the constraint surface. For S1, the projection operator is P = 1 - ( n n ' / n . n), for M1, it is P = 1 - ( w n l / w .n ) , and for M2, it is P = 1 - (ww'/w. w). We can write these operators as P = 1 - (sc'/s. c), where s is the subtracted vector, c the coristrnint vector, and 1 the identity matrix (Fig. 2). The projection operator removes the c component of the unconstrained derivative Cw, through subtraction of a multiple of s. Thus, the subtracted vector s represents the method of constraint enforcement: s = w for multiplicative constraints, while s = n for subtractive constraints. The constraint vector c determines the constraint that is enforced: the dynamics remain on the constraint surface w . c = constant. Given a constraint surface, there are two "natural" methods of constraint enforcement: projection perpendicular to the surface (s = c), or projection toward the origin (s = w). For a type 1 constraint, these lead to different dynamics: S1 is perpendicular projection, while M1 is projection along w. For a type 2 constraint, these are identical: M2 is both perpendicular projection and projection along w. 2.3 Dynamic Effects of Multiplicative and Subtractive Constraints. In this section, we characterize the dynamics under M1, S1, and M2 constraints. In Section 2.3.1, we demonstrate that under '31, the dynamics

Constraints in Hebbian Learning

107

Figure 2: Projection onto the constraint surface. The projection operator is P = 1 - (scT/s . c). This acts on the unconstrained derivative Cw by removing its c component, projecting the dynamics onto the constraint surface c . PCw = c (d/dt)w = 0. This constraint surface is shown as the line perpendicular to c. The constraint is enforced through subtraction of a multiple of s: PCw = Cw-ps where /j = c . C w / c . s. For multiplicative constraints, s = w; for subtractive S1 constraints, s = c = n.

typically have no stable fixed point, and flow until all synapses are saturated; while under multiplicative constraints, the principal eigenvector eo of C is a stable fixed point of the dynamics, provided that it satisfies the constraint. In Section 2.3.2, we characterize the conditions under which multiplicatively constrained dynamics flow to the principal eigenvector fixed point. In Section 2.3.3, we characterize the outcome under S1 constraints in terms of the eigenvectors of C. We begin by illustrating in Figure 3 the typical dynamics under M1, S1, and M2 in the plane formed by the principal eigenvector of C, eo, and one other eigenvector with positive eigenvalue, e' . Figure 3 illustrates the main conclusions of this section, and may be taken as a visual aid for the remainder: 0

In Figure 3A we illustrate M1 and S1 dynamics in the case in which the principal eigenvector e" is close in direction to the constraint vector n (n is the vector perpendicular to the constraint surface). This is typical for Hebbian learning when there are only positive correlations and the total synaptic sum is conserved, as in the examples of Section 1. Positive correlations lead to a principal eigenvector in which all weights have a single sign; this is close in direction to the usual constraint vector (1.1... . l)T, which conserves total synaptic strength. In this case, growth of eo would violate the constraint. Multiplicative and subtractive constraints lead to very

Kenneth D. Miller and David J. C. MacKay

108

different outcomes: multiplicative constraints lead to convergence to e", whereas subtractive constraints lead to unstable flow in a direction perpendicular to n. The outcome in this case was illustrated in Figures lA,B. 0

In Figure 38 we illustrate M1 and S1 dynamics in the case in which the principal eigenvector e" is parallel to the constraint surface: e" . n = 0. We call such vectors w, for which w . n = 0, zero-sum vectors. Growth of a zero-sum vector does not violate the type 1 constraint. For practical purposes, any vector that is approximately parallel to the constraint surface, so that it intersects the surface far

A

""t;."'

UNCON.

el

+

eo

\ e

eo'

eo

eo

C

M2

el

el

eo

Constraints in Hebbian Learning

109

outside the hypercube that limits synaptic weights, may be treated a s a zero-sum vector. The principal eigenvector is typically a zerosum vector in Hebbian learning when correlations among input activities oscillate in sign as a function of input separation (Miller 1990a). Such oscillations lead to a principal eigenvector in which weights oscillate in sign, and s u m approximately to zero; such a vector is approximately perpendicular to the constraint vector (1,1,.. . , l)T. In this case, growth of e0 does not violate the constraint. The type of constraint enforcement makes little difference: the weight vector typically flows to a saturated version of e". 0

Under M2 constraints (Fig. 3 0 , the principal eigenvector eo is always perpendicular to the constraint surface, and its growth would always violate the constraint. The dynamics converge to eo.

2.3.1 General Differences between Multiplicative and Subtractive Constraint Enforcement: Fixed Points and Stability. We now establish the essential difference between multiplicative and subtractive constraints. To d o so, w e examine the locations and stability of the fixed points that are in the interior of the hypercube of allowed synaptic weights ("interior fixed

Figure 3: Facing page. Dynamics under multiplicative and subtractive constraints. Dynamics in the plane formed by the principal eigenvector of C, e", and one other eigenvector with positive eigenvalue, el. (A) M1 and S1 constraints when e0 is close in direction to n. Diagonal lines indicate the constraint surface on which n . w is constant. Unconstrained: arrows show the unconstrained derivative Cw from the point w at the base of the arrow. M1: Solid arrows show the unconstrained flow; dashed arrows show the return path to the constraint surface (as in Fig. 2). Return path is in the direction w. Open circle indicates unstable fixed point, large filled circle indicates stable fixed point. The fixed points are the eigenvectors, where Cw cx w. S1: Return path is in the direction n. The fixed point is the point where Cw x n (indicated by perpendicular symbol), and is unstable. Second row: The resulting constrained flow along the constraint surface for M1 and S1. (B)M1 and S1 constraints when eo is perpendicular to n. The constraint surface does not intersect eo. M1 and S1 lead to similar outcomes: unstable growth occurs, predominantly in the eo direction, until the hypercube that limits synaptic weights is reached. The outcome is expected to be a saturated version of keo. Note that the unconstrained dynamics also flow predominantly in the heo direction and so should lead to a similar outcome. For convenience, we have chosen the constraint direction n = el. (C) M2 constraints. The return path is in the direction of w, as for M1. Thus, locally (for example, near the fixed points) the dynamics are like M1. On a large scale, the dynamics differ because of the difference in constraint surface. Left, unconstrained derivative and return path; right, constrained flow. Figures were drawn using eigenvalues: Xo/X' = 3; constraint vector in A: n . eo/n. e* = 1.5.

Kenneth D. Miller and David J. C. MacKay

110

points"). A fixed point w"'is a point where the flow ( d / d t ) w = 0. For C symmetric, the constrained dynamics must either flow to a stable interior fixed point or else flow to the hypercube. We will show that the only stable fixed point of multiplicatively constrained dynamics is the intersection of the principal eigenvector e" of C with the constraint surface; and that if eo intersects the constraint surface, then the dynamics typically converge to eo,as was illustrated in Section 1. Under subtractive S1 constraints, there is generally no stable fixed point within the hypercube. S1 dynamics typically are only stabilized once all synapses (or all but one) reach saturation at w,,, or wn,in. The locations of the interior fixed points follow trivially from equations 2.1-2.2 and the fact that the dynamics remain on the constraint surface: 0

0

The fixed points under a multiplicatively enforced constraint are the intersections of the eigenvectors of C with the constraint surface, that is, the points w on the constraint surface at which Cw 0: w. The fixed points under a subtractively enforced constraint are the points w on the constraint surface at which Cw rx n.

The stability of a fixed point can be determined as shown in Figure 3A, by determining whether a point perturbed from the fixed point is taken farther away by the dynamics. Generalizing the reasoning illustrated there, it is easy to prove (Appendix):

Theorem 1. Under a multiplicatively enforced constraint, if the principal eigenvector of C is an interior fixed point it is stable. lnterior fixed points that are nonprincipal eigenvectors are unstable. Theorem 2. Under an S1 constraint, if C has at least two eigenvectors with positive eigenvalues, then any interior fixed point is unstable. A case of theorem 1 for M2 constraints was proven by Oja (1982). Theorem 2 shows that S1 dynamics are unstable when no synapse is saturated. If in addition the following condition holds, as occurs when C is a correlation matrix, then these dynamics remain unstable until all synapses have saturated (Appendix):

Theorem 3. Let i and j be indices in the synaptic basis. Suppose that for all i and j with i # j , Ci;> (C;,l. Then under an S1 constraint, either all synapses or all but one are saturated in a stable final condition. This condition is satisfied for Hebbian models, because Ci, represents the correlation in activities between input i and input j . The result is sharply different from that for multiplicative constraints: in that case, the principal eigenvector may be a stable fixed point with no synapse saturated. This theorem generalizes a result proven by Linsker (1986).

Constraints in Hebbian Learning

111

Theorem 3 explains the sharpening of the receptive field that occurs under an S1 constraint (Fig. 1). A practical implication is that an upper limit on synaptic strengths,,,,oz, is needed for stability under an S1 constraint (whereas no such limit is needed under a multiplicative constraint). If there is no upper synaptic limit, eventually one synapse will acquire all of the allowed synaptic strength, while all other synapses will become saturated at ~ i i ~ , ~ , , .

2.3.2 The Outcorne iirzifer Miiltiplicative Corzstraints. From the previous section, we conclude that a multiplicatively enforced constraint results in convergence to the principal eigenvector e0 of C provided (1) the principal eigenvector intersects the constraint surface within the hypercube of allowed weights, forming a stable fixed point; and (2) the initial weight vector is within the basin of attraction of this fixed point. We ignore the possible effect of the hypercube, and assess when these conditions are met. Under M2 constraints, both of these conditions are always satisfied (Fig. 3C), so M2 constraints always lead to convergence. Under M1 constraints, condition (1) is satisfied when en is nonzerosum (Fig. 3A). Then, for n = (1.1. . . . , 1)I, condition (2) is also satisfied in at least two typical cases: if both the initial weight vector and the principal eigenvector have no changes in sign, or if weights are initialized as small fluctuations about a nonzero mean5 Thus, M1 constraints typically converge to en when e" is nonzero-sum. 2.3.3 The Outcome iirzifcr SI Coiistroirzfs. S1 constraints lead to altered, linear dynamics. To see this, write the subtractively constrained equation as (d/dt)w(t) = PCw(t) with P = 1 - nn'; here, n = n/lnl. Write w as the sum w ( t ) = Pw(t) + zo,,n, where w,, = w . n is conserved. Then the dynamics can be written: d

-w(t) dt

=

+

PCPw(t) 7o,,Pcn

(2.6)

PCP is the operator C, restricted to the subspace of zero-sum vectors. w,,PCn is a constant vector. Thus, S1 constraints lead to linear dynamics driven by PCP rather than by C. These dynamics have been characterized in MacKay and

'Let Ae'l be the stable fixed point, and let wg be the initial weight vector on the constraint surface. Condition (2) is satisfied if wg . (/je") > 0. Suppose the constraint conserves w . ti = 7u,,, so that wll n = /je" . n = UI,,. Then if wo and e" are each singlesigned, they must have the same sign so wg . ( d e " ) > 0. If weights are initialized as small fluctuations about a nonzero mean, then wg Y 7u,,n, so wg (/jell) Y ws > 0.

112

Kenneth D. Miller and David J. C. MacKay

Miller (1990a, Appendices B and E). To understand them, consider first the eigenvectors of PCP. These are of two types: 1. Any zero-sum eigenvector of C is also an eigenvector of PCP with identical eigenvalue. So zero-sum eigenvectors of C grow freely, at the same rate as they would in the absence of constraints.

2. Each nonzero-sum eigenvector of C is replaced by a corresponding zero-sum eigenvector of PCP with smaller eigenvalue+' for example, an all-positive, centrally peaked nonzero-sum eigenvector of C may be replaced by a center-surround (positive center and negative surround) zero-sum eigenvector of PCP. Eigenvalue order of the nonzero-sum eigenvectors is preserved under this correspondence.

Now consider the constant term w,,PCn. This term boosts the growth rate of the eigenvectors of PCP that compose it. These are the eigenvectors derived from the nonzero-sum eigenvectors of C. Thus, under S1 constraints, the dynamics may be dominated either by the principal zero-sum eigenvector of C, or by a zero-sum vector that replaces the principal eigenvector. Both vectors may be very different from the principal eigenvector of C. 2.3.4 Summary: The Outcome under M1, SZ and M2. Multiplicative and subtractive constraints lead to dramatically different outcomes in many cases. In particular, under a type 1 constraint, multiplicative constraints converge to a dominant nonzero-sum pattern, whereas subtractive constraints suppress such a pattern in favor of a zero-sum pattern. We may summarize as follows:

1. If the principal eigenvector eo of C is a nonzero-sum vector and intersects the constraint surface within the hypercube, as is typical for Hebbian learning when there are only positive correlations, then

a. M1 constraints lead to a stabilized version of eo; b. S1 constraints lead to a zero-sum vector that grows to complete saturation, superimposed on the constrained background (w . n)n. The dominant zero-sum vector may be either

i. A zero-sum vector derived from eo,as in Figure lA,B; or ii. The principal zero-sum eigenvector of C.

2. If the principal eigenvector e0 of C is a zero-sum vector, as is typical for Hebbian learning when correlations among input activities oscillate in sign, then a type 1 constraint has little effect; the un'There is one exception: the nonzero-sum eigenvector of C with smallest eigenvalue is replaced by n, which is an eigenvector of PCP with eigenvalue 0.

Constraints in Hebbian Learning

113

constrained dynamics or M1 or S1 constraints all lead to saturated versions of eo. 3. M2 constraints always lead to a stabilized version of the principal eigenvector of C , unless the hypercube limiting synaptic weights interferes with the dynamics. 2.4 What Is Maximized under Multiplicative and Subtractive Constraints? Under multiplicative constraints, the weight vector tends to a multiple of the principal eigenvector e0 of C. This is the direction in weight space that maximizes WTCWover all directions W. This maximizes the mutual correlations among the weights; so most mutually correlated inputs are expected to retain representation. Under S1 constraints, for C symmetric, the dynamics maximize E = ~ w T C P W+ w,wTPCn (Section 2.3.3). For w,,sufficiently small, the first term dominates, so the weight vector is dominated by the principal eigenvector e! of PCP. This is the direction in weight space that maximizes W'CW over all zero-sum directions W. When n = (1.1... . . l)T, e; is a vector in which some subset of maximally correlated weights are set to positive values, and remaining weights are set to negative values. In the while negative final weight structure, positive weights in e: tend to wnlJX, weights tend to wmin.The receptive field thus becomes sharpened to a subset of maximally correlated inputs.

2.5 Application to Simple Hebbian Learning, Including Linsker's Simulations. The results just derived explain the outcome for simple Hebbian learning with a positive correlation function (Fig. lA,B). M1 constraints lead to convergence to the principal eigenvector e", which is all-positive. S1 constraints instead lead to growth of a zero-sum vector; in Figure lA,B, this vector is a center-surround vector derived from e". The results of Section 2.3.3 explain some of the results found by Linsker (1986). He explored Hebbian dynamics under S1 constraints with a gaussian correlation function, using w,in = -w,,, and a spatially gaussian distribution of inputs. Then, as analyzed in MacKay and Miller (1990a,b), the leading eigenvector of C is an all-positive eigenvector, and the zero-sum vector derived from this is a center-surround vector. The leading zero-sum eigenvectors of C,and leading eigenvectors of the constrained operator PCP, are two vectors that are bilobed, half positive and half negative. The all-positive eigenvector dominates the unconstrained the bilobed vectors dominate the development. For small values of w,,, constrained development. For larger values of w,,,the contribution of the w,PCn term to the growth of the center-surround vector allows the center-surround vector to dominate the constrained development within the hypercube, despite its having a smaller eigenvalue than the bilobed vectors under PCP.

114

Kenneth D. Miller a n d David J. C. MacKay

2.6 Extension to Two Input Layers. We now consider the case in which two equivalent input layers innervate a common output cell (Fig. 1C). For example, in the visual system, inputs serving the left eye and right eye each project to the visual cortex, in an initially completely overlapping manner (Miller et al. 1989). Similarly, ON-center and OFFcenter cells make initially equivalent projections to visual cortex (Miller 1992). Let wl, w2, respectively, be the synaptic weight vector from each input projection. Define the sum, ws = w1 + w2, and the difference, WD = w1 - w2. Because of the symmetry between the two input layers, the eigenvectors of the unconstrained equation can be divided into sum eigenvectors, ws = e,S, wD = 0, with eigenvalues As; and difference eigenvectors, wD = e,D, ws = 0, with eigenvalues XA (Miller 19904. Now an additional critical distinction emerges between subtractive and multiplicative constraints. A type 1 constraint conserves the total synaptic strength C, ws.Patterns of wD have zero total synaptic strength (i.e., are zero-sum vectors). Therefore, wD grows freely under an S1 constraint, whereas under a multiplicative constraint growth of wD is suppressed unless a difference eigenvector is the principal eigenvector of the unconstrained development. In models of ocular dominance segregation, the principal eigenvector is typically determined as follows (Miller 1990a). Let, , :A and Xi,, be the largest sum and difference eigenvalues, respectively. If there are positive correlations between the activities of the two eyes, then ,,,A: > .,,A: If there are no between-eye correlations, then ,,,A: = Xi,. If these correlations are negative, then A!ax > AS,., Thus, under a multiplicative constraint, wD cannot grow, and ocular dominance segregation cannot occur, unless the two eyes are negatively correlated7 (Miller et al. 1989). Such anticorrelations could be produced by intralaminar inhibition within the LGN. However, it seems unlikely that ocular dominance segregation depends on anticorrelations, since ocular dominance develops in the presence of vision in some animals, and vision should partially correlate the two eyes. > 0. The dynamics unUnder an Sl constraint, wD will grow if der an S1 constraint may be dominated either by egax,or by the zero-sum vector that derives from e:, depending on which has the faster growth rate (Section 2.3.3). In practice, ocular dominance segregation develops under an S1 constraint even if there are positive between-eye correlations of moderate size relative to the within-eye correlations (unpublished observations). Thus, subtractive rather than multiplicative enforcement of constraints appears more appropriate for modeling Hebbian development in visual cortex. 71f between-eye correlations are zero, then under multiplicative constraints the ratio of the principal eigenvector components, w ~ ~ ~ /does w ~not, change ~ ~ , under time development, while all other components are suppressed. Typically this ratio is initially small, so ocular dominance segregation does not occur.

Constraints in Hebbian Learning

115

3 Constraints Given a Full Layer of Output Cells When modeling Hebbian learning in a full layer of output cells, two differences arise compared to the case of an isolated cell. First, constraints may be applied to the total innervation onto each output cell (M1: Grajski and Merzenich 1990; von der Malsburg 1973; von der Malsburg and Willshaw 1976; Willshaw and von der Malsburg 1976; S1: Miller 1992); or to the total innervation from each input cell (M1: von der Malsburg 1979; Willshaw and von der Malsburg 1979); or to both (M1: Whitelaw and Cowan 1981; S1: Miller et a/. 1989). Second, there is usually coupling between the weight changes on different output cells. For example, neighboring cells’ activities may interact through intralaminar synaptic connections, causing the evolution of their weights to be coupled; or modulatory factors may diffuse, directly affecting neighboring synapses. Both types of coupling may take the mathematical form of an output layer ”lateral interaction function” (Miller 1990b; Miller et nl. 1989). Formulation of constraints in the case of a full layer is discussed in Miller and MacKay (1992, Appendix C). We have not studied constraints over a full layer in detail. However, the following heuristics, based on the single cell studies of Section 2, appear to be compatible with the studies cited in the previous paragraph and with unpublished observations. We refer to the projection to a single output cell as a receptive field or RF, and the projection from a single input location as a projective field or PF. The eigenvectors are patterns of weights across the entire network, not just across individual RFs or PFs. In simple Hebbian models, the dominant, fastest-growing patterns can often be characterized as follows. First, in the absence of constraints, the RFs of the dominant patterns are primarily determined by a particular input correlation function, and the PFs of the dominant patterns are similarly determined by the output layer lateral interaction function (Miller 1990a). If the correlations are all positive, the RFs have a single sign; if correlations oscillate in sign with input separation, the RFs oscillate in sign with a similar wavelength. A single-signed RF can be regarded as one that oscillates with an infinite wavelength, so we may summarize: in the absence of constraints, the RFs of the dominant patterns vary between positive and negative values with a wavelength corresponding to the peak of the Fourier transform of the appropriate input correlation function. Similarly, the PFs of the dominant patterns vary between positive and negative values with a wavelength corresponding to the peak of the Fourier transform of the output layer lateral interaction function. Second, constraints on output cells appear only to affect the form of the individual RFs, while constraints on input cells only affect the form of the individual PFs. Consider the case of two layers that are topographically connected: each input cell initially makes synapses onto cells over a certain diameter (”arbor diameter”) in the output layer, and

116

Kenneth D. Miller and David J. C. MacKay

adjacent input cells project adjacent arbors. Then output cells also receive connections over an arbor diameter from the input layer. Suppose that output or input cell constraints conserve total synaptic strength over the cell. Then an RF or PF that alternates in sign with a wavelength less than or equal to the arbor diameter is approximately zero-sum, that is, it has summed synaptic strength near 0. An RF or PF that alternates with longer wavelength is nonzero-sum. Subtractive constraints selectively suppress the growth of nonzero-sum patterns, whereas multiplicative constraints stabilize the growth of a dominant nonzero-sum pattern. Thus, we arrive at the following heuristic rules for the wavelength with which RFs or PFs alternate in sign (Fig. 4): 1. If the dominant pattern in the absence of constraints has (RF,PF) wavelength larger than an arbor diameter, then

a. Subtractive (output,input) constraints suppress this pattern in favor of a pattern with (RF,PF) wavelength of an arbor diameter; b. Multiplicative (output,input) constraints d o not alter the (RF, PF) wavelength of this dominant pattern, but only stabilize its amplitude. 2. If the dominant pattern in the absence of constraints has (RF,PF) wavelength smaller than an arbor diameter, then (output,input) constraints, whether enforced multiplicatively or subtractively, will have little effect. In all cases, saturation of all synapses is expected without constraints or under subtractive constraints, but not under multiplicative constraints. Several cautions must be emphasized about this approach. First, it predicts only the characteristic wavelength of weight alternation, and does not distinguish between different weight structures with similar wavelength. Second, the approach is heuristic: its validity must be checked in any particular case. In particular, the final weight pattern is expected to be one in which the dominant PF and RF patterns are ”knitted together” into a compatible overall pattern. If such a “knitting” is not possible, the heuristics will fail. This analysis can be applied to understand the effects of subtractive input constraints on ocular dominance segregation. Consider the development of the difference W D between projections from the two eyes (Section 2.6). An RF across which wD is all-positive or all-negative corresponds to a monocular receptive field. Subtractive output constraints have no effect on the development of wD: such constraints affect only the sum, not the difference, of the two projections. When RFs are monocular, an oscillation across PFs of wDcorresponds to the oscillation between ocular dominance columns (Fig. 4). Subtractive input constraints separately conserve the total strength from the left-eye input and from the right-eye

117

Constraints in Hebbian Learning

CORRELATION FUNCTION

LATERAL INTERACTION FUNCTION

RECEPTIVE FIELD

PROJECTIVE FIELD

Figure 4: The role of constraints on input and output cells: a heuristic approach. Top: Output cell receptive fields (RFs)expected to develop under unconstrained dynamics (U),or under M1 or S1 constraints on output cells. White regions in receptive fields indicate positive weights; dark regions indicate zero or negative weights, depending on whether w,in is zero or negative. Correlations between input activities are shown as a function of input separation. Without constraints or under M1, the weights vary in sign to match the oscillation of the correlation function. Under S1, the weights always alternate in sign, with wavelength no larger than an arbor diameter. Note that this approach does not distinguish between different weight structures with similar wavelength of alternation, such as the two lower RFs. Bottom: Input cell projective fields (PFs) are determined in the same manner as RFs, except that (1) the determining function is the output layer lateral interaction function; and (2) the determining constraints are those a n input cells. Here, solid lines indicate positive weights, dashed lines indicate zero or negative weights.

118

Kenneth D. Miller and David J. C. MacKay

input at each position, and so conserve the total difference wD from each input position. Thus, these constraints ensure that there is an oscillation of wDacross PFs with wavelength no larger than an arbor diameter. Subtractive input constraints thus determine the width of a left-eye plus a right-eye ocular dominance column to be an arbor diameter when the unconstrained dynamics would lead to larger columns, but have little effect on ocular dominance segregation otherwise (Miller ct nl. 1989). 4 How Can Constraints Be Implemented?

4.1 Learning Rules That Converge to Constrained Dynamics. The formulations in equations 2.3-2.5 confine the dynamics to the constraint surface that contains the initial weight vector. Alternatively, constraints may be formulated so that the dynamics converge from an arbitrary initial weight vector to one particular constraint surface, and remain o n that constraint surface thereafter. In this case the dynamics are described by equations 2.3-2.5 after an initial transient in which the constraint surface is reached. Such a formulation of S1 constraints is obtained by setting c(w) = Ik2)(n.w-kl) for constants kl and k2 in equation 2.2. When lkzl is large, this term enforces the constraint n . w = kl (Linsker 1986) and is equivalent to an S1 constraint (MacKay and Miller 1990a, Appendix E). Multiplicative constraints can be similarly formulated. Dynamics that converge to a multiplicative constraint can also be obtained by substituting a constant k > 0 for the denominator of ~ ( win) equations 2.3 or 2.5. Let c be the constraint vector (for M1, c = n; for M2, c = w) and set y(w) = c.Cw/k. Then, if the initial condition and dynamics maintain c . Cw > 0, the dynamics will flow to the constraint surface c . w = k and remain stable to perturbations off it thereafter [as can be seen by examining c . (d/dt)w]. Oja (1982) studied such M2 constraints with k = 1 and proved convergence to the principal eigenvector. Finally, if the principal eigenvalue A" of C is positive, convergent multiplicative dynamics can also be formulated by using any y ( w ) in equation 2.1 that grows with IwI and takes values both smaller and larger than A" (B. Pearlmutter, unpublished manuscript). This leads to convergence to a multiple of the principal eigenvector, ijeo, satisfying the constraint i(i1e") = A". An example is y(w) = lwI2 (Yuille et al. 1989).

4.2 Use of Thresholds to Achieve Constraints. Consider a linear Hebbian rule: (4.1)

Here y is the activation of the output cell, x the vector of input activities, and xo and yo are threshold activity levels for Hebbian plasticity. Assume

Constraints in Hebbian Learning

119

a linear activation rule, y = w . x. We average equation 4.1 over input patterns, assuming that XH and yH are constant over input patterns. The resulting equation is

n

-w(t)

nt

= QW

+ A [ (y) - yo] [ (x)

~

(4.2)

xi,]

where Q is the input covariance matrix, Q = X ( ( x - ( x ) ) (x - (x))'), and (y) = w . (x). The second term is a decay term, and can enforce a constraint. If the elements of [(x) xo] are large and negative, the type 1 constraint (y) = yo, that is, w . (x) = yo, will be enforced. If XH is independent of w, this constraint is enforced subtractively; if furthermore all inputs have the same mean activity level and threshold, this is an S1 constraint, as discussed in Section 4.1 and as used in Linsker (1986). The presynaptic threshold xg can also enforce a constraint if its elements increase with those of w. For example, an M2 constraint that converges to w . w = 1 (Section 4.1), when applied to the unconstrained equation (d/Lit)w= yx, yields the rule proposed by Oja (1982): (d/cft)w = yx - w ( w . yx), or (d/dt)w = y ( x - wy). This is XH = wy.' Both of these mechanisms require that inputs activated at an average or below-average level lose synaptic strength when the postsynaptic cell is highly activated. This is not the case in at least one biological system, LTP in hippocampus (Gustafsson ef al. 1987). This difficulty is avoided by the choice xH = w, which yields the multiplicatively stabilized rule (d/dt)w = y ( x - w ) . This rule does not ensure a selective outcome: as noted in Kohonen (1989, Section 4.3.2), it converges either to the principal eigenvector of C = (xx'), or else to w = 0. However, with a nonlinear, competitive rule for y that ensures localized activation, this rule is that of the self-organizing feature map of Kohonen (1989) and does achieve selectivity. The postsynaptic threshold, yo, can enforce a constraint if it increases faster than linearly with the average postsynaptic activation, (y), and if the elements of [(x) - xf,] are positive. Bienenstock e t a l . (1982) proposed the rule yH = (y)'/~,,~, where ySetis a cell's "preset" desired activity level. They combined this with a nonlinear Hebbian rule, for example, y(y - yH) in place of (y - yH) in equation 4.1. With either Hebbian rule, this "sliding threshold" has the effect of adjusting synaptic strengths to achieve (y) % yH,or (!y) = w (x) % yset.Thus, it provides another method of achieving a type 1 constraint. Recent results both in the peripheral auditory system (Yang and Faber 1991) and in hippocampus (Huang et 01. 1992)are suggestive that increased neural activity may elevate a threshold for modification, but in a manner specific to those inputs whose activity is increased. This is consistent with an increase in the elements of xo corresponding to activated inputs, but not with an increase in yo, which would elevate thresholds for all inputs. ~

"Note: equation 4.2 is not valid for this case, because

XH

varies with input patterns.

120

Kenneth D. Miller and David J. C. MacKay

Covariance models (Sejnowski 1977a,b) have been proposed to solve the problem that chance coincidences drive synapses to saturate at 7umdX. This problem was known to occur in Hebb models of the form (d/dt)w ix yx. In a covariance model, (d/dt)w cx (y - (y))(x - (x)). This is a Hebb model with linearly sliding thresholds: y~ = (y), XH = (x). In this case, the decay term in equation 4.2 is zero, so synaptic growth is driven by the unconstrained equation (d/dt)w = Qw. Thus, the problem of loss of selectivity under a Hebbian rule is not avoided by a linear covariance rule. 4.3 Biological Implementation of Constraints. Rules that conserve synaptic strength have been criticized as nonlocal (e.g., Bienenstock ~t al. 1982). Thus, it is important to note that multiplicative or subtractive constraints in their general form (equations 2.1-2.2) can be implemented locally if each of a cell’s synaptic weights undergoes decay, either at a fixed rate (subtractive decay) or at a rate proportional to its strength (multiplicative decay); and if the overall gain of this decay, 7(w) or F(w), is set for the cell as a whole, and increases with the cell’s total synaptic strength. Such a cellular increase in decay, implemented locally at each synapse, might be achieved in at least two ways. First, a cell might have a limited capacity to metabolically supply its synapses, so that greater total synaptic strength means less supply and thus faster decay for each synapse. Second, the overall rate of decay might increase with a cell’s average degree of activation, which in turn would increase with the total synaptic strength received by a cell. Increased activation could increase release of a molecule that degrades synapses, such as a protease, or decrease release of a molecule that supports synapses, such as a trophic, adhesion, or sprouting factor (evidence for such mechanisms is reviewed in Van Essen et al. 1990). Increased activation might also increase decay due to thresholds for synaptic modification, as just discussed.

5 Discussion

We have demonstrated that multiplicative and subtractive constraints can lead to fundamentally different outcomes in linear learning. Under multiplicative constraints, the weight vector tends to the principal eigenvector of the unconstrained time development operator. This is a “graded” receptive field in which most mutually correlated inputs are represented. Thus, when two equally active eyes compete, both retain equal innervation unless the two eyes are anticorrelated. Under subtractive constraints, the weight vector tends to a receptive field that is “sharpened” to a subset of maximally correlated inputs: the weights of these inputs reach the maximum allowed strength, while all other weights reach the minimum allowed strength. When two eyes compete, subtractive constraints can lead to domination by one eye (ocular dominance segregation) provided

Constraints in Hebbian Learning

121

only that correlations within one eye are stronger than those between the eyes. The instability of subtractive constraints depends on the unconstrained operator having at least two positive eigenvalues, which is typical for Hebbian learning. An interesting alternative is anti-Hebbian learning (Mitchison 1991): in this case, all unconstrained eigenvalues are reversed in sign from the Hebbian case, so typically no eigenvalue is positive. Our analysis applies to this case also: multiplicatively constrained dynamics flow to the principal eigenvector, which is the vector that would have the smallest eigenvalue under Hebbian dynamics (Mitchison 1991); while subtractively constrained dynamics flow to the fixed point, which is stable. Multiplicative and subtractive constraints represent two fundamentally different methods of controlling the size of the weight vector. Multiplication equally rescales all weight patterns, while subtraction directly acts on only a single weight pattern. Because this difference is general, many of the results we have found for the linear case may generalize to cases involving nonlinear rules. Biologically, there is as yet little evidence as to the mechanisms that lead activity-dependent plasticity to be competitive or to achieve selectivity. Among the two choices of subtractive and multiplicative constraints, subtractive seem to resemble biology more closely in systems where sharp receptive fields are achieved, and in visual cortex where ocular dominance columns are likely to develop without requiring anticorrelation between the eyes; while multiplicative constraints might resemble biology more closely in situations like adult cortical plasticity where continually moving and graded representations may occur (Kaas 1991). We do not advocate that one or the other of these is the biologically correct choice. Rather, we wish (1) to point out that different choices of competitive mechanism can yield different outcomes, so it is important for the modelers to know whether and how their results depend on these choices; and (2) to begin to distinguish and characterize different classes of such mechanisms, which might then be compared to biology.

Appendix: Proofs of Mathematical Results We study dynamics confined to a constraint surface and governed by a general multiplicative constraint (equation 2.1) or by an S1 subtractive constraint (equations 2.4). As in the text, we use indices n. b.. . . to refer to the eigenvector basis of C. We assume that C is symmetric and thus has a complete set of orthonormal eigenvectors e" with corresponding eigenvalues A". Write the constrained equation as (d/dt)w = f(w). To determine the stability of a fixed point w"'[where f(w) = 01, we linearize f(w) about the fixed point. Call this linearized operator D; in the eigenvector basis of C,

122

Kenneth D. Miller and David J. C. MacKay

it is a matrix with elements D,,!, = i)fil(w)/i)zu~~w,w,l~. For an S1 constraint, f ( w ) = PCw is linear (here, P = [l - fin']),so D = PC. We define the constraint plane to be the hyperplane tangent to the constraint surface at the fixed point, and the constraint vector c to be the vector normal to the constraint plane. c is a left eigenvector of D.' The stability of a fixed point is determined by the eigenvalues of D (Hirsch and Smale 1974). If one eigenvalue is positive, the fixed point is unstable; if all eigenvalues are negative, the fixed point is stable. In assessing the outcome of the constrained dynamics, we are concerned only with stability of the fixed point to perturbations within the constraint surface. Thus, if all eigenvalues are negative except one zero eigenvalue corresponding to a direction outside the constraint surface, then the fixed point is stable.

Theorem 1 Proof. We consider a multiplicatively constrained equation, (rl/dt)w = Cw - ? ( w ) w . We assume that multiplicative confinement of the dynamics to the constraint surface means two things. First, D has one zero or negative eigenvalue corresponding to the enforcement of the constraint, with associated left eigenvector c. Therefore any right eigenvector of D with positive eigenvalue is parallel to the constraint plane. Second, the constraint plane is not parallel to the subtracted vector w that enforces the constraint. A fixed point is an eigenvector of C: w"' = zu,,e''for some n, ?(w'I')= A''. The linearized operator is D = C - X " l - zuIle"[V-,(w"')]', where V is the gradient operator defined by Vx(x) = I,, e"[i),p(w)/i)zu,,]~,_,.I n the eigenvector basis of C, D is a diagonal matrix with the addition of one row of off-diagonal elements; such a matrix has the same eigenvalues as the diagonal matrix alone [because the characteristic equation, det(D - X1) = 0, is unchanged by the additional row]. The diagonal part of D is the matrix C - X"1- -he"[e']]', where -h = 7 ( ~ , I [ C ) ~ ( ~ ) / D ~ , I ] I w - n , , , e , , . This is the operator C with eigenvalue X" reduced to --h and all other eigenvalues reduced by A". Note that e" is the right eigenvector of D with eigenvalue - h ; e" is not parallel to the constraint plane, so -tl 5 0. Now we determine whether D has positive eigenvalues. If e" is not the principal eigenvector of C, then D has a positive eigenvalue and the fixed point is unstable. If e" is the principal eigenvector of C, and it is nondegenerate (no other eigenvector has the same eigenvalue), then all eigenvalues of D except perhaps a zero corresponding to e" are negative; so the fixed point is stable. If C has N degenerate principal eigenvectors, and e" is one of them, then D has N - 1 zeros corresponding to perturbations within the degenerate subspace: the principal eigenvector fixed points are thus marginally stable (eigenvalue 0) to perturbations within 0 this subspace, and stable to other perturbations.

"Proof. For any Aw in the constraint plane, D A w must remain within the constraint plane; that is, c'DAw = 0 for all Aw satisfying c r a w= 0. Therefore, c ' D x c'.

Constraints in Hebbian Learning

123

Lemma 1. Under an S1 constraint, (d/dt)w = PCw with P = (1- nn'), if there is a vector v parallel to the constraint plane such that v'PCV > 0, then PC has a positive eigenvalue. Proof. v is parallel to the constraint plane, so v . n = 0 and Pv = v. Thus, from v'PCv > 0 we conclude that vTPCPv> 0. Since PCP is symmetric, this implies that PCP has a positive eigenvalue; call this eigenvalue Xo, with corresponding eigenvector eo. e0 is parallel to the constraint plane, that is, PeO = e0 (because eo = PCPeo/XO,and P2 = P). So PCeO = PCPeO = Xoeo. 0 Theorem 2 proof. We consider the S1 constrained equation (d/dt)w = PCw. Let v be a linear combination of the two eigenvectors of C with positive eigenvalues, such that v is parallel to the constraint plane: v . n = 0. Then V'PCV = V'CV > 0. So, by Lemma 1, PC must have a positive 0 eigenvalue. Theorem 3 proof. This is a generalization of a similar proof in Linsker (1986). Suppose PCwF"= "0" when synapses wy and w,"' are both not saturated. By "0," we mean that each component of the vector is either 0, or else of a sign that would take an already saturated synapse beyond its limiting value. Let U' be the unit vector with ith weight 1 and all other elements 0, and similarly for U'. Consider stability of the fixed point to a perturbation confined to the u'/ul plane. The action of C in this plane is given by the submatrix

The eigenvalues of Cu,ujare both real and positive when the conditions of the theorem are met. Let el and e2 be the two orthonormal eigenvectors of Cu,ij,with all synaptic components other than i and j set to zero. As in the proof of Theorem 2, let v be a linear combination of el and e2 that is parallel to the constraint plane, v . n = 0. Then vTPCv= vTCv> 0. So, 0 by Lemma 1, the fixed point is unstable.

Acknowledgments K. D. M. thanks C. Koch, Caltech, and M. P. Stryker, UCSF, for supporting this work, which was performed in their laboratories. K. D. M. was supported by a Del Webb fellowship and a Markey Foundation internal grant, both from Caltech Division of Biology, and by an N.E.I. Fellowship at UCSF. D. J. C. M. was supported by a Caltech Fellowship and a Studentship from SERC, UK. We thank Bartlett Me1 and Terry Sejnowski for helpful comments on the manuscript. This collaboration would have been impossible without the internet/NSFnet.

124

Kenneth D. Miller and David J. C. MacKay

References Bienenstock, E. L., Cooper, L. N., and Munro, P. W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2, 32-48. Bourgeois, J.-P., Jastreboff, P. J., and Rakic, P. 1989. Synaptogenesis in visual cortex of normal and preterm monkeys: Evidence for intrinsic regulation of synaptic overproduction. Proc. Natl. Acad. Sci. U.S.A. 86, 4297-4301. Grajski, K. A., and Merzenich, M. M. 1990. Hebb-type dynamics is sufficient to account for the inverse magnification rule in cortical somatotopy. Neural Comp. 2, 71-84. Guillery, R. W. 1972. Binocular competition in the control of geniculate cell Comp. Neurol. 144, 117-130. growth. I. Gustafsson, B., Wigstrom, H., Abraham, W. C., and Huang, Y.-Y. 1987. Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci. 7, 774-780. Hayes, W. P., and Meyer, R. L. 1988a. Optic synapse number but not density is constrained during regeneration onto surgically halved tectum in goldfish: HRP-EM evidence that optic fibers compete for fixed numbers of postsynaptic sites on the tectum. j . Comp. Neurol. 274, 539-559. Hayes, W. P., and Meyer, R. L. 1988b. Retinotopically inappropriate synapses of subnormal density formed by misdirected optic fibers in goldfish tectum. Dev.Brain Res. 38, 304-312. Hayes, W. P., and Meyer, R. L. 1989a. Impulse blockade by intraocular tetrodoxin during optic regeneration in goldfish: HRP-EM evidence that the formation of normal numbers of optic synapses and the elimination of exuberant optic fibers is activity independent. J. Neurosci. 9, 1414-1423. Hayes, W. P., and Meyer, R. L. 1989b. Normal numbers of retinotectal synapses during the activity-sensitive period of optic regeneration in goldfish: HRPEM evidence implicating synapse rearrangement and collateral elimination during map refinement. J. Neurosci. 9, 1400-1413. Hirsch, M. W., and Smale, S. 1974. Differential Equations, Dynarnical Systems and Linear Algebra. Academic Press, New York. Huang, Y.Y., Colino, A., Selig, D. K., and Malenka, R. C. 1992. The influence of prior synaptic activity on the induction of long-term potentiation. Science 255, 730-733. Kaas, J. H. 1991. Plasticity of sensory and motor maps in adult mammals. A n n u . Rev. Neurosci. 14, 137-167. Kohonen, T. 1989. Self-Organization and Associative Memory, 3rd ed. SpringerVerlag, Berlin. Linsker, R. 1986. From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512. MacKay, D. J. C., and Miller, K. D. 1990a. Analysis of Linsker’s applications of Hebbian rules to linear networks. Network 1, 257-298. MacKay, D. J. C., and Miller, K. D. 1990b. Analysis of Linsker’s simulation of Hebbian rules. Neural Comp. 2, 173-187.

Constraints in Hebbian Learning

125

Miller, K. D. 1990a. Correlation-based models of neural development. In Neuroscience and Connectionist Theory, M. A. Gluck and D. E. Rumelhart, eds., pp. 267-353. Erlbaum, Hillsdale, NJ. Miller, K. D. 1990b. Derivation of linear Hebbian equations from a nonlinear Hebbian model of synaptic plasticity. Neural Comp. 2, 321-333. Miller, K. D. 1992. Development of orientation columns via competition between ON- and OFF-center inputs. NeuroReport 3, 73-76. Miller, K. D., Keller, J. B., and Stryker, M. I? 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605-615. Miller, K. D., and MacKay, D. J. C. 1992. The role of constraints in Hebbian learning. Tech. Rep. Memo 19, Program in Computation and Neural Systems, Caltech, Pasadena, CA. Mitchison, G. 1991. Removing time variation with the anti-Hebbian differential synapse. Neural Comp. 3, 312-320. Murray, M., Sharma, S., and Edwards, M. A. 1982. Target regulation of synaptic number in the compressed retinotectal projection of goldfish. J. Comp. Neurol. 209, 374-385. Oja, E. 1982. A simplified neuron model as a principal component analyzer. J. Math. Biol. 15, 267-273. Pallas, S. L., and Finlay, B. L. 1991. Compensation for population-size mismatches in the hamster retinotectal system: Alterations in the organization of retinal projections. Vis. Neurosci. 6, 271-281. Perez, R., Glass, L., and Shlaer, R. 1975. Development of specificity in the cat visual cortex. J. Math. Biol. 1, 275-288. Rochester, N., Holland, J. H., Haibt, L. H., and Duda, W. L. 1956. Tests on a cell assembly theory of the action of the brain, using a large digital computer. IRE Trans. Info. Theory IT-2, 80-93. Rosenblatt, F. 1961. Principles ofhreurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, D.C. Sejnowski, T. J. 1977a. Statistical constraints on synaptic plasticity. J. Theor. Bid. 69, 385-389. Sejnowski, T. J. 197%. Storing covariance with nonlinearly interacting neurons. J. Math. Biol. 4, 303-321. Van Essen, D. C., Gordon, H., Soha, J. M., and Fraser, S. E. 1990. Synaptic dynamics at the neuromuscular junction: Mechanisms and models. J. Neurobiol. 21, 223-249. von der Malsburg, C. 1973. Self-organization of orientation selective cells in the striate cortex. Kybevnetik 14, 85-100. von der Malsburg, C. 1979. Development of ocularity domains and growth behavior of axon terminals. B i d . Cyber. 32, 49-62. von der Malsburg, C., and Willshaw, D. J. 1976. A mechanism for producing continuous neural mappings: Ocularity dominance stripes and ordered retino-tectal projections. Exp. Brain Res. Suppl. 1, 463-469. Whitelaw, V. A., and Cowan, J. D. 1981. Specificity and plasticity of retinotectal connections: A computational model. 7. Neurosci. 1, 1369-1387. Wiesel, T. N., and Hubel, D. H. 1965. Comparison of the effects of unilateral

126

Kenneth D. Miller and David J. C. MacKay

and bilateral eye closure on cortical unit responses in kittens. 1.Neurophysiol. 28, 1029-1040. Willshaw, D. J., and von der Malsburg, C. 1976. How patterned neural connections can be set up by self-organization. Proc. R. SOC.Londoii B. 194, 431-445. Willshaw, D. J., and von der Malsburg, C. 1979. A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Phil. Trans. R. SOC.Londoti B. 287, 203-243. Yang, X. D., and Faber, D. S. 1991. Initial synaptic efficacy influences induction and expression of long-term changes in transmission. Proc. Natl. Acad. Sci. U.S.A. 88, 4299-4303. Yuille, A. L., Kammen, D. M., and Cohen, D. S. 1989. Quadrature and the development of orientation selective cortical cells by Hebb rules. Biol. Cybmiet. 61, 183-194. ~

Received October 9, 1992; accepted May 13, 1993.

This article has been cited by: 2. Matthieu Gilson, Anthony N. Burkitt, David B. Grayden, Doreen A. Thomas, J. Leo Hemmen. 2010. Emergence of network structure due to spike-timing-dependent plasticity in recurrent neuronal networks V: self-organization schemes and weight dependence. Biological Cybernetics . [CrossRef] 3. Sean Byrnes, Anthony N. Burkitt, David B. Grayden, Hamish Meffin. 2010. Spiking Neuron Model for Temporal Sequence RecognitionSpiking Neuron Model for Temporal Sequence Recognition. Neural Computation 22:1, 61-93. [Abstract] [Full Text] [PDF] [PDF Plus] 4. Brian S. Blais, Harel Z. Shouval. 2009. Effect of correlated lateral geniculate nucleus firing rates on predictions for monocular eye closure versus monocular retinal inactivation. Physical Review E 80:6. . [CrossRef] 5. Shigeru Tanaka, Masanobu Miyashita. 2009. Constraint on the Number of Synaptic Inputs to a Visual Cortical Neuron Controls Receptive Field FormationConstraint on the Number of Synaptic Inputs to a Visual Cortical Neuron Controls Receptive Field Formation. Neural Computation 21:9, 2554-2580. [Abstract] [Full Text] [PDF] [PDF Plus] 6. Max Garagnani, Thomas Wennekers, Friedemann Pulvermüller. 2009. Recruitment and Consolidation of Cell Assemblies for Words by Way of Hebbian Learning and Competition in a Multi-Layer Neural Network. Cognitive Computation 1:2, 160-176. [CrossRef] 7. Niranjan Chakravarthy, Kostas Tsakalis, Shivkumar Sabesan, Leon Iasemidis. 2009. Homeostasis of Brain Dynamics in Epilepsy: A Feedback Control Systems Perspective of Seizures. Annals of Biomedical Engineering 37:3, 565-585. [CrossRef] 8. Niranjan Chakravarthy, Shivkumar Sabesan, Kostas Tsakalis, Leon Iasemidis. 2009. Controlling epileptic seizures in a neural mass model. Journal of Combinatorial Optimization 17:1, 98-116. [CrossRef] 9. David Hsu, Aonan Tang, Murielle Hsu, John Beggs. 2007. Simple spontaneously active Hebbian learning model: Homeostasis of activity and connectivity, and consequences for learning and epileptogenesis. Physical Review E 76:4. . [CrossRef] 10. Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. 2007. Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight DistributionOptimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight Distribution. Neural Computation 19:3, 639-671. [Abstract] [PDF] [PDF Plus]

11. Tomokazu Ohshiro, Michael Weliky. 2006. Simple fall-off pattern of correlated neural activity in the developing lateral geniculate nucleus. Nature Neuroscience 9:12, 1541-1548. [CrossRef] 12. H. Meffin, J. Besson, A. Burkitt, D. Grayden. 2006. Learning the structure of correlated synaptic subgroups using stable and competitive spike-timing-dependent plasticity. Physical Review E 73:4. . [CrossRef] 13. Mauro Ursino, Giuseppe-Emiliano Cara. 2005. Dependence of Visual Cell Properties on Intracortical Synapses Among Hypercolumns: Analysis by a Computer Model. Journal of Computational Neuroscience 19:3, 291-310. [CrossRef] 14. Paul C. Bressloff. 2005. Spontaneous symmetry breaking in self–organizing neural fields. Biological Cybernetics 93:4, 256-274. [CrossRef] 15. Simone Fiori . 2005. Nonlinear Complex-Valued Extensions of Hebbian Learning: An EssayNonlinear Complex-Valued Extensions of Hebbian Learning: An Essay. Neural Computation 17:4, 779-838. [Abstract] [PDF] [PDF Plus] 16. Anthony N. Burkitt , Hamish Meffin , David. B. Grayden . 2004. Spike-Timing-Dependent Plasticity: The Relationship to Rate-Based Learning for Models with Weight Dynamics Determined by a Stable Fixed PointSpike-Timing-Dependent Plasticity: The Relationship to Rate-Based Learning for Models with Weight Dynamics Determined by a Stable Fixed Point. Neural Computation 16:5, 885-940. [Abstract] [PDF] [PDF Plus] 17. Terry Elliott . 2003. An Analysis of Synaptic Normalization in a General Class of Hebbian ModelsAn Analysis of Synaptic Normalization in a General Class of Hebbian Models. Neural Computation 15:4, 937-963. [Abstract] [PDF] [PDF Plus] 18. T. Elliott , N. R. Shadbolt . 2002. Multiplicative Synaptic Normalization and a Nonlinear Hebb Rule Underlie a Neurotrophic Model of Competitive Synaptic PlasticityMultiplicative Synaptic Normalization and a Nonlinear Hebb Rule Underlie a Neurotrophic Model of Competitive Synaptic Plasticity. Neural Computation 14:6, 1311-1322. [Abstract] [PDF] [PDF Plus] 19. Richard Kempter , Wulfram Gerstner , J. Leo van Hemmen . 2001. Intrinsic Stabilization of Output Rates by Spike-Based Hebbian LearningIntrinsic Stabilization of Output Rates by Spike-Based Hebbian Learning. Neural Computation 13:12, 2709-2741. [Abstract] [PDF] [PDF Plus] 20. Randall C. O'Reilly . 2001. Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian LearningGeneralization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning. Neural Computation 13:6, 1199-1241. [Abstract] [PDF] [PDF Plus] 21. Gal Chechik , Isaac Meilijson , Eytan Ruppin . 2001. Effective Neuronal Learning with Ineffective Hebbian Learning RulesEffective Neuronal Learning with Ineffective Hebbian Learning Rules. Neural Computation 13:4, 817-840. [Abstract] [PDF] [PDF Plus]

22. James A. Bednar , Risto Miikkulainen . 2000. Tilt Aftereffects in a Self-Organizing Model of the Primary Visual CortexTilt Aftereffects in a Self-Organizing Model of the Primary Visual Cortex. Neural Computation 12:7, 1721-1740. [Abstract] [PDF] [PDF Plus] 23. Gal Chechik , Isaac Meilijson , Eytan Ruppin . 1999. Neuronal Regulation: A Mechanism for Synaptic Pruning During Brain MaturationNeuronal Regulation: A Mechanism for Synaptic Pruning During Brain Maturation. Neural Computation 11:8, 2061-2080. [Abstract] [PDF] [PDF Plus] 24. Richard Kempter, Wulfram Gerstner, J. van Hemmen. 1999. Hebbian learning and spiking neurons. Physical Review E 59:4, 4498-4514. [CrossRef] 25. Laurenz Wiskott, Terrence Sejnowski. 1998. Constrained Optimization for Neural Map Formation: A Unifying Framework for Weight Growth and NormalizationConstrained Optimization for Neural Map Formation: A Unifying Framework for Weight Growth and Normalization. Neural Computation 10:3, 671-716. [Abstract] [PDF] [PDF Plus] 26. Kenneth D. Miller. 1998. Equivalence of a Sprouting-and-Retraction Model and Correlation-Based Plasticity Models of Neural DevelopmentEquivalence of a Sprouting-and-Retraction Model and Correlation-Based Plasticity Models of Neural Development. Neural Computation 10:3, 529-547. [Abstract] [PDF] [PDF Plus] 27. T. Elliott, C. I. Howarth, N. R. Shadbolt. 1998. Axonal Processes and Neural Plasticity: A ReplyAxonal Processes and Neural Plasticity: A Reply. Neural Computation 10:3, 549-554. [Abstract] [PDF] [PDF Plus] 28. Dean V. Buonomano, Michael M. Merzenich. 1998. CORTICAL PLASTICITY: From Synapses to Maps. Annual Review of Neuroscience 21:1, 149-186. [CrossRef] 29. Jianfeng Feng , David Brown . 1998. Fixed-Point Attractor Analysis for a Class of NeurodynamicsFixed-Point Attractor Analysis for a Class of Neurodynamics. Neural Computation 10:1, 189-213. [Abstract] [PDF] [PDF Plus] 30. Radford M. Neal , Peter Dayan . 1997. Factor Analysis Using Delta-Rule Wake-Sleep LearningFactor Analysis Using Delta-Rule Wake-Sleep Learning. Neural Computation 9:8, 1781-1803. [Abstract] [PDF] [PDF Plus] 31. Christian Piepenbrock, Helge Ritter, Klaus Obermayer. 1997. The Joint Development of Orientation and Ocular Dominance: Role of ConstraintsThe Joint Development of Orientation and Ocular Dominance: Role of Constraints. Neural Computation 9:5, 959-970. [Abstract] [PDF] [PDF Plus] 32. David Willshaw, John Hallam, Sarah Gingell, Soo Leng Lau. 1997. Marr's Theory of the Neocortex as a Self-Organizing Neural NetworkMarr's Theory of the Neocortex as a Self-Organizing Neural Network. Neural Computation 9:4, 911-936. [Abstract] [PDF] [PDF Plus] 33. Jianfeng Feng, Hong Pan, Vwani P. Roychowdhury. 1996. On Neurodynamics with Limiter Function and Linsker's Developmental ModelOn Neurodynamics

with Limiter Function and Linsker's Developmental Model. Neural Computation 8:5, 1003-1019. [Abstract] [PDF] [PDF Plus] 34. Harel Shouval, Leon N. Cooper. 1996. Organization of receptive fields in networks with Hebbian learning: the connection between synaptic and phenomenological models. Biological Cybernetics 74:5, 439-447. [CrossRef] 35. Christopher W. Lee , Bruno A. Olshausen . 1996. A Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot StereogramsA Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot Stereograms. Neural Computation 8:3, 545-566. [Abstract] [PDF] [PDF Plus] 36. Colin Fyfe . 1995. Introducing Asymmetry into Interneuron LearningIntroducing Asymmetry into Interneuron Learning. Neural Computation 7:6, 1191-1205. [Abstract] [PDF] [PDF Plus] 37. Barak A. Pearlmutter . 1995. Time-Skew Hebb Rule in a Nonisopotential NeuronTime-Skew Hebb Rule in a Nonisopotential Neuron. Neural Computation 7:4, 706-712. [Abstract] [PDF] [PDF Plus] 38. Marco Idiart, Barry Berk, L. F. Abbott. 1995. Reduced Representation by Neural Networks with Restricted Receptive FieldsReduced Representation by Neural Networks with Restricted Receptive Fields. Neural Computation 7:3, 507-517. [Abstract] [PDF] [PDF Plus] 39. Yong Liu . 1994. Influence Function Analysis of PCA and BCM LearningInfluence Function Analysis of PCA and BCM Learning. Neural Computation 6:6, 1276-1288. [Abstract] [PDF] [PDF Plus] 40. L. F. Abbott. 1994. Decoding neuronal firing and modelling neural networks. Quarterly Reviews of Biophysics 27:03, 291. [CrossRef] 41. Joseph Sirosh, Risto Miikkulainen. 1994. Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics 71:1, 65-78. [CrossRef] 42. Geoffrey J. Goodhill , Harry G. Barrow . 1994. The Role of Weight Normalization in Competitive LearningThe Role of Weight Normalization in Competitive Learning. Neural Computation 6:2, 255-269. [Abstract] [PDF] [PDF Plus]

Communicated by Harry Barrow and David Field

Toward a Theory of the Striate Cortex Zhaoping Li Joseph J. Atick The Rockefeller University, 1230 York Avenue, New York, NY 10021, USA

We explore the hypothesis that linear cortical neurons are concerned with building a particular type of representation of the visual worldone that not only preserves the information and the efficiency achieved by the retina, but in addition preserves spatial relationships in the input-both in the plane of vision and in the depth dimension. Focusing on the linear cortical cells, we classify all transforms having these properties. They are given by representations of the scaling and translation group and turn out to be labeled by rational numbers '(p + q)/p' ( p , q integers). Any given ( p . 4 ) predicts a set of receptive fields that comes at different spatial locations and scales (sizes) with a bandwidth of log,[@ q ) / p ] octaves and, most interestingly, with a diversity of 'q' cell varieties. The bandwidth affects the trade-off between presewation of planar and depth relations and, we think, should be selected to match structures in natural scenes. For bandwidths between 1 and 2 octaves, which are the ones we feel provide the best matching, we find for each scale a minimum of two distinct cell types that reside next to each other and in phase quadrature, that is, differ by 90" in the phases of their receptive fields, as are found in the cortex, they resemble the "even-symmetric'' and "odd-symmetric" simple cells in special cases. An interesting consequence of the representations presented here is that the pattern of activation in the cells in response to a translation or scaling of an object remains the same but merely shifts its locus from one group of cells to another. This work also provides a new understanding of color coding changes from the retina to the cortex.

+

1 Introduction

What is the purpose of the signal processing performed by neurons in the visual pathway? Are there first principles that predict the computations of these neurons? Recently there has been some progress in answering these questions for neurons in the early stages of the visual pathway. In Atick and Redlich (1990, 1992) a quantitative theory, based on the principle of redundancy reduction, was proposed. It hypothesizes that the main goal of retinal transformations is to eliminate redundancy in input signals, particularly that due to pairwise correlations among Neural Computation 6,127-146 (1994) @ 1993 Massachusetts Institute of Technology

128

Zhaoping Li and Joseph J. Atick

pixels-second-order statistics.' The predictions of the theory agree well with experimental data on processing of retinal ganglion cells (Atick and Redlich 1992; Atick et al. 1992). Given the successes of this theory, it is natural to ask whether redundancy reduction is a computational strategy continued into the striate cortex, One possibility is that cortical neurons are concerned with eliminating higher-order redundancy, which is due to higher-order statistics. We think this is unlikely. To see why, we recall the facts that make redundancy reduction compelling when applied to the retina and see that these facts are not as relevant for the cortex. First, the retina has a clear bottleneck problem: the amount of visual data falling on the retina per second is enormous, of the order of tens of megabytes, while the retinal output has to fit into an optic nerve of a dynamic range significantly smaller than that of the input. Thus, the retina must compress the signal, and it can do so without significant loss of information by reducing redundancy. In contrast, after the signal is past the optic nerve, there is no identifiable bottleneck that requires continued redundancy reduction beyond the retina. Second, even if there were pressure to reduce data,2 eliminating higherorder statistics does not help. The reason is that higher-order statistics do not contribute significantly to the entropy of images, and hence no significant compression can be achieved by eliminating them (for reviews of information theory see Shannon and Weaver 1949; Atick 1992). The dominant redundancy comes from pairwise correlation^.^ There is another intrinsic difference between higher- and second-order statistics that suggests their different treatment by the visual pathway. Figure 1 shows image A and another image B that was obtained by randomizing the phases of the Fourier coefficients of A. B thus has the same second-order statistics as A but no higher-order ones. Contrary to A, B has no clear forms or structures (cf. Field 1989). This suggests that for defining forms and for discriminating between images, secondorder statistics are useless, while higher-order ones are essential. Actually, eliminating the former highlights the higher-order statistics that should be used to extract form signals from " n ~ i s e . " ~ 'Since retinal neurons receive noisy signals it is necessary to formulate the redundancy reduction hypothesis carefully taking noise into account. In Atick and Redlich (1990, 1992) a generalized notion of redundancy was defined, whose minimization leads to elimination of pairwise correlations and to noise smoothing. 2For example, there could be a computational bottleneck such as an attentional bottleneck occuring deep into the cortex-perhaps in the link between V4 and IT (Van Essen et nl. 1991). 3This fact is well known in the television industry (see, eg., Schreiber 1956). This is why practical compression schemes for television signals never take into account more than pairwise correlations, and even then, typically nearest neighbor correlations. This fact was also verified for several scanned natural images in our laboratory by N. Redlich and by Z. Li. 4Extractingsignal from noise can achieve by far more significant data reduction than trying to eliminate higher-order correlations.

129

Theory of the Striate Cortex

A

B

Figure 1: (A, B) Demonstration of the uselessness of second-order statistics for form definition and discrimination. Following Field (1989), image B is constructed by first Fourier transforming A, randomizing the phases of the coefficients and then taking the inverse Fourier transform. The two images thus have the same second-order statistics but B has no higher-order ones. All relevant object features disappeared from B.

So what is the cortex then trying to do? Ultimately, of course, the cortex is concerned with object and pattern recognition. One promising direction could be to use statistical regularities of images to discover matched filters that lead to better representations for pattern recognition. Research in this direction is currently under way. However, there is another important problem that a perceptual system has to face before the recognition task. This is the problem of segmentation, or equivalently, the problem of grouping features according to a hypothesis of which objects they belong to. It is a complex problem, which may turn out not to be solvable independently from the recognition problem. However, since objects are usually localized in space, we think an essential ingredient for its successful solution is a representation of the visual world where spatial relationships, both in the plane of vision and in the depth dimension, are preserved as much as possible. In this paper we hypothesize that the purpose of early cortical processing is to produce a representation that (1)preserves information, (2) is free of second-order statistics, and (3) preserves spatial relationships. The first two objectives are fully achieved by the retina so we merely require that they be maintained by cortical neurons. We think the third objective is attempted in the retina (e.g., retinotopic and scale invariant sampling);

Zhaoping Li and Joseph J. Atick

130

however, it is completed only in the cortex where more computational and organizational resources are available. Here, we focus on the cortical transforms performed by the relatively linear cells; the first two requirements immediately limit the class of transforms that linear cells can perform on the retinal signals to the class of unitary matrices? U with U . Ut = 1. So the principle for deriving cortical cell kernels reduces to finding the U that best preserves spatial relationships. Actually, preserving planar and depth relationships simultaneously requires a trade-off between the two (see Section 2). This implies that there is a family of Us, one for every possible trade-off. Each U is labeled by the bandwidth of the resulting cell filters and forms a representation of the scaling and translation group (see Section 3). We show that the requirement of unitarity limits the allowed choices of bandwidths, and for each choice predicts the needed cell diversity. The bandwidth that should ultimately be selected is the one that best matches structures in natural scenes. For bandwidths around 1.6 octaves, which are the ones we feel are most relevant for natural scenes, the predicted cell kernels and cell diversity resemble those observed in the cortex. The resulting cell kernels also possess an interesting object constancy property: when an object in the visual field is translated in the plane or perpendicular to the plane of vision, the pattern of activation it evokes in the cells remains intrinsically the same but shifts its locus from one group of cells to another, leaving the same total number of cells activated. The importance of such representations for pattern recognition has been stressed repeatedly by many people before and recently by Olshausen et al. (1992). Furthermore, this work provides a new understanding of color coding change from the single opponency in the retina to the double opponency in the cortex. 2 Manifesting Spatial Relationships

In this section we examine the family of decorrelating maps and see how they differ in the degree with which they preserve spatial relationships. We start with the input, represented by the activities of photoreceptors in the retina, {S(z,)} where x, labels the spatial location of the nth photoreceptor in a two-dimensional (2D) grid. For simplicity, we take the grid to be uniform. To focus on the relevant issues without the notational complexity of 2D, we first examine the one-dimensional (1D) problem and then generalize the analysis to 2D in Section 4. The autocorrelator of the signals { S ( x , , ) } is

JLrn ( S ( x n ) S ( x m ) )

(2.1)

51n this paper we use the term "unitary" instead of "orthogonal"since we find it more convenient to use complex basis [e.g.,eifx instead of coslfx)]. Ut = U*T,where the asterisk denotes complex conjugate. For real matrices, unitary means orthogonal.

Theory of the Striate Cortex

131

where brackets denote ensemble average. To eliminate this particular redundancy, one has to decorrelate the output and then apply the appropriate gain control to fit the signals into a limited dynamic range. This can be achieved by a linear transformation

where j = 1,.. . .N and the kernel K/,, is the product of two matrices. Using boldface to denote matrices: K=V.M

(2.3)

MI,, is the rotation to the principal components of R: (M . R .M’), = A,6,, where {A,} are the eigenvalues of R. While V is the gain control which is a diagonal matrix with elements V,, = l/A.Thus the output has the

property (O,O,) = ( K . R .KT),,= h,,

(2.4)

An important fact to note is that redefining K by K’ = U . K where U is a unitary matrix (U . Ut = 1) does not alter the decorrelation property (equation 2.4). (Actually U should be an orthogonal matrix for real 0,,but since we will for convenience use complex variables, unitary U is appropriate.) Therefore, there is a whole family of equally efficient representations parameterized by {U}. Any member is denoted by KU KU

=

U . (V . M)

U . K(p)

(2.5)

where K(p) = V . M is the transformation to the principal components. Without compromising efficiency, this nonuniqueness allows one to look for a specific U that leads to KU with other desirable properties such as manifest spatial relationships.6 To see this, let us exhibit the transformation Kc!’) more explicitly. For natural signals, the autocorrelator is translationally invariant, in the sense that R,,,, = R ( n -rn). One can then define the autocorrelator by its Fourier transform or its power spectrum, which in 2D is R ( f ) l/lfl*,where f is the 2D spatial frequency (Field 1987; Ruderman a i d Bialek 1993). For illustration purposes, we take in this section the analogous 1D “scale

-

should be noted that this nonuniqueness in receptive field properties is due to the fact that the principle used is decorrelation. If one insists on minimization of pixel entropy (which for gaussian signals is equivalent to decorrelation) this symmetry formally does not exist for ensembles of nongaussian signals. In other words some choice of U may be selected over others. However, for the ensemble of 40 images that we have considered, we found that the pixel entropy varied only by few percent for different Us. This is consistent with the idea that natural scenes are dominated by second-order statistics that do not select any particular U. In other systems it is possible that higher order statistics do select a special U, see, for example, Hopfield (1991). For another point of view see Linsker (1992).

Zhaoping Li and Joseph J. Atick

132

-

invariant” spectrum, namely, R ( f ) l/f. In the 2D analysis of Section 4 we use the measured spectrum l/lfl’. For a translationally invariant autocorrelator, the transformation to principal components is a Fourier transform. This means, the principal components of natural scenes or the row vectors of the matrix M are sine waves of different frequencies N

where

] = ( 0 , 1 , 2. . . . . N - 1 )

f,

= {

2.13 ,

if j i s odd

--%i,

if j is even

While the gain control matrix V is V , transform then becomes

=

l/m

=

l/m. The total

(2.7)

This performs a Fourier transform and at the same time normalizes the output such that the power is equalized among frequency components ( O f ) = const. (i.e., output is whitened). One undesirable feature of the transformation K(p) is that it does not preserve spatial relationships in the plane. As an object is translated in the field of view, the locus of response { 0 , )will not simply translate. Also two objects separated in the input do not activate two separate groups of cells in the output. Typically all cells respond to a mixture of features of all objects in the visual field. Segmentation is thus not easily achievable in this representation. Mathematically, we say that the output ( 0 ; )preserves planar spatial relationships in the input if O;[S]= Oi-rii[S’]

xr=,

when

S’(X,,) = S(X~~+-,,,)

(2.8)

where O;[S]= K,,,S(x,). In other words, a translation in the input merely shifts the output from one group of cells to another. Implicitly, preserving planar spatial relationship also requires, and we will therefore enforce, that the cell receptive fields be local, so a spatially localized object evokes activities only in a local cell group, which shifts its location when the object moves and is separated from another cell group evoked by another spatially disjoint object in the image plane. Technically speaking, an (0,)that satisfies equation 2.8 is said to form a representation of the discrete ”translation group.”

Theory of the Striate Cortex

133

Insisting on equation 2.8 picks u p a unique choice of U. In fact in this case U is given by (2.9)

which is just the inverse Fourier transform. The resulting transformation K(’) = U.V.Ut gives translationally invariant center-surround cell kernels

In two dimensions, taking into account optical properties of the eye medium and the noise, these kernels were shown to account well for properties of retinal ganglion cells (Atick and Redlich 1992). Although the representation defined by K(’) is ideal for preserving spatial relationships in the plane, it completely destroys spatial relations in scale or depth dimension. The change in the patterns of activation in {Oi} in response to a change in the object distance is very complicated. To preserve depth relations the output should form a representation of another group the so-called “scaling group.” This is because when an object recedes or approaches, the image it projects goes from S(x) to S( Ax) for some scale factor A. The requirement of object invariance under scaling dictates that

for some shift I depending on A. It is not difficult to see that Kc’), which satisfies equation 2.8 all the way down to the smallest possible translation, violates this condition. Actually, satisfying equations 2.8 and 2.10 for the smallest possible translation and scale changes simultaneously is not possible. A compromise between them has to be found. The problem of finding the kernels that lead to { 0,) with the best compromise between equations 2.8 and 2.10 is equivalent to the mathematical problem of constructing simultaneous representations of the translation and scaling group, which is what we do next.

3 Representations of Translation and Scaling Group

To satisfy equations 2.8 and 2.10 the cells must carry two different labels. One is a spatial position label ”n” and the other is a scale label “a.” The idea is that under translations of the input the output translates over the ”n” index, while under scaling by some scale factor X the output shifts

Zhaoping Li and Joseph J. Atick

134

over the “a” index. Such cell groups can be obtained from 0 = U . K(p) using a U that is block diagonal:

U=

Each submatrix Uahas dimension Na and gives rise to No cells with outputs Onlocated at lattice points x“, = ( N / N a ) nfor n = 1,2,. . . Nn. Since the block matrices Ua act on K(p), which are the Fourier modes of the inputs, the resulting cells in any given block a filter the inputs through a limited and exclusive frequency band with frequencies f, for C,,,,Na’ 5 j < CntSaNa’. Since Na < N these cells sample more sparsely on the original visual field. Notice the cells from different blocks a are spatially mingled with each other, and their total number add up to N = C, Na.The hope is to have translation invariance within each block and scale invariance between blocks, that is, %

ql[S]

=

ql+,,[S’] for S ( x ) = S ( x + b x ) and hx for S ( x ) = S(Ax)

q [ S ] = q,+’[S’]

=

(N/Na)bn

(3.1) (3.2)

Each block ‘a’ thus represents a particular scale, the translation invariance within that scale can be achieved with a resolution bx 0; N / N a , inversely proportional to Na.Larger blocks or larger Na thus give better translation invariance, and the single block matrix U = Uo = Mt achieves this symmetry to the highest possible resolution. On the other hand, a higher resolution in scaling invariance calls for a smaller X > 1. As we will see below, ( A - 1) 0; Na/lfa, where f” is the smallest frequency sampled by the ath block. Hence a better scaling invariance requires smaller block sizes Na. A trade-off between better translation and scaling invariance reduces to choosing the scaling factor A, or the bandwidth depending on it. This will become clearer as we now follow the detailed construction of U. The unitarity condition now requires having Un(Un)t= 1 for each a, resulting in output cells uncorrelated within each scale and between scales. To construct Ua, one notices that the requirement of translation invariance is equivalent to having identical receptive fields, except for a spatial shift of the centers, within each scale a. It forces U;, 0; e’fic.For a general X, it turns out that the constraint Ua(Ua)t= 1 for a > 0 cannot be satisfied if one insists on only one cell or receptive field type within the scale. However, if one allows the existence of several say, ’q’, cell types

Theory of the Striate Cortex

135

within the scale, Ua(Ua)t= 1 is again possible. In this case, each cell is identical to (or is the off-cell type of) the one that is 9 lattice spaces away in the same scale lattice (i.e., + The most general choice for real receptive fields is then L~JU&~~"~J+@ i f f) i > 0

ut, =

{

(3.3) 1-~(lfrl+(t)"n+@) if f i < 0 fie

where 0 is an arbitrary phase that can be thought of as zero for simplicity at the moment, and

for two relatively prime integers p and 9. This means the number of cell types in any given scaling block will be 9. The frequencies sampled by this cell group are f i = f ( 2 n / N ) j for f' < j 5 ?+'. Including both the positive and the negative frequencies, the total number of frequencies sampled, and, since U" is a square matrix, the total number of cells in this scale, is No = 2(j"+' -?). The constraint of unitarity for u > 0 leads to the equation 2/*+'

c uo

/=2/"+1

"I

c N"

,U+l

(

~

0

nI

),* =

1

-e~~~7[(2+wrp"1

(3.5)

+ c,c, = h1111'

/=/*+1

whose solution is

The condition

=

(p/9)7r then leads to the nontrivial consequence (3.7)

In a discrete system, the only acceptable solutions are those where 9/2p is an integer. For example, the choice of 9 = 2 and p = 1 leads to the scaling ?+' = 3? + 1. This is the most interesting solution as discussed below. Mathematically speaking, in the continuum limit a large class of 00 and N co such that solutions exists, since in that limit one takes? f" = (2n/N)j" remains finite, then we are simply lead to f"" = f"(9 + p ) / p for any 9 and p . Thus representations of the scaling and translation group are possible for all rational scaling factors X = (9 + p ) / p . The bandwidth, B,,,, of the corresponding cells is log,[(9 + p ) / p ] . Interesting consequences follow from the relationship between cell bandwidth and diversity: -+

Cell types

=9

-+

136

Zhaoping Li and Joseph J. Atick

+

For example, a bandwidth of one octave or a scaling factor ( q p ) / p = 2 needs only one cell type in each scale, when 9 = p = 1. If it turns out to be necessary to have B,,, greater than 1 octave, then at least two classes of cells are needed to faithfully represent information in each scale, with q = 2 and p = 1 giving scaling factor of 3 or B,, close to 1.6 octaves. It is interesting to compare our solutions to the so called "wavelets" that, constructed in the mathematical literature, also form representations of the translation and scaling group. In the standard construction of Grossman and Morlet (1984) and Meyer (19851, the representations could be made orthonormal (i.e., unitary in the case of real matrices) only for limited choice of scaling factors given by 1 + l / m where m 2 1 is an integer. Such constructions need only one filter type in each scale and give scale factors no larger than 2 [equivalently the largest bandwidth is 1 octave--e.g., the well-known Haar basis wavelets (Daubechies 1988)l. This agrees with what we derived above for the special case of q = 1 where B,,, = log,(l + l / p ) . However, allowing q > 1 gives more bandwidth choices in our construction. For example, q = 2 gives Boct = log, (1 + 2/p), however, no larger than 1.6 octaves, and q = 3 gives log, (1 + 3/p), no larger than 2 octaves, etc. These results also agree with the recent theorem of Auscher (1992) who proved that multiscale representations can exist for scalings by any rational number k / l , provided k - I filter types are allowed in each scale. Our conclusion above yields exactly the same result by redefining k = p + q and I = q. We arrived at our conclusion independently through the explicit construction presented above.7 The connection between the number of cell types and the bandwidth that is possible to achieve is significant. We believe the bandwidth needed by cortical cells is determined by properties of natural images. Its value should be the best compromise between planar and depth resolution preservation for the distribution of structures in natural scenes. Actually, Field (1987, 1989) examined the issue of best bandwidth for filters that modeled cortical cells and found that bandwidths between 1 and 2 octaves best matched natural scene structures. Our results here show that cortical cells cannot achieve bandwidths more than one octave without having more than one cell type. Next we show what the predicted cell kernels look like. For generality, we give the expression for the kernels in the continuum limit for any scale factor X = (9 + p ) / p or equivalently with any allowed bandwidthalthough the ones we think are most relevant to the cortex are the discrete p = 1, q = 2 kernels. The cell kernels are given by {KO(,$ - x ) . u > 0 ) and { K " ( x ; - x)}. For any given a > 0, the kernels sample the frequency in the range f E cf",Xf") = cf".fR+'). For u = 0, K" samples only frequencies f E ( O , f ' ) , and Uo is given by U" = Mt in equation 2.9 with N replaced by

'We thank Ingrid Daubechies for pointing out the result of P. Auscher to us

Theory of the Striate Cortex

137

Receptive fields

Sensitivity

A

B

C

D

Spatial distance

Spatial Frequency

Figure 2: "Even-symmetric" (A, C) and "odd-symmetric" (B, D) kernels predicted for the scale factor 3 (equivalently for BOct= 1.6 octaves) for two neighboring scales (top and bottom rows, respectively), together with their spectra (frequency sensitivities or selectivities).

N". Including both the positive and negative frequencies the predicted kernels are

-

-

x)

-

-m 9

+H

]

(3.8) (3.9)

For any given p and q the kernels for a > 0 come in q varieties. Even and odd varieties are immediately apparent when one sets q = 2, p = 1, and 0 = 0 [K"(q, - x) are even or odd functions of - x for even or odd n ] . In Figure 2 we exhibit the even and odd kernels in two adjacent scales and their spectra. The a = 0 kernels, where 0 = 0 is chosen, are similar to the center-surround retinal ganglion cells (however, they are larger in size), and hence we need not exhibit them here. In general, though, H can take any value, and the neighboring cells will simply differ by a 90" phase shift, or in quadrature, without necessarily having even or odd symmetry in their receptive field shapes.

Zhaoping Li and Joseph J. Atick

138

From equation 3.8, it is easy to show that the kernels for a > 0 satisfy the following recursive relations:

P ( f ,- AX) -

(x

=

+ 4)] =

1

-K"'(f,+' - X) x K"(x& - x)

(3.10) (3.11)

To prove these one needs to use the following facts, fa+' = Xfa, N"+'= AN",and A$+' = 4:' = $. (Equation 3.11 also applies for KO.> The above relations imply that, except shifted in space, each cell has the same receptive field as its 9th neighbor within the same scale block, for example, when 9 = 2 in the example above, all the even (or odd) cells are identical. Furthermore, except for the lowest scale a = 0, the nth cell in all scales has the same receptive field except for a factor of X expansion in size and a X reduction in amplitude. Actually, since # .I$+',these cells are located at different spatial locations. Now it is straightforward to see that the translation invariance (equation 3.1) for 6n = 9 and scale invariance (equation 3.2) are the direct consequence of the translation and scaling relationships, 3.11 and 3.10, respectively, between the receptive fields. This is exactly our goal of object constancy. Notice that the scaling constancy would not have been possible if the whitening factor was not there in equation 3.8. These results can be extended to 2D where the whitening factor is l/T - = If1as we will see next.

e3

4

4 Extension to 2D and Color Vision: Oriented Filters and Color Oppo-

nent Cells

The extension to two dimensions of the above construction is not difficult but involves a new subtlety. In this case, the constraint of unitarity on the matrices U".a > 0 is hard to satisfy even if we allow for the phase factor q4 that leads ultimately to different classes of cells. This constraint is considered in more detail in the Appendix; here we only state the conclusions of that analysis. What one finds is that to ensure unitarity of U", one needs to allow for cell diversity of a different kind-cells in the ath scale need to be further broken down into different types or orientations, each sampling from a limited region of the frequency space in that scale. Three examples of acceptable unitary breakings are shown in Figure 3A, B, C. In A (B) filters are broken into two classes in any scale a > 0-in addition to the 9-cell diversity discussed in 1D. One filter type is a lowpass-bandpass in the x-y direction, and the other is a bandpass-lowpass in the x-y direction, which are denoted by " l b and "bl." In C there are three classes of filters, "lb," "bl," and finally a class of filters which are bandpass in both x and y, "bb." The " l b and "bl" filters are oriented while the " b b ones

Theory of the Striate Cortex

139

I"+'

I"

I'

I'

.I

/"+I

B

I'

c

Figure 3: ( A X ) Proliferation of more cell types by the break-down of the frequency sampling region in 2D within a given scale a. Ignoring the negative frequencies, the frequencies f within the scale are inside the large solid box but outside the small dashed box. The solid lines within the large solid box further partition the sampling into subregions denoted by "bl," "lb," and "bb," which indicate bandpass-lowpass, lowpass-bandpass, and bandpass-bandpass, respectively, in x-y directions. (A, B) Asymmetric breakdown between x and y directions, the "lb cells are not equivalent to a 90" rotation of the "bl" cells. (C)Symmetric breakdown between x and y directions. The "bb" cells are significantly different from the others (see Fig. 4).

are not.8 Figure 4A and B shows the five cell types one encounters for the breaking in Figure 3B and the nine cell types for the breaking in Figure 3C, respectively, for a choice of scaling factor 3. Finally, the object constancy equations 3.1 and 3.2 still hold since equations 3.10 and 3.11 extend to 2D as

where x and g7,are 2D vectors, a n d zz and q are 2D indices. These relatronships are understood to hold between cells belonging to the same frequency sampling category ("lb," "bl," or " b b ) . The factor of 1 / A 2 comes because the whitening factor in 2D is l/m = IfI. 80ne notices that this extension to 2D requires a choice of orientations such as the x-y axes, breaking the rotational symmetry. Furthermore, it is natural to ask if the object constancy by translations and scalings should be extended to the object rotations in the image plane-requiring the cells be representationsof the rotation group. At this point, it is not clear whether the rotational invariance is necessary (noting that we usually tilt our heads to read a tilted book or fail to recognize a face upside down), and whether the rotational invariance can be incorporated simultanously with the translation and scaling ones without increasing the number of cells. We will leave this outside the paper.

Zhaoping Li and Joseph J. Atick

140

A

B

Figure 4: (A, B) The predicted variety of cell receptive fields in 2D. The five cell types in (A) and the nine cell types in (B) arise from the frequency partitioning schemes in Figure 3B and C, respectively. The kernels in the lower-left corner of both images demonstrate the lowpass-lowpass filter K" in 2D and they are nonoriented. All others are bandpass in at least one direction. Those are actually significantly smaller but are expanded in size in this figure for demonstration. The " b b cells in the upper-right part of (B) come in four varieties (even-even, odd-odd, even-odd, and odd-even when H = 0 is taken for both x and y directions) and should exist in the cortex if the scheme in Figure 3C is favored. All kernels are constructed taking into account the optical MTF of the eye.

" :1

From equations 3.8 and 3.9, it is clear that the cortical kernels K"(x) 0; df[l/Jx(f)] coscfx + 4") differ from the retinal kernel

K(x) 0: ~ f m " x d f [ l / J R Vcoscfx ) ] + 9) only by the range of the frequency integration or selectivity. The cortical receptive fields are lowpass or bandpass versions of the retinal ones. One immediate consequence of this is that most cortical cells, especially the lowpass ones like those in the cytochrome oxidase blob cells, have larger receptive fields than the retinal ones. Second, when considering color vision, the power spectrums R,(f) and R,cf) for the luminance and chrominance channels, respectively, differ in their magnitudes. In reality when noises are considered, the receptive field filters are not simply which would have simply resulted in identical receptive field forms for luminance and chrominance except for their different strengths, but instead, the filter for luminance is more of a bandpass and the filter

l/m,

Theory of the Striate Cortex

141

for chrominance a relatively lowpass. Since the retinal cells carry luminance and chrominance information simultanously by multiplexing the signals from both channels, the resulting retinal cells are of red-centergreen-surround (or green-center-red-surround) types (Atick et al. 1992). This is because at low spatial frequencies, the chrominance filter dominates, while at higher spatial frequencies, the luminance one dominates. As we argued above, the cortical cells simply lowpass or bandpass the signals from the retinal cells; thus the lowpass version will carry mostly the chrominance signals while the bandpass or highpass ones the luminance signals. This is indeed observed in the cortex (Livingstone and Hubel 1984; Ts'o and Gilbert 1988) where the large (lowpass) blob cells are more color selective, while the smaller (higher-pass) nonblob cells, which are also more orientation selective by our results above, are less color sensitive. Furthermore, since the luminance signals are negligible at low frequencies, when one only considers the linear cell properties, the color sensitive blob cells are double-opponent (e.g. red-excitatorygreen-inhibitory center and the red-inhibitory-green-excitatory surround) or color-opponent-center-only (type II), depending on the noise levels. This is apparent when one tries to spatially lowpass the signals from a group of single-opponent retinal cells (Fig. 5).

5 Discussion: Comparison with Other Work

The types of cells that we arrive at in constructing unitary representations of the translation and scaling group (see Figs. 2 and 4) are similar to simple cells in cat and monkey striate cortex. The analysis also predicts an interesting relationship between bandwidths of cells and their diversity as was discussed in Sections 3 and 4. One consequence of that relationship is that for cells to achieve a representation of the world with sampling bandwidth between 1 and 2 octaves there must be at least two cell types adjacent to each other and differ by 90" in their receptive field phases (Fig. 2). This bandwidth range is the range of measured bandwidths of simple cells (e.g., Kulikowski and Bishop 1981; Andrews and Pollen 1979) and also, we think, is best suited for matching structures in natural scenes (cf. Field 1987, 1989). This analysis thus explains the presence of phase quadrature (e.g., paired even-odd simple cells) observed in the cortex (Pollen and Ronner 1981): such cell diversity is needed to build a faithful multiscale representation of the visual world. The analysis also requires breaking orientation symmetry. Here we do not wish to advocate scaling symmetry as an explanation for the existence of oriented cells in the cortex. It may be that orientation symmetry is broken for a more fundamental reason and that scaling symmetry takes advantage of that. Either way, orientation symmetry breaking is an important ingredient in building these multiscale representations.

Zhaoping Li and Joseph J. Atick

142

In the past, there has been a sizeable body of work on trying to model simple cells in terms of “Gabor” and “log Gabor” filters (Kulikowski et al. 1982; Daugman 1985, Field 1987, 1989). Such filters are qualitatively close to those derived here, and they describe some of the properties of simple cells well. Our work differs from previous work in many ways. The two most important differences are the following. First, the filters here are derived by unitary transforms on retinal filters that reduce

Luminance (Solid) & Chrominance (Dashed)

100.

1

Spatial Frequency (ddeg)

Ganglion red

Ganglion green

I

Blob green

I

I

I

Spatial Distance

Theory of the Striate Cortex

143

redundancy in inputs by whitening. By selecting the unitary transformation that manifests spatial-scale relationships in signals, one arrives at a representation that exhibits object constancy-the output response to an input S(x) and its planar and depth translated version [i.e., S(x) + S[X(x + bx)]] are related by

Hence, a visual object moved in space simply shifts the outputs from one group of cells to another. Second, we find a direct linkage between cell bandwidth and diversity. Such linkage does not appear in previous works where orthonormality or unitarity was not required. More recently there has also been a lot of work on orthonormal multiscale representations of the scaling and the translation group, alternatively known as wavelets (Meyer 1985; Daubechies 1988; Mallat 1989). The relationship of our work to wavelets was discussed in Section 3. Here we should add that in this paper we provide explicit construction of these representations for any rational scaling factor. Furthermore, our filters satisfy K a ( X x ) = (l/Xd)Kaf'(x) where d is the dimension of the space (e.g., d = 1 or 2), while those in the wavelet construction satisfy Ka(Xx) = ( l / X d / 2 ) K a " ( x ) . This difference stems from the fact that our filters are the convolution of the whitening filter and the standardtype wavelet. The whitening filter-given by where R ( f ) is the scale-invariant power spectrum of natural scenes-is what ultimately leads to the object constancy property that is absent from the standardtype wavelets. The question at this stage is whether we could identify the pieces in our mathematical construction with classes of cells in the cortex. First, there is the class of lowpass cells u = 0, which have large receptive fields, and no orientation tuning (actually since their kernels have a whitening factor, they are not completely lowpass but an incomplete bandpassweak surround). We think a good candidate for these cells are the cells in the cytochrome oxidase blob areas in the cortex. When we add color

- l/m

Figure 5: Facing page, Change of color coding from retina to cortex. The top plot shows the visual contrast sensitivities to the luminance and chrominance signals. The bottom plot demonstrates the receptive field profiles (sensitivityto red or green cone inputs) of the color selective cells in the retina (or ganglion) and the cortex. The parameters used for the ganglion cells are the same as those in Atick et al. (1992). The blob cells are constructed by low ass filtering P the ganglion cell outputs with a filter frequency sensitivity of e-f ' ( 2 f i w ) where how = 1.5 c/deg. The strengths of the cell profiles are individually normalized for both the ganglion and the blob cells. The range of the spatial distance axes, or the size, of the blob cells is 3.7 times larger than that of ganglion cells. This means that each blob cell sums the outputs from (on the order of) at least about (3.7)* 16 local ganglion cells. N

Zhaoping Li and Joseph J. Atick

144

to our analysis, this class will come out to be color opponent.’ These cells, a lowpass version of the single opponent retinal cells, turn out to be double opponent or color-opponent-center-only (see Fig. 5) from this mathematical construction, in agreement with observations. Second, the representation requires several orientation classes in every choice of higher scale; they are not as likely to be color selective and, within each orientation and scale, there are two types of cells-in phase quadrature (e.g., even and odd symmetric)-if the bandwidth of the cells is greater than one octave. These have kernels similar to simple cells. Also, in some choices of division of the two-dimensional frequency space into bands (see Fig. 41, one encounters cells that are very different from simple cells. These cells come from the bandpass region in both the x and y directions (the “bb” region in Fig. 3C) and as such possess relatively small receptive fields in space. It is amusing to note their resemblance to the type of cells that Van Essen discovered in V4 (private communication). It is important at this stage to look in detail for evidence that cortical neurons are building a multiscale, translationally invariant representation of the input along the lines described in this paper. However, in looking for those we must allow for the possibility that these representations are formed in an active process starting as early as the striate cortex, as was proposed recently by Olshausen et al. (1992). We also must keep in mind that to perform detailed comparison with real cortical filters, our filters have to be modified to take noise into account.

Appendix In this appendix we examine the condition of unitarity on the matrix Ufl. The matrix elements of Ua in the scale a > 0 are generalized from the 1D case simply as (taking H = 0)

n

j = Vx,jl/), f = and 4 = -I [(f,x/lh,l)4x,(f,,/lfi,I)4y]. A priori the ceils in Uasample from the frequency region inside the big solid box but outside the dashed box in Figure 3. The critical fact that makes the 2D case different from 1D is that there are = 4(j”+’)2 - 4W)’ cells in the ath class, while the total number then of cells is (N)’, where

= (nx,iiy), $ =

($lr,$lJ,

’It is easy to see why: since they are roughly lowpass-large receptive fields-they have higher signal-to-noise in space, and hence they can afford to have a low signalto-noise in color. While opponent cells in space have low signal-to-noise,they need to integrate in color to improve their signal-to-noise (see Atick et a / . 1992).

Theory of the Striate Cortex

145

The unitarity requirement U"(Un)t = 1 (a > 0) can be shown to be equivalent to

where An, is any integer # 0. A similar condition in the y direction should also hold. To satisfy equation A.2 one can only hope that the cosine factor is zero for odd An, and the sine factor is zero for the rest. This is impossible in 2D although possible in 1D. To see this difference, note that in lD, N" = 2(j"+' -?) and the argument of the sine is An7r/2, which leads to vanishing sine for even An. One then makes cosine term zero for odd An by choosing 4 such that [(j"" +? ~ ) / N ' ] K + &= f r / 2 . This is exactly how equation 3.6 is reached. In 2D, N" = 2 d m , and hence the sine term is

+

sin[An,( dP+l- F ) / p + I +ja)7r/2] # 0 for even An,. Although we cannot prove that the negative result in 2D is not caused by the fact that we have a Euclidean grid, we think it not possible to construct the representation even when using a radially symmetric lattice. To ensure unitarity of U",we need to allow for cell diversity of a different kind-cells in ath scale need to be further broken down into different types or orientations, each type sampling from a limited region of the frequency space as shown, for example, in Figure 3.

Acknowledgments We would like to thank D. Field, C. Gilbert, and N. Redlich for useful discussions, and the Seaver Institute for its support.

References Andrews, B. W., and Pollen, D. A. 1979. Relationship between spatial frequency selectivity and receptive field profile of simple cells. 1. Physiol. (London) 287, 163-1 76. Atick, J. J. 1992. Could information theory provide an ecological theory of sensory processing? Network 3, 213-251. Atick, J. J., and Redlich, A. N. 1990. Towards a theory of early visual processing. Neural Comp. 2,308-320. Atick, J. J., and Redlich, A. N. 1992. What does the retina know about natural scenes? Neural Comp. 4, 196-210. Atick, J. J., Li, Z., and Redlich, A. N. 1992. Understanding retinal color coding from first principles. Neural Comp. 4, 559-572. Auscher, P. 1992. Wavelet bases for L 2 ( R ) with rational dilation factor. In Wavelets and their applications, M. B. Ruskai, ed., pp. 439-451. Jones and Bartlett, Boston.

Zhaoping Li and Joseph J. Atick

146

Daubechies, I. 1988. Orthonormal bases of compactly supported waves. Commun. Pure Appl. Math. 41, 909-996. Daugman, J. G. 1985. Uncertainty relations for resolution in space, spatial frequency and orientation optimized by two-dimensional visual cortical filters. J. Opt. SOC.A m . A 2, 1160-1169. Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. I. Opt. SOC.A m . A 4, 2379-2394. Field, D. J. 1989. What the statistics of natural images tell us about visual coding. SPIE Vol. 1077, Human Vision, Visual Processing, and Digital Display, 269276. Grossmann, A., and Morlet, J. 1984. Decomposition of hardy functions into square integrable wavelets of constant shape. SlAM J. Math. 15, 17-34. Hopfield, J. J. 1991. Olfactory computation and object perception. Proc. Natl. Acad. Sci. U.S.A. 88, 6462-6466. Kulikowski, J. J., and Bishop, P. 1981. Linear analysis of the responses of simple cells in the cat visual cortex. Exp. Brain Res. 44, 386400. Kulikowski, J. J., Marcelja, S., and Bishop, P. 1982. Theory of spatial position and spatial frequency relations in the receptive fields of simple cells in the visual cortex. Biol. Cybern. 43, 187-198. Linsker, R. 1992. Private communication. See also, talk at NIPS 92. Livingstone M. S., and Hubel, D. H. 1984. Anatomy and physiology of a color system in the primate visual cortex. J. Neurosci. 4(1), 309-356. Mallat, S. 1989. A theory of multiresolution signal decomposition: The wavelet representation. I E E E Transact. Pattern Anal. Machine Intelligence 11, 674-693. Meyer, Y. 1985. Principe d’incertitude, bases hilbertiennes et algebres d’operateurs Sem. Bourbaki 662, 209-223. Olshausen, B., Anderson, C. H., and Van Essen, D. C. 1992. A neural model of visual attention and invariant pattern recognition. Caltech Report no. CNS MEMO 18, August. Pollen, D. A., and Ronner, S. F. 1981. Phase relationships between adjacent simple cells in the cat. Science 212, 1409-1411. Ruderman, D. L., and Bialek, W. 1993. Statistics of natural images: scaling in the woods. Private communication and to appear. Schreiber, W. F. 1956. The measurement of third order probability distributions of television signals. IRE Trans. Inform. Theory IT-2, 94-105. Shannon, C . E., and Weaver, W. 1949. The Mathematical Theory of Communication. University of Illinois Press, Urbana, IL. Ts’o, D. Y., and Gilbert, C. D. 1988. The organization of chromatic and spatial interactions in the primate striate cortex. I. Neurosci. 8(5), 1712-1727. Van Essen, D. C., Olshausen B., Anderson, C. H., and Gallant, J. L. 1991. Pattern recognition, attention, and information bottlenecks in the primate visual system. Conf. on Visual Information Processing: From Neurons to Chips ( S H E Proc. 2473). ~~

Received October 23, 1992; accepted April 20, 1993.

This article has been cited by: 2. Sheng Li, Si Wu. 2007. Robustness of neural codes and its implication on natural image processing. Cognitive Neurodynamics 1:3, 261-272. [CrossRef] 3. Odelia Schwartz, Anne Hsu, Peter Dayan. 2007. Space and time in visual context. Nature Reviews Neuroscience 8:7, 522-535. [CrossRef] 4. Yoshitatsu Matsuda, Kazunori Yamaguchi. 2007. Linear Multilayer ICA Generating Hierarchical Edge DetectorsLinear Multilayer ICA Generating Hierarchical Edge Detectors. Neural Computation 19:1, 218-230. [Abstract] [PDF] [PDF Plus] 5. Jeffrey Ng, Anil A. Bharath, Li Zhaoping. 2007. A Survey of Architecture and Function of the Primary Visual Cortex (V1). EURASIP Journal on Advances in Signal Processing 2007, 1-18. [CrossRef] 6. Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan. 2006. Soft Mixer Assignment in a Hierarchical Generative Model of Natural Scene StatisticsSoft Mixer Assignment in a Hierarchical Generative Model of Natural Scene Statistics. Neural Computation 18:11, 2680-2718. [Abstract] [PDF] [PDF Plus] 7. Zhiyong Zhang, Willy Wriggers. 2006. Local feature analysis: A statistical theory for reproducible essential dynamics of large macromolecules. Proteins: Structure, Function, and Bioinformatics 64:2, 391-403. [CrossRef] 8. Thomas Wennekers , Nihat Ay . 2005. Finite State Automata Resulting from Temporal Information Maximization and a Temporal Learning RuleFinite State Automata Resulting from Temporal Information Maximization and a Temporal Learning Rule. Neural Computation 17:10, 2258-2290. [Abstract] [PDF] [PDF Plus] 9. Alexandre Pouget, Peter Dayan, Richard S. Zemel. 2003. INFERENCE AND COMPUTATION WITH POPULATION CODES. Annual Review of Neuroscience 26:1, 381-410. [CrossRef] 10. Yury Petrov, L. Zhaoping. 2003. Local correlations, information redundancy, and sufficient pixel depth in natural images. Journal of the Optical Society of America A 20:1, 56. [CrossRef] 11. Chris J. S. Webber . 2001. Predictions of the Spontaneous Symmetry-Breaking Theory for Visual Code Completeness and Spatial Scaling in Single-Cell Learning RulesPredictions of the Spontaneous Symmetry-Breaking Theory for Visual Code Completeness and Spatial Scaling in Single-Cell Learning Rules. Neural Computation 13:5, 1023-1043. [Abstract] [PDF] [PDF Plus] 12. Antonio Turiel, Néstor Parga, Daniel Ruderman, Thomas Cronin. 2000. Multiscaling and information content of natural color images. Physical Review E 62:1, 1138-1148. [CrossRef] 13. Alexander Dimitrov , Jack D. Cowan . 1998. Spatial Decorrelation in Orientation-Selective Cortical CellsSpatial Decorrelation in Orientation-Selective Cortical Cells. Neural Computation 10:7, 1779-1795. [Abstract] [PDF] [PDF Plus]

14. Tai Sing Lee. 1996. Image representation using 2D Gabor wavelets. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:10, 959-971. [CrossRef] 15. Daniel Ruderman. 1994. The statistics of natural images. Network: Computation in Neural Systems 5:4, 517-548. [CrossRef]

Communicated by Yann Le Cun

Fast Exact Multiplication by the Hessian Barak A. Pearlmutter Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540 USA

Just storing the Hessian H (the matrix of second derivatives a2E/aw,aw, of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. To calculate Hv, we first define a differential operator Rycf(w)} = (a/&)f ( w + W ) J ~ =note ~, that %{Vw} = Hv and %{w} = v, and then apply R{.} to the equations used to compute 0,. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to a one pass gradient calculation algorithm (backpropagation), a relaxation gradient calculation algorithm (recurrent backpropagation), and two stochastic gradient calculation algorithms (Boltzmann machines and weight perturbation). Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating any need to calculate the full Hessian. 1 Introduction Efficiently extracting second-order information from large neural networks is an important problem, because properties of the Hessian appear frequently. For instance, in the analysis of the convergence of learning algorithms (Widrow et al. 1979; Le Cun et al. 1991; Pearlmutter 1992), in some techniques for predicting generalization rates in neural networks (MacKay 1991; Moody 19921, in techniques for enhancing generalization by weight elimination (Le Cun et al. 1990; Hassibi and Stork 19931, and in full second-urder optimization methods (Watrous 1987). There exist algorithms for calculating the full Hessian H (the matrix of second derivative terms a2E/awJw, of the error E with respect to the weights w1 of a backpropagation network (Bishop 1992; Buntine and Weigend 19941, or reasonable estimates thereof (MacKay 1991)-but even storing the full Hessian is impractical for large networks. There is also an algorithm for efficiently computing just the diagonal of the Hessian (Becker and Le Cun 1989; Le Cun et al. 1990). This is useful when the trace of the Hessian is needed, or when the diagonal approximation is being made-but there is no reason to believe that the diagonal approxNeural Computation 6, 147-160 (1994) @ 1993 Massachusetts Institute of Technology

148

Barak A. Pearlmutter

imation is good in general, and it is reasonable to suppose that, as the system grows, the diagonal elements of the Hessian become less and less dominant. Further, the inverse of the diagonal approximation of the Hessian is known to be a poor approximation to the diagonal of the inverse Hessian. Here we derive an efficient technique for calculating the product of an arbitrary vector v with the Hessian H. This allows information to be extracted from the Hessian without ever calculating or storing the Hessian itself. A common use for an estimate of the Hessian is to take its product with various vectors. This takes O(n2)time when there are n weights. The technique we derive here finds this product in O ( n )time and space,' and does not make any approximations. We first operate in a very general framework, to develop the basic technique. We then apply it to a series of more and more complicated systems, starting with a typical noniterative gradient calculation algorithm, in particular a backpropagation network, and proceeding to a deterministic relaxation system, and then to some stochastic systems, in particular a Boltzmann machine and a weight perturbation system. 2 The Relation between the Gradient and the Hessian

The basic technique is to note that the Hessian matrix appears in the expansion of the gradient about a point in weight space,

VW(w+Aw)= V w ( ~ ) + H A ~ + O ( l l A ~ 1 1 2 ) where w is a point in weight space, Aw is a perturbation of w, Owis the gradient, the vector of partial derivatives dE/dw;, and H is the Hessian, the matrix of second derivatives of E with respect to each pair of elements of w. This equation has been used to analyze the convergence properties of some variants of gradient descent (Widrow et al. 1979; Le Cun et al. 1991; Pearlmutter 1992), and to approximate the effect of deleting a weight from the network (Le Cun et al. 1990; Hassibi and Stork 1993). Here we instead use it by choosing A w = w , where v is a vector and r is a small number. We wish to compute Hv. Now we note that

+

H(w) = ~ H = v VW(w+ W ) - V ~ ( W ) O ( f ) or, dividing by r, (2.1) This equation provides a simple approximation algorithm for finding Hv for any system whose gradient can be efficiently computed, in time about 'Or O(pn) time when, as is typical for supervised neural networks, the full gradient is the sum of p gradients, each for one single exemplar.

Multiplication by the Hessian

149

that required to compute the gradient (assuming that the gradient at w has already been computed). Also, applying the technique requires minimal programming effort. This approximation was used to good effect in Le Cun et al. (1993) and in many numerical analysis optimization routines, which use it to gradually build up an approximation to the inverse Hessian. Unfortunately, this formula is susceptible to numeric and roundoff problems. The constant r must be small enough that the O(r) term is insignificant. But as r becomes small, large numbers are added to tiny ones in w + rv, causing a loss of precision of v. A similar loss of precision occurs in the subtraction of the original gradient from the perturbed one, because two nearly identical vectors are being subtracted to obtain the tiny difference between them. 3 The R{.}Technique

Fortunately, there is a way to make an algorithm which exactly computes Hv, rather than just approximating it, and simultaneously rid ourselves of these numeric difficulties. To do this, we first take the limit of equation (2.1) as r + 0. The left-hand side stays Hv, while the right-hand side matches the definition of a derivative, and thus

Hv = lim r-0

+

VW(W rv) - V r

W ( 4

=

8 -Vw(w+rv) dr

(3.1)

As we shall see, there is a simple transformation to convert an algorithm that computes the gradient of the system into one that computes this new quantity. The key to this transformation is to define the operator (3.2) so Hv = &{Vw(w)). (To avoid clutter we will usually write R{.} instead of %{.}.) We can then take all the equations of a procedure that calculates a gradient (e.g., the backpropagation procedure), and we can apply the %{.) operator to each equation. Because R{.}is a differential operator, it obeys the usual rules for differential operators, such as

Barak A. Pearlmutter

150

Also note that

R{w}= v

(3.4)

These rules are sufficient to derive, from the equations normally used to compute the gradient, a new set of equations about a new set of Rvariables. These new equations make use of variables from the original gradient calculation on their right-hand sides. This can be thought of as an adjoint system to the gradient calculation, just as the gradient calculation of backpropagation can be thought of as an adjoint system to the forward calculation of the error measure. This new adjoint system computes the vector R{Vw}, which is precisely the vector Hv that we desire. 4 Application of the

R{.}Technique to Various Networks

Let us utilize this new technique for transforming the equations that compute the gradient into equations that compute Hv, the product of a vector v with the Hessian H. We will, rather mechanically, derive appropriate algorithms for some standard sorts of neural networks that typify three broad classes of gradient calculation algorithms. These examples are intended to be illustrative, as the technique applies equally well to most other gradient calculation procedures. Usually the error E is the sum of the errors for many patterns, E = & E,. Therefore Ow and H are sums over all the patterns, H = C, Hp, and Hv = C, H,v. As is usual, for clarity this outer sum over patterns is not shown except where necessary, and the gradient and Hv procedures are shown for only a single exemplar.

4.1 Simple Backpropagation Networks. Let us apply the above procedure to a simple backpropagation network, to derive the R{backprop} algorithm, a set of equations that can be used to efficiently calculate Hv for a backpropagation network. In press, I found that the R{backprop} algorithm was independently discovered a number of times. Werbos (1988, eq. 14) derived it as a backpropagation process to calculate Hv = V,(v . V,E), where V,E is also calculated by backpropagation. That derivation is dual to the one given here, in that the direction of the equations is reversed, the backwards pass of the V,E algorithm becoming a forward pass in the Hv algorithm, while here the direction of the equations is unchanged. Another derivation is given in Mdler (1993a). Also, the procedure is known to the automatic differentiation community (Christianson, 1992; Kim et al. 1985). For convenience, we will now change our notation for indexing the weights w. Let w be the weights, now doubly indexed by their source

Multiplication by the Hessian

151

and destination units’ indices, as in w,], the weight from unit i to unit j. Because v is of the same dimension as w, its elements will be similarly indexed. All sums over indices are limited to weights that exist in the network topology. As is usual, quantities that occur on the left sides of the equations are treated computationally as variables, and calculated in topological order, which is assumed to exist because the weights, regarded as a connection matrix, is zero-diagonal and can be put into triangular form (Werbos 1974). The forward computation of the network is2 (4.1)

y;

= c;(x;)

+ I;

where of(.)is the nonlinearity of the ith unit, x, is the total input to the ith unit, y, is the output of the ith unit, and I, is the external input (from outside the network) to the ith unit. Let the error measure be E = E(y), and its simple direct derivative with respect to y, be e, = dE/dy,. We assume that e, depends only on y,, and not on any yl for j # i. This is true of most common error measures, such as squared error or cross entropy (Hinton 1987).3 We can thus write el(yl)as a simple function. The backward pass is then (4.2)

Applying

R{.} to the above equations gives

2This compact form of the backpropagation equations, due to Fernando Pineda, unifies the special cases of input units, hidden units, and output units. In the case of a unit i with no incoming weights (i.e., an input unit), it simplifies to yI = nI(0)+ I,, allowing the value to be set entirely externally. For a hidden unit or output i, the term I, = 0. In the corresponding equations for the backward pass (4.2) only the output units have nonzero direct error terms el, and since such output units have no outgoing weights, the situation for an output unit i simplifies to i?E/i3yl = e,(y,). 31f this assumption is violated then in equation 4.4 the ei(y,)R{y,} term generalizes to (% /OY,)R{Y,

c,

1.

Barak A. Pearlmutter

152

for the forward pass, and, for the backward pass, (4.4)

The vector whose elements are R{8 E / 3 w , } is just R{Ow}= Hv, the quantity we wish to compute. For sum squared error ei(y;)= y; - d ; where di is the desired output for unit i, so ei(yl) = 1. This simplifies (4.4) for simple output units to R{DE/8y;} = R{y,}.Note that, in the above equations, the topology of the neural network sometimes results in some R-variables being guaranteed to be zero when v is sparse-in particular when v = ( 0 . . . O 1 0 . . 0), which can be used to compute a single desired column of the Hessian. In this situation, some of the computation is also shared between various columns.

-

4.2 Recurrent Backpropagation Networks. The recurrent backpropagation algorithm (Almeida 1987; Pineda 1987)consists of a set of forward equations which relax to a solution for the gradient,

xi

=

xwjiyj

(4.5)

I

dt

0:

-yl

+ .i(Xi) + Ii

Adjoint equations for the calculation of Hv are obtained by applying the R{.} operator, yielding

R{xi)

1

C (wjlR{yi)+ vjiyj) j

dn{yi) dt

c(

-R{Yi)

+ a;(xl)R{xi}

(4.6)

Multiplication by the Hessian

153

These equations specify a relaxation process for computing Hv. Just as the relaxation equations for computing 0, are linear even though those for computing y and E are not, these new relaxation equations are linear. 4.3 Stochastic Boltzmann Machines. One might ask whether this technique can be used to derive a Hessian multiplication algorithm for a classic Boltzmann machine (Ackley et al. 1985), which is discrete and stochastic, unlike its continuous and deterministic cousin to which application of R{.} is simple. A classic Boltzmann machine operates stochastically, with its binary unit states s, taking on random values according to the probability

P(s, = 1) xi

= =

pl

c

a(x,/T)

=

(4.7)

W/IS\

/

At equilibrium, the probability of a state visible units) is related to its energy

Q

of all the units (not just the (4.8)

i<j

by P(a) = Z-' exp -E,/T, where the partition function is Z = C,,exp E,/T. The system's equilibrium statistics are sampled because, at equilibrium, (4.9) where pi, = (sis,), G is the asymmetric divergence, an information theoretic measure of the difference between the environmental distribution over the output units and that of the network, as used in Ackley et al. (1985), T is the temperature, and the + and - superscripts indicate the environmental distribution, + for waking and - for hallucinating. Applying the R{.}operator, we obtain

{},:

R

- =

(R{P$}- R(P,}) / T

(4.10)

We shall soon find it useful if we define

D,

=

R{E,}

=

CS:'S;U~,

(4.11)

I
41/

-

(s,s,D)

(4.12)

(with the letter D chosen because it has the same relation to v that E has to w) and to note that (4.13)

Barak A. Pearlmutter

154

With some calculus, we find R{exp-E,/T} = -P(a) Z D,/T, and thus R { Z } = -Z (D) / T . Using these and the relation between the probability of a state and its energy, we have

where the expression P(a) cannot be treated as a constant because it is defined over all the units, not just the visible ones, and therefore depends on the weights. This can be used to calculate

(4.15)

This beautiful formula4gives an efficient way to compute Hv for a Boltzmann machine, or at least as efficient a way as is used to compute the gradient, simply by using sampling to estimate q;,. This requires the additional calculation and broadcast of the single global quantity D, but is otherwise local. The collection of statistics for the gradient is sometimes accelerated by using the equation

The analogous identity for accelerating the computation of 9;, is

or

91)= ( P I )( p ~ +( ~ - s,)AD]) I S I = where AD,

c,S,W,'.

=

~ , s l u lisI defined by analogy with AE,

(4.18) =

El,,=, - Els,,"

=

The derivation here was for the simplest sort of Boltzmann machine, with binary units and only pairwise connections between the units. However, the technique is immediately applicable to higher order Boltzmann machines (Hinton 19871, as well as to Boltzmann machines with nonbinary units (Movellan and McClelland 1991). 4Equation4.15 is similar in form to that of the gradient of the entropy (Geoff Hinton, personal communication).

Multiplication by the Hessian

155

4.4 Weight Perturbation. In weight perturbation (Jabri and Flower 1991; Alspector et al. 1993; Flower and Jabri 1993; Kirk et al. 1993; Cauwenberghs 1993) the gradient 0, is approximated using only the globally broadcast result of the computation of E(w). This is done by adding a random zero-mean perturbation vector A w to w repeatedly and approxmating the resulting change in error by E(w

+ Aw) = E(w) + AE = E(w) + 0,.

AW

From the viewpoint of each individual weight w;

aE AE = Aw,-

+ noise

(4.19)

awl

Because of the central limit theorem it is reasonable to make a leastsquares estimate of SE/i)w,, which is aE/aw, = (Aw,AE)/(Aw?). The numerator is estimated from corresponding samples of Awl and AE, and the denominator from prior knowledge of the distribution of Awl. This requires only the global broadcast of AE, while each Awl can be generated and used locally. It is hard to see a way to mechanically apply R{.}to this procedure, but we can nonetheless derive a suitable procedure for estimating Hv. We note that a better approximation for the change in error would be

E(w

+ Aw) = E(w) + 0,.

AW

+ ;AwTHAw

(4.20)

where H is an estimate of the Hessian H. We wish to include in H only those properties of H that are relevant. Let us define z = Hv. If H is to be small in the least-squares sense, but also Hv = z, then the best choice would then be H = zvT/l IvI 12, except that then H would not be symmetric, and therefore the error surface would not be well defined. Adding the symmetry requirement, which amounts to the added constraint vTH = zT, the least-squares H becomes

1 H==( ZVT + vzT . z vvT) (4.21) llvIl2 I IVI l2 Substituting this in and rearranging the terms, we find that, from the perspective of each weight, =

(4.22) This allows both DE/awi and zi to be estimated in the same least-squares fashion as above, using only locally available values, vi and Awi, and the globally broadcast AE, plus a new quantity that must be computed and globally broadcast, Aw . v. The same technique applies equally well to other perturbative procedures, such as unit perturbation (Flower and Jabri 1993), and a similar derivation can be used to find the diagonal elements of H, without the need for any additional globally broadcast values.

Barak A. Pearlmutter

156

5 Practical Applications

The R{.}technique makes it possible to calculate Hv efficiently. This can be used in the center of many different iterative algorithms to extract particular properties of H. In essence, it allows H to be treated as a generalized sparse matrix. 5.1 Finding Eigenvalues and Eigenvectors. Standard variants of the power method allow one to 0

Find the largest few eigenvalues of H, and their eigenvectors.

0

Find the smallest few eigenvalues of H, and their eigenvectors.

0

Sample H s eigenvalue spectrum, along with the corresponding eigenvectors.

The clever algorithm of Skilling (1989) estimates the eigenvalue spectrum of a generalized sparse matrix. It starts by choosing a random vector vg, calculating vI = H'vo for i = 1, . . . , m, using the dot products v, v, as estimates of the moments of the eigenvalue spectrum, and using these moments to recover the shape of the eigenvalue spectrum. This algorithm is made applicable to the Hessian by the R{.} technique, in both deterministic and, with minor modifications, stochastic gradient settings. 5.2 Multiplication by the Inverse Hessian. It is frequently necessary to find x = H-'b, which is the key calculation of all Newton's-method second-order numerical optimization techniques, and is also used in the Optimal Brain Surgeon technique, and in some techniques for predicting the generalization rate. The R{.}technique does not directly solve this problem, but instead one can solve Hx = b for x by minimizing llHx blI2 using the conjugate-gradient method, thus exactly computing x = H-'b in ii iterations without calculating or storing H-'. This squares the condition number, but if H is known to be positive definite, one can instead minimize xTHx/2 + x . b, which does not square the condition number (Press et nl. 1988, p. 78). 5.3 Step Size and Line Search. Many optimization techniques repeatedly choose a direction v, and then proceed along that direction some distance 1 1 , which takes the system to the constrained minimum of E(w +- p ) . Finding the value for 11, which minimizes E is called a line search, because it searches only along the line w + /LV.There are many techniques for performing a line search. Some are approximate while others attempt to find an exact constrained minimum, and some use only the value of the error, while others also make use of the gradient. In particular, the line search used within the Scaled Conjugate Gradient (SCG)optimization procedure, in both its deterministic (Moiler 1993b)

Multiplication by the Hessian

157

and stochastic (Mdler 199313 incarnations, makes use of both first- and second-order information at w to determine how far to move. The firstorder information used is simply Vw(w),while the second-order information is precisely Hv, calculated with the one-sided finite difference approximation of equation 2.1. It can thus benefit immediately from the exact calculation of Hv. In fact, the R{backprop} procedure was independently discovered for that application (Maller 1993a). The SCG line search proceeds as follows. Assuming that the error E is well approximated by a quadratic, then the product Hv and the gradient Vw(w)predicts of the gradient at any point along the line w + krv by

D,(w

+ pv) = VW(w)+ ~

H +v O(p2)

(5.1)

Disregarding the O ( / / ' ) term, if we wish to choose p to minimize the error, we take the dot product of Vw(w pv) with v and set it equal to zero, as the gradient at the constrained minimum must be orthogonal to the space under consideration. This gives v . Vw(w)+ p'Hv = 0 or

+

Equation 5.1 then gives a prediction of the gradient at w + pv. To assess the accuracy of the quadratic approximation we might wish to compare this with a gradient measurement taken at that point, or we might even preemptively take a step in that direction. 5.4 Eigenvalue Based Learning Rate Optimization for Stochastic Gradient Descent. The technique described in the previous section is, at least as stated, suitable only for deterministic gradient descent. In many systems, particularly large ones, deterministic gradient descent is impractical; only noisy estimates of the gradient are available. In joint work with colleagues at AT&T Bell Labs (Le Cun etal. 19931, the approximation technique of equation 2.1 enabled H to be treated as a generalized sparse matrix, and properties of H were extracted to accelerate the convergence of stochastic gradient descent. Information accumulated online, in particular eigenvalues and eigenvectors of the principal eigenspace, was used to linearly transform the weight space in such a way that the ill-conditioned off-axis long narrow valleys in weight space, which slow down gradient descent, become well-conditioned circular bowls. This work did not use an exact value for Hv, but rather a stochastic unbiased estimate of the Hessian based on just a single exemplar at a time. Computations of the form x ( t ) = H(t)v were replaced with relaxations of the form x ( t ) = (1- tr)x(t- 1)+crH(t)v, where 0 < (1 << 1 determines the trade-off between steady-state noise and speed of convergence.

Barak A. Pearlmutter

158

6 Summary and Conclusion

Second-order information about the error is of great practical and theoretical importance. It allows sophisticated optimization techniques to be applied, appears in many theories of generalization, and is used in sophisticated weight pruning procedures. Unfortunately, the Hessian matrix H, whose elements are the second derivative terms iI2E/8w,8w,,is unwieldy. We have derived the R{.} technique, which directly computes Hv, the product of the Hessian with a vector. The technique is 0

exact: no approximations are made.

0

numerically accurate: there is no drastic loss of precision.

0

0

0

efficient: it takes about the same amount of computation as a gradient calculation. flexible: it applies to all existing gradient calculation procedures. robust: if the gradient calculation gives an unbiased estimate of Ow, then our procedure gives an analogous unbiased estimate of Hv.

Procedures that result from the applications of the R{.} technique are about as local, parallel, and efficient as the original untransformed gradient calculation. The technique applies naturally to backpropagation networks, recurrent networks, relaxation networks, Boltzmann machines, and perturbative methods. Hopefully, this new class of algorithms for efficiently multiplying vectors by the Hessian will facilitate the construction of algorithms that are efficient in both space and time.

Acknowledgments I thank Yann Le Cun and Patrice Simard for their encouragement and generosity. Without their work and enthusiasm I would not have derived the R{.}technique or recognized its importance. Thanks also go to Nandakishore Kambhatla, John Moody, Thomas Petsche, Steve Rehfuss, Akaysha Tang, David Touretzky, Geoffery Towell, Andreas Weigend, and an anonymous reviewer for helpful comments and careful readings. This work was partially supported by Grants NSF ECS-9114333 and ONR N00014-92-J-4062 to John Moody, and by Siemens Corporate Research. References Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. 1985. A learning algorithm for Boltzmann machines. Cog. Sci. 9, 147-169. Almeida, L. B. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In IEEE International Conference on

Multiplication by the Hessian

159

Neural Networks, M. Caudill and C. Butler, eds., pp. 609-618. San Diego, CA. Alspector, J., Meir, R., Yuhas, B., and Jayakumar, A. 1993. A parallel gradient descent method for learning in analog VLSI neural networks. In Advances in Neural lnformation Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 836-844. Morgan Kaufmann, San Mateo, CA. Becker, S. and Le Cun, Y.1989. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 Connectionist Models Summer School, D. S. Touretzky, G. E. Hinton, and T. J. Sejnowski, eds., pp. 29-37. Morgan Kaufmann, San Mateo, CA. Bishop, C. 1992. Exact calculation of the Hessian matrix for the multilayer perceptron. Neural Comp. 4(4), 494-501. Buntine, W., and Weigend, A. 1994. Computing second derivatives on feedforward networks: A review. I E E E Transact. Neural Networks, in press. Caudill, M., and Butler, C., eds. 1987. IEEE First International Conferenceon Neural Networks. San Diego, CA. Cauwenberghs, G. 1993. A fast stochastic error-descent algorithm for supervised learning and optimization. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J.D. Cowan, and C. L. Giles, eds., pp. 244-251. Christianson, B. 1992. Automatic hessians by reverse accumulation. IMA Journal of Numerical Analysis 12, 135-150. Flower, B., and Jabri, M. 1993. Summed weight neuron perturbation: An O(n) improvement over weight perturbation. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 212219. Morgan Kaufmann, San Mateo, CA. Hassibi, B., and Stork, D. G. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 164-171. Morgan Kaufmann, San Mateo, CA. Hinton, G. E. 1987. Connectionist learning procedures. Tech. Rep. CMU-CS-87115, Carnegie Mellon University, Pittsburgh, PA. Jabri, M., and Flower, 8.1991. Weight perturbation: An optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks. Neural Comp. 3(4), 546-565. Kim, K. V., Nesterov, Y. E., and Cherkassky, B. V. 1985. An algorithm for fast differentiations and its applications. In Abstracts of the 12th IFIP Conferenceon System Modeling and Optimization, pp. 181-182, Budapest, Hungary. Kirk, D. B., Kerns, D., Fleischer, K., and Barr, A. H. 1993. Analog VLSI implementation of gradient descent. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 789-796. Le Cun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems 2, D. s.Touretzky, ed., pp. 598605. Morgan Kaufmann, San Mateo, CA. Le Cun, Y., Kanter, I., and Solla, S. A. 1991. Second order properties of error surfaces: Learning time and generalization. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 918-924. Morgan Kaufmann, San Mateo, CA.

160

Barak A. Pearlmutter

Le Cun, Y., Simard, P. Y., and Pearlmutter, B. A. 1993. Automatic learning rate maximization by on-line estimation of the Hessian’s eigenvectors. In Advances in Neural lnforniatiori Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 156-163. Morgan Kaufmann, San Mateo, CA. MacKay, D. J. C. 1991. A practical Bayesian framework for back-prop networks. Neural Comp. 4(3), 448-472. Mdler, M. 1993a. Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O ( n )time. Daimi PB-432, Computer Science Department, Aarhus University, Denmark. Merller, M. 1993b. A scaled conjugate gradient algorithm for fast supervised learning. Neural Netzuorks 6(4), 525-533. Merller, M. 1993c. Supervised learning on large redundant training sets. Int. 1. Neural Syst. 4(1), 15-25. Moody, J. E. 1992. The efiective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Advariccs in Neural Information Processing System 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds., pp. 847-854. Morgan Kaufmann, San Mateo, CA. Movellan, J. R. and McClelland, J. L. 1991. Learning continuous probability distributions with the contrastive Hebbian algorithm. Tech. Rep. PDP.CNS.91.2, Dept. of Psychology, Carnegie Mellon University, Pittsburgh, PA. Pearlmutter, B. A. 1992. Gradient descent: Second-order momentum and saturating error. In Advances in N~wualInformation Processing S!ysteins 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds., pp. 887-894. Morgan Kaufmann, San Mateo, CA. Pineda, F. 1987. Generalization of back-propagation to recurrent neural networks. Phys. Rev. Lett. 19(59), 2229-2232. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Verrerling, W. T. 1988. Nitmerical Recipes in C. Cambridge University Press, Cambridge. Skilling, J. 1989. The eigenvalues of mega-dimensional matrices. In Mnxiniurn Entropy arid Bayesian Methods, J. Skilling, ed., pp. 455-466. Kluwer Academic Publishers, Norwell, MA. Watrous, R. 1987. Learning algorithms for connectionist networks: Applied gradient methods of nonlinear optimization. In I E E E First International COJIference on Neural Networks, M. Caudill and C. Butler, eds., pp. 619-627. San Diego, CA. Werbos, I? J. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University. Werbos, P. J. 1988. Backpropagation: Past and future. In I E E E International Conference on Neural Networks, Vol. I, pp. 343-353, San Diego, CA. Widrow, B., McCool, J. M., Larimore, M. G., and Johnson, C. R., Jr. 1979. Stationary and nonstationary learning characteristics of the LMS adaptive filter. Proc. IEEE 64, 1151-1162. ~~~

~

Received January 15, 1993; accepted June 9, 1993.

This article has been cited by: 2. Nader Fallah, Hong Gu, Kazem Mohammad, Seyyed Ali Seyyedsalehi, Keramat Nourijelyani, Mohammad Reza Eshraghian. 2009. Nonlinear Poisson regression using neural networks: a simulation study. Neural Computing and Applications 18:8, 939-943. [CrossRef] 3. Nader Fallah, Kazem Mohammad, Keramat Nourijelyani, Mohammad Reza Eshraghian, Seyyed Ali Seyyedsalehi, Maria Raiessi, Maziar Rahmani, Hamid Reza Goodarzi, Soodabeh Darvish, Hojjat Zeraati, Gholamreza Davoodi, Saeed Sadeghian. 2009. Nonlinear association between serum testosterone levels and coronary artery disease in Iranian men. European Journal of Epidemiology 24:6, 297-306. [CrossRef] 4. Zhenzhen Liu, I. Elhanany. 2008. A Fast and Scalable Recurrent Neural Network Based on Stochastic Meta Descent. IEEE Transactions on Neural Networks 19:9, 1652-1658. [CrossRef] 5. Kar-Ann Toh. 2008. Deterministic Neural ClassificationDeterministic Neural Classification. Neural Computation 20:6, 1565-1595. [Abstract] [PDF] [PDF Plus] 6. Yuval Tassa, Tom Erez. 2007. Least Squares Solutions of the HJB Equation With Neural Network Value-Function Approximators. IEEE Transactions on Neural Networks 18:4, 1031-1041. [CrossRef] 7. M.V. Ribeiro, C.A. Duque, J.M.T. Romano. 2006. An interconnected type-1 fuzzy algorithm for impulsive noise cancellation in multicarrier-based power line communication systems. IEEE Journal on Selected Areas in Communications 24:7, 1364-1376. [CrossRef] 8. M.V. Ribeiro, R. daR.Lopes, J.M.T. Romano, C.A. Duque. 2006. Impulse Noise Mitigation Based on Computational Intelligence for Improved Bit Rate in PLC-DMT. IEEE Transactions on Power Delivery 21:1, 94-101. [CrossRef] 9. M. Bray, E. Koller-Meier, P. Müller, N.N. Schraudolph, L. Van Gool. 2005. Stochastic optimisation for high-dimensional tracking in dense range maps. IEE Proceedings - Vision, Image, and Signal Processing 152:4, 501. [CrossRef] 10. Nicol N. Schraudolph . 2002. Fast Curvature Matrix-Vector Products for Second-Order Gradient DescentFast Curvature Matrix-Vector Products for Second-Order Gradient Descent. Neural Computation 14:7, 1723-1738. [Abstract] [PDF] [PDF Plus] 11. K.-W. Wong, C.-S. Leung, S.-J. Chang. 2002. Use of periodic and monotonic activation functions in multilayer feedforward neural networks trained by extended Kalman filter algorithm. IEE Proceedings - Vision, Image, and Signal Processing 149:4, 217. [CrossRef] 12. Mark Zlochin , Yoram Baram . 2001. Manifold Stochastic Dynamics for Bayesian LearningManifold Stochastic Dynamics for Bayesian Learning. Neural Computation 13:11, 2549-2572. [Abstract] [PDF] [PDF Plus]

13. Kazumi Saito, Ryohei Nakano. 1997. Partial BFGS Update and Efficient Step-Length Calculation for Three-Layer Neural NetworksPartial BFGS Update and Efficient Step-Length Calculation for Three-Layer Neural Networks. Neural Computation 9:1, 123-141. [Abstract] [PDF] [PDF Plus] 14. Sepp Hochreiter, Jürgen Schmidhuber. 1997. Flat MinimaFlat Minima. Neural Computation 9:1, 1-42. [Abstract] [PDF] [PDF Plus] 15. S. Ridella, S. Rovetta, R. Zunino. 1995. Adaptive internal representation in circular back-propagation networks. Neural Computing & Applications 3:4, 222-233. [CrossRef] 16. B. Christianson. 1995. Geometric approach to Fletcher's ideal penalty function. Journal of Optimization Theory and Applications 84:2, 433-441. [CrossRef] 17. Peter M. Williams . 1995. Bayesian Regularization and Pruning Using a Laplace PriorBayesian Regularization and Pruning Using a Laplace Prior. Neural Computation 7:1, 117-143. [Abstract] [PDF] [PDF Plus] 18. Chris M. Bishop . 1995. Training with Noise is Equivalent to Tikhonov RegularizationTraining with Noise is Equivalent to Tikhonov Regularization. Neural Computation 7:1, 108-116. [Abstract] [PDF] [PDF Plus] 19. Terrence L. FineFeedforward Neural Nets . [CrossRef]

Communicated by Franqoise Fogelman-Soulie

Polyhedral Combinatorics and Neural Networks Andrew H. Gee Richard W. Prager Cambridge University Engirlerring Dqiartinerf t , Trurnpington Street, Cainbridge CB2 ZPZ,Eizgland

The often disappointing performance of optimizing neural networks can be partly attributed to the rather ad hoc manner in which problems are mapped onto them for solution. In this paper a rigorous mapping is described for quadratic 0-1 programming problems with linear equality and inequality constraints, this being the most general class of problem such networks can solve. The problem’s constraints define a polyhedron P containing all the valid solution points, and the mapping guarantees strict confinement of the network’s state vector to P. However, forcing convergence to a 0-1 point within P is shown to be generally intractable, rendering the Hopfield and similar models inapplicable to the vast majority of problems. A modification of the tabu learning technique is presented as a more coherent approach to general problem solving with neural networks. When tested on a collection of knapsack problems, the modified dynamics produced some very encouraging results. 1 Introduction and Outline

While feedback neural networks have been studied for some time as a means of approximately solving combinatorial optimization problems, many researchers have reported that such networks rarely find valid solutions, let alone high quality ones (Kamgar-Parsi and Kamgar-Parsi 1990; Wilson and Pawley 1988). We suggest that a significant cause of such failures has been the rather ad hoc manner in which problems are mapped onto the network for solution. Accepting that such networks can perform nothing more sophisticated than continuous bounded descent on a quadratic energy function, it is quite possible to derive efficient mappings to exploit these simple dynamics to the full. Throughout this paper we draw on results from traditional mathematical programming (Schrijver 1986)to examine both problems and network dynamics from a polyhedral perspective. For a general quadratic 0-1 programming problem with linear equality and inequality constraints, it is possible to identify a bounded polyhedron, or polytope, within which the solution vector must lie if it is not to violate any of the aforementioned constraints. In this paper we describe a rigorous mapping for such Neitml Conipi&tion 6, 161-180 (1994)

@ 1993 Massachusetts Institute of Technology

162

Andrew H. Gee and Richard W. Prager

problems, which guarantees confinement of the network's state vector to the constraint polytope; hence the possibility of arriving at states which violate any of the constraints is eliminated. The mapping is a generalization of the "valid subspace" approach (Aiyer et al. 1990; Aiyer 1991), combined with an improved slack variable technique for dealing with inequality constraints (Tagliarini and Page 1988; Tagliarini et al. 1991). However, confinement to the correct polytope is not the final goal: to find a valid solution we require convergence to a 0-1 point within this polytope. We demonstrate that forcing the network to converge to such a point is probably impossible, since finding such a point is generally an NP-complete problem. In particular, this means that for correctly mapped problems, the popular "annealing" techniques, often employed to force convergence to a 0-1 point, are quite ineffective. For the small class of integral polytopes (Schrijver 1986), convergence to a 0-1 point is feasible, though such polytopes are the exception and not the rule. It follows that conventional neural network techniques are quite unsuited to optimization over most polytopes. In response, we present modified network dynamics, based on the tabu learning technique of Beyer and Ogier (1990) and Beyer and Ogier (19911, which constitute a more coherent approach to optimization over nonintegral polytopes. Beyer and Ogier's scheme marries tabu search with Hopfield-style dynamics to produce a viable neural search technique, which should be contrasted with the one-shot descent offered by the Hopfield and similar models. The modified tabu search dynamics presented here are tested on a collection of knapsack problems, with very encouraging results. The organization of the paper is as follows. In Section 2 we review continuous descent dynamics, including the Hopfield model, and their application to the solution of 0-1 programming problems. We proceed to describe in Section 3 a rigorous mapping for the most general quadratic 01 programming problems with linear equality and inequality constraints. In the light of this mapping we go on to address the issue of efficient simulation of descent dynamics on discrete-time machines. In Section 4 we examine the convergence properties of such dynamics, concluding that it is probably impossible to force convergence to a 0-1 point for most problems; we also discuss the deficiencies of the various annealing techniques in this context. A more coherent approach is offered by tabu search dynamics, which we develop in Section 5 and subsequently apply to a set of knapsack problems, with very encouraging results. Finally, in Section 6 we discuss the key issues raised in the paper and present our conclusions. 2 Bounded Descent Techniques

While the arguments of this paper are of relevance to any method attempting to solve 0-1 programming problems by continuous descent, of

Polyhedral Combinatorics and Neural Networks

163

particular interest is the Hopfield network‘ technique, since it has been the subject of significant research in the recent past, and highlights the potential advantages of fast, analogue implementations. The dynamics of the network are governed by the following equations:

D; is a continuous variable in the interval 0 to 1, and g ( u , ) is a monotonically increasing function which constrains D; to this interval, usually a hyperbolic tangent of the form

(2.3)

T is a matrix of interconnection weights, and ib is a bias vector. If T is symmetric, and either 7 = 0 or the gain of the transfer functions (2.3) is high (as we shall assume throughout this paper), the system has a Liapunov function (Hopfield 1984) 1 E(v) = --vTTv - vTib 2

(2.4)

The Hopfield network therefore provides a continuous method of performing bounded descent minimization of functions of the form (2.4) with symmetric T. By ”bounded,” we mean that the descent is limited to the unit hypercube, at the vertices of which lie the 0-1 points; the network is therefore potentially suited to the solution of 0-1 programming problems. Should we wish to solve any particular problem, we must first decide how to set the network parameters T and ib, so that minimization of the Liapunov function (2.4) coincides with minimization of the problem’s objective function and enforces satisfaction of the problem’s constraints: this process is termed ”mapping” the problem onto the network. The particular attraction of the network lies in the existence of an analogue electrical circuit with dynamics corresponding to (2.1) and (2.2) (Hopfield 1984). It is also possible to use the network for cases where T is not symmetric, by noting that (2.5)

Hence, should we wish to use the network to minimize a function E with a nonsymmetric T, we simply replace T with i ( T + TT), which is symmetric; in so doing we do not change E . ‘The Hopfield network is in fact a special case of the additiw inodd developed by Grossberg in the 1960s [see Grossberg (1988) for an historical survey].

Andrew H. Gee and Richard W. Prager

164

Also of interest are steepest descent dynamics, previously studied in Aiyer (1991) and Ogier and Beyer (1990):

v,=

(2.6) [Tv + ib] otherwise

The dynamics (2.6) share the same Liapunov function (2.4) as the Hopfield network, and the v variables are also limited to the range 0 5 u, 5 1. Thus (2.6) provides an alternative method of minimizing E within the unit hypercube, and has the advantage of a more efficient simulation on digital computers, as we shall see in Section 3.3. Throughout this paper we shall refer to bounded, quadratic continuous descent systems (like the Hopfield and steepest descent models) as descent networks. 3 Quadratic 0-1 Programming on Descent Networks 3.1 Problem Statement. In this paper we study quadratic 0-1 programming problems with linear constraints, this being the most general class of problem that can be mapped onto a descent network for solution. The mathematical statement of such a problem is

minimize

(3.3)

0 5 ui 5 1,

i E { 1, . . . , n }

(3.4)

and v integral

(3.5)

where v E R", b"q E Rmesand bin E R""". With reference to condition (3.51, we define an integral vector to be one whose elements are all integers. The system (3.1H3.5) describes an optimization problem in which it is required to minimize the quadratic objective function (3.1) subject to a set of linear equality (3.2) and inequality (3.3) constraints on the elements of v. The final two conditions (3.4) and (3.5) ensure that u, E ( 0 , l ) . From a geometric point of view, we note that the conditions (3.2)(3.4), if feasible, define a bounded polyhedron, or polytope, within which

Polyhedral Combinatorics and Neural Networks

165

v must remain if it is to represent a valid solution: let us denote this polytope by the symbol P. When attempting to solve such problems with descent networks, we strive toward discovering a mapping so that v continually remains within P while performing some sort of descent on the objective function (3.1). It may also be necessary to employ an annealing process to free v from local minima of E and drive v toward a hypercube corner, where the condition (3.5) is satisfied. Unfortunately, problems are often mapped onto descent networks in a rather ad hoc manner, and strict confinement to P is rarely achieved. 3.2 A Rigorous Mapping for Quadratic 0-1 Programming Problems. In this section we describe a rigorous mapping for problems of the form (3.1143.51, which achieves all the goals mentioned in Section 3.1. The mapping is a generalization of the "valid subspace" approach, developed for a few specific problems in Aiyer et al. (1990) and Aiyer (1991), coupled with an improved slack variable technique for dealing with inequality constraints (Tagliarini and Page 1988; Tagliarini et al. 1991). We begin by considering the simple case where there are no inequality constraints, leaving only equality constraints of the form (3.2). The equality constraints constitute a system of linear equations that can be solved to obtain an affine subspace of solutions. If the rows of A""are linearly independent (as is invariably the case for a set of feasible, irredundant constraints), then we can describe this subspace in the form

v = TV% + s

(3.6)

where

(3.8)

If we now set the network's Liapunov function to be

E

= EOP

+ 2-1C O J J V

-

(T""v + s)1I2

(3.9)

then, in the limit of large co, v will satisfy (3.6) and therefore (3.2) throughout convergence. In this way all the equality constraints have been combined into a single penalty term in the network's Liapunov function; moreover, this term does not interfere with the objective E"P at any point within the polytope P. The Liapunov function (3.9) can be achieved by setting the network's parameters as follows (Aiyer 1991):

T

= TOP

ib

=

+ CO(T~"

iop + cOs

- I)

(3.10) (3.11)

Andrew H. Gee and Richard W. Prager

166

It now remains only to point out that the network's transfer functions naturally enforce conditions (3.4), and so confinement of v within P is guaranteed. The mapping can be extended to cope with inequality constraints of the form (3.3). This is possible using the slackvariable technique, by which inequality constraints are converted into equality constraints. Slack variables were first proposed for the assignment problem (Tagliarini and Page 1988) and subsequently generalized (Tagliarini ef a!. 1991). Here we improve the technique so that each inequality constraint requires the introduction of only one slack variable. A typical constraint A:;z+

+ A:;~zIz+ . . . + A::u,~ 5 b:"

(3.12)

becomes A:;zJ~+ A 3 2

+ . . . + A::v,, + k , ~ =, b:",

k , ~1, 0

(3.13)

Here w, is the slack variable, and k, is a positive constant that is set to ensure that w ,is bounded between 0 and 1, and can therefore be treated in the same manner as the v variables.2 Thus we transform the system of inequality constraints into a system of equality constraints acting on an extended set of variables v+' = [v' wT], where w is the vector of slack ~ . . . w,,,,~]~: the size of v+ is therefore rz+ = n + ml". The variables [ w w2 optimization problem can now be expressed as

(3.15)

(3.16)

and O
i c { I ,. . . , n + )

(3.17)

and

v integral

(3.18)

where

*Thiscan be achieved by setting k, as follows: (3.14)

If this gives a negative value for k,, then the inequality constraint (3.12) cannot be satisfied by any 0-1 variable v.

Polyhedral Combinatoricsand Neural Networks

167

Figure 1: Schematic diagram of the efficient descent algorithm. Note that E O P has no dependency on the slack variables w, and the equality constraints (3.16) embody both the original equality and inequality constraints (3.2) and (3.3). The formulation (3.15143.18) contains only equality constraints, and so the mapping technique described above is applicable. We can therefore find an n+ x n+ matrix Tva',and an n+-element vector s, such that if the network's parameters are set as in (3.10)-(3.11) (substituting Top+ for TOP and iop+ for P),the state vector v+ will perform a descent on E O P while remaining strictly within the polytope P+ defined by (3.16) and (3.17). If we observe only the variables v within the extended set v+, we shall see that they perform a descent on E O P while remaining strictly within the polytope P, as required. Note that the integrality conditions vi E ( 0 , l ) apply only to the original variables and not to the slack variables, which may take any value between 0 and 1 at a valid solution point. Thus it is not necessary (and indeed incorrect) to force the slack variables toward 0 or 1 using an annealing technique. To find a valid solution we need only ensure that v converges to an integral point within P, which is the subject of Section 4. 3.3 Efficient Simulation on Digital Computers. In this section we consider discrete-time simulation of descent dynamics on Liapunov functions of the form (3.9), as encountered in proper mappings of 0-1 programming problems. A standard Euler approximation of the dynamic equations, though feasible, would require a very small time step at each iteration, and would therefore be slow to converge. This is because co must be large to guarantee confinement to the polytope P+, resulting in correspondingly large values of v+ when v+ strays marginally outside P+. Hence a large time step is bound to lead to unstable oscillations of V+ around P+. Figure 1 shows a schematic illustration of a far more efficient simulation algorithm, previously presented in Aiyer (19911, for the steepest descent dynamics (2.6). Each iteration has two distinct stages. First, as depicted in box (11, v+ is updated over a finite time step At using the

Andrew H. Gee and Richard W. Prager

168

gradient of the objective term E O P alone. The time step can be far larger than that for the standard Euler approximation, since the problematic co term is not implemented. However, updating v+ in this manner will typically take V+ outside the polytope P+. Hence, after every update, V + is directly projected back onto P+. This is an iterative process, requiring several passes around the loop (2), in which v+ is first orthogonally projected onto the subspace given by (3.61, and then thresholded so that its elements lie in the range 0 to 1. For more details of projection onto polytopes the reader is referred to Agmon (1954) and Motzkin and Schoenberg (1954). The resulting algorithm is highly efficient and has general applicability to any quadratic 0-1 programming problem. 4 Polyhedral Issues in Convergence

Having seen how it is possible to achieve descent dynamics on Eap while remaining within the polytope P, we now need to investigate the convergence properties of such a process. This is best illustrated by an example. Consider the following knapsack problem in two variables: minimize

E"'(v)

= -(z+

+ 2'02)

(4.1)

subject to '01

+ '02

5 1.7

(4.2)

and

v integral

(4.4)

For knapsack problems the objective function El+' is linear and therefore has no local minima. The problem is easily mapped onto a descent network for solution, through the introduction of one slack variable to cope with the single inequality constraint (4.2). It is clear that the optimal solution to our simple example problem is v = [0 l]', though the general knapsack problem is NP-complete (Garey and Johnson 1979; Schrijver 1986). Figure 2 shows the polytopes P and P+ for this problem. Also marked within P are contours of E O P and a typical descent trajectory within the polytope; we see that the descent converges to the point A = [0.7 1.OIT, which is a vertex of P. The corresponding descent path within P+ is also illustrated. The vertex A is in fact the optimal solution to the problem (4.1)-(4.3), without the integrality condition (4.4). This less restricted problem is known as the LP-relaxation (linear programming relaxation) of the knapsack problem. So the correct mapping of the knapsack problem

Polyhedral Combinatorics and Neural Networks

169

Figure 2: Network convergence for a simple knapsack problem. leads the network to solve the LP-relaxation a~curately,~ though a valid 0-1 solution is not found. So how can we force the descent to converge to B, C, or D, one of the 0-1 points in P? The usual approach is to use some kind of annealing process, and we will examine the use of hysteretic annealing (Eberhart et al. 1991) [variants have been reported as convex relaxation (Ogier and Beyer 1990) and matrix graduated nonconvexity (Aiyer 199111 with this problem. The idea behind hysteretic annealing is to add a term of the form )I

Ea""

=

-y

C ( U0.5)2 ~ -

(4.5)

i=l

to the objective function E O P . The function E""" is either convex or concave, depending on the sign of y, and has full spherical symmetry around the point vT = [0.50.5 . . . 0.51. A consequence is that Eann has the same value at all 0-1 points, and therefore does not invalidate the objective for 0-1 programming problems. The annealing process is usually initialized with a large negative value of y, in which case EOP is convex and v converges to a point within the interior of P. Subsequently, the value of y is gradually increased, eventually EOP becomes concave, and v is driven toward the boundary of P. For quadratic objectives this process helps v to escape from local minima of E and can guide v toward good solutions, though this latter advantage is somewhat problem-dependent (Gee and Prager 1992). 'Indeed, a descent network correctly set u p using the mapping in Section 3.2 will reliably solve linear programming problems, without the need for any modifications as proposed in Tank and Hopfield (1986) and Van Hulle (1991).

170

Andrew H. Gee and Richard W. Prager

Applied to the knapsack problem in Figure 2, however, hysteretic annealing has no useful effect. For v trapped at the vertex A, no amount of increasing y will drive v toward a 0-1 point, at least until the magnitude of y becomes comparable with that of co in (3.10) and (3.111, at which point v will depart from the polytope P and converge to the point [1.0 1.0Ir, though this is clearly undesirable. Negative values of will succeed in freeing v from the vertex A, though only to guide v back toward the interior of P, where it will stabilize; this, too, is clearly not very ~seful.~ In fact, the failure of annealing techniques to find a 0-1 point within P is hardly surprising, since finding such points is generally an NPcomplete problem (Schrijver 1986). The best we can expect from annealing procedures is to force convergence to some vertex of P. This is in fact what hysteretic annealing does, since increasing y in (4.5) will eventually make E O P concave. It is easy to prove (and is intuitively obvious) that if E O P is linear or concave, then convergence to some vertex of P is guaranteed, except in highly pathological cases where the path of descent is exactly orthogonal to a face of P. However, for 0-1 programming problems we hope that v will converge to a 0-1 point, which will be an integral vertex of P. This can be guaranteed only if all the vertices of P are 0-1 points, that is if all the vertices of P are integral: a polytope exhibiting this property is called an integral polytope (Schrijver 1986). Integral polytopes constitute an important concept in the field of mathematical programming. For example, integer linear programming over integral polytopes can be solved in polynomial time using Khachiyan's method (Schrijver 19861, though the general integer linear programming problem is NP-complete. Likewise, we now see that integral polytopes are highly relevant to continuous descent solution techniques, which will reliably converge to a valid solution point only if the optimization is over an integral polytope. Neural network techniques, in their conventional form, are quite unsuited to the solution of 0-1 programming problems over nonintegral polytopes. Given that integral polytopes are the exception and not the rule, it would be dangerous to assume the integrality of a polytope associated 4We might well ask whether any other form of annealing would be more successful. For example, it seems feasible that using temperature annealing with a Hopfield network [increasing the gain of the neuron transfer functions (2.3) by reducing TI would eventually force the outputs v, to either 0 or 1. However, even with very high gain transfer functions, we are still operating on the Liapunov function E (2.4). Taking the knapsack problem in Figure 2 as an example, any direction away from the vertex A is "uphill" in the sense of E. Since v evolves in such a way that E is nonincreasing, it is clear that v will not move from this vertex, however steep the transfer functions become. As a means of forcing convergence to a 0-1 point, varying the neuron gains is therefore ineffective. The same argument can also be applied to the discrete mean field annealing algorithm (Bilbro et al. 1989; Peterson and Anderson 1988; Peterson and Saderberg 1989; Van den Bout and Miller 1990), which, in the low temperature limit, seeks the same solution points as a Hopfield network running with high gain transfer functions and 11 = 1 (Aiyer 1991; Peterson and Anderson 1988).

Polyhedral Combinatorics and Neural Networks

171

with any particular problem mapping. Unfortunately, the identification of integral polytopes is itself an NP-complete problem (Papadimitriou and Yannakakis 1990). However, it is possible to recognize some special classes of integrality, covering some of the polytopes implicitly studied by the neural network community. While it is beyond the scope of this paper to discuss the recognition of integral polytopes in detail [a good summary can be found in Schrijver (1986)1, it is worth remarking that the polytope associated with the traditional Hopfield-Tank mapping of the traveling salesman problem (Hopfield and Tank 1985) is indeed integral (Birkhoff 1946; von Neumann 1953). Unfortunately, the objective function associated with the Hopfield-Tank mapping is not linear, so while convergence to a valid solution point can be guaranteed, convergence to the optimal solution point cannot. This should be contrasted with the mappings of the traveling salesman problem traditionally studied by the mathematical programming community, in which a linear objective over a nonintegral polytope is considered (Langevin et al. 1990; Schrijver 1986). The strength of the neural network technique therefore lies in the use of annealing over integral polytopes. However, as we have seen with the knapsack problem, most problems do not present themselves naturally in this form. We must therefore investigate alternative network dynamics for nonintegral polytopes, which is the subject of Section 5 . 5 Tabu Search Networks for Nonintegral Polytopes 5.1 System Definition. While conventional neural optimization techniques are completely unsuited to optimization over nonintegral polytopes, a modification to their dynamics results in a far more coherent approach to connectionist problem solving. In this section we describe how tabu learning, presented in Beyer and Ogier (1990) and Beyer and Ogier (1991) as a viable neural search technique, can be modified to ensure that v+ escapes from vertices of P+. Tabu search (Glover 1989) was originally developed as a ”metaheuristic,” to work in conjunction with other search techniques. The idea is that certain parts of the search space are designated tabu for some time, directing the search toward more fruitful areas. The choice of appropriate tabu strategies can lead to highly efficient searches, with some extremely impressive results in the field of combinatorial optimization (Glover 1989). Applied to descent networks, tabu search suggests a dynamic objective function, where we continuously update E to penalize points that v+ has already visited. In this way, if V+ gets trapped at any point, eventually E will build up locally to such an extent that v+ is driven away. What emerges is a neural search technique, which no longer converges to a single point, and therefore needs constant monitoring to spot any valid solution points the search may pass through. This should

Andrew H. Gee and Richard W. Prager

172

be contrasted with the one-shot descent offered by the conventional optimization networks, which typically converge to nonintegral points. To be sure of finding an integral point the use of some sort of search procedure is imperative. The tabu methodology makes this search efficient, while a connectionist implementation in hardware could provide very rapid processing speeds. The implementation of the search on feedback networks is particularly elegant. In what follows we build on the approach in Beyer and Ogier (1990) and Beyer and Ogier (1991), adapting it where necessary to operate over the interior of the unit hypercube, as opposed to over its vertices, and to achieve the specific goal of escape from vertices of P+. We consider a time varying objective function EPP(vt.t ) = E"P(v)

+ F,(v+, t )

(5.1)

where F,(v+.t ) = j j fe'r(5+"p 0 [v', f[v+(s)]]ds

(5.2)

In (5.2), o and 11 are positive constants, p(a.b) measures the proximity of the vectors a and b, and f(a) is a small displacement function such that f(a) FZ a. Thus F, evolves in time to increase the objective in the vicinity of points recently visited. Let us now consider how F , affects the descent dynamics for v t trapped at a vertex v,' of Pt since time t = 0. In this case Ff(v+.t ) = /jL'e''('-')p [v+.f(v,t)]ds

It is apparent that if we set f(a) = a, then even though the objective is increased at the vertex, escape is not guaranteed since the maximum of Ff will be at v,', and therefore the gradient of F, at v,' will be zero. What is required is to place the maximum of F, near v,' but just outside the polytope P+. This can be achieved if we set f(a) = a + F(a - c)

(5.3)

where c is some vector within P+, and f is a small, positive constant. If we choose the quadratic proximity function p ( a ,b ) = -1la - b1I2, then F , becomes 1

+g(t)

F,(v+. t ) = --HV'~V+ 2 where H ( t ) = 21j

and

J" 0

- hTv+

ds

(5.4)

(5.5)

Polyhedral Combinatorics and Neural Networks

173

Equations (5.5) and (5.6) can be differentiated to obtain the following dynamic update equations for H and h H = 21)

h

=

(5.7) (5.8)

- tYH

-2/jf(vt)

-

rrh

with initial conditions H(0) = 0 and h(0) = 0. The tabu search system is obtained by arranging for the overall objective function to be 1 ~ (+ )v = --V+TTV+ - .+rib 2 =

EPP(~'.f)- g ( t )

+ -21c oI I v +

-

+

(TYa'v+ s)1I2

(5.9)

where we have discarded g ( t ) , the part of F, that was not dependent on v+. Remember that the Tva' and s terms enforce confinement to the polytope P+, as discussed in Section 3.2. The objective function (5.9) is achieved by setting T ib

+ c0(TVa' I) + H(t)I iOP+ +cos + h(t)

= T"F+ =

-

(5.10) (5.11)

v+ can be updated using either Hopfield or steepest descent dynamics on E,5 though the dynamics of v+ will now depend on B and h. In turn, the dynamics of 0 and h, given in (5.7) and (5.81, depend on v+. The system of coupled first-order differential equations is surprisingly simple and quite possibly retains the attraction of being implementable in analogue electrical hardware. Note that only the diagonal interconnections and input bias currents are time varying, with the former approaching an asymptotic value of 2/j/o. The tabu search dynamics are intended for use over nonintegral polytopes with linear objectives E"p. Indeed, if E'P is linear, it is straightforward to show that Epp is linear or concave for all time t, and so there are no local minima of EPp in which vs can get trapped. Ability to escape from any vertex of P+ can be guaranteed by appropriate settings of the parameters (p, 13, and F (see Appendix). For moderate jj, the trajectory of v+ is initially dominated by the gradient of EnP, until v+ gets trapped at a vertex of P+. At this point the tabu objective F, proceeds to dominate, and eventually v+ escapes from the vertex and moves toward another. We can therefore expect V + to search through a number of vertices, with an initial bias toward those minimizing E"F. The search might exhibit limit cycles: it is a matter of future research to develop continuous dynamics which search every vertex of a polytope, given sufficient time. 'For a time-varying objective function E, the rate of change of E is given by € = ( d € / d t ) + VE . v. The nonzero d E / d t term means that the direction -VE is no longer guaranteed downhill: hence E is riot a Liapunov function of the Hopfield or steepest descent dynamics, which move v downhill only on an instantaneous snapshot of the energy surface, oblivious to the fact that the energy surface changes over time. Since the tabu search system has no stable states, we would not expect it to admit a Liapunov function.

Andrew H. Gee and Richard W. Prager

174

( a ) Time l o

=0

(b) Time 1 1

> lo

( c ) Tlmr I?> I I

Figure 3: Tabu search with steepest descent dynamics for the example knapsack problem. Parameters are a = 0.01, [j = 2.0,F = 0.1. Contours of E are shown in dotted lines. 5.2 Illustration and Experiments. We implemented the tabu search system with steepest descent dynamics (2.6). With tabu search, the steepest descent dynamics have the additional advantage over the Hopfield dynamics that reaction to changes in F f is more rapid. This is because there are no u variables, which can stray far from the midpoint of the transfer functions (2.31, and subsequently take a long time to return and effect significant changes in v. Figure 3 shows the trajectory of V + under the tabu dynamics, with T and ib set to map the knapsack problem of Figure 2. We see that the search passes through all the vertices of P ', including the valid solution points B, C , and D. In fact, the first vertex to be visited after the LP-relaxation solution at A is vertex B, which is the optimal solution to the knapsack problem. To demonstrate the usefulness of tabu search with more challenging problems, we applied the technique to a number of randomly generated knapsack problems of the form: minimize ,1

E"P(v) = -

Cap,

(5.12)

I=1

subject to (5.13) and 01v;11,

ie

{l,. . . n }

(5.14)

and

v integral

(5.15)

Polyhedral Combinatorics and Neural Networks

175

Table 1: Average Solution Qualities ( - E o p ) for Knapsack Problems, Normalized to Linear Progamming Solutions!

Number of items Solution technique

50

100

150

200

250

Linear programming Tabu search network Greedy heuristic

0.0%

0.0%

0.0%

0.0%

0.0%

+0.37%

+0.06% -0.01%

-0.31%

-0.34%

-11.7%

-7.1%

-6.9%

-14.1%

“Tabu search parameters were

CY

-6.0%

= 0.01, {j = 1.0, e = 0.1.

The a, and b, were independent random numbers on the unit interval, and the knapsack capacity C was chosen to be a different random fraction of the sum of the b, for each problem.6 Steepest descent dynamics were used on the time-varying objective function (5.9), with the dynamics simulated using the algorithm described in Section 3.3 (see Appendix for comments on the initialization of v+). The search was run for a fixed number of iterations on all problems, and the network output was continuously observed, so that valid solutions could be logged as the network found them. The results in Table 1 reflect the best solution found within the iteration limit. The table also shows the performance of two other algorithms on the same problems. For the linear programming approach, a descent network was used to solve the LP-relaxation of the knapsack problem, without the integrality condition (5.15). This produces a solution vector with one nonintegral element: if this element is reduced to zero then a high quality, valid 0-1 solution is obtained. The greedy approach is simply to pack items in order of decreasing a,, unless the inclusion of an item violates the constraint (5.13), in which case that item is omitted. The tabu search system compares well with the very effective linear programming approach, and consistently outperforms the greedy heuristic. The performance of the system was found to be largely insensitive to changes in the parameters (1, /rl, and t, so long as the ability to escape from a vertex was maintained (see Appendix). Limit cycles were observed in the state vector trajectory, though these typically included visits to about 100 vertices. This implies that the tabu search dynamics may fail to find a valid solution if the proportion of integral vertices is less than about ‘Restricting the various knapsack variables to particular ranges affects the difficulty of the problem: see Ohlsson e t d . (1992) for an interesting discussion of this point. Here we consider only unrestricted problems, with varying degrees of difficulty.

Andrew H. Gee and Richard W. Prager

176

Table 2: Classification of Quadratic 0-1 Programming Problems." ~

Integral

Polytope Quadratic Objective Complexity NP-complete Deterministic Solution annealing Examples TSP GPP

Polytope Objective Quadratic Complexity At least NP-complete Solution Tabu search Examples

Linear None P-solvable P-solvable Simple descent Simple descent One-to-one assignment

Crossbar switching

Nonintegral Linear NP-complete Tabu search TSP

None NP-complete Tabu search "Teachers and classes"

Knapsack "The graph partitioning problem (GPP) has been mapped onto neural networks using a quadratic objective over an integral polytope (Peterson and Anderson 1988; Peterson and Siiderberg 1989; Van den Bout and Miller 1990). The traveling salesman problem (TSP) can be mapped in two ways: a quadratic objective over an integral polytope (Hopfield and Tank 1985) or a linear objective over a nonintegral polytope (Langevin c't a / . 1990). One-to-one linear assignment was studied in Eberhart d nl. (1991), while the crossbar switching problem was the subject of Takefuji and Lee (1Y91). Attempts to solve knapsack problems with neural networks can be found in Hellstrom and Kana1 (1992) and Ohlsson rt nl. (1992). The "teachers and classes" problem (Cislen t,t a / . 1989) is an example of resource-constrained multiprocessor scheduling (Carey and Johnson 1979, p. 239).

1%. However, for the sort of knapsack problems considered here about 50% of the vertices are integral, so this does not pose a problem. 6 Discussion and Conclusions

In Table 2 we see a classification of quadratic 0-1 programming problems, in which the problem complexity and suggested connectionist solution technique are governed more by the nature of the polytope than by the order of the objective function. The popular deterministic annealing techniques are recommended for only one class of problem, that being quadratic optimization over integral polytopes. When the objective is linear, a simple descent technique will reliably find an optimal solution within an integral polytope. This approach will also work for pure constraint satisfaction problems, which have no objective: simple descent on an arbitrary linear objective will find a valid solution within an integral polytope. For nonintegral polytopes the picture is very different. Even the simplest problems, in which we desire to find urn/ 0-1 solution,

Polyhedral Combinatorics and Neural Networks

177

are NP-complete, and one-shot descent or annealing solution techniques are quite inappropriate. Tabu search dynamics offer one connectionist, continuous approach to tackling these problems, though no doubt other approaches will emerge with time. Meanwhile, however, tabu search dynamics have achieved encouraging results with knapsack problems, and certainly deserve further investigation, especially in relation to any possible analogue circuit implementation. In this paper we have sought to reassess the field of "neural" optimization from a refreshing viewpoint, and have demonstrated that, in the light of a rigorous problem mapping, the current techniques are unsuited to the solution of the vast majority of 0-1 programming problems. However, with appropriate modifications reflecting the polyhedral nature of the problem at hand, such techniques show great promise for the future, especially when the potential for fast, analogue implementations is realized.

Appendix: Tabu Dynamics for Stationary vf In this appendix we consider v+ trapped at a vertex v,' of P+ from time to. In these circumstances it is possible to analytically integrate the tabu

dynamic equations, the results of which allow us to judge the speed of the dynamics and place bounds on the variables (Y, /j, and f to guarantee escape from the vertex. The analysis mirrors Appendix C of Beyer and Ogier (19901, with alterations to reflect the modified dynamics under consideration here. Suppose at time to the tabu variables have values H = Ho and h = ho. Then integrating equations 5.7 and 5.8 gives

The resulting gradient of E at v,' is

-VE

=

T"P+v:

+ iOP++ 2[jf -(c I).

- v:)

(A.3) where we have substituted the expression for f(v,'j from equation 5.3. We see that -VE has a limiting value as t + 00, specifically

-VEX

= T"P'v:

2/k + i'p' + -(c (P

-

vz)

64.4)

To guarantee escape from the vertex, we require that -aTVE, > 0 for some direction a that does not lead out of P+. The only such direction we can reliably identify is (c - v:), since c is defined as being within P+

Andrew H. Gee and Richard W. Prager

178

and Pf is convex. Assuming a linear objective El’P, so that Top+ = 0, we require v;112 > 0 The worst-case scenario is when iOP+ is in the direction -(c which case the escape condition becomes

(A.5) -

v:), in

(A.61 Hence the relevant quantity governing the ability to escape from a vertex is the ratio / k / m . Equation A.6 indicates that it is desirable to locate the vector c as far from any vertex as possible. Some vector c within Pf can be found in polynomial time using Khachiyan’s method (Schrijver 19861, or more practically one of the projection techniques (Agmon 1954; Motzkin and Schoenberg 1954);once located, c can also serve as a starting position for v+. Often, however, the most convenient choice of c is the vector s (see equation 3.61, which is a natural product of the mapping process and lies within P+ for many classes of problem. For the knapsack problem, using c = s, it is possible to demonstrate that the worst-case escape condition is approximately

Finally, a note about the speed of the overall search process. The system spends most time in situations examined in this appendix, that is with v+ trapped at a vertex and the tabu variables H and h changing to free v+ from the vertex. Equation A.3 indicates that this process takes a characteristic time of l/a, and so we conclude that the overall speed of the system is approximately proportional to o.

Acknowledgments Andrew Gee acknowledges the financial support of the Science and Engineering Research Council of Great Britain. The authors would also like to thank the reviewers for their helpful comments and suggestions.

References Agmon, S. 1954. The relaxation method for linear inequalities. Can. 1. Math. 6, 382-392.

Aiyer, S. V. 8. 1991. Solving combinatorial optimization problems using neural networks. Tech. Rep. CUED/F-INFENG/TR 89, Cambridge University Department of Engineering.

Polyhedral Combinatorics and Neural Networks

179

Aiyer, S. V. B., Niranjan, M., and Fallside, F. 1990. A theoretical investigation into the performance of the Hopfield model. I E E E Trans. Neural Networks 1(2),204-215. Beyer, D. A., and Ogier, R. G. 1990. The tabu learning neural network search method applied to the traveling salesman problem. Tech. Rep. SRI International, Menlo Park, California. Beyer, D. A., and Ogier, R. G. 1991. Tabu learning: A neural network search method for solving nonconvex optimization problems. In Proceedings ( ~the f International joint Conference on Neural Networks, Singapore. Bilbro, G., Mann, R., Miller, T. K., 111, Snyder, W. E., Van den Bout, D. E., and White, M. 1989. Optimization by mean field annealing. In Advances in Neural Information Processing Systems 1, D. S. Touretzky, ed., pp. 91-98. Morgan Kaufmann, San Mateo, CA. Birkhoff, G. 1946. Tres observaciones sobre al algebra lineal. Rev. Facultad Ciencias Exactas, Puras Aplicadas Univ. Nac. Tacuman, Ser. A (Mat. Fisca Teorica) 5, 147-151. Eberhart, S. P., Daud, D., Kerns, D. A., Brown, T. X., and Thakoor, A. P. 1991. Competitive neural architecture for hardware solution to the assignment problem. Neural Networks 4, 431-442. Garey, M. R., and Johnson, D. S. 1979. Computers and Intractability-A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco. Gee, A. H., and Prager, R. W. 1992. Alternative energy functions for optimizing neural networks. Tech. Rep. CUED/F-INFENG/TR 95, Cambridge University Department of Engineering. Gislen, L., Peterson, C., and Soderberg, B. 1989. ’Teachers and Classes’ with neural networks. Int. 1. Neural S y s t e m 1(2), 167-176. Glover, F. 1989. Tabu search, a tutorial. Tech. rep., Center for Applied Artificial Intelligence, University of Colorado. Revised February 1990. Grossberg, S. 1988. Nonlinear neural networks: Principles, mechanisms and architectures. Neural Networks 1(1), 17-61. Hellstrom, B. J., and Kanal, L. V. 1992. Knapsack packing networks. lEEE Transactions on Neural Networks 3(2), 302-307. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81, 3088-3092. Hopfield, J. J., and Tank, D. W. 1985. ‘Neural’ computation of decisions in optimization problems. Biol. Cybern. 52, 141-152. Kamgar-Parsi, B., and Kamgar-Parsi, B. 1990. On problem solving with Hopfield neural networks. Biol. Cybern. 62, 415-423. Langevin, A., Soumis, F., and Desrosiers, J. 1990. Classification of travelling salesman problem formulations. Oper. Res. Lett. 9(2), 127-132. Motzkin, T. S., and Schoenberg, I. J. 1954. The relaxation method for linear inequalities. Can. 1. Math. 6, 393-404. Ogier, R. G., and Beyer, D. A. 1990. Neural network solution to the link scheduling problem using convex relaxation. In Proceedings of thelEEE Global Telecommunications Conferencc, pp. 1371-1376, San Diego. Ohlsson, M., Peterson, C., and Soderberg, B. 1992. Neural networks for op-

Andrew H. Gee and Richard W. Prager

180

timization problems with inequality constraints-the knapsack problem. Tech. Rep. LU TP 92-11, Department of Theoretical Physics, University of Lund, Sweden. Papadimitriou, C. H., and Yannakakis, M. 1990. Note on recognizing integer polyhedra. Combinatorica 10(1), 107-109. Peterson, C., and Anderson, J. R. 1988. Neural networks and NP-complete optimization problems: A performance study on the graph bisection problem. Complex Syst. 2(1), 59-89. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. Int. 1. Neural Syst. 1(1), 3-22. Schrijver, A. 1986. Theory of Linear and Integcr Programming. John Wiley, Chichester. Tagliarini, G. A., Fury Christ, J., and Page, E. W. 1991. Optimization using neural networks. I E E E Transact. Comput. 40(12), 1347-1358. Tagliarini, G. A., and Page, E. W. 1988. A neural-network solution to the concentrator assignment problem. In Neural Information Processing S y s t e m , D. Z . Anderson, ed., pp. 775-782. American Institute of Physics, New York. Takefuji, Y., and Lee, K.-C. 1991. An artificial hysteresis binary neuron: A model suppressing the oscillatory behaviors of neural dynamics. Bid. Cybernet. 64, 353-356. Tank, D. W., and Hopfield, J. J. 1986. Simple 'neural' optimization networks: An A/D converter, signal decision circuit, and a linear programming circuit. ZEEE Transact. Circuits Syst. 33(5), 533-541. Van den Bout, D. E., and Miller, T. K., 111. 1990. Graph partitioning using annealed neural networks. I E E E Transact. Neural Networks 1(2), 192-203. Van Hulle, M. M. 1991. A goal programming network for linear programming. B i d . Cybernet. 65, 243-252. von Neumann, J. 1953. A certain zero-sum two-person game equivalent to the optimal assignment problem. In Contributions to the Theory of Games, [I, H. W. Kuhn and A. W. Tucker, eds. Annals of Matkematics Studies 28, Princeton University Press, Princeton, NJ. Wilson, V., and Pawley, G. S. 1988. On the stability of the TSP problem algorithm of Hopfield and Tank. Bid. Cybernet. 58, 63-70. ~~

~-

~

Received August 5, 1992, accepted May 13, 1993

This article has been cited by: 2. Marcello Pelillo , Andrea Torsello . 2006. Payoff-Monotonic Game Dynamics and the Maximum Clique ProblemPayoff-Monotonic Game Dynamics and the Maximum Clique Problem. Neural Computation 18:5, 1215-1258. [Abstract] [PDF] [PDF Plus] 3. Onur Köksoy, Tankut Yalcinoz. 2005. A Hopfield Neural Network Approach to the Dual Response Problem. Quality and Reliability Engineering International 21:6, 595-603. [CrossRef] 4. Chuangyin Dang , Lei Xu . 2002. A Lagrange Multiplier and Hopfield-Type Barrier Function Method for the Traveling Salesman ProblemA Lagrange Multiplier and Hopfield-Type Barrier Function Method for the Traveling Salesman Problem. Neural Computation 14:2, 303-324. [Abstract] [PDF] [PDF Plus] 5. Gregg S. Leichtman, Anthony L. Aita, H. Warren Goldman. 2000. Automated Gamma Knife dose planning using polygon clipping and adaptive simulated annealing. Medical Physics 27:1, 154. [CrossRef] 6. Mamoru Sasaki, Hideki Yokote, Kenichirou Higashi, Hongbing Zhu. 1999. Realization of low sensitivity in the Hopfield model for optimal-solution search. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 82:12, 43-53. [CrossRef] 7. Anand Rangarajan , Alan Yuille , Eric Mjolsness . 1999. Convergence Properties of the Softassign Quadratic Assignment AlgorithmConvergence Properties of the Softassign Quadratic Assignment Algorithm. Neural Computation 11:6, 1455-1474. [Abstract] [PDF] [PDF Plus] 8. Y. Takahashi. 1998. A mathematical framework for solving dynamic optimization problems with adaptive networks. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 28:3, 404-416. [CrossRef] 9. Y. Takahashi. 1998. Mathematical improvement of the Hopfield model for feasible solutions to the traveling salesman problem by a synapse dynamical system. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 28:6, 906-919. [CrossRef] 10. T. Yalcinoz, M.J. Short. 1997. Large-scale economic dispatch using an improved Hopfield neural network. IEE Proceedings - Generation, Transmission and Distribution 144:2, 181. [CrossRef] 11. A. Rangarajan, E.D. Mjolsness. 1996. A Lagrangian relaxation network for graph matching. IEEE Transactions on Neural Networks 7:6, 1365-1381. [CrossRef]

Communicated by Steven Nowlan

ARTICLE

Hierarchical Mixtures of Experts and the EM Algorithm Michael I. Jordan Department of Brain and Cognitive Sciences, Massachusetts lnstitute of Technology, Cambridge, M A 02139 U S A

Robert A. Jacobs Department of Psychology, University of Rochester, Rochester, NY 14627 U S A

We present a tree-structured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIM’s). Learning is treated as a maximum likelihood problem; in particular, we present an Expectation-Maximization (EM) algorithm for adjusting the parameters of the architecture. We also develop an on-line learning algorithm in which the parameters are updated incrementally. Comparative simulation results are presented in the robot dynamics domain. 1 Introduction

The principle of divide-and-conquer is a principle with wide applicability throughout applied mathematics. Divide-and-conquer algorithms attack a complex problem by dividing it into simpler problems whose solutions can be combined to yield a solution to the complex problem. This approach can often lead to simple, elegant, and efficient algorithms. In this paper we explore a particular application of the divide-and-conquer principle to the problem of learning from examples. We describe a network architecture and a learning algorithm for the architecture, both of which are inspired by the philosophy of divide-and-conquer. In the statistical literature and in the machine learning literature, divide-and-conquer approaches have become increasingly popular. The CART algorithm of Breiman ef al. (1984), the MARS algorithm of Friedman (19911, and the ID3 algorithm of Quinlan (1986) are well-known examples. These algorithms fit surfaces to data by explicitly dividing the input space into a nested sequence of regions, and by fitting simple surfaces (e.g., constant functions) within these regions. They have convergence times that are often orders of magnitude faster than gradient-based neural network algorithms. Neural Computation 6, 181-214 (1994)

@ 1994 Massachusetts Institute of Technology

182

Michael 1. Jordan and Robert A. Jacobs

Although divide-and-conquer algorithms have much to recommend them, one should be concerned about the statistical consequences of dividing the input space. Dividing the data can have favorable consequences for the bias of an estimator, but it generally increases the variance. Consider linear regression, for example, in which the variance of the estimates of the slope and intercept depends quadratically on the spread of data on the x-axis. The points that are the most peripheral in the input space are those that have the maximal effect in decreasing the variance of the parameter estimates. The foregoing considerations suggest that divide-and-conquer algorithms generally tend to be variance-increasing algorithms. This is indeed the case and is particularly problematic in high-dimensional spaces where data become exceedingly sparse (Scott 1992). One response to this dilemma-that adopted by CART, MARS, and ID3, and also adopted here-is to utilize piecewise constant or piecewise linear functions. These functions minimize variance at a cost of increased bias. We also make use of a second variance-decreasing device; a device familiar in the neural network literature. We make use of ”soft” splits of data (Bridle 1989; Nowlan 1991; Wahba et al. 1993), allowing data to lie simultaneously in multiple regions. This approach allows the parameters in one region to be influenced by data in neighboring regions. CART, MARS, and ID3 rely on “hard” splits, which, as we remarked above, have particularly severe effects on variance. By allowing soft splits the severe effects of lopping off distant data can be ameliorated. We also attempt to minimize the bias that is incurred by using piecewise linear functions, by allowing the splits to be formed along hyperplanes at arbitrary orientations in the input space. This lessens the bias due to high-order interactions among the inputs and allows the algorithm to be insensitive to the particular choice of coordinates used to encode the data (an improvement over methods such as MARS and ID3, which are coordinate-dependent). The work that we describe here makes contact with a number of branches of statistical theory. First, as in our earlier work (Jacobs et al. 1991), we formulate the learning problem as a mixture estimation problem (cf. Cheeseman et al. 1988; Duda and Hart 1973; Nowlan 1991; Redner and Walker 1984; Titterington et al. 1985). We show that the algorithm that is generally employed for the unsupervised learning of mixture parameters-the Expectation-Maximization (EM) algorithm of Dempster et al. (1977)-can also be exploited for supervised learning. Second, we utilize generalized linear model (GLIM) theory (McCullagh and Nelder 1983) to provide the basic statistical structure for the components of the architecture: In particular, the ”soft splits” referred to above are modeled as rnultinornial logit models-a specific form of GLIM. We also show that the algorithm developed for fitting GLIMs-the iteratively reweighted least squares (IRLS) algorithm-can be usefully employed in our model, in particular as the M step of the EM algorithm. Finally, we show that these ideas can be developed in a recursive manner, yielding a tree-

Mixtures of Experts and EM Algorithm

183

structured approach to estimation that is reminiscent of CART, MARS, and ID3. The remainder of the paper proceeds as follows. We first introduce the hierarchical mixture-of-experts architecture and present the likelihood function for the architecture. After describing a gradient descent algorithm, we develop a more powerful learning algorithm for the architecture that is a special case of the general Expectation-Maximization (EM) framework of Dempster et al. (1977). We also describe a least-squares version of this algorithm that leads to a particularly efficient implementation. Both of the latter algorithms are batch learning algorithms. In the final section, we present an on-line version of the least-squares algorithm that in practice appears to be the most efficient of the algorithms that we have studied. 2 Hierarchical Mixtures of Experts

The algorithms that we discuss in this paper are supervised learning algorithms. We explicitly address the case of regression, in which the input vectors are elements of Srnand the output vectors are elements of 8". We also consider classification models and counting models in which the outputs are integer-valued. The data are assumed to form a countable set of paired observations X = {(x('),y(!))}.In the case of the batch algorithms discussed below, this set is assumed to be finite; in the case of the on-line algorithms, the set may be infinite. We propose to solve nonlinear supervised learning problems by dividing the input space into a nested set of regions and fitting simple surfaces to the data that fall in these regions. The regions have "soft" boundaries, meaning that data points may lie simultaneously in multiple regions. The boundaries between regions are themselves simple parameterized surfaces that are adjusted by the learning algorithm. The hierarchical mixture-of-experts (HME) architecture is shown in Figure 1.' The architecture is a tree in which the gating networks sit at the nonterminals of the tree. These networks receive the vector x as input and produce scalar outputs that are a partition of unity at each point in the input space. The expert networks sit at the leaves of the tree. Each expert produces an output vector pil for each input vector. These output vectors proceed u p the tree, being blended by the gating network outputs. All of the expert networks in the tree are linear with a single output nonlinearity. We will refer to such a network as "generalized linear," borrowing the terminology from statistics (McCullagh and Nelder 1983). 'To simplify the presentation, we restrict ourselves to a two-level hierarchy throughout the paper. All of the algorithms that we describe, however, generalize readily to hierarchies of arbitrary depth. See Jordan and Xu (1993) for a recursive formalism that handles arbitrary hierarchies.

Michael I. Jordan and Robert A. Jacobs

184

Gating Network

R/ h'2

Gating Network

Gating Network

-Expert Network

Expert Network

X

Expert Network

Expert Network A

I

X

X

Figure 1: A two-level hierarchical mixture of experts. To form a deeper tree, each expert is expanded recursively into a gating network and a set of subexperts.

Expert network (i.j) produces its output p,/ as a generalized linear function of the input x: PI, = f ( U , x )

(2.1)

where U,, is a weight matrix and f is a fixed continuous nonlinearity. The vector x is assumed to include a fixed component of one to allow for an intercept term. For regression problems, f ( . ) is generally chosen to be the identity function (i.e., the experts are linear). For binary classification problems, f ( . ) is generally taken to be the logistic function, in which case the expert outputs are interpreted as the log odds of "success" under a Bernoulli probability model (see below). Other models (e.g., multiway classification, counting, rate estimation, and survival estimation) are handled by making other choices for f ( . ) . These models are smoothed piecewise analogs of the corresponding GLIM models (cf. McCullagh and Nelder 1983).

Mixtures of Experts and EM Algorithm

185

The gating networks are also generalized linear. Define intermediate variables El as follows:

El

= VTX

(2.2)

where v, is a weight vector. Then the ith output of the top-level gating network is the "softmax" function of the E, (Bridle 1989; McCullagh and Nelder 1983): (2.3)

Note that the gi are positive and sum to one for each x. They can be interpreted as providing a "soft" partitioning of the input space. Similarly, the gating networks at lower levels are also generalized linear systems. Define [,, as follows:

6, = v;x

(2.4)

Then

is the output of the jth unit in the ith gating network at the second level of the architecture. Once again, the gjl; are positive and sum to one for each x. They can be interpreted as providing a "soft" sub-partition of the input space nested within the partitioning providing by the higher-level gating network. The output vector at each nonterminal of the tree is the weighted output of the experts below that nonterminal. That is, the output at the ith nonterminal in the second layer of the two-level tree is

and the output at the top level of the tree is gill1

P = i

Note that both the g's and the p's depend on the input x, thus the total output is a nonlinear function of the input. 2.1 Regression Surface. Given the definitions of the expert networks and the gating networks, the regression surface defined by the hierarchy is a piecewise blend of the regression surfaces defined by the experts. The gating networks provide a nested, "soft" partitioning of the input space and the expert networks provide local regression surfaces within the partition. There is overlap between neighboring regions. To understand the nature of the overlap, consider a one-level hierarchy with two

186

Michael I. Jordan and Robert A. Jacobs

expert networks. In this case, the gating network has two outputs, gl and g2. The gating output gl is given by (2.6) (2.7)

which is a logistic ridge function whose orientation is determined by the direction of the vector v1 - v2. The gating output 9 2 is equal to 1 -81. For a given x, the total output p is the convex combination glpl +g2p2. This is a weighted average of the experts, where the weights are determined by the values of the ridge function. Along the ridge, 81 = 82 = 112, and both experts contribute equally. Away from the ridge, one expert or the other dominates. The amount of smoothing across the ridge is determined by the magnitude of the vector v2 - V I . If v2 - v1 is large, then the ridge function becomes a sharp split and the weighted output of the experts becomes piecewise (generalized) linear. If v2-vl is small, then each expert contributes to a significant degree on each side of the ridge, thereby smoothing the piecewise map. In the limit of a zero difference vector, gl = g2 = 1/2 for all x, and the total output is the same fixed average of the experts on both sides of the fictitious "split." In general, a given gating network induces a smoothed planar partitioning of the input space. Lower-level gating networks induce a partition within the partition induced by higher-level gating networks. The weights in a given gating network determine the amount of smoothing across the partition at that particular level of resolution: large weight vectors imply sharp changes in the regression surface across a ridge and small weights imply a smoother surface. In the limit of zero weights in all gating networks, the entire hierarchy reduces to a fixed average (a linear system in the case of regression). 2.2 A Probability Model. The hierarchy can be given a probabilistic interpretation. We suppose that the mechanism by which data are generated by the environment involves a nested sequence of decisions that terminates in a regressive process that maps x to y. The decisions are modeled as multinomial random variables. That is, for each x, we interpret the values gi(x, vy) as the multinomial probabilities associated with the first decision and the g,p(x, vi) as the (conditional) multinomial probabilities associated with the second decision, where the superscript "0" refers to the "true" values of the parameters. The decisions form a decision tree. We use a statistical model to model this decision tree; in particular, our choice of parameterization (cf. Equations 2.2, 2.3, 2.4, and 2.5) corresponds to a multinomial logit probability model at each nonterminal of the tree (see Appendix 8).A multinomial logit model is a special case of a GLIM that is commonly used for "soft" multiway classification (McCullagh and Nelder 1983). Under the multinomial logit model, we

Mixtures of Experts and EM Algorithm

187

interpret the gating networks as modeling the input-dependent, multinomial probabilities associated with decisions at particular levels of resolution in a tree-structured model of the data. Once a particular sequence of decisions has been made, resulting in a choice of regressive process (i?j ) , output y is assumed to be generated according to the following statistical model. First, a linear predictor Q,, is formed:

The expected value of y is obtained by passing the linear predictor through the linkfunction f : *

The output y is then chosen from a probability density P, with mean pt and "dispersion" parameter 4;. We denote the density of y as

where the parameter vector 0: includes the weights parameter 4;:

and the dispersion

We assume the density P to be a member of the exponential family of densities (McCullagh and Nelder 1983). The interpretation of the dispersion parameter depends on the particular choice of density. For example, in the case of the n-dimensional gaussian, the dispersion parameter is the covariance matrix Given these assumptions, the total probability of generating y from x is the mixture of the probabilities of generating y from each of the component densities, where the mixing proportions are multinomial probabilities: (2.8) i

i

Note that 0' includes the expert network parameters 0; as well as the gating network parameters vp and v;. Note also that we have explicitly 2We utilize the neural network convention in defining links. In GLIM theory, the convention is that the link function relates 17 to h; thus, 17 = h ( p ) ,where h is equivalent to our f-'. 3Not all exponential family densities have a dispersion parameter; in particular, the Bernoulli density discussed below has no dispersion parameter.

188

Michael I. Jordan and Robert A. Jacobs

indicated the dependence of the probabilities g1and g,l, on the input x and on the parameters. In the remainder of the paper we drop the explicit reference to the input and the parameters to simplify the notation: (2.9)

We also utilize equation 2.9 without the superscripts to refer to the probability model defined by a particular HME architecture, irrespective of any reference to a "true" model. 2.2.1 Example (Regression). In the case of regression the probabilistic component of the model is generally assumed to be gaussian. Assuming identical covariance matrices of the form cr21 for each of the experts yields the following hierarchical probability model:

2.2.2 Example (Binay Classification). In binary classification problems the output y is a discrete random variable having possible outcomes of "failure" and "success." The probabilistic component of the model is generally assumed to be the Bernoulli distribution (Cox 1970). In this case, the mean lr;, is the conditional probability of classifying the input as "success." The resulting hierarchical probability model is a mixture of Bernoulli densities:

2.3 Posterior Probabilities. In developing the learning algorithms to be presented in the remainder of the paper, it will prove useful to define posterior probabilities associated with the nodes of the tree. The terms "posterior" and "prior" have meaning in this context during the training of the system. We refer to the probabilities g; and gjl; as prior probabilities, because they are computed based only on the input x, without knowledge of the corresponding target output y. A posterior probability is defined once both the input and the target output are known. Using Bayes' rule, we define the posterior probabilities at the nodes of the tree as follows: (2.10)

and (2.11)

Mixtures of Experts and EM Algorithm

189

We will also find it useful to define the joint posterior probability hi], the product of h, and h!,,: (2.12) This quantity is the probability that expert network ( i ,j) can be considered to have generated the data, based on knowledge of both the input and the output. Once again, we emphasize that all of these quantities are conditional on the input x. In deeper trees, the posterior probability associated with an expert network is simply the product of the conditional posterior probabilities along the path from the root of the tree to that expert.

2.4 The Likelihood and a Gradient Ascent Learning Algorithm. Jordan and Jacobs (1992) presented a gradient ascent learning algorithm for the hierarchical architecture. The algorithm was based on earlier work by Jacobs et al. (1991), who treated the problem of learning in mixture-ofexperts architectures as a maximum likelihood estimation problem. The log likelihood of a data set X = { ( ~ ( ~ y('))}y 1. is obtained by taking the log of the product of N densities of the form of equation 2.9, which yields the following log likelihood: (2.13) /

Let us assume that the probability density P is gaussian with an identity covariance matrix and that the link function is the identity. In this case, by differentiating 1(8;X )with respect to the parameters, we obtain the following gradient ascent learning rule for the weight matrix U,: (2.14) where o, is a learning rate. The gradient ascent learning rule for the ith weight vector in the top-level gating network is given by (2.15) and the gradient ascent rule for the jth weight vector in the ith lower-level gating network is given by (2.16) Updates can also be obtained for covariance matrices (Jordan and Jacobs 1992). The algorithm given by equations 2.14,2.15, and 2.16 is a batch learning algorithm. The corresponding on-line algorithm is obtained by sim-

190

Michael I. Jordan and Robert A. Jacobs

ply dropping the summation sign and updating the parameters after each stimulus presentation. Thus, for example, (2.17) is the stochastic update rule for the weights in the (i, j)th expert network based on the tth stimulus pattern. 2.5 T h e EM Algorithm. In the following sections we develop a learning algorithm for the HME architecture based on the Expectation-Maximization (EM) framework of Dempster et al. (1977). We derive an EM algorithm for the architecture that consists of the iterative solution of a coupled set of iteratively-reweighted least-squares problems. The EM algorithm is a general technique for maximum likelihood estimation. In practice EM has been applied almost exclusively to unsupervised learning problems. This is true of the neural network literature and machine learning literature, in which EM has appeared in the context of clustering (Cheeseman et al. 1988; Nowlan 1991) and density estimation (Specht 19911, as well as the statistics literature, in which applications include missing data problems (Little and Rubin 19871, mixture density estimation (Redner and Walker 1984), and factor analysis (Dempster etal. 1977). Another unsupervised learning application is the learning problem for Hidden Markov Models, for which the Baum-Welch reestimation formulas are a special case of EM. There is nothing in the EM framework that precludes its application to regression or classification problems; however, such applications have been few.4 EM is an iterative approach to maximum likelihood estimation. Each iteration of an EM algorithm is composed of two steps: an Estimation (E) step and a Maximization (M) step. The M step involves the maximization of a likelihood function that is redefined in each iteration by the E step. If the algorithm simply increases the function during the M step, rather than maximizing the function, then the algorithm is referred to as a Generalized EM (GEM) algorithm. The Boltzmann learning algorithm (Hinton and Sejnowski 1986) is a neural network example of a GEM algorithm. GEM algorithms are often significantly slower to converge than EM algorithms. An application of EM generally begins with the observation that the optimization of the likelihood function l(8;X )would be simplified if only a set of additional variables, called "missing" or "hidden" variables, were known. In this context, we refer to the observable data X as the "incomplete data" and posit a "complete data" set Y that includes the missing variables 2. We specify a probability model that links the fictive missing variables to the actual data: P(y, z(x,8 ) . The logarithm of the density P defines the "complete-data likelihood," Zc(8;Y ) . The original likelihood, 'An exception is the "switching regression" model of Quandt and Ramsey (1972). For further discussion of switching regression, see Jordan and Xu (1993).

Mixtures of Experts and EM Algorithm

191

I ( 8; X),is referred to in this context as the ”incomplete-data likelihood.” It is the relationship between these two likelihood functions that motivates the EM algorithm. Note that the complete-data likelihood is a random variable, because the missing variables 2 are in fact unknown. An EM algorithm first finds the expected value of the complete-data likelihood, given the observed data and the current model. This is the E step:

Q(e,@)) = E ( w ;Y ) ( x ] where e(p) is the value of the parameters at the pth

iteration and the expectation is taken with respect to d p ) . This step yields a deterministic function Q. The M step maximizes this function with respect to 0 to find the new parameter estimates 8(p+’): @+’) = arg max Q(8,8(P))

e

The E step is then repeated to yield an improved estimate of the complete likelihood and the process iterates. An iterative step of EM chooses a parameter value that increases the value of Q, the expectation of the complete likelihood. What is the effect of such a step on the incomplete likelihood? Dempster et al. proved that an increase in Q implies an increase in the incomplete likelihood:

x)2 I(&“; x)

i(e(P+l);

Equality obtains only at the stationary points of 1 (Wu 1983). Thus the likelihood I increases monotonically along the sequence of parameter estimates generated by an EM algorithm. In practice this implies convergence to a local maximum. 2.6 Applying EM to the HME Architecture. To develop an EM algorithm for the HME architecture, we must define appropriate “missing data” so as to simplify the likelihood function. We define indicator vari, that one and only one of the z, is equal to one, ables z, and z , ~ ,such and one and only one of the z,~,is equal to one. These indicator variables have an interpretation as the labels that correspond to the decisions in the probability model. We also define the indicator variable z,,, which is the product of z, and z , ~ ,This . variable has an interpretation as the label that specifies the expert (the regressive process) in the probability model. If the labels z,, q,, and z,, were known, then the maximum likelihood problem would decouple into a separate set of regression problems for each expert network and a separate set of multiway classification problems for the gating networks. These problems would be solved independently of each other, yielding a rapid one-pass learning algorithm. Of course, the missing variables are not known, but we can specify a probability model

Michael I. Jordan and Robert A. Jacobs

192

that links them to the observable data. This probability model can be written in terms of the zII as follows:

zt)lx(f), 0) =

p(y(f),

=

gl(I)g/ll (I)PIAY(f))

(2.18)

nn{g~f)gj;~P,l(~(f))}z~~l

(2.19)

I

/

using the fact that z t ) is an indicator variable. Taking the logarithm of this probability model yields the following complete-data likelihood: k ( & Y ) = C ~ ~ z ~ ' l (I) n g/llPl/(Y(f))} {(1) g , (2.20) '

=

I

/

CCCz:;"lngj" f

I

+ lngj,',) + lnPII(y('))}

(2.21)

/

Note the relationship of the complete-data likelihood in equation 2.21 to the incomplete-data likelihood in equation 2.13. The use of the indicator variables zII has allowed the logarithm to be brought inside the summation signs, substantially simplifying the maximization problem. We now define the E step of the EM algorithm by taking the expectation of the complete-data likelihood:

Q(e,e(,')) = C!?xhj:'{lngjt) f

'

+lng;;,) +1nPlI(y('))}

(2.22)

/

(2.23)

(2.25) -

'I

(2.26)

(Note also that E[zj"lX] = hj" and E[zII)IX]= hi:).) The M step requires maximizing Q(e.e(p))with respect to the expert network parameters and the gating network parameters. Examining equation 2.22, we see that the expert network parameters influence the Q function only through the terms hi:' In P1,(y(')), and the gating network parameters influence the Q function only through the terms / I : ) lngj') and hj:' lng;;,). Thus the M step reduces to the following separate maximization problems:

Mixtures of Experts and EM Algorithm

193

(2.29) Each of these maximization problems is itself a maximum likelihood problem. This is clearly true in the case of equation 2.27, which is simply a weighted maximum likelihood problem in the probability density P,,. Given our parameterization of PI,, the log likelihood in equation 2.27 is a weighted log likelihood for a GLIM. An efficient algorithm known as iteratively reweighted least-squares (IRLS) is available to solve the maximum likelihood problem for such models (McCullagh and Nelder 1983). We discuss IRLS in Appendix A. Equation 2.28 involves maximizing the cross-entropy between the posterior probabilities h:) and the prior probabilities g:.' This cross-entropy is the log likelihood associated with a multinomial logit probability model in which the h:' act as the output observations (see Appendix B). Thus the maximization in equation 2.28 is also a maximum likelihood problem for a GLIM and can be solved using IRLS. The same is true of equation 2.29, which is a weighted maximum likelihood problem with output observations hi;k' and observation weights h:). In summary, the EM algorithm that we have obtained involves a calculation of posterior probabilities in the outer loop (the E step), and the solution of a set of IRLS problems in the inner loop (the M step). We summarize the algorithm as follows: Algorithm 1 1. For each data pair (x(').y(')),compute the posterior probabilities h:" and hi// using the current values of the parameters.

2. For each expert (i.j), solve an IRLS problem with observations { ( x ( ~ )y('))}y . and observation weights { h y ) } : . 3. For each top-level ating network, solve an IRLS problem with ob-

servations {(x(').hitg)}:.

4. For each lower-level gating network, solve a weighted IRLS problem with observations { (x('),hj;,')}: and observation weights { h f ) } ? .

5. Iterate using the updated parameter values. 2.7 A Least-Squares Algorithm. In the case of regression, in which a gaussian probability model and an identity link function are used, the IRLS loop for the expert networks reduces to weighted least squares, which can be solved (in one pass) by any of the standard least-squares algorithms (Golub and van Loan 1989). The gating networks still require iterative processing. Suppose, however, that we fit the parameters of the

Michael I. Jordan and Robert A. Jacobs

194

gating networks using least squares rather than maximum likelihood. In this case, we might hope to obtain an algorithm in which the gating network parameters are fit by a one-pass algorithm. To motivate this approach, note that we can express the IRLS problem for the gating networks as follows. Differentiating the cross-entropy (equation 2.28) with respect to the parameters v, (using the fact that &/a<, = g,(6,,-g,), where 6, is the Kronecker delta) and setting the derivatives to zero yields the following equations:

x ( h j " - g,( X ( I ) , VI))X(')

=0

(2.30)

1

which are a coupled set of equations that must be solved for each i. Similarly, for each gating network at the second level of the tree, we obtain the following equations:

gjlI(x(?Vij))X(')

=0

(2.31)

t

which must be solved for each i and j . There is one aspect of these equations that renders them unusual. Recall that if the labels zj') and were known, then the gating networks would be essentially solving a set of multiway classification problems. The supervised errors (2:') - 8:')) and (2;;' -8")) would appear in the algorithm for solving these problems. Note tkat &ese errors are differences between indicator variables and probabilities. In equations 2.30 and 2.31, on the other hand, the errors that drive the algorithm are the differences (k!''--g!')) and (h,!:'-g;F), which are differences between probabilities. The EM algorithm ekectively "fills in" the missing labels with estimated probabilities h, and h,~,.These estimated probabilities can be thought of as targets for the g, and the g,1,. This suggests that we can compute "virtual targets" for the underlying linear by inverting the softmax function. (Note that this predictors <, and <,I,, option would not be available for the z, and z,~,, even if they were known, because zero and one are not in the range of the softmax function.) Thus the targets for the <, are the values:

zi;,'

In hj"

-

In c

where C = Ck&his the normalization constant in the softmax function. Note, however, that constants that are common to all of the <, can be omitted, because such constants disa ear when <, are converted to 8, (cf. equation 2.3). Thus the values l n h l r c a n be used as targets for the <,. A similar argument shows that the values lnh('),jkcan be used as targets for the El,, with observation weights h(')k. The utility of this approach is that once targets are available for the linear predictors <, and El,, the problem of finding the parameters v, and v,, reduces to a coupled set of weighted least-squares problems. Thus we obtain an algorithm in which all of the parameters in the hierarchy,

Mixtures of Experts and EM Algorithm

195

both in the expert networks and the gating networks, can be obtained by solving least-squares problems. This yields the following learning algorithm: Algorithm 2 1. For each data pair (x(f).y(''), compute the posterior probabilities h,'" and h;; using the current values of the parameters. 2. For each expert ( i .j ) , solve a weighted least-squares problem with observations { (x('). y(f])}? and observation weights {hy'}?. 3. For each top-level gating network, solve a least-squares problem with observations { ( x ( f )In . hf')}?. 4. For each lower-level gating network, solve a weighted least-squares 1 ~ and observation weights problem with observations { ( ~ (In~hj;,')}? {hi''}?.

5. Iterate using the updated parameter values. It is important to note that this algorithm does not yield the same parameter estimates as Algorithm 1; the gating network residuals g!") are being fit by least squares rather than maximum likelihood. The algorithm can be thought of as an approximation to Algorithm 1, an approximation based on the assumption that the differences between h,'" and g,"' are small. This assumption is equivalent to the assumption that the architecture can fit the underlying regression surface (a consistency condition) and the assumption that the noise is small. In practice we have found that the least-squares algorithm works reasonably well, even in the early stages of fitting when the residuals can be large. The ability to use least squares is certainly appealing from a computational point of view. One possible hybrid algorithm involves using the least-squares algorithm to converge quickly to the neighborhood of a solution and then using IRLS to refine the solution. 2.8 Simulation Results. We tested Algorithm 1 and Algorithm 2 on a nonlinear system identification problem. The data were obtained from a simulation of a four-joint robot arm moving in three-dimensional space (Fun and Jordan 1993). The network must learn the forward dynamics of the arm; a state-dependent mapping from joint torques to joint accelerations, The state of the arm is encoded by eight real-valued variables: four positions (rad) and four angular velocities (rad/sec). The torque was encoded as four real-valued variables (N . m). Thus there were 12 inputs to the learning system. Given these 12 input variables, the network must predict the four accelerations at the joints (rad/sec2). This

196

Michael I. Jordan and Robert A. Jacobs

mapping is highly nonlinear due to the rotating coordinate systems and the interaction torques between the links of the arm. We generated 15,000 data points for training and 5,000 points for testing. For each epoch (i.e., each pass through the training set), we computed the relative error on the test set. Relative error is computed as a ratio between the mean squared error and the mean squared error that would be obtained if the learner were to output the mean value of the accelerations for all data points. We compared the performance of a binary hierarchy to that of a backpropagation network. The hierarchy was a four-level hierarchy with 16 expert networks and 15 gating networks. Each expert network had 4 output units and each gating network had 1 output unit. The backpropagation network had 60 hidden units, which yields approximately the same number of parameters in the network as in the hierarchy. The HME architecture was trained by Algorithms 1 and 2, utilizing Cholesky decomposition to solve the weighted least-squares problems (Golub and van Loan 1989). Note that the HME algorithms have no free parameters. The free parameters for the backpropagation network (the learning rate and the momentum term) were chosen based on a coarse search of the parameter space. (Values of 0.00001 and 0.15 were chosen for these parameters.) There were difficulties with local minima (or plateaus) using the backpropagation algorithm: Five of 10 runs failed to converge to "reasonable" error values. (As we report in the next section, no such difficulties were encountered in the case of on-line backpropagation.) We report average convergence times and average relative errors only for those runs that converged to "reasonable" error values. All 10 runs for both of the HME algorithms converged to "reasonable" error values. Figure 2 shows the performance of the hierarchy and the backpropagation network. The horizontal axis of the graph gives the training time in epochs. The vertical axis gives generalization performance as measured by the average relative error on the test set. Table 1 reports the average relative errors for both architectures measured at the minima of the relative error curves. (Minima were defined by a sequence of three successive increases in the relative error.) We also report values of relative error for the best linear approximation, the CART algorithm, and the MARS algorithm. Both CART and MARS were run four times, once for each of the output variables. We combined the results from these four computations to compute the total relative error. Two versions of CART were run; one in which the splits were restricted to be parallel to the axes and one in which linear combinations of the input variables were allowed. The MARS algorithm requires choices to be made for the values of two structural parameters: the maximum number of basis functions and the maximum number of interaction terms. Each basis function in MARS yields a linear surface defined over a rectangular region of the input

Mixtures of Experts and EM Algorithm

197

- Backpropagation HME (Algorithm 2) L

2 L a,

-

a,

.->

m -

d

1

10

100

1000

Epochs

Figure 2: Relative error on the test set for a backpropagation network and a four-level HME architecture trained with batch algorithms. The standard errors at the minima of the curves are 0.013 for backpropagation and 0.002 for HME. Table 1: Average Values of Relative Error and Number of Epochs Required for Convergence for the Batch Algorithms. Architecture Linear Backpropagation HME (Algorithm 1) HME (Algorithm 2 ) CART CART (linear) MARS

Relative Error # Epochs 0.31 0.09 0.10 0.12 0.17 0.13 0.16

1 5,500 35 39

NA NA NA

space, corresponding roughly to the function implemented by a single expert in the HME architecture. Therefore we chose a maximum of 16 basis functions to correspond to the 16 experts in the four-level hierarchy. To choose the maximum number of interactions (mi), we compared the performance of MARS for mi = 1, 2, 3, 6, and 12, and chose the value that yielded the best performance (mi= 3). For the iterative algorithms, we also report the number of epochs required for convergence. Because the learning curves for these algorithms

198

Michael I. Jordan and Robert A. Jacobs

generally have lengthy tails, we defined convergence as the first epoch at which the relative error drops within 5% of the minimum. All of the architectures that we studied performed significantly better than the best linear approximation. As expected, the CART architecture with linear combinations performed better than CART with axis-parallel splitss The HME architecture yielded a modest improvement over MARS and CART. Backpropagation produced the lowest relative error of the algorithms tested (ignoring the difficulties with convergence). These differences in relative error should be treated with some caution. The need to set free parameters for some of the architectures (e.g., backpropagation) and the need to make structural choices (e.g., number of hidden units, number of basis functions, number of experts) make it difficult to match architectures. The HME architecture, for example, involves parameter dependencies that are not present in a backpropagation network. A gating network at a high level in the tree can “pinch off” a branch of the tree, rendering useless the parameters in that branch of the tree. Raw parameter count is therefore only a very rough guide to architecture capacity; more precise measures are needed (e.g., VC dimension) before definitive quantitative comparisons can be made. The differences between backpropagation and HME in terms of convergence time are more definitive. Both HME algorithms reliably converge more than two orders of magnitude faster than backpropagation. As shown in Figure 3, the HME architecture lends itself well to graphic investigation. This figure displays the time sequence of the distributions of posterior probabilities across the training set at each node of the tree. At Epoch 0, before any learning has taken place, most of the posterior probabilities at each node are approximately 0.5 across the training set. As the training proceeds, the histograms flatten out, eventually approaching bimodal distributions in which the posterior probabilities are either one or zero for most of the training patterns. This evolution is indicative of increasingly sharp splits being fit by the gating networks. Note that there is a tendency for the splits to be formed more rapidly at higher levels in the tree than at lower levels. Figure 4 shows another graphic device that can be useful for understanding the way in which an HME architecture fits a data set. This figure, which we refer to as a “deviance tree,” shows the deviance (mean squared error) that would be obtained at each level of the tree if the tree were clipped at that level. We construct a clipped tree at a given level by replacing each nonterminal at that level with a matrix that is a weighted average of the experts below that nonterminal. The weights are the total prior probabilities associated with each expert across the training set. The error for each output unit is then calculated by passing the test set through the clipped tree. As can be seen in the figure, the deviance is ‘It should be noted that CART is at an advantage relative to the other algorithms in this comparison, because no structural parameters were fixed for CART. That is, CART is allowed to find the best tree of any size to fit the data.

Mixtures of Experts and EM Algorithm

Epoch 0

I

199

Epoch 9

1-

Epoch 19

Epoch 29

Figure 3: A sequence of histogram trees for the HME architecture. Each histogram displays the distribution of posterior probabilities across the training set at each node in the tree. substantially smaller for deeper trees (note that the ordinate of the plots is on a log scale). The deviance in the right branch of the tree is larger than in the left branch of the tree. Information such as this can be useful for purposes of exploratory data analysis and for model selection. 2.9 An On-Line Algorithm. The batch least-squares algorithm that we have described (Algorithm 2) can be converted into an on-line algorithm by noting that linear least squares and weighted linear least squares problems can be solved by recursive procedures that update the parameter estimates with each successive data point (Ljung and Soderstrom 1986). Our application of these recursive algorithms is straightforward; however, care must be taken to handle the observation weights (the posterior probabilities) correctly. These weights change as a function of the changing parameter values. This implies that the recursive least squares algorithm must include a decay parameter that allows the system to ”forget” older values of the posterior probabilities.

Michael I. Jordan and Robert A. Jacobs

200

IIII

IIII

IIll

L L L -

. ._..

L d

I&

.,I.

III.

Ll

. . . I111 IIII

LLh

III. 111.

ILL

Llll

I I I I Ill1 Ill1 I I I I

Figure 4: A deviance tree for the HME architecture. Each plot displays the mean squared error (MSE) for the four output units of the clipped tree. The plots are on a log scale covering approximately three orders of magnitude.

In this section we present the equations for the on-line algorithm. These equations involve an update not only of the parameters in each of the networks: but also the storage and updating of an inverse covariance matrix for each network. Each matrix has dimensionality m x m, where m is the dimensionality of the input vector. (Note that the size of these matrices depends on the square of the number of input variables, not the square of the number of parameters. Note also that the update equation for the inverse covariance matrix updates the inverse matrix directly; there is never a need to invert matrices.) The on-line update rule for the parameters of the expert networks is given by the following recursive equation: (2.32) 6Note that in this section we use the term ”parameters” for the variables that are traditionally called “weights” in the neural network literature. We reserve the term “weights” for the observation weights.

Mixtures of Experts and EM Algorithm

201

where R, is the inverse covariance matrix for expert network ( i . j ) . This matrix is updated via the equation: (2.33) where X is the decay parameter. It is interesting to note the similarity between the parameter update rule in equation 2.32 and the gradient rule presented earlier (cf. equation 2.14). These updates are essentially the same, except that the scalar p is replaced by the matrix Ry'. It can be shown, however, that RY) is an estimate of the inverse Hessian of the least-squares cost function (Ljung and Soderstrom 1986), thus equation 2.32 is in fact a stochastic approximation to a Newton-Raphson method rather than a gradient method.' Similar equations apply for the updates of the gating networks. The update rule for the parameters of the top-level gating network is given by the following equation (for the ith output of the gating network):

vj'+')

=

vIt)

+ sj')(lnk!t)- (l(t))x(fi

(2.34)

where the inverse covariance matrix S,is updated by (2.35) Finally, the update rule for the parameters of the lower-level gating network is as follows: v ( 'II + l )

=

vf:'

+ Sf;)k;')(lnhi[)

-

(:))x(')

(2.36)

where the inverse covariance matrix S,is updated by (2.37)

2.10 Simulation Results. The on-line algorithm was tested on the robot dynamics problem described in the previous section. Preliminary simulations convinced us of the necessity of the decay parameter (XI. We also found that this parameter should be slowly increased as training proceeds-on the early trials the posterior probabilities are changing rapidly so that the covariances should be decayed rapidly, whereas on later trials the posterior probabilities have stabilized and the covariances should be decayed less rapidly. We used a simple fixed schedule: X was 'This is true for fixed values of the posterior probabilities. These posterior probabilities are also changing over time, however, as required by the EM algorithm. The overall convergence rate of the algorithm is determined by the convergence rate of EM, not the convergence rate of Newton-Raphson.

Michael I. Jordan and Robert A. Jacobs

202

1

5

10

50

100

Epochs

Figure 5: Relative error on the test set for a backpropagation network and a four-level hierarchy trained with on-line algorithms. The standard errors at the minima of the curves are 0.008 for backpropagation and 0.009 for HME.

initialized to 0.99 and increased a fixed fraction (0.6) of the remaining distance to 1.0 every 1000 time steps. The performance of the on-line algorithm was compared to an on-line backpropagation network. Parameter settings for the backpropagation network were obtained by a coarse search through the parameter space, yielding a value of 0.15 for the learning rate and 0.20 for the momentum. The results for both architectures are shown in Figure 5. As can be seen, the on-line algorithm for backpropagation is significantly faster than the corresponding batch algorithm (cf. Fig. 2). This is also true of the on-line HME algorithm, which has nearly converged within the first epoch. The minimum values of relative error and the convergence times for both architectures are provided in Table 2. We also provide the corresponding values for a simulation of the on-line gradient algorithm for the HME architecture (equation 2.17). We also performed a set of simulations which tested a variety of different HME architectures. We compared a one-level hierarchy with 32 experts to hierarchies with five levels (32 experts), and six levels (64 experts). We also simulated two three-level hierarchies, one with branching factors of 4,4, and 2 (proceeding from the top of the tree to the bottom), and one with branching factors of 2,4, and 4. (Each three-level hierarchy contained 32 experts.) The results are shown in Figure 6. As can be

Mixtures of Experts and EM Algorithm

203

Table 2: Average Values of Relative Error and Number of Epochs Required for Convergence for the On-Line Algorithms. Architecture

Relative Error Number of Epochs

Linear Backpropagation (on-line) HME (on-line) HME (gradient)

0.32 0.08 0.12 0.15

1 63 2 104

- _

3-level (a) 3-level (b) 5-level 6-level

I

1

,

0

2

4

6

I

I

8

10

Epochs

Figure 6: Relative error on the test set for HME hierarchies with different structures. "3-level (a)" refers to a 3-level hierarchy with branching factors of 4, 4, and 2, and "3-level (b)" refers to a 3-level hierarchy with branching factors of 2, 4, and 4. The standard errors for all curves at their respective minima were approximately 0.009. seen, there was a significant difference between the one-level hierarchy and the other architectures. There were smaller differences among the multilevel hierarchies. No significant difference was observed between the two different 3-level architectures.

3 Model Selection Utilizing the HME approach requires that choices be made regarding the structural parameters of the model, in particular the number of levels

204

Michael I. Jordan and Robert A. Jacobs

and the branching factor of the tree. As with other flexible estimation techniques, it is desirable to allow these structural parameters to be chosen based at least partly on the data. This model selection problem can be addressed in a variety of ways. In this paper we have utilized a test set approach to model selection, stopping the training when the error on the test set reaches a minimum. As is the case with other neural network algorithms, this procedure can be justified as a complexity control measure. As we have noted, when the parameters in the gating networks of an HME architecture are small, the entire system reduces to a single “averaged” GLIM at the root of the tree. As the training proceeds, the parameters in the gating networks begin to grow in magnitude and splits are formed. When a split is formed the parameters in the branches of the tree on either side of the split are decoupled and the effective number of degrees of freedom in the system increases. This increase in complexity takes place gradually as the values of the parameters increase and the splits sharpen. By stopping the training of the system based on the performance on a test set, we obtain control over the effective number of degrees of freedom in the architecture. Other approaches to model selection can also be considered. One natural approach is to use ridge regression in each of the expert networks and the gating networks. This approach extends naturally to the on-line setting in the form of a “weight decay.” It is also worth considering Bayesian techniques of the kind considered in the decision tree literature by Buntine (19911, as well as the MDL methods of Quinlan and Rivest (1989). 4 Related work

There are a variety of ties that can be made between the HME architecture and related work in statistics, machine learning, and neural networks. In this section we briefly mention some of these ties and make some comparative remarks. Our architecture is not the only nonlinear approximator to make substantial use of GLIMs and the IRLS algorithm. IRLS also figures prominently in a branch of nonparametric statistics known as generalized additive models (GAMs; Hastie and Tibshirani 1990). It is interesting to note the complementary roles of IRLS in these two architectures. In the GAM model, the IRLS algorithm appears in the outer loop, providing an adjusted dependent variable that is fit by a backfitting procedure in the inner loop. In the HME approach, on the other hand, the outer loop is the E step of EM and IRLS is in the inner loop. This complementarity suggests that it might be of interest to consider hybrid models in which a HME is nested inside a GAM or vice versa. We have already mentioned the close ties between the HME approach and other tree-structured estimators such as CART and MARS. Our ap-

Mixtures of Experts and EM Algorithm

205

proach differs from MARS and related architectures-such as the basisfunction trees of Sanger (1991)-by allowing splits that are oblique with respect to the axes. We also differ from these architectures by using a statistical model-the multinomial logit model-for the splits. We believe that both of these features can play a role in increasing predictive ability-the use of oblique splits should tend to decrease bias, and the use of smooth multinomial logit splits should generally decrease variance. Oblique splits also render the HME architecture insensitive to the particular choice of coordinates used to encode the data. Finally, it is worth emphasizing the difference in philosophy behind these architectures. Whereas CART and MARS are entirely nonparametric, the HME approach has a strong flavor of parametric statistics, via its use of generalized linear models, mixture models, and maximum likelihood. Similar comments can be made with respect to the decision tree methodology in the machine learning literature. Algorithms such as ID3 build trees that have axis-parallel splits and use heuristic splitting algorithms (Quinlan 1986). More recent research has studied decision trees with oblique splits (Murthy et al. 1993; Utgoff and Brodley 1990). None of these papers, however, has treated the problem of splitting data as a statistical problem, nor have they provided a global goodness-of-fit measure for their trees. There are a variety of neural network architectures that are related to the HME architecture. The multiresolution aspect of HME is reminiscent of Moody’s (1989) multiresolution CMAC hierarchy, differing in that Moody’s levels of resolution are handled explicitly by separate networks. The “neural tree” algorithm (Stromberg et al. 1991) is a decision tree with multilayer perceptions (MLPs) at the nonterminals. This architecture can form oblique (or curvilinear) splits, however, the MLPs are trained by a heuristic that has no clear relationship to overall classification performance. Finally, Hinton and Nowlan (see Nowlan 1991) have independently proposed extending the Jacobs et al. (19911 modular architecture to a tree-structured system. They did not develop a likelihood approach to the problem, however, proposing instead a heuristic splitting scheme. 5 Conclusions

We have presented a tree-structured architecture for supervised learning. We have developed the learning algorithm for this architecture within the framework of maximum likelihood estimation, utilizing ideas from mixture model estimation and generalized linear model theory. The maximum likelihood framework allows standard tools from statistical theory to be brought to bear in developing inference procedures and measures of uncertainty for the architecture (Cox and Hinkley 1974). It also opens the door to the Bayesian approaches that have been found to be useful

206

Michael I. Jordan and Robert A. Jacobs

in the context of unsupervised mixture model estimation (Cheeseman et

al. 1988). Although we have not emphasized theoretical issues in this paper, there are a number of points that are worth mentioning. First, the set of exponentially smoothed piecewise linear functions that we have utilized is clearly dense in the set of piecewise linear functions on compact sets in R", thus it is straightforward to show that the hierarchical architecture is dense in the set of continuous functions on compact sets in %'". That is, the architecture is "universal" in the sense of Hornik et al. (1989). From this result it would seem straightforward to develop consistency results for the architecture (cf. Geman et al. 1992; Stone 1977). We are currently developing this line of argument and are studying the asymptotic distributional properties of fixed hierarchies. Second, convergence results are available for the architecture. We have shown that the convergence rate of the algorithm is linear in the condition number of a matrix that is the product of a n inverse covariance matrix and the Hessian of the log likelihood for the architecture (Jordan and Xu 1993). Finally, it is worth noting a number of possible extensions of the work reported here. Our earlier work on hierarchical mixtures of experts utilized the multilayer perceptron as the primitive function for the expert networks and gating networks (Jordan and Jacobs 1992). That option is still available, although we lose the EM proof of convergence (cf. Jordan and Xu 1993)and we lose the ability to fit the subnetworks efficiently with IRLS. One interesting example of such an application is the case where the experts are autoassociators (Bourlard and Kamp 19881, in which case the architecture fits hierarchically nested local principal component decompositions. Another area in unsupervised learning worth exploring is the nonassociative version of the hierarchical architecture. Such a model would be a recursive version of classical mixture-likelihood clustering and may have interesting ties to hierarchical clustering models. Finally, it is also of interest to note that the recursive least squares algorithm that we utilized in obtaining an on-line variant of Algorithm 2 is not the only possible on-line approach. Any of the fast filter algorithms (Haykin 1991) could also be utilized, giving rise to a family of on-line algorithms. Also, it is worth studying the application of the recursive algorithms to PRESSlike cross-validation calculations to efficiently compute the changes in likelihood that arise from adding or deleting parameters or data points. Appendix A: Iteratively Reweighted Least Squares The iteratively reweighted least squares (IRLS) algorithm is the inner loop of the algorithm that we have proposed for the HME architecture. In this section, we describe the IRLS algorithm, deriving it as a special case of the Fisher scoring method for generalized linear models. Our presentation derives from McCullagh and Nelder (1983).

Mixtures of Experts and EM Algorithm

207

IRLS is an iterative algorithm for computing the maximum likelihood estimates of the parameters of a generalized linear model. It is a special case of a general algorithm for maximum likelihood estimation known as the Fisher scoring method (Finney 1973). Let Z(p;X)be a log likelihood function-a function of the parameter vector +and let (i3/i3papT)denote the Hessian of the log likelihood. The Fisher scoring method updates the parameter estimates p as follows:

where p, denotes the parameter estimate at the rth iteration and al/ap is the gradient vector. Note that the Fisher scoring method is essentially the same as the Newton-Raphson algorithm, except that the expected value of the Hessian replaces the Hessian. There are statistical reasons for preferring the expected value of the Hessian-and the expected value of the Hessian is often easier to c o m p u t e b u t Newton-Raphson can also be used in many cases. The likelihood in generalized linear model theory is a product of densities from the exponential family of distributions. This family is an important class in statistics and includes many useful densities, such as the normal, the Poisson, the binomial, and the gamma. The general form of a density in the exponential family is the following:

where 71 is known as the "natural parameter" and parameter.6

4 is the

dispersion

Example (Bernoulli Density). The Bernoulli density with mean T has the following form: P(y.7r) = TY(1 - 7r)l-Y

+ In(1 exp{qy - ln(1 + T

= =

exp{In(G ) y

T)}

eq)}

(5.3)

where 'ri = ln(T/l - T ) is the natural parameter of the Bernoulli density. This parameter has the interpretation as the log odds of "success" in a random Bernoulli experiment. In a generalized linear model, the parameter function of the input x:

is modeled as a linear

~~

wstrict ourselves to scalar-valued random variables to simplify the presentation, and describe the (straightforward) extension to vector-valued random variables at the end of the section.

Michael I. Jordan and Robert A. Jacobs

208

where p is a parameter vector. Substituting this expression into equation 5.2 and taking the product of N such densities yields the following log likelihood for a data set X = { ( ~ ( ‘ ) . y ( ~ ) ) } y :

l(p.X )= C{( p T ~ ( f ) y (-’ )b ( p T X “ ’ ) ) / 4

+ c(Y(’).4)}

f

The observations y(’) are assumed to be sampled independently from densities P(y. r ~ ( ~ 4), ) . where ~ j ( = ~ )pTxct). We now compute the gradient of the log likelihood:

and the Hessian of the log likelihood: (5.5) These quantities could be substituted directly into equation 5.1, however, there is additional mathematical structure that can be exploited. First note the following identity, which is true of any log likelihood:

(This fact can be proved by differentiating both sides of the identity J’ P(y, p, @)dy= 1 with respect to p.1 Because this identity is true for any set of observed data, including all subsets of X,we have the following:

for all f . This equation implies that the mean of Y ( ~ ) which , we denote as is a function of r / ( l ) . We therefore include in the generalized linear model the linkfunction, which models as a function of 11:

ell)

Example (Bernoulli Density). Equation 5.3 shows that b(71) for the Bernoulli density. Thus

= ln(1

which is the logistic function. Inverting the logistic function yields 11 ln(p/l - p ) ; thus, 11 equals 7r, as it must.

+

=

The link function f(v) = b’(7j) is known in generalized linear model theory as the canonical link. By parameterizing the exponential family density in terms of 77 (cf. equation 5.21, we have forced the choice of the canonical link. It is also possible to use other links, in which case r/

Mixtures of Experts and EM Algorithm

209

no longer has the interpretation as the natural parameter of the density. There are statistical reasons, however, to prefer the canonical link (McCullagh and Nelder 1983). Moreover, by choosing the canonical link, the Hessian of the likelihood turns out to be constant (cf. equation 5.51, and the Fisher scoring method therefore reduces to Newton-RaphsonP To continue the development, we need an additional fact about log likelihoods. By differentiating the identity J P(y,p)dy = 1 twice with respect to p, the following identity can be established:

This identity can be used to obtain a relationship between the variance of 17 and the function b(7)) in the exponential family density. Beginning with equation 5.5, we have

-E

[cb " ( p T x i f j ) x ( ' ) x ( ' l T j o ] t

where we have used the independence assumption in the fourth step. Comparing equation 5.5 with the last equation, we obtain the following relationship: Varly(')]= $b"(pTx(") Moreover, because f (ri) = b ' ( q ) , we have ~ar[y('= ) ] dff'(pTx''))

(5.6)

We now assemble the various pieces. First note that equation 5.6 can be utilized to express the Hessian (equation 5.5) in the following form:

'Whether or not the canonical link is used, the results presented in the remainder of this section are correct for the Fisher scoring method. If noncanonical links are used, then Newton-Raphson will include additional terms (terms that vanish under the expectation operator).

Michael I. Jordan and Robert A. Jacobs

212

Finally, note that equation 6.2 implies that b ( q ) must be defined as follows (cf. equation 5.2):

b ( q ) = nln

1

EelJi

which implies

(6.5) The fitting of a multinomial logit model proceeds by IRLS as described in Appendix A, using equations 6.4 and 6.5 for the link function and the mean, respectively.

Acknowledgments We want to thank Geoffrey Hinton, Tony Robinson, Mitsuo Kawato, Carlotta Domeniconi, and Daniel Wolpert for helpful comments on the manuscript. This project was supported in part by a grant from the McDonnell-Pew Foundation, by a grant from ATR Human Information Processing Research Laboratories, by a grant from Siemens Corpora tion, by Grant IRI-9013991 from the National Science Foundation, and by Grant N00014-90-J-1942 from the Office of Naval Research. The project was also supported by NSF Grant ASC-9217041 in support of the Center for Biological and Computational Learning at MIT, including funds provided by DARPA under the HPCC program, and NSF Grant ECS-9216531 to support an Initiative in Intelligent Control at MIT. Michael I. Jordan is an NSF Presidential Young Investigator.

References Bourlard, H., and Kamp, Y. 1988. Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 59, 291-294.

Breiman, L., Friedman, J. H., Olshen, R. A,, and Stone, C. J. 1984. CIassificatiorz and Regression Trees. Wadsworth International Group, Belmont, CA. Bridle, J. 1989. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing: Algorithms, Architectures, and Applications, F. Fogelman-Soulie and J. Herault, eds. Springer-Verlag, New York. Buntine, W. 1991. Learning classification trees. NASA ,\mes Tech. Rep. FIA-9012-19-01, Moffett Field, CA. Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and Freeman, D. 1988. Autoclass: A Bayesian classification system. In Proceedings of the Fifth Znternational Conference on Machine Learning, Ann Arbor, MI. Cox, a. R. 1970. The Analysis of Binary Data. Chapman-Hall, London.

Mixtures of Experts and EM Algorithm

213

Cox, D. R., and Hinkley, D. V. 1974. Theoretical Statistics. Chapman-Hall, London. Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Sot. B 39, 1-38. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Finney, D. J. 1973. Statistical Methods in Biological Assay. Hafner, New York. Friedman, J. H. 1991. Multivariate adaptive regression splines. Ann. Statist. 19, 1-141. Fun, W., and Jordan, M. I. 1993. The Moving Basin: Effective Action Search in Forward Models. MIT Computational Cognitive Science Tech. Report 9205, Cambridge, MA. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp. 4, 1-52. Golub, G. H., and Van Loan, G. F. 1989. MatrixComputations. The Johns Hopkins University Press, Baltimore, MD. Hastie, T. J., and Tibshirani, R. J. 1990. Generalized Additive Models. Chapman and Hall, London. Haykin, S. 1991. Adaptive Filter Theory. Prentice-Hall, Englrwood Cliffs, NJ. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds., Vol. 1, pp. 282-317. MIT Press, Cambridge, MA. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural Comp. 3, 79-87. Jordan, M. I., and Jacobs, R. A. 1992. Hierarchies of adaptive experts. In Advances in Neural Information Processing Systems 4, J. Moody, s. Hanson, and R. Lippmann, eds., pp. 985-993. Morgan Kaufmann, San Mateo, CA. Jordan, M. I., and Xu, L. 1993. Convergence Properties of the EM Approach to Learning in Mixture-of-Experts Architectures. Computational Cognitive Science Tech. Rep. 9301, MIT, Cambridge, MA. Little, R. J. A., and Rubin, D. B. 1987. Statistical Analysis with Missing Data. John Wiley, New York. Ljung, L., and Soderstrom, T. 1986. Theory and Practice of Recursive Identification. MIT Press, Cambridge. McCullagh, I?, and Nelder, J. A. 1983. Generalized Linear Models. Chapman and Hall, London. Moody, J. 1989. Fast learning in multi-resolution hierarchies. In Advances in Neural Znforrnation Processing Systems, D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. Murthy, S. K., Kasif, S., and Salzberg, S. 1993. OCI: A Randomized Algorithm for Building Obiique Decision Trees. Tech. Rep., Department of Computer Science, The Johns Hopkins University. Nowlan, S. J. 1990. Maximum likelihood competitive learning. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA.

Michael I. Jordan and Robert A. Jacobs

214

Nowlan, S. J. 1991. Soft Competitive Adaptation: Neural Network Learning Algorithms Based on Fitting Statistical Mixtures. Tech. Rep. CMU-CS-91-126, CMU, Pittsburgh, PA. Quandt, R. E., and Ramsey, J. B. 1972. A new approach to estimating switching regressions. 1. A m . Statist. SOC.67, 306-310. Quinlan, J. R. 1986. Induction of decision trees. Machine Learn. 1, 81-106. Quinlan, J. R., and Rivest, R. L. 1989. Inferring decision trees using the Minimum Description Length Principle. Information and Computation 80,227-248. Redner, R. A., and Walker, H. F. 1984. Mixture densities, maximum likelihood and the EM algorithm. S I A M Rev. 26, 195-239. Sanger, T. D. 1991. A tree-structured adaptive network for function approximation in high dimensional spaces. I E E E Transact. Neural Networks 2, 285-293. Scott, D. W. 1992. Multivariate Density Estimation. John Wiley, New York. Specht, D. F. 1991. A general regression neural network. IEEE Transact. Neural Networks 2, 568-576. Stone, C. J. 1977. Consistent nonparametric regression. Ann. Statist. 5, 5 9 5 4 5 . Stromberg, J. E., Zrida, J., and lsaksson, A. 1991. Neural trees-using neural nets in a tree classifier structure. I E E E lnternational Conference on Acoustics, Speech and Signal Processing, 137-140. Titterington, D. M., Smith, A. F. M., and Makov, U. E. 1985. Statistical Analysis of Finite Mixture Distributions. John Wiley, New York. Utgoff,.'I E., and Brodley, C. E. 1990. An incremental method for finding multivariate splits for decision trees. In Proceedings of the Seventh International Conference on Machine Learning, Los Altos, CA. Wahba, G., Gu, C., Wang, Y., and Chappell, R. 1993. Soft Classification, a.k.a.Risk Estimation, via Penalized Log Likelihood and Smoothing Spline Analysis of Variance. Tech. Rep. 899, Department of Statistics, University of Wisconsin, Madison. Wu, C. F. J. 1983. On the convergence properties of the EM algorithm. Ann. Statist. 11, 95-103. ~~

Received February 22, 1993; accepted July 9, 1993.

This article has been cited by: 1. Ling Xu, Timothy Hanson, Edward J. Bedrick, Carla Restrepo. 2010. Hypothesis Tests on Mixture Model Components with Applications in Ecology and Agriculture. Journal of Agricultural, Biological, and Environmental Statistics 15:3, 308-326. [CrossRef] 2. T. Hancock, I. Takigawa, H. Mamitsuka. 2010. Mining metabolic pathways through gene expression. Bioinformatics 26:17, 2128-2135. [CrossRef] 3. Dimitri Bettebghor, Nathalie Bartoli, Stéphane Grihon, Joseph Morlier, Manuel Samuelides. 2010. Surrogate modeling approximation using a mixture of experts based on EM joint estimation. Structural and Multidisciplinary Optimization . [CrossRef] 4. Elif Derya Übeyli, Konuralp Ilbay, Gul Ilbay, Deniz Sahin, Gur Akansel. 2010. Differentiation of Two Subtypes of Adult Hydrocephalus by Mixture of Experts. Journal of Medical Systems 34:3, 281-290. [CrossRef] 5. K. A. Le Cao, E. Meugnier, G. J. McLachlan. 2010. Integrative mixture of experts to combine clinical factors and gene markers. Bioinformatics 26:9, 1192-1198. [CrossRef] 6. Lior Rokach. 2010. Ensemble-based classifiers. Artificial Intelligence Review 33:1-2, 1-39. [CrossRef] 7. Rajib Nayak, James Gomes. 2010. Generalized hybrid control synthesis for affine systems using sequential adaptive networks. Journal of Chemical Technology & Biotechnology 85:1, 59-76. [CrossRef] 8. R. A. Mat Noor, Z. Ahmad, M. Mat Don, M. H. Uzir. 2010. Modelling and control of different types of polymerization processes using neural networks technique: A review. The Canadian Journal of Chemical Engineering n/a-n/a. [CrossRef] 9. Mehran Emadi Andani, Fariba Bahrami, Parviz Jabehdar Maralani, Auke Jan Ijspeert. 2009. MODEM: a multi-agent hierarchical structure to model the human motor control system. Biological Cybernetics 101:5-6, 361-377. [CrossRef] 10. I. Heintz, E. Fosler-Lussier, C. Brew. 2009. Discriminative Input Stream Combination for Conditional Random Field Phone Recognition. IEEE Transactions on Audio, Speech, and Language Processing 17:8, 1533-1546. [CrossRef] 11. Clodoaldo A. M. Lima, André L. V. Coelho, Fernando J. Zuben. 2009. Pattern classification with mixtures of weighted least-squares support vector machine experts. Neural Computing and Applications 18:7, 843-860. [CrossRef] 12. Elif Derya Übeyli. 2009. Modified mixture of experts employing eigenvector methods and Lyapunov exponents for analysis of electroencephalogram signals. Expert Systems 26:4, 339-354. [CrossRef]

13. Steven C.H. Hoi, Rong Jin, Michael R. Lyu. 2009. Batch Mode Active Learning with Applications to Text Categorization and Image Retrieval. IEEE Transactions on Knowledge and Data Engineering 21:9, 1233-1248. [CrossRef] 14. Elif Derya Übeyli. 2009. Modified Mixture of Experts for Diabetes Diagnosis. Journal of Medical Systems 33:4, 299-305. [CrossRef] 15. Habtom W. Ressom, Getachew K. Befekadu, Mahlet G. Tadesse. 2009. Analysis of LC-MS Data Using probabilitic-based mixture regression models Analyse von LC-MS-Daten mit wahrscheinlichkeitsbasierter Mischung von Regressionsmodellen. at - Automatisierungstechnik 57:9, 453-465. [CrossRef] 16. E.D. Ubeyli. 2009. Eigenvector Methods for Automated Detection of Electrocardiographic Changes in Partial Epileptic Patients. IEEE Transactions on Information Technology in Biomedicine 13:4, 478-485. [CrossRef] 17. Zainal Ahmad, Rabiatul ′Adawiah Mat Noor, Jie Zhang. 2009. Multiple neural networks modeling techniques in process control: a review. Asia-Pacific Journal of Chemical Engineering 4:4, 403-419. [CrossRef] 18. Ryunosuke Nishimoto, Jun Tani. 2009. Development of hierarchical structures for actions and motor imagery: a constructivist view from synthetic neuro-robotics study. Psychological Research Psychologische Forschung 73:4, 545-558. [CrossRef] 19. Olivier Cappé, Eric Moulines. 2009. On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71:3, 593-613. [CrossRef] 20. R. Ratcliff, M. G. Philiastides, P. Sajda. 2009. Quality of evidence for perceptual decision making is indexed by trial-to-trial variability of the EEG. Proceedings of the National Academy of Sciences 106:16, 6539-6544. [CrossRef] 21. P. Rojanavasu, Hai Huong Dam, H.A. Abbass, C. Lokan, O. Pinngern. 2009. A Self-Organized, Distributed, and Adaptive Rule-Based Induction System. IEEE Transactions on Neural Networks 20:3, 446-459. [CrossRef] 22. LAURENCE T. MALONEY, PASCAL MAMASSIAN. 2009. Bayesian decision theory as a model of human visual perception: Testing Bayesian transfer. Visual Neuroscience 26:01, 147. [CrossRef] 23. Monica Adya, Edward J. Lusk, Moncef Balhadjali. 2009. Decomposition as a Complex-Skill Acquisition Strategy in Management Education: A Case Study in Business Forecasting. Decision Sciences Journal of Innovative Education 7:1, 9-36. [CrossRef] 24. V. I. Gorodetskiy, S. V. Serebryakov. 2008. Methods and algorithms of collective recognition. Automation and Remote Control 69:11, 1821-1851. [CrossRef] 25. G. Polzlbauer, T. Lidy, A. Rauber. 2008. Decision Manifolds—A Supervised Learning Algorithm Based on Self-Organization. IEEE Transactions on Neural Networks 19:9, 1518-1530. [CrossRef]

26. Minwoo Jeong, Gary Geunbae Lee. 2008. Triangular-Chain Conditional Random Fields. IEEE Transactions on Audio, Speech, and Language Processing 16:7, 1287-1302. [CrossRef] 27. Dongbing Gu. 2008. Distributed EM Algorithm for Gaussian Mixtures in Sensor Networks. IEEE Transactions on Neural Networks 19:7, 1154-1166. [CrossRef] 28. M.M. Islam, Xin Yao, S.M. Shahriar Nirjon, M.A. Islam, K. Murase. 2008. Bagging and Boosting Negatively Correlated Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38:3, 771-784. [CrossRef] 29. Elif Derya Übeyli. 2008. Implementing wavelet transform/mixture of experts network for analysis of electrocardiogram beats. Expert Systems 25:2, 150-162. [CrossRef] 30. Vanessa Gomez-Verdejo, JerÓnimo Arenas-Garcia, AnÍbal R. Figueiras-Vidal. 2008. A Dynamically Adjusted Mixed Emphasis Method for Building Boosting Ensembles. IEEE Transactions on Neural Networks 19:1, 3-17. [CrossRef] 31. Xia Hong, Sheng Chen, Chris J. Harris. 2008. A Forward-Constrained Regression Algorithm for Sparse Kernel Density Estimation. IEEE Transactions on Neural Networks 19:1, 193-198. [CrossRef] 32. Norbert Tóth, Béla Pataki. 2008. Classification confidence weighted majority voting using decision tree classifiers. International Journal of Intelligent Computing and Cybernetics 1:2, 169-192. [CrossRef] 33. Minh Ha Nguyen, H.A. Abbass, R.I. McKay. 2008. Analysis of CCME: Coevolutionary Dynamics, Automatic Problem Decomposition, and Regularization. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38:1, 100-109. [CrossRef] 34. Estevam R. Hruschka, Eduardo R. Hruschka, Nelson F. F. Ebecken. 2007. Bayesian networks for imputation in classification problems. Journal of Intelligent Information Systems 29:3, 231-252. [CrossRef] 35. Cristian Sminchisescu, Atul Kanaujia, Dimitris N. Metaxas. 2007. BM³E : Discriminative Density Propagation for Visual Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:11, 2030-2044. [CrossRef] 36. Shun-ichi Amari. 2007. Integration of Stochastic Models by Minimizing α-DivergenceIntegration of Stochastic Models by Minimizing α-Divergence. Neural Computation 19:10, 2780-2796. [Abstract] [PDF] [PDF Plus] 37. Kin-Chung Wong, Wei-Yang Lin, Yu Hen Hu, Nigel Boston, Xueqin Zhang. 2007. Optimal Linear Combination of Facial Regions for Improving Identification Performance. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 37:5, 1138-1148. [CrossRef] 38. Gregor Gregorcic, Gordon Lightbody. 2007. Local Model Network Identification With Gaussian Processes. IEEE Transactions on Neural Networks 18:5, 1404-1423. [CrossRef]

39. Reza Ebrahimpour, Ehsanollah Kabir, Mohammad Reza Yousefi. 2007. Face Detection Using Mixture of MLP Experts. Neural Processing Letters 26:1, 69-82. [CrossRef] 40. M. Ortega-Moral, D. Gutiérrez-González, M. L. De-Pablo, J. Cid-Sueiro. 2007. Training Classifiers for Tree-structured Categories with Partially Labeled Data. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 48:1-2, 53-65. [CrossRef] 41. Giovanna Jona Lasinio, Fabio Divino, Annibale Biggeri. 2007. Environmental risk assessment in the Tuscany region: a proposal. Environmetrics 18:3, 315-332. [CrossRef] 42. Devi Parikh, Robi Polikar. 2007. An Ensemble-Based Incremental Learning Approach to Data Fusion. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 37:2, 437-450. [CrossRef] 43. David B. Dunson, Natesh Pillai, Ju-Hyun Park. 2007. Bayesian density regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69:2, 163-183. [CrossRef] 44. Lucy Marshall, David Nott, Ashish Sharma. 2007. Towards dynamic catchment modelling: a Bayesian hierarchical mixtures of experts framework. Hydrological Processes 21:7, 847-861. [CrossRef] 45. Colin Fyfe. 2007. Two topographic maps for data visualisation. Data Mining and Knowledge Discovery 14:2, 207-224. [CrossRef] 46. Elif Derya Übeyli. 2007. Comparison of different classification algorithms in clinical decision-making. Expert Systems 24:1, 17-31. [CrossRef] 47. Lucy Marshall, Ashish Sharma, David Nott. 2007. A single model ensemble versus a dynamic modeling platform: Semi-distributed rainfall runoff modeling in a Hierarchical Mixtures of Experts framework. Geophysical Research Letters 34:1. . [CrossRef] 48. Tamara Hayes, Misha Pavel, Nicole Larimer, Ishan Tsay, John Nutt, Andre Adami. 2007. Distributed Healthcare: Simultaneous Assessment of Multiple Individuals. IEEE Pervasive Computing 6:1, 36-43. [CrossRef] 49. Eiji Uchibe, Kenji Doya. 2007. The Brain & Neural Networks 14:4, 293-304. [CrossRef] 50. Michael Defoin-Platel, Malik Chami. 2007. How ambiguous is the inverse problem of ocean color in coastal waters?. Journal of Geophysical Research 112:C3. . [CrossRef] 51. I. Guler, E.D. Ubeyli. 2006. Automated Diagnostic Systems With Diverse and Composite Features for Doppler Ultrasound Signals. IEEE Transactions on Biomedical Engineering 53:10, 1934-1942. [CrossRef] 52. Lior Rokach. 2006. Decomposition methodology for classification tasks: a meta decomposer framework. Pattern Analysis and Applications 9:2-3, 257-271. [CrossRef]

53. Jayanta Basak. 2006. Online Adaptive Decision Trees: Pattern Classification and Function ApproximationOnline Adaptive Decision Trees: Pattern Classification and Function Approximation. Neural Computation 18:9, 2062-2101. [Abstract] [PDF] [PDF Plus] 54. D. Wedge, D. Ingram, D. Mclean, C. Mingham, Z. Bandar. 2006. On Global–Local Artificial Neural Networks for Function Approximation. IEEE Transactions on Neural Networks 17:4, 942-952. [CrossRef] 55. G.-B. Huang, L. Chen, C.-K. Siew. 2006. Universal Approximation Using Incremental Constructive Feedforward Networks With Random Hidden Nodes. IEEE Transactions on Neural Networks 17:4, 879-892. [CrossRef] 56. C. Alippi, F. Scotti. 2006. Exploiting Application Locality to Design Low-Complexity, Highly Performing, and Power-Aware Embedded Classifiers. IEEE Transactions on Neural Networks 17:3, 745-754. [CrossRef] 57. RÓMer Rosales, Stan Sclaroff. 2006. Combining Generative and Discriminative Models in a Framework for Articulated Pose Estimation. International Journal of Computer Vision 67:3, 251-276. [CrossRef] 58. J. Zhang, Q. Jin, Y. Xu. 2006. Inferential Estimation of Polymer Melt Index Using Sequentially Trained Bootstrap Aggregated Neural Networks. Chemical Engineering & Technology 29:4, 442-448. [CrossRef] 59. Mingyang Xu, Michael W. Golay. 2006. Data-guided model combination by decomposition and aggregation. Machine Learning 63:1, 43-67. [CrossRef] 60. Yang Ge , Wenxin Jiang . 2006. On Consistency of Bayesian Inference with Mixtures of Logistic RegressionOn Consistency of Bayesian Inference with Mixtures of Logistic Regression. Neural Computation 18:1, 224-243. [Abstract] [PDF] [PDF Plus] 61. Alexandre X. Carvalho, Martin A. Tanner. 2006. Modeling nonlinearities with mixtures-of-experts of time series models. International Journal of Mathematics and Mathematical Sciences 2006, 1-23. [CrossRef] 62. Lucy Marshall, Ashish Sharma, David Nott. 2006. Modeling the catchment via mixtures: Issues of model specification and validation. Water Resources Research 42:11. . [CrossRef] 63. Alejandro Villagran, Gabriel HuertaBayesian Inference on Mixture-of-Experts for Estimation of Stochastic Volatility 20, 277-296. [CrossRef] 64. Sethu Vijayakumar , Aaron D'Souza , Stefan Schaal . 2005. Incremental Online Learning in High DimensionsIncremental Online Learning in High Dimensions. Neural Computation 17:12, 2602-2634. [Abstract] [PDF] [PDF Plus] 65. G.-B. Huang, K.Z. Mao, C.-K. Siew, D.-S. Huang. 2005. Fast Modular Network Implementation for Support Vector Machines. IEEE Transactions on Neural Networks 16:6, 1651-1663. [CrossRef]

66. Feng Zhang, Bani Mallick, Zhujun Weng. 2005. A Bayesian method for identifying independent sources of non-random spatial patterns. Statistics and Computing 15:4, 329-339. [CrossRef] 67. Elif Derya Übeyli. 2005. A Mixture of Experts Network Structure for Breast Cancer Diagnosis. Journal of Medical Systems 29:5, 569-579. [CrossRef] 68. Andreas Lindemann, Christian L. Dunis, Paulo Lisboa. 2005. Level estimation, classification and probability distribution architectures for trading the EUR/USD exchange rate. Neural Computing and Applications 14:3, 256-271. [CrossRef] 69. J.I. Arribas, J. Cid-Sueiro. 2005. A Model Selection Algorithm for a Posteriori Probability Estimation With Neural Networks. IEEE Transactions on Neural Networks 16:4, 799-809. [CrossRef] 70. V. Cherkassky, Y. Ma. 2005. Multiple Model Regression Estimation. IEEE Transactions on Neural Networks 16:4, 785-798. [CrossRef] 71. Zainal Ahmad, Jie Zhang. 2005. Bayesian selective combination of multiple neural networks for improving long-range predictions in nonlinear process modelling. Neural Computing and Applications 14:1, 78-87. [CrossRef] 72. Alexandre X. Carvalho, Martin A. Tanner. 2005. Modeling nonlinear time series with local mixtures of generalized linear models. Canadian Journal of Statistics 33:1, 97-113. [CrossRef] 73. Patricia Melin, Cristina Felix, Oscar Castillo. 2005. Face recognition using modular neural networks and the fuzzy Sugeno integral for response integration. International Journal of Intelligent Systems 20:2, 275-291. [CrossRef] 74. Carlos Ordonez, Edward Omiecinski. 2005. Accelerating EM clustering to find high-quality solutions. Knowledge and Information Systems 7:2, 135-157. [CrossRef] 75. Abedalrazq Khalil. 2005. Applicability of statistical learning algorithms in groundwater quality modeling. Water Resources Research 41:5. . [CrossRef] 76. A.X. Carvalho, M.A. Tanner. 2005. Mixtures-of-Experts of Autoregressive Time Series: Asymptotic Normality and Model Specification. IEEE Transactions on Neural Networks 16:1, 39-56. [CrossRef] 77. Andreas Lindemann, Christian L. Dunis, Paulo Lisboa. 2004. Probability distributions, trading strategies and leverage: an application of Gaussian mixture models. Journal of Forecasting 23:8, 559-585. [CrossRef] 78. A. Sun, E.-P. Lim, W.-K. Ng, J. Srivastava. 2004. Blocking reduction strategies in hierarchical text classification. IEEE Transactions on Knowledge and Data Engineering 16:10, 1305-1308. [CrossRef] 79. Jayanta Basak . 2004. Online Adaptive Decision TreesOnline Adaptive Decision Trees. Neural Computation 16:9, 1959-1981. [Abstract] [PDF] [PDF Plus] 80. T. Rohlfing, D.B. Russakoff, C.R. Maurer. 2004. Performance-Based Classifier Combination in Atlas-Based Image Segmentation Using

Expectation-Maximization Parameter Estimation. IEEE Transactions on Medical Imaging 23:8, 983-994. [CrossRef] 81. C. Ordonez, E. Omiecinski. 2004. Efficient disk-based K-means clustering for relational databases. IEEE Transactions on Knowledge and Data Engineering 16:8, 909-921. [CrossRef] 82. Jayanta Basak, Ravi Kothari. 2004. A Classification Paradigm for Distributed Vertically Partitioned DataA Classification Paradigm for Distributed Vertically Partitioned Data. Neural Computation 16:7, 1525-1544. [Abstract] [PDF] [PDF Plus] 83. L. Xu. 2004. Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor Autodetermination. IEEE Transactions on Neural Networks 15:4, 885-902. [CrossRef] 84. D. Malerba, F. Esposito, M. Ceci, A. Appice. 2004. Top-down induction of model trees with regression and splitting nodes. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:5, 612-625. [CrossRef] 85. G. Fang, W. Gao, D. Zhao. 2004. Large Vocabulary Sign Language Recognition Based on Fuzzy Decision Trees. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 34:3, 305-314. [CrossRef] 86. S.-K. Ng, G.J. McLachlan. 2004. Using the EM Algorithm to Train Neural Networks: Misconceptions and a New Algorithm for Multiclass Classification. IEEE Transactions on Neural Networks 15:3, 738-749. [CrossRef] 87. M.A. Moussa. 2004. Combining Expert Neural Networks Using Reinforcement Feedback for Learning Primitive Grasping Behavior. IEEE Transactions on Neural Networks 15:3, 629-638. [CrossRef] 88. D.R. Martin, C.C. Fowlkes, J. Malik. 2004. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:5, 530-549. [CrossRef] 89. G. Biau, L. Devroye, G. Biau, L. Devroye. 2004. A Note on Density Model Size Testing. IEEE Transactions on Information Theory 50:3, 576-581. [CrossRef] 90. Ori Rosen, Ayala Cohen. 2003. Analysis of growth curves via mixtures. Statistics in Medicine 22:23, 3641-3654. [CrossRef] 91. A. Garg, V. Pavlovic, J.M. Rehg. 2003. Boosted learning in dynamic bayesian networks for multimodal speaker detection. Proceedings of the IEEE 91:9, 1355-1369. [CrossRef] 92. S. Raudys. 2003. Experts' boasting in trainable fusion rules. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:9, 1178-1182. [CrossRef] 93. Md.M. Islam, Xin Yao, K. Murase. 2003. A constructive algorithm for training cooperative neural network ensembles. IEEE Transactions on Neural Networks 14:4, 820-834. [CrossRef]

94. M.K. Titsias, A. Likas. 2003. Class conditional density estimation using mixtures with constrained component sharing. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:7, 924-928. [CrossRef] 95. Nan Xie, H. Leung, Hing Chan. 2003. A multiple-model prediction approach for sea clutter modeling. IEEE Transactions on Geoscience and Remote Sensing 41:6, 1491-1502. [CrossRef] 96. Chee Peng Lim, R.F. Harrison. 2003. Online pattern classification with multiple neural network systems: an experimental study. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 33:2, 235-247. [CrossRef] 97. A. L. Yuille , Anand Rangarajan . 2003. The Concave-Convex ProcedureThe Concave-Convex Procedure. Neural Computation 15:4, 915-936. [Abstract] [PDF] [PDF Plus] 98. Robert A. Jacobs , Melissa Dominguez . 2003. Visual Development and the Acquisition of Motion Velocity SensitivitiesVisual Development and the Acquisition of Motion Velocity Sensitivities. Neural Computation 15:4, 761-781. [Abstract] [PDF] [PDF Plus] 99. Liu Yong, Zou Xiu-fen. 2003. Analysis of negative correlation learning. Wuhan University Journal of Natural Sciences 8:1, 165-175. [CrossRef] 100. Liu Yong, Zou Xiu-fen. 2003. From designing a single neural network to designing neural network ensembles. Wuhan University Journal of Natural Sciences 8:1, 155-164. [CrossRef] 101. Antonio Torralba. 2003. Modeling global scene factors in attention. Journal of the Optical Society of America A 20:7, 1407. [CrossRef] 102. Junghui Chen, Yuezhi Yea. 2003. Design Pole Placement Controller Using Linearized Neural Networks for MISO Systems. JOURNAL OF CHEMICAL ENGINEERING OF JAPAN 36:8, 1005-1011. [CrossRef] 103. A. Kehagias, V. Petridis. 2002. Predictive modular neural networks for unsupervised segmentation of switching time series: the data allocation problem. IEEE Transactions on Neural Networks 13:6, 1432-1449. [CrossRef] 104. Michalis K. Titsias , Aristidis Likas . 2002. Mixture of Experts Classification Using a Hierarchical Mixture ModelMixture of Experts Classification Using a Hierarchical Mixture Model. Neural Computation 14:9, 2221-2244. [Abstract] [PDF] [PDF Plus] 105. A. Torralba, A. Oliva. 2002. Depth estimation from image structure. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:9, 1226-1238. [CrossRef] 106. Ji Ming, P. Jancovic, F.J. Smith. 2002. Robust speech recognition using probabilistic union models. IEEE Transactions on Speech and Audio Processing 10:6, 403-414. [CrossRef]

107. E. Mizutani, K. Nishio. 2002. Multi-illuminant color reproduction for electronic cameras via CANFIS neuro-fuzzy modular network device characterization. IEEE Transactions on Neural Networks 13:4, 1009-1022. [CrossRef] 108. F Acernese, F Barone, M de Rosa, R De Rosa, A Eleuteri, L Milano, R Tagliaferri. 2002. A neural network-based approach to noise identification of interferometric GW antennas: the case of the 40 m Caltech laser interferometer. Classical and Quantum Gravity 19:12, 3293-3307. [CrossRef] 109. Akihiro Minagawa , Norio Tagawa , Toshiyuki Tanaka . 2002. SMEM Algorithm Is Not Fully Compatible with Maximum-Likelihood FrameworkSMEM Algorithm Is Not Fully Compatible with Maximum-Likelihood Framework. Neural Computation 14:6, 1261-1266. [Abstract] [PDF] [PDF Plus] 110. M. Pardo, G. Sberveglieri. 2002. Learning from data: a tutorial with emphasis on modern pattern recognition methods. IEEE Sensors Journal 2:3, 203-217. [CrossRef] 111. Sheng-Uei Guan, Shanchun Li. 2002. Parallel growing and training of neural networks using output parallelism. IEEE Transactions on Neural Networks 13:3, 542-550. [CrossRef] 112. Antonio Ciampi, Andr� Couturier, Shaolin Li. 2002. Prediction trees with soft nodes for binary outcomes. Statistics in Medicine 21:8, 1145-1165. [CrossRef] 113. C.C. Chibelushi, F. Deravi, J.S.D. Mason. 2002. A review of speech-based bimodal recognition. IEEE Transactions on Multimedia 4:1, 23-37. [CrossRef] 114. Masahiko Haruno , Daniel M. Wolpert , Mitsuo Kawato . 2001. MOSAIC Model for Sensorimotor Learning and ControlMOSAIC Model for Sensorimotor Learning and Control. Neural Computation 13:10, 2201-2220. [Abstract] [PDF] [PDF Plus] 115. D. G. T. Denison, P. Dellaportas, B. K. Mallick. 2001. Wind speed prediction in a complex terrain. Environmetrics 12:6, 499-515. [CrossRef] 116. J. Cid-Sueiro, A.R. Figueiras-Vidal. 2001. On the structure of strict sense Bayesian cost functions and its applications. IEEE Transactions on Neural Networks 12:3, 445-455. [CrossRef] 117. H. Li, Y. Wang, K.J.R. Liu, S.-C.B. Lo, M.T. Freedman. 2001. Computerized radiographic mass detection. II. Decision support by featured database visualization and modular neural networks. IEEE Transactions on Medical Imaging 20:4, 302-313. [CrossRef] 118. Yuan-Fu Liao, Sin-Horng Chen. 2001. A modular RNN-based method for continuous Mandarin speech recognition. IEEE Transactions on Speech and Audio Processing 9:3, 252-263. [CrossRef] 119. Hsin-Chia Fu, Yen-Po Lee, Cheng-Chin Chiang, Hsiao-Tien Pao. 2001. Divide-and-conquer learning and modular perceptron networks. IEEE Transactions on Neural Networks 12:2, 250-263. [CrossRef]

120. C.J. Harris, X. Hong. 2001. Neurofuzzy mixture of experts network parallel learning and model construction algorithms. IEE Proceedings - Control Theory and Applications 148:6, 456. [CrossRef] 121. Qiang Gan, C.J. Harris. 2001. A hybrid learning scheme combining EM and MASMOD algorithms for fuzzy local linearization modeling. IEEE Transactions on Neural Networks 12:1, 43-53. [CrossRef] 122. N.S.V. Rao. 2001. On fusers that perform better than best sensor. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:8, 904-909. [CrossRef] 123. R. Polikar, L. Upda, S.S. Upda, V. Honavar. 2001. Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31:4, 497-508. [CrossRef] 124. X. Dai. 2001. CMA-based nonlinear blind equaliser modelled by a two-layer feedforward neural network. IEE Proceedings - Communications 148:4, 243. [CrossRef] 125. Laurent Girin, Jean-Luc Schwartz, Gang Feng. 2001. Audio-visual enhancement of speech in noise. The Journal of the Acoustical Society of America 109:6, 3007. [CrossRef] 126. N-J Huh, J-H Oh, K Kang. 2000. Journal of Physics A: Mathematical and General 33:48, 8663-8672. [CrossRef] 127. Jinwen Ma , Lei Xu , Michael I. Jordan . 2000. Asymptotic Convergence Rate of the EM Algorithm for Gaussian MixturesAsymptotic Convergence Rate of the EM Algorithm for Gaussian Mixtures. Neural Computation 12:12, 2881-2907. [Abstract] [PDF] [PDF Plus] 128. Dirk Husmeier . 2000. The Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural NetworksThe Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural Networks. Neural Computation 12:11, 2685-2717. [Abstract] [PDF] [PDF Plus] 129. Yasuo Matsuyama. 2000. The ?-EM algorithm and its basic properties. Systems and Computers in Japan 31:11, 12-23. [CrossRef] 130. S. Gutta, J.R.J. Huang, P. Jonathon, H. Wechsler. 2000. Mixture of experts for classification of gender, ethnic origin, and pose of human faces. IEEE Transactions on Neural Networks 11:4, 948-960. [CrossRef] 131. Shotaro Akaho , Hilbert J. Kappen . 2000. Nonmonotonic Generalization Bias of Gaussian Mixture ModelsNonmonotonic Generalization Bias of Gaussian Mixture Models. Neural Computation 12:6, 1411-1427. [Abstract] [PDF] [PDF Plus] 132. Wenxin Jiang . 2000. The VC Dimension for Mixtures of Binary ClassifiersThe VC Dimension for Mixtures of Binary Classifiers. Neural Computation 12:6, 1293-1301. [Abstract] [PDF] [PDF Plus]

133. Naonori Ueda, Ryohei Nakano. 2000. EM algorithm with split and merge operations for mixture models. Systems and Computers in Japan 31:5, 1-11. [CrossRef] 134. Yue Wang, Lan Luo, M.T. Freedman, Sun-Yuan Kung. 2000. Probabilistic principal component subspaces: a hierarchical finite mixture model for data visualization. IEEE Transactions on Neural Networks 11:3, 625-636. [CrossRef] 135. Wenxin Jiang, M.A. Tanner. 2000. On the asymptotic normality of hierarchical mixtures-of-experts for generalized linear models. IEEE Transactions on Information Theory 46:3, 1005-1013. [CrossRef] 136. Zoubin Ghahramani , Geoffrey E. Hinton . 2000. Variational Learning for Switching State-Space ModelsVariational Learning for Switching State-Space Models. Neural Computation 12:4, 831-864. [Abstract] [PDF] [PDF Plus] 137. Masa-aki Sato , Shin Ishii . 2000. On-line EM Algorithm for the Normalized Gaussian NetworkOn-line EM Algorithm for the Normalized Gaussian Network. Neural Computation 12:2, 407-432. [Abstract] [PDF] [PDF Plus] 138. Shiro Ikeda. 2000. Acceleration of the EM algorithm. Systems and Computers in Japan 31:2, 10-18. [CrossRef] 139. Mohammed Ouali, Ross D. King. 2000. Cascaded multiple classifiers for secondary structure prediction. Protein Science 9:6, 1162-1176. [CrossRef] 140. A.K. Jain, P.W. Duin, Jianchang Mao. 2000. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:1, 4-37. [CrossRef] 141. T. Higuchi, Xin Yao, Yong Liu. 2000. Evolutionary ensembles with negative correlation learning. IEEE Transactions on Evolutionary Computation 4:4, 380-387. [CrossRef] 142. Bin Zhang, Rao S. Govindaraju. 2000. Prediction of watershed runoff using Bayesian concepts and modular neural networks. Water Resources Research 36:3, 753. [CrossRef] 143. H.-T. Pao, Yeong Yuh Xu, Hung-Yuan Chang, Hsin-Chia Fu. 2000. User adaptive handwriting recognition by self-growing probabilistic decision-based neural networks. IEEE Transactions on Neural Networks 11:6, 1373-1384. [CrossRef] 144. Azriel Rosenfeld, Harry Wechsler. 2000. Pattern recognition: Historical perspective and future directions. International Journal of Imaging Systems and Technology 11:2, 101-116. [CrossRef] 145. Wenxin Jiang , Martin A. Tanner . 1999. On the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear ModelsOn the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear Models. Neural Computation 11:5, 1183-1198. [Abstract] [PDF] [PDF Plus] 146. Ori Rosen, Martin Tanner. 1999. Mixtures of proportional hazards regression models. Statistics in Medicine 18:9, 1119-1131. [CrossRef]

147. H. Attias . 1999. Independent Factor AnalysisIndependent Factor Analysis. Neural Computation 11:4, 803-851. [Abstract] [PDF] [PDF Plus] 148. F.M. Candocia, J.C. Principe. 1999. Super-resolution of images based on local correlations. IEEE Transactions on Neural Networks 10:2, 372-380. [CrossRef] 149. Michael E. Tipping , Christopher M. Bishop . 1999. Mixtures of Probabilistic Principal Component AnalyzersMixtures of Probabilistic Principal Component Analyzers. Neural Computation 11:2, 443-482. [Abstract] [PDF] [PDF Plus] 150. Sam Roweis , Zoubin Ghahramani . 1999. A Unifying Review of Linear Gaussian ModelsA Unifying Review of Linear Gaussian Models. Neural Computation 11:2, 305-345. [Abstract] [PDF] [PDF Plus] 151. Ran Avnimelech , Nathan Intrator . 1999. Boosted Mixture of Experts: An Ensemble Learning SchemeBoosted Mixture of Experts: An Ensemble Learning Scheme. Neural Computation 11:2, 483-497. [Abstract] [PDF] [PDF Plus] 152. A. Suarez, J.F. Lutsko. 1999. Globally optimal fuzzy decision trees for classification and regression. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:12, 1297-1311. [CrossRef] 153. Sun-Yuan Kung, J. Taur, Shang-Hung Lin. 1999. Synergistic modeling and applications of hierarchical fuzzy neural networks. Proceedings of the IEEE 87:9, 1550-1574. [CrossRef] 154. P. Frasconi, M. Gori, G. Soda. 1999. Data categorization using decision trellises. IEEE Transactions on Knowledge and Data Engineering 11:5, 697-712. [CrossRef] 155. Jen-Tzung Chien. 1999. Online hierarchical transformation of hidden Markov models for speech recognition. IEEE Transactions on Speech and Audio Processing 7:6, 656-667. [CrossRef] 156. D T Pham, R J Alcock. 1999. Synergistic classification systems for wood defect identification. Proceedings of the Institution of Mechanical Engineers, Part E: Journal of Process Mechanical Engineering 213:2, 127-133. [CrossRef] 157. A.N. Srivastava, R. Su, A.S. Weigend. 1999. Data mining for features using scale-sensitive gated experts. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:12, 1268-1279. [CrossRef] 158. Sheng Ma, Chuanyi Ji. 1999. Performance and efficiency: recent advances in supervised learning. Proceedings of the IEEE 87:9, 1519-1535. [CrossRef] 159. A. Baraldi, P. Blonda. 1999. A survey of fuzzy clustering algorithms for pattern recognition. I. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 29:6, 778-785. [CrossRef] 160. L. Hadjiiski, B. Sahiner, Heang-Ping Chan, N. Petrick, M. Helvie. 1999. Classification of malignant and benign masses based on hybrid ART2LDA approach. IEEE Transactions on Medical Imaging 18:12, 1178-1187. [CrossRef] 161. Bao-Liang Lu, H. Kita, Y. Nishikawa. 1999. Inverting feedforward neural networks using linear and nonlinear programming. IEEE Transactions on Neural Networks 10:6, 1271-1290. [CrossRef]

162. V. Ramamurti, J. Ghosh. 1999. Structurally adaptive modular networks for nonstationary environments. IEEE Transactions on Neural Networks 10:1, 152-160. [CrossRef] 163. Stefan Schaal , Christopher G. Atkeson . 1998. Constructive Incremental Learning from Only Local InformationConstructive Incremental Learning from Only Local Information. Neural Computation 10:8, 2047-2084. [Abstract] [PDF] [PDF Plus] 164. Akio Utsugi . 1998. Density Estimation by Mixture Models with Smoothing PriorsDensity Estimation by Mixture Models with Smoothing Priors. Neural Computation 10:8, 2115-2135. [Abstract] [PDF] [PDF Plus] 165. James A. Reggia , Sharon Goodall , Yuri Shkuro . 1998. Computational Studies of Lateralization of Phoneme Sequence GenerationComputational Studies of Lateralization of Phoneme Sequence Generation. Neural Computation 10:5, 1277-1297. [Abstract] [PDF] [PDF Plus] 166. A.J. Zeevi, R. Meir, V. Maiorov. 1998. Error bounds for functional approximation and estimation using mixtures of experts. IEEE Transactions on Information Theory 44:3, 1010-1025. [CrossRef] 167. Y. Shimshoni, N. Intrator. 1998. Classification of seismic signals by integrating ensembles of neural networks. IEEE Transactions on Signal Processing 46:5, 1194-1201. [CrossRef] 168. C.M. Bishop, M.E. Tipping. 1998. A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:3, 281-293. [CrossRef] 169. R. Rae, H.J. Ritter. 1998. Recognition of human head orientation based on artificial neural networks. IEEE Transactions on Neural Networks 9:2, 257-265. [CrossRef] 170. Lloyd P. M. Johnston, Mark A. Kramer. 1998. Estimating state probability distributions from noisy and corrupted data. AIChE Journal 44:3, 591-602. [CrossRef] 171. A. Verikas, K. Malmqvist, L. Bergman, M. Signahl. 1998. Colour classification by neural networks in graphic arts. Neural Computing & Applications 7:1, 52-64. [CrossRef] 172. David J. Miller , Hasan S. Uyar . 1998. Combined Learning and Use for a Mixture Model Equivalent to the RBF ClassifierCombined Learning and Use for a Mixture Model Equivalent to the RBF Classifier. Neural Computation 10:2, 281-293. [Abstract] [PDF] [PDF Plus] 173. Sheng Ma, Chuanyi Ji. 1998. Fast training of recurrent networks based on the EM algorithm. IEEE Transactions on Neural Networks 9:1, 11-26. [CrossRef] 174. R. Kumar, P. Rockett. 1998. Multiobjective genetic algorithm partitioning for hierarchical learning of high-dimensional pattern spaces: a learning-follows-decomposition strategy. IEEE Transactions on Neural Networks 9:5, 822-830. [CrossRef]

175. C.L. Fancourt, J.C. Principe. 1998. Competitive principal component analysis for locally stationary time series. IEEE Transactions on Signal Processing 46:11, 3068-3081. [CrossRef] 176. A.D.J. Cross, E.R. Hancock. 1998. Graph matching with a dual-step EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:11, 1236-1253. [CrossRef] 177. Sheng Ma, Chuanyi Ji. 1998. A unified approach on fast training of feedforward and recurrent networks using EM algorithm. IEEE Transactions on Signal Processing 46:8, 2270-2274. [CrossRef] 178. Y. Gotoh, M.M. Hochberg, H.F. Silverman. 1998. Efficient training algorithms for HMMs using incremental estimation. IEEE Transactions on Speech and Audio Processing 6:6, 539-548. [CrossRef] 179. V. Maiorov, R.S. Meir. 1998. Approximation bounds for smooth functions in C(R/sup d/) by neural and mixture networks. IEEE Transactions on Neural Networks 9:5, 969-978. [CrossRef] 180. Yue Wang, Shang-Hung Lin, Huai Li, Sun-Yuan Kung. 1998. Data mapping by probabilistic modular networks and information-theoretic criteria. IEEE Transactions on Signal Processing 46:12, 3378-3397. [CrossRef] 181. R. Sun, T. Peterson. 1998. Autonomous learning of sequential tasks: experiments and analyses. IEEE Transactions on Neural Networks 9:6, 1217-1234. [CrossRef] 182. K. Rose. 1998. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE 86:11, 2210-2239. [CrossRef] 183. Sin-Horng Chen, Yuan-Fu Liao. 1998. Modular recurrent neural networks for Mandarin syllable recognition. IEEE Transactions on Neural Networks 9:6, 1430-1441. [CrossRef] 184. Athanasios Kehagias , Vassilios Petridis . 1997. Time-Series Segmentation Using Predictive Modular Neural NetworksTime-Series Segmentation Using Predictive Modular Neural Networks. Neural Computation 9:8, 1691-1709. [Abstract] [PDF] [PDF Plus] 185. Yoram Singer . 1997. Adaptive Mixtures of Probabilistic TransducersAdaptive Mixtures of Probabilistic Transducers. Neural Computation 9:8, 1711-1733. [Abstract] [PDF] [PDF Plus] 186. Song Chun Zhu , Ying Nian Wu , David Mumford . 1997. Minimax Entropy Principle and Its Application to Texture ModelingMinimax Entropy Principle and Its Application to Texture Modeling. Neural Computation 9:8, 1627-1660. [Abstract] [PDF] [PDF Plus] 187. T. Adali, X. Liu, M.K. Sonmez. 1997. Conditional distribution learning with neural networks and its application to channel equalization. IEEE Transactions on Signal Processing 45:4, 1051-1064. [CrossRef]

188. Ke Chen, Xiang Yu, Huisheng Chi. 1997. Combining linear discriminant functions with neural networks for supervised learning. Neural Computing & Applications 6:1, 19-41. [CrossRef] 189. Robert A. Jacobs . 1997. Bias/Variance Analyses of Mixtures-of-Experts ArchitecturesBias/Variance Analyses of Mixtures-of-Experts Architectures. Neural Computation 9:2, 369-383. [Abstract] [PDF] [PDF Plus] 190. V.P. Kumar, E.S. Manolakos. 1997. Unsupervised statistical neural networks for model-based object recognition. IEEE Transactions on Signal Processing 45:11, 2709-2718. [CrossRef] 191. A.V. Rao, D. Miller, K. Rose, A. Gersho. 1997. Mixture of experts regression modeling by deterministic annealing. IEEE Transactions on Signal Processing 45:11, 2811-2820. [CrossRef] 192. R. Langari, Liang Wang, J. Yen. 1997. Radial basis function networks, regression weights, and the expectation-maximization algorithm. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 27:5, 613-623. [CrossRef] 193. V. Petridis, A. Kehagias. 1997. Predictive modular fuzzy systems for time-series classification. IEEE Transactions on Fuzzy Systems 5:3, 381-397. [CrossRef] 194. J. T. Connor. 1996. A robust neural network filter for electricity demand prediction. Journal of Forecasting 15:6, 437-458. [CrossRef] 195. J. Tani. 1996. Model-based learning for mobile robot navigation from the dynamical systems perspective. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 26:3, 421-436. [CrossRef] 196. Steven Gold, Anand Rangarajan, Eric Mjolsness. 1996. Learning with Preknowledge: Clustering with Point and Graph Matching Distance MeasuresLearning with Preknowledge: Clustering with Point and Graph Matching Distance Measures. Neural Computation 8:4, 787-804. [Abstract] [PDF] [PDF Plus] 197. E. Alpaydin, M.I. Jordan. 1996. Local linear perceptrons for classification. IEEE Transactions on Neural Networks 7:3, 788-794. [CrossRef] 198. Ming Zhang, J. Fulcher. 1996. Face recognition using artificial neural network group-based adaptive tolerance (GAT) trees. IEEE Transactions on Neural Networks 7:3, 555-567. [CrossRef] 199. S. Gold, A. Rangarajan. 1996. A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:4, 377-388. [CrossRef] 200. David Miller , Kenneth Rose . 1996. Hierarchical, Unsupervised Learning with Growing via Phase TransitionsHierarchical, Unsupervised Learning with Growing via Phase Transitions. Neural Computation 8:2, 425-450. [Abstract] [PDF] [PDF Plus]

201. G. Mato , H. Sompolinsky . 1996. Neural Network Models of Perceptual Learning of Angle DiscriminationNeural Network Models of Perceptual Learning of Angle Discrimination. Neural Computation 8:2, 270-299. [Abstract] [PDF] [PDF Plus] 202. Lei Xu , Michael I. Jordan . 1996. On Convergence Properties of the EM Algorithm for Gaussian MixturesOn Convergence Properties of the EM Algorithm for Gaussian Mixtures. Neural Computation 8:1, 129-151. [Abstract] [PDF] [PDF Plus] 203. D. Miller, A.V. Rao, K. Rose, A. Gersho. 1996. A global optimization technique for statistical classifier design. IEEE Transactions on Signal Processing 44:12, 3108-3122. [CrossRef] 204. V. Petridis, A. Kehagias. 1996. Modular neural networks for MAP classification of time series and the partition algorithm. IEEE Transactions on Neural Networks 7:1, 73-86. [CrossRef] 205. Ke Chen, Dahong Xie, Huisheng Chi. 1996. A modified HME architecture for text-dependent speaker identification. IEEE Transactions on Neural Networks 7:5, 1309-1313. [CrossRef] 206. Robert A. Jacobs . 1995. Methods For Combining Experts' Probability AssessmentsMethods For Combining Experts' Probability Assessments. Neural Computation 7:5, 867-888. [Abstract] [PDF] [PDF Plus] 207. Shun-ichi Amari . 1995. The EM Algorithm and Information Geometry in Neural Network LearningThe EM Algorithm and Information Geometry in Neural Network Learning. Neural Computation 7:1, 13-18. [Abstract] [PDF] [PDF Plus] 208. Michael I. Jordan, Tamar Flash, Yoram Arnon. 1994. A Model of the Learning of Arm Trajectories from Spatial DeviationsA Model of the Learning of Arm Trajectories from Spatial Deviations. Journal of Cognitive Neuroscience 6:4, 359-376. [Abstract] [PDF] [PDF Plus] 209. Marcus FreanConnectionist Architectures: Optimization . [CrossRef] 210. Robi PolikarPattern Recognition . [CrossRef] 211. Mike SchusterNeural Nets for Speech Processing . [CrossRef] 212. Yair BartalDivide-and-Conquer Methods . [CrossRef] 213. B. D. RipleyComputer-Intensive Methods . [CrossRef] 214. D. M. TitteringtonMixture Distributions-II . [CrossRef]

NOTE

Communicated by Richard Sutton

TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro IBM Thomas 1. Watson Research Center, I? 0.Box 704, Yorktozon Heights, N Y 10598 USA

TD-Gammon is a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results, based on the TD(X) reinforcement learning algorithm (Sutton 1988). Despite starting from random initial weights (and hence random initial strategy), TD-Gammon achieves a surprisingly strong level of play. With zero knowledge built in at the start of learning (i.e., given only a ”raw” description of the board state), the network learns to play at a strong intermediate level. Furthermore, when a set of handcrafted features is added to the network’s input representation, the result is a truly staggering level of performance: the latest version of TD-Gammon is now estimated to play at a strong master level that is extremely close to the world’s best human players.

Reinforcement learning is a fascinating and challenging alternative to the more standard approach to training neural networks by supervised learning. Instead of training on a ”teacher signal” indicating the correct output for every input, reinforcement learning provides less information to work with: the learner is given only a “reward” or “reinforcement” signal indicating the quality of output. In many cases the reward is also delayed, that is, is given at the end of a long sequence of inputs and outputs. In contrast to the numerous practical successes of supervised learning, there have been relatively few successful applications of reinforcement learning to complex real-world problems. This paper presents a case study in which the TD(X) reinforcement learning algorithm (Sutton 1988) was applied to training a multilayer neural network on a complex task: learning strategies for the game of backgammon. This is an attractive test problem due to its considerable complexity and stochastic nature. It is also possible to make a detailed comparison of TD learning with the alternative approach of supervised training on human expert examples; this was the approach used in the development of Neurogammon, a program that convincingly won the backgammon championship at the 1989 International Computer Olympiad (Tesauro 1989). Neurnl Cotnpirtntior~ 6, 215-219 (1994)

@ 1994 Massachusetts Institute of Technology

216

Gerald Tesauro

Details of the TD backgammon learning system are described elsewhere (Tesauro 1992). In brief, the network observes a sequence of board positions ~ 1 . ~ ...?xr 2 . leading to a final reward signal z determined by the outcome of the game. (These games were played without doubling, thus the network did not learn anything about doubling strategy.) The sequences of positions were generated using the networks predictions as an evaluation function. In other words, the move selected at each time step was the move that maximized the networks estimate of expected outcome. Thus the network learned based on the outcome of self-play. This procedure of letting the network learn from its own play was used even at the very start of learning, when the networks initial weights are random, and hence its initial strategy is a random strategy. From an a priori point of view, this methodology appeared unlikely to produce any sensible learning, because random strategy is exceedingly bad, and because the games end u p taking an incredibly long time: with random play on both sides, games often last several hundred or even several thousand time steps. In contrast, in normal human play games usually last on the order of 50-60 time steps. Preliminary experiments used an input representation scheme that encoded only the raw board information (the number of white or black checkers at each location), and did not utilize any additional precomputed features relevant to good play, such as, for example, the strength of a blockade or probability of being hit. These experiments were completely knowledge-free in that there was no initial knowledge built in about how to play good backgammon. In subsequent experiments, a set of hand-crafted features was added to the representation, resulting in higher overall performance. This feature set was the same set that was included in Neurogammon. The rather surprising result, after tens of thousands of training games, was that a significant amount of learning actually took place, even in the zero initial knowledge experiments. These networks achieved a strong intermediate level of play approximately equal to that of Neurogammon. The networks with hand-crafted features have greatly surpassed Neurogammon and all other previous computer programs, and have continued to improve with more and more games of training experience. The best of these networks is now estimated to play at a strong master level that is extremely close to equaling world-class human play. This has been demonstrated in numerous tests of TD-Gammon in play against several world-class human grandmasters, including Bill Robertie and Paul Magriel, both noted authors and highly respected former World Champions. For the tests against humans, a heuristic doubling algorithm was added to the program that took TD-Gammon's equity estimates as input, and tried to apply somewhat classical formulas developed in the 1970s (Zadeh and Kobliska 1977) to determine proper doubling actions. Results of testing are summarized in Table 1. TD-Gammon 1.O, which had a total training experience of 300,000 games, lost a total of 13 points in

TD-Gammon

217

Table 1: Results of Testing TD-Gammon in Play against World-Class Human

Opponents.a Program

Training games Opponents

TD-Gammon 1.O

300,000

TD-Gammon 2.0

800,000

TD-Gammon 2.1

1,500,000

Results

Robertie, Davis, -13 pts/51 games Magriel (-0.25 ppg) Goulding, Woolsey, -7 pts/38 games Snellings, Russell, (-0.18 ppg) Sylvester Robertie -1 pt/40 games (-0.02 ppg)

"Version 1.0 used 1-ply search for move selection; versions 2.0 and 2.1 used 2-ply search. Version 2.0 had 40 hidden units; versions 1.0 and 2.1 had 80 hidden units.

51 games against Robertie, Magriel, and Malcolm Davis, the 11th highest rated player in the world in 1991. TD-Gammon 2.0, which had 800,000 training games of experience and was publicly exhibited at the 1992 World Cup of Backgammon tournament, had a net loss of 7 points in 38 exhibition games against top players Kent Goulding, Kit Woolsey, Wilcox Snellings, former World Cup Champion Joe Sylvester, and former World Champion Joe Russell. The latest version of the program, version 2.1, had 1.5 million games of training experience and achieved near-parity to Bill Robertie in a recent 40-game test session: after trailing the entire session, Robertie managed to eke out a narrow one-point victory by the score of 40 to 39. According to an article by Bill Robertie published in Inside Backgummon magazine (Robertie 19921, TD-Gammon's level of play is significantly better than any previous computer program. Robertie estimates that TDGammon 1.O would lose on average in the range of 0.2 to 0.25 points per game against world-class human play. (This is consistent with the results of the 51-game sample.) This would be about equivalent to a decent advanced level of human play in local and regional open-division tournaments. In contrast, most commercial programs play at a weak intermediate level that loses well over one point per game against world-class humans. The best previous commercial program scored -0.66 points per game on this scale. The best previous program of any sort was Hans Berliner's BKG program, which in its only public appearance in 1979 won a short match against the World Champion at that time (Berliner 1980). BKG was about equivalent to a very strong intermediate or weak advanced player and would have scored in the range of -0.3 to -0.4 points per game. Based on the latest 40-game sample, Robertie's overall assessment is that TD-Gammon 2.1 now plays at a strong master level that is extremely close to equaling the world's best human players. In fact, due to the

218

Gerald Tesauro

program’s steadiness (it never gets tired or careless, as even the best of humans inevitably do), he thinks it would actually be the favorite against any human player in a long money-game session or in a grueling tournament format such as the World Cup competition. The only thing that prevents TD-Gammon from genuinely equaling world-class human play is that it still makes minor, practically inconsequential technical errors in its endgame play. One would expect these technical errors to cost the program on the order of 0.05 points per game against top humans. Robertie thinks that there are probably only two or three dozen players in the entire world who, at the top of their game, could expect to hold their own or have an advantage over the program. This means that TD-Gammon is now probably as good at backgammon as the grandmaster chess machine Deep Thought is at chess. Interestingly enough, it is only in the last 5-10 years that human play has gotten good enough to rival TD-Gammon’s current playing ability. If TD-Gammon had been developed 10 years ago, Robertie says, it would have easily been the best player in the world at that time. Even 5 years ago, there would have been only two or three players who could equal it. The self-teaching reinforcement learning approach used in the development of TD-Gammon has greatly surpassed the supervised learning approach of Neurogammon, and has achieved a level of play considerably beyond any possible prior expectations. It has also demonstrated favorable empirical behavior of TD(X), such as good scaling behavior, despite the lack of theoretical guarantees. Prospects for further improvement of TD-Gammon seem promising. Based on the observed scaling, training larger and larger networks with correspondingly more experience would probably result in even higher levels of performance. Additional improvements could come from modifications of the training procedure or the input representation scheme. Some combination of these factors could easily result in a version of TD-Gammon that would be the uncontested world’s best backgammon player. However, instead of merely pushing TD-Gammon to higher and higher levels of play, it now seems more worthwhile to extract the principles underlying the success of this application of TD learning, and to determine what kinds of other applications may also produce similar successes. Other possible applications might include financial trading strategies, military battlefield strategies, and control tasks such as robot motor control, navigation, and path planning. At this point we are still largely ignorant as to why TD-Gammon is able to learn so well. One plausible conjecture is that the stochastic nature of the task is critical to the success of TD learning. One possibly very important effect of the stochastic dice rolls in backgammon is that during learning, they enforce a certain minimum amount of exploration of the state space. By stochastically forcing the system into regions of state space that the current evaluation function

TD-Gammon

219

tries to avoid, it is possible that improved evaluations a n d new strategies

can be discovered. References Berliner, H. 1980. Computer backgammon. Sci. Am. 243(1), 64-72. Robertie, B. 1992. Carbon versus silicon: matching wits with TD-Gammon. Inside Backgammon 2(2), 14-22. Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learn. 3, 9-44. Tesauro, G. 1989. Neurogammon wins Computer Olympiad. Neural Comp. 1, 321-323. Tesauro, G. 1992. Practical issues in temporal difference learning. Machine Learn. 8, 257-277. Zadeh, N., and Kobliska, G. 1977. On optimal doubling in backgammon. Manage. Sci. 23, 853-858.

Received April 19, 1993; accepted May 25, 1993.

This article has been cited by: 1. Francisco S. Melo, M. Isabel Ribeiro. 2010. Coordinated learning in multiagent MDPs with infinite state-space. Autonomous Agents and Multi-Agent Systems 21:3, 321-367. [CrossRef] 2. Shimon Whiteson, Matthew E. Taylor, Peter Stone. 2010. Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning. Autonomous Agents and Multi-Agent Systems 21:1, 1-35. [CrossRef] 3. Xin Xu, Chunming Liu, Dewen Hu. 2010. Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft Computing . [CrossRef] 4. Lei Yang, Jennie Si, Konstantinos S. Tsakalis, Armando A. Rodriguez. 2009. Performance Evaluation of Direct Heuristic Dynamic Programming using Control-Theoretic Measures. Journal of Intelligent and Robotic Systems 55:2-3, 177-201. [CrossRef] 5. Murat Okatan. 2009. Correlates of reward-predictive value in learning-related hippocampal neural activity. Hippocampus 19:5, 487-506. [CrossRef] 6. D. Ernst, M. Glavic, F. Capitanescu, L. Wehenkel. 2009. Reinforcement Learning Versus Model Predictive Control: A Comparison on a Power System Problem. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:2, 517-529. [CrossRef] 7. Wiebke Potjans, Abigail Morrison, Markus Diesmann. 2009. A Spiking Neural Network Model of an Actor-Critic Learning AgentA Spiking Neural Network Model of an Actor-Critic Learning Agent. Neural Computation 21:2, 301-339. [Abstract] [Full Text] [PDF] [PDF Plus] 8. Ah-Hwee Tan, Ning Lu, Dan Xiao. 2008. Integrating Temporal Difference Methods and Self-Organizing Neural Networks for Reinforcement Learning With Delayed Evaluative Feedback. IEEE Transactions on Neural Networks 19:2, 230-244. [CrossRef] 9. Takafumi Sasakawa, Jinglu Hu, Kotaro Hirasawa. 2008. A brainlike learning system with supervised, unsupervised, and reinforcement learning. Electrical Engineering in Japan 162:1, 32-39. [CrossRef] 10. Sridhar Mahadevan. 2008. Representation Discovery using Harmonic Analysis. Synthesis Lectures on Artificial Intelligence and Machine Learning 2:1, 1-147. [CrossRef] 11. Hajime Fujita, Shin Ishii. 2007. Model-Based Reinforcement Learning for Partially Observable Games with Sampling-Based State EstimationModel-Based Reinforcement Learning for Partially Observable Games with Sampling-Based State Estimation. Neural Computation 19:11, 3051-3087. [Abstract] [PDF] [PDF Plus]

12. Xin Xu, Dewen Hu, Xicheng Lu. 2007. Kernel-Based Least Squares Policy Iteration for Reinforcement Learning. IEEE Transactions on Neural Networks 18:4, 973-992. [CrossRef] 13. M. Mainegra Hing, A. van Harten, P. C. Schuur. 2007. Reinforcement learning versus heuristics for order acceptance on a single resource. Journal of Heuristics 13:2, 167-187. [CrossRef] 14. Andrew J Smith, Ming Li, Suzanna Becker, Shitij Kapur. 2007. Linking Animal Models of Psychosis to Computational Models of Dopamine Function. Neuropsychopharmacology 32:1, 54-66. [CrossRef] 15. Kenji Doya. 2007. Reinforcement learning: Computational theory and biological mechanisms. HFSP Journal 1:1, 30. [CrossRef] 16. Takafumi Sasakawa, Jinglu Hu, Kotaro Hirasawa. 2006. A Brain-like Learning System with Supervised, Unsupervised and Reinforcement Learning. IEEJ Transactions on Electronics, Information and Systems 126:9, 1165-1172. [CrossRef] 17. X. Dai, C.K. Li, A.B. Rad. 2005. An Approach to Tune Fuzzy Controllers Based on Reinforcement Learning for Autonomous Vehicle Control. IEEE Transactions on Intelligent Transportation Systems 6:3, 285-293. [CrossRef] 18. Shin Ishii, Hajime Fujita, Masaoki Mitsutake, Tatsuya Yamazaki, Jun Matsuda, Yoichiro Matsuno. 2005. A Reinforcement Learning Scheme for a Partially-Observable Multi-Agent Game. Machine Learning 59:1-2, 31-54. [CrossRef] 19. CHRISTOPHER J. FONNESBECK. 2005. SOLVING DYNAMIC WILDLIFE RESOURCE OPTIMIZATION PROBLEMS USING REINFORCEMENT LEARNING. Natural Resource Modeling 18:1, 1-40. [CrossRef] 20. J. Varghese, S. Mukhopadhyay. 2003. Automated web navigation using multiagent adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 33:3, 412-417. [CrossRef] 21. Hiromi Miyajima, Tomonari Toyama, Koji komabasiri. 2003. Some Learning Methods of Cooperative Behaviors for a Group of Mobile Agents that Capture Fleeing Targets. IEEJ Transactions on Electronics, Information and Systems 123:3, 421-429. [CrossRef] 22. J. Moody, M. Saffell. 2001. Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks 12:4, 875-889. [CrossRef] 23. J. Si, Yu-Tsung Wang. 2001. Online learning control by association and reinforcement. IEEE Transactions on Neural Networks 12:2, 264-276. [CrossRef] 24. K. Doya, H. Kimura, M. Kawato. 2001. Neural mechanisms of learning and control. IEEE Control Systems Magazine 21:4, 42-54. [CrossRef] 25. C. Apte, L. Morgenstern, Se June Hong. 2000. AI at IBM Research. IEEE Intelligent Systems 15:6, 51-57. [CrossRef]

26. Kenji Doya . 2000. Reinforcement Learning in Continuous Time and SpaceReinforcement Learning in Continuous Time and Space. Neural Computation 12:1, 219-245. [Abstract] [PDF] [PDF Plus] 27. E. Levin, R. Pieraccini, W. Eckert. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing 8:1, 11-23. [CrossRef] 28. 1999. Book ReviewsBook Reviews. Journal of Cognitive Neuroscience 11:1, 126-134. [Citation] [PDF] [PDF Plus] 29. S. Yamada, M. Nakashima, S. Shiono. 1998. Reinforcement learning to train a cooperative network with both discrete and continuous output neurons. IEEE Transactions on Neural Networks 9:6, 1502-1508. [CrossRef] 30. Barney Pell, Susan L. Epstein, Robert Levinson. 1996. INTRODUCTION TO THE SPECIAL ISSUE ON GAMES: STRUCTURE AND LEARNING. Computational Intelligence 12:1, 1-6. [CrossRef] 31. Barney Pell. 1996. A STRATEGIC METAGAME PLAYER FOR GENERAL CHESS-LIKE GAMES. Computational Intelligence 12:1, 177-198. [CrossRef]

NOTE

Communicated by Francoise Fogelman-Soulie

Correlated Attractors from Uncorrelated Stimuli L. F. Cugliandolo Dipartittiento di Fisica, Universita di Rotna, “La Sapienza,” and INFN Sezione di Roma, 1-00285 Rome, Italy Analytic results for a neural network model proposed to account for neurophysiological experiments in which attractors for uncorrelated patterns are found to be correlated are presented. Exact expressions for the associative attractors and their correlations are described. 1 Introduction

Griniasty et al. (1993) proposed a neural network model that describes phenomena observed in single unit recordings in the anterior ventral temporal cortex of performing monkeys (Miyashita 1988). The synaptic matrix considers the memorized patterns as a (cyclic) sequence and couples consecutive patterns with a strengh a in addition to the autoassociative term. The network has correlated associative attractors though the learnt patterns are uncorrelated. The notation in this paper follows Griniasty et al. (1993). The number of stored patterns p is finite. To solve the model without numerical approximations, we have used the Mathematica (Wolfram 1990) program. The relevant dynamics of the network when stimulated by a pure pattern is described by the iteration of the mean-field equations starting from a network state identical to the stimulating pattern. The overlap m,, can be exactly computed in each time step and, interestingly enough, exact fixed points that correspond to the attractors can be found. Numerical simulations and the storage capacity analysis of this type of networks are presented elsewhere (Cugliandolo and Tsodyks 1993). 2 The Model with fl Neurons

The synaptic matrix is, as in Griniasty et al. (1993)

If a < 1/2 the network behaves as a Hopfield one [see Griniasty et a1. (1993)l. Nrirral Cornputatioii 6, 220-224 (1994)

@ 1994 Massachusetts Institute of Technology

Correlated Attractors

221

If a > 1/2 and p < 11, two different behaviors appear depending on whether p is even or odd. If p is odd, after (p - 1)/2 steps the network relaxes to the fixed points p

3

=

p = 5 p = 7 p = 9

m, = m,, = m,= m,=

1/2(1,1,1) 1/8 (5,3,1,1,3) 1/32 (19,13,3,1,13,13) 1/128 (77,51,13.3.1,1,3,13,51)

They can be represented by 1 m T = __ 2P-2

m,

= 4qt1

(2.2)

+

(-I),+(?) 2p-2

,

r/

= 2. . . . * ,P

2

ml = m 2 + 2 m 3

mg+1

m, ml

(2.3)

3

(2.5)

= 0 = =

-

mg+1t,

1

(2.4)

y = I , ' . ' >P- + l mqt7 mt+(7-1) 2 If p is even, two-cycle solutions given by the formulas -

-

-

m2

+ 2m3

mg+'-?

,

y = I , ...,'

2

(2.6)

and - 1 -

m;+1 m,

=

ml

=

m2

+ 2m3

-

Wf+ltr -

(2.7)

are present (they appear because the dynamics is parallel). When p 2 10, the fixed points are reached after four iterations. The whole set of attractors is obtained by cyclic rotations of mL=

1 -

27

(77,51,13,3,1,0, . . . ,0.1,3,13,51)

(2.8)

and they are universal, that is, the nonzero components do not depend either on p or on a. Surprisingly, the overlap vector has exactly 9 components different from zero symmetrically disposed around the component associated with the stimulating pattern. These attractors are mutually correlated and the correlations depend only on the separation of the corresponding stimulating patterns in the

L. F. Cugliandolo

222

memorized sequence, d = N - B ( a > 13) and Cd = Cp-d. If p 2 n ( n is the number of nonzero components of m,, i.e., n = 9), C1 Cz C3

C4 Cs

= = = = =

Cp-l Cp-2 Cp-3 Cp-4 Cp-s

=

= = = =

1360/211 680/2"

N

252/Z1'

N

82/2" 23/211

N

2:

N

0.66 0.33 0.12 0.04 0.01

+2 +d

(2.9)

while more distant attractors are not significantly correlated. They are independent of p and a, if a E (1/2,1). Furthermore, if p 2 22 and 10 < d < p - 10 attractors are not correlated at all: Cd = 0. 3 Network of 0,l Neurons

The dynamics of the spike emission is

+

1 J,s,(t)

s,(t h t ) = 0

-

H

1

(3.1)

with I,, as in Griniasty et al. (1993) and P[r/,CL= 11 = f . The critical a, below which pure pattern states are stable, is (3.2)

The dependence of attractors on a and p , once the probability f and the threshold 0 are chosen, becomes more complicated. If a > a, and p is small enough, the overlap vectors corresponding to the new fixed points have all components different from zero. Increasing p (a fixed), a "critical" p,, is reached after which it has a relatively small number of components different from zero concentrated around the starting pattern and the rest of them exactly zero. p , increases in steps with a and inside each subinterval there is universality. The values of a dividing subintervals depend on f and H. Some examples are in order. The f l neural activity model is recovered taking f = 1/2 and H = 0. Then, acr = 1/2 and the whole interval (1/2,1) presents universality. p , = 10, n = 9, and attractors and correlations are given by equations 2.8-2.9. If, for instance, f = 1/10, a = 1/10, and 6' = 0: a > acr. The fixed points are

p = 3 p 2 4 Thus, p,,

=4

m,!L = (1- f ) 2 (1.1,1) rn; = (1-f)2(1.1.0. . . . , o , ~ )

and n

= 3.

Correlated Attractors

223

Increasing a so as to move to another subinterval, the fixed points change. Taking a = 3/10 they are

p = 3, p = 4, p = 5. p 2 6,

m; m' m! m:,

= (1- f ) 2 (1,1,1) = (1 - f ) 3 (1.1.1.1) = =

(1 - f ) 3 (1+ f , +~ f , f , f , ~ +j) (1- f ) 3 (1 + f , 1 + f , f . O . . . . , O . f , l + f )

pCr = 6 and n = 5. In the first example, if p 2 n

C1 C2 c3= ...

= C/,-l = = Cp-2 = = cp-3 =

+ 2 + d = 5 + d the correlations are

171/271 81/271 0

N N

0.63 0.30

while in the second example if p 2 n + 2 + d

C1 C2 C3 C4 c 5 = ...

= Cl,-l = C1'-2 = Cp-3 =

C1,-4

= q - 5

= = = = =

206119/306119 106119/306119 16119/306119 279/306119 0

=7 N N N

N

+ d, they are 0.673 0.347 0.053 0.002

In both examples the correlations are independent of p . Their values decay with increasing separation and they are exactly zero after a small distance. In general, if p > 2n + 3 and n + 1 < d < p - (n + 1) the correlations are independent of d and exactly zero. The attractors and correlations are similar to those given by the examples in the whole range of values of parameters such that inequality 3.2 holds. In conclusion, it has been shown that the model of Griniasty et al. (19931, both for f l and 1.0 neurons, has attractors with a finite similarity index (the components of the overlap) with a small number of nearest stimuli in the learning sequence (5 if a > 1/2 for the fl and a function of a and 8, at fixed f , varying in steps for the 1.0 neurons, respectively).

Acknowledgments We wish to thank D. J. Amit for stimulating and helpful discussions and J. Kurchan, E. Sivan, M. Tsodyks, and M. A. Virasoro for useful discussions.

References Griniasty, M., Tsodyks, M. V., and Amit, D. J. 1993. Conversion of temporal correlations between stimuli to spatial correlations between attractors. Neural Comp. 5, 1-17.

L. F. Cugliandolo

224

Miyashita, Y. 1988. Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature (London) 335, 817. Wolfram, S. 1990. Mathematica. Computer Program, Wolfram Research Inc. Cugliandolo, L. F., and Tsodyks, M. V. 1993. Capacity of networks with correlated attractors. I. Pkys. A, in press. ~

~

~

Received January 15, 1993; accepted July 27, 1993.

This article has been cited by: 1. F. Metz, W. Theumann. 2005. Pattern reconstruction and sequence processing in feed-forward layered neural networks near saturation. Physical Review E 72:2. . [CrossRef] 2. Masahiko Yoshioka, Masatoshi Shiino. 2000. Associative memory storing an extensive number of patterns based on a network of oscillators with distributed natural frequencies in the presence of external white noise. Physical Review E 61:5, 4732-4744. [CrossRef] 3. Masahiko Yoshioka, Masatoshi Shiino. 1998. Associative memory based on synchronized firing of spiking neurons with time-delayed interactions. Physical Review E 58:3, 3628-3639. [CrossRef] 4. Friedemann Pulvermüller, Hubert Preissl. 1995. Local or transcortical assemblies? Some evidence from cognitive neuroscience. Behavioral and Brain Sciences 18:04, 640. [CrossRef] 5. Morris W. Hirsch. 1995. Mathematics of Hebbian attractors. Behavioral and Brain Sciences 18:04, 633. [CrossRef] 6. J. J. Wright. 1995. How do local reverberations achieve global integration?. Behavioral and Brain Sciences 18:04, 644. [CrossRef] 7. Shimon Edelman. 1995. How representation works is more important than what representations are. Behavioral and Brain Sciences 18:04, 630. [CrossRef] 8. Elie Bienenstock, Stuart Geman. 1995. Where the adventure is. Behavioral and Brain Sciences 18:04, 627. [CrossRef] 9. Anders Lansner, Erik Fransén. 1995. Distributed cell assemblies and detailed cell models. Behavioral and Brain Sciences 18:04, 637. [CrossRef] 10. Ralph E. Hoffman. 1995. Additional tests of Amit's attractor neural networks. Behavioral and Brain Sciences 18:04, 634. [CrossRef] 11. David C. Krakauer, Alasdair I. Houston. 1995. An evolutionary perspective on Hebb's reverberatory representations. Behavioral and Brain Sciences 18:04, 636. [CrossRef] 12. Jean Petitot. 1995. The problems of cognitive dynamical models. Behavioral and Brain Sciences 18:04, 640. [CrossRef] 13. Frank der van Velde. 1995. Association and computation with cell assemblies. Behavioral and Brain Sciences 18:04, 643. [CrossRef] 14. Peter M. Milner. 1995. Attractors – don't get sucked in. Behavioral and Brain Sciences 18:04, 638. [CrossRef] 15. G. J. Dalenoort, P. H. de Vries. 1995. What's in a cell assembly?. Behavioral and Brain Sciences 18:04, 629. [CrossRef] 16. Joaquin M. Fuster. 1995. Not the module does memory make – but the network. Behavioral and Brain Sciences 18:04, 631. [CrossRef]

17. Daniel J. Amit. 1995. Empirical and theoretical active memory: The proper context. Behavioral and Brain Sciences 18:04, 645. [CrossRef] 18. Masahiko Morita. 1995. Another ANN model for the Miyashita experiments. Behavioral and Brain Sciences 18:04, 639. [CrossRef] 19. Walter J. Freeman. 1995. The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral and Brain Sciences 18:04, 631. [CrossRef] 20. Eric Chown. 1995. Reverberation reconsidered: On the path to cognitive theory. Behavioral and Brain Sciences 18:04, 628. [CrossRef] 21. Maartje E. J. Raijmakers, Peter C. M. Molenaar. 1995. How to decide whether a neural representation is a cognitive concept?. Behavioral and Brain Sciences 18:04, 641. [CrossRef] 22. Wolfgang Klimesch. 1995. The functional meaning of reverberations for sensoric and contextual encoding. Behavioral and Brain Sciences 18:04, 636. [CrossRef] 23. Michael Hucka, Mark Weaver, Stephen Kaplan. 1995. Hebb's accomplishments misunderstood. Behavioral and Brain Sciences 18:04, 635. [CrossRef] 24. Josef P. Rauschecker. 1995. Reverberations of Hebbian thinking. Behavioral and Brain Sciences 18:04, 642. [CrossRef] 25. Ehud Ahissar. 1995. Are single-cell data sufficient for testing neural network models?. Behavioral and Brain Sciences 18:04, 626. [CrossRef] 26. Daniel J. Amit. 1995. The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral and Brain Sciences 18:04, 617. [CrossRef] 27. L F Cugliandolo, M V Tsodyks. 1994. Journal of Physics A: Mathematical and General 27:3, 741-756. [CrossRef]

Communicated by Laurence Abbott

Learning of Phase Lags in Coupled Neural Oscillators Bard Ermentrout Department of Mathematics, University of Pittsburgh, Pittsburgh, PA 15260 USA Nancy Kopell Department of Mathematics, Boston University, Boston, M A 02215 U S A If an oscillating neural circuit is forced by another such circuit via a composite signal, the phase lag induced by the forcing can be changed by changing the relative strengths of components of the coupling. We consider such circuits, with the forced and forcing oscillators receiving signals with some given phase lag. We show how such signals can be transformed into an algorithm that yields connection strengths needed to produce that lag. The algorithm reduces the problem of producing a given phase lag to one of producing a kind of synchrony with a "teaching" signal; the algorithm can be interpreted as maximizing the correlation between voltages of a cell and the teaching signal. We apply these ideas to regulation of phase lags in chains of oscillators associated with undulatory locomotion. 1 Introduction Networks of oscillatory neurons are known to control the rhythmic motor patterns in vertebrate locomotion (Cohen et d . 1988). These networks often consist of large numbers of coupled subnetworks that are themselves capable of oscillating. By connecting such oscillatory units, it is possible to produce a variety of patterns in which each unit maintains a fixed phase relationship with other units in the network (Ermentrout and Kopelll993; Kiemell989; Schoner et al. 1990; Bay and Hemami 1987; Collins and Stewart 1992). The resulting pattern of phases determines the phase relations among the motor units that induce the locomotory activity. The various gaits of quadrupeds provide examples of such patterns, and have been theoretically explored by analyzing or simulating networks of four simple oscillators (Schoner et al. 1990; Bay and Hemami 1987; Collins and Stewart 1992). Other work on asynchronous behavior in networks of oscillators is in Tsang et al. (19911, Aronson et al. (1991), and Abbott and van Vreeswijk (1993). Neural Coniputation 6, 225-241 (1994) @ 1994 Massachusetts Institute of Technology

226

Bard Ermentrout and Nancy Kopell

The example that most strongly motivated the work of this paper is the undulatory swimming pattern of vertebrates such as lampreys and various species of fish. In such animals, the relevant network of neurons is believed to be a linear chain of subnetworks, each capable of autonomous oscillations (Cohen et al. 1982; Grillner 1981). In the lamprey, there is a traveling wave of one body length that is maintained over a broad range of swimming speeds, and for animals of many sizes. This traveling wave can be produced in vitru, using an isolated spinal cord for which activity is induced by the addition of an excitatory amino acid (D-glutamate); larger concentrations of the glutamate lead to higher frequencies of the local oscillators, but with phase lag between fixed points along the cord unchanged (Wallen and Williams 1984). One of the main questions for which a body of theory has been developed (Cohen et al. 1992) is what produces and regulates the phase lags between the successive oscillators in order that the wavelength is maintained at the body length. It has been shown that local coupling (nearest neighbor or multiple near neighbor) is sufficient to produce traveling waves with constant phase lags (Kopell and Ermentrout 1986; Kopell et al. 1990). Furthermore, such systems are capable of regulating the phase lags (and hence the wavelength) under changes of swimming speed (Wallen and Williams 1984; Kopell and Ermentrout 1986). However, there is no mechanism in these systems to ensure that the regulated wavelength will be one body length. To investigate how wavelengths equal to one body length might be produced, we studied chains of oscillators with long coupling fibers joining the ends of the chain to positions near the middle of the chain (Ermentrout and Kopell 1994). We showed that if the long-coupling fibers tended to produce antiphase relationships between the oscillators directly coupled, then the system with such long connections plus short coupling could produce a variety of wave-like patterns, including constant speed traveling waves with wavelength equal to the length of the chain (Ermentrout and Kopell 1994). This in itself was not an answer to the question of regulation in the adult animals, since it is known that short sections of the in vitru spinal cord can self-organize into traveling waves with essentially the same lag per unit length as larger sections of the in vitru cord. Such smaller segments may be considerably smaller than the half-body length connections required to produce the waves using the mechanism of Ermentrout and Kopell (1994). In this paper, we show how the long fibers might still play an important role in the production of the traveling waves with correct wavelength. We discussed in Ermentrout and Kopell (1994) how long fibers might be involved in the production of a variety of patterns associated with movement during very early development of vertebrates. Here we suggest how the patterns produced by such connections can be used as teaching signals to allow the local connections to be tuned to produce appropriate phase lags in the absence of the long connections (as, for

Learning of Phase Lags in Coupled Neural Oscillators

227

example, in sections of the A vifro cord). In such a circuit, after tuning, the phase lags can be expected to be the same even for sections of the cord significantly shorter than half a body length. The techniques used in this paper to achieve local connections sufficient by themselves to produce the waves make use of local circuits that are more complicated than a single cell. A central idea is to use a local circuit for the jth oscillator, in which the signal from the jth oscillator to the (i 1)st is a composite of signals coming from more than one cell in the jth oscillator. Then, by adjusting the relative strengths of the components of the signal, we show in Section 2 that a range of phase lags between the oscillators can be attained. We shall be concerned with methods for achieving appropriate phase lags for coupling in one direction only [the "dominant" direction; see Kopell and Ermentrout (198611. If the local coupling is not reciprocal, the lag to be "learned" depends only on the two oscillators directly involved and not the behavior of the rest of the network; thus it suffices to consider a pair of oscillators and coupling from one to the other. The work of Section 2 shows that for appropriate values of the connection strengths, and within some parameter ranges, the desired lags can be produced. Section 3 deals with the question of how the teaching signal, which specifies the desired lags, can be transformed into an algorithm that produces the appropriate connection strengths. The algorithm reduces the problem of producing an appropriate phase lag to one of producing a kind of synchrony between some component of the circuit and the teaching signal. Since the teaching signal need not have a wave form at all related to that of the voltage of a cell of the oscillator being taught, synchrony does not mean here total overlap in signal or that some thresholds for the teacher and for the oscillator being taught are reached simultaneously; our notion will be described in Section 3. We show that the algorithm is related to gradient ascent on the correlation of the voltages of the teaching signal and one of the cells of the oscillator. We give analysis and simulations to show that the algorithm works. In Section 4 we comment on the relationship of this paper to previous work.

+

2 Connection Strengths and Phase Lags

Consider a pair of limit cycle oscillators, the second forced by the first. If the unforced frequencies of the two are close, in general there is one-one locking with a phase lead or lag that varies with the forcing strength. Now assume that the forcing oscillator is composite, that is, that it has more than one component capable of sending a signal. Thus, within a cycle, the forced oscillator receives more than one signal. We can consider the strengths of these signals as independent. Each forcing signal alone, at a fixed strength, produces a characteristic lag. We show below that,

Bard Ermentrout a n d Nancy Kopell

228

under robust conditions, a predetermined phase lag can be obtained by suitably varying the relative strengths of the components of the signal. The learning algorithm we shall use in Section 3 works on composite oscillators, each component of which is described by a simple model of a neural oscillator. We first describe some behavior of composite oscillators in a simple phase model. We then present specific equations to be used in Section 3 and show that, at least in some parameter ranges, they behave like the general phase models. If one limit cycle forces another, and the forcing is not too strong, the averaging method can be used to reduce the full equations to ones involving the interactions of phases, one for each oscillator (Ermentrout and Kopell 1991). Let HI and 82 denote the phases of the two oscillators. To lowest order, the equation describing the forced system has the form

el,

= LlJ

0; = w + H ( B ,

(2.1)

-02)

where LJ is the uncoupled frequency of oscillators 1 and 2, and FT is a 2x-period function of its argument. It is well known that oscillators 1 and 2 phaselock with a phase difference 4 = HI - H2 satisfying H ( @ )= 0. The locking is stable if d H / d @ > 0. If there is more than one component to the forcing, and the forcing is weak, the effects of the two components are additive to lowest order, Thus 2.1 may be written as

el,

=

w

0;

=

LlJ

+ AH,(81

- 02)

+ BH,j(H, - 02)

(2.2)

where we think of A as the variable strength of a particular component of the forcing, while BH,j(Bl - 0 2 ) denotes all the other components of the forcing. As we show below, under fairly general conditions, the achievable phase lag between the forced and forcing oscillators includes some subset of the interval between the lags that would be produced by H , or H/, alone. A lag that can be produced by H,,(resp. H/J alone is a value (resp. Q,d of 4 for which H,(#)= 0 Iresp. H p ( 4 ) = 01; such a lag is stable providing that dH,,/dd(4,,) > 0 [resp. dH[j/dd(d/j)> 01. We assume that there are such lags (bn and &, and that they are not equal; for definiteness, we may assume that 4- > 4 p , which will not affect the conclusions below. 4'1, is one for which The subinterval in question, which we call dH,,/d$ > 0 on [$L.$,ll and dHp/d$ > 0 on [ap,Gf (see Fig. 1). (The functions in Figure 1 are qualitatively similar to those computed numerically from the more explicit equations to be given below. They show that, for those equations, there is such a subinterval, and that it is not all of In general, if there is such a subinterval, it is now easy tn see that any lag in that subinterval can be achieved stablyby choosing the appropriate A and B. Namely, let A = - B H , j ( $ ) / K 1 ( $ ) for B > 0. [@i%

Learning of Phase Lags in Coupled Neural Oscillators

229

Figure 1: A qualitative sketch of parts of the numerically computed coupling functions H , and Hfi associated to the full equations used in the simulations of Section 3.3. The attainable phase lags include those on the dark subinterval of the 6 axis. Then H ( 3 ) = 0, so 3 is a lag produced by 2.2. Note that A > 0, since the hypotheses on dHLl/d@ and dHO/d@ imply that Ha($) < 0 and Hi,($) > 0. Since the derivatives of each component are positive by hypothesis, this in turn implies that d H / d $ > 0 at 4 = 3,so the lag is stable. We now consider the equations that we shall use for the learning algorithm. For simplicity of exposition, we restrict ourselves to twodimensional equations such as those in Morris and Lecar (19811, though the ideas can be generalized. Each of the components of each of the two oscillators is then described by equations of the form

(2.3) Here, V is the voltage and n is a generalized recovery variable. The terms on the right-hand side of the first equation denote the ionic currents. Within a given oscillator, these components are coupled through the voltages excitatory and/or inhibitory synapses, that is, terms of the form z g(V)[VR- V], where V is the voltage of the component receiving the synaptic current, is the voltage of the presynaptic component, and z is the strength of the interactions. The components need not be the same or even similar. The essential requirement for each of the two “oscillators” is that, when its components are coupled, the system has a stable limit cycle.

Bard Ermentrout and Nancy Kopell

230

Figure 2: Model architecture for learning a phase lag between a pair of oscillators, the first forcing the second. Each of the two oscillators has two components, an excitatory cell and an inhibitory cell. The connections between the components are fixed. In addition, there are synapses from both components of oscillator 1 to the excitatory cell of oscillator 2. Of these, the connection from 1, to €2 is held fixed, and the connection from €1 to €2 is varied in order to produce the desired lag. We assume that the effects of the forcing signals are additive to lowest order. (This is automatically true if the forcing is weak.) Thus, some or all of the component cells of the forced oscillator have a sum of synaptic currents added to the voltage equation, with the currents gated by components of the forcing oscillators. The coupling terms within a given oscillator are assumed to be stronger than the forcing between the oscillators. An explicit example is given by the architecture in Figure 2. In this architecture, each circuit has two cells, one excitatory (E) and one inhibitory (I). It is well known that two such cells can be connected to construct a local oscillating unit (Wilson and Cowan 1973). Let V,, and V,, denote the voltages of the excitatory and inhibitory cells of the jth circuit and rife. n,, the corresponding recovery variables. Then the equations for the ~ t circuit h are given by

Vie

=

F(Vp, n,e)

nie

=

G(V,e,n,e)

v\,

=

F(V,i, n ~ i + ) wege(V,e)[VNa - vp]

n;, =

w,,, q1)

f

wigi(vp)[vK - Vp]

(2.4)

Here the ge and the g, are the usual sigmoidal gating functions and w,, we are the coupling strengths of the units within an oscillator. We have

Learning of Phase Lags in Coupled Neural Oscillators

I

231

ze zg

0.125 Zi I

0.625 21

- 0.084

Figure 3: The numerically computed composite coupling function for three different values of z, and fixed zi. Note that changing the value of z, changes the zero of the composite coupling function.

chosen the reversal potential of the synapses to be VK and V N ~but , any choices suffice provided that the inhibitory synapse acts to hyperpolarize the cell and the excitatory synapse acts to depolarize the cell. We allow circuit 1 to force circuit 2 by adding to the VZeequation inputs from V1, and Vli. The equation for V2, is thus

If the forcing is not too large, the explicit equations above can be reduced to those of the form 2.2, where H c t ( 4 )comes from the excitatory coupling, Ho(4)comes from the inhibitory coupling, and A . B are z,, zi. Figure 3 shows the zeros of the composite function resulting from the averaging of 2.5 for various strengths of the excitatory coupling 2,. As 2, increases, the lag moves closer to d, = 0. The numerical computation of the functions of Figure 3 were done using PhasePlane and numerical procedures given in Ermentrout and Kopell(l991) with parameter values given in Section 3.3.

232

Bard Ermentrout and Nancy Kopell

3 Learning Synchrony to Learn a Phase Lag 3.1 An Architecture for Learning Synchrony. In this section, we discuss an architecture that reduces the problem of learning a specified phase lag between two oscillators in a chain to the problem of learning connections that can produce an averaged version of synchrony between two signals. We start with some interconnected local circuit that is producing the traveling waves. We now suppose that another circuit is formed at each locus, capable of oscillation. (We have in mind that the first circuit may be very crude: anything capable of producing oscillations at an appropriate frequency. The second one, to be retained through adult life, may be more complicated and subject to functional constraints.) We shall refer to the former as the teacher circuits and the latter as the student circuits. For definiteness, we shall use the circuit example of Section 2 for the local student circuits. However, this is merely the minimal complexity needed to carry out the scheme; local circuits with more components can work as well. We envision the process of tuning the local connections between the local student circuits as starting at one end of the chain and proceeding one circuit at a time. The sequential changes in anatomical structures are consistent with other developmental changes that happen sequentially, starting at the head end (Bekoff 1985). Since the process is the same at each stage, we restrict ourselves to describing only the change in coupling between the first and second local circuits. We assume that the local teacher circuit produces the same signal at each site, but with a time lag of ( from site to site. This signal is oscillatory, with the same period P as that of the student circuits. There is no direct connection between the teaching circuit and the local circuits. They are assumed to be physically close enough to allow some unspecified process to have access to information from both, but the teacher does not interfere with the outcome of the coupling from circuit j to circuit j + l (see Fig. 4). Our problem is to change the weights of the connections from €1 and 11 to Ez so that the phase lag induced from circuit 1 to circuit 2 is exactly the phase lag ( / P of the teaching circuit. We assume that the connections are such that the lag ( / P lies between the lags induced by each of the connections alone. When the learning of the connections between student circuit 1 and student circuit 2 has been accomplished, the learning between student circuits 2 and 3 can begin. Thus, the learning progresses along the chain, one circuit at a time. As will be shown below, a lack of complete synchrony between teacher and student in circuit 1 leads to an error in the learned phase lag between circuit 1 and circuit 2. However, that error does not propagate down the chain. 3.2 Algorithms for Learning Synchrony. The essential idea is to change one of the connection strengths until there is "synchrony" be-

Learning of Phase Lags in Coupled Neural Oscillators

233

Figure 4: Schematic diagram of the teacher and student circuits. For each j, the teacher and student circuit are assumed to be physically close, so that some process may have access to information from both. The teaching circuit does not directly influence the student. tween the signals of teaching circuit 2 and Ez. Since these signals need not have even similar wave forms, we must first describe what we mean by synchrony. We will say that the signals are synchronous if

L'V;(t)T,(t)dt = 0

(3.1)

Here P is the common period of the local circuits, V, is the membrane potential of the E cell of the jth student circuit, and T, is the pulse-like signal from the jth teaching circuit (see Fig. 5A). The integral averages the current from the E cell and the voltage from the teacher circuit. This contrasts with some formulations of Hebbian rules in which what matters is the voltage of the two cells being compared (Amari 1977).

Remark Note that if the signals V,(t) and T,(t) happen to have the same wave form u p to a phase shift, then the integral vanishes if the phase shift is zero, that is, if there is exact synchrony. (This follows from the periodicity of V and T and the fundamental theorem of calculus.) Suppose instead that the teaching signal is more pulsatile (as in Fig. 5A). Then if that signal occurs when the postsynaptic potential is decreasing, as in Figure 5B, the integral is negative; if it occurs when the postsynaptic signal is increasing, the integral is positive.

234

Bard Ermentrout and Nancy Kopell

Figure 5: (A) Voltage vs. time for signal from teacher circuit (dashed curve) and output of the E cell (solid curve) when these two signals are synchronous (in the sense of the text). (B) The voltage signal of the E cell (solid curve) lagging the teaching signal (dashed curve).

We now specify an algorithm to change z , so that the E2 cell will synchronize (in the above sense) with the teaching signal at teacher circuit 2. (Recall that z , is the strength of the connection from the E cell of circuit 1 to the E cell of circuit 2.) It follows from the fact that 4, < 4e and the monotonicity of Ht and H- that increasing z , increases the resulting value of 4 = dl - 02 and therefore shifts the curve V , , ( f ) to the right. In the explicit case in which we give the computations, the attainable values of 4 are all negative, that is, the forced oscillator leads the forcer.

Learning of Phase Lags in Coupled Neural Oscillators

235

In this case, increasing z, then decreases the phase lead of oscillator 2 over oscillator 1. We shall fix the other connection strength zi. The teaching signal at circuit 2 is the same as the signal at circuit 1, with a time lag of ( and thus a phase lag of €/P,so Tz(t)= T l ( t - I ) . The learning algorithm is

z:, = --EV;,(t)T1(f - ()

(3.2)

where t << 1, so the learning process is slow compared with the cycle time. It is not explicit in 3.2 how the value of z, affects the right-hand side of that equation. For this we note that the averaging theorem implies that, to lowest order in z,, the dependence occurs as a phase shift ?(z,), where (’(2,) > 0. That is, Vz,(t) = Vz,[t -<(z,)], where V,,(t) is the signal, after decay of transients, if z , = 0. The algorithm 3.2 changes z, until an appropriate shift has been performed; at that point, z: = 0, on the average over a cycle. In principle, z , tends to an oscillatory function of t, rather than a number. However, in some circumstances, the magnitude of this oscillation may be very small. For example, suppose that the teaching signal is quite pulsatile. Then the right-hand side of 3.2 is close to zero except during the time interval during which the signal is on. Near or at synchrony, the value of V’ over that interval is also small. Thus the multiple of E in 3.2, which is a product of V’ and the teaching signal, stays small over the entire cycle. The correlation between the signals V2,[f - ?(z,)] and T I( t - () is given by (3.3)

The algorithm that changes z, in order to maximize 3.4 is

-

(3.4)

Since d t / d z e > 0, this changes z, in the same direction as 3.2, and has the same stable state. The advantage of 3.2 over 3.4 is that 3.2 is local in time, and could conceivably be performed by chemical processes monitoring the cross-membrane current of E2 and the voltage of the teaching signal. Algorithm 3.2 converges because the simple gradient ascent of 3.4 does. The algorithm is effective even if the frequencies of the student circuit differ somewhat from the phaselocked teacher circuit and from each other. In that case, different values of z , for each pair are needed to match the desired lag, but the process produces the same lag ( between each pair.

236

Bard Ermentrout and Nancy Kopell

Remark: For weak coupling, the algorithm produces a phase lag that is independent of the frequency of the oscillating circuit. This is seen most easily from the phase equations to which the full equations reduce. Thus, the resulting system shares the property of the lamprey cord that the phase lags remain unchanged when the frequency is changed uniformly along the cord. 3.3 Numerics. The learning algorithm was simulated with the circuits as shown in Figure 4. The parameters in 2.1 specifying the components of one of the circuits are gL = 0.5, gK = 2, gNa = 1.33, VL = -0.5, 0.33cosh(V/0.2). The functions VK = -0.7, VNa = 1, lapp = 0.125, X(V) m x ( V ) and nr(V) are given by m=(V) = 0.5[1 tanh[(V 0.01)/0.15], n%(V) = 0.5[1+tanh(V/O.l)]. The synapses between the components of a single circuit give synaptic current to the E cell of the form w,g,(v,,)[V~ V] and a current to the I cell of the form we&(V2,) [VNa - V]. Here w,= 0.5, w,= 0.5, and gl(V) and g,(V) are given by g, = m,(V), gl(V)= n,(V). The synapses connecting the two circuits go from the I and E cells of circuit 1 to the E cell of circuit 2. The synaptic currents are as in 2.5, namely z,g,(V,,)[VK- V] + Z,&(Vl,)[VNa - V]. Here t , is fixed at 0.2, z, is a variable, and g, and g, are as above. Figure 5 shows the computed composite H functions for three different values of z,, showing how the relevant zero of H changes. The teaching signal is given by T ( t ) = af{l + tanh[s . sin(27rt/P) H I } . Here the strength is set at af = 0.1, the sharpness of the pulse is determined by s = 30, the period P = 8.15 is set to be the period of the student circuit oscillations, and the theshold is set at H = 1. The phase lags (/P to be learned were set at several values between -0.03 and -0.1 to show that a range of phase lags could be learned. The teaching algorithm was given in 3.2. The only parameter to be specified there is f, which was taken to be 0.02. For each of the values of the phase lag, the algorithm converged. We note that the averaging theorem, which we relied on for heuristic understanding, guarantees this method will work for "sufficiently small" values of all the coupling variables. The values used in the numerics may not be small enough to be within the range of this theorem. Nevertheless, as in many other circumstances involving asymptotics, the numerics (done with PhasePlane) show that the method appears to work quite well. To show that this method works for a chain of oscillators, we carried out the procedure for a chain of five composite oscillators. The coefficients z , from oscillator j to oscillator j 1 were allowed to vary one at a time according to the training equation 3.2; when each z, reached equilibrium, then the next was run. Figure 6A shows the values of z, vs. t for each of the four connections, with the phase lag to be learned set at -0.06. Figure 6B shows the output of the voltage vs. time for the five oscillators in the chain after the learning has been achieved and z, set to the equilibrium value of 3.2.

+

+

+

Learning of Phase Lags in Coupled Neural Oscillators

"je

237

B

.

5 4 3 2 1

t

Figure 6: (A) The functions z,(t) for the connections between adjacent oscillators in the chain during training. The numbers correspond to the source oscillator of the connection, for example, 1 represents the connection from oscillator 1 to oscillator 2. The first student oscillator receives a direct pulse from the teacher, creating different training conditions than for the other pairs. All weights start at z , = 0, and after the first pair, all converge to the same value. The time shown in the figure is t _< 2000, which corresponds to 245 cycles. The convergence occurs within 100 cycles. (B) The time course of V,,(t) for each of the 5 oscillators in the chain after training. The numbers labeling the curves give the value of j . Note that the phase lag between oscillators 1 and 2 differs from the other lags, which are all the same.

238

Bard Ermentrout and Nancy Kopell

In these simulations, a small pulse was added in student circuit 1 to lock teacher 1 to student 1. [This extra pulse was given by the addition of the term 0.01 T(t)(l - Vie) to the equation for Vie.] For j > 1, there were no direct connections between student and teacher. The small pulse added creates a change in the signals from student 1 to student 2, creating a small error in the first phase lag. This error does not propagate, and the other lags are the one the network is intended to learn. Remark In the lamprey, which motivated this study, there is both tailto-head and head-to-tail coupling. Various sets of experiments (Williams et ul. 1990; Sigvardt et al. 1991) suggest via theory (Kopell et a!. 1991; Kopell and Ermentrout 1990) that the tail-to-head coupling is the "dominant" one that sets the phase lag. In applying this work to the lamprey, we consider the direction of the forcing to be that of the dominant coupling. Since the observed wave is head to toe, the oscillator forcing its rostra1 neighbor (i.e., the neighbor in the head direction) lugs this neighbor ((,h < 0) as in the simulations. Continuing with this interpretation, the sequential changes in coupling along the chain in the above simulation start at the tail end and proceed rostrally.

4 Discussion

There have been many papers addressing the question of how to train networks to learn certain quantities. Most of this work concerns algorithms for adjusting connection strengths so as to learn a family of patterns encoded by digital representations (Rumelhart and McClelland 1986). By contrast, a phase lag is an analogue quantity, and the methods presented for digital patterns d o not apply in an obvious way. Recently, there has been some interest in mechanisms for training networks to oscillate, apparently motivated by synchronous cortical activity during visual tasks. For example, Doya and Yoshizawa (1989) use backpropagation to alter the weights in a hidden layer of cells to generate oscillations with a particular wave form. Other authors (Pearlmutter 1989; Williams and Zipser 1989) have developed similar algorithms. Other work by Doya and Yoshizawa (1992) is more directly relevant to the present work. In the latter paper, the authors assume that the networks are oscillatory, as we do, and then attempt to connect them in such a way as to attain a given phase lag; thus, they are addressing the same problem as in this paper. The major difference between that paper and the current one is in the learning algorithm. In Doya and Yoshizawa (19921, the algorithm involves gradient descent on a function E ( t ) that measures the phase difference between the oscillators to be synchronized. A physical process that carries out that algorithm must be able to compute a weighted average of the output of cells and subtract that from the output of another

Learning of Phase Lags in Coupled Neural Oscillators

239

cell. By contrast, the learning rule proposed here requires only the calculation of a voltage and a membrane current, and really depends only on the sign of that product. Thus, it is easier to conceive of a physical process able to carry it out. Furthermore, the method could work equally well with any periodic signal produced by one of the cells of each student circuit; for example, a synaptic current can be used instead of a membrane current. This could produce a different phase between Ti and the ith student circuit, but would produce the same phase between student circuits when carried out in a chain. The mechanism proposed here has been implemented with membrane models based on gated currents and voltages. The mechanism could also be implemented with models in which the variables are firing rates. The essential idea is that the learning algorithm works on a variable associated with a cell in the circuit (such as firing rate of that cell) and the time derivative of the analogous variable in the teaching signal. Any physical process capable of determining the sign of the product of those two quantities could be used to produce the learning.

Acknowledgments We wish to thank T. LoFaro for his assistance in preparing the figures. Supported in part by NIMH Grant 47150 and NSF Grants DMS-9002028 (B.E.) and DMS-9200131 (N.K.).

References Abbott, L. F., and van Vreeswijk, C. 1993. Asynchronous states in networks of pulse-coupled oscillators. Neural Comp. 5, 823-848. Amari, S. I. 1977. Neural theory of association and concept formation. Biol. Cybern. 26, 175-185. Aronson, D. G., Golubitsky, M., and Mallet-Paret, J. 1991. Ponies on a menygo-round in large arrays of Josephson Junctions. Nonlinearity 4, 90S910. Bay, J. S., and Hemami, H. 1987. Modelling of a neural pattern generator with coupled nonlinear oscillators. ZEEE Trans. Biomed. Eng. 4, 297-306. Bekoff, A. 1985. Development of locomotion in vertebrates: A comparative perspective. In The Comparative Development of Adaptive Skills: Evolutiona y Implications, E. S. Gallin, ed., pp. 57-94. Erlbaum, Hillsdale, NJ. Cohen, A. H., Holmes, P. J., and Rand, R. H. 1982. The nature of the coupling between segmental oscillators of the lamprey spinal generator for locomotion: A mathematical model. I. Math. Biol. 13, 345-369. Cohen, A. H., Rossignol, S., and Grillner, S. 1988. Neural Control of Rhythmic Movements in Vertebrates. John Wiley, New York. Cohen, A. H., Ermentrout, G. B., Kiemel, T., Kopell, N., Sigvardt, K. A., and Williams, T. L. 1992. Modelling of intersegmental coordination in the lam-

240

Bard Ermentrout and Nancy Kopell

prey central pattern generator for locomotion. Trends in Neurosci. 15, 434438. Collins, J. J., and Stewart, I. N. 1992. Symmetry-breaking bifurcation: A possible mechanism for 2:l frequency locking in animal locomotion. /. Math. Biol. 30, 827-838. Doya, K., and Yoshizawa, S. 1989. Adaptive neural oscillator using continuoustime back-propagation learning. Neural Networks 2, 375-386. Doya, K., and Yoshizawa, S. 1992. Adaptive synchronization of neural and physical oscillators. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lipmann, eds. Morgan Kaufmann, Sari Mateo, CA. Ermentrout, G. B., and Kopell, N. 1991. Multiple pulse interactions and averaging in systems of coupled neural oscillators. /. Math. Bid. 29, 195-217. Ermentrout, G. B., and Kopell, N. 1994. Inhibition produced patterning in chains of coupled nonlinear oscillators. S I A M J. Appl. Math. 54, in press. Grillner, S. 1981. Control of locomotion in bipeds, tetrapods and fish. In HandbookofPhysiology, Section 1: The Nervous System, 2, V. B. Brooks, ed., pp. 11791236. American Physiological Society, Bethesda, MD. Guckenheimer, J., and Holmes, I? J. 1983. Nonlinear Oscillntions, Dyrinniical Systems, and Bifurcation of Vector Fields. Springer-Verlag, New York. Kiemel, T. 1989. Three problems on coupled nonlinear oscillators. Ph.D. Thesis, Cornell University. Kopell, N., and Ermentrout, G. B. 1986. Symmetry and phaselocking in chains of coupled oscillators. Commun. Pure Appl. Math. 39, 623-660. Kopell, N., and Ermentrout, G. B. 1990. Phase transitions and other phenomena in chains of oscillators. SIAM J. Appl. Math. 50, 1014-1052. Kopell, N., Zhang, W., and Ermentrout, G. B. 1990. Multiple coupling in chains of oscillators. S I A M J. Math. Anal. 21, 935-953. Kopell, N., Ermentrout, G. B., and Williams, T. 1991. On chains of neural oscillators forced at one end. SIAM J. Appl. Math. 51, 1397-1417. Morris, C., and Lecar, H. 1981. Voltage oscillations in the barnacle giant muscle fiber. Biophys. J. 35, 193-213. Pearlmutter, B. A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1,263-269. Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Proc~ssing, Vols. 1 and 2. MIT Press, Cambridge, MA. Schoner, G., Young, W. Y., and Kelso, J. A. S. 1990. A synergetic theory of gaits and gait transitions. J. Theor. B i d . 142, 359-391. Sigvardt, K., Kopell, N., Ermentrout, G. B., and Remler, M. 1991. SOC.Neurosci. Abst. 17, 122. Tsang, K. Y., Mirollo, R. E., Strogatz, S. H., and Weisenfield, K. 1991. Dynamics of a globally coupled oscillator array. Physica D 48, 102-112. Wallen, P., and Williams, T. L. 1984. Fictive locomotion in the lamprey spinal cord in uitro compared with swimming in the intact and spinal animal. 1. Physiol. 347, 225-239. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1, 270-280.

Learning of Phase Lags in Coupled Neural Oscillators

241

Williams, T., Sigvardt, K., Kopell, N., Ermentrout, G. B., and Remler, M. 1990. Forcing of coupled nonlinear oscillators: Studies of intersegmental coordination in the lamprey locomotor central pattern generator. I. Neurophysiol. 64,862-871. Wilson, H. R., and Cowan, J. 1973. A mathematical theory for the functional dynamics of cortical and thalamic tissue. Kybernetic 13,55-80. ___

~~

Received March 1,1993; accepted June 22,1993.

This article has been cited by: 2. Jae-Kwan Ryu, Nak Young Chong, Bum Jae You, Henrik I. Christensen. 2010. Locomotion of snake-like robots using adaptive neural oscillators. Intelligent Service Robotics 3:1, 1-10. [CrossRef] 3. Péter L. Várkonyi, Tim Kiemel, Kathleen Hoffman, Avis H. Cohen, Philip Holmes. 2008. On the derivation and tuning of phase oscillator models for lamprey central pattern generators. Journal of Computational Neuroscience 25:2, 245-261. [CrossRef] 4. Máté Lengyel, Jeehyun Kwag, Ole Paulsen, Peter Dayan. 2005. Matching storage and recall: hippocampal spike timing–dependent plasticity and phase response curves. Nature Neuroscience 8:12, 1677-1683. [CrossRef] 5. G.N. Borisyuk, R.M. Borisyuk, Yakov B. Kazanovich, Genrikh R. Ivanitskii. 2002. Models of neural dynamics in brain information processing the developments of 'the decade'. Uspekhi Fizicheskih Nauk 172:10, 1189. [CrossRef] 6. Toshio Aoyagi. 1995. Network of Neural Oscillators for Retrieving Phase Information. Physical Review Letters 74:20, 4075-4078. [CrossRef]

Communicated by Idan Segev

A Mechanism for Neuronal Gain Control by Descending Pathways Mark E. Nelson Department of Physiology and Biophysics and Beckman Institute for Advanced Science and Technology, University of lllinois, Urbana, 1L 61801 U S A

Many implementations of adaptive signal processing in the nervous system are likely to require a mechanism for gain control at the single neuron level. To properly adjust the gain of an individual neuron, it may be necessary to use information carried by neurons in other parts of the system. The ability to adjust the gain of neurons in one part of the brain, using control signals arising from another, has been observed in the electrosensory system of weakly electric fish, where descending pathways to a first-order sensory nucleus have been shown to influence the gain of its output neurons. Although the neural circuitry associated with this system is well studied, the exact nature of the gain control mechanism is not fully understood. In this paper, we propose a mechanism based on the regulation of total membrane conductance via synaptic activity on descending pathways. Using a simple neural model, we show how the activity levels of paired excitatory and inhibitory control pathways can regulate the gain and baseline excitation of a target neuron. 1 Introduction

Mechanisms for gain control at the single neuron level are likely to be involved in many implementations of adaptive signal processing in the nervous system. The need for some sort of adaptive gain control capability is most obvious at the sensory periphery, where adjustments to the gain of sensory neurons are often required to compensate for variations in external conditions, such as the average intensity of sensory stimuli. Adaptive gain control is also likely to be employed at higher levels of neural processing, where gain changes may be related to the functional state of the system. For both peripheral and higher-order neurons, the information necessary to make proper gain adjustments may be carried by neurons that are not part of the local circuitry associated with the target neuron. Thus it would be useful to have a means for adjusting the gain of ,neurons in one part of the nervous system, using control signals arising from another. Neural Computation 6, 242-254 (1994)

@ 1994 Massachusetts Institute of Technology

Mechanism for Neuronal Gain Control

243

Such a capability seems to exist in the electrosensory system of weakly electric fish, where descending pathways to a first-order sensory nucleus have been shown to influence the gain of its output neurons. Although the neural circuitry associated with this system is well studied, the exact nature of the gain control mechanism is not fully understood. In this paper, we propose a general neural mechanism that may underlie gain control in the electrosensory system, and which could potentially be used in many other systems, as a means for carrying out adaptive signal processing. In our model, gain control is achieved by regulating activity levels on paired excitatory and inhibitory control pathways that impinge on a target neuron. Increased synaptic activity on the control pathways gives rise to an increase in total membrane conductance of the target neuron, which lowers its input resistance. Consequently, input current is less effective in bringing about changes in membrane voltage and the effective gain of the neuron is reduced. Changing gain by increasing synaptic conductance levels would normally bring about a shift in the baseline level of excitation in the target neuron as well. In order to avoid this coupling between gain and baseline excitation, it is necessary to implement the conductance change using paired excitatory and inhibitory pathways. 2 Gain Control in the Electrosensory System

Certain species of freshwater tropical fish, known as weakly electric fish, detect and discriminate objects in their environment using self-generated electric fields (Bullock and Heiligenberg 1986). Unlike strongly electric fish, which use their electrogenic capabilities to stun prey, weakly electric fish produce fields that are generally too weak to be perceptible to other fish or to human touch. When a weakly electric fish discharges its electric organ, a small potential difference, on the order of a few millivolts, is established across its skin. Nearby objects in the water distort the fish’s field, giving rise to perturbations in the potential across the skin. Specialized electroreceptors embedded in the skin transform these small perturbations, which are typically less than 100 p V in amplitude, into changes in spike activity in primary afferent nerve fibers. In the genus Apteronotus, primary afferent fibers change their firing rate by about 1 spike/sec for a 1 p V change in potential across the skin (Bastian 1981a). Weakly electric fish often live in turbid water and tend to be nocturnal. These conditions, which hinder visual perception, do not adversely affect the electric sense. Using their electrosensory capabilities, weakly electric fish can navigate and capture prey in total darkness in much the same way that bats do using echolocation. A fundamental difference between bat echolocation and fish ”electrolocation” is that the propagation of the electric field emitted by the fish is essentially instantaneous when considered on the time scales that characterize nervous system function.

244

Mark E. Nelson

Thus rather than processing echo delays as bats do, electric fish extract information from instantaneous amplitude and phase modulations of their emitted signals. The electric sense must cope with a wide range of signal intensities because the magnitude of detectable electric field perturbations can vary over several orders of magnitude depending on the size, distance, and impedance of the object that gives rise to them (Bastian 1981a). In the electrosensory system there are no peripheral mechanisms to compensate for variations in signal intensity. Unlike the vertebrate visual system, which can regulate the intensity of light arriving at photoreceptors by adjusting pupil diameter, the electrosensory system has no equivalent means for directly regulating the overall electric field strength experienced by the electroreceptors,' and unlike the auditory system, there are no efferent projections to the sensory periphery to control the gain of the receptors themselves. The first opportunity for the electrosensory system to make adjustments in sensitivity occurs in a first-order sensory nucleus known as the electrosensory lateral line lobe (ELL). In the ELL, primary afferent axons from peripheral electroreceptors terminate on a class of pyramidal cells referred to as E-cells (Maler rt 01. 1981; Bastian 1981b), which represent a subset of the output neurons for the nucleus. Figure 1 shows a reconstructed E-cell (Bastian and Courtright 1991) that illustrates the general morphology of this class of neurons, including the basal dendrite with its terminal arborization that receives afferent input and the extensive apical dendrites that receive descending input from higher brain centers (Maler et al. 1981). Figure 2 shows a highly simplified diagram of the ELL circuitry, indicating the afferent and descending pathways, and their patterns of termination on E-cells. In view of our proposed gain control mechanism involving a conductance change, it is particularly noteworthy that descending inputs account for the majority of E-cell synapses, and can thus potentially make a significant contribution to the total synaptic conductance of the cell. In experiments in which the average electric field strength experienced by the fish is artificially increased or decreased, E-cell sensitivity is observed to change in a compensatory manner. For example, a 50'70 reduction in overall field strength would cause no significant change in the number of spikes generated by an E-cell in response to an object moving through its receptive field, whereas the number of spikes generated by primary afferent fibers that impinge on the E-cell would be greatly reduced. To maintain a constant output response, while receiving a reduced input signal, the effective gain of the E-cell must have somehow been increased. Interestingly, this apparent gain increase occurs without significantly altering the baseline level of spontaneous activity in the 'In principle, this could be achieved by regulating the strength of the fish's own electric discharge. However, these fish maintain a remarkably stable discharge amplitude and such a mechanism has never been observed.

Mechanism for Neuronal Gain Control

245

1

a Icaldendrlte ($-nding input)

i f

I

isoma (inhibitory input)

1:

r

.....i ..... I

1

i .....i

basal dendrite (efferentinput)

Figure 1: Reconstructed E-cell from the electrosensory lateral line lobe (ELL) of the weakly electric fish Apteronotus leptorhynchus (brown ghost knife fish). This cell was reconstructed by Bastian and Courtright (1991) following intracellular injection of Lucifer yellow. E-cells receive excitatory primary afferent input on the terminal arbor of their basal dendrites, excitatory descending input on their extensive apical dendrites, and inhibitory input primarily on the soma and somatic dendrites. E-cells serve as output neurons of the ELL and send their axons to higher brain centers. Descending feedback pathways from these higher centers have been shown to play a role in controlling the gain of E-cell responses to afferent input (Bastian 1986a,b). Redrawn from Bastian and Courtright (1991). E-cell. Descending pathways are known to play an important role in this descending gain control capability, since the ability to make compensatory gain adjustments is abolished when descending inputs to the ELL are blocked (Bastian 1986a,b). Although the neural circuitry associated with this system is well studied, the mechanism underlying the gain control capability is not fully understood. 3 A Model of Descending Gain Control

In this paper, w e propose that the underlying mechanism could involve the regulation of total membrane conductance by activity on descending pathways. Figure 2 shows a schematic diagram of the relevant circuitry that forms the basis of our model. The target neuron receives afferent input on its basal dendrite and control inputs from two descending

Mark E. Nelson

246

(CONTROL)

0 excitatory

excitatory

inhibitory

*sanding

inhibitory

inhibitory i n m n

Figure 2: Neural circuitry for descending gain control based on the the circuitry of the electrosensory lateral line lobe (ELL). The target neuron (E-cell) receives primary afferent input on its basal dendrite, and descending control inputs on its apical dendrites and cell body. The control pathway is divided into an excitatory and an inhibitory component. One component makes excitatory synapses (open circles) directly on the apical dendrites of the target neuron, while the other acts through an inhibitory interneuron (shown in gray) to activate inhibitory synapses on the soma (filled circles). The gain and offset of the target neuron’s response to an input signal can be regulated by adjusting activity levels on the two descending control pathways. pathways on its apical dendrites and cell body. One descending pathway makes excitatory synaptic connections directly on the apical dendrite of the target neuron, while a second pathway exerts a net inhibitory effect by acting through an interneuron which makes inhibitory synapses on the cell body of the target neuron. The model circuitry shown in Figure 2 has the input and control pathways segregated onto different parts of the dendritic tree, as is the case for actual E-cells (Fig. 1). This spatial segregation has played a key role in the discovery and characterization of gain control in this system by allowing independent experimental manipulation of the input and control pathways (Bastian 1986a,b). However, in this paper, we will ignore the effects of the spatial distribution of synapses and will treat the target neuron as an electrotonically compact point neuron. This approximation turns out to be sufficient for describing the basic operation of the proposed gain control mechanism when the conductance changes associated with the descending pathways can be treated as slowly varying. In the actual system, the placement of monosynaptic excitatory inputs on the

Mechanism for Neuronal Gain Control

247

Figure 3: Electrical equivalent circuit for the target neuron (E-cell) in Figure 2. The membrane capacitanceC, and leakage conductanceg],ak are intrinsic properties of the target neuron, while the excitatory and inhibitory conductances,g,, and ginh, are determined by activity levels on the descending pathways. The input signal Z(t) is modeled as a time-dependent current that can represent either the synaptic current arising from afferent input or an externally injected current. By adjusting g,, and ginh, activity levels on descending pathways can regulate the total membrane conductance and baseline level of excitation of the target neuron.

distal dendrites and bisynaptic inhibitory inputs on the soma may help maintain the relative timing between the two components of the control pathway, such that the system could better handle rapid gain changes associated with transient changes in descending activity. The gain control function of the neural circuitry in Figure 2 can be understood by considering the electrical equivalent circuit for the target neuron, as shown in Figure 3. For the purpose of understanding this model, it is sufficient to consider only the passive and synaptic conductances involved, and ignore the various types of voltage-dependent channels that are known to be present in ELL pyramidal cells (Mathieson and Maler 1988). The passive properties of the target neuron are described by a membrane capacitance C,, a leakage conductance &ak, and an associated reversal potential Eleak. The excitatory descending pathway directly activates excitatory synapses on the target neuron, giving rise to an excitatory synaptic conductance gexwith a reversal potential Eex. The inhibitory descending pathway acts by exciting a class of in-

Mark E. Nelson

248

hibitory interneurons, which in turn activates inhibitory synapses on the target neuron with inhibitory conductance g l n h and reversal potential Elnh. The excitatory and inhibitory conductances, g,, and g l n h are taken to represent the population conductances of all the individual excitatory and inhibitory synapses associated with the descending pathways. While individual synaptic events give rise to a time-dependent conductance change (which is often modeled by an o function), we consider the domain in which the activity levels on the descending pathways, the number of synapses involved, and the synaptic time constants are such that the summed effect can be well described by a single time-invavianf conductance value for each pathway. The input signal (the one under the influence of the gain control mechanism) is modeled in a general form as a time-dependent current Z(t). This current can represent either the synaptic current arising from activation of synapses in the primary afferent pathway, or it can represent direct current injection into the cell, such as might occur in an intracellular recording experiment. The output of the system is considered to be the time-varying membrane potential of the target neuron V ( f ) .For most biological neurons, there is a subsequent transformation in which changes in membrane potential give rise to changes in the rate at which action potentials are generated. However, for the purpose of understanding the gain control mechanism proposed here, it is sufficient to limit our consideration to the behavior of the underlying membrane potential in the absence of the spike generating processes. Thus, in this model, the gain of the system describes the magnitude of the transformation between input current and the output membrane voltage change. The behavior of the membrane potential V(t) for the circuit shown in Figure 3 is described by

dV

cm- dt

fgleak(v

- EIeak) + g e x ( V

-

E m )+ g i n h ( V

- Einh)

=I(t)

(3.1)

In the absence of an input signal (I = 0), the system will reach a steadystate (dV/dt = 0) membrane potential V,, given by Vss(I=O) =

gieakEleak f g e x E e x f 8inhEinh

gleak

-k g e x f

(3.2)

ginh

If we consider the input l ( t ) to give rise to fluctuations in membrane potential U (t ) about this steady state value

U ( t )= V ( t )- v,,

(3.3)

then 3.1 can be rewritten as (3.4)

Mechanism for Neuronal Gain Control

249

Figure 4: Gain as a function of frequency for three different values of total membrane conductance gtot. At low frequencies, gain is inversely proportional to gtot. Note that the time constant T , which characterizes the low-pass cutr)ff frequency, also varies inversely with gtot. where gtotis the total membrane conductance gtot

= gleak -k gex

+ gin11

(3.5)

Equation 3.4 describes a first-order low-pass filter with a transfer function H ( s ) given by

H ( s )=

Rtot

7s +1

~

(3.6)

where s is the complex frequency (s = iw),Rtot is the total membrane resistance (Rtot = l/gtot),and T is the RC time constant ( T = RtotC,). The frequency dependence of the response gain JH(iu)l,as given by equation 3.6, is illustrated in Figure 4. In this figure, the gain has been normalized to the maximum system gain, JHlmax = l/gleak,which occurs at Y = 0 when g,, and gi,,~,are both zero. The normalized gain, in decibels, is given by 20log,,(~HJ/JHJ,,,). For frequency components of the input signal below the knee of the response curve (WT << l),the gain is directly proportional to the total membrane resistance Rtot. Above the knee ( W T >> l),the gain rolls off with increasing frequency and is independent

250

Mark E. Nelson

of Rtot, due to the fact that the impedance is dominated by the capacitive component in this domain. For frequency components of the input signal below the cutoff frequency, gain control can be accomplished by regulating the total membrane conductance. It is important to note that the value of the cutoff frequency is not constant, but is correlated with the magnitude of the response gain; both the membrane time constant T and the gain IN1 are directly proportional to R,,,. In many cases, the biologically relevant frequency components of the input signal will be well below the cutoff frequencies imposed by the gain control circuitry. For example, in the case of weakly electric fish with continuous wave-type electric organ discharge (EOD) signals, the primary afferent input signal carries information about amplitude modulations of the EOD signal due to objects (or other electric fish) in the vicinity. For typical object sizes and velocities, the relevant frequency components of the input signal are probably below about 10 Hz (Bastian 1981a,b) and can thus be considered low frequency with respect to the cutoffs imposed by the gain control circuitry. From the point of view of biological implementation, there is a potential concern that low-frequency signals are actually constructed from the summation of individual postsynaptic potentials (PSPs) that can have much higher frequency components and that may thus be significantly affected by the low-pass filtering mechanism. In the proposed model, increases in gain are accompanied by a reduction in the cutoff frequency. Hence there is a potential concern that attempts to increase the gain would be counteracted by decreases in amplitude of individual PSPs, rendering the overall gain control mechanism ineffective. However, because the system acts as a linear filter, it turns out that filtering the individual PSPs and then adding them together (filtering then summing) is equivalent to adding the individual PSPs together (to form a signal with only low-freqency components) and then passing that signal through the filter (summing then filtering). When a "fast" PSP is low-pass filtered, the peak amplitude is indeed attenuated, but the duration of the filtered PSP is prolonged, such that the overall contribution to a low-frequency signal is not diminished. Thus, to the extent that the neural processing associated with summing afferent PSPs together can be treated as linear, the gain control mechanism is not affected by the fact that low-frequency input signals are built u p from PSPs with higher frequency components. Note that the requirement of linearity only applies to neural processing associated with the summation of PSPs and not to subsequent processing, such as the generation of action potentials. In the case of E-cells in the ELL, it is interesting to note that the input region of the neuron in the terminal arbor of the basal dendrite is well separated from the spike generating region in the soma (Fig. 1). In this model, we propose that regulation of total membrane conductance occurs via activity on descending pathways that activate excitatory and inhibitory synaptic conductances. For this proposed mechanism to be effective, descending synaptic conductances must make a significant

Mechanism for Neuronal Gain Control

251

contribution to the total membrane conductance of the target neuron. Whether this condition actually holds for ELL pyramidal cells has not yet been experimentally tested. However, it is not an unreasonable assumption to make, considering recent reports that synaptic background activity can have a significant influence on the total membrane conductance of cortical pyramidal cells (Bernander et al. 1991) and cerebellar Purkinje cells (Rapp et al. 1992). 4 Control of Baseline Excitation

If the only functional goal of the circuitry shown in Figure 2 was to regulate total membrane conductance, then synaptic activity on a single descending pathway would be sufficient and there would be no need for paired excitatory and inhibitory pathways. However, attempting to control response gain using a single pathway results in a coupling between the gain and the baseline level of excitation of the target neuron, which may be undesirable. For example, if one tried to decrease response gain using a single inhibitory control pathway, then as the inhibitory conductance was increased to lower the gain, the steady-state membrane potential of the target neuron would simultaneously be pulled toward the inhibitory reversal potential. If we would like to be able to change the sensitivity of a neuron’s response without changing its baseline level of excitation, as has been observed in E-cells in the ELL, then we need a mechanism for decoupling the gain of the response from the steady-state membrane potential. The second control pathway in Figure 2 provides the extra degree of freedom necessary to achieve this goal. To change the gain of a neuron without changing its baseline level of excitation, the excitatory and inhibitory conductances must be adjusted so as to achieve the desired total membrane conductance gtot,as given by equation 3.2, while maintaining a constant steady-state membrane voltage V,,, as given by equation 3.5. Solving equations 3.2 and 3.5 simultaneously for gexand ginhr we find gex

=

(4.1)

ginh

=

(4.2)

For example, consider a case where the reversal potentials are beak = -70 mV, Eex = 0 mV, and Einh = -90 mV. Assume-we want to find Values of the steady-state conductances, gexand ginh that would result in a total membrane conductance that is twice the leakage conductance ke., gtot = 2&ak), and would produce a steady-state depolarization of 10 mV (i.e., V,, = -60 mV). Using 4.1 and 4.2 we find the required synaptic conductance levels are gex= &leak and ginh = &leak.

252

Mark E. Nelson

5 Discussion

The ability to regulate a target neuron’s gain using descending control signals would provide the nervous system with a powerful means for implementing adaptive signal processing algorithms in sensory processing pathways as well as other parts of the brain. The simple gain control mechanism proposed here, involving the regulation of total membrane conductance, may find widespread use in the nervous system. Determining whether this is the case, of course, requires experimental verification. Even in the electrosensory system, which provided the inspiration for this model, definitive experimental tests of the proposed mechanism have yet to be carried out. Fortunately the model is capable of generating straightforward, experimentally testable predictions. The main prediction of the model is that gain changes will be correlated with changes in input resistance of the target neuron and with changes in RC time constant, as illustrated in Figure 4. If such correlations were observed experimentally, more subtle aspects of the model could be tested, such as the prediction that both the excitatory and inhibitory pathways should contribute to the conductance change. To make the discussion more specific, we will describe how one might go about testing this model in the ELL of the weakly electric fish Apteronotus leptorhynchus. In this system it is possible to make in v i m intracellular recordings from intact animals (e.g., Bastian and Courtright 19911, thus making it possible to measure input resistance directly while manipulating the system so as to bring about gain changes. One experimental procedure for testing the gain control model would involve delivering hyperpolarizing current step pulses into the soma of an E-cell via an intracellular electrode to determine its input resistance and membrane time constant. Input resistance would be determined by measuring the steady-state ratio of the change in membrane potential to injected current, while the time constant could be estimated from the time course of the exponential onset of the voltage change.2 The input resistance and time constant would first be determined under “normal” conditions, and then under conditions of increased or decreased gain. Bastian (1986a) has demonstrated that gain changes can be induced by artificially altering the nominal strength of the electric organ discharge (EOD) signal by adding or subtracting a scaled version of the fish’s own EOD signal back into the experimental tank. Thus, if the gain of the system were increased using this technique, the model predicts that a hyperpolarizing test pulse of fixed amplitude should give rise to a larger change in membrane voltage (i.e., increased input resistance) with a slower onset

’Previous in nitro studies (Mathieson and Maler 1988) have demonstrated that hyperpolarizing current steps do not seem to activate any large nonlinear conductances, thus validating this approach to measuring membrane time constant.

Mechanism for Neuronal Gain Control

253

(i.e., increased RC time constant) relative to the "normal" case. Note that the gain of the system can be monitored independently from the input resistance by delivering peripheral stimuli to the electroreceptors and observing the response of the E-cell. Thus the proposed model could be rejected by the above experimental procedure if the system gain were shown to increase, while the observed input resistance and time constant remained unchanged. This could occur, for example, if the actual gain control mechanism were implemented at the level of the afferent synapses onto the E-cell, such that an increased gain was associated with increased synaptic current, whereas our model predicts that the synaptic current remains the same, but the responsiveness of the E-cell changes. We have mentioned that the model circuitry of Figure 2 was inspired by the circuitry of the ELL. For those familiar with this circuitry, it is interesting to speculate on the identity of the interneuron in the inhibitory control pathway. In the gymnotid ELL, there are at least six identified classes of inhibitory interneurons. For the proposed gain control mechanism, we are interested in identifying those that receive descending input and that make inhibitory synapses onto pyramidal cells. Four of the six classes meet these criteria: granule cell type 2 (G2), polymorphic, stellate, and ventral molecular layer neurons. While all four classes may participate to some extent in the gain control mechanism, one would predict that G2 (as suggested by Shumway and Maler 1989) and polymorphic cells make the dominant contribution, based.on cell number and synapse location. The morphology of G2 and polymorphic neurons differs somewhat from that shown in Figure 2. In addition to the apical dendrite, which is shown in the figure, these neurons also have a basal dendrite that receives primary afferent input. G2 and polymorphic neurons are excited by primary afferent input and thus provide additional inhibition to pyramidal cells when afferent activity levels increase. This can be viewed as providing a feedforward component to the inhibitory conductance change associated with the proposed gain control mechanism. In this paper, we have confined our analysis to the effects of tonic changes in descending activity, which has allowed us to treat the control conductances as time-invariant quantities. While this may be a reasonable approximation for certain experimental situations, it is unlikely to be a good representation of the actual patterns of control activity that occur under natural conditions. This is particularly true in the electrosensory system, where the descending pathways are known to form part of a feedback loop that includes the ELL output neurons. In fact, there is already experimental evidence demonstrating that in addition to gain control, descending pathways influence the spatial and temporal filtering properties of ELL output neurons (Bastian 1986a,b; Shumway and Maler 1989). Thus the simple model presented here is only a first step toward understanding the full range of effects that descending pathways can have on the signal processing capabilities of single neurons.

254

Mark E. Nelson

Acknowledgments Thanks to J. Bastian and L. Maler for many enlightening discussions on descending gain control in the ELL, and to J. Payne for his detailed comments on this manuscript. This work was supported by NIMH 1R29-MH49242.

References Bastian, J. 1981a. Electrolocation I: How the electroreceptors of Apteroriotus ulbifrons code for moving objects and other electrical stimuli. J. Comp. Physiol. 144, 465-479. Bastian, J. 1981b. Electrolocation 11: The effects of moving objects and other electrical stimuli on the activities of two categories of posterior lateral line lobe cells in Apteronotus albifrons. 1. Comp. Physiol. 144, 481-494. Bastian, J. 1986a. Gain control in the electrosensory system mediated by descending inputs to the electrosensory lateral line lobe. 1. Neurosci. 6, 553562. Bastian, J. 1986b. Gain control in the electrosensory system: A role for the descending projections to the electrosensory lateral line lobe. 1. Comp.Physiol. 158,505-515. Bastian, J., and Courtright, J. 1991. Morphological correlates of pyramidal cell adaptation rate in the electrosensory lateral line lobe of weakly electric fish. J. Cornp. Physiol. 168, 393-407. Bernander, O., Douglas, R. J., Martin, K. A. C., and Koch, C. 1991. Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Nut/. Acad. Sci. U.S.A. 88, 11569-11573. Bullock, T. H., and Heiligenberg, W., eds. 1986. Electroreception. Wiley, New York. Knudsen, E. I. 1975. Spatial aspects of the electric fields generated by weakly electric fish. J. Comp. Physiol. 99, 103-118. Maler, L., Sas, E., and Rogers, J. 1981. The cytology of the posterior lateral line lobe of high frequency weakly electric fish (Gymnotidei): Dendritic differentiation and synaptic specificity in a simple cortex. J. Comp. Neurol. 195,87-140. Mathieson, W. B., and Maler, L. 1988. Morphological and electrophysiological properties of a novel in vitro preparation: The electrosensory lateral line lobe brain slice. J. Cornp. Physiol. 163, 489-506. Rapp, M., Yarom, Y., and Segev, I. 1992. The impact of parallel fiber background activity on the cable properties of cerebellar Purkinje cells. Neural Comp. 4, 518-533. Shumway, C. A., and Maler, L. M. 1989. GABAergic inhibition shapes temporal and spatial response properties of pyramidal cells in the electrosensory lateral line lobe of gymnotiform fish. 1. Cornp. Physiol. 164, 391-407. Received February 8, 1993; accepted July 9, 1993.

This article has been cited by: 2. R. Angus Silver. 2010. Neuronal arithmetic. Nature Reviews Neuroscience 11:7, 474-489. [CrossRef] 3. Connie Sutherland, Brent Doiron, André Longtin. 2009. Feedback-induced gain control in stochastic spiking networks. Biological Cybernetics 100:6, 475-489. [CrossRef] 4. Jeffrey M. Beck, Alexandre Pouget. 2007. Exact Inferences in a Neural Implementation of a Hidden Markov ModelExact Inferences in a Neural Implementation of a Hidden Markov Model. Neural Computation 19:5, 1344-1361. [Abstract] [PDF] [PDF Plus] 5. R. Jacob Vogelstein, Udayan Mallik, Joshua T. Vogelstein, Gert Cauwenberghs. 2007. Dynamically Reconfigurable Silicon Array of Spiking Neurons With Conductance-Based Synapses. IEEE Transactions on Neural Networks 18:1, 253-265. [CrossRef] 6. R. Eckhorn, A.M. Gail, A. Bruns, A. Gabriel, B. Al-Shaikhli, M. Saam. 2004. Different Types of Signal Coupling in the Visual Cortex Related to Neural Mechanisms of Associative Processing and Perception. IEEE Transactions on Neural Networks 15:5, 1039-1052. [CrossRef] 7. Carlo R. Laing , André Longtin . 2003. Dynamics of Deterministic and Stochastic Paired Excitatory—Inhibitory Delayed FeedbackDynamics of Deterministic and Stochastic Paired Excitatory—Inhibitory Delayed Feedback. Neural Computation 15:12, 2779-2822. [Abstract] [PDF] [PDF Plus] 8. Alexandre Pouget, Peter Dayan, Richard S. Zemel. 2003. INFERENCE AND COMPUTATION WITH POPULATION CODES. Annual Review of Neuroscience 26:1, 381-410. [CrossRef] 9. Robert J. Calin-Jageman, Thomas M. Fischer. 2003. Temporal and spatial aspects of an environmental stimulus influence the dynamics of behavioral regulation of the Aplysia siphon-withdrawal response. Behavioral Neuroscience 117:3, 555-565. [CrossRef] 10. Brent Doiron , André Longtin , Neil Berman , Leonard Maler . 2001. Subtractive and Divisive Inhibition: Effect of Voltage-Dependent Inhibitory Conductances and NoiseSubtractive and Divisive Inhibition: Effect of Voltage-Dependent Inhibitory Conductances and Noise. Neural Computation 13:1, 227-248. [Abstract] [PDF] [PDF Plus] 11. Thomas M. Fischer, Jean W. Yuan, Thomas J. Carew. 2000. Dynamic regulation of the siphon withdrawal reflex of Aplysia californica in response to changes in the ambient tactile environment. Behavioral Neuroscience 114:6, 1209-1222. [CrossRef] 12. Gary R. Holt, Christof Koch. 1997. Shunting Inhibition Does Not Have a Divisive Effect on Firing Rates*Shunting Inhibition Does Not Have a Divisive Effect on Firing Rates*. Neural Computation 9:5, 1001-1013. [Abstract] [PDF] [PDF Plus]

Communicated by Todd Leen

The Role of Weight Normalization in Competitive Learning Geoffrey J. Goodhill University of Edinburgh, Centre for Cognitive Science, 2 Buccleiich Place, Edinburgh EH8 9LW, United Kingdom

Harry G. Barrow University of Sussex, School of Cognitive and Computing Sciences, Falmer, Brighton BN1 9QH, United Kingdom

The effect of different kinds of weight normalization on the outcome of a simple competitive learning rule is analyzed. It is shown that there are important differences in the representation formed depending on whether the constraint is enforced by dividing each weight by the same amount ("divisive enforcement") or subtracting a fixed amount from each weight ("subtractive enforcement"). For the divisive cases weight vectors spread out over the space so as to evenly represent "typical" inputs, whereas for the subtractive cases the weight vectors tend to the axes of the space, so as to represent "extreme" inputs. The consequences of these differences are examined. 1 Introduction

Competitive learning (Rumelhart and Zipser 1986) has been shown to produce interesting solutions to many unsupervised learning problems [see, e.g., Becker (1991); Hertz et al. (199111. However, an issue that has not been greatly discussed is the effect of the type of weight normalization used. In common with other learning procedures that employ a simple Hebbian-type rule, it is necessary in competitive learning to introduce some form of constraint on the weights to prevent them from growing without bounds. This is often done by specifying that the sum [e.g., von der Malsburg (197311 or the sum-of-squares [e.g., Barrow (198711 of the weights for each unit should be maintained at a constant value. Weight adaptation in competitive learning is usually performed only for the "winning" unit w, which we take to be the unit whose weight vector has the largest inner product with the input pattern x. Adaptation usually consists of taking a linear combination of the current weight vector and the input vector. The two most common rules are

w'

=w

+tX

Neural Computation 6, 255-269 (1994)

(1.11 @ 1994 Massachusetts Institute of Technology

Geoffrey J. Goodhill and Harry G. Barrow

256

and WI =

w + ( ( X - w)

(1.2)

Consider the general case WI

= nw

+

FX

(1.3)

where a = 1 for rule 1.1 and a = 1 - for rule 1.2. For a particular normalization constraint, e.g. llwll = L, there are various ways in which that constraint may be enforced. The two main approaches are

w

= wl/rr

(1.4)

w

= w’ - ijc

(1.5)

and

where c is a fixed vector, and (1 and /;’ are calculated to enforce the constraint. For instance, if the constraint is llwll = L then (L = Ilw’ll/L. The simplest case for c is c, = 1 Vi. We refer to rule 1.4 as ”divisive” enforcement, since each weight is divided by the same amount so as to enforce the constraint, and rule 1.5 as ”subtractive” enforcement, since here an amount is subtrncted from each weight so as to enforce the constraint. It should be noted that the qualitative behavior of each rule does not depend on the value of a. It is straightforward to show that any case in which R # 1 is equivalent to a case in which n = 1 and the parameters t and L have different values. In this paper, therefore, we will consider only the case a = 1. The effect of these two types of enforcement on a model for ocular dominance segregation, where development is driven by the timeaveraged correlation matrix of the inputs, was mentioned by Miller (1990, footnote 24). Divisive and subtractive enforcements have been thoroughly analyzed for the case of general linear learning rules in Miller and MacKay (1993, 1994). They show that in this case divisive enforcement causes the weight pattern to tend to the principal eigenvector of the synaptic development operator, whereas subtractive enforcement causes almost all weights to reach either their minimum or maximum values. Competitive learning however involves choosing a winner, and thus does not succumb to the analysis employed by Miller and MacKay (1993, 1994), since account needs to be taken of the changing subset of inputs for which each output unit wins. In this paper we analyze a special case of competitive learning that, although simple, highlights the differences between divisive and subtractive enforcement. We also consider both normalization constraints C,w,= constant and C,4 = constant, and thus compare four cases in all. The analysis focuses on the case of two units (Lee,two weight vectors) evolving in the positive quadrant of a two-dimensional space under the influence of normalized input vectors uniformly distributed in direction.

Weight Normalization in Competitive Learning

257

Table 1: Notation for Calculation of Weight Vectors.

Parameter Description W X

hW

w hw

0

4 U

L d €

Weight vector Input vector Change in weight vector Angle of weight vector to right axis Change in angle of weight vector to right axis Angle of input pattern vector to right axis Angle of enforcement vector to right axis Angle of normal to constraint surface to right axis Magnitude of the normalization constraint llxll (constant) Learning rate

Later it is suggested how the conclusions can be extended to various more complex situations. It is shown that, for uniformly distributed inputs, divisive enforcement leads to weight vectors becoming evenly distributed through the space, while subtractive enforcement leads to weight vectors tending to the axes of the space. 2 Analysis

The analysis proceeds in the following stages: (1) Calculate the weight change for the winning unit in response to an input pattern. (2) Calculate the average rate of change of a weight vector, by averaging over all patterns for which that unit wins. (3) Calculate the phase plane dynamics, in particular the stable states. 2.1 Weight Changes. The change in direction of the weight vector for the winning unit is derived by considering the geometric effect of updating weights and then enforcing the normalization constraint. A formula for the change in the weight in the general case is derived, and then instantiated to each of the four cases under consideration. For convenience the axes are referred to as "left" (y axis) and "right" ( x axis). Figure 1 shows the effect of updating a weight vector w with angle w to the right axis, and then enforcing a normalization constraint. Notation is summarized in Table 1. A small fraction of x is added to w, and then the constraint is enforced by projecting back to the normalization surface (the surface in which all normalized vectors lie) at angle 4, thus defining the new weight. For the squared constraint case, the surface is a circle

258

Geoffrey J. Goodhill and Harry G. Barrow

Figure 1: (a) The general case of updating a weight vector w by adding a small fraction of the input vector x and then projecting at angle 4 back to the normalization surface. (b) The change in angle w, hw, produced by the weight update.

centered on the origin with radius L. For the linear constraint case, the surface is a line normal to the vector (1, l), which cuts the right axis at (L.0). When E is very small, we may consider the normalization surface to be a plane, even in the squared constraint case. For this case the normalization surface is normal to the weight vector, a tangent of the circle. For divisive enforcement, the projection direction is back along w’, directly toward the origin. For subtractive enforcement, the projection direction is back along a fixed vector c, typically (1.1).

Weight Normalization in Competitive Learning

259

Table 2: Value of 6w for winning unit. Constraint Enforcement

Equivalences

nw

Referring to Figure la, consider hw = f x - i-lc. Resolving horizontally and vertically and then eliminating ijllcll yields sin(8 - 4)

IlWl =

- f I ~ X ~ I C O S (fJ

(2.1)

4)

Now referring to Figure lb, consider the change in angle w,hw: llw11hw = -11SWII cos(0 - w )

which in conjunction with equation 2.1 gives

hw

=

sin(0 - 4)cos(O - w ) llwll c o s ( r - 4)

fIIXII ~

For the squared constraint case llwll case

llwll

=

L ficos((r

-

= L,

(2.2) whereas in the linear constraint

d)

For divisive enforcement d = w,whereas for subtractive enforcement 4 is constant. From now on we assume llxll = d, a constant. Table 2 shows the instantiation of equation 2.2 in the four particular cases studied below. An important difference between divisive and subtractive enforcement is immediately apparent: for divisive enforcement the sign of the change is dependent on sign(9 - w),while for subtractive enforcement it is dependent on sign(8 - 4). (Note that cos(w - (I,), cos(z - 4) and cos(: - w)are always positive for w,4 E [O. 51.) Thus in the divisive case a weight vector only moves toward (say) the right axis if the input pattern is more inclined to the right axis than the weight is already, whereas in the subtractive case the vector moves toward the right axis whenever the input pattern is inclined farther to the right axis than the constraint vector.

Geoffrey J. Goodhill and Harry G. Barrow

260

2.2 Averaged Weight Changes. The case of two competing weight vectors w1 and w2 with angles w1 and w2, respectively, to the right axis is now considered. It is assumed that w1 < w2: this is simply a matter of the labeling of the weights. The problem is to calculate the motion of each weight vector in response to the input patterns for which it wins, taking account of the fact that this set changes with time. This is done by assuming that the learning rate f is small enough so that the weight vectors move infinitesimally in the time it takes to present all the input patterns. Pattern order is then not important, and it is possible to average over the entire set of inputs in calculating the rates of change. Consider the evolution of w,, i = 1.2. In the continuous time limit, from equation 2.2 we have

Using the assumption that f is small, an average is now taken over all the patterns for which wi wins the competition. In two dimensions this is straightforward. For instance consider w1: in the squared constraint cases w1 wins for all 0 < (wl+ w2)/2. In the linear constraint cases the weight vectors have variable length, and the condition for w1 to win for input H is now IlWlll cos(H - d l )

>

IIw211 cos(0 - iJ2)

where

This yields the condition H < for wI to win for input H . That is, in the linear cases the unit that wins is the unit closest to the axis to which the input is closest, and the weights evolve effectively independently of each other. (Note that we have only assumed w1 < w2, not dl < 7r/4.) First equation 2.2 is integrated for general limits 01 and Hz, and then the particular values of 01 and 02 for each of the cases are substituted. We have (2.3)

where the angle brackets denote averaging over the specified range of H , and P( 0) is the probability of input 0. The outcome under any continuous distribution can be determined by appropriate choice of P(H). Here we just consider the simplest case of the uniform distribution P(H) = p , a constant. With some trigonometrical manipulation it follows that

2tdp cos(a - W i )

(;if)= -

((willC O S ( ~-

41)

(

O1 02) sin sin 4 - -

(y)

(2.4)

Weight Normalization in Competitive Learning

261

2.3 Stable States.

2.3.1 Linear Constraints. Substituting the limits derived above for linear constraints into equation 2.4 yields for the divisive enforcement case

where for conciseness we have defined C = 2 ~ d p / L .To determine the behavior of the system the conditions for which (&I) and (&) are positive, negative, and zero are examined. It is clear that w1 moves toward the right axis for w1 > ~ / 8 w,2 moves towards the left axis for w2 < 3 ~ 1 8 , and the stable state is ?l

q=-.

8

w2=-

371 8

Each weight captures half the patterns, and comes to rest balanced by inputs on either side of it. Weights do not saturate at the axes. This behavior can be clearly visualized in the phase plane portrait (Fig. 2a). For the subtractive enforcement case

For (GI) < 0, that is w1 heading for the right axis, it is required that 4 > x / 8 . Similarly for w2 to be heading for the left axis it is required that d, < 3x18. Thus the weights saturate, one at each axis, if x / 8 < d, < 3 ~ 1 8 . They both saturate at the left axis for d, < 7 r / & and both at the right axis . plane portraits for some illustrative values of 4 are for 4 > 3 ~ / 8 Phase shown in Figure 2b-d. 2.3.2 Squared Constraints. Instantiating equation 2.4 in the divisive enforcement case yields

(G2)

=

Csin

4

1

Geoffrey J. Goodhill and Harry G. Barrow

262

-

a

0 2

0 2

n/4

0

b

rd2

C

d

Figure 2: Phase plane portraits of the dynamics for linear constraint cases. . Subtractive en(a) Divisive enforcement: weights tend to ( i ~ / 8 , 3 ~ / 8 )(b,c,d) forcement for = a/4,+ = ~ / 6and , d = ~ 1 1 6respectively. , For T / 8 < + < 37r/8 weights saturate one at each axis, otherwise both saturate at the same axis.

+

+

For (GI) < 0 we require 3wl > w2, for (ij2) > 0 we require 3w2 < w1 a n d the stable state is the same as in the linear constraint, divisive

K,

enforcement case: K LJ1=-,

8

3K w2=8

The phase plane portrait is shown in Figure 3a.

Weight Normalization in Competitive Learning

a

263

b 0 2

0 7

XI4

d

C

Figure 3: Phase plane portraits of the dynamics for squared constraint cases. (a) Divisive enforcement: weights tend to ( s / 8 , 3 ~ / 8 ) . (b,c,d) Subtractive enforcement for @ = s/4, @ = s/6, and @ = x/16 respectively. For 4 = ir/4 weights saturate at different axes. As @ moves from ~ / 4 there , is an increasing region of the (w1, w2)plane for which the final outcome is saturation at the same axis. In the subtractive enforcement case we have (&I)

=

-c cos($1

-

(W2) =

c

1 C O S ( 4 - w2)

For (Wl)< 0 we require

4

w1

>

w1)

+ w24

w1

+ w 2 + ") sin ( 4

wf :1 -

")

Geoffrey J. Goodhill and Harry G. Barrow

264

Table 3: Convergence Properties of the Four Cases.

Constraints Divisive

Weights stable at Weights saturate at different axes for

Linear

LJl

=;.

w * = 3.

i<(b
For C#J < both weights saturate at left axis For 9 > both weights saturate at right axis Weights stable a t Weights saturate a t different axes for 0, = i For i < Q, < weights may saturate at the w1 = i, w2 = 9 same or different axes (see text) For other sb both weights saturate at same axis as in linear case

Squared

Similarly for

o<

Subtractive

ull

(ijz)

> 0 we require

+ LJ2 + 7r

If = 7r/4 both these conditions are satisfied for all but two initial states and weights saturate one at each axis, that is, the only stable attractor is (0.7r/2). The two points for which this is not true are the critical points (0,O) and (7r/2%7r/2).Here 2 , = 0, ijz = 0, and these are unstable equilibria. If ~ / < 8 4 < 7r/4,both (7r/2.~/2)and (0.7r/2) are stable attractors. Both weights can saturate at the left axis if they start sufficiently close to it. The size of the basin of attraction around (7r/2.7r/2)gradually increases as dj decreases, until for (lj < 7r/8 the point (0. n / 2 ) is no longer an attractor and all initial conditions lead to saturation of both weights at the left , cf) > 3 ~ / 8 .Phase axis. Analogous results hold for 7r/4 < (/I < 3 1 ~ 1 8and plane portraits for some illustrative values of (i, are shown in Figure 2f-h. We have not been able to find an analytic expression for the boundary between the different basins of attraction in this case. Convergence properties for each of the four cases of constraints and enforcement are summarized in Table 3. 3 Discussion

-

3.1 Extension to More Units. Extending the above analysis to the case of more than two units evolving in two-dimensional space is straightforward. Consider N units with weight vectors w,, indexed according to their angle with the right axis, so that the smallest angle with the right axis is LJI and so on.

Weight Normalization in Competitive Learning

265

For the squared constraint, divisive enforcement case the stable state is w1 = ~ 2 1 3WN , = [(LdN-1+~)/3]. The weight vectors in between are stable when they are equidistant from their neighbors. The angle between each pair is thus (t = 7r/2N, and the angle between wl, LJN and the right and left axes respectively is 0/2. For the linear constraint, divisive enforcement case the situation is different since it is the weight vector closest to the axis to which the input vector is closest that wins. First consider the case where ~ / <8 w1 < 3 ~ 1 8 . Then w1 is the only vector that ever wins for H < 7r/4, and so is stable at L J ~ = 7r/8 as before while all other vectors j such that u', < 7r/4 remain in their initial positions. A similar situation holds in the upper octant. If wl < 7r/8 then it still eventually comes to rest at ull = 7r/8. However, if there are other vectors k such that wk < 7r/B then these will begin to win as u'1 passes by on its way to ~ / 8 Which . unit wins changes as each progresses toward 7r/8, where they are all finally stable. Again, vectors with initial angles between 7 ~ / 8and 3 ~ / remain 8 in their initial states. By similar arguments, the situation is even more straightforward in the linear constraint, subtractive enforcement case. w1 saturates at the right axis and WN at the left axis as before (for appropriate O), and all other weights remain unchanged. The squared constraint, subtractive enforcement case is, however, more complicated. In general all weights vectors wi for which w,< 4 saturate at the right axis: an analogous result holds for the left axis. However, this is not quite true of the two initial weight vectors wj and wjtl, which are such that w, < (i, and w , + ~> 4,. Assume wi+lis closer to q5 than wj. Then wi+l can win for inputs H < 05, and wj+l can eventually be pulled to the right axis. The effect of a conscience mechanism (Hertz et al. 1991) that ensures that each unit wins roughly the same amount of the time can thus be clearly seen. In the linear constraint, subtractive enforcement case, this would mean that eventually all weights would saturate at the axis to which they were initially closest. For instance, for H < ~ 1 4 w1 , would win for the first pattern, but then w2 would win for the next since wl is temporarily out of the running, and similarly for all weights. 3.2 Higher Dimensions. In higher dimensional spaces, the situation immediately becomes much more complicated, for two main reasons. First, it is harder to calculate the new weight vector for the winning unit in terms of the old by the geometric methods we have used here. Second, the dividing lines between which unit wins for the set of inputs forms a Voronoi Tesselation (Hertz et al. 1991) of the constraint surface. The limits of the integral required to calculate (W) are thus hard to determine, and the integrand is more complicated. Empirical results have been obtained for this case in the context of a model for ocular dominance segregation (Goodhill 1993). Figure 4 shows an example for this model (discussed further below), illustrating that qualitatively similar behavior to the twodimensional case occurs in higher dimensions. See Miller and MacKay

.

0.

0

m

n... . n . . .

0 . .

0.

. . n

0

. .

. o

n.......

n....n...

n . . . . .

. n

- 0

n.......

0.

0 - . . . .

0 .

. n . . .

n

....... . . . . . .

. . . . .

. . .

. .

. .

Figure 4 Outcome in a high dimensional case under divisive and subtractive enforcement of a linear normalization constraint (c, = 1.Vi). The pictures show the ocular dominance of an array of 32 by 32 postsynaptic units, whose weights are updated using a competitive rule in response to distributed, locally correlated patterns of input activity. For each postsynaptic unit the color of the square indicates for which eye the unit is dominant, and the size of the square represents the degree of dominance. (a) Initial state (random connection strengths): no postsynaptic units are strongly dominated. (b) State after 250,000 iterations for divisive enforcement of a linear constraint: again no postsynaptic units are strongly dominated. (c) State after 250,000 iterations for subtractive enforcement of a linear constraint: segregation has occurred. For further details of this model see Goodhill (1991, 1993).

.n..

.

. n

.n.

. n . . . . .m.......

0 -

...... . . . . . . . ...... ..... ....... ... . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. . . . .... . . . . . . . . . . . .. .... . . . . . . . . ... , .,. . . . . .... . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ................. . . . . . . . . . . . . . . . . . .... . . . . . . . . . .. .. .. .. .. .. .......... .. ..... . . .... . . . . . . . . . . . . . . ....... ....... ..... ..... .. .. . .. ... .... ....... ............. .. .. ............ ... . . . . . . ..... .. .. .. ..... .. ..... .. ......... ... ...... . . . . . . . . . . . . . . . . .... .. ... ... .. ....... .. ..... . . . . . . . . . . . ............. . . . . . . . . . . . . ................. .. .. .. ..... .. ..... .. .. ... .. . . ................. .. ... ......... . . . . . .......... ... . . .......... . .... . . . ......... . . . . . . ... ..... . . .... . . . ......... ..... . . . . . . . . . . . .... . . . . . . .... .. ......... .. .. ................... .... .... . . . . . . ....... .... ... ..... . . . . .................... . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .... ..... ...... . . . . . . . . . . . . ... . . . . . . . ... ..... ...... . . . . . . . . . . . . . . . . . . . .. ................. . .. .. .. ... . ................. . . .. .. . . . ... ... .. . . . ............................................................ .................. . . . . . ...... ......... ........

3n

ID Y

4a

Weight Normalization in Competitive Learning

267

(1993, 1994) for analysis of linear learning rules in spaces of general dimension. A more general analysis for the competitive case will be found in Barrow and Goodhill (1993, 1994). 3.3 Representations. How do the representations formed by divisive and subtractive enforcement differ, and for what kinds of problems might each be appropriate? From the two-dimensional results presented here, it appears that divisive enforcement is appropriate if it is desired to spread weight vectors out over the space so as to evenly represent "typical" inputs, or, if the inputs are clustered, to find the cluster centers. Subtractive enforcement on the other hand represents "extreme" inputs: instead of finding the centers of clusters, the weight vectors tend to the axes of the space to which clusters are closest. Subtractive enforcement can be thought of as making harsher decisions about the input distribution. These properties are illustrated for a high-dimensional case of a similar learning rule in the model of visual map formation and ocular dominance segregation of Goodhill (1991, 1993). Here, an array of "cortical" units competes in response to inputs from two arrays of "retinal" units. With divisive enforcement of a linear normalization rule, no ocular dominance segregation occurs unless only a small patch of retina in one eye or the other is active at a time (Goodhill 1990). However, with subtractive enforcement segregation does occur when all retinal units are simultaneously active, with local correlations of activity within each retina and positive correlations between the two retinae (Goodhill 1991, 1993). This is illustrated in Figure 4. Two other points of note are as follows. (1)Whereas the stable state for divisive enforcement is invariant to affine transformations of the input space, it is not for subtractive enforcement. (2) A natural type of projection onto the constraint surface to consider is an orthogonal projection. For squared constraints orthogonal projection corresponds to divisive enforcement, whereas for linear constraints orthogonal projection corresponds to subtractive enforcement with c; = 1 tri in equation 1.5. Thus applying a rule of orthogonal projection leads to a very different outcome for squared and linear constraints.

4 Conclusions

A simple case of competitive learning has been analyzed with respect to whether the normalization constraint is linear or sum-of-squares, and also whether the constraint is enforced divisively or subtractively. It has been shown that the outcome is significantly different depending on the type of enforcement, while being relatively insensitive to the type of constraint. Divisive enforcement causes the weights to represent "typical" inputs, whereas subtractive enforcement causes the weights to represent

268

Geoffrey J. Goodhill and Harry G. Barrow

"extreme" inputs. These results are similar to the linear learning rule case analyzed in Miller and MacKay (1993, 1994). Directions for future work include analysis of normalization in competitive learning systems of higher dimension, and studying the differences in the representations formed by divisive and subtractive enforcement on a variety of problems of both practical and biological interest.

Acknowledgments We thank Ken Miller both for first introducing us to the difference between divisive and subtractive constraints in the linear case and the twodimensional version of this, and for kindly making available Miller and MacKay (1993) while still in draft form. G.J.G. thanks David Griffel, Peter Williams, Peter Dayan, Martin Simmen, and Steve Finch for helpful comments and suggestions. G.J.G. was supported by an SERC Postgraduate Studentship (at Sussex University), and a Joint Councils Initiative in CogSci/HCI postdoctoral training fellowship.

References Barrow, H. G. 1987. Learning receptive fields. Proc. JEEE First Annual Conference on Neural Netzoorks, IV, 115-121. Barrow, H. G., and Goodhill, G. J. 1994. In preparation. Becker, S. 1991. Unsupervised learning: Procedures for neural networks. Int. I. Neural Syst. 2, 17-33. Goodhill, G. J. 1990. The development of topography and ocular dominance. In Proceedings of the 2990 Conriectionist Models Summer School, D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, eds. Morgan Kaufmann, San Mateo, CA. Goodhill, G. J. 1991. Correlations, competition and optimality: Modelling the development of topography and ocular dominance. Ph.D. Thesis, University of Sussex. Goodhill, G. J. 1993. Topography and ocular dominance: A model exploring positive correlations. Biol. Cybernet. 69, 109-118. Hertz, J., Krogh, A,, and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation. Lecture notes in the Santa Fe Znstitute Studies in the Sciences of Complexity. Addison-Wesley, Reading, MA. Malsburg, C. von der 1973. Self-organizationof orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100. Miller, K. D. 1990. Correlation-based mechanisms of neural development. In Neuroscienceand Connectionist Theory, M. A. Gluck and D. E. Rumelhart, eds. Lawrence Erlbaum, Hillsdale, NJ. Miller, K. D., and MacKay, D. J. C. 1993. The role of constraints in Hebbian Learning. Caltech Computation and Neural Systems Memo 19. Miller, K. D., and MacKay, D. J. C. 1994. The role of constraints in Hebbian learning. Neural Comp. 6(1), 100-126.

Weight Normalization in Competitive Learning

269

Rumelhart, D. E., and Zipser, D. 1986. Feature discovery by competitive learning. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, J. L. McClelland and D. E. Rumelhart, eds., pp. 151-193. MIT Press, Cambridge, MA.

Received August 13, 1992; accepted June 22, 1993.

This article has been cited by: 2. Matthieu Gilson, Anthony N. Burkitt, David B. Grayden, Doreen A. Thomas, J. Leo Hemmen. 2010. Emergence of network structure due to spike-timing-dependent plasticity in recurrent neuronal networks V: self-organization schemes and weight dependence. Biological Cybernetics . [CrossRef] 3. Terry Elliott . 2003. An Analysis of Synaptic Normalization in a General Class of Hebbian ModelsAn Analysis of Synaptic Normalization in a General Class of Hebbian Models. Neural Computation 15:4, 937-963. [Abstract] [PDF] [PDF Plus] 4. T. Elliott , N. R. Shadbolt . 2002. Multiplicative Synaptic Normalization and a Nonlinear Hebb Rule Underlie a Neurotrophic Model of Competitive Synaptic PlasticityMultiplicative Synaptic Normalization and a Nonlinear Hebb Rule Underlie a Neurotrophic Model of Competitive Synaptic Plasticity. Neural Computation 14:6, 1311-1322. [Abstract] [PDF] [PDF Plus] 5. Randall C. O'Reilly . 2001. Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian LearningGeneralization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning. Neural Computation 13:6, 1199-1241. [Abstract] [PDF] [PDF Plus] 6. Gal Chechik , Isaac Meilijson , Eytan Ruppin . 2001. Effective Neuronal Learning with Ineffective Hebbian Learning RulesEffective Neuronal Learning with Ineffective Hebbian Learning Rules. Neural Computation 13:4, 817-840. [Abstract] [PDF] [PDF Plus] 7. Eugene Mihaliuk, Renate Wackerbauer, Kenneth Showalter. 2001. Topographic organization of Hebbian neural connections by synchronous wave activity. Chaos: An Interdisciplinary Journal of Nonlinear Science 11:1, 287. [CrossRef] 8. Jianfeng Feng , David Brown . 1998. Fixed-Point Attractor Analysis for a Class of NeurodynamicsFixed-Point Attractor Analysis for a Class of Neurodynamics. Neural Computation 10:1, 189-213. [Abstract] [PDF] [PDF Plus] 9. Lipo Wang. 1997. On competitive learning. IEEE Transactions on Neural Networks 8:5, 1214-1217. [CrossRef] 10. David Willshaw, John Hallam, Sarah Gingell, Soo Leng Lau. 1997. Marr's Theory of the Neocortex as a Self-Organizing Neural NetworkMarr's Theory of the Neocortex as a Self-Organizing Neural Network. Neural Computation 9:4, 911-936. [Abstract] [PDF] [PDF Plus] 11. DeLiang Wang. 1996. Primitive Auditory Segregation Based on Oscillatory Correlation. Cognitive Science 20:3, 409-456. [CrossRef] 12. Christopher W. Lee , Bruno A. Olshausen . 1996. A Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot StereogramsA Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot Stereograms. Neural Computation 8:3, 545-566. [Abstract] [PDF] [PDF Plus]

13. Barak A. Pearlmutter . 1995. Time-Skew Hebb Rule in a Nonisopotential NeuronTime-Skew Hebb Rule in a Nonisopotential Neuron. Neural Computation 7:4, 706-712. [Abstract] [PDF] [PDF Plus] 14. Mathew David Mackenzie. 1995. CDUL: Class directed unsupervised learning. Neural Computing & Applications 3:1, 2-16. [CrossRef]

Communicated by John Platt

A Probabilistic Resource Allocating Network for Novelty Detection Stephen Roberts Lionel Tarassenko Neural Network Research Group, Department of Engineering Science, Utziversity of Oxford, Oxford, UK

The detection of novel or abnormal input vectors is of importance in many monitoring tasks, such as fault detection in complex systems and detection of abnormal patterns in medical diagnostics. We have developed a robust method for novelty detection, which aims to minimize the number of heuristically chosen thresholds in the novelty decision process. We achieve this by growing a gaussian mixture model to form a representation of a training set of ”normal” system states. When previously unseen data are to be screened for novelty we use the sume threshold as was used during training to define a novelty decision boundary. We show on a sample problem of medical signal processing that this method is capable of providing robust novelty decision boundaries and apply the technique to the detection of epileptic seizures within a data record. 1 Introduction The detection of novelty (or abnormality) is a very important task in many diagnostic or monitoring systems. Sensors distributed around a plant or an engine, for example, will be used to monitor its performance. In safety-critical applications, it will be essential to detect the occurrence of an unexpected event as quickly as possible. This can best be done by a continuous on-line assessment of the novelty of successive sets of sensor readings. If we have a training set of data, 7 say, which represents the states of some system, we can ask a simple question when a previously unseen data vector is presented to us; does the vector coincide with any of the data points in 7?If not, we can decide that the new vector is novrl. This simplistic argument is, of course, flawed. The data set 7 would have to encode the entirety of ”normal” system states and all data would have to be noiseless. Nature does not provide us with such luxuries and “realworld” data sets are incomplete and noisy. We wish, therefore, to form some representation of our data set before we can decide on the novelty, or otherwise, of new input vectors. N w r a l Cornputation 6, 270-284 (1994)

@ 1994 Massachusetts Institute of Technology

Network for Novelty Detection

271

The most appropriate representation is an estimate of the probability density function (PDF) of the data set we are given. For most problems, we have no a priori statistical information regarding this data set, i.e., we neither know the number of generators within the data set nor their underlying functions. Parametric estimation methods, therefore, cannot be applied. Nonparametric techniques, such as Parzen windows (Parzen 1962), require the application of a windowing, or kernel, function sited at every sample in the training set and are thus computationally expensive for large data sets with the added drawback that they may also model any noise that is in the training data (Ripley 1992; Bishop 1991). A more parsimonious representation of the training data is hence required. Semiparametric methods using kernel estimation (where the number of kernel functions is less than the number of x E 7 but still large compared to the probable number of generators within the data) offer most of the advantages of nonparametric methods along with computational economy (Trdvh 1991; Ripley 1992). 1.1 Semiparametric Estimation. Semiparametric methods assume that the data set can be encoded, generally, as a parameterization of the statistical moments of the data (typically the first and second moments only). It should be noted that semiparametric estimation may be regarded as a clustering or partitioning of the data set in terms of a set of cluster means (first moments) and covariance matrices (second moments). The K-means algorithm clusters using a model where data points are ”hardpartitioned” into subgroups, that is, membership functions’ are either 1 or 0. Gath and Geva (1989) reported on a method for fuzzy partitioning of data using a variant of the maximum likelihood approach. Such an approach is more realistic in that membership functions are continuous variables and has much in common with statistical techniques for parameter estimation in a semiparametric model.* We assume that the underlying PDF may be represented as a mixture of a finite number of component densities. Much work has, historically, concentrated on the use of (nonorthogonal) gaussian functions as model generators for the component densities. The choice of gaussians is not an arbitrary one, indeed it is possible to show that a finite sum of gaussian responses can approximate, with arbitrarily small error, any PDF (the property of universal approximation) (Park and Sandberg 1991). 2 Theory

We consider some finite training set of data, 7,consisting of a sequence of &dimensional feature vectors, x = [ X I , . . . , x d I T such that x E 7 E rRd. Let ‘The degree of association between a data vector and some data cluster. 21n the framework of neural networks, there is an intimate link between kemel-based PDF estimation (or fuzzy data partitioning) and the formulation of the hidden layer of a radial-basis-function (RBF) network (Broomhead and Lowe 1988).

Stephen Roberts and Lionel Tarassenko

272

the number of members of 7 be N and the number of gaussian kernels at a given time be K . Each gaussian kernel has two sets of degrees of freedom, its vector location in input-space (the first moment) m, and a smoothing function (here generalized as a covariance matrix-the second moment) F. The response of the jth such kernel function to an input feature vector, x, is denoted as d(x; mi,F,) or, more simply, as @(x). Bayes‘ rule specifies that the mixture density may be written as K P(X) =

C P ( k ) P ( X I k)

(2.1)

k=l

where p ( k ) is the prior probability of selecting the kth kernel function and p(x I k) is the conditional density of x on the kth kernel. If a gaussian mixture model is used then p(x I k) is simply the response of the kth gaussian function in the mixture and equation 2.1 can be rewritten as K

(2.2) The set of priors is subject to the constraints, CkK_]p(k)= 1 and each 1 2 p ( k ) 2 0. Bayes’ theorem for densities specifies that the posterior probability of selecting the jth kernel function given feature vector x can be written as (2.3) k=l

The conventional approach for defining the free parameters of each gaussian function is to maximize the log-likelihood over all x E 7, namely3 N

C logp(xJ

(2.4)

I=]

Upon substitution from equation 2.2, solutions that maximize 2.4 can be sought, subject to the constraints imposed on the priors, using Lagrange multipliers. Details may be found in several papers (Dempster et al. 1977; Trdven 1991) and we quote the results here. The form of the free F,] that satisfy parameters of each dj may be specified by seeking [m,. N

log $(xi; m,, Fj)

=0

(2.5)

I=1

The solutions of equation 2.5, specifiying a gaussian component density of the form

~

3Assuming independence between data samples.

Network for Novelty Detection

273

are given by

and

(2.8)

Solutions to equations 2.7 and 2.8 require a recursive method for nonlinear optimization, for example the expectation-maximization (EM) algorithm (Dempster et al. 1977). An iterative version of this nonlinear optimization scheme, which resembles reinforcement learning, can also be formulated (Lowe 1991; T r d v h 1991; Neal and Hinton 1993). With the method described in this paper, we also solve equations 2.7 and 2.8 using reinforcement learning using an adaption, or learning, parameter (defining the "cooling curve" of the network), which is gradually reduced during training. Defining this parameter as 0 5 of < 1 the iterative equations are as follows: (2.9) $+I

=

F,,f + "f[P(j I X , ) ( X f - m,,f)(xf- A,,JT- F,fl (1 - 0,) + '',PO' I XI)

(2.10)

where x f is the data vector randomly selected from 7 at the tth iteration. Convergence of equations 2.9 and 2.10 to equations 2.7 and 2.8 as + 0 may be confirmed analytically (see Appendix). 2.1 Network Growth. The major problem with any form of novelty detection is the choice of an appropriate "novelty threshold." We have attempted to minimize the number of heuristically-determined thresholds and parameters and the main point of our paper is that once the network is fully trained, new data may be tested for novelty using the same threshold as was used to determine network growth during training. We define a test parameter, X(xf), such that

where xf is the input vector presented at time t during training and $( )

Stephen Roberts and Lionel Tarassenko

274

is the response of a gaussian kernel, such that JJ(X= m)= 1, namely

(2.12) We monitor the value of X(xt) at each data presentation and use it to decide whether the network should grow by one further gaussian unit, based on the following criterion: 4xt)

{ 5>

c1 +

c1

-+

growth no growth

Taking natural logarithms of equation 2.12 leads to a reformulated growth criterion of

where Qt = 2 ln(l/c,). The growth decision is thus based on monitoring the smallest Mahalanobis distance between xt and each m within the network. Network growth is thus similar to that proposed by other authors, being based on some distance metric (Sebestyen 1962; Platt 1991). Note that two data vectors may have identical minimum Euclidean distances, but differing minimum Mahalanobis distances depending on the statistics of the data distribution (Fig. 1). There are no kernel centers when the network is initialized. The first presentation of data causes the addition of a single kernel function, positioned at the site of the first feature vector. As usual, the training set is presented in random order and the first kernel function is adapted according to equations 2.9 and 2.10 until some X(x) becomes equal to, or falls below, the f 1 threshold. At this point a new kernel function is generated. The growth threshold, 0 5 ct 5 1 is initially set as to = 0, that is, growth will occur only if some x1 has a 0% chance of having been generated by the network at that time. The magnitude of f 1 increases linearly with time according to

The network thus starts fitting kernel functions to model the statistics of 7 coarsely; as tt increases, the precision of the fit becomes progressively better. The upper limit, emax, specifies the final precision to which the statistics of 'T are represented by the kernel functions. Note that if f,, is set to unity, then the network will grow until a kernel function is allocated to every data vector in 'T.4 41t should be noted that such a system could result in matrices.

ill-conditioned covariance

Network for Novelty Detection

275

0 New data vector " 5

1 No growth

i 1

-0.I

0.I

-0.2

0.2

0.3

0.4

0.5

-

-0.3 -0.4 -

Growth

-0.5 -

Figure 1: Growth based on a weighted distance criterion. There are three clear clusters in this artificial 2-D problem indicated by the covarianceellipses around each cluster center (filled square). When a new kernel function, index n say, is generated at time t, its position vector is set as

m, = xf The initial estimate for the covariance matrix, F,, is uniform and symmetric with each component F,, being defined as

where C = (m, - m,)(m, - m,)T in which 1 is the index of the kernel function that, prior to growth, had the largest posterior p(I I X I ) . To avoid having to make any assumptions about the distribution of the data in the training set, we take all the priors, p u ) , to be equal, their value being updated after growth such that 1

for all j 5 n, where K f is the current number of kernel functions.

Stephen Roberts and Lionel Tarassenko

276

2.1.1 Local "Cooling". There is one remaining problem with such network growth, especially if it occurs late during the training process. If ( t , the adaption gain, is a scalar (i.e., equal and decreasing globally for all kernel functions) then a new kernel function may not be sufficiently "plastic" to adapt adequately to the statistics of the region of input space to which it is sensitive. This problem can be avoided by allowing both 0 , and the time index t, to be vectors, with components for each kernel function. The dimensions of a and t thus grow with the number of kernel functions. As detailed in the Appendix the adaption gain for the jth kernel function is

where t, is the component of the time vector for kernel function j . When a new kernel function, index n, is added, we set t,, to zero. This ensures that the new kernel function is "plastic" ((],, is large) and its free parameters can converge with minimal disturbance to all other kernel functions (remember that, in the intervals between new kernel functions being generated, all existing kernels continue to be adapted using equations 2.9 and 2.10). Training ends when min{t,} reaches some predefined limit, ti say, where tl >> N, N being the number of x E 7, such that we may assume that every gaussian has seen every member o f 7 at least once. 2.1.2 Choice of Parameters. In implementing the algorithm we have imposed several model constraints. In addition to making all kernel priors equal we can let the covariance matrices, F, be diagonal, without loss of generality. This leads to a considerable reduction in computing overheads when the dimensionality of input space is large, since matrix inversion then becomes a simple task. The choice of the parameters T,, r,,, and 0 , )is not critical, save t h i t M V require the rate of increase of c, to be small compared to the i n i t i r i l r,itc> of decrease of (r,. We choose r, to be equal to N, the number of x i T , 5 0 that c, reaches its maximum value after one iteration through the tr,iining data, and r(,is set to unity. In all results presented here 110 0 7. ~

2.2 Novelty Detection. Once training is complete we know that no member x, of 7 has X(x,) 5 fmax. On presentation of some previously unseen test data, u say, we may calculate X(u) via equations 2.11 and 2.12. Our novelty criterion makes use of the threshold f,,,, such that, if

{5

cmax +

~ ( u ) >, , ,c

+

u is a novel vector u is not novel

Network for Novelty Detection

277

3 An Example from Medical Signal Processing

In earlier papers (Roberts and Tarassenko 1992a,b), we have described a method for analyzing the human electroencephalogram (EEG) during sleep using a radial-basis-function (RBF) network. Our results have shown that it is possible to classify sleep states from patterns of electrical activity recorded during sleep from the cortex using scalp electrodes. There are many patients, however, whose EEG records contain abnormal signal features (epileptics, for example). With these patients, unexpected events, not represented in the training set, are likely to occur during the patient’s EEG recording. It is very important clinically to identify these novel input vectors as they occur. To validate the method of novelty detection described in this paper for such an application, we have constructed an artificial test problem using our EEG data, in which all the input vectors recorded during times of wakefulness5 are deliberately excluded from the training database. Such a database, consisting of 3644 10-dimensional input vectors,6 was assembled and a representation of this database was formed using test thresholds of, , ,c = 0.1, which led to the growth of a total of 97 gaussian functions and cmax = 0.2, which gave 258 gaussian functions. Figure 2 shows the time course of K,, the number of gaussian funcThe trained tions during training for the two different values of .,f networks were then presented with patterns from a previously unseen EEG recording (Fig. 3a and b). At the beginning of this recording, the patient is not asleep and this is clearly indicated by the value of X(x,) falling below the novelty thresholds in each case, save for three short occurrences of drowsy sleep. As the subject falls asleep (t = 501, the data are no longer identified as being novel with respect to the training data. There are subsequent episodes during which X(x,) decreases to lower values, but these correspond either to body movements (during which the subject’s EEG state is that of wakefulness, for very short periods of time) and a recording dropout (end of record). The training data did not contain any dropouts, hence its inclusion as a “novel feature.” It is important to make the point that novelty detection does not specify a particular class of input vectors, merely the fact that these vectors are novel with respect to the training database. We see from Figure 3 that as long as enough gaussian functions are grown to represent the complexity of the problem, the novelty decision is robust when the value of F,,,, and hence the number of kernel functions, is altered.

sAs defined by the consensus of three human scorers working from a rule-based scoring system. hAs described in Roberts and Tarassenko (1992a,b), the EEG signal is parameterized on a time scale of 1 sec using a 10th-order Kalman filter.

Stephen Roberts and Lionel Tarassenko

278

250

200

100

SO

0

T-

Figure 2: Growth of gaussian functions during training on nonwake EEG database (a) f m a x = 0.1 and (b) Fmax = 0.2. 3.1 Detection of Epileptic Seizures. We have now carried out a pilot study of our novelty detection algorithm on an EEG record known to contain epileptic seizures (i.e., “abnormal events”). From the 20 min of available data, a training database of 1000 ”normal” EEG segments was constructed. The algorithm described in this paper was used to form a gaussian mixture representation of this data, using a threshold of cmax = 0.1. A total of 195 gaussian kernels was grown by the algorithm. Figure 4 shows the time course of X(x,) (upper trace) along with the error term from the 10th-order Kalman filter, the coefficients of which are used as the input representation to the network (lower trace). The four major peaks (A, B, C, and D) in the novelty trace correspond to epileptiform activity, which is also shown u p by the Kalman error term. There are two revealing areas of discrepancy between the two plots, however: 1. The peak at E in the Kalman error term ( t zz 950 sec) corresponds to high-frequency muscle artifact. This type of artifact is present elsewhere in the training database and the novelty threshold is there-

Network for Novelty Detection

279

Figure 3: Time course of X(x) during presentation of unseen test data including data novt.l to the system (sections of wake-state EEG) - (a) 97 gaussian units, fmaX = 0.1 and (b) 258 gaussian units, f,,,, = 0.2. The novelty decision threshold is shown in each case.

fore not crossed (the reason for the discontinuity in the Kalman error term is the exceptionally large amplitude of the artifact). 2. Between the first peak at A ( t =: 75 sec) and the second one at B ( t = 400 sec), there are four smaller peaks (1, 2, 3, and 4) that can be identified from the novelty trace but not from the Kalman filter error term. On returning to the original EEG record, it is quite clear that these short bursts of "novel" activity correspond to short sections of signal which do exhibit seizure-type waveforms.

The above represents only our first results from a short pilot study but the detection of novel events between the first two seizures does represent a promising beginning.

Stephen Roberts and Lionel Tarassenko

280

I 40-

35 -

-

t (seconds)

Error term

D ~

A

B

E

C

30 25 20 I5

-

10 -

501

n

200

400

600

800

IWO

I200

Figure 4: Time course of X(x) during presentation of an EEG record with epileptic seizures (upper trace) and Kalman filter error term from the same record (lower trace). 4 Conclusions

In this paper, we have introduced a method for the detection of novelty, based on a gaussian mixture model. The main advantage of our method lies in the fact that the detection of novelty is robust in thechoiceofthreshold, provided that a sufficient number of kernel functions is used to build u p the representation of the training data set. This was demonstrated on a medical signal processing problem, specifically constructed from real data for the purposes of this paper. We are now using our method as a screening tool for the detection of unexpected abnormalities in EEG recordings. We are also extending its use to other monitoring systems such as plasma diagnostics (Bishop et al. 1993). In the implementation of the model presented here we have assumed that all kernel priors are equal. We may, however, allow the priors to be free parameters of the model and attempt to evaluate them using nonlinear optimization. The kernel priors may be regarded as an unbiased

Network for Novelty Detection

281

estimate of kernel posteriors, given all the data set (TrBven 19911, namely (4.1)

a solution for which may be sought by reinforcement learning (see Appendix): Pci)f+l =

P(jh + {Yf [PV I XI)

-

Po’),]

(4.2)

If priors are allowed to be free parameters it is possible to prune the trained network by removal of kernel functions with small priors, since these kernels typically correspond to outliers within T. The removal of outliers from a training set, however, is a complex problem in pattern recognition (Ripley 1992; Beckman and Cook 1983) and this strategy would have to be investigated in detail. The algorithm described in this paper does have some similarities with previous work by others, for example, the pioneering work of Sebestyen (1962) who describes a method to build a representation of data in input space using gaussian kernels of uniform width centered on a subset of the training patterns. New input patterns are then clussified according to their Euclidean distances from the cluster means, each of which has a class label attached to it. We believe that there are two weaknesses associated with this approach: first, the use of the same width for all clusters and of the Euclidean metric removes any information about the way the data are distributed around each center. Second, an algorithm that builds u p a representation of data according to its density in input space is unlikely to be optimal for classification when one is more interested in the boundaries between classes rather than the distribution of data in input space per se. Class labels represent important a priori information that must be built into the data encoding by the network to achieve optimal classification performance. Our algorithm can be adapted, however, to provide a validation parameter for the outputs of an RBF network if the gaussian mixture representation is used as the hidden layer of the network. In addition to the a posteriori class membership probabilities that can be estimated at the output of the RBF network, we can then decide whether the input vector lies within the hidden-layer representation of the training data. Neural networks, like all function approximation methods, cannot extrapolate and the X(x,) parameter can therefore serve as validation of the network outputs.

5 Appendix We consider a stochastic gradient descent in some parameter 8, of the form 0, rr,(hfit - 0,) (5.1) 8f+1 = (1- ( Y f ) rrfht

+

+

Stephen Roberts and Lionel Tarassenko

282

where frt is a monotonically decreasing parameter, 0 5 f t t < 1 and itis some input parameter specified at time t . The stable point is when

(AO)= (Ht+1

- 0,) = 0

where (.) denotes an expectation value. Upon rearranging equation 5.1 we obtain (5.2) The stable point of this system is hence when

(hri,- ktHt) = 0 and at this point a limit value of Ht

= HL

is obtained, hence

(Atit) - 4.(h)= 0 or N

(5.3) lkl

where N is the total number of available it. Inspection of equations 2.9 and 2.10 shows that they are in the form of equation 5.1, where kt = p(j I xt), 0 is m, or F, and it is xt or ( x t - m,,t)(xt m,,t respectively. Equation 5.3 thus corresponds to a limit convergence to equations 2.7 and 2.8. Note that for k = 1 equation 5.3 reduces to

which is of the form of equation 4.1 if it = p ( j I xt). We must, however, show that the stochastic algorithm reaches this fixed point. Equation 5.2 is of the form AH = Etf(Ht) stochastic approximation of which may be achieved by means of the Robbins-Munro algorithm (Duda and Hart 1973). The step size of the algorithm is given here by

The convergence conditions of the Robbins-Munro algorithm are satisfied, for bounded training data, by (f

> 0;

Vt

Network for Novelty Detection

283

and

If we set

then Et

=

(YO

t

+ r,, + tro(ht

-

1)

The first of these conditions is met for 0 < 00 < 1, 0 5 At 5 1 and r, 2 I as r, cro(h, - 1) > 0 for all t . The other two conditions are also met, as

+

oc

5+ f

(kl

r,,

+ cro(h, - 1) =cc

and x

284

Stephen Roberts and Lionel Tarassenko

Neal, R. M., and Hinton, G. E. 1993. A new view of the EM algorithm that justifies incremental and other variants. Biometrika, submitted. Park, J., and Sandberg, I. W. 1991. Universal approximation using radial-basisfunction networks. Neural Comp. 3(2), 246-257. Parzen, E. 1962. On estimation of a probability density function and mode. Ann. Math. Stat. 33, 1065-1076. Platt, J. 1991. A resource-allocating network for function interpolation. Neural Conip. 3, 213-225. Ripley, B. D. 1992. Statistical aspects of neural networks. Proc. SemStat, Denmark, April 1992. Roberts, S., and Tarassenko, L. 1992a. A new method of automated sleep quantification. Med. Bid. Eng. Comput. 30(5), 509-517. Roberts, S., and Tarassenko, L. 1992b. The analysis of the sleep EEG using a multi-layer network with spatial organisation. ZEE Proc.-F 139(6),420-425. Sebestyen, G. S. 1962. Pattern recognition by an adaptive process of sample set construction. IRE Trans. Info. Theory IT-8, S82-S91. Trsven, H. G. C. 1991. A neural network approach to statistical pattern classification by "semiparametric" estimation of probability density functions. l E E E Trans. Neural Networks 2(3), 366-377.

Received November 16, 1992; accepted June 2, 1993.

This article has been cited by: 2. Anna Koufakou, Michael Georgiopoulos. 2010. A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Mining and Knowledge Discovery 20:2, 259-289. [CrossRef] 3. Jia-Jun Wong, Siu-Yeung Cho. 2010. A face emotion tree structure representation with probabilistic recursive neural network modeling. Neural Computing and Applications 19:1, 33-54. [CrossRef] 4. Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka. 2007. Conditional Anomaly Detection. IEEE Transactions on Knowledge and Data Engineering 19:5, 631-645. [CrossRef] 5. Carolina Sanchez-Hernandez, Doreen S. Boyd, Giles M. Foody. 2007. One-Class Classification for Mapping a Specific Land-Cover Class: SVDD Classification of Fenland. IEEE Transactions on Geoscience and Remote Sensing 45:4, 1061-1073. [CrossRef] 6. Victoria J. Hodge, Jim Austin. 2004. A Survey of Outlier Detection Methodologies. Artificial Intelligence Review 22:2, 85-126. [CrossRef] 7. Finn �rup Nielsen, Lars Kai Hansen. 2002. Modeling of activation data in the BrainMap? database: Detection of outliers. Human Brain Mapping 15:3, 146-156. [CrossRef] 8. S.J. Roberts. 2000. Extreme value statistics for novelty detection in biomedical data processing. IEE Proceedings - Science, Measurement and Technology 147:6, 363. [CrossRef] 9. V. Ramamurti, J. Ghosh. 1999. Structurally adaptive modular networks for nonstationary environments. IEEE Transactions on Neural Networks 10:1, 152-160. [CrossRef] 10. S. J. Roberts. 1999. Novelty detection using extreme value statistics. IEE Proceedings - Vision, Image, and Signal Processing 146:3, 124. [CrossRef] 11. C. C. Holmes , B. K. Mallick . 1998. Bayesian Radial Basis Functions of Variable DimensionBayesian Radial Basis Functions of Variable Dimension. Neural Computation 10:5, 1217-1233. [Abstract] [PDF] [PDF Plus] 12. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 13. Stephen J. Roberts, Will Penny. 1997. Neural networks: friends or foes?. Sensor Review 17:1, 64-70. [CrossRef] 14. Lionel Tarassenko, David A. Clifton, Peter R. Bannister, Steve King, Dennis KingNovelty Detection . [CrossRef] 15. Charles R. Farrar, Keith Worden, Janice Dulieu-BartonPrinciples of Structural Degradation Monitoring . [CrossRef]

Communicated by Halbert White

Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm and Resistance to Local Minima William Finnoff Siemens AG, Corporate Research and Development, Otto-Hahn-Ring 6,D-8000 Munich 83, Germany

In this paper we discuss the asymptotic properties of the most commonly used variant of the backpropagation algorithm in which network weights are trained by means of a local gradient descent on examples drawn randomly from a fixed training set, and the learning rate 7) of the gradient updates is held constant (simple backpropagation). Using stochastic approximation results, we show that for r/ + 0 this training process approaches a batch training. Further, we show that for small 7) one can approximate simple backpropagation by the sum of a batch training process and a gaussian diffusion, which is the unique solution to a linear stochastic differential equation. Using this approximation we indicate the reasons why simple backpropagation is less likely to get stuck in local minima than the batch training process and demonstrate this empirically on a number of examples. 1 Introduction The original (simple) backpropagation algorithm, incorporating pattern for pattern learning and a a constant learning rate 'r] E (O.co),remains in spite of many real (and imagined) deficiencies the most widely used network training algorithm, and a vast body of literature documents its general applicability and robustness. In this paper we will draw on the highly developed literature of stochastic approximation theory to demonstrate several asymptotic properties of simple backpropagation. The close relationship between backpropagation and stochastic approximation methods has been long recognized, and various properties of the algorithm for the case of decreasing learning rate q n + 1 < rlrI, n E N were shown for example by White (1989a,b) and Darken and Moody (1991). Hornik and Kuan (1991) used comparable results for the algorithm with constant learning rate to derive weak convergence results. In the first part of this paper we will show that simple backpropagation has the same asymptotic dynamics as batch training in the small learning rate limit. As such, anything that can be expected of batch Neural Computation 6, 285-295 (1994)

@ 1994 Massachusetts Institute of Technology

William Finnoff

286

training can also be expected in simple backpropagation as long as the learning rate of the algorithm is very small. In the special situation considered here [in contrast to that in Hornik and Kuan (1991)l we will also be able to provide a result on the speed of convergence. In the next part of the paper, gaussian approximations for the difference between the actual training process and the limit are derived. It is shown that this difference (properly renormalized) converges to the solution of a linear stochastic differential equation. In the final section of the paper, we combine these results to provide an approximation for the simple backpropagation training process and use this to show why simple backpropagation will be less inclined to get stuck in local minima than batch training. This ability to avoid local minima is then demonstrated empirically on several examples. 2 Notation and Conventions

+

In the following we denote with R the real and with N the natural numbers. For I E N, we denote I' = [-1. l]', further, we will assume that R' is equipped with the norm 11 - 11 (and induced metric) defined b setting for x,y = (x'. . . . ,XI), (y', . . . ,y') E R', IIx - yII = EL=l(x" - ytC1*.In the following all random elements that appear are assumed to be defined on the same probability space ( R , 3 , P). For a topological space X, B ( X ) will denote the sigma algebra of the Bore1 sets. Any references to measurability in such a space will then be with respect to this structure. For a random element K : R -+ X, we denote with L { K }the law of K [i.e., the probability measure on B ( X ) defined for every A E B ( X ) by setting L{h-}(A)= P ( K - ' ( A ) ) ] and , for a further sigma algebra d c 3, P(KIA) denotes the conditional probability and E(hid) denotes the expectation of h- given A. Finally, let L j , k E N and let V : R' x R' -+ Rk,(z,x) H V(2.x) be a continuously differentiable function. Then, V , = V2(., .) will denote the partial derivative of V with respect to x. Conditions. In the remainder of the paper, we assume that m. k , 9 E N are arbitrarily chosen but fixed. Then, setting d = m(9 + 1) ( k 1)9, we assume to be given a three times continuously differentiable (squashing) function d : R -+ I, x H $(x) such that I(d/dx)J41I 1 for j = 0, 1, 2, 3. l=l,.,.,k+l In the following we will denote with 7; = (-$')~~,',;':;;~l, [3; = (p') /I=', ...,q ' and H = 0, = ( 0 1 ) 1 = l , , . , , d = (y, p;) elements of the space R"' x Rq+', Rkflx Rq, and Rd, respectively, and define the parametric version of a single hidden layer network activation function with k inputs, m outputs, and 9 hidden units

+ +

f : R" x Rk + R"'.( 0 .X)

+

r'(0,x), . . . .f"'(0.x)]

(2.1)

Backpropagation Algorithm

287

by setting for x E Rm,K = ( x , . . . . . x k , l), 0 = (7;.b;),and u

=

1.. . . . rn,

(2.2)

where XT denotes the transpose of 52. We then define the parametric error function

U : Rm x Rk x Rd -+ R,(y. X, 0) H IIy - f ( 0 ?x )

(2.3)

Further, let [(y,, x,,)],~N be a sequence of independent, identically distributed (i.i.d.) random variables in lmx lk, consisting of targets ( Y , , ) , ~ ~ and inputs ( x , , ) " ~ and ~ , / L a Bore1 probability measure on I" x Ik so that for every n E N, L { ( y , . x , ) } = 11. Let 00 be a further random variable taking values in Rd and independent of the ( ~ , . X , ~ ) S , such that Il00((5 B. Finally we denote with F,,the sigma algebra generated by the random variables O0, (y,, x , ) , i = 1.. . . ,n and define for every fl E R", (y. x ) E l m f k , p(y, x , 0) = -U@(y.x . 0) and h ( 0 ) = J,.,+I-p(y, x , 0)d//(y.x ) . We then note that for every 0 E Rd,

E[P(Yn+ly xn+1. 0)

I3n1=

L,,+,

P(Y.X. g)dlL(y. X)

[= h(0)1

(2.4)

(Robbins-Munro condition). The hypotheses are tailored to those given in simple backpropagation. There, one has a fixed finite set of training examples A = { ( Y l . X I .) . . . , (YK,XK)}for some K E N, which (eventually through a linear transformation) can be assumed to take values in [--l,l]'"+".By repeatedly drawing examples at random from this set one creates a sequence of i.i.d. random variables [(y,,, x,,)],~N with L { ( ~ , , , X=, ~p ) = } p4, where pa denotes the "empirical measure" belonging to the training set A defined as follows: Denoting for a point z E Im+k by &z the Dirac measure on the point z [i.e. &z(A)= 1 if z E A, & z ( A )= 0 otherwise for A E f?(lm+k)], then set ~4 = (1/K) C,",l E(v,,x,).The function p(y. x . 0) corresponds to the local gradient of the error U(y,x , 0) for a network with weights 0 by target y and input x . The function h then corresponds to the cumulative gradient of the error averaged over all the examples in the training set. Finally, our hypotheses are also fulfilled in the case of online learning with uniformly bounded training examples generated by independent drawings from the distribution 11. 3 Approximation with the Ordinary Differential Equation

In the following we will be considering the asymptotic properties of network training processes induced by the starting value 00, the gradient (or direction) function p, the learning rate T,J, and training examples

William Finnoff

288

( ~ , ~ . x ~ jOne , , ~defines ~. the discrete parameter process H of weight updates by setting

=

0'' = (19ij,,Ez+

(3.1)

and the corresponding continuous parameter process ting

[ P I (

t ) ] , , [ ),~ by , ~ set-

H:;(t)= 15',':-, + [t - ( n - ljt/]p(y,l.~n.0,'-1j for t E [(n - 1)q>n r ~ ) n. E N (3.2) The first question that we will investigate is that of the "small learning rate limit" of the continuous parameter process This refers to the natural and obvious question that arises when one considers the family of processes I9q as to the limiting properties of the family (if any) for 71 -+ 0. In the following we will show that the family of (stochastic) processes ( H T ' ) T , > ~converges ~ with probability one to a limit process H, where denotes the solution to the cumulative gradient equation,

e

,

H(tj = 00

+ Ju'h[O(s)]ds

(3.3)

which, by following Lemma 3.1 is unique and defined on the entire interval (0.00). Here, for 00 = n = constant, this solution is deterministic. This result corresponds to a "law of large numbers" for the weight update process, in which the small learning rate (in the limit) averages out the stochastic fluctuations. Central to any application of many stochastic approximation results is the derivation of local Lipschitz and linear growth bounds for p and h. That is the subject of the following: Lemma 3.1. i. There exists a constant K > 0 so that SUP(^,^)^,,,,+^ IIp(y. x , 8) 11 5 + Il0ll) and Ilh(0)II I K ( 1 + Ilf4l). ii. For every G > 0 there exists a constant LG so that for any O,8 E [-G. GId, SUp(y,*)E,m+h IIP(y,x,e)-l,(y,x.e)ll I L G I l H - ~ l l m d1") -h@Il I Lcll~-flll.

w

Proof. i. This follows from the fact that for every (y,xj E lm+h,the coordinates of the vector valued function p(y, x, 8) are either uniformly bounded, or the product of a bounded function and one of the coordinates of the weight vector 8. ii. The calculations on which this result are based are tedious but straightforward, making repeated use of the fact that products and sums of locally Lipschitz continuous functions are themselves locally Lipschitz continuous. As a consequence of the local Lipschitz condition on h, for every starting point Bo one can always find a unique local solution to equation 3.3 by finding the fixpoint to an operator g -+ G(g), where G(g) is a function on an interval (0,TI defined by setting G(g)( t) = 190 Ji h(g(s)]dsfor every

+

Backpropagation Algorithm

289

t E [O, TI. By choosing the interval [O. T ] sufficiently small, the Lipschitz condition ensures that G is a contraction, and the existence of a unique fixpoint is guaranteed. Under the linear growth condition given by (i) of Lemma 3.1, this can be extended to a (global) solution on the entire time interval [O. m ) in a unique fashion. For details consult Walter (1976, pp. 48-50). We collect this in the following:

Corollary. For every to equation 3.3.

6'0 E

Rd there exists a unique solution g = [ g ( t ) ] p s c )

- At this point we would like to emphasize that the path of the process

6' very much depends on the starting point 6'0 and that our conditions are not sufficient to ensure that the deterministic process 8 will become constant in the limit. The most obvious situation in which this may occur is when the target y is defined as a discontinuous function of the inputs. For example, assume that the sequence ( x , ) , , ~ N is generated by random drawings from a uniform distribution on the interval [-1.11 and yn = f ( x n ) for every n E N, wheref(x) = 1 if x 2 0 and = -1 otherwise. Then, the training process 8 for a network consisting of a single hidden unit with a hyperbolic tangent squashing function would never converge to a fixed value, since the error can always be reduced by increasing the size of the weight from the input to the hidden unit. To ensure that the deterministic training process 8 converges to a fixed value without placing further assumptions on the data, it will generally be necessary to modify the error function. One such modification is to add a term that is quadratic in the weights, that is, instead of the gradient of U(y. x. 6') determining the training process, one uses U x ( y , x 6') , = U(y, x.6') + XI(6'1(*, leading to training with "weight decay" and decay rate X > 0. Under our assumption that the data only takes values in 1" x Ik, this modification ensures that the error goes to infinity for 6 .+ 03. Further, if one defines px and hx in an analogous fashion, Lemma 3.1 and its Corollary still hold for the modified functions. Using classical results, it can be shown that the corresponding training process 8, with weight decay will then converge to a constant value 6: for t + 00 [see, for example, Krasovskii 1963, Th. 5.3, p. 311. Since the judicious use of weight decay can often improve generalization performance (see Finnoff et al. 1992) this would seem to be a reasonable modification. Using Lemma 3.1 and its Corollary, we are now in a position to present the results on the probability of deviations of the process 6"' from the limit 8. Theorem 3.2. Let r, 6 E (0. m). Then there exists a constant B, (whichdoes not depend on 7) so that

i. E(supsj, Il6'"s) ii. P(sups<, - IlO(s)

8(s)Il2) 5 B,v. - $(s)ll > 6)5 (l/hz)Brr/. -

Proof. The first part of the proof requires that one finds bounds for P ( t ) and e ( t ) for t E [O. This is accomplished using the results of

4.

William Finnoff

290

Lemma 3.1 and Gronwall’s Lemma (see Walter 1973, p. 214). This places independent bounds on B,. The remainder of the proof uses Theorem 9, 51.5, Part I1 of Benveniste et al. (1987). The required conditions (Al), (A2) follow directly from our hypotheses, and (A3), (A41 from Lemma 3.1. Due to the boundedness of the variables ( y f lX,),,~N . and 80, condition (A5) is trivially fulfilled. 0 It should be noted that the constant B, is usually dependent on r and may indeed increase exponentially (in r ) unless it is possible to show that the training process remains in some bounded region for t -+ 00. This is not necessarily due exclusively to the difference between the stochastic approximation and the discrete parameter cumulative gradient process, but also to the error between the discrete (Euler approximation) and continuous parameter versions of 3.3. 4 Gaussian Approximations

In this section we will give a gaussian approximation for the difference between the training process 8’’ and the limit 6. Although in the limit these coincide, for ‘7 > 0 the training process fluctuates away from the limit in a stochastic fashion. The following gaussian approximation provides an estimate for the size and nature of these fluctuations depending on the second order statistics (variance/covariance matrix) of the weight update process. Define for any t E [0,m),

W ( t )=

e y t )- e(t)

Jsi

(4.1)

Further, for i = 1,.. . , d we denote with p’(y,x.0) and h’(0) the ith coordinate vector of p ( ~x ., 0) and h ( 8 ) , respectively. Then define for i , j = 1 . .. . , d , B E Rd R”(0)= / m + h [ P l ( Y : X . S ) $ ( Y I X I 8) - h ’ ( w ’ ( 0 ) l d / 4 Y ? x )

(4.2)

Thus, for any n E N, 8 E Rd, R(8) represents the covariance matrix of the random element p(yn,xn,B). We can then define for the symmetric matrix R ( 0 ) a further Rdxdvalued matrix R’12(0) with the property that R(8) = R’/2(8)[R’12(8)]’. We now present the gaussian approximation result for constant learning rate.

Theorem 4.1. Under the assumptions given above, the distributions of the processes @, 71 > 0, converge weakly (in the sense of weak convergence of measures, see Billingsley 2968) for 7 + 0 to a uniquely defined measure L { 0 } , where 8 denotes the solution to the following stochastic differential equation

8 ( t )=

f h0[8(s)]8(s)ds+ l R 1 / 2 [ 8 ( s ) ] d W ( s ) 0

(4.3)

Backpropagation Algorithm

291

hs(0) = dh(Q)/dOand Wdenotesa standard d-dimensional Brownian motion (i.e., with covariance matrix equal to the identity matrix). Proof. The proof here uses Theorem (71, $4.4, Part I1 of Benveniste et al. (1987). As noted in the proof of Theorem 3.2, under our hypotheses, the conditions (Al)-(A5) are fulfilled. Define for i , j = 1, . .d, ( y . x ) E lm+h, 0 E Rd, w”(y,x,0) = p’(y,x,0)lj(y9x , 0) - h’(O)h’(O),and v = p. Under our hypotheses, h has continuous first- and second-order derivatives for all 0 E Rd and the function R = (R’’)l,,=l, ,d as well as W = ( w ” ) ~ , =,d~ , fulfill the remaining requirements of (A81 as follows: (A8,i) and (A8,ii) are trivial consequence of the definition of R and W. Finally, setting p 3 = p4 = 0 and 11 = 1, (A8,iii) then can be derived directly from the 0 definitions of W and R and Lemma (5.1)ii). For further results concerning gaussian approximations of stochastic approximation processes consult Bouton (1985), Kushner and Schwartz (1984), and Metivier and Priouret (1987). 5 Simple Backpropagation’s Resistance to Local Minima

In this section we combine the results of the two preceding sections to provide a gaussian approximation of simple backpropagation. Recalling the results and notation of Theorem 3.2 and Theorem 4.1 we have for any t E 10, 001,

+

Q y t )= e(t) T p * e ( t )

+O ( 7 p )

(5.1)

As can be seen from this approximation, for “very small” learning rate 7, simple backpropagation and batch learning will produce essentially the same results since the stochastic portion of the process (controlled by vl/*) will be negligible. Otherwise, there is a nonnegligible stochastic element in the training process which can be approximated by the gaussian diffusion 4.’ This diffusion term gives simple backpropagation a “quasiannealing” character, in which the cumulative gradient is continuously perturbed by the gaussian term allowing it to avoid getting stuck at local minima having small, shallow basins of attraction. The imperviousness of simple backpropagation to local minima, which is part of neural network “folklore” is documented here in four examples (see figs. 1-4). A single hidden layer feedforward network with 4 = tanh, 10 hidden units, and one output was trained with both simple backpropagation and batch training using data generated by four different models. The data consisted of pairs (yl.xl),i = 1,.. . T, T E N with targets yl E R and inputs xI = (x,’, . . . , x f ) E [-1. 1IK,where yl =

e

.

‘It should be noted that the rest term in 5.1 will actually have a better convergence ) . calculation of exact rates, though, would require a rate than the indicated o ( ~ ’ ’ ~ The generalized version of the Berry-Ess6en theorem. To our knowledge, no such results are available that would be applicable to the situation described above.

William Finnoff

292

Error x 10-3 -

simple BP ...................... BatchL.

Epochs

0.00

100.00

200.00

300.00

Figure 1: Net.

Error x 10-3 simple BP Batch L.

.................... 800.00600.00400.00-

Epochs

0.00

100.00

200.00

300.00

400.00

Figure 2: Product mapping.

+

gi[xl.. . . . 4)] u,, for j . K E N. The first experiment was based on an additive structure g havinp the following form with j = 5 and K = 10, g[(x:. . . . . $)]= C:=,sin((?$), ok E R. The second model had a product structure g with j = 3, K = 10, and g[(x,'.. . . ,x;)] = niplxf, ak E R. The third structure considered was constructed with j = 5 and K = 10, using sums of radial basis functions (RBFs) as follows: g[(x:. . . . $)] = C~-l(-l)'exp[C:=l(trk~' - $)2/2a2]. Here, for every I = 1 , .. . , 8 the vec. . . rr5s') corresponds to the center of the RBF. These tor parameter points were chosen by independent drawings from a uniform distribution o n [ - 1 1Is. The final experiment was conducted using data generated by a feedforward network activation function. The network used for this task had 50 input units, 200 hidden units, and 1 output. The weights ( 0 ' 1 ' ~ .

~

BackpropagationAlgorithm

293 -

Error x 10-3 simple BP B a ~ h . :...... L

600.00 500.00 400.00

Epochs

0.00

100.00

200.00

300.00

Figure 3: Sums of Sins.

Error x 103

600.00

simple BP Batch L.

....................

500.00 400.00 Epochs

0.00

100.00

200.00

300.00

Figure 4: Sums of RBFs. from the input to hidden layer and hidden layer to output were generated by independent drawings from a standard Cauchy distribution. The parameter settings for the experiments are collected in Table 1. These examples were chosen to display a wide range of complexity based on the degree of nonlinearity in the structure, the number of variables in the problem, and the strength of the noise in the disturbance term. Originally they (and a number of other examples) were produced to test the performance of complexity control methods (weight pruning, the use of complexity penalty terms, and stopped or crossvalidation training) commonly used in the neural network community to prevent overfitting of data. In particular, the learning rates used here were those chosen for the original complexity control experiments, which were found by testing across a small range of values ~0.1,0.05,0.01,0.005).

William Finnoff

294

Table 1: Parameter Settings of the Experiments.

Structure

Noise variance Learn step

Sum of RBFs

0.6

0.01

3 n k = O xk

0.2

0.05

50/200/1

0.3

0.01

c:=osin((tkxk)

0.6

0.05

The final choice was made based on the best performance achieved by the network on a validation set of examples during the training process (stopped training) and appears to be a robust and generally applicable method for determining a reasonable constant learning rate. We note that this choice was in no way influenced by the question as to whether the training process might be more or less likely to get stuck at local minima of the error function. For more details concerning the construction of the examples used here consult Finnoff et al. (1992). For each model three training runs were made using the same vector of starting weights for both simple backpropagation and batch training. As can be seen, in all but one example the batch training process got stuck in a local minimum producing much worse results than those found using simple backpropagation. Due to the wide array of structures used to generate data and the number of data sets used, it would be hard to dismiss the observed phenomena as being example dependent. In conclusion we remark that simple backpropagation, with a constant learning rate chosen as indicated above, is notable for its robustness, general applicability, and ease of implementation. Indeed, these properties may be sufficient to make simple backpropagation competitive with considerably more sophisticated algorithms, and justify its widespread application.

References Benveniste, A., Metivier, M., and Priouret, P. 1987. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, Berlin. Billingsley, P. 1968. Convergence of Probability Measures. John Wiley, New York. Bouton, C. 1985. Approximation Gaussienne d’algorithmes stochastiques a dynamique Markovienne. Thesis, Paris VI (in French). Darken, C., and Moody, J. 1991. Note on learning rate schedules for stochastic optimization. In Advances in Neural Information Processing Systems 3, R. Lippmann, J. Moody, and D. Touretzky, eds. Morgan Kaufmann, San Mateo, CA.

Backpropagation Algorithm

295

Finnoff, W., Hergert, F. and Zimmerman, H. G. 1992. Improving generalization by nonconvergent model selection methods. Neural Networks 6(6), 771-783. Hornik, K., and Kuan C. M. 1991. Convergence of learning algorithms with constant learning rates. ZEEE Trans. Neural Networks 2, 484-489. Krasovskii, N. 1963. The Stability ofMotion. Stanford University Press, Stanford. Kushner, H. J., and Schwartz, A. 1984. An invariant measure approach to the convergence of stochastic approximations with state dependent noise. SlAM 1.Control Opt. 22(1), 13-27. MCtivier, M., and Priouret, P. 1987. Theoremes de convergence presque-sfire pour une classe d’algorithmes stochastiques a pas decroissant. Prob. 7%. Re/. Fields 74, 403-428 (in French). Walter, W. 1976. Gmohnliche Differentialgleichungen. Springer-Verlag, Berlin. White, H. 1989a. Some asymptotic results for learning in single hidden-layer feedforward network models. 1.A m . Stat. Assoc. 84(408), 1003-1013. White, H. 1989b. Learning in artificial neural networks: A statistical perspective. Neural Cornp. 1, 425-464.

Received August 21, 1992; accepted May 17, 1993.

This article has been cited by: 2. Zong-Ben Xu, Rui Zhang, Wen-Feng Jing. 2009. When Does Online BP Training Converge?. IEEE Transactions on Neural Networks 20:10, 1529-1539. [CrossRef] 3. Luca Scardovi, Marco Baglietto, Thomas Parisini. 2007. Active State Estimation for Nonlinear Systems: A Neural Approximation Approach. IEEE Transactions on Neural Networks 18:4, 1172-1184. [CrossRef] 4. N. Zhang, W. Wu, G. Zheng. 2006. Convergence of Gradient Method With Momentum for Two-Layer Feedforward Neural Networks. IEEE Transactions on Neural Networks 17:2, 522-525. [CrossRef] 5. W. Wu, G. Feng, Z. Li, Y. Xu. 2005. Deterministic Convergence of an Online Gradient Method for BP Neural Networks. IEEE Transactions on Neural Networks 16:3, 533-540. [CrossRef] 6. A. Alessandri, T. Parisini, R. Zoppoli. 2001. Sliding-window neural state estimation in a power plant heater line. International Journal of Adaptive Control and Signal Processing 15:8, 815-836. [CrossRef] 7. Arnaud Buhot, Mirta B Gordon. 2001. Journal of Physics A: Mathematical and General 34:21, 4377-4388. [CrossRef] 8. T. Parisini, R. Zoppoli. 1998. Neural approximations for infinite-horizon optimal control of nonlinear stochastic systems. IEEE Transactions on Neural Networks 9:6, 1388-1408. [CrossRef] 9. Wim Wiegerinck, Tom Heskes. 1996. How Dependencies between Successive Examples Affect On-Line LearningHow Dependencies between Successive Examples Affect On-Line Learning. Neural Computation 8:8, 1743-1765. [Abstract] [PDF] [PDF Plus] 10. T. Parisini, R. Zoppoli. 1996. Neural approximations for multistage optimal control of nonlinear stochastic systems. IEEE Transactions on Automatic Control 41:6, 889-895. [CrossRef] 11. P.B. Watta, M.H. Hassoun. 1996. A coupled gradient network approach for static and temporal mixed-integer optimization. IEEE Transactions on Neural Networks 7:3, 578-593. [CrossRef]

Communicated by Ronald Williams

Relating Real-Time Backpropagation and Backpropagation-Through-Time:An Application of Flow Graph Interreciprocity Fransoise Beaufays Eric A. Wan Department of Electrical Engineering, Stanford University, Stanford, C A 94305-4055 U S A We show that signal flow graph theory provides a simple way to relate two popular algorithms used for adapting dynamic neural networks, real-time backpropagation and backpropagation-through-time. Starting with the flow graph for real-time backpropagation, we use a simple transposition to produce a second graph. The new graph is shown to be interreciprocal with the original and to correspond to the backpropagation-through-time algorithm. Interreciprocity provides a theoretical argument to verify that both flow graphs implement the same overall weight update. 1 Introduction

Two adaptive algorithms, real-time backpropagation (RTBP) and backpropagation-through-time (BPTT), are currently used to train multilayer neural networks with output feedback connections. RTBP was first introduced for single layer fully recurrent networks by Williams and Zipser (1989). The algorithm has since been extended to include feedforward networks with output feedback (see, e.g., Narendra et al. 1990). The algorithm is sometimes referred to as real-time recurrent learning, online backpropagation, or dynamic backpropagation (Williams and Zipser 1989; Narendra et al. 1990; Hertz et al. 1991). The name recurrent backpropagation is also occasionally used, although this should not be confused with recurrent backpropagation as developed by Pineda (1987) for learning fixed points in feedback networks. RTBP is well suited for on-line adaptation of dynamic networks where a desired response is specified at each time step. BPTT (Rumelhart and McClelland 1986; Nguyen and Widrow 1990; Werbos 19901, on the other hand, involves unfolding the network in time and applying standard backpropagation through the unraveled system. It does not allow for online adaptation as in RTBP, but has been shown to be computationally less expensive. Both algorithms attempt to minimize the same performance Neural Computation 6, 296-306 (1994)

@ 1994 Massachusetts Institute of Technology

Relating Real-Time Backpropagation and Backpropagation-Through-Time 297 criterion, and are equivalent in terms of what they compute (assuming all weight changes are made off-line). However, they are generally derived independently and take on very different mathematical formulations. In this paper, we use flow graph theory as a common support for relating the two algorithms. We begin by deriving a general flow graph diagram for the weight updates associated with RTBP. A second flow graph is obtained by transposing the original one, that is, by reversing the arrows that link the graph nodes, and by interchanging the source and sink nodes. Flow graph theory shows that transposed flow graphs are interreciprocal, and for single input single output (SISO) systems have identical transfer functions. This basic property, which was first presented in the context of electrical circuits analysis (Penfield eta!. 1970), finds applications in a wide variety of engineering disciplines, such as the reciprocity of emitting and receiving antennas in electromagnetism (Ramo et al. 1984), the relationship between controller and observer canonical forms in control theory (Kailath 19801, and the duality between decimation in time and decimation in frequency formulations of the FFI' algorithm in signal processing (Oppenheim and Schafer 1989). The transposed flow graph is shown to correspond directly to the BPTT algorithm. The interreciprocity of the two flow graphs allows us to verify that RTBP and BPTT perform the same overall computations. These principles are then extended to a more elaborate control feedback structure. 2 Network Equations

A neural network with output recurrence is shown in Figure 1. Let r(k-1) denote the vector of external reference inputs to the network and x ( k - 1) the recurrent inputs. The output vector x ( k ) is a function of the recurrent and external inputs, and of the adaptive weights w of the network:

x ( k ) = N[x(k- I ) .r(k - 1).w]

(2.1)

The neural network n/ is most generally a feedforward multilayer architecture (Rumelhart and McClelland 1986). If n/ has only a single layer of neurons, the structure of Figure 1 represents a completely recurrent network (Williams and Zipser 1989; Pineda 1987). Any connectionist architecture with feedback units can, in fact, be represented in this standard format (Piche and Widrow 1991). Adapting the neural network amounts to finding the set of weights w that minimizes the cost function k=l

1

=

CE [ e(k)l'e(k)1

(2.2)

krl

where the expectation E[.] is taken over the external reference inputs r ( k ) and over the initial values of the recurrent inputs x(0). The error e ( k ) is

298

Francoise Beaufays and

Eric A. Wan

Figure 1: Recurrent neural network (9 represents a unit delay operator).

defined at each time step as the difference between the desired state d(k) and the recurrent state x(k) whenever the desired vector d(k) is defined, and is otherwise set to zero: e ( k )=

{t(k)

-

x(k) if 3 d(k) otherwise

(2.3)

For such problems as terminal control (Bryson and Ho 1969; Nguyen and Widrow 1990; Plumer 1993) a desired response may be given only at the final time k = K, while for other problems such as system identification (Ljung 1987; Narendra and Parthasarathy 1990) it is more common to have a desired response vector for all k. In addition, only some of the recurrent states may represent actual outputs while others may be used solely for computational purposes. In both RTBP and B P n , a gradient descent approach is used to adapt the weights of the network. At each time step, the contribution to the weight update is given by (2.4)

where I/, is the learning rate. Here the derivative is used to represent the change in error due to a weight change over all time.' The accumulation of weight updates over k = 1 . .. K is given by Aw = C,"=,Aw(k). Typically, RTBP uses on-line adaptation in which the weights are updated 'We define the derivative of a vector a E R" with respect to another vector b E R" as the matrix da/db E VX"lwhose (i,j)thelement is da,/db,. Similarly, the partial of a whose vector a E R" with respect to another vector b E %" is the matrix i)a/Ob E P r x m (i.j)th element is Oa,/ab,. For m = n = 1, this notation reduces to the scalar derivative and partial derivative as traditionally defined in calculus. It is easy to verify that most scalar operations in calculus, such as the chain rule, also hold in the vectorial case.

Relating Real-Time Backpropagation and Backpropagation-Through-Time 299 at each time k, whereas BPTT performs an update based on the aggregate Aw. The differences due to on-line versus off-line adaptation will not be considered in this paper. For consistency, we assume that in both algorithms the weights are held constant during all gradient calculations. 3 Flow Graph Representation of the Adaptive Algorithms

RTBP was originally derived for fully recurrent single layer networks2 A more general algorithm is obtained by using equation 2.1 to directly evaluate the state gradient dx(k)/ dw in the above weight update formula. Applying the chain rule, we get

dx(k) = dw

ax(k) . dx(k- 1) dx(k) . dr(k- 1) ax(k) dw +-.dw -k dr(k- 1) dw Bw dw dx(k- 1)

(3.1)

in which dr(k - 1) /dw = 0 since the external inputs do not depend on the network weights, and dw / dw = I, where I is the identity matrix. With these simplifications, equation 3.1 reduces to (3.2)

Equation 3.2 is then applied recursively, from k = 1 to k initial conditions dx(0) / dw = 0. For sake of clarity, let

Arec(k) = dx(k)/dw A(k) = ax(k)/aw J(k) = dx(k)/ dx(k - 1)

=

K , with (3.3) (3.4) (3.5)

With this new notation, equation 3.2 can be rewritten as Are'(,)

=

J(k) . A"'(k

with initial condition A"'(0) given by AWT(k)

= p

- 1) = 0.

~ ( I c * ) A"'(k) ~

+ A(k)

Vk = 1 . . . K

(3.6)

The weight update at each time step is (3.7)

Equations 3.6 and 3.7 can be illustrated by a flow graph (see Fig. 2 ) . The input to the flow graph, or source node variable, is set to 1.0 and propagated along the lower horizontal branch of the graph. The center horizontal branch computes the state derivatives A"'(k), and the upper horizontal branch accumulates the weight changes Aw(k)T. The total weight change, AwT, is readily available at the output (sink node). RTBP is completely defined with this flow graph. 2The linear equivalent of the RTBP algorithm was first introduced in the context of infinite impulse response (IIR) filter adaptation (White 1975).

Francoise Beaufays and Eric A. Wan

300

Figure 2: Flow graph associated with the real-time backpropagation algorithm. Let us now build a new flow graph by transposing the flow graph of Figure 2. Transposing the original flow graph is accomplished by reversing the branch directions, transposing the branch gains, replacing summing junctions by branching points and vice versa, and interchanging source and sink nodes. The new flow graph is represented in Figure 3. From the work by Tellegen (1952) and Bordewijk (1956) (see Appendix A), we know that transposed flow graphs are a particular case of interreciprocal graphs. This means, in a SISO case, that the sink value obtained in one graph, when exciting the source with a given input, is the same as the sink value of the transposed graph, when exciting its source by the same input. Thus, if an input of 1.0 is distributed along the upper horizontal branch of the transposed graph, the output, which is now accumulated on the lower horizontal branch, will be equal to hw. This hw is identical to the output of our original flow graph.3 With the notation introduced before, and calling @ ( k ) the signal transmitted along the center horizontal branch, we can directly write down the equations describing the new flow graph:

@ ( k ) = JT(k + 1) . Shp(k + 1) + p e(k) Vk = K . . . l (3.8) with initial condition Sbp(K+l)= 0. The weight update at each time step is given by Aw(k)

=

AT(k). Sbp(k)

(3.9)

Equations 3.8 and 3.9, obtained from the new flow graph, are nothing other than the description of BPTT @(k) is the error gradient 1

3The flow graphs introduced here are in fact single input multi output (SIMO).The arguments of interreciprocity may be applied by considering the SIMO graph to be a stack of SISO graphs, each of which can be independently transposed.

Relating Real-Time Backpropagation and Backpropagation-Through-Time 301

Figure 3: Transposed flow graph: a representation of the backpropagationthrough-time algorithm. backpropagated from k = K . . . l . This provides a simple theoretical derivation of BPTT. Alternative derivations include the use of ordered derivatives (Werbos 19741, heuristically unfolding the network in time (Rumelhart and McClelland 1986; Nguyen and Widrow 19901, and solving a set of Euler-Lagrange equations (Le Cun 1988; Plumer 1993). Clearly one could have also taken a reverse path by starting with a derivation for BPTT, constructing the corresponding flow graph, transposing it, and then reading out the equations for RTBP. The two approaches lead to equivalent results. Another nice feature of flow graph representations is that the computational and complexity differences between RTBP and BPTT can be directly observed from their respective flow graphs. By observing the dimension of the terms flowing in the graphs and the necessary matrix calculations and multiplications, it can be verified that RTBP requires 0 ( N 2W) operations while BPTT requires only O(W) operations, (Nis the number of recurrent states and W is the number of weight^).^ 4 Extension to Controlle+Plant Structures

Flow graph theory can also be applied to more complicated network arrangements, such as the dynamic controller-plant structure illustrated in Figure 4. A discrete-time dynamic plant P described by its state-space equations is controlled by a neural network controller C. Let x ( k - 1) be the state of the plant, r(k - 1) the external reference inputs to the controller, and u(k - 1) the control signal used to drive the plant. 4For fully recurrent networks W = N 2 and RTBP is U(N4).

Francoise Beaufays and Eric A. Wan

302

x(k-1).

c

p

x(k-1;

x(k)

*

Figure 4: Controller-plant structure. Figure 4 can be described formally by the following equations:

x(k) u ( k - 1)

= =

P [ x ( k - 1).~ ( -kI ) ] C ( x ( k- 1).r ( k - 1).w ]

(4.1) (4.2)

As before, the error vector e ( k ) is defined as the difference between the desired state d ( k ) and the actual state x ( k ) when there exists a desired state, and zero otherwise. Using RTBP to adapt the weights of the controller requires the evaluation of the derivatives of the state with respect to the weights. Applying the chain rule to equations 4.1 and 4.2, we get

dx(k)--

Ox(k) d x ( k - 1) i)x(k) d u ( k - I ) (4.3) dw Dx(k - 1) dw i)u(k - 1) d w Du(k - 1)d x ( k - 1) Ou(k - 1) d u ( k , - 1) (4.4) dw D x ( k - 1) d w + dw Equations 4.3 and 4.4 can then be represented by a flow graph (see Fig. 5a). Transposing this flow graph, we get a new graph, which corresponds to the BPTT algorithm for the controller-plant structure (see Fig. 5b). Again, the argument of interreciprocity immediately shows the equivalence of the weight updates performed by the two algorithms. In addition, it can be verified that B M T applied to this structure still requires a factor of O(N2)less computations than RTBP. +

5 Conclusion

We have shown that real-time backpropagation and backpropagationthrough-time are easily related when represented by signal flow graphs. In particular, the flow graphs corresponding to the two algorithms are the exact transpose of one another. As a consequence, flow graph theory could be applied to verify that the gradient calculations performed by the algorithms are equivalent. These principles were then extended to a

Relating Real-Time Backpropagation and Backpropagation-Through-Time 303

Figure 5: (a) Flow graph corresponding to RTBP. (b) Transposed flow graph: a representation of BPTT. Notation: Jc(k) = Du(k)/i)x(k),J p x ( k ) i)x(kj/iJx(k l ) . J p u ( k ) = ax(k)/iJu(k - l j . A u ( k ) d ~ ( k ) / a w , h y ( k )d X ( k ) / d W , A F ( k ) d u (k ) / d w . @'(k) = -11 d [ C t - , eT(k)e(kj]/dx(k).

=

controller-plant structure to illustrate how flow graph techniques can be applied to a variety of adaptive dynamic systems.

Appendix A: Flow Graph Interreciprocity In this appendix we provide the formal definition of interreciprocity. We then prove that transposed flow graphs are interreciprocal, and that the transfer functions of single input single output interreciprocal flow graphs are identical. Let 3 be a flow graph. In 3,we define Yk,the value associated with node k; T].k, the transmittance of the branch V . k ) ; and V,.k = T],k. Y], the output of branch (j.k ) . Let us further assume that each node k of the graph has associated to it a source node, that is, a node connected to it

Francoise Beaufays and Eric A. Wan

304

Figure 6: Example of nodes and branches in a signal flow graph. by a branch of unity transmittance. Let xk be the value of this source node (if node k has no associated source node, Xk is simply set to zero). It results from the above definitions that Yk = 1,V,,k Xk = C, T,.k. Y, X k (see Fig. 6). Let us now consider a second flow graph, F, having the same topology as 3 (i.e., 3 has the same set of nodes and branches as 3, but the branch transmittances of both graphs may differ). F is described with the variables: Yk. ?,,k, V , , k , and Xk.

+

+

Definition 1. Two flowgraphs, 3 and 3,are said to be the transpose of each other iff their transmittance matrices are transposed, i.e., ?,,k =

Tk,,

v j,k

(5.1)

Definition 2. (Bordewijk 1956). Two flow graphs, 3 and interreciprocal iff

F, are said

to be

We can now state the following theorem:

Theorem 1. Transposed flow graphs are interreciprocal. Proof. Let F be a flow graph, and let 3 be the transpose of F. We start from the identity C kY k . Yk = CkYk. v k , and replace Yk by C jT,,k . Y, Xk in the first member, and Yk by C j? j , k . ?, + X k in the second member (Oppenheim and Schafer 1989). Rearranging the terms, we get

+

x ( Y k ' vj,k - Yk ' V j , k ) ilk

+ c(?kxk - Yk '

k

'

xk) =0

(5.3)

Relating Real-Time Backpropagation and Backpropagation-Through-Time 305 Equation 5.3 is usually referred to as “the two-network form of Tellegen’s theorem” (Tellegen 1952; Penfield et al. 1970). Since is the transpose of F,the first term of equation 5.3 can be rewritten as C,,k(Ykk.V,,k -Y,.V,,k) = c , , ~ ( Y~~, ,. k .Y, - ~ k T , ., k . Y,) = z,,,(Yk. ~ , , k .Y, - ~k . rk,, . Y , ) = 0. Since the first term of equation 5.3 is zero, the second term Ck(Yk.Xk- Yk. X,) is also zero. The flow graphs 3 and F are thus interreciprocal. QED. The last step consists in showing that SISO interreciprocal flow graphs have the same transfer functions. Let node u be the unique source of F and node b its unique sink. From the definition of transposition, node u is the sink of 3, and node b is its source. We thus have xk = 0 V k # u and X k = 0 Vk # b. Therefore, equation 2 reduces to

x,. Y, = 2,

‘

Y1,

(5.4)

This last equality can be interpreted as follows (Penfield et al. 1970; Oppenheim and Schafer 1989): the output Yb, obtained when exciting graph F with an input signal X,,is identical to the output Y , of the transposed graph 3 when exciting it at node b with an input 3 X,. The transfer functions of the SISO systems represented by the two flow graphs are thus identical, which is the desired conclusion. Acknowledgments This work was sponsored in part by EPRI under Contract RP8010-13. References Bordewijk, J. L. 1956. Inter-reciprocity applied to electrical networks. Appl. Sci. Res. 6B, 1-74. Bryson, A. E., Jr., and Ho, Y. 1969. Applied Optimal Control, Chap. 2. Blaisdell, New York. Hertz, J. A., Krogh, A., and Palmer, R. G. 1991. Introduction to theTheoryofNeurul Computation. Addison-Wesley, Reading, MA. Kailath, T. 1980. Linear Systems. Prentice-Hall, Englewood Cliffs, NJ. Le Cun, Y.1988. A Theoretical Framework for Back-Propagation. In Proceedings of the 1988 Connectionist Models Summer School. D. Touretzky, G. Hinton, and T. Sejnowski, eds., pp. 21-28. Morgan Kaufmann, San Mateo, CA. Ljung, L. 1987. System Identification: Theoryfor the User. Prentice-Hall, Englewood Cliffs, NJ. Narendra, K., and Parthasarathy, K. 1990. Identification and control of dynamic systems using neural networks. ZEEE Truns. Neural Networks 1(1), 4-27. Nguyen, D., and Widrow, B. 1990. Neural networks for self-learning control systems. Control Syst. 10(3),18-23. IEEE. Oppenheim, A. V., and Schafer, R. W. 1989. Digital Signal Processing, 2nd ed. Prentice-Hall, Englewood Cliffs, NJ.

306

Franqoise Beaufays and Eric A. Wan

Penfield, P., Spence, R., and Duiker, S. 1970. Tellegen’s Theorem and Electrical Networks. MIT Press, Cambridge, MA. Piche, S. W., and Widrow, B. 1991. Final-order gradient descent training of adaptive discrete-time dynamic networks. Tech. Rep. RL-TR-91-62, Rome Laboratory, Air Force Systems Command, Griffiss A.F.B., NY 13441-5700. Pineda, F. J. 1987. Generalization of back-propagation to recurrent neural networks. IEEE Transactions on Neural Networks, special issue on recurrent networks. Plumer, E. S. 1993. Time-optimal terminal control using neural networks. Proceedings of the I E E E lnterriational Conference on Neural Networks, San Francisco, CA, 1926-1931. Ramo, S., Whinnery, J. R., and Van Duzer, T. 1984. Fields and Waves in Communication Electronics, 2nd ed. John Wiley, New York. Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing. MIT Press, Cambridge, MA. Tellegen, B. D. H. 1952. A general network theorem, with applications. Philips Res. Rep. 7,259-269. Werbos, P. 1990. Backpropagation through time: What it does and how to do it. Proc. IEEE, special issue on neural networks 2, 1550-1560. Werbos, P. 1974. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Ph.D. Dissertation, Harvard University, Cambridge, MA. White, S. A. 1975. An adaptive recursive digital filter. Proc. 9th Asilomar Conf. Circuits Syst. Comput. 21. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1(2), 270-280.

Received July 21, 1992; accepted July 9, 1993.

This article has been cited by: 2. Zhihong Man, Hong Wu, Sophie Liu, Xinghuo Yu. 2006. A New Adaptive Backpropagation Algorithm Based on Lyapunov Stability Theory for Neural Networks. IEEE Transactions on Neural Networks 17:6, 1580-1591. [CrossRef] 3. G.G. Yen, L.-W. Ho. 2005. Online Fault Accomodation Control of Catastrophic System Failures. Control and Intelligent Systems 33:2. . [CrossRef] 4. M. Bouchard. 2001. New recursive-least-squares algorithms for nonlinear active control of sound and vibration using neural networks. IEEE Transactions on Neural Networks 12:1, 135-147. [CrossRef] 5. Paolo Campolucci , Aurelio Uncini , Francesco Piazza . 2000. A Signal-Flow-Graph Approach to On-line Gradient CalculationA Signal-Flow-Graph Approach to On-line Gradient Calculation. Neural Computation 12:8, 1901-1927. [Abstract] [PDF] [PDF Plus] 6. A.F. Atiya, A.G. Parlos. 2000. New results on recurrent network training: unifying the algorithms and accelerating convergence. IEEE Transactions on Neural Networks 11:3, 697-709. [CrossRef] 7. P. Campolucci, A. Uncini, F. Piazza, B.D. Rao. 1999. On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 253-271. [CrossRef] 8. S.C. Sivakumar, W. Robertson, W.J. Phillips. 1999. Online stabilization of block-diagonal recurrent neural networks. IEEE Transactions on Neural Networks 10:1, 167-175. [CrossRef] 9. Eric A. Wan , Françoise Beaufays . 1996. Diagrammatic Derivation of Gradient Algorithms for Neural NetworksDiagrammatic Derivation of Gradient Algorithms for Neural Networks. Neural Computation 8:1, 182-201. [Abstract] [PDF] [PDF Plus]

Communicated by Eric Baum

Smooth On-Line Learning Algorithms for Hidden Markov Models Pierre Baldi Jet Propiilsion Laboratory and Division of Biology, California Institute of Technology, Pasadena, CA 92225 USA

Yves Chauvin Net-ID, Inc. and Department of Psychology, Stanford University, Stanford, CA 94305 USA

A simple learning algorithm for Hidden Markov Models (HMMs) is presented together with a number of variations. Unlike other classical algorithms such as the Baum-Welch algorithm, the algorithms described are smooth and can be used on-line (after each example presentation) or in batch mode, with or without the usual Viterbi most likely path approximation. The algorithms have simple expressions that result from using a normalized-exponential representation for the HMM parameters. All the algorithms presented are proved to be exact or approximate gradient optimization algorithms with respect to likelihood, log-likelihood, or cross-entropy functions, and as such are usually convergent. These algorithms can also be casted in the more general EM (Expectation-Maximization) framework where they can be viewed as exact or approximate GEM (Generalized Expectation-Maximization) algorithms. The mathematical properties of the algorithms are derived in the appendix. Hidden Markov Models (HMMs) are a particular class of adaptive systems that has been extensively used in speech recognition problems (see Rabiner 1989 for a review), but also in other interesting applications such as single channel kinetic modeling (Ball and Rice 1992). More recently, HMMs and related more general statistical techniques such as the EM (Expectation-Maximization) algorithm (Dempster et al. 1977) have been applied to the modeling and analysis of DNA and protein sequences in biology (Baldi et al. 1993a,b; Cardon and Stormo 1992; Haussler et al. 1993; Krogh et nl. 1993, and references therein) and optical character recognition (Levin and Pieraccini 1993). A first order HMM M is characterized by a set of states, an alphabet of symbols, a probability transition matrix T = (t,,) and a probability emission matrix E = (el,). The parameter t,, (resp. el,) represents the probability of transition from state i to state j (resp. of emission of symbol j Niwrul Cornpittation 6, 307-318 (1994)

@ 1994 Massachusetts Institute of Technology

308

Pierre Baldi and Yves Chauvin

from state i). HMMs can be viewed as adaptive systems: given a training sequence of symbols 0, the parameters of an HMM can be iteratively adjusted in order to optimize the fit between the model and the data, as measured by some criterion. Most commonly, the goal is to maximize the likelihood Lo(M) = p ( 0 I M) (or rather its logarithm) of the sequence according to the model. This likelihood can be computed exactly using a simple dynamic programming scheme, known as the forward procedure, which takes into account all possible paths through M capable of producing the data. In certain situations, it is preferable to use only the likelihood of the most probable path. The most probable path is also easily computed using a slightly different recursive procedure, known as the Viterbi algorithm. When several training sequences are available, they can usually be treated as independent and therefore the goal is to maximize the likelihood of the model L(M) = Lo(M), or its logarithm C0log Lo(M). Different objective functions, such as the posterior distribution of the model given the data, are also possible (for instance, Stolcke and Omohundro 1993). Learning from examples in HMMs is typically accomplished using the Baum-Welch algorithm. At each iteration of the Baum-Welch algorithm, the expected number n,,(O)[resp. m , ( O ) ]of i + j transitions (resp. emissions of letter j from state i) induced in a given model, with a fixed set of parameters, by each training sequence 0, is calculated using the forward-backward procedure (see Rabiner 1989 and our appendix for more formal definitions). The transition and emission probabilities are then reset according to

no

where n i ( 0 ) = Cjn,,(0)and m;(O) = Cjm;,(0). Thus, in the case of a single training sequence, at each iteration the Baum-Welch algorithm resets a transition or emission parameter to its expected frequency, given the current model and the training data. In the case of multiple training sequences, the contribution of each sequence must be weighted by the inverse of its likelihood. It is clear that the Baum-Welch algorithm can lead to abrupt jumps in parameter space and that the procedure is not suitable for on-line learning, that is, after each training example. This is even more so if the Viterbi approach is used by computing only the most likely path associated with the production of a sequence (see also Juang and Rabiner 1990; Merhav and Ephraim 19911, as opposed to the forward-backward procedure where all possible paths are examined. Along such a single path, the transition or emission counts are necesssarily 0 or 1 (provided there are no loops) and therefore cannot be reasonably used in an on-line version of equation 1.1. Another problem with the Baum-Welch algorithm is that 0 probabilities are absorbing: once a transition or emission probability is set to 0, it is never used again and therefore remains equals

Learning Algorithms for HMM

309

to 0. This is of course undesirable and usually is prevented by artificially enforcing that no parameter be less than a fixed small threshold. A different algorithm for HMM learning that is smooth, overcomes the previous obstacles and can be used on-line or in batch mode, with or without the Viterbi approximation, can be defined as follows. First, a normalized-exponential representation for the parameters of the model is introduced. For each ti, (resp. ejj),a new parameter wq (resp. v,,) is defined according to

where X is a temperature parameter. Whichever changes are applied to the ws and vs by the learning algorithm, the normalization constraints on the original parameters are automatically enforced by this reparameterization. One additional advantage of this representation is that none of the parameters can reach the absorbing value 0. The representation (equation 1.2) is general in the sense that any finite probability distribution can be written in this form, provided there are no 0 probabilities (or else one must allow for infinite negative exponents). This is a very good property for HMMs since, in general, 0 probabilities need to be avoided. We shall now describe the on-line version of the learning algorithm, the batch version can be obtained immediately by summing over all training sequences. After estimating the statistics n,,(O)and mi,(O)using, for instance, the forward-backward procedure, the update equations of the new algorithm are particularly simple and given by

where 7 is the learning rate that incorporates all the temperature effects. Although, as we shall prove in the appendix, equation 1.3 represents nothing more than on-line gradient descent on the negative log-likelihood of the data, its remarkable simplicity seems to have escaped the attention of previous investigators, partly because of the monopolistic role played by the Baum-Welch algorithm. Notice that the on-line version of gradient descent on the negative likelihood itself is given by

(1.3')

Thus, with suitable learning rates and excluding the probably rare possibility of convergence to saddle points, 1.3 and 1.3' will converge to a

Pierre Baldi and Yves Chauvin

310

possibly local maximum of the likelihood L ( M ) . As for backpropagation or any other gradient algorithm, the convergence is exact in batch mode and stochastic in the on-line version. In general, for dynamic range reasons, 1.3 should be preferable to l .3’. Furthermore, in the case of multiple training sequences, 1.3’ requires that the global factor L ( M ) be available. In certain situations, it may be possible to use a nonscaled version of 1.3 in the form Awl,

=

rl[n,,(O)- n l ( 0 ) f l , ] and

Av,

=

r/[m,(O)- m,(O)e,,] (1.3”)

This is particularly true when the bulk of the training sequences tend to have likelihoods that are roughly in the same range. If the distribution of the likelihoods of the training sequences is approximately gaussian, this requires that the standard deviation be relatively small. When the likelihoods of the training sequences are in the same range, then 1.3” can be viewed as an approximate rescaling of 1.3 with the corresponding vectors being almost aligned and certainly in the same half space. So, with a proper choice of learning rate (in general different from the rate used in 1.3), 1.3” should also lead to an increase of the likelihood. The previous algorithms rely on the local discrepancy between the transitions and emission counts induced by the data and their predicted value from the parameters of the model. Variations on 1.3 and 1.3’ can be constructed using the discrepancy between the corresponding frequencies in the form

and (1.4“)

In the appendix, it is shown that 1.4” can be reached through a different heuristic line of reasoning: by approximating gradient descent on an objective function constructed as the sum of locally defined cross-entropy terms. All the variations 1.4-1.4’’ are obtained by multiplying 1.3 or 1.3’ by positive coefficients such as l/n;(O), which may depend both on the sequence 0 and the state i. In the case of a single training sequence, all the vectors associated with 1.3-1.4’’ are in the same half plane and all these rules will increase the likelihood, provided the learning rate is sufficiently small. In the case of multiple training sequences, it is reasonable to expect that on average, in many situations, 1.4-1.4” will still tend to increase the likelihood L ( M ) , or its logarithm, although not along the line of steepest ascent. Accordingly, their convergence can be expected to be slower.

Learning Algorithms for HMM

311

In the case of an on-line Viterbi approach, the optimal path ~'(0) associated with the current training sequence is first computed, together with the associated likelihood L&(M). Any one of the previous algorithms can then be approximated by replacing the expected transition counts n,(O) and n,(O)by the corresponding counts n t ( 0 ) and n:(O) obtained along the optimal Viterbi path (and similarly for the emissions). From the definition of n , given in the appendix, it is easy to see that nG(0) = c,(r*)Lb(M),where c,(T*) is the number of times the i + j transition appears along the path T * . For instance, for any I along T * , the gradient ) descent equations of 1.3 can then be approximated by Awl, = r / [ c , , ( T *t l , c l ( ~ * and ) ] similarly for the emissions. In the case of an architecture without any loops, the Viterbi approximations are particularly simple since C ~ / ( T * )and c,(T*) are 0 or 1. In a specific application (Baldi et al. 1992, 19931, good results have been obtained with a Viterbi version of 1.3, rewritten in the form

Awl, = 7 / [ t b - t,,] and

Av,

= 7/[f; - el,]

(1.5)

Here, for a fixed state i on the path, t:, and t; are the target transition and emission values: ti, = 1 every time the i + j transition is part of the Viterbi path of the corresponding training sequence and 0 otherwise; and similarly for ti. In particular if, as a result of a loop, a Viterbi path visits the state i several times, then 1.5 must be repeated at each visit (1.5 is "on-line" not only with respect to different paths associated with different training sequences but also within each path). As for 1.4", it is shown in the appendix that 1.5 can also be derived by approximate gradient descent on a sum of local cross-entropy terms. One important difference, however, is that this time, this objective function can be discontinuous as a result of the evolution of the Viterbi paths themselves. In many applications, the likelihoods Lo(M) are very small and beyond machine precision. This problem can be dealt with by using a scaling approach (Rabiner 1989). The scaling equations derived for the Baum-Welch algorithm can readily be extended to the algorithms presented here. Another possibility, often used for obvious reasons in conjunction with a Viterbi algorithm, is to calculate only the logarithm of the likelihoods. Accordingly, it may be possible in some situations to replace in the previous algorithms the factors l/Lo(M) by - logLo(M). The algorithms previously described can also be seen in the more general EM framework (Dempster et al. 1977). In the general EM framework, one considers two dependent random variables X and Y, where Y is the observed random variable. For a given value of X, there is a unique value of Y but, in general, different values of X can lead to the same value of the observable Y. In addition, there is a parameterized family of densities f(x I 6') depending on a set of parameters 6'. In HMM terminology, X corresponds to the paths, Y to the output sequences, and 6' to the transition and emission parameters. As usual, the problem is to try to find a maximum likelihood estimate for the set of parameters

Pierre Baldi and Yves Chauvin

312

H, from the observations y. The EM algorithm is a recursive procedure defined by H(t

+ 1) = argmfxF[O,O(t),y]= argmfxE[logf(x 10) I H(t),y]

(1.6)

It can be shown that, in general, 1.6 converges monotonically to a possibly local maximum likelihood estimator of 0. When applied in the HMM context, the EM algorithm becomes the Baum-Welch algorithm. An interesting and slightly different view of the EM algorithm, which leads also to incremental variants, can be found in Neal and Hinton (19931, where X can be interpreted as representing the states of a statistical mechanical system. If the energy of a state is measured by its negative log likelihood, then 1.6 can be seen as a double minimization step on the corresponding free energy function. Any algorithm that increases the function F in 1.6 (and, as a result, increases also the likelihood of the observation given the parameters), without necessarily maximizing it, is called a GEM algorithm. It can be shown in general (a proof is given in the appendix in the HMM context) that the gradient of F and the gradient of the log likelihood of the observations given the parameters are identical. A small gradient ascent step on the log likelihood must lead to an increase of F. Thus, with sufficiently small learning rates, gradient descent on the negative log-likelihood and the other related algorithms presented here can be seen as special cases of GEM algorithms or approximations to GEM algorithms. For a specific application, it is natural to ask which of the previous learning rules should be used. The answer to this question depends both on the application considered, the architecture of the HMM, and possibly other implementation constraints, such as the available precision. It is clear that 1.3 should play for HMMs the role backpropagation plays for neural networks. Indeed, it is well known that HMMs can be viewed as a particular kind of linear neural networks. Backpropagation applied to these equivalent networks together with the normalized-exponential parameter representation leads immediately to 1.3. On the other hand, it is easy to see on simple examples that, for instance, 1.3 and 1.5 can behave very differently. There are problems, such as those in the area of DNA or protein sequence modeling, where the Viterbi paths play a particular role and where Viterbi learning may be more desirable. In this context, extensive simulation results on one of these smooth algorithms (1.5) can be found in Baldi eta]. (1993a,b). As in any gradient method, the choice of the learning rate is also crucial and may require some experimentation. It should also be obvious that, as in the case of neural networks and other modeling techniques, the present ideas can easily be extended to more complex objective functions including, for instance, regularizer terms reflecting prior knowledge on the HMM parameters. The same is true also for higher order HMMs.

Learning Algorithms for HMM

313

In general, the simple algorithms introduced above should be useful in situations where smoothness and/or on-line learning are important. These could include 1. large models with many parameters and relatively scarce data that may be more prone to overfitting, local minima trapping, and other pathological behaviors when trained with discontinuous or batch rules; 2. analog physical implementations where only continuous learning rules can be realized; and 3. all situations where the storage of examples for batch learning is not desirable.

In the following mathematical appendix, we first show that 1.3 and 1.3' correspond to gradient ascent on the log likelihood and likelihood, respectively. We also give a different derivation of 1.4"and 1.5 as gradient descent algorithms on a different objective function. We then examine the relation of the algorithms to the Baum-Welch algorithm and to the more general EM approach. 2 Mathematical Appendix

For brevity, only transition parameters will be considered here but the analysis for emission parameters is analogous. In what follows, we will need the partial derivatives atij

--

- Xti,(l

-

dW,

ti,)

and

atij ~

awik

=

-Xtj,fjk

(2.1)

Derivation of 1.3 and 1.3' as Global Gradient Descent Algorithms. For any path T through the model M and any fixed sequence 0, let oy,,o= P ( T . 0

I M)

(2.2)

so that (2.3)

the summation being over all possible paths through the architecture. With several training sequences, the likelihood of the model is given by (2.4)

314

Pierre Baldi and Yves Chauvin

Since the probability c y , , ~ is the product of all the corresponding transition and emission probabilities along the path x, it is easy to see that (2.5) where c , ; ( x ) is the number of times the i -+ j transition is used along the path x consistent with the production of 0. From 2.3 and 2.5, (2.6) where n;,(O)= C,c;;(x)(v,,~is the expected number of times the i + j transition is used to produce 0 in model M. Combining 2.4 and 2.6 leads to (2.7) Finally, using 2.1, the chain rule and a few simplifications, we get Proposition 1.

If in 2.4 we use the log likelihood, then we get Proposition 2.

Clearly, 1.3 is the on-line version of 2.9, 1.3' is the on-line version of 2.8, and the temperature parameter X can be absorbed in the learning rate. The key conclusive point is that 1.3 corresponds to gradient descent on the negative log likelihood and therefore is the most sensible algorithm. Heuristic Derivation of 1.4" and 1.5 as Approximate Gradient Descent Algorithms on a Sum of Local Cross-Entropy Terms. Once the numbers n f l ( 0 ) have been calculated, the local distance between the current distribution t,, of transitions in the model and the expected distribution induced by the data can be measured using the cross-entropy function (2.10)

Learning Algorithms for HMM

315

Equivalently, one could consider the local likelihood associated with the distribution of the transitions out of state i. The logarithm of this likelihood then yields a term similar to 2.10. The parameters can now be n , t ) in order to reupdated by gradient descent on H ( n ,t ) = Co X i duce this distance

p(

(2.11) The approximation similar to 2.11 computes only the explicit derivative of H ( F It,) with respect to w,,. All other higher order contributions, associated with the fact that a change in f i k affects all the quantities q ( 0 )( DH,o/at;k is usually nonzero), are neglected. After simplifications, we get Proposition 3. (2.12) Clearly, 1.4” is the on-line descent version of 2.12. Thus 1.4” is the learning rule derived by approximating gradient descent on the sum of the local cross-entropy terms between the distribution of transitions in the model at state i and the expected value of this distribution once the data are taken into account. Similar more complex learning rules could also be derived by weighting the terms HIo in the sum H [for instance by l/Lo(M)]. Again, the temperature coefficient X can be merged into the learning rate. The same reasoning using the Viterbi approximation yields 1.5 [notice that the instantaneous distribution of transitions out of any state along a Viterbi path is a multinomial distribution of the form fl](tlJ 1. Relations to Baum-Welch, EM, and GEM Algorithms. To examine the relation of the previous algorithms to the Baum-Welch algorithm, it is useful to examine the proof of convergence for the Baum-Welch algorithm (Baum 1972). Consider, for a fixed architecture, two models M and M‘ with different transitions and emission parameters. M‘ can be thought as being the improvement over M that we are seeking. For any path K through the model, we can define the probabilities Q,,O

= P(T,0

I M)

and

PT,o= P ( K )0 1 M’)

(2.13)

Assuming, as usual, independence between the sequences, the likelihood of each model is then given by (2.14)

31 6

Pierre Baldi and Yves Chauvin

with (2.15)

the summations being over all possible paths ?r through the architecture. For each sequence, we can then define two probability distributions (v,,o/ C, (Y,,o and P,,o/ C, A , o induced by the two models and the sequence 0 over all paths. We can again measure the distance between these two distributions using the cross-entropy: (2.16)

The cross entropy being always positive, after simple manipulations we get

which gives

When M = M’, the right-hand side R of 2.18 is equal to 0. Therefore, to find a model M’ with a higher likelihood than the model M, we must ( in the general EM context, increase the term Co C,[fY,,Olog~~,,o]/Lo(M) this term corresponds to the function F defined above). The Baum-Welch algorithm directly maximizes this term. On the other hand, the algorithm introduced here in 1.3 is just a gradient descent step toward the maximization of this term. Remarkably, therefore, both the log-likelihood logL(M) and the right-hand side of 2.18 have the same gradient. To see this, one must first observe that the probability , j n , 0 is a product of probabilities corresponding to all the transitions associated with the path K and all the corresponding emissions associated with the sequence 0. The contribution to the sum Co C,[(P,?O log /%,o]/Lo(M) from the transitions only (there is an additional similar term for the contributions resulting from the emissions) can therefore be rewritten as

(2.19)

where c i j ( r )is the number of i -+ j transitions contained in the path K and, as above, n , is the expected number of i -+ j transitions in model M, given the data. It is easy to check that the expression for the probability distribution that maximizes Q corresponds to the Baum-Welch algorithm

Learning Algorithms for HMM

317

with fi, = Co[n,,/Lo(M)]/&[n,/Lo(M)] and that, if we use a normalizedexponential representation for the parameters w’,the gradient of R is given by

Proposition 4.

When used on-line, this immediately yields the algorithm in 1.3. Thus, the batch version of 1.3 performs gradient ascent on R or on logL(M) and leads, for sufficiently small step sizes, to an increase of the likelihood unless the gradient is zero. It leads also to an increase of R and, as such, can also be viewed as a GEM algorithm. The gradient of the cross-entropy term maximized by Baum-Welch is also the gradient of the log-likelihood function. Although the proof given here is based on HMMs, the same result can be proved similarly in the general EM framework by showing that the gradient of the log-likelihood and the gradient of the function being maximized during the M step of the EM algorithm are identical.

Acknowledgments We would like to thank David Haussler, Anders Krogh, Yosi Rinott, and Esther Levin for useful discussions. The work of I? B. is supported by grants from the AFOSR and the ONR.

References Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M. A. 1992b. Hidden Markov models of biological primary sequence information. PNAS (USA), in press. Baldi, P., Chauvin, Y., Hunkapiller, T. and McClure, M. A. 1993a. Hidden Markov models in molecular biology: New algorithms and applications. In Aduances in Neural lnformation Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. Lee Giles, eds. Morgan Kaufmann, San Mateo, CA. Ball, F. G., and Rice, J. A. 1992. Stochastic models for ion channels: Introduction and bibliography. Math. Biosci. 112(2), 189-206. Baum, L. E. 1972. An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3, 1-8. Blahut, R. E. 1987. Principles and Practice of lnformation Theory. Addison-Wesley, Reading, MA.

318

Pierre Baldi and Yves Chauvin

Cardon, L. R., and Stormo, G. D. 1992. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. 1. Mol. Bid. 223, 159-170. Dempster, A. P, Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. I. R. Statist. SOC.B 39, 1-22. Haussler, D., Krogh, A., Mian, I. S., and Sjolander, K. 1993. Protein Modeling using Hidden Markov Models: Analysis of Globins. Proceedings of the Hawaii International Conference on System Sciences. Vol. 1, pp. 792-802. IEEE Computer Society Press, Los Alamitos, CA. Juang, B., and Rabiner, L. R. 1990. The segmental K-means algorithm for estimating parameters of hidden Markov models. I E E E Transact. Acoustics, Speech Signal Process. 38(9), 1639-1641. Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. 1993. Hidden Markov models in computational biology: Applications to protein modeling. journal of Moleciilar Biology, in press. Levin, E., and Pieraccini, R. 1993. Planar hidden Markov modeling: From speech to optical character recognition. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan and C. Lee Giles, eds. Morgan Kaufmann, San Mateo, CA. Merhav, N., and Ephraim, Y. 1991. Maximum likelihood hidden Markov modeling using a dominant sequence of states. I E E E Transact. Signal Process. 39(9), 2111-2115. Neal, R. M., and Hinton, G. E. 1993. A new view of the EM algorithm that justifies incremental and other variants. Biometrika, submitted. Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257-286. Stolcke, A., and Omohundro, S. 1993. Hidden Markov model induction by Bayesian model merging. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan and C. Lee Giles, eds. Morgan Kaufmann, San Mateo, CA.

Received January 8, 1993; accepted July 9, 1993.

This article has been cited by: 2. Nobuhisa Ueda, Taisuke Sato. 2004. Simplified training algorithm for hierarchical hidden Markov models. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 87:5, 59-69. [CrossRef] 3. A. Kehagias, V. Petridis. 2002. Predictive modular neural networks for unsupervised segmentation of switching time series: the data allocation problem. IEEE Transactions on Neural Networks 13:6, 1432-1449. [CrossRef] 4. Xiaolin Li, M. Parizeau, R. Plamondon. 2000. Training hidden Markov models with multiple observations-a combinatorial method. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:4, 371-377. [CrossRef] 5. Anders Krogh , Søren Kamaric Riis . 1999. Hidden Neural NetworksHidden Neural Networks. Neural Computation 11:2, 541-563. [Abstract] [PDF] [PDF Plus] 6. Jen-Tzung Chien. 1999. Online hierarchical transformation of hidden Markov models for speech recognition. IEEE Transactions on Speech and Audio Processing 7:6, 656-667. [CrossRef] 7. Y. Gotoh, M.M. Hochberg, H.F. Silverman. 1998. Efficient training algorithms for HMMs using incremental estimation. IEEE Transactions on Speech and Audio Processing 6:6, 539-548. [CrossRef] 8. M.C. Nechyba, Yangsheng Xu. 1997. Human control strategy: abstraction, verification, and replication. IEEE Control Systems Magazine 17:5, 48-61. [CrossRef] 9. Y. Bengio, P. Frasconi. 1996. Input-output HMMs for sequence processing. IEEE Transactions on Neural Networks 7:5, 1231-1249. [CrossRef]

Communicated by Steven Nowlan

On Functional Approximation with Normalized Gaussian Units Michel Benaim Department of Mathematics, Universifyof California at Berkeley, Berkeley, CA 94720 USA Feedforward neural networks with a single hidden layer using normalized gaussian units are studied. It is proved that such neural networks are capable of universal approximation in a satisfactory sense. Then, a hybrid learning rule as per Moody and Darken that combines unsupervised learning of hidden units and supervised learning of output units is considered. By using the method of ordinary differential equations for adaptive algorithms (ODE method) it is shown that the asymptotic properties of the learning rule may be studied in terms of an autonomous cascade of dynamical systems. Some recent results from Hirsch about cascades are used to show the asymptotic stability of the learning rule. 1 Introduction

Much of the recent studies on feedforward neural networks concern the problem of representing arbitrary nonlinear mappings. Due to some limitations of backpropagation technique with multilayered perceptrons, neural networks architectures with “nonsigmoid” hidden units have been recently considered by many authors including Moody and Darken (1989), Poggio and Girosi (1990),Nowlan (1990), and Hartman and Keeler (1991) (this list is not exhaustive). Moody and Darken proposed a hybrid algorithm for training radial basis function (RBF) networks in which the RBFs’ centers are determined by a preprocessing k-means algorithm and the output weights are modified according to a linear rule. They worked with gaussian hidden units for the same Mackey-Glass equation as in Lapedes and Farber (1987) and reported that the hybrid algorithm is much faster than the standard backpropagation. In the same paper, Moody and Darken also suggested the use of normalized radial function as hidden units. More recently, Nowlan (1990) demonstrated the superiority of normalized gaussian units in supervised classification tasks. He used a normalized gaussian basis functions (NGBF) network for a digit classification task and a vowel discrimination task. He reported that the NGBF network was superior in performance to (RBF) networks and to other classifiers that have been applied to the same data. Neural Computation 6,319-333 (1994)

@ 1994 Massachusetts Institute of Technology

Michel Benaim

320

HIDDEN UNITS

INPUT UNITS

Figure 1: Architecture of a (NGBF) network. An NGBF network has a feedforward architecture with one hidden layer as shown in Figure 1. From now on we shall assume that there are d input units, N hidden units, and, without loss of generality, one output layer. If x is the current input vector, the output of the ith hidden unit is given by the normalized function

where w, is the weight vector from the input layer to the hidden unit i and 11 . 11 denotes the Euclidian distance over Rd. The nonlinear function qOis a gaussian kernel with a uniform width g:

The overall response function of the output unit is a weighted sum of the outputs from the hidden units:

where a; denotes the weight of the connection from hidden unit i to the output unit. The learning rule used by Nowlan is a modification of the Moody and Darken algorithm in which RBF’s centers are adapted according to a “soft competitive learning” scheme rather than a k-means algorithm (“hard competitive learning” scheme). An ”on-line” implementation of this learning rule has been proposed by Benaim and Tomasini (1992). It can be described as follows: Assume the training set is a subset of Rd x R consisting of pairs (x,y) of input and target vectors. At each time step a

Functional Approximation

321

training vector ( x , y ) is randomly chosen in the data set and the weight vectors are updated according to the recursion

A w ~= Aai =

E[X -

w ~ ] O ; (WX, ,(T)

EIY - O ( XW, , a , (T)]@,(x, W,

(T)

where E << 1 is a learning rate parameter. The adaptive modification of weights {a,} follows the classical Widrow-Hoff algorithm (Widrow and Hoff 19601, which aims at fitting the desired output function according to standard least mean square minimization. The adaptive modification of weights {w;}is in the form of a competitive learning law (Grossberg 1976; Von der Malsburg 1973; Rumelhart and Zipser 1985). We remark that when the parameter (T tends to zero, Oi(x,w ,0) performs a winner-take-all function: The only unit that becomes active is the one whose weight vector best matches the input vector. The associated learning rule (1.2) is therefore a classical adaptive implementation of the Linde-Buzo-Grays (LBG), or k-means, algorithm (Linde et al. 1980). For a nonzero value of the parameter (T, the competition can be refered to as ”soft” competition. Our purpose in this paper is to study from a theoretical point of view normalized gaussian basis functions networks for functional approximation. The problem of approximating an arbitrary function with a neural network architecture leads to two important questions. The first question deals with existence: Is the neural network architecture rich enough to provide a good approximation (in a sense to be made precise) of any arbitrary function? The second question deals with feasibility: How to train the network in an adaptive, efficient, and “neuronal” fashion in such a way that the network produces a global approximation to the unknown map embedded by the training set? We will address these two questions for the case of NGBF networks. We will show in Section 2 that NGBF networks with one hidden layer are capable of universal approximation. The question of feasibility will be discussed in Section 3. We will study the adaptive version (on-line implementation) of the training rule experimentally studied by Nowlan. To provide a rigorous analysis of this learning rule, we shall use the ordinary differential equation (ODE) method for “recursive stochastic algorithms” combined with recent results of Hirsch (1989) about cascades of dynamical systems. It will be shown that the learning dynamics (1.2 and 1.3) may be studied in terms of a cascade of dynamical systems. This will lead us to conclude the asymptotic stability of the learning rule. 2 Universal Approximation with (NGBF) Networks

In this section we consider approximation capabilities of NGBF networks. The approximation capabilities of feedforward neural network architectures have been studied by many authors including Cybenko (1989),

322

Michel Benaim

Hornik et al. (1989, 1990), Funahashi (1989), and Hornik (1991) for the case of multilayer perceptrons with sigmoid activation functions. More related to our work are recent studies concerning approximation with radial basis functions networks. Hartman et RI. (1990) apply the StoneWeirstrass theorem to show that RBF networks with gaussian units having different kernel widths are universal approximators with respect to the uniform norm for continuous functions defined on a compact convex set. Park and Sandberg (1991) significantly improve this result by relaxing the condition to be imposed on the activation function and providing U'(Rd.dx) approximation result with respect to the Lebesgue measure dx. Due to the particular form of NGBF units, the results and methods of authors mentioned below are not applicable and there is a need for a proof. In Section 3, we will show that NGBF networks with a single hidden layer are universal approximators in the space of continuous functions with compact support, in the space UJ(Rd.dx)and in the space Lp(Rd.p ) where I L is an arbitrary environment probability measure. Approximation in Ll'(Rd.p ) is of fundamental interest in the neural network context in which it is natural to think of input vectors as realizations of a random variable having probability law 11 (see, e.g., White 1989). For multilayer perceptrons, such approximation results have been proved by Hornik et al. (1989) for probability measures having compact support, and more recently by Hornik (1991) for general probability measures. We use the following terminology: Lo(R") denotes the space of Rvalued functions defined on Rd that are Borel measurable. For p 2 1 and for any Borel positive measure 11 on Rd, LF'(R",p)denotes the space of R-valued functions that are pth power integrable. In other words, a function f is an element of LP(Rd,11,) iff is Borel measurable and

The usual norm on Lp(R",11) is defined by

If p is a probability measure [that is, p ( R d ) = 11 we let L " ( R d . p )be the metric space that is obtained by endowing Lo(Rd)with the distance do defined by

do is the metric associated to convergence in probability (see, e.g., Hornik et al. 1989, lemma 2.1).

Functional Approximation

323

If K is a compact subset of R“, C ( K ) denotes the space of R-valued continuous functions defined on K . The uniform norm for the space C ( K ) is defined by

The Euclidian norm on Rd is denoted 11 . 11. Let A be a subset of R“. The set of all R-valued functions defined on A implemented by an NGBF network with N hidden units is denoted SN(A).An eIement of S N ( A )is a function O ( . .w.(Y, (T) : A + R represented by expression 1.1 where (T > 0, w,E Rd, a, E R. The set of all functions defined on A and implemented by such a network with arbitrary large number of hidden units is

N=l

The following theorem shows that NBGF networks are capable of approximating, with arbitrary accuracy, any function in C ( K ) and Lr’(Rd.p ) where IL is either the Lebesgue measure or a probability measure. The corresponding result for the general multidimensional output functions can be easily deduced from this case. The proof is given in the appendix. Theorem 1. i. Let K

C Rd be a compact set. Then, S ( K ) is dense in C ( K ) .

ii. Let /ibe either the Lebesgue measure or a probability measure on R”. Then, for p 2 1, S ( R d )n Lp(Rd.p ) is dense in Lp(Rd.p ) . iii. Let p be a probability measure on R”. Then, S ( R d )is dense in Lo(Rd.p ) . The use of a probability measure 11 instead of the Lebesgue measure dx is of particular interest when considering neural network applications. In this case the measure /I, can be seen as the input environment measure. A consequence of (iii) is that an NBGF neural network can approximate any “decision function.” More precisely, if PI, Pf,. . . .P,, is a partition of Rd into m disjoint measurable sets, Cybenko (1989) defines the ”decision function” f associated to this partition according to f ( x ) = i,

if and only if x E PI

It follows from (iii) that for arbitrary E > 0, 77 > 0, there exists f in S ( R d ) such that p { x , If(x) - f ( x ) l > 7) < E . Such a result can be compared with Theorem 5 in Cybenko (1989) and Corollary 2.1 in Hornik ef al. (1989) obtained for multilayer perceptrons.

Michel Benaim

324

3 The Learning Algorithm

In this section we will consider and analyze the real time algorithm (1.2 and 1.3) for functional approximation with NGBF networks. Within the framework of functional approximation the relationship between x and y can be expressed as y = g ( x ) for some mapping g : Rd + R. More generally we may assume that the training set is described by a joint probability v ( x ,y ) over Rd x R. This formalism allows us to take into account perturbations and errors that may occur in the determination of the deterministic relation y = g ( x ) . The probability density of the input data (the environment) is the environmental density over Rd

To the recursion (1.2 and 1.31, we associate the autonomous averaged differential system:

dw, - = / ( x - w,)@,(x,W , a ) p ( d x ) dt

(3.1) (3.2)

We also use the notation (3.3) da j - = E[[Y- O ( X ,W , CY, 0 ) ] 0 i ( X ,W ,o)] (3.4) dt where ( X , Y ) is a random vector having v as joint probability and E[.] stands for the mathematical expectation. Using the method of ordinary differential equations (ODE) for adaptive stochastic algorithms (see, e.g., Ljung 1977 or Beneveniste et al. 1987) asymptotic properties of the algorithm (1.2 and 1.3) may be studied in terms of the averaged system (3.1 and 3.2). With a constant gain parameter, E << 1, a trajectory of (1.2 and 1.3) issued from the initial condition [w(O),a(O)]remains &-close(with high probability) on a large time scale of the order of 1 / of ~ the deterministic trajectory solution to (3.1 and 3.2) with initial condition [w(O),a(O)].(For a precise statement see, e.g., Beneveniste et al. 1987, Theorem 1, p. 46.) The asymptotic behavior of the ODE (3.1 and 3.2) is given by the nature of its omega h i t sets. Let f -+ x ( f )be the unique solution to a differential equation dx/dt = F ( x ) with initial condition x ( 0 ) = x . The omega limit set of x, noted w ( x ) , consists of points p such that p = lim[x(tk)]for some sequence of times t k -+ f w . If the forward trajectory { x ( t ) ,t 2 0) is bounded, w(x) is a nonempty compact connected invariant set (see, e.g., Hirsch and Smale 1974). An invariant set is a set such that the trajectories remain in there for any time -m < t < $00.

Functional Approximation

325

We shall now prove that under mild assumptions the omega limit sets of 3.1 and 3.2 are equilibria. Equation 3.1 is a gradient system of the form dw - = a2VC(w) (3.5) dt where

C ( w ) = / l o g [ h ( x .w ) ] p ( d x )

(3.6)

and (3.7)

C ( w ) is the log-likelihood of an internal representation of the network that is given by a mixture of N gaussian modes centered on each weight vector w;. Note that if p admits a density.dp [i.e., p(dx) = d/r(x)dxl, the maximization of C ( w ) is exactly the minimization of the Kullback discrepancy:

between the internal representation h(.,w ) and the environmental density d p (x).

For a fixed value of the weight vector w = {wi} 3.2 is also a gradient system governed by the LMS energy function:

In summary, the differential system (3.1 and 3.2) can be rewritten as

dw - = dVC(7.U) dt do DD(w. (Y) - -- dt an This is a cascade of dynamic systems (see Hirsch 1989) that is a dynamic system, defined on a product €1 x €2 of Euclidian spaces, of the following form: dw - dt - F(w) da (3.10) G(w,~Y) dt If we know that both dzuldt = F(w) and dtrldt = G ( u ,0 ) for any parameter u, have convergent dynamics, it seems natural to think that the cascade 3.10 also has convergent dynamics. However, this intuitive result is not true in full generality. Hirsch (1989) considered this problem and proved 1

Michel Benaim

326

several convergence results. Using the main line of the proof of (Hirsch 1989, Theorem 5) it is not difficult to prove the following statement Theorem 2. Consider the cascade 3.1 0 defined on the product E l x EZ. Suppose trajectories of 3.10 are bounded. Suppose that omega limit sets for dwldt = F(w) consist of equilibria. Suppose that for any parameter u , any nonempty compact set invariant under the dynamics dtrldt = G ( u ,(1)is composed ofequilibria. Then, the omega limit set of any solution to 3.10 is composed of equilibria. To apply the preceding theorem to the averaged ODE (3.1 and 3.2) we need mild assumptions concerning the environmental probability p . We use the following terminology:

Definition. Let c f , } l = l N be a set of measurable functions. We say that c f I } , - l N is independent in L0(R".p)if for any scalars ( a l 1. . . a N ) alfi(x)= o palmost surely [i.e., p { x / ~f"., a,fi(x)= O } = 11 implies that al = . . = a N = 0. We say that a family c f w } is N-independent in L"(R", p ) if every finite subset of cflL,} with cardinal N is independent in L"(R". 1 1 ) . We say that a family (fiL,}is independent in L"(R".p ) if it is N-independent for any N.

~z~

Our main assumption concerning / r is that the family {exp(-llx in Lo(Rd% p ) . This assumption is reasonable as well from a theoretical point of view as for practical purposes. For example, if we assume that / I admits a continuous density dlr [i.e., p ( d x ) = d/r(x)dxl the family {exp(-llx - w112)}wEpis independent in L"(R". 1 1 ) . If we assume that p is a discrete probability Z O ~ ~ ~ ) } ,is , ~N-independent ~I

in

associated with a finite training input set {XI, x2, . . . , xi,,} the family is "generically" N-independent in Lo(R".p ) pro{exp(- I(x - WII~)},,~,+, vided that m 2 N + 1. Now, we can state the main result of this section, whose proof is given in the appendix. Theorem 3. Suppose that the environmental probability / I has a compact support. is N-independent in Lo(Rd,p ) . Then, the Suppose that, {exp(-1Ix omega limit set of any solution to the system (3.1 and 3.2) consists of equilibria. 4 Conclusions

In this paper we have considered normalized gaussian unit networks for functional approximation from a theoretical point of view. We have proved that NGBF networks are capable of universal approximation in a satisfactory sense. Thus, we have considered a hybrid learning algorithm in which the gaussian centers are modified according to a "soft"

Functional Approximation

327

competitive algorithm (that maximizes the log-likelihood of an internal representation) and the output weights are modified according to standard least mean square minimization (Nowlan 1990). Using the methods introduced by Hirsch (19891, we have proved that trajectories of the ODE associated with the learning rule converge toward equilibria. From a practical point of view, the hybrid algorithm we consider is easy to implement, intrinsically neuronal, and well suited to time series prediction problems. Moreover it seems that NGBF networks have good generalization properties. This is an interesting fact that needs further mathematical analysis and computer investigation to be explained. Finally, it is interesting to point out that the recent theory of "cascades of dynamical systems" provides a basic tool for the rigorous analysis of hybrid algorithms in which hidden layers are trained according to an unsupervised rule and output units according to a supervised scheme. For example, the results of Section 3 could probably be extended to other algorithms as the Moody and Darken one (Moody and Darken 1989). Acknowledgments This work was supported by a grant from the Centre National de la Recherche Scientifique (Programme Cognisciences). This research originated when the author was at the Centre d'etudes et de recherches de Toulouse. It is a pleasure to acknowledge the assistance of Manuel Samuelides and Linda Tomasini. Mathematical Appendix Proof of Theorem 1 Given {w,. w2... . .W N } a finite subset of Rd, we define the ith "closed Voronoi cell" as cell(i) = {x E Rd/b'j

E

.

(1.. . . N}llx - w,III IIx

-

w,ll}

If A is a closed set, we define the ith "closed Voronoi cell in A" as cell(i,A ) = cell(i) n A and we define the radius of cell(i,A) (possibly infinite) as rad(i,A) = sup{llx - w,II/x E cell(i,A)} Proof. Let K C Rd be a compact set, f E C ( K ) , E > 0. By uniform continuity o f f on K , there exists 'r) > 0 such that

WlY)

E

K2?IIX -YII I rl=+ If@) -f(Y)I <

;

Since K is compact, there always exists a finite set

(A.1) { W Iw2, , . . . , WN}

C K,

Michel Benaim

328

such that V i E (1.. . . ,N}. rad(i,K ) 5 7//2. Now, define the family of functions C_ S N ( K ) by

ug}n,~

One has

Let I(i) = E (1,. . . ,N}/cell(i, K ) n cello, K ) # 8) and let x E cell(i, K). If E I(i), there exists y E cell(i, K ) f l cello. K ) and from A.l one has

i

If(wi) - f ( x ) l 5 If(wj) - f ( w i ) l +

If(wi) -f(y)l

+ If(y) -f(x)I

3E

F4

so

If i !$ I(i), there exists all > 0, such that IIx - w,1I2- IIx - w,1I2 5 -ufl. Let a = inf{inf{a,,j E ( 1 . . . N} - I(i)},i = 1 . . . N}. From the relation O,(x, w.fJ) = exp @,(x.w,0 )

(

IIX

- Wl1l2

- IIX -

2fJ2

we deduce that

Choosing (TO such that for (T 5 from A.3, A.4, and A.6 that fl fJ I (To,

Ilf -6llK i E

(TO,

w,J2

1

(A.5)

2NJJf(IKexp($) 5 ~ / 4 ,we conclude (A.7)

(i) is proved. Now, we shall prove (ii) and (iii) for p 2 1. We assume that / I is either the Lebesgue measure or a probability measure. Let h E Lp(Rd.p), and let E > 0. The measure p is a Bore1 positive measure on Rd, such that p ( K ) < +m for any compact set K G Rd. It follows from standard theorem that 11,

Functional Approximation

329

is a regular measure (see, e.g., Rudin 1966, p. 48, Theorem 2.18). Since p is a regular measure on Rd the set of continuous functions with compact support is dense in Lp(Rd.p ) (see, e.g., Rudin 1966, p. 68, Theorem A.4). So there exist a compact set K' C Rd and a continuous function f such that b' x E Rd - K', f(x) = 0

and

Ilf

-

E

hllp 5

(A.8)

Because f is zero on Rd - K', there exists q > 0 such that assertion A.l is B(0,p) true for any compact set K 2 K'. Choose (J > 277 such that K' where B ( 0 , p ) = {x E R"/l(xII 5 u } and let us note K = B(O,2p). Thus, from A.7 there exists rro and a function represented by the expression A.2 such that for (T I (TO,

Z

From A.8 one has E

116- hllp I116- fllp + Ilf - hllp IIIZ -flip + 2

(A.9)

Moreover,

It follows that (A.11)

Let J = {i/wi E B ( O , p ) } . Since, for llxll 2 p f ( x ) construction of 6 that

=

0, it is clear from the

Michel Benaim

330

It follows that

then

When 11, is the Lebesgue measure or a probability measure, it is obvious that this last term converges to 0, when (J + 0. Choosing cr 5 sufficiently small it can be bounded by &P/2. We get from A.9 that

(iii) is proved for p 1 1. The proof of (iii) for p = 0, is a straightforward application of (i) and Lusin's theorem (see, e.g., Rudin 1966) (see also Lemma 2.2 and Lemma 0 A1 given in Hornik et al. 1989). Proof of Theorem 2. Let w(t),cr(t)be a trajectory of 4.11 and L J ( W , ( Y ) the omega limit set of [w(t),rr(t)]. Let ( p , q ) E w(w,(r). Let T, : E -+ E x F be the function defined by n,(x) = ( p , x ) . Since p is an equilibrium for dw/dt = F(w),the set T;'[w(w,o)]= { x , ( p , x ) E w ( w . ( r ) }is a nonempty set invariant under the dynamics of du/dt = G(p,tr). Since ~ ( w . o is .) compact, it is clear that T ; ~ [ W ( W , Q ) ]is closed and bounded. So it is compact. By hypothesis, any compact nonempty invariant set for dtr/dt = G(p,(r) is composed of equilibria. It follows that ( p . q ) is an equilibrium. 0

Proof of Theorem 3. Since p has a compact support, there exists T > 0 such that the support of 11, is included in [-T.TId. It follows that trajectories of 3.1 are bounded. Let Wik be the kth component of weight w,. For w,k > T one has dw,kldt < 0 and for Wlk < T one has dw,k/dt > 0. It follows that wlk eventually enters [-T, TI. Because the system 3.1 is a gradient system (see equation 3.5) with bounded trajectories all omega limit points of any trajectory are equilibria (see, e.g., Hirsch and Smale 1974). Let us now consider equation 3.2 where w E Rd is assumed to be a fixed parameter. In that case 3.2 is a linear system

do;

-

dt

N

=

bi - c A i j o j ill

i = 1 . .. N

(A.12)

Functional Approximation

331

with b, = E[(YO,(X, w , o ) ]and A,, = E[O,(X,zo,o)O,(X.w, o)].First let us show that the fixed point equation, N

b; =

C A;,(?,

i = 1. . . N

(A.13)

j=1

always admits a solution (not necessarily unique!). By reordering the variables we can always assume that { w l ,w2. . . . ,w N }= { w l , . . . ,w p }with p 5 N a n d w, # w, for i 5 p , j 5 p , and i # j. Because w, = w, implies that b, = b, and Alk = A,k, the linear system A.13 can be rewritten as N

b, = X A p ,

i = 1 . .. p

(A.14)

j= 1

To show that A.14 admits a solution we shall prove that the submatrix A, = (A;j);=l...p,,=l...p is invertible. A, is a symmetric matrix associated to the quadratic form

xrZl

So (A,.u,u) = 0 implies O,(X.w.o)u,= 0 p almost surely. Because the family {exp(-llx - W ( J * ) } ~ is~ ~N-independent , I in Lo(Rd,p ) , c?=, O,(X,w,o)u,= O p almost surely implies u1 = ... - up = 0. It follows that A, is invertible. From now we know that A.14 admits a not necessarily unique solution. Let a* be such a solution; A.14 can be rewritten as do

-

dt

=

-A(@ - R * )

(A.15)

with A = (Al,),=lN,,=I N . Because A is the symmetric matrix of the quadratic form (Au.u ) = E[(CE,O,(X.w,o ) ~ , ) the ~ ] ,eigenvalues of A are nonnegative. So by a change of linear coordinates A.15 can be rewritten as dtr,

- = -y,(tri - (t,*)

dt

i = 1 . . .N

(A.16)

with yi 2 0 . From A.16 we deduce that nonempty invariant sets of A.15 are composed of equilibria (not necessarily isolated). We conclude by using Theorem 2 applied to system (3.1 and 3.2).

332

Michel Benaim

References Beneveniste, A., Metivier, M., and Priouret, I? 1987. Algorithmes adaptatifs et approximations stochastiques. Masson, Paris. Benaim, M., and Tomasini, L. 1991. Competitive and self-organizing algorithms based on the minimization of an information criterion. In Artificial Neural Networks, I , T. Kohonen, K. Makisara, and J. Kangas, eds., Vol. 1, pp. 391-396. North-Holland, Amsterdam. Benaim, M., and Tomasini, L. 1992. Approximating function and predicting times series with multi-sigmoidal basis functions. In Artificial Neural Networks, J. Aleksander and J. Taylor, eds., Vol. I, pp. 407-411. Elsevier Science Publishers B.V., Amsterdam. Cybenko, G. 1989. Approximation by superposition of a sigmoidal function. Math. Control, Signal Syst. 2, 303-314. Funahashi, K. 1989. On the approximate realization of continuous mapping by neural networks. Neural Networks 2, 183-192. Grossberg, S. 1976. Adaptive pattern classification and universal recoding I: Parallel development and coding of neural features detectors. Biol. Cybernet. 23, 187-202. Hartman, B., and Keeler, J. D. 1991. Predicting the future: Advantage of semilocal units. Neural Comp. 3, 566-578. Hartman, E., Keeler, J. D., and Kowalski, J. 1990. Layered neural networks with gaussian hidden units are universal approximators. Neural Comp. 2, 210-21 5. Hirsch, M. W. 1989. Convergent activation dynamics in continuous time networks. Neural Networks 2, 331-349. Hirsch, M. W., and Smale, S. 1974. Diffrrential Equations, Dynamical Systems and Linear Algebra. Academic Press, New York. Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 251-257. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Hornik, K., Stinchcombe, M., and White, H. 1990. Universal approximation of an unknown mapping and its derivate using multilayer feedforward networks. Neural Networks 3, 551-560. Lapedes, A,, and Farber, R. 1987. Nonlinear signal processing using neural networks: Prediction and system modeling. Los Alamos Tech. Rep. LA-UR-87. Linde, Y., Buzo, A., and Gray, R. 1980. l E E E Transact. Commun. COM-28, 84-95, 1. Ljung, L. 1977. Analysis of recursive stochastic algorithms. I E E E Transact. Automatic Control 22, 4. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1,281-294. Nowlan, S. 1990. Maximum likelihood competitive learning. Proccedings of Neural Information Processing Systems, 574-582. Park, J,, and Sandberg, I. W. 1991. Universal approximation using radial basis function networks. Neural Comp. 3, 246-257.

Functional Approximation

333

Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 979-982. Rudin, W. 1966. Real and Complex Analysis. McGraw-Hill, New York. Rumelhart, D., and Zipser, D. 1985. Feature discovery by competitive learning. Cog. Sci. 9, 75-112. Von der Malsburg, C. 1973. Self-organization of orientation sensitive cells in the striate cortex. Kyhernetik 14,85100. White, H. 1989. Learning in artificial neural networks. Neural Comp. 1, 425-464. Widrow, B., and Hoff, N. E. 1960. Adaptive switching circuits. Institute of Radio Engineers WESCON Convention Record, Pt. 4, pp. 96-104.

Received June 22, 1992; accepted June 22, 1993.

This article has been cited by: 2. Jiann-Ming Wu . 2002. Natural Discriminant Analysis Using Interactive Potts ModelsNatural Discriminant Analysis Using Interactive Potts Models. Neural Computation 14:3, 689-713. [Abstract] [PDF] [PDF Plus] 3. M.R. Cowper, B. Mulgrew, C.P. Unsworth. 2001. Investigation into the use of nonlinear predictor networks to improve the performance of maritime surveillance radar target detectors. IEE Proceedings - Radar, Sonar and Navigation 148:3, 103. [CrossRef] 4. N. Ahmad, M. Hanmandlu, M.F. Azeem. 2000. Generalization of adaptive neuro-fuzzy inference systems. IEEE Transactions on Neural Networks 11:6, 1332-1346. [CrossRef] 5. Michel Benaim. 1996. A Dynamical System Approach to Stochastic Approximations. SIAM Journal on Control and Optimization 34:2, 437. [CrossRef] 6. Michel Benaim . 1995. Convergence Theorems for Hybrid Learning RulesConvergence Theorems for Hybrid Learning Rules. Neural Computation 7:1, 19-24. [Citation] [PDF] [PDF Plus]

Communicated by Radford Neal

Statistical Physics, Mixtures of Distributions, and the EM Algorithm Alan L. Yuille Division of Applied Sciences, Harvard University, Cambridge, M A 02138 U S A

Paul Stolorz let Propulsion Laboratory, MS 198-219, Pasadena, CA 91109 und Santn Fe Institute, Santa Fc, NM 87501 U S A

Joachim Utans Zii teriiational

Comyuter Science Institute, 1947 Center Street, Suite 600, Berkeley, C A 94704 U S A

We show that there are strong relationships between approaches to optmization and learning based on statistical physics or mixtures of experts. In particular, the EM algorithm can be interpreted as converging either to a local maximum of the mixtures model or to a saddle point solution to the statistical physics system. An advantage of the statistical physics approach is that it naturally gives rise to a heuristic continuation method, deterministic annealing, for finding good solutions.

In recent years there has been considerable interest in formulating optimization problems in terms of statistical physics. This has led to the development of powerful optimization algorithms, such as deterministic annealing. At the same time good results have been attained by formulating learning theory in terms of mixtures of distributions (Jacobs et al. 1991) and using the EM algorithm (Jordan and Jacobs 1993). The aim of this note is to show that there are close connections between the mixture of distributions and the statistical physics approaches. The EM algorithm can be, and has been, used in conjunction with deterministic annealing. This equivalence has previously been mentioned for some specific cases (Yuille et al. 1991; Stolorz 1991; Utans 1993) but, to our knowledge, its generality does not seem to be widely appreciated. We will demonstrate these equivalences by examining the elastic net alNeural Conzputaafion 6, 334-340 (1994) @ 1994 Massachusetts Institute of Technology

EM Algorithm

335

gorithm for the Traveling Salesman Problem (TSP). Then we will discuss the generalization to other problems. The elastic net (Durbin and Willshaw 1987) attempts to fit an elastic net, consisting of cities {y, : j’ = 1 , .. . . N} joined together by elastic strings, to a set of cities {x, : / I = 1... . , M } where N 2 M. The intuition is that the elastic forces will cause the net to find the shortest possible tour. This corresponds to minimizing an energy function:

where the net is circular so that YN+I = y1. Here 11 is a parameter that characterizes the inverse scale. The idea is to minimize E,a[{y,};/?I at large scale, small /I, and then track the solution as [ j increases. It was shown (Durbin et al. 1989) that this could be interpreted in a Bayesian framework. We can write the Gibbs distribution P(Y I X) = (1/Z) exp{-dE[Y]} (where Z is a normalization constant) and express this in terms of Bayes‘ formula as

n,

as the prior (conditional) We interpret P(Y) = (1/Z1) e-/’ylY~-J’l+l/2 probability for the tour (Z1 is a normalization constant). The distribution

corresponds to the product of a mixture of gaussian distribution (Z2 is a normalization constant). More specifically, each data point x,, is assumed to be produced by a mixture of gaussians centered on the points {y,}. Thus the elastic net corresponds to a mixture model of data generation combined with a prior model. It was then shown (Simic 1990; Yuille 1990)that the elastic net could be derived from a statistical physics system as a saddle point approximation. The derivation in Yuille (1990) started from an energy

where the {V,‘,} are binary (0,l) variables which obey the constraint = 1. vp. The partition function of the corresponding Gibbs distribution can be The sum over the V variables can be written as Z = Cv,f[dy]e-flEIVJ1. done explicitly (Yuille 1990) while imposing the constraints, to yield Z = J[dy]e-DEeff[Y;fl1, where E,.f[y;P] is given by equation 1. By an identical cal= C vP(V, y), culation we can compute the marginal distribution PM(Y)

c,v ,

A. L. Yuille, P. Stolorz, and J. Utans

336

where P(y) is a Gibbs distribution with energy &[y; /?I, corresponding to the mixture distribution (equation 3). Thus extremizing E,ff[y;/-l] corresponds to performing a saddle point approximation’ to the partition function and hence to finding the mean field approximation to the system (Amit 1989). It can also be considered to be maximizing PM(y). This equivalence between finding the saddle point approximation and maximizing PM is the key reason why the statistical physics and mixture approaches correspond. It is important to emphasize that there are many possible algorithms for attempting to minimize &[y; a]. The original algorithm (Durbin and Willshaw 1987) was a discretized version of steepest descent:

To obtain the algorithm used in Durbin and Willshaw (1987) approximate dy/dt by [y(t 1) - y(t)]/K,where K is a constant, and set y = y(f) in the right-hand side of equation 5 . However, the EM algorithm has also been successfully applied to find the saddle point solutions (Durbin, private communication) with results reported in Peterson (1990). An EM algorithm assumes there are two types of parameters, in this case the V and the y. An E-step estimates the V with the y fixed. An M-step maximizes to find the y with the V fixed. The E-step and the M-step alternate until convergence. (Observe that the M-step finds a single value for y while the E-step finds an expectation over a distribution of values for V. The EM algorithm finds a probability distribution for the V but, because they are binary variables, this is equivalent to finding their mean values.) For the TSP the ETstepis

+

and - the the M-step corresponds to maximizing P(V,y ) with V given by V. This is equivalent to solving the linear equations for y:

These can be solved by a variety of algorithms including the dynamic system: Yl+l -

‘This corresponds to approximating the integral for Z by the maximum of the integrand. The more peaked the integrand, as /l k rn, the better the approximation.

EM Algorithm

337

Observe that the previous steepest descent algorithm can be written, using equation 5, as

dy, = -2 1V{'/(Y/

-

dt

XII) -

II

Thus the only difference between EM (equations 6, 8) and steepest descent (equation 5) is that for EM the V and y are estimated in turn, while for steepest descent they are estimated together. Both algorithms will converge to a local mimimum of E,ff[{y/}: d]. It appears that, at fixed temperature, EM converges faster than steepest descent (Durbin, private communication) probably because, for this specific energy function, the E-step can be computed directly (see equation 6), and the M-step corresponds to solving linear equations (see equation 7). But the quality of the results on the TSP (Peterson 1990) decreased badly as the annealing schedule was increased, demonstrating that EM was effective only when used in conjunction with annealing. In summary, we can regard the elastic net as two types of system: (1) a mixture of distributions that can be solved, at fixed (j, by an EM algorithm, or (2) a statistical physics system whose mean fields can be estimated by a variety of algorithms including steepest descent and EM. In both cases deterministic annealing requires that the solutions are found at low /3 and then tracked as /j increases. This continuation method is a heuristic technique for finding the global minimum of the effective energy. By contrast the EM algorithm applied at fixed /j is guaranteed to find only a local minimum. The basic ideas here are straightforward to generalize. A problem posed in terms of a mixture of distributions can be reformulated as a statistical physics problem and vice versa. An EM algorithm can be applied and can be thought of as either a way to obtain a maximum a posteriori estimate of the mixture distribution or as a solution to the saddle point equations, the mean field equations, for the statistical physics system. In addition there is a simple relationship between a steepest descent algorithm to estimate the mean fields and the EM algorithm. Thus the EM algorithm should not be thought of as a rival to deterministic annealing. It is simply one way to solve the mean field equations. The key idea of deterministic annealing, which takes it beyond EM, is the continuation strategy of finding the solution at small / j and tracking it as /j increases. We now briefly discuss how these results can be extended to other problems such as learning/adaptive experts (Jacobs et nl. 1991). For a general mixtures model one assumes that the data {x!,} are generated by a mixture of distributions:

338

A. L. Yuille, P. Stolorz, and J. Utans

where the set {al} consists of nonnegative numbers such that Cia, = 1; and the set {(Y,} characterizes the (continuous) parameters of the distributions {P,}. For example, we might let PI be a gaussian with parameters I t , = ( p . a). In practice, we are interested in determining the parameters {irI} from the data. This can be done by applying Bayes' theorem to obtain

where Pp(rr) is the prior probability of the parameters and P ( x ) is a normalization constant. The {o,} can be chosen by an estimator for this distribution, for example, the maximum u posteriori estimator {(r:} = argmaq,,,)P ( { ( k l } I {x,,)). The nature of x will depend on the problem being modeled: (1) for supervised learning it is an input-output training pair, (2) for unsupervised learning it is an input, (3) it is the data for an optimization problem, for example, see our discussion of the TSP, and (4) for a signal processing or vision problem it is some processed version of the input signal or image. Observe that the V are interpreted differently for these cases. To formulate this as a mixtures problem, or in terms of statistical physics, we introduce binary decision variables {V,,,}, as for the TSP, with 1, V,,, = 1. V 11. This gives rise to P(V. IP I x) = eci'E(V,rr)/Z where

As before we can define an EM algorithm for V and gives

It.

The E-step

This will converge to a local maximum of P(n 1 x) or equivalently to a solution of the saddle point equations of the statistical physics system. Deterministic annealing will give a heuristic continuation method for solving these equations which, in general, will be preferable to using EM at fixed 1). While completing this work we learnt of an interesting result by Neal and Hinton (1993; see also Hathaway 1986), which states that both the Eand the M-steps of the EM algorithm can be interpreted as minimizing an effective energy, or equivalently as maximizing the associated Gibbs *These can be considered as hyperpriors for the models.

EM Algorithm

339

distribution. To understand this result from our perspective, observe that we can use saddle point techniques (see, for example, Simic 1990) to obtain an effective energy for the TSP energy without eliminating the V variables. This gives

where the {Q,,} are Lagrange multipliers and {V,,,} correspond to the .expected value of the { VIL,}.Minimizing E,ff(V.y] with respect to either V or y (keeping the other fixed) will yield the E- and the M-steps. (Observe that minimizing E,ff[V,y]with respect to V , solving for V(y), and substutiting back gives the effective energy Eefi[y]of equation 1.)

Acknowledgments We would like to thank Eric Mjolsness and Anand Rangarajan for helpful conversations and encouragement. One of us (A.L.Y.)thanks DARPA and the Air Force for support under contract F49620-92-J-0466 and Geoffrey Hinton for a helpful conversation.

References Amit, D. J. 1989. Modeling Brain Function. Cambridge University Press, Cambridge, England. Durbin, R., and Willshaw, D. 1987. An analog approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689491. Durbin, R., Szeliski, R., and Yuille, A. L. 1989. An analysis of the elastic net approach to the travelling salesman problem. Neural Comp. 1, 348-358. Hathaway, R. J. 1986. Another interpretation of the EM algorithm for mixture distributions. Stat. Prob. Lett. 4, 53-56. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural Comp. 3, 7947. Jordan, M. I., and Jacobs, R. A. 1993. Hierarchical mixtures of experts and the EM algorithm. MIT Dept. of Brain and Cognitive Science preprint. Neal, R. M., and Hinton, G. E. 1993. A new view of the EM algorithm that justifies incremental and other variants. Biometrica, submitted. Peterson, C. 1990. Parallel distributed approaches to combinatorial optimization. Neural Comp. 2, 261. Simic, I? 1990. Statistical mechanics as the underlying theory of “elastic” and “neural” optimization. NETWORK: Comp. Neural Syst. I(1), 1-15. Stolorz, P. 1991. Abusing statistical mechanics to do adaptive learning and combinatorial optimization. Los Alamos preprint.

340

A. L. Yuille, P. Stolorz, and J. Utans

Utans, J. 1993. Mixture models and the EM algorithm for object recognition within compositional hierarchies. Part 1: Recognition. TR-93-004, ICSI, 1947 Center St., Berkeley, CA 94704. Yuille, A. L. 1990. Generalized deformable models, statistical physics and matching problems. Neural Cornp. 2, 1-24. Yuille, A. L., Peterson, C., and Honda, K. 1991. Deformable templates, robust statistics, and Hough transforms. Proceedings SPlE GeohefricMethods in Computer Vision, San Diego, CA.

Received March 29, 1993; accepted July 15, 1993.

This article has been cited by: 2. Behrooz Safarinejadian, Mohammad B. Menhaj, Mehdi Karrari. 2010. A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowledge and Information Systems 23:3, 267-292. [CrossRef] 3. Are Strandlie. 2010. Track and vertex reconstruction: From classical to adaptive methods. Reviews of Modern Physics 82:2, 1419-1458. [CrossRef] 4. F. Wang, B.C. Vemuri, A. Rangarajan, S.J. Eisenschenk. 2008. Simultaneous Nonrigid Registration of Multiple Point Sets and Atlas Construction. IEEE Transactions on Pattern Analysis and Machine Intelligence 30:11, 2011-2022. [CrossRef] 5. Qi Zhao , David J. Miller . 2005. Mixture Modeling with Pairwise, Instance-Level Class ConstraintsMixture Modeling with Pairwise, Instance-Level Class Constraints. Neural Computation 17:11, 2482-2507. [Abstract] [PDF] [PDF Plus] 6. Carlos Ordonez, Edward Omiecinski. 2005. Accelerating EM clustering to find high-quality solutions. Knowledge and Information Systems 7:2, 135-157. [CrossRef] 7. S. Cang, H. Yu. 2005. Novel probability neural network. IEE Proceedings - Vision, Image, and Signal Processing 152:5, 535. [CrossRef] 8. Akihiro Minagawa, Yukihiko Kobayashi, Norio Tagawa, Toshiyuki Tanaka. 2003. Detection of vanishing point sequence with temporal fluctuation. Systems and Computers in Japan 34:2, 1-12. [CrossRef] 9. Ing-Tsung Hsiao, Anand Rangarajan, Gene Gindi. 2003. Bayesian image reconstruction for transmission tomography using deterministic annealing. Journal of Electronic Imaging 12:1, 7. [CrossRef] 10. T. Heskes. 2001. Self-organizing maps, vector quantization, and mixture modeling. IEEE Transactions on Neural Networks 12:6, 1299-1305. [CrossRef] 11. Bin Luo, E.R. Hancock. 2001. Structural graph matching using the EM algorithm and singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:10, 1120-1136. [CrossRef] 12. Sun-Yuan Kung, J. Taur, Shang-Hung Lin. 1999. Synergistic modeling and applications of hierarchical fuzzy neural networks. Proceedings of the IEEE 87:9, 1550-1574. [CrossRef] 13. Sun-Yuan Kung, Jenq-Neng Hwang. 1998. Neural networks for intelligent multimedia processing. Proceedings of the IEEE 86:6, 1244-1272. [CrossRef] 14. Lloyd P. M. Johnston, Mark A. Kramer. 1998. Estimating state probability distributions from noisy and corrupted data. AIChE Journal 44:3, 591-602. [CrossRef] 15. Thore Graepel, Matthias Burger, Klaus Obermayer. 1997. Phase transitions in stochastic self-organizing maps. Physical Review E 56:4, 3876-3890. [CrossRef]

16. Akio Utsugi. 1997. Hyperparameter Selection for Self-Organizing MapsHyperparameter Selection for Self-Organizing Maps. Neural Computation 9:3, 623-635. [Abstract] [PDF] [PDF Plus] 17. Martin Kloppenburg, Paul Tavan. 1997. Deterministic annealing for density estimation by multivariate normal mixtures. Physical Review E 55:3, R2089-R2092. [CrossRef] 18. V.P. Kumar, E.S. Manolakos. 1997. Unsupervised statistical neural networks for model-based object recognition. IEEE Transactions on Signal Processing 45:11, 2709-2718. [CrossRef] 19. T. Hofmann, J.M. Buhmann. 1997. Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:1, 1-14. [CrossRef] 20. S. Gold, A. Rangarajan. 1996. A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:4, 377-388. [CrossRef] 21. David Miller , Kenneth Rose . 1996. Hierarchical, Unsupervised Learning with Growing via Phase TransitionsHierarchical, Unsupervised Learning with Growing via Phase Transitions. Neural Computation 8:2, 425-450. [Abstract] [PDF] [PDF Plus] 22. Lei Xu , Michael I. Jordan . 1996. On Convergence Properties of the EM Algorithm for Gaussian MixturesOn Convergence Properties of the EM Algorithm for Gaussian Mixtures. Neural Computation 8:1, 129-151. [Abstract] [PDF] [PDF Plus] 23. D. Miller, A.V. Rao, K. Rose, A. Gersho. 1996. A global optimization technique for statistical classifier design. IEEE Transactions on Signal Processing 44:12, 3108-3122. [CrossRef]

Communicated by Douglas Miller

REVIEW

Statistical Physics Algorithms That Converge A. L. Yuille J. J. Kosowsky Division of Applied Sciences, Harvard University, Cambridge, M A 02138 U S A

In recent years there has been significant interest in adapting techniques from statistical physics, in particular mean field theory, to provide deterministic heuristic algorithms for obtaining approximate solutions to optimization problems. Although these algorithms have been shown experimentally to be successful there has been little theoretical analysis of them. In this paper we demonstrate connections between mean field theory methods and other approaches, in particular, barrier function and interior point methods. As an explicit example, we summarize our work on the linear assignment problem. In this previous work we defined a number of algorithms, including deterministic annealing, for solving the assignment problem. We proved convergence, gave bounds on the convergence times, and showed relations to other optimization algorithms. 1 Statistical Physics and Mean Field Theory for Optimization

~

In recent years there has been significant interest in adapting techniques from statistical physics, in particular mean field theory, to provide deterministic heuristic algorithms for obtaining approximate solutions to optimization problems (Hopfield and Tank 1985; Grzywacz and Yuille 1986; Peterson and Soderberg 1989; Simic 1990; Yuille 1990; Geiger and Girosi 1991; Rose et al. 1990; Platt and Hopfield 1986). These algorithms, some of which are known as deterministic annealing, are closely related to simulated annealing (Kirkpatrick et al. 1983; Geman and Geman 1984) and the Boltzmann machine (Kienker et al. 1986). Both approaches formulate the optimization problem in terms of minimizing a cost function and defining a corresponding Gibbs distribution. Simulated annealing then proceeds by sampling the Gibbs probability distribution as the temperature is reduced to zero, while deterministic annealing attempts to track an approximation to the mean of the distribution. An exciting aspect of these algorithms is the possibility of implementing them in special purpose VLSI (Elfadel 1993). Although simulations have shown (Durbin and Willshaw 1987; Peterson and Soderberg 1989; Peterson 1990) that Neural Computation 6, 341-356 (1994)

@ 1994 Massachusetts Institute of Technology

342

A. L. Yuille and J. J. Kosowsky

mean field methods often yield effective algorithms' there has been little theoretical analysis of them and few attempts to explain how they relate to more standard optimization algorithms. In this paper, we demonstrate connections between mean field theory methods and other approaches. We will specifically consider linear programming with barrier functions (Bayer and Lagarias 1989; Wright 1992) and interior point methods (Karmarkar 1984; Faybusovich 1991; Monteiro and Adler 1989). In addition, there are interesting similarities to the auction algorithm (Bertsekas 1990) and primal/dual min cost flow algorithms (Ford and Fulkerson 1962). As an explicit example, we summarize our earlier work (Kosowsky and Yuille 1991; Yuille and Kosowsky 1991) on the linear assignment problem. In this previous work we defined a number of algorithms, including deterministic annealing, for solving the assignment problem. We proved convergence, gave bounds on the convergence times, and showed relations to other optimization algorithms. The second section describes the saddle point method for obtaining the mean field equations. The third section shows relations between mean field methods and other optimization techniques. The fourth section derives useful results for putting bounds on the convergence of algorithms derived from mean field theory. In Section 5, we analyze the linear assignment problem and obtain convergence bounds. In addition we show relationships to other optimization techniques. 2 Deterministic Annealing

In this paper we will only consider optimization problems that can be formulated in terms of minimizing an energy E[V] with respect to binary variables V = {Via}. We assume that V corresponds to an N x N square matrix {Via}. We impose global row constraints C, V,, = 1, Vi, and column constraints Xi V , = 1, Val Note that {Via} now represents a permutation matrix and thus the solution space of the original optimization problem can be embedded in the permutation group of N elements. A specific example, that we will study in more detail in Section 5, is the linear assignment problem. The energy can be written as E[V] = - CiaV,,Aia where the {A,,} are fixed weights. To obtain an intuitive, economic, understanding of this problem, one can let the i's label people, the a's label objects, and set Ai, to be the benefit of the a'th object to the 7th person. The problem then consists of finding a one-to-one mapping between people and objects such that the overall benefit is maximized. A 'Deterministic annealing has sometimes been criticized for its performance on large problems. These criticisms are mainly directed at the formulation used in Hopfield and Tank (1985) and, in particular, by the way global constraints were imposed by energy biases, though significant improvements to this approach have been made in Aiyer et a / . (1990). The work in this paper, however, follows an alternative strategy (Durbin and Willshaw 1987) that imposes constraints directly and generalizes well to large problems.

Statistical Physics Algorithms

343

similar energy function arises in the minimal mapping theory for longrange motion perception (Ullman 1979). The optimization problem is embedded in a statistical physics framework by defining a Gibbs distribution (Parisi 1988) so that the probability of any specific configuration V is given by P[V]= e-flE[v]/Z,where /3 = 1/T is the inverse of the temperature and the corresponding partition where the summation should be function of the system is Z = Cv taken only over configurations that satisfy the global row and column constraints. This direct method of imposing the constraints is known as hard constraints (Peterson and Soderberg 1989; Simic 1990; Yuille 1990). 2.1 Effective Energies and the Saddle Point Approximation. If the partition function for a physical system is known then it is typically straightforward to calculate the mean fields via = CvViaP[V](Parisi 1988). In the limit as T H 0 the mean fields will tend to the lowest energy state of the system (all other states will have zero probability in the limit). Unfortunately it is often impossible to calculate the partition function explicitly and instead one is reduced to approximating it. Physicists have developed a variety of techniques for approximating the partition function. We will use one of the simplest possible methods, the saddle point approximation [for details see Parisi (198811. This method involves imbedding the discrete space of permutation matrices in Euclidean space and then enforcing the constraints via an integral representation of the delta function. This induces a new set of auxiliary variables and results in a new effective energy function that is a function of both the original and auxiliary variables plus a temperature parameter. The partition function is now the integral of the exponential of minus the effective energy. In the saddle point approximation we approximate the mean field at a fixed temperature by finding the minima of the effective energy. Thus we obtain approximations to the mean fields as a function of temperature. We call the equations for the extrema of the effective energy the saddle point equations. [An alternative approach involves imbedding the permutation group in the continuous rotation group (Brockett 1988). Relations to this approach are discussed in Yuille and Kosowsky (19931.1 It should be emphasized that the saddle point approximation is often a poor, and sometimes terrible, approximation to the system.2 For optimization, however, we are not concerned with modeling the full physical system. We are only interested in obtaining the correct, or approximately correct solution in the limit as T H 0. This requires (1) that the global minimum of the effective energy tends to the global minimum of the original energy function in the limit as T H 0, (2) that we can find the 2Physicists have shown that the approximation can become exact in the limit as the size of the system tends to infiity provided the system has long range interactions (Parisi 1988). Unfortunately such conditions usually do not hold for the class of problems we are concerned with.

344

A. L. Yuille and J. J. Kosowsky

global minimum of the effective energy. We will discuss these issues in later sections. Several related effective energies can be used to describe the same physical system. The extrema of these effective energies will correspond; hence the approximations obtained will be equivalent. We illustrate with two examples. The first (Peterson and Soderberg 1989) imposes half of the constraints by introducing an additional variable { Uia} and the remainder by Lagrange multi liers { P a } . It expresses the partition function by Z = J[dS][dP][dU]e-geff[sJ~ulwhere the {So,} are continuous variables taking values in [0,1] that correspond to the {Va;}. The values ( S ' } that extremize E,ff[S,P , U ] are the mean field approximation. The effective energy E,ff[S,P,U ] is given by

The second [see, for example, Elfadel and Yuille (199311 is perhaps more intuitive. Instead of using saddle point techniques to approximate an exact expression for the partition function the approach directly obtains an alternative, but related, effective energy by approximating the free energy. It imposes the global constraints by Lagrange multipliers { P i } and {Qa}. The resulting effective energy is given by

(2.2) Thus, the effective energy consists of the original energy, the Lagrange multiplier terms, plus an entropy term (1//3)CaiS,i log So,. The entropy serves two purposes. First, since it is convex it will cause the effective energy to be convex for sufficiently small @ (large T). Thus we can solve the problem at high temperature and then track the solution as j3 increases. This procedure is an example of a continuation method (Wasserstorm 1973). Second, the entropy acts as a barrier function [see (Wright 199211, that in conjunction with the Lagrange multiplier terms, will prevent the {Sai} from taking values outside the range [0,1]. To see this observe that the gradient of the effective energy approaches minus infinity as one of the {S,,} approaches zero and hence imposes a positivity constraint on the { Sai} while the Lagrange multiplier constraints prevent any of the (Sai} from becoming greater than one, since this would force other S's in the same row, or column, to be negative.

Statistical Physics Algorithms

345

To see the equivalence between these formulations we can extremize 2.1 with respect to { U,,}. (2.3) This can be rewritten as log S,, = Uia- log{CbeUib},Vi, a. Multiplying by S,/P and summing over a and i, and then applying the global row constraints (which is automatically imposed by 2.31, yields

The equivalence between 2.2 and 2.1 follows directly from 2.4. 2.2 Optimization Methods. Once we have obtained the saddle point equations for a specific problem, we must attempt to solve for the minima of the effective energy. For certain classes of problem the effective energy is convex. Two prototypical algorithms will always find the minimum and thereby solve the original problem3: (1)descent in the effective energy at constant and sufficiently low temperature? and (2) tracking the minimum energy solution as the temperature decreases. The first method is similar to existing neural network algorithms and it seems plausible (Elfadel 1993) that it can be implemented in VLSI circuits. The second (see Section 3.2) is sometimes equivalent to interior point algorithms. For more specific problems a larger variety of algorithms exist, see Section 5. We emphasize that our goal is to develop differential equations (i.e., analog algorithms) for solving combinatorial problems. If the algorithms are to be simulated on serial computers then techniques such as Newton's method and conjugate gradient descent may be preferable to steepest descent. For difficult optimization problems, such as the traveling salesman problem, the effective energy will typically have many local minima at low temperature. It follows from 2.2, however, that the effective energy will be convex at sufficiently high temperatures. Deterministic annealing is a heuristic continuation method which attempts to find the global minimum of the effective energy at high temperature and track it as the temperature decreases. There is no guarantee that the minimum at high temperature can always be tracked to the minimum at low temperature, but the experimental results are encouraging (Peterson and Soderberg 1989; Durbin and Willshaw 1987; Peterson 1990). The same saddle point method and deterministic annealing can be applied to problems where the energy depends also on a continuous 3The minimum of the effective energy as T ++ 0 does not always correspond to the true solution to the problem, but the energy can usually be modified until it does-see Section 4. 41f the effective energy contains Lagrange multiplier terms then we will project the energy gradient onto the space obeying the global constraints.

A. L. Yuille and J. J. Kosowsky

346

variable and where only one, or none, of the global constraints is imposed. The modification is straightforward and can be seen in Yuille (1990), Geiger and Girosi (19911, Rose et al. (1990), and Ohlsson et al. (1992). A similar approach, though without using partition functions, was described in Koch et al. (1986). 3 Related Algorithms

3.1 Barrier Function for Alternative Statistical System. Barrier functions methods are a standard technique for solving linear and nonlinear optimization problems (Wright 1992). If we need to minimize a function f ( x ) within a domain D then a barrier function g ( x ) can be added to impose the domain constraints. Thus, we minimize a combined function f ( x ) + p g ( x ) for values of p decreasing to zero. Typically g ( x ) is chosen to be infinite at the boundary of D. For example, the barrier function - logx is often used to impose the constraint x 2 0. We will now show that this can be obtained as the entropy of an alternative physical system. We derived the previous effective energies 2.2 and 2.1 by considering a physical system where the states of the system correspond to all possible permutations. This corresponds to the subset of the vertices of the hypercube corresponding to those that obey the global constraints. Alternatively, we can define a physical system whose states are those internal to the hypercube that satisfy the same global constraints. The lowest energy of this more generalized system is also found at the optimal perm~tation.~ We can then derive the following effective energy (Yuille and Kosowsky 1993):

Observe that this differs from 2.2 by the form of barrier function. For the more general problem of maximizing E = E[x] subject to the constraints C j A i , x , = b,, Vi and xi 2 0, Vi, the saddle point method directly gives the standard barrier function - log xi. More specifically, we obtain 1

)

(3.2) 1Pi C Aijxj - bi - -k Ci log xi If we restrict the energy to E[x] = xixici, and project onto the subspace

E,ff[x, PI

= E[x]- -

p i

( j

satisfying the constraints, we obtain a standard method for solving linear programming problems (Wright 1992). A variety of algorithms, such as Newton's method, can be used to minimize this energy. 5We should point out, however, that the resulting physical system contains unnecessary states and hence may be inferior to the original method.

Statistical Physics Algorithms

347

3.2 Temperature Tracking and Relations to Interior Point Methods. If the effective energy is convex then we can derive a variant of deterministic annealing by writing down a differential equation to track the unique minimum as a function of temperature. At high temperature (small P ) the unique minimum can typically be solved for explicitly, thereby giving initial conditions for the temperature tracking differential equation. More precisely, we use the equations (3.3)

to implicitly define a solution S*(/j) = { S k ( P ) } as a function of temperature. Differentiating 3.3 with respect to P gives an equation for updating Sg(P) as P changes:

This equation can be inverted to solve for dSTb/dP (care must be taken to ensure that the global constraints are satisfied). Then, given SiP=o,, one can use 3.4 to find the solution as temperature goes to zero. Surprisingly, for the linear programming problem, Faybusovich (19911 has shown that this temperature tracking algorithm is equivalent to his analog variant of the interior point algorithm (Bayer and Lagarias 1989; Karmarkar 1984) provided that /3 is reinterpreted as the time variable for the interior point algorithm. This is encouraging since interior point algorithms have been shown empirically to be effective ways of solving linear programming problems. We emphasize, however, that temperature tracking (deterministic annealing) is more general than interior point methods since it is not restricted to convex problems. We note that iterative interior point methods have also been related to barrier functions (Gill et al. 1986). For problems with more than one local miminum, we expect bifurcations when @E,ff/dSi,aS,b develops a zero eigenvalue. In such cases 3.4 cannot be inverted and we obtain a bifurcation. 3.3 Multiple Minima-Bifurcations. For more challenging optirnization problems, such as the TSP, there will be many local minima and analysis becomes difficult. For some specific problems, it can be shown (Peterson and Soderberg 1989; Durbin etal. 1989; Yuille and Elfadel 1992) that there are bifurcations of the solutions. These bifurcations are analogous to phase transitions in statistical physics systems. We should note that, strictly speaking, phase transitions cannot occur in the problems that we are considering, because of their finite size, and can occur in the limit only as the size of the system goes to infinity. Nevertheless our systems will display the essential properties of such transitions. More specifically, there will be a critical temperature T, above which the solution to

A. L. Yuille and J. J. Kosowsky

348

the mean field equations is constant over temperature and below which this solution is unstable. T, can be computed analytically and gives an upper bound for the initial temperature in the annealing process.6 We will show how to compute T, for a special class of problems that includes the TSP and many models for texture and pattern generation (Yuille and Elfadel 1992). The generic form is E[V] = &flbRt,dabVfaV,br along with the standard global constraints where R,, is a shift invariant matrix. For the TSP d,b is the distance between cities and R,, = h,,,+l. Differentiating the free energy formulation with Lagrange multipliers, 2.2, we obtain equations for the extrema:

subject to the row and column constraints for {Si,}. The shift-invariance of R,, implies that there is always a trivial solution with S,, = 1/N by setting P, = -2TCbdnb(l/N) and Q, = const, Vi. To determine the stability of the solution we must look at the eigenvalues of the Hessian H;j& at S,, = 1/N, of the unconstrained effective energy, (3.6)

+

The associated Hessian is Hl,ah = 2R,,dab NTh,,h,b. To ensure stability the Hessian must be positive definite after being projected onto the subspace obeying the global constraints. For high T the second term dominates and the solution S,, = 1/N, Vi.u is stable. For low temperature the first term dominates and a bifurcation will occur unless R,d,b is positive definite (in all the directions obeying the global constraints). This bifurcation is desirable since we do not want S,, = 1/N, Vi.u to be a stable solution as T + 0. Moreover, it gives a critical temperature T,, and initial conditions, to start the annealing7 The eigenvectors of Hllab are of form XrY; where the {X;} and {Y;} are eigenvectors of R,, and dab, respectively ( p and v label the eigenvectors and j and b their components). Since R, is shift-invariant its eigenvectors are of form Xr = e ( G 1 ) 2 T ] ” /for N LL = 1, ..., N. In general, dab will not be shift invariant and its eigenvectors must be found for each specific problem. was observed (Durbin and Willshaw 1987) that above a critical temperature the elastic net converged to a single point at the center of mass of the cities. It was then shown (Durbin r t a[. 1989) that an analytic expression could be derived for this critical temperature. 7There is no point in starting the annealing at temperatures higher than T, since the system would only converge to the state S,, = l/N. Vi, a .

Statistical Physics Algorithms 4 Convergence of solution as T

349 H

0

It is important to investigate the convergence of mean field theory algorithms both with temperature and with time. For example, how does the minimum of the effective energy at temperature T relate to the solution to the original problem? In this section we state results which quantify the changes in the energy and the effective energy along a trajectory of extrema as the temperature varies. In the next section these results will be applied to the assignment problem to obtain convergence. In this section we will use the form of effective energy specified by equation 2.2. It is easy to see that, for the extremized effective energy 2.2, P(DE,ff/DP) = -l/PC,iS,ilogS,;, where the right-hand side can be interpreted as the entropy of the state. This result tells us how the depth of an extrema in the effective energy changes with temperatures and can be used to obtain bounds on the convergence of the system as the temperature goes to zero. Theorem 1. Let E [ S * (T ) ;TI be the energy, not the effective energy, of a trajectory S*(T)of the extrema of the effective energy as temperature T varies. Then, for all T1 we have IE[S*(O);T= 01 - E[S*(Tl);T= TI]^ 53TlNlogN. Proof. See Yuille and Kosowsky (1993).

0

For a specific choice of the energy function, and assuming uniqueness of the saddle point solution, we can use this theorem to put a bound on T1 to ensure that the saddle point solution is arbitrarily close to the minimum of E[S; T = 01. The bound 3TNlogN in the theorem arises from our being able to bound the size of the entropy term [CiaSi, log Si,[ by N log N . If we used the more traditional barrier function Xi, log Si,, the entropy for the alternative statistical system mentioned in Section 3.1 for such a bound would be impossible. We may then hypothesize that the entropy term C,, Si, log S,, may yield better convergence properties than the standard barrier function. The global minimum of E[S], even assuming that it can be found by tracking, does not necessarily correspond to the minimum of E[V].It will fail if the global minimum of E [ S ]does not occur at an allowable9vertex of the hypercube [0,I]”. Thus we need conditions that ensure that the global minimum of E[S]lies at a vertex. For optimization problems, this vertex condition has usually been imposed by adding additional terms, such as -K CiaSi, to the energy. Such terms will not affect the relative energy of the allowed solutions but, provided K is sufficiently large, they has a standard physical interpretation (Parisi 1988) 9That is, a vertex that satisfies the global constraints.

A. L. Yuille and J. J. Kosowsky

350

will ensure that E [ S ] does not have an internal global minimum. In practice the values of constants like K are set empirically. For some systems it is straightforward to put constraints on E [ S ] to ensure that the global minimum must lie on a vertex. If the Hessian Htlub = a'E[S]/8s,8S,b, projected onto the subspace satisfying the global constraints, has negative eigenvalues in the interior then there cannot be a minimum there. Thus the magnitude of K must be chosen to ensure that the projected Hessian has some negative eigenvalues. To prevent minima from lying on one of the faces of the hypercube we must ensure that the projected Hessian still contains negative eigenvalues even when some of the { S I u } are constrained to be 1 or 0. Bounds for K will depend on the specific energy function, but the following theorem gives an upper bound for K for quadratic energies.

Theorem 2. Let theenergy beofform E ( S ] = El)& T@SlaSlb. Then we can ensure that the minima lie at the vertices of the hypercube by adding a term -K C,, S i where > Cqab

fi

Proof. f l & , b T&} is an upper bound for the eigenvalues of TI)&. If K > fi&ab then the Hessian of E [ S ] - K El,S;, even when projected onto the subspace obeying the constraints, is guaranteed to have negative eigenvalues so there are no minima in the interior. At faces of the hypercube certain elements S,, are fixed to be 1 and the remaining elements of the corresponding rows and columns are 0. Thus the stability will depend on the reduced matrix obtained from by eliminating the rows and columns corresponding to the S, fixed at 1. Eljab c a b } remains an upper bound for the eigenvalues of this reduced matrix, so 0 the result follows as before.

Gab}

4

5 The Linear Assignment Problem

As an example of a specific system where analytical results can be obtained we will summarize our earlier work on the assignment problem. The proofs of these results are not straightforward and we refer the interested reader to our technical reports [Kosowsky and Yuille (1991); Yuille and Kosowsky (199111. The assignment problem has energy E [ V ] = -Ci,V,Aia where the {Ai,} are numbers. An effective energy can be obtained for this problem using 2.1. The equation for the extrema can be solved to give S and U as functions of P only. Substituting back into the effective energy gives the P-Energy:

Statistical Physics Algorithms

351

The mean field variables can be determined from the P by the relation

5.1 Convergence and Bounds. This section states several theorems which can be used for putting bounds on convergence times of algorithms for linear assignment.

Theorem 3. The Hessian of E p is positive semidefinite and bounded below by TN log N CiaA,,. Hence, for nonzero temperature T = l/P, there is a unique minimum of E P up to the translation Pa H Pa + K , Va. Imposing the constraint, EnPa = 0, yields a unique minimum at P@). This translation invariance does not affect the solution { S i } = {Sja[P(p),TI}. 0 Proof. See Kosowsky and Yuille (1991).

+

Thus, there is a unique optimal solution for nonzero temperature and we can find the minimum of Ep by steepest descent at fixed temperature. At zero temperature there may be more than one optimal solution in certain nongeneric situations.

Theorem 4. Suppose the assignment problem associated with the N x N benefit matrix {Ah} admits a unique optimal solution. Let A equal the difference in energy between theoptimal solution and the second best solution. Then, rounding-offeach of theentriesof Sia[P,T ]to the nearest integeryields theuniquepermutation matrix that solves the assignment problem whenever A 2N log N Proof. See Yuille and Kosowsky (1991).

T<

(5.3) 0

This theorem tells that that once we have found the minimum of E p at fixed temperature we can obtain the solution to the assignment problem by rounding-off the corresponding { Sia}'s to the nearest integer, provided the temperature is below the bound given above. Observe that scaling the {A,,} to be integers will ensure that A 2 1 in the generic case where there exists a unique optimal solution. In practice, it is never completely clear that a steepest descent algorithm has converged. Hence we proved the following theorem: Theorem 5. Suppose that N 2 4 and IIVEp[P; TI )I 5 6 . Then for all i and a,

(Sja(P,T)-

< 8-

+

A

[e{2(maxAia -minAia

Tlog N -1--E l+r}

+ cNlog(N where {r&}

r,a

is the optimal assignment.

-

1)

1.a

+TlogN] (5.4)

A. L. Yuille and J. J. Kosowsky

352

Proof. See Yuille and Kosowsky (1991).

0

This theorem shows that provided the temperature is sufficiently small we do not need to get to the minimum of Ep. Instead we can put a threshold on llVEpll and stop the descent as soon as this threshold is reached. This will only take a finite time since, as E p is convex, we can put a lower bound on the rate of decrease of E p until it reaches the threshold.’O More specifically, we need ISi,(P, T ) - IISl < 1/2 to ensure that we can obtain the correct solution by rounding-off the {Si,(P, T ) } to the nearest integer. This condition can be enforced by requiring that the four terms on the right-hand side of 5.4 are all less than 1/8. For the third term we need,

T < A/{64N2(logN)2}

(5.5)

Similarly, for the first term we require, t <

A 128N2(logN) ( maxi,, A;, - mini,, A,,)

(5.6)

Now, the fourth term is less than 1/8 if F <

1 8Nl0g(N - 1)

(5.7)

Finally, for the second term, {8N2(logN)Tlog(N- 1 + c ) / ( 1 - €)}/A < 1/8 will follow from eq.uation 5.5, provided t < 1/(N 1). Since N 2 4, equation 5.5 in turn ensures that t < l / ( N + l ) , so that the second relation is automatically satisfied. Now, we can bound

+

A 5 2(maxA,, - min A,,) I .n

(5.8)

,,a

It then follows that equations 5.4 and 5.5 are satisfied if Thus, convergence is assured if

T <

E

< 1/6N log N.

A 64W (log N ) 2

In the next section we estimate how long it will take to reach such a configuration by steepest descent. ‘OThis procedure seems similar to that used in Khatchiyan (1979) for the ellipsoid method.

Statistical Physics Algorithms

353

5.2 Algorithms. Using the results of the previous section we see that solving the assignment problem is equivalent to minimizing the convex EP energy at sufficiently low temperature. The first algorithm for minimizing the EP energy is steepest descent. We give the system unbiased initial conditions Pa = 0, V a and its initial energy can be written as Ep[P(T) = 01. The steepest descent equation is (5.10) The convergence time for this dynamic system can now be estimated. Using the chain rule, dEp[P(T)]/dt = VEp[P(T)]. dP/dt, and the update rule, dP/dt = -VEp[P(T)], we observe that dEp[P(T)]/df=- IIVEP[P(T)III~. Recalling that the P-energy is convex, it follows that after (Ep(P = 0, TI Ep[P(T)])/cZunits of time we can descend to a point where (IVEp[P(T)]II < E . Using Lagrange multipliers, it can be shown that Ep[P = O,T] = Ci,aSia(0)Aia - T,& Sia(0)log Sia(0) is bounded above by TNlogN + Nmaxi,aAia. Clearly, Ep[P(T)]is bounded below by Nmini,, Ai,. Hence,

Ep[P = 0, TI - Ep[P(T)]5 TNlogN

+N

- min A,)

(5.11)

1 4

So, if we fix

a TI 64W(logN ) 2 and set E t

=

(5.12)

1/6NlogN, then we have convergence in time

5 36N3(logN)’ TlogN + (maxA,, - minA,,)]

I

ra

(5.13)

Note that the convergence is polynomial in N. We should emphasize that this is the time for the analog dynamic system to converge and should not be confused with complexity bounds used in computer science. Steepest descent also has an interesting economic interpretation and can be related to Bertsekas’ auction algorithm (Bertsekas 1990). To see these connnections we interpret the { P a } as the prices of objects labeled by a and the {Ala}as the utility of the object a to the person i. We interpret our algorithm as adjusting the prices of objects until the total demand C, S,, = E l efi(A~~-Pa(t))/Cc eP(Aic-Pc(t)) for an object a is equal to its rigid supply 1. Thus we refer to our algorithm as the invisible hand algorithm. This is closely related to Bertsekas’ auction algorithm (Bertsekas 1990) where the price of objects is adjusted, by a bidding process, until only one person desires each object. One can roughly think of our algorithm as a parallel continuous time version of Bertsekas’ serial discrete time algorithm.

354

A. L. Yuille and J. J. Kosowsky

Another algorithm is the temperature tracking approach mentioned in Section 3.2. The algorithm can be obtained by solving (5.14) Note that this equation cannot be solved directly for dPh/d/) since @Ep/aP,d&, has a single zero eigenvalue, corresponding to the translation invariance P, H P, + K , Va. If we eliminate this invariance by imposing the constraint C, Po = 0, so that C, dP,/d[j = 0, then we get a unique trajectory for the temperature tracking problem. Theorem 5 tells us that we do not need to know the value of the {P,(j3 = 0 ) ) precisely, provided the gradient is small enough. Then we can use Theorem 5 to get temperature and time bounds on the algorithm. Stability can be ensured by adding a small amount of steepest descent as a gradient restoring force. As mentioned in Section 2, this temperature tracking algorithm is equivalent to a version of the interior point methods (Bayer and Lagarias 1989; Karmarkar 1984; Faybusovich 1991). A variety of other algorithms can be applied. For example (Yuille and Kosowsky 19931, the minimum of the E P energy can also be found iteratively by applying Sinkhorn’s Theorem (1964). 6 Conclusion

We have related the mean field statistical physics methods to more standard optimization techniques such as linear programming with barrier functions and interior point methods. We stress that the mean field methods are more general than these techniques since they are more generally applicable and have proven themselves to be good heuristic algorithms for obtaining solutions to difficult optimization problems. In addition, we have analyzed the mean field approach for the special case of the linear assignment problem and obtained convergence bounds for a variety of physics-based algorithms and related them to more standard approaches.

Acknowledgments We would like to acknowledge support from DARPA and the Air Force with contracts AFOSR-89-0506 and F49620-92-J-0466 and to thank the Brown, Harvard, and MIT Center for Intelligent Control Systems for a United States Army Research Office Grant DAAL03-86-C-0171. We would also like to thank Roger Brockett and Leonid Faybusovich for helpful conversations. The input from three anonymous reviewers waq also appreciated.

Statistical Physics Algorithms

355

References Aiyer, S. V. B., Niranjan, M., and Fallside, F. 1990. A theoretical investigation into the performance of the Hopfield model. I E E E Trans. Neural Networks 1, 204-2 15. Bayer, D. A., and Lagarias, J. C. 1989. The nonlinear geometry of linear programming. I: Afine and projective scaling trajectories. Trans. A M S 314(2), 499-526. Bertsekas, D. P. 1990. The auction algorithm for assignment and other network flow problems: A tutorial. INTERFACES 20, 133-149. Brockett, R. W. 1991. Dynamical systems that sort lists, diagonalize matrices and solve linear programming problems. In Linear Algebra and Its Applications, 1 (46), 79-91. Durbin, R., and Willshaw, D. 1987. An analog approach to the travelling salesman problem using an elastic net method. Nature (London) 326,689-691. Durbin, R., Szeliski, R., and Yuille, A. L. 1989. An analysis of the elastic net approach to the travelling salesman problem. Neural Comp. 1, 348-358. Elfadel, I. M. 1993. From Random Fields to Networks. Ph.D. thesis, Dept. of Mechanical Engineering, MIT. Elfadel, I. M., and Yuille, A. L. 1993. Mean-field phase transitions and correlation functions for Gibbs random fields. J. Math. Imaging Vision 3(2), 167-186. Faybusovich, L. 1991. Interior point methods and entropy. In Proceedings of the ZEEE Conference on Decison and Control, pp. 2094-2095. Ford, L. R., and Fulkerson, D. R. 1962. Flows in Networks. Princeton University Press, Princeton, NJ. Geiger, D., and Girosi, F. 1991. Parallel and deterministic algorithms from MRFs: Surface reconstruction and integration. I E E E Trans. PAMI 13,401-412. Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. I E E E Trans. PAM1 6, 721-741. Gill, P., Murray, W., Saunders, M., Tomlin, J., and Wright, M. 1986. On projective Newton barrier methods for linear programming and an equivalence to Karmarkar’s projective method. Math. Program. 36, 183-209. Grzywacz, N. M., and Yuille, A. L. 1986. Massively parallel implementations of theories of apparent motion. In AZP Conference Proceedings 251, Neural Networks for Computing, J. Denker, ed. American Institute of Physics, New York. Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Biol. Cybern. 52, 141-152. Karmarkar, N. 1984. A new polynomial-time algorithm for linear programming. Combinatorica 4(4), 373-395. Khatchiyan, L. G . 1979. A polynomial algorithm in linear programming. Sou. Math. Dokl. 20(1), 191-194. Kienker, P. K., Sejnowski, T. J., Hinton, G. E., and Schumacher, L. E. 1986. Separating figure from ground with a parallel network. Perception 15, 197216. Kirkpatrick, S., Gelatt Jr., C., and Vecchi, M. 1983. Optimization by simulated annealing. Science 220, 671-680.

356

A. L. Yuille and J. J. Kosowsky

Koch, C., Marroquin, J., and Yuille, A. L. 1986. Analog “neuronal” networks in early vision. Proc. Natl. Acad. Sci. U.S.A. 83, 4263-4267. Kosowsky, J. J., and Yuille, A. L. 1991. The invisible hand algorithm: Solving the assignment problem with statistical physics. Harvard Robotics Laboratory Tech. Rep. 91-1, Harvard University, Cambridge, MA. Monteiro, R. D. C., and Adler, I. 1989. Interior path following primal-dual algorithms. Part I: Linear programming. Math. Program. 44, 27-41. Ohlsson, M., Peterson, C., and Yuille, A. L. 1992. Track finding with deformable templates-the elastic arms approach. Computer Phys. Commun. 71, 77-98. Parisi, G. 1988. Stafistical Field Theory. Addison-Wesley, Reading, MA. Peterson, C. 1990. Parallel distributed approaches to combinatorial optimization problems-benchmark studies on T.S.P. Neural Comp. 2(3), 261-270. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. Int. J. Neural Syst. 1(1), 3-22. Platt, J., and Hopfield, J. 1986. Analog decoding using neural networks. In Neural Networks for Computing, pp. 365-369. American Institute of Physics Press, New York. Rose, K., Gurewitz, E., and Fox, G. 1990. A deterministic annealing approach to clustering. Pattern Recog. Lett. 11, 589-594. Simic, P. 1990. Statistical mechanics as the underlying theory of ’elastic’ and ’neural’ optimization. NETWORK: Comp. Neural Syst. 1(1), 1-15. Sinkhorn, R. 1964. A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Mafh. Statist. 35, 876-879. Ullman, S. 1979. The Interpretation of Visual Motion. MIT Press, Cambridge, MA. Wasserstrom, E. 1973. Numerical solutions by the continuation method. SlAh4 Rev. 15, 89-119. Wright, M. 1992. Interior methods for constrained optimization. In Acta Numberica 1992, A. Iserles, ed., pp. 341-407. Cambridge University Press, Cambridge. Yuille, A. L. 1990. Generalized deformable models, statistical physics and matching problems. Neural Comp. 2, 1-24. Yuille, A. L., and Elfadel, I. M. 1992. Mean-field theory and phase transitions for grayscale texture synthesis. In Proceedings 26th Conference on Information Sciences and Systems, Princeton University. Yuille, A. L., and Kosowsky, J. J. 1991. The invisible hand algorithm: Time convergenceand temperature tracking. Harvard Robotics Laboratory Tech. Rep. 91-10, Harvard University. Yuille, A. L., and Kosowsky, J. J. 1993. A mathematical analysis of deterministic annealing. Artificial Neural Networks with Applications in Speech and Vision. Chapman and Hall, New York.

Received July 23, 1992; accepted August 19, 1993.

This article has been cited by: 1. A. L. Yuille , Anand Rangarajan . 2003. The Concave-Convex ProcedureThe Concave-Convex Procedure. Neural Computation 15:4, 915-936. [Abstract] [PDF] [PDF Plus] 2. Ing-Tsung Hsiao, Anand Rangarajan, Gene Gindi. 2003. Bayesian image reconstruction for transmission tomography using deterministic annealing. Journal of Electronic Imaging 12:1, 7. [CrossRef] 3. N. Rubanov. 2003. SubIslands: the probabilistic match assignment algorithm for subcircuit recognition. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 22:1, 26-38. [CrossRef] 4. J.L. Marroquin, B.C. Vemuri, S. Botello, E. Calderon, A. Fernandez-Bouzas. 2002. An accurate and efficient Bayesian method for automatic segmentation of brain MRI. IEEE Transactions on Medical Imaging 21:8, 934-945. [CrossRef] 5. A. L. Yuille . 2002. CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief PropagationCCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation. Neural Computation 14:7, 1691-1722. [Abstract] [PDF] [PDF Plus] 6. Chuangyin Dang , Lei Xu . 2002. A Lagrange Multiplier and Hopfield-Type Barrier Function Method for the Traveling Salesman ProblemA Lagrange Multiplier and Hopfield-Type Barrier Function Method for the Traveling Salesman Problem. Neural Computation 14:2, 303-324. [Abstract] [PDF] [PDF Plus] 7. Richard Myers , Edwin R. Hancock . 2001. Empirical Modelling of Genetic AlgorithmsEmpirical Modelling of Genetic Algorithms. Evolutionary Computation 9:4, 461-493. [Abstract] [PDF] [PDF Plus] 8. Bin Luo, E.R. Hancock. 2001. Structural graph matching using the EM algorithm and singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:10, 1120-1136. [CrossRef] 9. Shin Ishii , Hirotaka Niitsuma . 2000. λ-Opt Neural Approaches to Quadratic Assignment Problemsλ-Opt Neural Approaches to Quadratic Assignment Problems. Neural Computation 12:9, 2209-2225. [Abstract] [PDF] [PDF Plus] 10. Marcello Pelillo . 1999. Replicator Equations, Maximal Cliques, and Graph IsomorphismReplicator Equations, Maximal Cliques, and Graph Isomorphism. Neural Computation 11:8, 1933-1955. [Abstract] [PDF] [PDF Plus] 11. Anand Rangarajan , Alan Yuille , Eric Mjolsness . 1999. Convergence Properties of the Softassign Quadratic Assignment AlgorithmConvergence Properties of the Softassign Quadratic Assignment Algorithm. Neural Computation 11:6, 1455-1474. [Abstract] [PDF] [PDF Plus] 12. Andrew M. Finch , Richard C. Wilson , Edwin R. Hancock . 1998. An Energy Function and Continuous Edit Process for Graph MatchingAn Energy Function

and Continuous Edit Process for Graph Matching. Neural Computation 10:7, 1873-1894. [Abstract] [PDF] [PDF Plus] 13. H. J. Kappen , F. B. Rodríguez . 1998. Efficient Learning in Boltzmann Machines Using Linear Response TheoryEfficient Learning in Boltzmann Machines Using Linear Response Theory. Neural Computation 10:5, 1137-1156. [Abstract] [PDF] [PDF Plus] 14. Marcelo Blatt , Shai Wiseman , Eytan Domany . 1997. Data Clustering Using a Model Granular MagnetData Clustering Using a Model Granular Magnet. Neural Computation 9:8, 1805-1842. [Abstract] [PDF] [PDF Plus] 15. Thore Graepel, Matthias Burger, Klaus Obermayer. 1997. Phase transitions in stochastic self-organizing maps. Physical Review E 56:4, 3876-3890. [CrossRef] 16. Kiichi Urahama. 1996. Gradient Projection Network: Analog Solver for Linearly Constrained Nonlinear ProgrammingGradient Projection Network: Analog Solver for Linearly Constrained Nonlinear Programming. Neural Computation 8:5, 1061-1073. [Abstract] [PDF] [PDF Plus] 17. Steven Gold, Anand Rangarajan, Eric Mjolsness. 1996. Learning with Preknowledge: Clustering with Point and Graph Matching Distance MeasuresLearning with Preknowledge: Clustering with Point and Graph Matching Distance Measures. Neural Computation 8:4, 787-804. [Abstract] [PDF] [PDF Plus] 18. S. Gold, A. Rangarajan. 1996. A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:4, 377-388. [CrossRef] 19. Marcelo Blatt, Shai Wiseman, Eytan Domany. 1996. Superparamagnetic Clustering of Data. Physical Review Letters 76:18, 3251-3254. [CrossRef] 20. A. Rangarajan, E.D. Mjolsness. 1996. A Lagrangian relaxation network for graph matching. IEEE Transactions on Neural Networks 7:6, 1365-1381. [CrossRef] 21. I. M. Elfadel . 1995. Convex Potentials and their Conjugates in Analog Mean-Field OptimizationConvex Potentials and their Conjugates in Analog Mean-Field Optimization. Neural Computation 7:5, 1079-1104. [Abstract] [PDF] [PDF Plus] 22. Michael Haft, Martin Schlang, Gustavo Deco. 1995. Information theory and local learning rules in a self-organizing network of Ising spins. Physical Review E 52:3, 2860-2871. [CrossRef] 23. Kiichi Urahama. 1995. Mathematical programming formulation for neural combinatorial optimization algorithms. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 78:9, 67-75. [CrossRef] 24. Kiichi Urahama. 1994. Equivalence between some dynamical systems for optimization. Neural Processing Letters 1:2, 14-17. [CrossRef]

ARTICLE

Communicated by Jeffrey Elman

Object Recognition and Sensitive Periods: A Computational Analysis of Visual Imprinting Randall C. OReilly Mark H. Johnson Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 25213 U S A

Using neural and behavioral constraints from a relatively simple biological visual system, we evaluate the mechanism and behavioral implications of a model of invariant object recognition. Evidence from a variety of methods suggests that a localized portion of the domestic chick brain, the intermediate and medial hyperstriatum ventrale (IMHV), is critical for object recognition. We have developed a neural network model of translation-invariant object recognition that incorporates features of the neural circuitry of IMHV, and exhibits behavior qualitatively similar to a range of findings in the filial imprinting paradigm. We derive several counter-intuitive behavioral predictions that depend critically upon the biologically derived features of the model. In particular, we propose that the recurrent excitatory and lateral inhibitory circuitry in the model, and observed in IMHV, produces hysteresis on the activation state of the units in the model and the principal excitatory neurons in IMHV. Hysteresis, when combined with a simple Hebbian covariance learning mechanism, has been shown in this and earlier work (Foldidk 1991; O’Reilly and McClelland 1992) to produce translation-invariant visual representations. The hysteresis and learning rule are responsible for a sensitive period phenomenon in the network, and for a series of novel temporal blending phenomena. These effects are empirically testable. Further, physiological and anatomical features of mammalian visual cortex support a hysteresisbased mechanism, arguing for the generality of the algorithm.

1 Introduction General approaches to the computational problem of spatially invariant object recognition have come and gone over the years, but the problem remains. In both the symbolic and neural network paradigms the research emphasis has shifted from underlying principles to special case performance on real world tasks such as handwritten digit recognition or Neural Computation 6,357-389 (1994)

@ 1994 Massachusetts Institute of Technology

358

Randall C. OReilly and Mark H. Johnson

assembly-line part recognition. We have adopted a different approach, which is to link the computational principles and behavior of a model to those of a biological object recognition system, wherein the research emphasis is on the ability to use empirical data to constrain and shape the model. The biologically based approach is limited by the level of computational understanding about how specific neural properties produce specific behavioral phenomena. However, if it were possible to establish a mapping between a biological system and a model at both the behavioral and neural levels, then the relationship between neural computation and behavior could be understood in the simplified framework of the model. If the model makes unique and testable behavioral predictions based on its identified neural properties, then experimental research can be used to test the computational theory of object recognition embodied in the model. We have developed a model of invariant visual object recognition based on the environmental regularity of object existence: objects, though they might exhibit motion relative to the observer, have a tendency to persist in the environment. This regularity can be capitalized upon by introducing a corresponding persistence or hysteresis in the activation states of neurons responsible for object recognition. When combined with a Hebbian learning rule, these neurons become associated with the many different images of a given object, resulting in an invariant representation. This algorithm (Foldidk 1991; OReilly and McClelland 19921, as described in OReilly and McClelland (19921, relies on specific neural properties that lead to hysteresis. Instead of attempting to test for evidence of the algorithm in the complex mammalian nervous system, we have taken the approach of studying the relatively well known and simpler vertebrate object recognition system of the domestic chick. Visual object recognition in the chick has been studied behaviorally for nearly 50 years under the guise of filial imprinting, which is the process whereby young precocial' birds learn to recognize the first conspicuous object that they see after hatching. The original work of Lorenz (1935, 1937) on imprinting has given rise to half a century of active research on this process by ethologists and psychologists, and more recently by neuroscientists interested in the neural basis of imprinting. Recently, the area of the chick brain that subserves this imprinting process has been identified, and some of its neurobiological properties studied. Therefore, it is now possible to assess a model of imprinting in the chick both with regard to its fidelity to these properties and the behavioral effects they produce. With a variety of neuroanatomical, neurophysiological, and biochemical techniques, Horn, Bateson and their collaborators have established that a particular region of the chick forebrain, referred to as the intermediate and medial part of the hyperstriatum ventrale (IMHV), is essential 'That is, young which are capable of fending for themselves from birth.

Object Recognition and Sensitive Periods

359

for imprinting (see Horn 1985; Horn and Johnson 1989; Johnson 1991 for reviews). This region receives input from the main visual projection areas of the chick, and may be analogous to mammalian association cortex on both embryological and functional grounds (Horn 1985). From a Golgi study of the histology and connectivity of the IMHV by Tombol et al. (19881, we were able to determine that the principal anatomical features necessary for hysteresis of the principal excitatory cells of this region as proposed by OReilly and McClelland (1992) are present. These features, recurrent excitatory connections and lateral inhibition, constitute an essential component of our object recognition model. In addition, evidence consistent with a Hebbian learning rule operating in this region has been found in morphometric studies of synaptic modification and changes in the density of postsynaptic NMDA receptors (e.g., Horn et al. 1985; McCabe and Horn 1988). Thus, both the hysteresis and the Hebbian learning properties of our model could be present in the region of the chick brain thought to be responsible for object recognition. Behavioral data are valuable to the extent that they can be accounted for by only a subset of possible models. Thus, the finding that chicks learn something about an object to which they have been exposed for a period of time is not a particularly strong constraint. However, there are several findings from the imprinting literature that appear to require more specialized mechanisms. Perhaps the best known of these is the critical or sensitive period for imprinting. A sensitive period means that a strong preference for a given object can be established only during a specific period of life, and that the animal is relatively unaffected by subsequent exposure to different objects. This kind of behavior is not typical of neural network models, which typically exhibit strong (even “catastrophic”) interference effects from subsequent learning. Further, Lorenz’s original theory has been revised to reflect the fact that the termination of the sensitive period is experience-driven (Sluckin and Salzen 1961; Bateson 19661, so that one cannot simply posit a hard-wired maturational process responsible for terminating the sensitivity of the network. Our model shows how a self-terminating sensitive period is a consequence of a system incorporating hysteresis and a covariance Hebbian learning rule. Another constraining behavioral phenomenon derives from research on temporal blending. It has been shown that chicks will blend two objects if they appear in close temporal proximity to each other (Chantrey 1974; Stewart et al. 1977). This temporal dependency alone is consistent with the idea that the object recognition system uses hysteresis to develop invariant representations of objects. However, our model demonstrates a paradoxical interaction between stimulus similarity and temporal blending, such that more similar stimuli experience relatively less blending. This phenomenon can be directly related to the same hysteresis and covariance Hebbian learning rule properties as the sensitive period phenomenon.

360

Randall C. OReilly and Mark H. Johnson

In order to claim that our model of object recognition is applicable to brains other than that of the domestic chick, we would need to find evidence of hysteresis and Hebbian associative learning in structures such as the mammalian visual cortex. Several lines of evidence are present. First, and most relevant for our model, the intracortical connectivity of the visual system contains recurrent excitatory connections and lateral inhibition (see Douglas and Martin 1990 for a review, and Douglas et 01. 1989; Bush and Douglas 1991 for models). There are also embryological similarities between the hyperstriaturn ventrale in chicks and neocortex in mammals (Horn 19851, and Johnson and Morton (1991) argue for a correspondence between IMHV and mammalian visual object recognition centers in their relationship to "subcortical" processing. Finally, it is possible that the temporally extended parvocellular inputs to the form pathway of the mammalian visual system (e.g., Livingstone and Hubel 1988; Maunsell et al. 1990) are an additional source of hysteresis in mammals. With regard to the Hebbian associative learning, several studies have found evidence for associative LTP in visual cortex (e.g., Fregnac eta!. 1988). We adopt a multilevel approach to modeling the behavioral phenomena and neural substrate. The basic mechanism of translation-invariant object recognition can be specified and implemented with a relatively abstract neural network model, as it is based on rather general properties of neural circuitry, and not on the detailed properties of individual neurons. Using an abstract neural network model offers the further advantage of simplicity and explanatory clarity-since the model has only a few properties, the phenomena that result can be related more directly to these properties. However, caution is required in relating such a simple model to both detailed behavioral and neural data; in general this model makes qualitative rather than quantitative predictions. For this reason, it is necessary to also construct a more realistic neural model that is based on anatomical and physiological measurements of actual IMHV neurons. Such a model is planned for future research. 2 The Case for Visual Imprinting as Object Recognition

In order to use imprinting behavior in the chick as an example of an object recognition process, we must be confident that the IMHV region of the chick brain, known to be critical for imprinting, is also relevant for other object recognition tasks, as opposed to classical conditioning or operant learning tasks. Visual imprinting is studied using dark-reared chicks that are exposed to a conspicuous object for a training period that usually lasts for several hours. Hours or days later, the chick is given a preference test in which it is released in the presence of two objects-the training object and a novel object. The extent to which the chick attempts to approach the familiar object as opposed to the novel one results in a

Object Recognition and Sensitive Periods

361

preference score, which correlates with the strength of imprinting. We may infer from this behavior that chicks acquire information about the visual characteristics of objects to which they are exposed, making imprinting dependent upon object recognition. Several studies have shown that the stimulus must be moving in order to obtain strong imprinting (Sluckin 1972; Hoffman and Ratner 1973). This effect is unlikely to be simply due to the differential attraction of attention, since chicks will preferentially approach familiar objects even when static once they have been trained on it when moving. In the model, the input stimulus must be moving because this enables an invariant representation to develop, and we propose that the motion during training is performing the same function for the chick. If IMHV is indeed a crucial site for imprinting then damage to it prior to imprinting should prevent the acquisition of preferences, and its destruction after imprinting should render a chick amnesic for existing preferences. These results have been confirmed (McCabe et al. 1982). Furthermore, lesions to IMHV had no effect on several other behaviors and learning tasks, demonstrating the relatively specific effects of such lesions (see Horn 1985; Horn and Johnson 1989; Johnson 1991 for reviews). For example, McCabe et al. (1982) showed that while small, localized lesions to IMHV impair the ability of chicks to acquire information about an imprinting object, these chicks were able to learn an association between a reward and a static pattern consisting of repeating visual features (e.g., dots, stripes). This suggests that neither low-level visual discrimination nor reward associations are impaired by IMHV lesions. Similarly, Johnson and Horn (1986) trained chicks to press a particular pedal for a reward of exposure to a moving imprinting object. Intact chicks, and chicks with small lesions elsewhere in the forebrain, were able to both learn the operant component of the task, and imprinted onto the rewarding object. In contrast, while chicks with IMHV lesions learned the operant component of the task, they were subsequently unable to recognize the rewarding object. Other evidence shows that IMHV is involved in other tasks that require object recognition. For example, IMHV lesions impair the ability of chicks to recognize a particular individual member of their own species (Johnson and Horn 19871, and the ability to recognize individual members of the species in a mate choice situation (Bolhuis et al. 1989). Davies et al. (1988) demonstrated that chicks with IMHV lesions are unable to learn not to peck at an unpleasant-tasting colored bead in a one-trial passive avoidance learning (PAL) task. The authors argued that this deficit comes from an inability to recognize the beads, since the intact chicks selectively withheld their pecking to the color of bead that had previously been coated with the unpleasant-tasting substance. Perhaps the best description of the functional effect of IMHV lesions would be that it induces object agnosia (Horn 1985; Johnson 1991).

362

Randall C. OReilly and Mark H. Johnson

3 The Neural Basis of Object Recognition in Imprinting IMHV is a small “sausage-shaped region immediately around the midpoint between the anterior and posterior poles of the cerebral hemispheres. It receives input from the main visual projection areas of the chick, including a prominent input from the avian forebrain primary visual projection area (the hyperstriatum accessorium, HA), and is embryologically and functionally related to mammalian association cortex (Horn 1985). See Figure 1 for a diagram of this visual input to IMHV. In addition, IMHV is extensively connected to other regions of the avian brain (Bradley et al. 19851, including those thought to be involved in motor control, such as the archistriaturn. Recent cytoarchitectonic studies of IMHV have revealed that, in contrast to the six-layered structure with many distinct cell types found in mammalian cerebral cortex, there is no clear laminar structure of IMHV, and only four distinctive types of cells have been identified to date (Tomb61 ef al. 1988). Figure 2 shows the typical connectivity between the four cell types that Tombol ef al. (1988) identified. There are two types of principal neurons (PNs) somewhat similar to mammalian pyramidal neurons (although they lack true apical dendrites) that are

Optic Tract From Contralateral Eye

Figure 1: Visual inputs to IMHV from the thalamic pathway. Shown is a saggital view (strictly diagrammatic, as some areas are out of the plane with others) of the chick brain, with visual information coming in through the optic tract, which then synapses in the optic nucleus of the thalamus. This then projects to area HA (hyperstriatum accessorium),which connects reciprocally with IMHV. This pathway may correspond to the retina +LGN e V l , V 4 &IT pathway in mammals. There are other routes of visual input to IMHV, which are not shown in this figure (see Horn 1985). The brain of a 2-day-old chick is approximately 2 cm long.

Object Recognition and Sensitive Periods

363

Figure 2: Schematic drawing summarizing the circuitry of IMHV at two levels of detail (simplified version in the box). Excitatory contacts are represented by open circles, and inhibitory ones by flat bars. Shown are the local circuit inhibitory neurons (LCN) and their reciprocal connectivity with the excitatory principal neurons (I") and , the recurrent excitatory connectivity between the principal neurons. In the detailed version, the thick solid lines are dendrites, while the axons are dashed or dotted lines. Both the inhibition and recurrent excitatory connectivity are used in the simplified model to produce hysteresis in the activation state of the IMHV. (After Tomb01 et al. 1988.)

probably excitatory. The type 1 PNs are spiny, large, and possess long bifurcating axons that probably project outside the region, while type 2 PNs are medium sized with thick spiny dendrites. The presence of a high density of spines is indicative of extensive afferentation, as is the case with the main input cells of the mammalian neocortex: the spiny stellate cells found in layer 4 (Douglas and Martin 1990). As shown in the figure, the two types of PN cells are interconnected such that they have a characteristic positive feedback loop. The other two classes of neuron identified are medium and small local circuit neurons (LCNs) that are probably inhibitory, receiving excitatory input from the PNs and projecting inhibitory output back onto them. It is not known if they also receive excitatory input from the afferents that excite the PNs. Thus, the LCNs are probably performing at least feedback inhibition, and possibly feedforward inhibition as well. Presumably, these inhibitory neurons are critical for moderating the intrinsic instability of the positive feedback present between the PNs. It should be noted that the types of PNs and LCNs and their characteristic interconnectivity found in IMHV are not commonly observed in neighboring regions of the chick brain that have been studied (Tomb01 et al. 1988).

364

Randall C. OReilly and Mark H. Johnson

Two characteristics of the cytoarchitectonics of IMHV described above are incorporated in our model: the existence of positive feedback loops between the excitatory principal neurons, and the extensive inhibitory circuitry mediated by the local circuit neurons. We propose that these properties lead to a hysteresis of the activation state of PNs in IMHV, a feature that contributes to the development of translation invariant object-based representations. In our model, we assume that type 2 PNs are the main target of projections to IMHV from area HA, while type 1 PNs are probably the main source of projections from IMHV, for the reasons mentioned earlier. Changes in synaptic morphology and cell activity in IMHV have been recorded following imprinting. For example, Horn et al. (1985) showed that the mean length of the postsynaptic density of axospinous synapses within the left IMHV increased by 17% following 140 min of training, indicating a strengthening of synaptic efficacy, but no increase was found for shorter training or in dark-reared controls. In another study, McCabe and Horn (1988) measured the numbers of NMDA receptors, which are widely thought to be involved in associative LTP (Collingridge and Bliss 1987). They found a significant increase in the number of NMDA sites in the left IMHV of postimprinting chicks compared with dark-reared controls, which was positively correlated with the degree to which a chick preferred the familiar stimulus at testing. Other factors such as locomotor activity were not significantly correlated. Finally, electrophysiological recording indicates that spontaneous activity levels in IMHV are relatively low, even during presentation of a stimulus, and that effects of training may be confined to IMHV (Payne and Horn 1984; Brown and Horn 1992). Significant effects are obtained only when several multicellular sites within IMHV are combined, suggesting that the representation is distributed within the region (Davey and Horn 1991). 4 The Self-organization of Invariant Representations

Despite the relative ease of its everyday execution, visual object recognition is a difficult computational problem. One of the principal reasons for this difficulty is also a clue to a potential solution: there are a practically infinite number of different images that a given object can project onto the retina. Deciphering which of the many thousands of familiar objects a given image represents is difficult because of this many-to-many correspondence. However, the ways in which a given object can produce different images on the retina are limited to a few dimensions of variability, corresponding to translation, dilation, rotation, etc. The algorithm proposed by Foldihk (1991) and OReilly and McClelland (1992) collapses across irrelevant dimensions by capitalizing on the idea that the visual environment naturally presents a sequence of images of the same object undergoing a transformation along one or more

Object Recognition and Sensitive Periods

365

of these dimensions. The information contained in the translating sequence of images is that all of them correspond to the same object in the world-one could refer to this information as "identity from persistence." In computational terms, the world imposes a temporal smoothness constraint on the existence of objects that can be used to regularize the ill-posed problem of visual object recognition (cf. Poggio et al. 1985; Yuille 1990). The temporal smoothness of the environment can be capitalized upon by a smoothness constraint (i.e., hysteresis, or the impact of previous activation states on subsequent ones) in the activation state of units in an artificial neural network, combined with an associative learning rule that causes temporally contiguous patterns of input activity to become represented by the same subset of higher-level units. These higher-level units develop representations that are invariant over the differences between the temporally contiguous patterns.

4.1 T h e Algorithm. The specific biologically plausible implementation of the general ideas described above, as proposed by OReilly and McClelland (1992) and used in the present simulations, takes advantage of commonly occurring features of neural anatomy to implement hysteresis in the activation states of units. We propose that hysteresis comes from the combined forces of lateral inhibition, which prevents other units from becoming active, and recurrent, excitatory activation loops, which cause active units to remain so through mutual excitation. The need for recurrent excitatory activation loops and lateral inhibition requires an recurrent network in the tradition of McClelland (1981) and Hopfield (1984). For simplicity, the IAC activation function of McClelland (1981) was used, with stepsize 0.05, decay 1, and a range of -1 to 1 with a provision that only positive activations are propagated to other units (see Appendix A for the exact equation used). Associative learning is implemented with a simple Hebbian covariance learning rule. The specific learning rule used in the present simulations is a modification of the Competitive Learning scheme (Rumelhart and Zipser 1986) (see Appendix B for the formulation), although other similar rules have been used with equal success (cf. Foldi6k 1991; OReilly and McClelland 1992). It is important that the learning rule be a covariance formulation (Sejnowski 19771, having both increase and decrease components that work together to shape the receptive fields of units both toward those inputs which excite them, and away from those that do not. However, the existence of an associative weight decrease phenomenon is a matter of some debate in the field. Nevertheless, there is considerable empirical evidence for a subtractive synaptic modification phenomenon known as associative long-term depression (LTD) in several different types of neurons and species (e.g., Stanton and Sejnowski 1989; Artola et al. 1990; Bradler and Barrioneuvo 1990; Fr6gnac et al. 1988).

366

Randall C. OReilly and Mark H. Johnson

The temporal contiguity of the visual environment is captured by "visual" stimuli that appear in a series of different locations sequentially over time, all of which share the same features, and differ only in retinotopic location. Thus, only translational invariance is simulated, as this is the simplest form of invariance to model and analyze.

4.2 Network Model and Methods. The detailed architecture of the model (shown in Fig. 3) is designed around the anatomical connectivity of IMHV and its primary input area, HA. The input layer of the network, layer 0, represents area HA, which contains cells with properties similar to the simple and complex retinotopic feature detectors described by Hubel and Wiesel in the cat visual cortex (Hubel and Wiesel 1962; Horn 1985). HA then projects to area IMHV, which we have divided conceptually into two different layers2 according to the two types of PNs in IMHV described by Tombol et al. (1988). In the model, the first "IMHV layer represents the IMHV type 2 PNs, which are likely to receive afferent input, and layer 2 represents type 1 PNs, which are likely to produce efferent output from IMHV due to their long, bifurcating axons. Layer 2 sends outputs to layer 1 of the model, which in turn sends outputs to layer 2, creating the recurrent feedback loop. There were 10 x 8 or 80 units in layer 0, and 24 units each in layers 1 and 2. Strong lateral inhibition is present within each layer of the model, in the form of relatively large negative weights between all units in the layer. While they are not explicitly simulated, this reflects the influence of a large number of GABAergic inhibitory interneurons in IMHV (Tombol et al. 1988), and its relatively low levels of spontaneous activity. The strong inhibition was implemented in the model with fixed weight values of 3.0, which enabled only one unit in each layer to become active at any time, making analysis of the system simpler. In the real system, it is assumed that inhibition does restrict activity to a relatively low level, but clearly the winner-take-all (WTA) extreme in the model is not realistic. The generalizability of the results to distributed patterns of activity throughout the system is an issue that will be addressed with the more detailed model. Preliminary results indicate that the same basic effects are found. Excitatory connections exist between the two principal cell types in IMHV. In the model, these excitatory connections were of the same strength as the excitatory connections from layer 0 to layer 1, and they were subject to the same learning rules. Both the inhibitory and excitatory connections provide the hysteresis effect necessary for the learning algorithm. However, only the strength of the excitatory connections was adjusted by the learning rule. This is in accordance with biochemical and *Note that the laminar distinction in the model between these two component cells of IMHV is not intended to suggest that the cells are arranged as such in the IMHV itself, but rather serves to reflect the functional distinction between the two types of principal neurons.

Object Recognition and Sensitive Periods

A /

/

/

/

/

367

/

/

/

/

Layer 2 (-

IMHV PN 1)

'N

2)

A)

Figure 3: Network architectureused for the simplified model of IMHV, showing the three layers (layer 0 represents HA, layer 1 represents IMHV PN type 2, layer 2 represents IMHV PN type l), and their interconnectivity. The network is shown being trained on stimulus D, and the 3 different units numbered 13 in layer 1 have partially invariant fields that capture local invariance over portions of the different positions of stimulus D. These different units project reciprocally to the layer 2 unit, which has a fully invariant representation of stimulus D by combining the partially invariant fields from layer 1.

neuroanatomical evidence consistent with excitatory axospinous connections being the site of plasticity within IMHV, and the involvement of NMDA receptor sites in imprinting (Horn et al. 1985; McCabe and Horn 1988). While the units in layer 1 of the model received excitatory input from both layers 0 and 2, the units in layer 2 did not receive any excitatory input from higher layers that are known to exist in the chick but are not present in the model, resulting in lower levels of activation in this layer compared to layer 1. To compensate for this, layer 2 had a lower level of decay than the other layers (0.5 vs. l).3 As a result of being farther from the moving input, there is a greater degree of hysteresis in layer 2 than in layer 1 of the model, which is reflected in the kinds of receptive fields that form in the two layers. See OReilly and McClelland (1992) for more discussion of the graded invariance transformation over increasingly deeper layers. 31t is likely that the principal neurons (type 1 I")in IMHV receive reciprocal excitatory connections from the areas that they project to, since most of the regions that receive input from IMHV also have projections back to it (Bradley et al. 1985).

368

Randall C. OReilly and Mark H. Johnson

Figure 4: Stimuli used in training the network, consisting of three feature bits active in any of eight different positions. For visual clarity, the positions were arranged along the horizontal axis, and the object features along the vertical, with 1 bit of overlap between two pairs of the four primary stimuli. Object AB was a hybrid stimulus used in some simulations that had 2 bits in common (overlap) with both A and B, while the others had 1 bit in common with their neighbor.

Training in the model consisted of presenting a set of feature bits (assumed to correspond to a given object) in sequential positions across the input layer 0 (simulated HA) with either right or left motion (randomly chosen). At each position of a stimulus, the weights between all units in the system were adjusted according to the Hebbian learning rule once the activation state of the network had reached equilibrium for each position (defined here as the point at which the maximum change in activation went below a threshold of 0.0005). The activation state was initialized to zero between different objects, but not between positions of a single object, enabling a state resulting from an object in one position to exert the desired hysteresis effect on the state with the object in the next position. Four "objects" were used in the simulations, which consisted of three active features in any of the eight possible retinotopic locations represented in the input layer (see Fig. 4). The arrangement of the feature bits into columns simply represents eight different views of an object, which could in theory correspond to any transformation sequence, not just horizontal translation. There was 1 bit out of the three features in

Object Recognition and Sensitive Periods

369

common between neighboring stimuli, so that object A overlapped with B by 1 bit, as did C with D. B and C did not overlap. Since it is not possible to record preferential approach behavior from the network, some proxy for this kind of preference measure must be used. As layer 2 in the model represents the output of IMHV, we recorded the activation level over this layer when different stimuli were presented to determine the preference. Because of the extreme inhibitory competition between units, the activation level of the single active unit is not indicative of the level of excitation this unit is receiving4 To circumvent this potential measurement artifact, we instead used the total excitatory input to layer 2 (i.e., for all units in the layer) as an indication of preference. The actual preference scores were computed by averaging the total excitatory input over all the different positions for each stimulus, resulting in a raw preference score for each stimulus. These raw preference scores were converted into percentage preference measures between two stimuli (e.g., A over D)by dividing each raw score by the total for both stimuli. This is analogous to the preference score measure used in many behavioral studies (Horn 1985).

5 Basic Imprinting and Capacity Properties Before addressing the more challenging behavioral phenomena with our model, it is first necessary to establish its basic imprinting and capacity properties. The latter is especially important because one does not want a putative sensitive period phenomenon to reduce to a simple capacity limitation of the model. The basic simulation of the imprinting effect involved presenting a single stimulus over many epochs while recording the development of the preference for this stimulus. The imprinting results for the model are shown in Figure 5, which indicates that a preference for the training stimulus over a novel, dissimilar, object does develop over time. In the model, this effect is simply due to the selective enhancement of weights to units that respond to the imprinted stimulus, and is not a surprising or novel result. The capacity simulation was run by training on all four stimuli (objects A-D), and showing that the model developed individuated representations for each object that were invariant across the entire range of positions of the object. Each epoch of training consisted of sweeping each stimulus in turn across the input layer, covering all eight positions of each object. The order in which the stimuli were presented was randomized, with a "delay" (implemented by zeroing the activation states) between each sweep of the stimuli. Training continued for 100 epochs. The receptive fields for the two simulated IMHV layers developed invariant representations in a graded fashion. That is, the first layer developed 4This is an artifact of the winner-take-all nature of our model, and is probably not the case in the real system.

Randall C. OReilly and Mark H. Johnson

370

Imprinting Preference Development

0.80

w I

a8

5 a; 0.60 0.50

1

h

0.40 -

M

o.30

A (lrdned) vz D (novel) D (novel)

MC (novel) vs. I

0.20

1 '

0

2s

so

75

100

12s

150

Epochs of Training on A

Figure 5: The basic imprinting effect, showing the preference for the imprinted stimulus A as compared to a novel stimulus, D. The preference for a control stimulus C as compared to D is also shown. This preference does not deviate from chance (0.5). representations specific to one stimulus in any of several (but not all) different positions of that stimulus, making it partially invariant with respect to position. Layer 2, however, did have fully invariant representations, so that a single pattern of activation coded for a single stimulus in any position. The development of invariance was quite robust over different parameter settings, although the specific qualities of the representations depended on certain ranges of settings. In particular, larger levels of activation decay tended to reduce the hysteresis effect and caused lower levels of invariance to develop. Likewise, smaller levels of decay caused more invariance to develop on layer 1 (layer 2 was already fully invariant). These effects appeared over relatively large changes in decay (1 vs. 1.5, for example). 6 Reversibility and the Sensitive Period 6.1 Behavior. Lorenz (1937) originally claimed that an imprinted preference was irreversible. Jaynes (1956) pointed out that there are two senses in which imprinting could be irreversible. First, that after

Object Recognition and Sensitive Periods

371

imprinting a bird will never again direct its filial responses to a novel object. Alternatively, that while a bird can direct its filial responses to a second object, it always retains information about the first object. There is a considerable amount of evidence that imprinting is not irreversible in the first sense (e.g., Klopfer and Hailman 1964a,b; Kopfer 1967; Salzen and Meyer 1968; Kertzman and Demarest 1982). Much evidence supports the second, and weaker, form of the irreversibility claim (for review see Bolhuis and Bateson 1990; Bolhuis 1991). Thus, while imprinted preferences can be reversed by prolonged exposure to a second object, a representation of the original object remains. A number of factors affect whether imprinted preferences can be reversed, and, if so, the extent of reversal. These factors include the length of exposure to the first object experienced by the chick, and length of subsequent exposure to a second object. For example, a prolonged exposure to the first object will prevent reversal of preference for a second object (Shapiro and Thurston 1978). With shorter exposure to the first object, a very brief period of exposure to a second object will not result in a reversal of preference, while a longer period of exposure to a second object will (Salzen and Meyer 1968). Data from Bolhuis and Bateson (19901, shown in Figure 6, illustrate this effect. The chicks were trained for either 3 or 6 days, and preference tests revealed a strong preference for the training object. Then, the chicks were exposed to a second object for 3 or 6 days, and preferences were reversed. The extent to which a reversal occurs depends on the length of exposure to the first stimulus (longer exposure reduces the subsequent extent of reversibility). While Bolhuis and Bateson (1990) did not systematically manipulate the length of exposure to the second stimulus, other studies have shown that this variable influences the degree of reversal (Salzen and Meyer 1968; Einsiedel 1975). 6.2 Simulation. The implementation of reversibility in the model was straightforward: train on one stimulus for a variable amount of time, then train on a different one for a variable amount of time. For the basic effect, we trained on stimulus A for 100 epochs (100 sweeps across the input layer), and then trained again with stimulus D from various points of training on A. The results are shown in Figure 7a for a network which was exposed to A for 100 epochs, and then exposed to D for up to 300 epochs. Thus, reversibility occurs despite a relatively strong initial preference for A, with D being preferred to A after more than 150 epochs of exposure to D. This finding is consistent with those cited above. Further, the strength of preference for D increases with longer training on that object. Also shown in this figure is the continued preference for A, the initial training object, over a novel stimulus, B. This finding shows that a representation of the first object still exists, and is consistent with the second, weaker form of irreversibility observed in the chick (Cook 1993).

Randall C. OReilly and Mark H. Johnson

372

Training Time and Reversibility in the Chick Dab h o r n Bolbub and Baleson,1WO 1.0 r

7

e

0.8 0.7 0.6

b

05 OA

1

'

M6 OD first, 3 on sccond

0.1 0.0

'

1

2

I

Test Period

Figure 6: Data showing an initial preference for the first object after either 3 or 6 days recorded on test 1, and a reversal of this preference after a period of exposure of 3 or 6 days to the other stimulus, as recorded in test 2. A decreased level of reversal is observed with greater exposure to the first object. That the network displayed reversibility is not terribly surprising, as many neural networks will continue to adapt their weights as the environment changes. However, unlike many networks that display a "catastrophic" level of interference from subsequent learning (cf. McCloskey and Cohen 19891, this model retained a relatively intact preference for the initial stimulus over a completely novel one. The explanation for this preserved learning effect will be presented below. While reversibility seems to occur in the network and in chicks, there is also support for the idea of a self-terminating sensitive period where sufficiently long exposure to an imprinting stimulus will prevent the preference from being reversed (e.g.,Shapiro and Thurston 1978). Indeed, in the model, a sensitive period is observed with just 25 more epochs of exposure to A before exposing the network to stimulus D. This additional exposure prevents any amount of further training on D from reversing the initial preference for A (Fig. 7b). This kind of sensitive period is selfterminating because it is determined solely by exposure to the object, and not by some other maturational change in the system. 6.3 Discussion. Both the sensitive period effect and the preserved preference effect described above can be explained by an interaction be-

Object Recognition and Sensitive Periods

a)

373

Reversibility of Preference ENmt of Exposure l o D aner 100 Epochson A

0.70

-

9

0.60

i

0.20"

l '

0

'

50

'

'

'

100

'

I50

l '

'

200

'

' 2 1

'

'1

300

Epochs of Training on D

b)

Self-TerminatingSensitive Period Elfecl or Expmure to D mer 125 Epochs on A

0.80

0.70

*

h

g 0.60 t $ 050

Y

2

d

4

\I

OAO 0.30

0.20

J 100

200

300

400

500

600

700

BOO

900

Epochs of Training on D

Figure 7 (a) The basic reversibility effect from training on D after initial imprinting on A . The comparison of the preference for A vs. D shows a reversal (from above 50% preference to below 50% preference). The preference for A over a second object B remains stable despite the training on D. (b) A sensitive period effect from 125 epochs of imprinting on A before exposing to D. The A vs. D preference does not reverse, indicating a preserved preference for A despite up to 900 epochs of exposure to D. tween the subtractive component of the covariance learning rule and hysteresis effects. Consider a unit in layer 1 of the model network that has become active by the presentation of a stimulus in position 1 of the

374

Randall C. OReilly and Mark H. Johnson

input layer. This unit will increment its weights to the feature units that activated it, and to the unit in layer 2 that is also active. Also, according to the subtractive component of the covariance learning rule, it will decrease its weights to all the other inactive units in the network, including those in the other positions of the same stimulus. When, due to hysteresis, the same layer 1 unit remains active for position 2 of the stimulus, the same subtraction will occur on weights from those input units in position 1 that were just increased on the previous time step, as they will now be inactive. Indeed, if this unit remains active for more than two different positions of an object, the net result will be to decrease its weights to each of these positions more than increase them. With the weight bounding procedure used in the model (see Appendix B), the value of a weight represents the balance of increasing and decreasing forces. Thus, the weights to a unit representing an object over several different positions will reflect the conditional probability that the input unit is active given the unit is (Rumelhart and Zipser 1986): if a unit is active for N positions of an object, the weight from each position will equilibrate around a value of 1/N. Because the weights are initialized to a random value with a mean of 0.5, they will typically be decreasing initially whenever N > 2. This means that a unit that was active for a given object on one epoch will be less likely to be active the next epoch. As a result, a different unit will become active the next time, and adjust its weights for the same object. In this way, the repeated presentation of a given object will result in a recruitment process where multiple units will become tuned to the same object. It should be noted that this process happens gradually, with all of the recruited units taking turns becoming active, even when the system has reached equilibrium (i.e., all weights near their 1/N values). As layer 1 units become recruited, they are also always decreasing their weights to input units with which they are never contemporaneously active. This makes them not respond to the features corresponding to objects other than the one they have been exposed to, which we refer to as a tuning process. In the presence of multiple objects, the influence of recruitment is balanced by this tuning process, because units tuned to a given object will be unavailable for recruitment for another object. However, when only one object is viewed for a period of time, recruitment and tuning work together to cause many units to respond very selectively to the imprinting stimulus. Thus, as training continues on the imprinting stimulus, both the tuning and recruitment effects get stronger, so that by a certain point (125 epochs in the present simulations), a majority of the units become selective to the imprinting stimulus, and are unavailable for recruitment to any new training stimulus because their weights to the other object features are near zero. Once this majority has been established, no amount of exposure to a different stimulus will cause these units to be recruited to the new stimulus, and the balance of preference will not shift. Instead, the retraining will cause the minority

Object Recognition and Sensitive Periods

375

of initially unrecruited or less strongly recruited units to become selective to the new stimulus. The tuning and recruitment processes also explain the retention of preference for a trained stimulus over a novel one even after retraining. The recruited units, being tuned to a particular object, do not become recruited by the other object, and the original preference for the object is preserved. Thus, the two computationally important and biologically supported features of our model, hysteresis and a covariance Hebbian learning rule, are critical for its self-terminating sensitive period behavior. However, the specific properties of both of these components will affect the quantitative nature of the reversibility phenomenon in the chick. In particular, the level of recruitment will affect the exposure duration necessary to terminate the critical period. To demonstrate this, simulations were run that were identical to the previous ones in all respects except for the size of layers 1 and 2, which were doubled. In these simulations, the initial preference for stimulus A was reversible after 150 epochs of exposure to A, but not after 300 epochs of exposure to A. Examination of the development of the receptive fields for units in layer 1 of this network showed that it was not the number of units which were tuned to stimulus A that changed over time, but rather the contrast between the weights from the imprinted object and the weights from the other objects, supporting the distributed and gradual account of preference formation given above. Other control simulations showed that variables such as the magnitude of the initial random weights were not important for the effect. 6.4 Predictions. While the model's behavior is consistent with the empirical studies described, we state here the clear predictions that the model makes. The falsification of any of these would pose a challenge to the model. 0

0

0

0

The length of time training on the first stimulus will affect the amount of retraining time necessary to reverse the preference for the first stimulus in favor of the second. A sufficiently long period of training on the initial stimulus will prevent any reversal effect regardless of how long the second stimulus is presented. A neuron in IMHV that shows a preference for a given stimulus will tend to retain this preference even after retraining. Further, this propensity for retention of initial preference will be correlated with the selectivity and/or strength of the initial preference.

More neurons in IMHV will show a strong preference for the first training stimulus after the sensitive period for reversibility than before it.

376

Randall C. OReilly and Mark H. Johnson Early on in training, neurons in IMHV should exhibit phasic or noisy correlations (i.e., correlated for a period then not, then correlated again) with the stimulus presentation as the balance between weight increase and decrease forces is struck.

7 Generalization At a behavioral level, a number of studies have shown that generalization from the training stimulus occurs following imprinting on an object to objects having the same shape or color (Jaynes 1956, 1958; Cofoid and Honig 1961). More recently, Bolhuis and Horn (1992) found evidence that generalization varies systematically with stimulus similarity. Generalization occurs in the model because of the tendency of overlapping input patterns to activate the same units in higher layers according to the degree of overlap present in the input layer. This is a basic property of most neural network models. The central finding from the simulations is that generalization appears to vary in a nonlinear fashion with respect to stimulus similarity. Thus, for the basic imprinting simulation, generalization was present for stimulus AB (a hybrid of A and B, having 2/3 overlap with each), but not for stimulus B, which had only 1/ 3 overlap with A after initial training on A (see Fig. 8). This nonlinearity is due to the lateral inhibition, which causes weakly activated units to become inhibited by more active units. Similar results were found when generalization was tested in the reversibility simulation. In this case, the initial exposure to A generalized to AB, and not B, and subsequent exposure to D caused the preference for A and AB to reverse. This reversed preference was less than the preference for D over B, indicating that the non-linear generalization from the initial exposure is preserved following reversal (see Fig. 9a). The reversibility simulations were also run with AB as the retraining stimulus, which is highly similar (2/3 overlap) to the original stimulus A . As can be seen in Figure 9b, the preference for A relative to a novel stimulus ( D ) increases slightly, rather than decreases after exposure to AB. This follows from the fact that the same units are active in both cases, causing learning to benefit both stimuli. Another interesting effect shown in the figure is that the retraining on AB generalizes to stimulus B, because of B’s 2/3 overlap with AB. Thus, initially quite distinct objects, A and B, can be made to elicit the same preference from the network, and activate the same units, by providing a ’%ridge” between them. Control simulations using different orderings of initial and subsequent stimuli (e.g., exposing to A and B, then AB) indicate that order is not important for this effect. Ryan and Lea (1989) describe findings in the chick consistent with our model’s behavior. They exposed chicks to a stimulus composed of four balls, all one of two colors. One group saw the stimulus gradually change by flipping the color of

377

Object Recognition and Sensitive Periods

Generalization in Imprinting Eflect d Expasure lo SUmulus A 0.00

I

h c)

0.60

0.20

0

1

SO

75

100

125

150

Epochs of Tralnlng on A

Figure 8: Generalization of imprinting on stimulus A to stimulus AB (2/3 overlap with A), as revealed by a comparison with the measured preference for D. Note that stimulus B, which has a 1/3 overlap with A, does not experience significant imprinting from exposure to A, revealing a nonlinearity in generalization. one ball every 4 days, while the other group's stimulus was changed all at once. The chicks with the gradually changing stimulus developed an equally strong preference for both colors of the stimulus, while the others did not. The authors argue that these chicks had undergone cutegory

enlargement. Finally, using stimulus AB for retraining prevented the termination of the critical period for imprinting, since, after retraining on AB, it will be preferred to A even after any number of epochs of initial exposure to A. However, the preference for A did not diminish relative to a novel stimulus (it actually increased slightly), and the level of preference of AB over A was not strong. This is a direct consequence of the fact that many of the same units in the model that were active for A are active for AB. Some evidence from imprinting on highly similar objects, such as two live hens, suggests that preferences remain fluid for a longer period (Kent 1987).

7.1 Predictions. Again, the model's behavior is consistent with the empirical studies described, but we consolidate here the clear predictions that the model makes.

Randall C. OReilly and Mark H. Johnson

378

a)

Generalization in Reversibility ErrKl of Exposure l o D a h r 100 Epocbr on A

0.70

*

0

f

0.50 b

0.40

0.30

H AB (66% M I a r I l y ) vs.

0.20"

'

0

'

so

.

'

loo

'

'

I50

'

'

200

'

'

250

'

'bl

300

Epochs of Training on D

b) Generalization from a Similar Stimulus

e* B (66% aImllarlly d A B )vs. D H

0.20

I0

C novel va. D

I 15

50

7s

100

12s

150

Epoch of Training on AB

Figure 9: (a) Generalization of the reversibility effect from exposure to D after 100 epochs on A. Stimulus AB (2/3 overlap with A) shows both the initial preference, and the reversal of this preference from exposure to D. Note that stimulus B remains less preferred than A or AB, due to nonlinear generalization effects. (b) Training on a highly similar stimulus (AB)to the original imprinting stimulus A increases the preference of A relative to a novel stimulus (D). While there is a reversal of preference for A relative to AB, this effect is minor relative to the increase in preference compared to the novel stimulus. Also, the AB stimulus, being equally similar to A and B, causes the system to generalize to B as well, which shows a similar level of preference relative to the novel stimulus as A.

Object Recognition and Sensitive Periods 0

0

0

0

0

0

379

The shape of the generalization window in chicks will be nonlinear with respect to stimulus similarity, with a rather sharp boundary to the range of stimuli that generalize and those that do not. The generalization of preference should be preserved under reversal due to subsequent training on a novel, dissimilar stimulus. The objects within the generalization window will activate many of the same principal neurons in IMHV, while those not in the window will not. This finding would establish the neural basis of generalization as predicted from our model. Secondary training on a stimulus that is very similar to the original imprinting stimulus, but also similar to another stimulus, will result in a ”widening” of the imprinting preference, so that all three stimuli will now show a similar level of preference to a novel stimulus. Retraining with a stimulus that is sufficiently similar to the original will not result in a decrease in preference relative to a completely novel stimulus, and only a weak reversal of preference over the initial stimulus. The termination of the sensitive period will occur at a variable time depending on the degree of similarity between the retraining stimulus and the original imprinting stimulus.

8 Temporal Contiguity and Blending

8.1 Behavior. In a phenomenon related to generalization, a number of authors have reported that blending can occur when two objects are presented in close temporal or spatial contiguity. For example, Chantrey (1974) varied the interstimulus interval (ISI) between the presentation of two objects from 15 sec to 30 min. By subsequently attempting to train chicks on a visual discrimination task involving the two objects he was able to demonstrate that when the objects had been presented in close temporal proximity (under 30 sec gap between them) they became blended together to some extent. This result was replicated by Stewart et al. (1977) who also controlled for the total amount of exposure to the stimuli. 8.2 Simulations. According to our assumptions about the specific nature of the learning taking place in IMHV, temporal contiguity should be a very important variable in determining what the network considers to be the same object. In the “identity from persistence” algorithm, an object is defined as that which coheres over time. If this kind of learning is indeed taking place in IMHV, then one would expect to find the kinds of blending effects found by Chantrey (1974) due to close temporal

380

Randall C. OReilly and Mark H. Johnson

contiguity. To simulate the effects of delay in the model, we used the reinitialization of the activation state as a proxy for a delay long enough to allow the neural activation states to decay to values near resting, or at least retain little information about previous states. The exact length of time that this corresponds to in the chick is not known, so we manipulated ”delay” as a binary variable. The delay condition networks were reinitialized between each sweep of a stimulus across the input layer, while the no-delay networks were not.5 In the model, we measure blending by looking directly at the unit’s weights to determine what they have encoded. If a unit has weights that have been enhanced from features belonging to two or more different stimuli, then this unit has a representation that blends the distinction between these objects. If the unit only has strong weights from features for one stimulus, then the unit has a representation that differentiates between stimuli. Thus, to measure blending, we simply count the number of units that have nondifferentiating representations. In order to allow for noise in the weights, we take an arbitrary threshold at 0.36788 ( l / e ) times the strength of the strongest weight into a unit as a cutoff for considering that weight to be enhanced. Figure 10 shows the level of blending for training on stimuli A and D in the delay and no-delay conditions. Clearly, the absence of a delay causes almost total blending in the representations that form, so that the network would be unable, based on the active units, to distinguish whether stimulus A or stimulus D was present in the environment. The explanation for this effect is simple-a different position of a given object and an entirely different object are no different to a naive network. Thus, the intermingling without a delay of two different stimuli causes the network to treat them as different “positions” of the same stimulus. Note that without hysteresis, this effect would not arise, making the blending phenomenon a crucial source of empirical support for our hysteresisbased algorithm. Intuitively, one might expect that the effect of similarity of stimuli on blending would be to increase it, so that more similar objects will be subject to greater blending effects. This was tested in the model by training with stimulus A in conjunction with AB or B, in the delay and nodelay conditions, and comparing the results to those found with stimulus D. Figure 11 shows that the effect of similarity on blending is not the same for both delay and no-delay conditions. In the delay condition, the more similar the stimuli were, the more blending occurred. This is the predicted direction of the effect. However, in the no-delay condition, the more similar the two stimuli, the less blending occurred: the level of blending in the no-delay condition was inversely proportional to the similarity of the stimuli. 5Note that all simulations to this been run with a ”delay.”

point involving multiple stimuli per epoch have

Object Recognition and Sensitive Periods

381

Temporal Contiguity E f h t d Delar a(& SUmuli A and D

"i 0

Figure 10: Temporal contiguity manipulation (delay vs. no-delay) affects number of blended representations,so that the no-delay condition causes considerably increased blending of the two stimuIi. Blending is measured by examining the receptive fields of the units in each of the two layers of the model. 8.3 Discussion. The inverse relationship between blending in the delay and no-delay conditions suggests a kind of inoculation effect might be operating. Since the delay condition represents a "best case" for stimulus discriminability, we can think of it as representing the baseline performance of the network for blending. This baseline is much worse for similar stimuli than dissimilar ones. When networks are run in the nodelay condition, the intrinsic level of blending (as revealed in the delay networks) somehow inoculates the system against the effects of no-delay induced blending6 The explanation for this effect, like that for the reversibility effects described above, depends critically on hysteresis and the covariance formulation of the learning rule. When the stimuli overlap, units that respond to one stimulus will be more likely to subsequently respond to the other, due to generalization of the learning. However, as was the case with the recruitment process described previously, the more positions of stimuli a unit is active for, the more it decreases its weights to each individual stimulus position that activates it. Thus, as a unit is active more often with similar input stimuli, it experiences a greater tuning pressure. Given the competitive activation 6Keep in mind that delay and no-delay conditions are run on separate networks, with the delay network serving as a control condition. Thus, the delay condition does not itself inoculate the networks for the no-delay condition, rather it reveals something about the intrinsic blending present in the network.

Randall C. OReilly and Mark H. Johnson

382

Temporal Contiguity Effect d Inter-Slhalllur SlmUulty 100

I

I

1

t

lo 0% (A-D)

33% (A-B)

66% (A-AB)

Degree of Slmllarlty

Figure 11: Temporal contiguity manipulation (delay vs. no-delay)interacts with similarity of the stimuli, so that in the delay condition, more similar stimuli (AB, B) show greater blending than dissimilar stimuli (D).The no-delay condition causes increased blending in inverse proportion to the amount of blending in the delay case. These data are the average of both layers 1 and 2, but each layer individually shows the same pattern. dynamics in the network, a slight imbalance in weight strength between two similar stimuli can result in a unit being active for one stimulus and not the other. Such an imbalance can be caused by the weight tuning process, and once a unit begins to respond preferentially to one stimulus over the other, the imbalance is magnified, resulting eventually in a differentiated representation for one stimulus over the other. Paradoxically, increasing overlap in the input causes increased pressure to form differentiated representations in layers 1 and 2, explaining why the blending in the delay condition for stimuli A-AB is only around 50% even though AB is known to show a high degree of generalization to A. In the no-delay condition for these highly overlapping stimuli, the absence of a delay has relatively little effect because the units are already becoming active for both stimuli even with the delay. Thus, causing them to remain active across stimulus boundaries in the no-delay condition does not alter the balance of tuning and generalization very dramatically, resulting in only a small increase in blending for the 2/3 overlap condition. The inoculation analogy is an appropriate one. Think of the similarity of the stimuli and the no-delay manipulation as two similar viruses that cause the same disease, blended representations. These viruses cause the illness (blending), but the similarity virus also arouses natural defenses

Object Recognition and Sensitive Periods

383

against it (enhanced tuning due to increased activity levels). Thus, in the similar stimuli (A-AB) no-delay condition, the additional virus of the no-delay does not have much of an additional effect, because the natural defenses are engaged by the similarity virus in such a way as to inoculate the system against the effects of this other virus. However, in a system without the similarity virus (e.g., the A-D condition), the no-delay virus causes much more rampant damage because its natural defenses are not otherwise aroused. Finally, the 1/ 3 overlap condition (A-B) falls somewhere in between these two extremes. 8.4 Predictions. 0

0

0

When dissimilar stimuli are used in a temporal blending experiment, significant blending will be observed in short delay conditions. However, when stimuli that share many visual features are used, evidence for blending will be more difficult to detect. We propose that the difficulties in extending the Chantrey (1974) findings to other stimuli and experimental conditions (Stewart et al. 1977) is due to the relative similarity of the stimuli used. The temporal blending effect should be apparent in neural recording studies. With two dissimilar training stimuli, a short temporal gap training condition should result in more cells responding to both stimuli. In contrast, a long temporal gap condition with the same two stimuli should reveal more neurons responsive to one or the other stimulus. Some IMHV neurons initially activated by both of two similar stimuli should become gradually more sensitive to one or the other as training proceeds, while others should remain active for both. The gradient of overall similarity in IMHV firing patterns should be towards more distinct patterns for the two stimuli as exposure increases, reflecting the tuning process responsible for inoculation of similar stimuli at work.

9 Discussion

Through a series of simulations and analysis of the mechanisms behind them, we have been able to evaluate a simple neural network model of invariant object recognition by relating its behavior and neural properties to those of the domestic chick during imprinting. The model exhibits a range of behaviors in different analogs of experimental manipulations from reversibility and sensitive periods to generalization and temporal blending that qualitatively agree with empirical results. It is this range, combined with the counterintuitive predictions regarding temporal blending and its ability to exhibit a self-terminating sensitive period,

384

Randall C. OReilly and Mark H. Johnson

that enables the behavioral data to support our theoretical claims. Further, the relative simplicity of the model has enabled its behaviors to be related to certain properties that the model incorporates from the neuronal structure of IMHV. This lends support to the idea that these neural circuit properties have a functional role in object recognition within IMHV corresponding to their role in our model. Aside from the specific predictions made regarding each of the different behavioral effects found in imprinting, we feel the model leads to several broader conclusions. To the extent that our account of IMHV fits the behavioral and neural data, it provides support for the computational model of object recognition upon which our account is based. Given the challenging nature of the object recognition problem from a computational standpoint, the existence of a simplified animal model of object recognition would provide a valuable tool for further exploration and understanding of how neural systems recognize objects. Further, the sensitive period issues explored in our model may have utility outside the sphere of visual imprinting in birds. For example, several authors have pointed out similarities between some of the phenomena associated with imprinting, and observations about the development and plasticity of the primate cortex ( e g , Rauschecker and Marler 1987). In particular, there may be strong parallels between the reversibility effects examined in this paper, and the extent of recovery of visual acuity following monocular occlusion in the kitten (e.g., Mitchell 1991). Several important assumptions have been made in constructing the model. First, in applying a neural network model of translation invariant object recognition to imprinting phenomena in the chick, we have assumed (and presented evidence in support of the claim) that area IMHV is involved in the process of object recognition. Further, we have assumed that the response properties of neurons in IMHV are sufficient to determine the preference behavior of chicks, since our preference measures in the model were taken directly from the units corresponding to those in IMHV. However, we do not wish to imply that IMHV is the only neural structure that influences the preferences of chicks. Indeed, it is well established that there are other areas of the chick brain that are involved in both imprinting (Horn 1985; Horn and Johnson 1989) and in passive avoidance learning (Kossut and Rose 1984; Rose 1991). A general issue that arises with a model such as the one presented in this paper concerns the extent to which it can be considered a “neural” model or purely an abstract psychological model. Clearly, the units in our model have nothing like the detail of a real neuron, or even of a relatively simplified point neuron model. For this reason we have been unable to make any quantitative predictions about patterns of firing of neuron types within IMHV, or about the temporal parameters which would produce blending effects. Instead, we have attempted to capture in a simplified computational framework a relatively few critical features of IMHV’s neural structure that we believe relate directly to its role as an

Object Recognition and Sensitive Periods

385

object recognition system. The resulting model is simple enough to make the analysis of its behavior possible, furthering our understanding of the system and allowing us to make several qualitative predictions that are open to experimental investigation.

Appendix A: IAC Activation Function The IAC activation function (in a slightly modified form) contains an input and a decay term (with the relative contribution of these two factors controlled by modifying the level of the decay term, a ) controlled by a rate parameter A: Aai = XV(neti) - .(ai - rest)] (A.1) where f (neti) is the following input function: neti(max - ai) neti > 0 f (neti) = neti(ai- min) neti < 0 (A.2) and net, is the net input to the unit: (A.3) neti = o,w,i qwl,

1 +C i IJfi

with I indexing over the other units in the same layer with inhibitory connections, and o, representing the positive-only activation value of unit j (0 otherwise).

Appendix B: Hebbian Weight Update Rule The weight update rule used throughout was as follows: (1.0 - w,,)a, > O,a, > 0 1 a, > O,a, < 0 (B.1) -Awl, = owll X otherwise where X is the learning rate. The rule depends on the signs of the pre- and postsynaptic terms (a, and a,, respectively), as the inhibition within a layer will drive all but a single unit into the negative activation range. Without hysteresis, this unit would always be the one with the largest net input from the current input pattern, making this formulation in combination with the lateral inhibition roughly equivalent to the Competitive Learning scheme (Rumelhart and Zipser 1986). The hysteresis can cause an already-active unit that might not have the largest amount of input from the current pattern to remain active, which is the basis of the translation invariance learning, and in this way the system differs from Rumelhart and Zipser (1986). Note that it differs also from the Foldilk (1991) scheme in that the hysleresis or trace component of the activation is implemented in the activation function through the influence of lateral inhibition and mutual excitation, and not directly in the weight update function.

{

386

Randall C. OReilly and Mark H. Johnson

Acknowledgments We would like to thank Jeff Elman, Jay McClelland, and Yuko Munakata

for helpful comments on the paper. R. C. OReilly was supported by an ONR graduate fellowship.

Note Added in Proof Another neural network model on imprinting in the chick has recently come to our attention (Bateson & Horn, in press, Animal Behavior). A comparison between the two models will be the subject of a future paper. References Artola, A., Brocher, S., and Singer, W. 1990. Different voltage-dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex. Nature (London) 347, 69-72. Bateson, I? 1966. The characteristics and context of imprinting. Bid. Rev. 41, 177-220. Bolhuis, J. J. 1991. Mechanisms of avian imprinting: A review. Biol. Rev. 66, 303-345. Bolhuis, J. J., and Bateson, I? 1990. The importance of being first: A primacy effect in filial imprinting. Anim. Behav. 40, 472-483. Bolhuis, J. J., and Horn, G. 1992. Generalization of learned preferences in filial imprinting. Anim. Behav. 44, 185-187. Bolhuis, J. J., Johnson, M. H., Horn, G., and Bateson, P. 1989. Long-lasting effects of IMHV lesions on social preferences in domestic fowl. Behav. Neurosci. 103, 438-441. Bradler, J., and Barrionuevo, G. 1990. Heterosynaptic correlates of long-term potentiation induction in hippocampal CA3 neurons. Neuroscience 35(2), 265-271. Bradley, P., Davies, D., and Horn, G. 1985. Connections of the hyperstriatum ventrale in the domestic chick Gallus domesticus. 1. Anat. 140, 577-589. Brown, M. W., and Horn, G. 1992. Neurones in the intermediate and medial part of the hyperstriatum ventrale (IMHV) of freely moving chicks respond to visual and/or auditory stimuli. 1.Physiol. 452, 102P Bush, P. C., and Douglas, R. J. 1991. Synchronization of bursting action potential discharge in a model network of neocortical neurons. Neural Comp. 3,19-30. Chantrey, D. 1974. Stimulus pre-exposure and discrimination learning by domestic chicks: Effect of varying interstimulus time. J. Comp. Physiol. Psychol. 87, 517-525. Cofoid, D., and Honig, W. 1961. Stimulus generalization of imprinting. Science 134, 1692-1694. Collingridge, G., and Bliss, T. 1987. NMDA receptors-their role in long-term potentiation. Trends Neurosci. 10, 288-293.

Object Recognition and Sensitive Periods

387

Cook, S. 1993. Retention of primary preferences after secondary filial imprinting. Anim. Behav. 46, 405-407. Davey, J., and Horn, G. 1991. The development of hemispheric asymmetries in neuronal activity in the domestic chick after visual experience. Behazj. Brain Res. 45, 81-86. Davies, D., Taylor, D., and Johnson, M. H. 1988. Restricted hyperstriatal lesions and passive avoidance learning in the chick. J. Neurosci. 8, 4662-4668. Douglas, R. J., and Martin, K. A. C. 1990. Neocortex. In The Synaptic Organization of the Brain, G. M. Shepherd, ed., Chap. 12, pp. 389-438. Oxford University Press, Oxford. Douglas, R. J., Martin, K. A. C., and Whitteridge, D. 1989. A canonical microcircuit for neocortex. Neural Comp. 1,480-488. Einsiedel, A. A. 1975. The development and modification of object preferences in domestic white leghorn chicks. Dm.Psychobiol. 8(6), 533-540. Foldiiik, P. 1991. Learning invariance from transformation sequences. Neural Comp. 3(2), 194-200. Frkgnac, Y., Shulz, D., Thorpe, S., and Bienenstock, E. L. 1988. A cellular analogue of visual cortical plasticity. Nature (London) 333, 367-370. Hoffman, H., and Ratner, A. 1973. A reinforcement model of imprinting. Psychd. Rm. 80,527-544. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81,3088-3092. Horn, G. 1985. Memory, Imprinting, and the Brain: An Inquiry into Mechanisms. Clarendon Press, Oxford. Horn, G., Bradley, P., and McCabe, B. J. 1985. Changes in the structure of synapses associated with learning. J.Neurosci. 5, 3161-3168. Horn, G., and Johnson, M. H. 1989. Memory systems in the chick Dissociations and neuronal analysis. Neuropsychologia 27, 1-22. Hubel, D., and Wiesel, T. N. 1962. Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106-154. Jaynes, J. 1956. Imprinting: The interaction of learned and innate behavior. I. Development and generalization. 1.Comp. Physiol. Psychol. 49, 200-206. Jaynes, J. 1958. Imprinting: The interaction of learned and innate behavior. IV. Generalization and emergent discrimination. J. Comp. Physiol. Psychol. 51, 238-242. Johnson, M. H. 1991. Information processing and storage during filial imprinting. In Kin Recognition, P. Hepper, ed., pp. 335-357. Cambridge University Press, Cambridge. Johnson, M. H., and Horn, G. 1986. Dissociation of recognition memory and associative learning by a restricted lesion of the chick forebrain. Neuropsychologia 24, 329-340. Johnson, M. H., and Horn, G. 1987. The role of a restricted region of the chick forebrain in the recognition of conspecifics. Behav. Brain Res. 23, 269-275. Johnson, M. H., and Morton, J. 1991. Bioiogy and Cognitive Development: The Case of Face Recognition. Blackwell, Oxford. Kent, J. 1987. Experiments on the relationship between the hen and chick:

388

Randall C. OReilly and Mark H. Johnson

The role of the auditory mode in recognition and the effects of maternal separation. Behaviour 102, 1-14. Kertzman, C., and Demarest, J. 1982. Irreversibility of imprinting after active vs. passive exposure to the object. J. Cornp. Physiol. Psychol. 96,130-142. Klopfer, P. H. 1967. Stimulus preferences and imprinting. Science 156, 13941396. Klopfer, P. H., and Hailman, J. P. 1964a. Basic parameters of following and imprinting in precocial birds. Z. Tierpsychol. 21, 755-762. Klopfer, P. H., and Hailman, J. P. 1964b. Perceptual preferences and imprinting in chicks. Science 145, 1333-1344. Kossut, M., and Rose, S. 1984. Differential 2-deoxyglucose uptake into chick brain structures during passive avoidance training. J. Neurosci. 12, 971-977. Livingstone, M., and Hubel, D. 1988. Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science 240, 740-749. Lorenz, K. 1935. Der Kumpan in der Umwelt des Vogels. I. Ornithol. 83, 137213,289-413. Lorenz, K. 1937. The companion in the bird’s world. Auk 54, 245-273. Maunsell, J. H. R., Nealey, T. A., and DePriest, D. D. 1990. Magnocellular and parvocellular contributions to responses in the middle temporal visual area (MT) of the macaque monkey. J. Neurosci. 10(10),3323-3334. McCabe, B. J., and Horn, G. 1988. Learning and memory: Regional changes in N-methybaspartate receptors in the chick brain. Proc. Natl. Acad. Sci. U.S.A. 85, 2849-2853. McCabe, 8.J., Cipolla-Neto, J., Horn, G., and Bateson, P. 1982. Amnesic effects of bilateral lesions placed in the hyperstriatum ventrale of the chick after imprinting. Exp. Brain Res. 48, 13-21. McClelland, J. L., and Rumelhart, D. E. 1981. An interactive activation model of context effects in letter perception: Part 1. An account of basic findings. Psychol. Rev. 88(5), 375-407. McCloskey, M., and Cohen, N. J. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In The Psychology of Learning and Motivation, G. H. Bower, ed., Vol. 24, pp. 109-164. Academic Press, San Diego, CA. Mitchell, D. E. 1991. The long-term effectiveness of different regimens of occlusion on recovery from early monocular deprivation in kittens. Phil. Trans. Royal Soc. (London) B 333, 51-79. OReilly, R.C., and McClelland, J. L. 1992. The self-organization of spatially invariant representations. Parallel Distributed Processing and Cognitive Neuroscience PDP.CNS.92.5, Carnegie Mellon University, Department of Psychology. Payne, J., and Horn, G. 1984. Differential effects of exposure to an imprinting stimulus on ’spontaneous’ neuronal activity in two regions of the chick brain. Brain Res. 232, 191-193. Poggio, T., Torre, V., and Koch, C. 1985. Computational vision and regularization theory. Nature (London) 317, 314-319. Rauschecker, J.,and Marler, P. 1987. Irnprintingand Cortical Plasticity: Comparative Aspects of Sensitive Periods. Wiley, New York.

Object Recognition and Sensitive Periods

389

Rose, S. 1991. How chicks make memories: The cellular cascade from C-fos to dendritic remodelling. Trends Neurosci. 14, 390-397. Rumelhart, D. E., and Zipser, D. 1986. Feature discovery by competitive learning. In Parallel Distributed Processing, Volume 1: Foundations, D. E. Rumelhart and J. L. McClelland and PDP Research Group, eds., Chap. 5, pp. 151-193. MIT Press, Cambridge, MA. Ryan, C., and Lea, S. 1989. Pattern recognition, updating, and filial imprinting in the domestic chicken (gallus gallus). In Models of Behaviour: Behavioural Approaches to Pattern Recognition and Concept Formation. Quantitative Analyses of Behavior, M. Commons, R. Herrnstein, S. Kosslyn, and D. Mumford, eds., Vol. 8, pp. 89-110. Lawrence Erlbaum, Hillsdale, NJ. Salzen, E. A., and Meyer, C. 1968. Reversibility of imprinting. J. Comp. Physiol. 66, 269-275. Sejnowski, T. J. 1977. Storing covariance with nonlinearly interacting neurons. J. Math. Biol. 4, 303-321. Shapiro, L., and Thurston, K. 1978. The effect of enforced exposure to live models on the reversibility. Psychol. Record 28, 479-485. Sluckin, W. 1972. Imprinting and Early Learning, 2nd edn. Methuen, London. Sluckin, W., and Salzen, E. A. 1961. Imprinting and perceptual learning. Q. J. EXP. Psychol. 13, 65-77. Stanton, P. K., and Sejnowski, T. J. 1989. Associative long-term depression in the hippocampus induced by Hebbian covariance. Nature (London)339,215-218. Stewart, D., Capretta, P., Cooper, A., and Littlefield, V. 1977. Learning in domestic chicks after exposure to both discriminanda. J . Comp. Physiol. Psychol. 91, 1095-1109. Tombol, T., Csillag, A., and Stewart, M. G. 1988. Cell types of the hyperstriatum ventrale of the domestic chicken Gallus domesticus: A golgi study. J. Hinforschung 29(3), 319-334. Yuille, A. L. 1990. Generalized deformable models, statistical physics, and matching problems. Neural Comp. 2(1), 1-24.

Received November 16, 1992; accepted September 10, 1993.

This article has been cited by: 1. Fiona M. Richardson, Michael S.C. Thomas. 2008. Critical periods and catastrophic interference effects in the development of self-organizing feature maps. Developmental Science 11:3, 371-389. [CrossRef] 2. Felix Creutzig, Henning Sprekeler. 2008. Predictive Coding and the Slowness Principle: An Information-Theoretic ApproachPredictive Coding and the Slowness Principle: An Information-Theoretic Approach. Neural Computation 20:4, 1026-1041. [Abstract] [PDF] [PDF Plus] 3. Michael S.C. Thomas, Mark H. Johnson. 2008. New Advances in Understanding Sensitive Periods in Brain Development. Current Directions in Psychological Science 17:1, 1-5. [CrossRef] 4. James L. McClelland, Richard M. Thompson. 2007. Using domain-general principles to explain children's causal reasoning abilities. Developmental Science 10:3, 333-356. [CrossRef] 5. Michael S. C. Thomas, Mark H. Johnson. 2006. The computational modeling of sensitive periods. Developmental Psychobiology 48:4, 337-344. [CrossRef] 6. Mark H. Johnson. 2005. Sensitive periods in functional brain development: Problems and prospects. Developmental Psychobiology 46:3, 287-292. [CrossRef] 7. Felice L. Bedford. 2004. Analysis of a Constraint on Perception, Cognition, and Development: One Object, One Place, One Time. Journal of Experimental Psychology: Human Perception and Performance 30:5, 907-912. [CrossRef] 8. Laurenz Wiskott . 2003. Slow Feature Analysis: A Theoretical Analysis of Optimal Free ResponsesSlow Feature Analysis: A Theoretical Analysis of Optimal Free Responses. Neural Computation 15:9, 2147-2177. [Abstract] [PDF] [PDF Plus] 9. Yuko Munakata, James L. McClelland. 2003. Connectionist models of development. Developmental Science 6:4, 413-429. [CrossRef] 10. Laurenz Wiskott , Terrence J. Sejnowski . 2002. Slow Feature Analysis: Unsupervised Learning of InvariancesSlow Feature Analysis: Unsupervised Learning of Invariances. Neural Computation 14:4, 715-770. [Abstract] [PDF] [PDF Plus] 11. Michelle de Haan, Kate Humphreys, Mark H. Johnson. 2002. Developing a brain specialized for face perception: A converging methods approach. Developmental Psychobiology 40:3, 200-212. [CrossRef] 12. Konrad P. Körding , Peter König . 2001. Neurons with Two Sites of Synaptic Integration Learn Invariant RepresentationsNeurons with Two Sites of Synaptic Integration Learn Invariant Representations. Neural Computation 13:12, 2823-2849. [Abstract] [PDF] [PDF Plus] 13. Athanassios Raftopoulos. 2001. Is perception informationally encapsulated? The issue of the theory-ladenness of perception. Cognitive Science 25:3, 423-451. [CrossRef]

14. Suzanna Becker . 1999. Implicit Learning in 3D Object Recognition: The Importance of Temporal ContextImplicit Learning in 3D Object Recognition: The Importance of Temporal Context. Neural Computation 11:2, 347-374. [Abstract] [PDF] [PDF Plus] 15. Néstor Parga , Edmund Rolls . 1998. Transform-Invariant Recognition by Association in a Recurrent NetworkTransform-Invariant Recognition by Association in a Recurrent Network. Neural Computation 10:6, 1507-1525. [Abstract] [PDF] [PDF Plus]

Communicated by Ralph Freeman

Computing Stereo Disparity and Motion with Known Binocular Cell Properties Ning Qian’ Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, M A 02139 U S A

Many models for stereo disparity computation have been proposed, but few can be said to be truly biological. There is also a rich literature devoted to physiological studies of stereopsis. Cells sensitive to binocular disparity have been found in the visual cortex, but it is not clear whether these cells could be used to compute disparity maps from stereograms. Here we propose a model for biological stereo vision based on known receptive field profiles of binocular cells in the visual cortex and provide the first demonstration that these cells could effectively solve random dot stereograms. Our model also allows a natural integration of stereo vision and motion detection. This may help explain the existence of units tuned to both disparity and motion in the visual cortex. 1 Introduction It is well known that binocular disparity forms the basis of stereoscopic depth perception. There have been many physiological investigations on the mechanisms of stereopsis (see Freeman and Ohzawa 1990 for a recent review). The best known work is perhaps that of Poggio and his co-workers (Poggio and Fischer 1977; Poggio et al. 1985), who found that a large proportion of V1 and V2 cells in alert monkeys are disparity sensitive. They classified their cells into several classes based on tuning behavior. For example, “tuned excitatory” cells respond best to disparities near zero, while “near” or ”far” cells prefer a range of crossed or uncrossed disparities. Other investigators argued for a continuous distribution of disparity tuning instead of discrete classes (LeVay and Voigt 1988). More recently, Freeman and his collaborators have carried out quantitative analysis of the receptive field structures of binocular cells in cat primary visual cortex (Freeman and Ohzawa 1990; Ozhawa et al. 1990). All these studies indicate that most disparity sensitive cells are broadly tuned. Even the most sharply tuned cells have tuning widths of *Current address: Center for Neurobiology and Behavior, Columbia University, 722 W. 168th St., New York, NY 10032. Neural Computation 6,390404 (1994)

@ 1994 Massachusetts Institute of Technology

Computing Stereo Disparity

391

about 0.1”-0.2”, comparable to 2-4 pixels in the stereogram in Fig. 3 of this paper, when viewed at a distance of about 40 cm. It has been shown that the broadly tuned disparity sensitive cells such as those found in the brain can explain some psychophysical results of stereo vision (Lehky and Sejnowski 1990). It remains to be demonstrated if these cells can be used to compute disparity maps from stereograms. Many models for disparity computation have been proposed over the years (Marr and Poggio 1976, 1979; Quam 1984; Prazdny 1985; Pollard ef al. 1985; Qian and Sejnowski 1988; Sanger 1988; Yeshurun and Schwartz 1989). Unfortunately, most of them are nonbiological, either because they require sharply disparity-tuned units with preferred disparities covering a wide range of values, or because of certain mathematical operations involved that are unlikely to be physiological or at least have not been demonstrated physiologically. Some models are biologically inspired (Marr and Poggio 1979; Sanger 1988),but they are not solely based on the properties of binocular cells in the brain. Among the existing algorithms, the one that comes closest to physiology is perhaps the model proposed by Sanger (1988),who used Gabor filters for disparity computation. The model uses the fact that displacement of a function generates a proportional phase shift in its Fourier transformation. The binocular disparity at each location is therefore proportional to the difference of the Fourier phases of the corresponding left and the right image patches. A Gabor function is a product of a gaussian envelope with a sinusoid and can be used to perform an approximate localized Fourier transformation. Sanger thus used sine and cosine Gabor filters to estimate the local Fourier phases of left and right images, and the phase difference at each location was used to find disparity. Although Sanger’s model employs Gabor filters which are known to describe simple cell receptive fields well (Marcelja 1980; Daugman 1985; Jones and Palmer 1987),the filters are used only in a monocular fashion. The binocular interaction in the model occurs only at the final step, when the left and the right phases are compared. Before that, the left and the right images are processed separately. No stage in the model uses binocular receptive fields or disparity tuned cells resembling those found in the visual cortex. Also, the explicit representations of the phases of the left and the right images, and of the phase differences in the model are not physiologically plausible. The simple cells in cortex are phase sensitive, but their responses are not monotonic functions of the image phases as implied by the model. The work described in this paper is mainly based on the physiological and computational studies by Freeman and Ohzawa (19901, and by Ohzawa et al. (1990). These investigators measured the receptive field structures of simple and complex binocular cells in the cat primary visual cortex. They found that simple cells do not reliably signal binocular disparity because they are also sensitive to the contrast and the position of the stimulus. They then showed that a subpopulation of complex cells

392

Ning Qian

is well suited as disparity detectors. Finally, they demonstrated, through computer simulations, that left and right receptive fields of a simple cell can be described by two Gabor functions with a certain phase difference between them and that disparity-selective complex cells can be modeled by combining the outputs of a quadrature pair of simple cells, similar to the procedure for motion energy computation (Adelson and Bergen 1985; Watson and Ahumada 1985). While their work represents a major step toward an understanding of biological stereopsis, it is incomplete from the computational point of view. The work is limited to specific simulations of a few simple and complex cells’ responses. They did not explore the explicit relationship between the parameters of the cells and their tuning behavior. More importantly, they did not provide a computational theory (Marr 1982) for computing disparity maps from stereograms. We propose such a theory in this paper by generalizing and formalizing the model of Ohzawa et al. (1990). Specifically, we show through mathematical analysis that disparity tuning curves of model simple cells depend strongly on the Fourier phase of the stimulus used to measure the curve. The expression we derived for simple cell response leads naturally to the quadrature pair method of combining simple cell outputs to form model complex cells that have reliable disparity tuning. A family of such complex cells then constitutes a distributed representation of disparity and the actual disparity values of the stimuli can be easily estimated from such a distribution. We demonstrate the effectiveness of our model by applying it to random dot stereograms. We finally show that our formulation of disparity computation can be combined naturally with the energy models of motion detection (Adelson and Bergen 1985; Watson and Ahumada 1985) into a unified framework. 2 Theory and Simulation

Through single unit recording from cat’s primary visual cortex, Freeman and his colleagues (Freeman and Ohzawa 1990; Ohzawa et al. 1990; DeAngelis et al. 1991) found that a typical binocular simple cell can be described by two Gabor functions, one for each of its receptive fields in the left and the right retinas. A mathematical description of such a pair of receptive fields is given by Normura et al. (1990):

(2.2) where G and w determine the size and the preferred (angular) spatial frequency of the receptive fields. 41 and $r are the phase parameters for the left and the right receptive fields, respectively. It has been shown

Computing Stereo Disparity

393

(Nomura et al. 1990) that the different disparity tuning types found in the visual cortex, including those described by Poggio et al. (19851, can be generated with appropriate combinations of parameters in equations 2.1 and 2.2. The response of the simple cell to a stimulus is given by

where 4 ( x ) and I r ( x ) are the left and the right retinal images of the stimulus. That is, the cell sums the contributions from the two receptive fields linearly (Freeman and Ohzawa 1990; Nomura et al. 1990; Ozhawa et al. 1990). For a stimulus with a binocular disparity D we can write (2.4) (2.5) Under the assumption that the receptive field widths are much larger than the image disparity, the gaussian envelope of the receptive fields can be ignored. It can be shown that the response of a cell with a pair of receptive fields given by equations 2.1 and 2.2, is approximately I,

+ #/) + COS(~'+

P[COS(~

$r

- wD)]

(2.6)

or equivalently (2.7) where p and 0 are the amplitude and phase of the Fourier transformation of the image I ( x ) at frequency w,the preferred spatial frequency of the cell. Equation 2.7 indicates that the disparity tuning of a binocular simple cell is dependent on the Fourier phases 8 of the input stimuli. A special case of this phase dependency has been reported by Ohzawa et al. (19901, who found that the disparity tuning of simple cells varies with stimulus position and sign of contrast. Our result is more general because Fourier phase can be affected by other variables besides the position and contrast of the stimulus. Any change to a pattern, other than a constant baseline shift or scaling of brightness, alters the Fourier phase, which in turn affects the disparity tuning. For example, independently generated random dot patterns contain different Fourier phases even when they occupy the same retinal position and have the same overall textural appearance. Equation 2.7 gives an explicit expression of how disparity tuning of a simple cell depends on the Fourier phase of a stimulus and the cell's parameters. To test this equation, we carried out computer simulations using a model cell with ~7 = 4 pixels, w/27r = 0.125 cycles/pixel, and $1 = $? = 0. Based on the modeling work of Nomura et al. (19901, this

Ning Qian

394

Q,

0 Q

2

u)

2.1 1.8.

1.5. 1.2-

0.9. 0.6.

0.3. 0.0* -0.3. -0.6. -0.91 -6

I

I

I

I

-4

-2

0

2

I

I

4 6 Disparity (pixel)

Figure 1: Disparity tuning curves of a simple binocular cell to a vertical line stimulus at three retinal positions. Different position generates different tuning behavior. The solid line is tuned to zero disparity while the dotted and the dashed lines prefer near or far disparities of 2 pixels, respectively. In this simulation, u = 4 pixels, w / 2 = ~ 0.125 cycles/pixel, and 41 = @r = 0.

cell should show tuned excitatory behavior. The results of our simulation using a vertical line are shown in Figure 1. When the left image of the line is centered in the left receptive field while the right line position is varied to cover a range of disparity values, the cell is indeed tuned excitatory as shown in Figure 1 (solid line). When the left line position is shifted in one direction or the other and the right line position is varied to cover the same disparity range, however, there is a corresponding shift in the cell's tuning curve and the cell behaves like a near or far cell (Fig. 1, dotted and dashed lines). The amount of shift and the shape of the tuning curves are well predicted by equation 2.7. This is easier to see using the equivalent equation 2.6. The second term in equation 2.6 determines the horizontal shift and the shape of the tuning curves while the first term determines the vertical shift. For the same model cell we also performed simulations to obtain its disparity tuning curves using independently generated random dot patterns with the same dot density, contrast, and overall position. Each pattern is used to generate a series of random dot stereograms with different disparities for establishing the tuning curve. Three tuning curves

Computing Stereo Disparity

a

395

3.5-

u)

5 3.0.3% 2.5.2.0 -1.5 -1 .o -0.5 .-

0.0 .-0.5 .-

1

@.

-1.o -- 0 -1.51 -6

.. ..

I

'8

I

I

I

I

I

-4

-2

0

2

4

I 6

Disparity (pixel)

Figure 2: Disparity tuning curves of a simple binocular cell to independently generated random dot patterns. Each pattern is used to generate a series of random dot stereogramsfor measuring disparity tuning. The model parameters are the same as those in Figure 1. from the cell are shown in Figure 2. Once again, the cell can be classified as either tuned excitatory, near, or far depending on the patterns used. The conclusion is that simple cells as described by equations 2.1 and 2.2 do not have reliable disparity tuning and thus cannot be used directly to compute disparity maps from stereograms. We can, however, combine the outputs of the simple cells to form model complex cells that do not depend on the Fourier phase of stimulus. To achieve this, we note that equation 2.7 consists of two cosine terms. The Fourier phase of the stimulus only appears in the first term, which also contains (4,+ 4,)/2. This observation suggests that we can construct a phase-independent complex binocular cell from two simple cells with equal u,w,and ( $ 1 - $,)/2 but with their ($1 &)/2 having a 90" phase difference. If the outputs of these two cells are squared and then summed up, the resulting complex response will be

+

(2.8) which is no longer dependent on the Fourier phases of the input patterns. Since the square of cosine is a periodic function of period T , the range of

Ning Qian

396

wD/2 in equation 2.8 should be restricted to T , or equivalently, the range of disparity for the cell to detect should be restricted to 27r/w, in order to avoid ambiguity. It is easy to show that the preferred disparity of such a complex cell is given by lar

- 41

Dpref = w

(2.9)

and its width of tuning (defined at the half peak amplitude) is equal to K

AD=-. W

(2.10)

Note that our construction of the complex cell above is equivalent to using two simple cells with both their 41’s and lar’sin quadrature relationship. We prefer to use the linear combinations (cb, - cjr)/2 and (4, + cjr)/2 because they are more relevant to disparity computation (see equations 2.8, 2.11, and 2.13). The method described above is very similar to the quadrature pair method developed for motion energy computation (Adelson and Bergen 1985; Watson and Ahumada 1985). It was first used by Ohzawa et al. (1990) to model the disparity selectivity of real complex cells. Their work, however, was limited to computer simulations, and was not based on the theoretical analysis we outlined above. Our equations 2.8, 2.9, and 2.10 are more general and they provide an explicit relationship between the parameters of complex cells and their disparity tuning curves. This relationship forms the basis of a computational theory for solving stereograms to be described next. Equation 2.8 suggests a simple way of computing stereo disparity using complex binocular cells. If we have a family of complex cells at a spatial location with their (4,- &)/2 covering the range from - ~ / 2 to 7r/2, these cells will then constitute a distributed representation of the stereo disparity present at that location in the input images. Such a representation could be sufficient from a biological point of view. For example, it could be used as input for the control of vergent eye movements. To compute the actual image disparity explicitly, we note that according to equation 2.8 the cell in the family with the strongest response satisfies the condition (2.11) where the starred parameters refer to those of the most responsive cell. Thus, by identifying the most responsive cell in the family of complex cells centered at a given location, we can compute the image disparity at that location from the parameters of the cell. Alternatively, we could pick the cell with the highest slope in the distribution instead of the maximum response for better discriminability (Lehky and Sejnowski 1990). In this

Computing Stereo Disparity

397

case, equation 2.11 should be modified to (2.12) where the starred parameters are now the parameters of the cell with the highest slope. It is interesting to note that superficially, equation 2.11 is similar to the one used by Sanger (1988) for disparity computation. The two approaches are actually different. In Sanger’s method, 4; and 4: refer to the Fourier phases of the corresponding left and the right input image patches. The current approach, on the other hand, uses quadrature pairs to deliberately eliminate the image Fourier phase dependence in the simple cell responses. 6; and 4; in our equation 2.11 are the left and the right phase parameters of the maximally responsive complex cell in the family. They are not the Fourier phases of the input images. Besides, equation 2.11 is not an essential part of our model, and may not correspond to a real step of computation in the brain. It is merely used to demonstrate that an explicit disparity map can be constructed from the distributed representation based on equation 2.8. As we mentioned above, the distributed representation, which is absent in Sanger’s model, might be sufficient from the biological point of view. We have applied our method to random dot stereograms (Jules21971). Figure 3 shows an example. The random dot stereogram in Figure 3a has a -2 pixel disparity for the surround, and f2 for the center. The dot density is 0.5 and the dot size is 1 pixel. We used 8 complex cells with their ($1 - $r)/2 distributed evenly in the range [-n/2,7r/2]. The CT’S and (w/27r)‘s are 4 pixels and 0.125 cycle/pixel, respectively, for all cells. The response of the 8 complex cells at each spatial location is computed by first convolving the stereograms with the corresponding 16 simple cells (one quadrature pair of simple cells for each complex cell) and combining the results into complex responses. At each spatial location the highest of the 8 responses is found and equation 2.11 is used to compute the disparity. The disparity map computed this way is then smoothed with a gaussian weighting function with (T = 4 pixels. The final disparity map is shown in Figure 3b, which agrees well with the correct map. The top and the bottom surfaces have disparities around 2 and -2 pixels, respectively. The error mainly occurs at the transition, which is not as sharp as the perception. To our knowledge, this is the first demonstration that filters with properties similar to real binocular cells in the brain can be used to compute disparity maps from stereograms. Note that the result is obtained with filters of a single spatial scale (i.e., a single set of values for (T and w). Further improvements could be achieved by combining outputs from several different scales. We could also improve the result by first fitting a smooth curve to the distribution of responses at each location and then identify the maximum. Our algorithm requires a family of binocular simple cells with various ( $ 1 - &)/2. The computation can be made much more efficient if we

Ning Qian

398

A

B

Figure 3: (a) A random dot stereogram with dot density equal to 50% and dot size 1 pixel. The center and the surround have disparities of 2 and -2 pixels respectively. (b)The computed disparity map. See text for the parameters used in the computation.

use the fact that the convolution of a binocular simple cell of arbitrary (41- &)/2 can be expressed as a linear combination of convolutions at two independent and fixed ($1 - &)/2. This is a direct consequence of a trigonometric identity, and is also related to the steerability property discussed by Freeman and Adelson (1991). It is well known that we can still perceive stereoscopic depth even when the contrast of one image in a stereo pair is different from that of the other (Julesz 1971). The response of our model complex cell under

Computing Stereo Disparity

399

this condition can be shown to be (2.13) where y is the contrast ratio of the two images in a pair. The expression is different from equation 2.8 by only a dc component and a scale factor, both independent of the disparity D. The actual stimulus disparity can be computed explicitly with the same equation 2.11. We repeated our computer simulation on the stereogram in Figure 3a with the contrast of the right image reduced to half that of the left. The resulting disparity map (not shown) is indistinguishable from Figure 3b. 3 Integrating Motion with Stereo

Our method of combining simple cell outputs for achieving phase independent disparity responses is rather similar to that used in motion energy models (Adelson and Bergen 1985; Watson and Ahumada 1985). This similarity allows a natural integration of motion and stereo vision into a common computational framework. The integration can be achieved by using binocular simple cells with following left and right receptive field structures:

where u's and w's determine the sizes of the receptive fields and the preferred (angular) frequencies along the spatial and temporal dimensions, and 41 and 4r are again the phase parameters. We assume that these phase parameters are constants independent of x, y, and t. The cell is consequently more sensitive to motion in a constant disparity plane while relatively insensitive to the motion in depth. This is consistent with the physiological finding that few cells in area MT are tuned to motion in depth (Maunsell and Van Essen 1983)and with the psychophysical observation that human subjects are poor at detecting motion in depth based on disparity cues alone (Westheimer 1990). The filters described by equations 3.1 and 3.2 are nonzero on the negative time axis. They are thus noncausal. This is, however, not a major problem because these filters decay to zero exponentially due to the gaussian envelopes. We could practically make the filters causal by shifting them toward the positive time direction by, for example, 3cf. The following results will not be affected by such a shift. A more serious problem with using temporal Gabor filters is that the temporal response of real simple cells is skewed, with its envelope having a longer decay time than

Ning Qian

400

rise time. Also, zero-crossing intervals in the temporal dimension are not equally spaced (DeAngelis et al. 1993). We will use temporal Gabor filters in the following analysis for their mathematical simplicity. Similar results can be obtained with more realistic temporal filters and will be presented elsewhere. Consider a stimulus with a constant disparity D and moving at the speed v, and v,, along the horizontal and vertical directions, respectively. The left and the right images of such a stimulus is given by (3.3) (3.4) Let us assume again that the widths of the gaussians in equations 3.1 and 3.2 are large. The response of the simple cell described above to the stimulus is then rs

=

2p6(y

+ w,v, + w,,~,,) cos (3.5)

where 6( ) is the delta function, and p and H are the amplitude and phase of the Fourier transformation of I ( x , y ) . The dependence on the stimulus phase can be removed using the same quadrature pair method discussed before. The resulting complex response is then given by

This equation indicates that the cell is indeed sensitive to both motion and stereo disparity. Moreover, the dependencies to motion and stereo are separated into two terms in the product. This is desirable for it allows separate estimation of stereo and motion parameters by using different populations of cells. For disparity computation, we can look at the responses of the family of cells with identical ut, w,, and w,,but different ( $ 1 - dr)/2 as we did in the previous section. Similarly, for velocity field computation we can use a family of cells with constant (41- 4,)/2, but different wt,w,, and wy (Heeger 1987; Grzywacz and Yuille 1990). By holding (41- $r)/2 at different values, we could estimate velocity fields at different depth levels. The unified model presented above could account for several interesting psychophysical observations about the interactions between motion and stereo. For example, it allows representation of more than one velocity vector at the same location if the two vectors have significantly different stereo disparities (Qian et al. 1993a). This is because the motion constraint plane determined by the 6 ( ) function in equation 3.6 is weighted not only by the image Fourier power $ but also by the

Computing Stereo Disparity

401

disparity-dependent cosine term. When information from a local area is combined to solve the motion aperture problem the cosine term reduces the interference between the motion signals coming from different directions at different disparities. The detailed results of our psychophysical and computational experiments will be presented elsewhere (Qian ef al. 1993a,b).

4 Discussion The main purpose of this paper is to show that the binocular receptive field properties of real cortical cells can be used to compute disparity maps from stereograms. As we mentioned in the text, some ingredients of our model have been previously proposed by Ohzawa et al. (1990). Our model, however, was derived through mathematical analysis of the binocular system, instead of based on examples of computer simulations. This distinction is an important one since our equations, which describe the relationship between the parameters of cells and their disparity tuning, cannot be easily obtained through computer simulations. These equations are the essential part of our model because they constitute a computational theory for solving stereograms. Our analysis and computer simulations showed that simple cells do not have reliable disparity tuning, because their responses are also dependent on the Fourier phases of the stimuli. A unit that is tuned excitatory under one condition can behave like a near or far cell under other conditions. We suggest that the disparity tuning curves of simple cells are incomplete as they are not uniquely defined for these cells and that the best way to study disparity sensitivity of simple cells is by mapping their binocular receptive field structures (Ohzawa et al. 1990). In order to achieve reliable disparity computation, we combined the outputs of simple cells to eliminate the Fourier phase dependence. This can be realized by squaring and then summing the responses of two simple cells that have the same u, w,and ($1 - q5r)/2 but have a 90" phase difference in their ( $ 1 + &)/2 (Ohzawa et al. 1990; Adelson and Bergen 1985; Watson and Ahumada 1985). An alternative way to eliminate phase dependence, and at the same time preserve disparity tuning, is by averaging over many simple cells with equal o?w,and ( $ 1 - &)/2 but various (4, + &)/2 (see equation 2.7). This approach is less demanding on the specificity of connections between cells but requires more simple cells for each complex cell. Given the fact that there are large numbers of cells in the visual cortex with a wide range of phase parameters (DeAngelis et al. 1993), the second approach may be biologically more plausible. The two methods are equivalent from a computational point of view, however, as they both generate the complex cell response given by equation 2.8. They can thus model real complex cells equally well. We conclude that

402

Ning Qian

the energy method using quadrature pairs is just one way for achieving phase independence and is not an indispensable part of the model. Our model can explain an observation by Poggio et al. (19851, who reported that while both simple and complex cells are disparity tuned to bars, only complex cells show disparity tuning to dynamic random dot stereograms. Each of these stereograms maintains a constant disparity over time but the actual arrangement of dots, and thus the Fourier phase, changes randomly from frame to frame. Since simple cells are sensitive to the Fourier phase as well as to the disparity of the stimulus, they lose their disparity tuning as a result of averaging over the random phases from frame to frame. Complex cells, on the other hand, are sensitive to disparity but are independent of the stimulus Fourier phase. They therefore respond consistently to the fixed disparity in a dynamic random dot stereogram, regardless of the randomly changing Fourier phase. Due to the similarity of our disparity model and the motion energy models, we are able to combine motion and stereo into a common framework. The resulting model allows a distributed and simultaneous representation of velocity and disparity among a population of cells. This is consistent with the fact that many V1 and MT cells are broadly tuned to direction, speed, and disparity. By looking at different subpopulations of these cells either velocity or disparity information can be extracted. The resulting model is not only physiologically plausible, it also explains several interesting psychophysical observations regarding the interaction between motion and stereo.

Acknowledgments I would like to thank Professor Richard Andersen for his support and encouragement. I am also grateful to Bard Geesaman, David Bradley, and two anonymous reviewers for their helpful comments. The work was supported by a McDonnell-Pew Postdoctoral Fellowship to the author, and is currently supported by Office of Naval Research Contract N0001489-J1236 and NIH Grant EY07492, both to Richard Andersen.

References Adelson, E. H., and Bergen, J. R. 1985. Spatiotemporal energy models for the perception of motion. J. Opt. SOC. Am. A 2(2), 284-299. Daugman, J. G. 1985. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. 1.Opt. Soc. Am. A 2, 1160-1169. DeAngelis, G. C., Ohzawa, I., and Freeman, R. D. 1991. Depth is encoded in the visual cortex by a specialized receptive field structure. Nature (London) 352, 156-159.

Computing Stereo Disparity

403

DeAngelis, G. C., Ohzawa, I., and Freeman, R. D. 1993. Spatiotemporal organization of simple-cell receptive fields in the cat’s striate cortex. I. General characteristics and postnatal development. J. Neurophysiol. 69, 1091-1117. Freeman, R. D., and Ohzawa, I. 1990. On the neurophysiological organization of binocular vision. Vision Res. 30, 1661-1676. Freeman, W. T., and Adelson, E. H. 1991. The design and use of steerable filters. IEEE Pat. Anal. Mach. Intell. 13(9), 891-906. Grzywacz, N. M., and Yuille, A. L. 1990. A model for the estimate of local image velocity by cells in the visual cortex. Proc. R. SOC.London A 239, 129-161. Heeger, D. J. 1987. Model for the extraction of image flow. J. Opt. Sac. Am. A 4(8), 1455-1471. Jones, J. P., and Palmer, L. A. 1987. The two-dimensional spatial structure of simple receptive fields in the cat striate cortex. J.Neurophysiol. 58,1187-1211. Julesz, B. 1971. Foundations of Cyclopean Perception. University of Chicago Press, Chicago. Lehky, S. R., and Sejnowski, T. J. 1990. Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. J. Neurosci. 10, 2281-2299. LeVay, S., and Voigt, T. 1988. Ocular dominance and disparity coding in cat visual cortex. VisuaI Neurosci. 1, 395-414. Marcelja, S. 1980. Mathematical description of the responses of simple cortical cells. J. Opt. SOC.Am. A 70, 1297-1300. Marr, D. 1982. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, San Francisco. Marr, D., and Poggio, T. 1976. Cooperative computation of stereo disparity. Science 194, 283-287. Marr, D., and Poggio, T. 1979. A computational theory of human stereo vision. Proc. R. SOC.London B 204,301-328. Maunsell, J. H. R., and Van Essen, D. C. 1983. Functional properties of neurons in middle temporal visual area of the macaque monkey. ii. Binocular interactions and sensitivity to binocular disparity. ]. Neurophysiol. 49, 1148-1167. Nomura, M., Matsumoto, G., and Fujiwara, S. 1990. A binocular model for the simple cell. Biol. Cybern. 63, 237-242. Ohzawa, I., DeAngelis, G. C., and Freeman, R. D. 1990. Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science 249, 1037-1041. Poggio, G. F., and Fischer, B. 1977. Binocular interaction and depth sensitivity in striate and prestriate cortex of behaving rhesus monkey. 1. Neurophysiol. 40, 1392-1405. Poggio, G. F., Motter, B. C., Squatrito, S., and Trotter, Y. 1985. Responses of neurons in visual cortex (V1 and V2) of the alert macaque to dynamic randomdot stereograms. Vision Res. 25, 397406. Pollard, S. B., Mayhew, J. E., and Frisby, J. I? 1985. Perception 14, 449-470. Prazdny, K. 1985. Detection of binocular disparities. Biol. Cybern. 52, 93-99. Qian, N., and Sejnowski, T. J. 1988. Learning to solve random-dot stereograms of dense and transparent surfaces with recurrent backpropagation. Proceedings of the 1988 Connectionist Models Summer School, 435443.

404

Ning Qian

Qian, N., Andersen, R. A., and Adelson, E. H. 1994a. Transparent motion perception as detection of unbalanced motion signals: Psychophysics. Submitted. Qian, N., Andersen, R. A., and Adelson, E. H. 1994b. Transparent motion perception as detection of unbalanced motion signals: Modeling. Submitted. Quam, L. H. 1984. Hierarchical warp stereo. Proceedings of the DARPA lmage Understanding Workshop, 149-155. Sanger, T. D. 1988. Stereo disparity computation using Gabor filters. Biol. Cybern. 59, 405418. Watson, A. B., and Ahumada, A. J. 1985. Model of human visual-motion sensing. I. Opt. SOC.Am. A 2, 322-342. Westheimer, G. 1990. Detection of disparity motion by the human observer. Optom. Vision Sci. 67, 627-630. Yeshurun, Y., and Schwartz, E. L. 1989. Cepstral filtering on a columnar image architecture-A fast algorithm for binocular stereo segmentation. IEEE Pat. Anal. Mach. Intell. 11, 759-767.

Received April 29, 1993; accepted August 16, 1993.

This article has been cited by: 2. Yansheng Ming, Zhanyi Hu. 2010. Modeling Stereopsis via Markov Random FieldModeling Stereopsis via Markov Random Field. Neural Computation 22:8, 2161-2191. [Abstract] [Full Text] [PDF] [PDF Plus] 3. Yiwen Wang, Bertram E. Shi. 2010. Autonomous Development of Vergence Control Driven by Disparity Energy Neuron PopulationsAutonomous Development of Vergence Control Driven by Disparity Energy Neuron Populations. Neural Computation 22:3, 730-751. [Abstract] [Full Text] [PDF] [PDF Plus] [Supplementary material] 4. Eric K. C. Tsang, Bertram E. Shi. 2009. Disparity Estimation by Pooling Evidence From Energy Neurons. IEEE Transactions on Neural Networks 20:11, 1772-1782. [CrossRef] 5. Lianqing Yu, Zhanyi Hu. 2009. Modeling neuronal response to disparity gradient. Soft Computing 13:12, 1175-1185. [CrossRef] 6. Eric K. C. Tsang, Bertram E. Shi. 2008. Normalization Enables Robust Validation of Disparity Estimates from Neural PopulationsNormalization Enables Robust Validation of Disparity Estimates from Neural Populations. Neural Computation 20:10, 2464-2490. [Abstract] [PDF] [PDF Plus] 7. José R. A. Torreão. 2007. Disparity estimation through Green’s functions of matching equations. Biological Cybernetics 97:4, 307-316. [CrossRef] 8. Jenny C A Read, Bruce G Cumming. 2007. Sensors for impossible stimuli may solve the stereo correspondence problem. Nature Neuroscience 10:10, 1322-1328. [CrossRef] 9. Melchi M. Michel , Robert A. Jacobs . 2006. The Costs of Ignoring High-Order Correlations in Populations of Model NeuronsThe Costs of Ignoring High-Order Correlations in Populations of Model Neurons. Neural Computation 18:3, 660-682. [Abstract] [PDF] [PDF Plus] 10. Jenny C.A. Read , Bruce G. Cumming . 2004. Understanding the Cortical Specialization for Horizontal DisparityUnderstanding the Cortical Specialization for Horizontal Disparity. Neural Computation 16:10, 1983-2020. [Abstract] [PDF] [PDF Plus] 11. Eric K. C. Tsang, Bertram E. Shi. 2004. A Preference for Phase-Based Disparity in a Neuromorphic Implementation of the Binocular Energy ModelA Preference for Phase-Based Disparity in a Neuromorphic Implementation of the Binocular Energy Model. Neural Computation 16:8, 1579-1600. [Abstract] [PDF] [PDF Plus] 12. Yuzhi Chen, Ning Qian. 2004. A Coarse-to-Fine Disparity Energy Model with Both Phase-Shift and Position-Shift Receptive Field MechanismsA Coarse-to-Fine Disparity Energy Model with Both Phase-Shift and Position-Shift Receptive Field Mechanisms. Neural Computation 16:8, 1545-1577. [Abstract] [PDF] [PDF Plus]

13. Melissa Dominguez , Robert A. Jacobs . 2003. Developmental Constraints Aid the Acquisition of Binocular Disparity SensitivitiesDevelopmental Constraints Aid the Acquisition of Binocular Disparity Sensitivities. Neural Computation 15:1, 161-182. [Abstract] [PDF] [PDF Plus] 14. Jenny C. A. Read . 2002. A Bayesian Approach to the Stereo Correspondence ProblemA Bayesian Approach to the Stereo Correspondence Problem. Neural Computation 14:6, 1371-1392. [Abstract] [PDF] [PDF Plus] 15. B. G. Cumming, G. C. DeAngelis. 2001. THE PHYSIOLOGY OF STEREOPSIS. Annual Review of Neuroscience 24:1, 203-238. [CrossRef] 16. Ning Qian , Samuel Mikaelian . 2000. Relationship Between Phase and Energy Methods for Disparity ComputationRelationship Between Phase and Energy Methods for Disparity Computation. Neural Computation 12:2, 279-292. [Abstract] [PDF] [PDF Plus] 17. R. F. Hess, C. L. Baker, Jr., L. M. Wilcox. 1999. Comparison of motion and stereopsis:linear and nonlinear performance. Journal of the Optical Society of America A 16:5, 987. [CrossRef] 18. Christopher W. Lee , Bruno A. Olshausen . 1996. A Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot StereogramsA Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot Stereograms. Neural Computation 8:3, 545-566. [Abstract] [PDF] [PDF Plus]

Communicated by Richard Andersen

Integration and Differentiation in Dynamic Recurrent Neural Networks Edwin E. Munro Larry E. Shupe Eberhard E. Fetz Department of Physiology and Biophysics and Regional Primate Research Center, University of Washington, Seattle, WA 98295 U S A Dynamic neural networks with recurrent connections were trained by backpropagation to generate the differential or the leaky integral of a nonrepeating frequency-modulated sinusoidal signal. The trained networks performed these operations on arbitrary input waveforms. Reducing the network size by deleting ineffective hidden units and combining redundant units, and then retraining the network produced a minimal network that computed the same function and revealed the underlying computational algorithm. Networks could also be trained to compute simultaneously the differential and integral of the input on two outputs; the two operations were performed in distributed overlapping fashion, and the activations of the hidden units were dominated by the integral. Incorporating units with time constants into model networks generally enhanced their performance as integrators and interfered with their ability to differentiate. 1 Introduction

Dynamic neural networks, which incorporate time-varying activity and recurrent connections, can be trained to generate a variety of transforms between spatio-temporal input and output patterns (Fetz 1993). Biologically motivated examples include networks that simulate the reflex responses of the leech (Lockery and Sejnowski 1992), oscillatory activity of pattern generators (Tsung et al. 1990; Rowat and Selverston 1991; Williams and Zipser 1989), performance of a manual step-tracking task in primates (Fetz et al. 1990; Fetz and Shupe 1990), the vestibulo-ocular reflex (Anastasio 1991; Arnold and Robinson 1991; Lisberger and Sejnowski 1992a,b), and short-term memory (Zipser 1991). Dynamic recurrent networks can be trained to compute specific transforms from examples by using a modified form of the backpropagation algorithm (Watrous and Shastri 1986; Williams and Zipser 1989). To explore their ability to compute analytical functions, we trained recurrent networks to generate the Nrrrrul Coniprtuntioti 6, 405-419 (1994) @ 1994 Massachusetts Institute of Technology

406

E. E. Munro, L. E. Shupe, and E. E. Fetz

differential or the leaky integral of a continuously changing nonrepeating input; the resulting networks appropriately transformed any subsequent input pattern. These networks were not intended to model any particular biological systems, although functions analogous to differentiation and integration do appear in physiological networks. Many neurons in the visual, auditory, and somatosensory systems respond preferentially to changes in peripheral stimulation (Patton et a/. 19891, with transient responses resembling the differential of the stimulus. In the motor system, neural integrators have been postulated to transform transient commands into sustained activity, and to mediate the vestibulo-ocular reflex [as modeled by Anastasio (19911, Arnold and Robinson (19911, Cannon and Robinson (1985), Fuchs (1981), Lisberger and Sejnowski, (1992a,b), and Robinson (1989)l. The purpose of our networks is not to model specific biological circuits but to investigate the mechanisms of neural computation of integration and differentiation in networks of sigmoidal units. Biological neurons have history-dependent ionic and membrane properties that may help to generate these functions, but for simplicity we did not incorporate such mechanisms. However, we did examine the capacities of networks whose hidden units had intrinsic time constants. A further purpose of this study was to demonstrate that the underlying neural algorithm can be elucidated by reducing these networks to their minimal size (Mozer and Smolensky 1989; LeCun et 01. 1990). 2 Methods

Our models utilized a recurrent network trained by a dynamic backpropagation algorithm (Watrous and Shastri 1986; Williams and Zipser 1989). A single input unit provided the time-varying input signal to the hidden units, and a bias unit provided a source of constant input. The hidden units had either excitatory or inhibitory connections, to each other and to the output(s). Self-connections and recurrent connections among the inhibitory units were sometimes omitted for efficiency, when their inclusion did not alter the basic results. The output layer consisted of one or two units with a linear input-output characteristic, representing the function(s) to be computed. The total input to a given unit was the weighted sum of activity of its input units and bias, weighted by the respective connection strengths. The output of "sigmoidal" hidden units (appearing at the next timestep) was derived by the sigmoid squashing function, shifted along the abscissa by a constant = 4 (equivalent to a nonshifted sigmoid with a constant bias of -4). As described below we also investigated units with intrinsic time constants. Each network was trained with a nonrepeating quasisinusoidal input signal whose frequency was randomly varied (periods between 2 and 60

Recurrent Neural Networks

407

timesteps). The desired target output waveform was calculated as the differential or the leaky integral delayed by two timesteps, to accommodate the delays between layers. Each training cycle consisted of an epoch spanning a fixed number of timesteps (normally 50). The difference between the actual and desired network outputs was summed over the entire epoch to calculate the net error, and the weights were changed to reduce this error. Both the duration of the training cycle and the number of previous timesteps included in the backpropagation could be adjusted to optimize the training. Varying the frequency of the training signal ensured that the network performed accurately over a broad range of frequencies. 3 Differentiation

Figure la illustrates a representative network with 16 hidden units that differentiates an arbitrary time-varying input. The units are identified by number and shown with their activation pattern; from top to bottom along the left-hand column, these are the bias, the input (il), excitatory hidden units (al-a@, inhibitory hidden units (bl-bg), and the output unit (01). Each square of the matrix represents the connection weight from the unit in the left-hand column to the unit in the upper row. The area of the square represents the magnitude of the weight, and the shading designates excitatory (black) or inhibitory (gray) connections. [In some cases weights which exceeded the scale (at the top) are given numerically.] The activity of each unit in response to the illustrated input signal is also displayed along the left-hand column and along the top. The illustrated input signal is representative of the training waveforms. The activity of the output unit is the derivative of the input, shifted forward by two timesteps to accommodate propagation delays. For this network, the nature of the solution can be deduced in part from the dominant connections in the weight matrix. Most of the excitatory units (a2-a7) receive strong input from the input unit and relay it directly to the output, with a net delay of two time steps. Many inhibitory units produce a time-shifted contribution to the output by receiving a delayed signal from the excitatory units and relaying this to the output for a total delay of three time steps (e.g., units bl-b4). The function of other inhibitory units (b6-b8) is less obvious, since they also receive input directly from il and provide negative feedback to the excitatory layer. In general the activations of the hidden units appear to resemble the input signal. To elucidate the essential operation of differentiating networks we reduced them to their minimal form. In incremental stages, we removed inactive or ineffective units (e.g., a1 and b5) and combined units with similar activity and connectivity (e.g., a5 and a6), each time retraining the

408

E. E. Munro, L. E. Shupe, and E. E. Fetz

Figure 1: Differentiating networks. Each box represents the strength of the connection from the unit on the left to the unit at the top of the diagram. The scale boxes calibrate connection strengths for the networks below. The waveform displayed next to each unit’s name shows that unit’s activity over the time course of the input pattern. Unit il is the input unit, a1 through a8 are excitatory hidden units, b l through b8 are inhibitory hidden units, and 01 is the output unit. (a) Network trained with 16 hidden units and a randomly varying sinusoidal input. Continuedfocing page.

Figure 1: continued. @,c)Reduced networks with minimal hidden units and no direct connection from the input to the output unit. The diagrams depict units and their connections. (d) Reduced network with a direct connection from the input to the output unit.

410

E. E. Munro, L. E. Shupe, and E. E. Fetz

network. In some cases, this process was facilitated by implementing an automatic uniform weight decay during the training period. Differentiating networks could be reduced to one of two basic networks, each with a single excitatory and inhibitory hidden unit, as illustrated in Figure l b and c, and shown with a pulse input. The circuit in Figure l b relayed the input serially via the excitatory units to the inhibitory units, calculating the difference between them at the output. The circuit in Figure lc fed the input simultaneously to the excitatory and inhibitory units, calculating the difference at the excitatory unit. The net result of both pathways is that the output represents the input signal minus the input delayed by one timestep, which produces the derivative: o(tf2) = i(f)-i(f-l). These reduced networks are the smallest that produce an output mediated by the hidden layer and delayed by two time steps. An even smaller differentiating network can be constructed if the input is connected directly to the linear output; this network consists of the essential differentiating triad in Figure Id. These networks have the familiar configurations of feedforward and collateral inhibitory circuits commonly found in biological sensory systems, which probably contribute to transient sensory responses. The larger differentiating networks appear to implement the same time-shift algorithms, but in a distributed and intermingled form. Thus, network la implements primarily the first algorithm (lb), but also incorporates a component of the second. A large number of differentiating networks were obtained, using different starting weights and networks having more complete connectivity (including self-connectionsand recurrent inhibitory connections); all solutions implemented essentially the same time-shift subtraction algorithm, with variable combinations of these two basic circuits. ‘The algorithm in Figure l b was usually more prominent and more often survived weight reduction.

4 Integration Recurrent networks of sigmoidal units could also be trained to simulate leaky integrators with various output decay constants. For the networks with longer decays it was necessary to scale down the output to prevent saturation with low frequency inputs. Figure 2 illustrates a network trained to simulate a leaky integrator with an output decay constant of 7 = 20 timesteps, and exemplifies the mode of computation for integrating networks. The inhibitory units remained essentially unused, since they did not develop effective connections or activations. The excitatory units were strongly interconnected, and the activity of each resembled the output-viz., the leaky integral of the input. A pulse input generated rising and decaying activations whose decay constant T (estimated by fitting the falling phase to an exponential) were found to be similar for the hidden units and were com-

Recurrent Neural Networks

411

Figure 2: Integrating networks. (a) Full network with 16 hidden units and a pulse input. Decay constant 7 = 20 time steps. Continued next puge. parable to that of the output. Deleting individual hidden units resulted in a uniform reduction of these decay constants across the hidden and output units. These observations indicate that the recurrent excitatory connections within the hidden layer perform the integration. Again, solutions obtained from different starting weights differed in their detailed connections, but all involved recurrent excitatory connections. The integrator networks could also be reduced to a smaller "essential" network containing two hidden units with reciprocal excitatory connec-

412

E. E. Munro, L. E. Shupe, and E. E. Fetz

Figure 2: (b) Minimal network which excludes self-connected units (c) Minimal network with one self-connected hidden unit (7 = 100).

(T

= 20).

tions if self-connections are excluded, as shown in Figure 2b. When self-connections are allowed, integration can be performed with a single excitatory hidden unit (Fig. 2c). Indeed, if the linear output unit is provided with self-excitation, no hidden unit is necessary and a pure integrator can be constructed by making the product of the self-connection weight and the slope of the linear activation function equal to one. 5 Simultaneous Differentiation and Integration

We also trained networks with sigmoidal units to produce both the differential and leaky integral as two simultaneous outputs for a single arbitrary input (Fig. 3). While networks quickly learned to compute either the derivative or the leaky integral, generally reaching an error level of < 5% within 1000-2000 training cycles, it proved more difficult to train networks to compute both functions simultaneously. Such networks generally learned to integrate with ease but failed to develop the inhibitory connections necessary for the differentiation, probably because the smaller differential signal contributed less to the error used for backpropagation. We overcame this problem by increasing the scale of the differential output (and therefore the differential error signal) relative to

Recurrent Neural Networks

413

Figure 3: Networks that simultaneously differentiate and integrate the input. (a) Full network. (b) Reduced network.

414

E. E. Munro, L. E. Shupe, and E. E. Fetz

the integral output until an acceptable solution was obtained. The number of hidden units and the overall number of training cycles required to reduce the error below the 5% level were larger for the networks performing both functions than for the single-function networks. Figure 3a illustrates a full network, which integrates and differentiates the input. The dual-function networks perform this simultaneous computation in a distributed manner. In general, a larger number of hidden units were actively involved in the dual computation and many of these contributed significantly to both outputs, as seen by comparing the vertical columns of input weights to 01 and 02. The recurrent connections of the excitatory units are similar to those obtained for the simple integrator network. The activity patterns of most hidden units (both excitatory and inhibitory) carry a component that resembles the integral output and deletion of individual units produced a reduction of activation decay constants across the network, suggesting an integrating mechanism similar to the pure integrator. The computation of the differential within the dual-function network bears some resemblence to the pure differentiator as well. The activation patterns of many hidden units carry a component resembling the input. Within this class, excitatory units receive a strong direct input whereas inhibitory units receive little (cf. Fig. lb). However, the extensive cross-connections among the hidden units and their dominant integral signal precludes a clean implementation of the same time-shift algorithm used by the pure differentiator. The algorithms found in the pure networks appear to be used in a distributed and overlapping fashion in the dual-function network. Reduced networks could also be derived from dual-function networks. In general, the hidden units involved in computing the integral and differential tended to decouple as the networks became smaller. The overlapping nature of the computation could be preserved in these networks only at the expense of some residual error. The smallest possible reduced network contained four hidden units, three excitatory and one inhibitory (Fig. 3b). Units contributing largely to the integral (a1 and a2) are clearly distinguished by their activation patterns and connectivity from those contributing to the differential (a3 and bl).

6 Network Size Since the small “reduced” networks can compute analytic functions as well as the large, it is interesting to consider the rationale for having networks with a larger number of units. Larger. networks are essential for training, since smaller networks rarely converged on acceptable solutions. During training small networks usually became stuck in local error minima that are avoided by the larger degrees of freedom provided by additional connections. Second, for units with sigmoidal input-output functions that saturate at the upper and lower levels, additional units

Recurrent Neural Networks

415

are required to extend the linear range of the network operation. For example, in the multiunit integrator (Fig. 2a) the activations of the involved hidden units have a variety of baseline offsets. The sum of these activations provides a linear output over a wider range than could be achieved by fewer units. Third, the operation of the larger networks would obviously be less affected by the loss of any particular units, since many would be redundant. Fourth, it seems significant that larger networks can use the same units in several different elemental algorithms simultaneously. For example, in Figure la the inhibitory hidden units b6-b8 are involved in both of the differentiating algorithms in Figure l b and c, and the hidden units in Figure 3a are involved in both differentiation and integration. Thus the larger networks can have significantly increased computational power, including the ability to compute different functions simultaneously. 7 Unit Time Constants

Prolonged responses of biological neurons are often modeled by incorporating intrinsic time constants into the hidden units. To see how our networks would perform when the hidden units retained some of their activity, we redefined the output value of unit i at a given time step X , ( f ) to be a weighted sum of its value at the previous timestep X , ( t - l),and the output of the sigmoidal transfer function ( F ) defined in section two. Thus:

X , ( t ) = aX,(f- 1)+ (1 - a)F[weighted sum of inputs to X I ] where a varies between 0 and 1. With this definition, the intrinsic time constant of unit i in time steps is given by r, = 1/1 - a

We chose values of 1-5 timesteps.

(Y

from 0 to 0.8, giving time constants in the range of

7.1 Integrator. As the time constants of hidden units was increased, the networks generally integrated with a lower level of error for a given number of training iterations. Moreover, networks incorporating larger time constants accomplished the same integration using smaller net recurrent weights (i.e., smaller sum of self- and feedback connections). The minimal leaky integrator with one self-recurrent unit required a lower recurrent weight when the time constant was larger, as expected from the equivalent effects of these two parameters. Similarly, in larger integrating networks the sum of recurrent weights was inversely related to intrinsic time constants. To further test the frequency responses of networks we documented their output response to input sinusoids of different frequencies with the

416

E. E. Munro, L. E. Shupe, and E. E. Fetz

Bode amplitude plot [the log of the output-input amplitude ratio vs. log of frequency] and Bode phase plot [the phase of the output relative to input vs. log of frequency]. The performance of leaky integrator networks was equally good in the training range, whether the hidden units had time constants or not; however, outside the training range the networks with intrinsic time constants had Bode plots that approximated the ideal leaky integrator slightly better. 7.2 Differentiator. Incorporating units with time constants into networks substantially impaired their ability to differentiate. Networks with the same architecture as those in Section 3, but with modest time constants (e.g., ~i = 21, could no longer be trained to differentiate from random initial conditions. In general, as the time constants of individual units increased, the networks took longer to learn to differentiate and performed at a greater absolute error level. This error was disproportionately due to the higher frequency component of the input signal. For networks with pure sigmoidal units without decay ( a = 0 ) the Bode amplitude plots approximated the ideal relation fairly closely, but fell off at a high frequency limit determined by the time step duration. For differentiating networks with units having time constants ri > 1, the Bode amplitude plot reached a maximum at a lower frequency and then began to decrease. This inflection frequency was inversely related to the time constant of the hidden units. We explored several methods to improve the training and performance of differentiator networks with time constants, including (1) increasing T; by small increments, and retraining between increases; (2) relaxing the sign constraint on hidden unit outputs; and (3) increasing the delay between the network's input and output signals from 2 time steps to 4. These methods produced additive improvements of network performance for a given time constant, as reflected by decreases in absolute error levels and increases in the high-frequency limit of Bode plots. Like differentiators composed of sigmoidal units without time constants, these networks appeared to generate a distributed version of the time shift algorithm described above, although they tended to use only a fraction of the units available. Titrating 7,by small increments may improve learning performance by allowing the learning network to "track a local error minimum as it varies parametrically with the time constant. Relaxing the sign constraint on hidden unit outputs allows more degrees of freedom for gradient descent, and gives networks a greater range for solutions. [Of course a network with unsigned units can be transformed to one with twice as many signed units.] The greater delay between input and output signals may provide two computational advantages: The larger delay could increase the number of paths through which a time shift and subtraction could be implemented [e.g., by varying the size of the time shift, the length of the path from input to output, and the time step at which the subtraction occurs]. It also makes it possible to produce

Recurrent Neural Networks

417

multiple estimates of the derivative in less than 4 time steps, allowing additional time to combine these estimates. We have not explored these possibilities systematically. 7.3 Combined Differentiator-Integrator. Networks that were previously trained to simultaneously differentiate and integrate an input signal as described above could be further trained (by successively incrementing T , and retraining) to both differentiate and integrate at reasonably low error levels with T, as high as 4.0. These dual-function networks differentiated more accurately than the pure differentiator networks obtained as described above, particularly at higher frequencies. For the dual-function networks without time constants the Bode amplitude plots for the differential output approximated the ideal as well as plots for pure differentiators without time constants. Increasing ~i up to 4.0 for dualfunction networks reduced the linearity and the high-frequency cutoff of the Bode plot only slightly. For T > 4.0 the high-frequency cutoff dropped markedly and the ability to differentiate high frequency signals deteriorated. The greater ease with which dual-function networks could accommodate increases in time constants compared to pure differentiators may reflect the fact that hidden units already had effective time constants imposed by their simultaneous participation in the integration. The combined differentiator/integrator networks tend to use most or all of their units in the computation whereas pure differentiators with time constants used as few as 10-20% of their units actively. Training networks to both integrate and differentiate may force them to distribute their computation of the derivative, which may allow the dual-function networks to deal better with the introduction of time constants. 8 Conclusions

These simulations have provided several insights into the capacity of recurrent dynamic neural networks to compute the analytic functions of differentiation and integration. 1. Using the backpropagation algorithm with nonrepeating input pat-

terns produced neural networks that differentiated and/or integrated any subsequent test signal. By eliminating unnecessary units and combining redundant units it was possible to derive reduced networks that performed the same function. The reduced networks often revealed the computational algorithm more clearly. However, networks this small could not be trained from random starting weights. 2. The differentiating networks implemented a strategy of delayed subtraction of the input, and the hidden units often carried signals resembling the input. The integrator networks used recurrent

418

E. E. Munro, L. E. Shupe, and E. E. Fetz

connections between the excitatory hidden units, whose activation all resembled the output (i.e., the integral of the input).

3. Networks could also be trained to produce the integral and differential simultaneously on two separate outputs. These networks combined the strategies used by the pure differentiator and integrator in a distributed and overlapping manner. Reduced networks performed these two operations best by dissociating into separate subnetworks for each function. 4. Integrating networks incorporating units with intrinsic time constants 7;learned faster and performed better as 7; were increased. By contrast, performance of differentiating networks deteriorated as T; were increased, particularly at high frequencies. Combined differentiator-integrator networks that had been trained without time constants could be trained to maintain their performance with incremental increases of 7,over a limited range.

5. These circuits have some analogues in biological systems. Many neurons in sensory systems exhibit transient responses to stimuli (resembling differentiation) that may be sharpened by ubiquitous recurrent and feedforward inhibitory connections. The intrinsic membrane time constants of neurons (ca. 5 msec) would not compromise differentiation of signals changing at lower rates. In addition to network mechanisms, intrinsic receptor mechanisms can also generate transient sensory responses. Cells in motor systems, notably the oculomotor system, exhibit sustained responses to transient input, resembling integration; the network time constants are many times longer than intrinsic neural time constants, probably due to recurrent connections. In contrast to the recurrent excitation of our minimal integrator networks, the oculomotor system also uses inhibition in order to integrate push-pull signals without integrating baseline activity (Cannon and Robinson 1985; Anastasio 1991; Arnold and Robinson 1991).

Acknowledgments This work was supported by the Office of Naval Research (contract number NOOOl8-89-J-1240), and NIH Grants RR00166 and NS12542. E.E.M. was supported in part by the Graduate Program in Neurobiology at the University of Washington.

References Anastasio, T. J. 1991. Neural network models of velocity storage in the hnr;. zontal vestibulo-ocularreflex. Bid. Cybern. 64,187-196.

Recurrent Neural Networks

419

Arnold, D. B., and Robinson, D. A. 1991. A learning network of the neural integrator of the oculomotor system. Biol. Cybern. 64, 447-454. Cannon, S. C., and Robinson, D. A. 1985. An improved neural-network model for the neural integrator of the oculomotor system: more realistic neuron behavior. Biol. Cybern. 53, 93-108. Fetz, E. E. 1993. Dynamic recurrent neural network models of sensorimotor behavior. In The Neurobiology of Neural Networks, D. Gardner, ed., pp. 165190. MIT Press, Cambridge, MA. Fetz, E. E., and Shupe, L. E. 1990. Neural network models of the primate motor system. In Advanced Neural Computers, R. Eckmiller, ed., pp. 43-50. Elsevier North Holland, Amsterdam. Fetz, E. E., Shupe, L. E., and Murthy, V. 1990. Neural networks controlling wrist movements. Proc. ZJCNN-90 11, 675-679. Fuchs, A. F. 1981. Eye-head coordination. In Handbook of Behavioral Neurobiology, Vol. 5: Motor Coordination, A. L. Towe and E. S. Luschei, eds., pp. 303-366. Plenum, NY. LeCun, Y., Denker, J., Solla, S., Howard, R. E., and Jackel, L. D. 1990. Optimal brain damage. In Advances in Neural Znformation Processing Systems 2, D. S. Touretzky, ed., pp. 598-605. Morgan Kaufmann, San Mateo, CA. Lisberger, S. G., and and Sejnowski, T. J. 1992a. Computational analysis suggests a new hypothesis for motor learning in the vestibulo-ocular reflex. Tech. Rep. INC-9201, Institute for Neural Computation, U. C. San Diego. Lisberger, S. G., and and Sejnowski, T. J. 1992b. A novel mechanism of motor learning in a recurrent network model based on the vestibulo-ocular reflex. Nature (London) 360, 159-161. Lockery, S. R., and Sejnowski, T. J. 1992. Distributed processing of sensory information in the leech: A dynamical neural network model of the local bending reflex. J. Neurosci. 12, 3877-3895. Mozer, M. C., and Smolensky, I? 1989. Using relevance to reduce network size automatically. Connection Sci. 1, 3-16. Patton, H. D., Fuchs, A. F., Hille, B., Scher, A. M., and Steiner, R. (eds.) 1989. Textbook of Physiology. W. B. Saunders, Philadelphia, PA. Robinson, D. A. 1989. Integrating with neurons. Ann. Reu. Neurosci. 12, 33-45. Rowat, P. F., and Selverston, A. I. 1991. Learning algorithms for oscillatory networks with gap junctions and membrane currents. Network 2, 17-41. Tsung, F.-S., Cottrell, G. W., and Selverston, A. I. 1990. Experiments on learning stable network oscillations. Proc. ZJCNN-90 I, 169-174. Watrous, R. L., and Shastri, L. 1986. Learning phonetic features using connectionist networks: An experiment in speech recognition. Tech. Rep. MS-CIS86-78. Linc Lab 44, University of Pennsylvania. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Cornp. 1, 270-280. Zipser, D. 1991. Recurrent network model of the neural mechanism of shortterm active memory. Neural Comp. 3, 179-193.

Received June 18, 1992; accepted July 15, 1993.

This article has been cited by:

Communicated by Fernando Pineda

A Convergence Result for Learning in Recurrent Neural Networks Chung-Ming Kuan University of Illinois Urbana-Champaign, Champaign, IL USA

Kurt Hornik Technical University of Vienna, Vienna, Austria

Halbert White University of California, San Diego, CA USA

We give a rigorous analysis of the convergence properties of a backpropagation algorithm for recurrent networks containing either output or hidden layer recurrence. The conditions permit data generated by stochastic processes with considerable dependence. Restrictions are offered that may help assure convergence of the network parameters to a local optimum, as some simulations illustrate. 1 Introduction

Artificial neural network models are a class of flexible nonlinear functions developed by cognitive scientists that have been used in forecasting, pattern recognition, signal processing, and process control applications. In "feedforward" networks, inputs activate "hidden" units, which in turn determine output activation. In these networks signals flow in only one direction, without feedback. Applications in forecasting, signal processing, and control require explicit treatment of dynamics. Feedforward networks can accommodate dynamics by including lagged (past) input and target values in an augmented set of inputs. However, a richer dynamic representation results from also allowing for internal network feedbacks. Such "recurrent" structures were used by Jordan (1986, 1992) for controlling and learning smooth robot movements, and by Elman (1990) for learning and representing temporal structure in linguistics. In Jordan's network, lagged values of network output feed back into hidden units; in Elman's network, lagged values of hidden units feed back into themselves. A leading learning method for feedforward networks is "backpropagation" (Werbos 1974; Parker 1982; Rumelhart et al. 1986). Its convergence properties have been analyzed rigorously by White (1989) for indepenNeural Computation 6, 420-440 (1994) @ 1994 Massachusetts Institute of Technology

Convergence Result in Recurrent Neural Networks

421

dent identically distributed (i.i.d.) training examples, and by Kuan and White (1993) for a class of (time dependent) mixingale processes. Learning methods for recurrent networks are extensions of the method of backpropagation. A very general algorithm is that of Williams and Zipser (1988). Convergence properties of such learning methods for recurrent networks have not yet been rigorously analyzed. This paper provides a rigorous convergence analysis of an extension of backpropagation for recurrent networks containing Jordan and Elman networks as special cases. This method is a special case of the Williams-Zipser algorithm. The recurrent networks treated here are related to but distinct from the recurrent networks considered by Pineda (1987a,b)and Almeida (1987). These authors treat networks with instantaneous feedback; here we only permit feedback with a time lag. Their network recurrence structures also differ somewhat from those considered here. Consequently, our results do not cover Almeida's and Pineda's recurrent versions of backpropagation. Our results follow from a result of Kuan and White (1992) (KW), derived from fundamental results of Kushner and Clark (1978) (KC). The conditions under which convergence holds suggest some restrictions relevant in practice. We present simulations showing that networks imposing these restrictions can perform uniformly better in terms of mean squared error than networks not imposing these restrictions. The recurrent networks considered here are but one possible approach to nonlinear time series prediction. For an entre to the considerable literature in this area, the interested reader is referred to Priestley (1988). We make no claims as to whether recurrent networks are in any way superior to other, better understood methods, and leave investigation of the usefulness of recurrent networks relative to other methods to subsequent research. Our attention to recurrent neural networks is motivated by the increasing interest and use they are receiving in the neural network community. 2 Heuristics and the Method of Recurrent Backpropagation

Suppose that we observe a realization of a sequence { Z t } = { Z , : t = 0.1,. . .} of random vectors, where Zt = ( Y t ,XT)' (with T denoting the transposition operator), Yt is (for simplicity) a scalar, and X, is a u x 1 vector, u E N = {1,2,.. .}. We interpret Y, as a target value at time t, and X I as a vector of input variables influencing Yr and generated by nature. Xt may contain lagged values of Yt (e.g., Yt-1. Yt-2,. . .) as well as lagged values of other variables. For convenience, we assume throughout that the first element of X t (i.e., X t , ) is always equal to one. Let X' = (XO, . . . , X,)denote the history of the X process from time zero through time t. [Similarly, for any sequence {at}, a' = (ao, . . . ,at).]

C.-M. Kuan, K.Hornik, and H. White

422

Suppose we are interested in approximating E(Yt I X'), the conditional ex ectation of Yf given X', by a parametric function of Xt, so that f t : IP'F+~) x 0 + w (say) defines a family of approximations ff(Xt,O)as O ranges over the parameter space 0 c Ws, s E N, say. In this situation we define the approximation error et(0) = Yf -ff(Xt,O ) and select 8' such that 0'

= min! lim E(ef(0)2)/2 t-cu

where min! designates a local minimizer of its argument, we assume limits exist, and E(.) denotes mathematical expectation. To see why this is natural, note that E(et(0)2)= E([Yt - E(Yt I X')l2)

+ E([E(Yt I X') -ft(xt,O)I2)

It follows that 0' also satisfies O* = min! lim E([E(Yt 1 X') - ff(Xfl0)12) f-cu

and thus indexes a locally mean-square optimal approximation to the limit of E(Yt I Xt). Given the validity of an interchange of limit, derivative and expectation, we have lim VE(et(0)2)/2= t+m lim EIVef(0). ef(0)]= 0

I-r:

as the necessary first-order conditions for O', where V is the gradient operator with respect to 0, producing an s x 1 vector. The expectation above is usually unknown; however, the method of stochastic approximation (Robbins and Monro 1951; Kushner and Clark 1978) can approximate a solution to the first order conditions. In general, stochastic approximation estimates a solution to the equations M(0)= 0 (with M : 0 -, Rs)as

8,+, = 8, - Vfmf(zf,8,),

t

= o , I , ... ,

where limf,,E(mf(Zf10)) = M(O), and sequence. For our application, we have

{vr

E

w+}

is a "learning rate"

mt(Zt,8)= Vef(0)et(O) We take ff as the output function of a recurrent neural network with output given in time period t by 0 1=

with

+

F(a ATP)

(2.la)

Convergence Result in Recurrent Neural Networks

423

A

o u t p u t I dyer

hidden l a y e r 1

I i n p u t I aye r

Figure 1: Jordan network.

where F : LQ -+ k, G : + 1 (1 = [ O . l ] ) are given functions (e.g., the logistic function G(X) = (1 +e-')-'); Afis the q x 1 vector of hidden unit v activations A,; parameters are a(1 x l),/3(9 x l),y = ($', . . . ,~ , T ) ~ (xql), and 6 = (ST,.. . ,1!7;)~(qp x 1) collected together in the s x 1 network weight vector 6' = ( a , -yT, 6T)T, with s = 1 + q q(v p ) ; and Rf is the p x 1 vector of recurrent variables Rf,, determined from previous inputs (Xf-l), previous recurrent values (Rf-l) and network weights (6') through p,, i=l, ....p. When Rt = Of-1, we have a Jordan (1986, 1992) network (see Fig. 1). Here

oT,

P(Xt-1tRi-lr6')

When Rt case

+

=F

= At-*, we

(

0

9

+

+ CPjG(XT-l?/ + Rt-ibI) /=I

i

have the Elman (1990) network (see Fig. 2). In this

pI(Xt-1,Rt-1,O) = G(XT-,y,

+ RT-,b,),

Substituting for A , in 2.la gives

i = 1,.. . , q

C.-M. Kuan, K. Hornik, and H. White

424

t o u t p u t layer

h i d d e n layer 1

1I

-I

i n p u t layer

Figure 2: Elman network.

for the single hidden layer recurrent net. Because ef = Yt - Of, network error depends on Zf, Rt, and 19. Thus, this net is a particular case of a generic class of models with errors el = @t,

Rf, 6)

where the function u results from the assumed network output function, and R, is determined by network recurrence. Above, the recurrent variables were generated as Rt = p (Xt-l, Rf-l O ) , with p = ( p , , . . . , p p ) T . However, much flexibility is gained by including Yt_l as a determinant of Rt, so we write Rt = p(Zf-I,Rf-I.8). This incorporates "teacher forcing." Because of Rt, network error is a function of the entire history of targets and inputs, Z'. For a given 0 and a given initial recurrent value, say Ro, the recurrent variables are given in time period t as

where we have suppressed the dependence of then

If on Ro. Network error is

Convergence Result in Recurrent Neural Networks

425

The gradient Ve, needed for learning is

+

Vef(8)= uO(Z,,I,(Z", O ) , 8)T vlt(z", 8)ur(z,.I,(z",

H ) , s)?

where uo is the 1 x s derivative of u with respect to 8 (u; = Vu), ur is the 1 x p derivative of u with respect to recurrent variables, and Vl, is the s x p gradient matrix of It with respect to 8. Any learning algorithm based directly on ef and Ve, will be computationally intensive, as the effect of any change in O must be propagated through time from period zero up to period t. The required computations grow as t increases, and the entire history Z' must be kept in memory. A computationally convenient alternative results from exploiting the recursive structure of Rf.Because

Rf = lf(Z".8)

= p(Zf-1.1,-1(Zf-2,8),0)

it follows that

Vl,(Zf-'. 8) = ps(Z,-1, Rt-1. 8)*

+ VZt-1(Z'-2, O)pr(Z,-l,R,-IO)~

where po is the p x s Jacobian matrix of p with respect to 8 ([I; = Vp) and pr is the p x p Jacobian matrix of p with respect to recurrent variables. With A, = Vlf(Zf-',8), we have a recursion,

At

= po(Zt-1,Rt-1,8)T

+ At-1Pr(Zt-lr Rt-1tO)7

The recursions for R, and A, suggest a learning algorithm that updates R, and A, with the weight update in time t but neglects the effect of weight updates on past values. If the system does not have "too long" a memory and if we eventually get "close" to O', then sufficiently little may be lost by ignoring the update effects that we still obtain the desired convergence to 8*. Thus, we begin by picking arbitrary initial weights 80, recurrent variables ko and s x p gradient matrix 60.To update network weights we compute network error A

>

i.0 = NZO, Ro, 60)

and form V ~ O = UO(ZO,

$. &)T

+ AOtlr(Z0,Izo,

in order to get period 1 weights ~~

81 = 80

-

qoveo eo '

The recurrent variables and gradient matrix are updated for use in period l to k1 = p(Z0, R o . 40)

C.-M. Kuan, K. Hornik, and H. White

426

and A1

= Ps(Z0,

Izo,

80lT

+ AOP,(ZO, R O , 4 0 l 7

Now we may compute 21

= u(Z1,R1&

and

ve1

= us(Z1. R, ,

e1y + A1u,( z1, Iz,, e , ) T

to obtain period 2 weights 02 =

01

-

r/1ve1.il

At time t we have targets and inputs Z f , recurrent variables Rf, weights B,, and gradient matrix A,, permitting us to compute

and At,,

= Po(Zt,Rt,8JT

+ AtPr(Zt,R,,

Note the modest memory and computation requirements of this algorithm. We refer to this as "recurrent backpropagation," as it generalizes backpropagation to certain recurrent networks. It is a special case of the Williams-Zipser (1988) algorithm. Our main goal is to obtain conditions under which recurrent backpropagation converges as t -+ 00 to a desired value, 8'. A potential difficulty is that nothing prevents 8, + 00. To avoid this, we employ a projection operator 7~ : Ws -+ 0, where 0 is a compact subset of EP.The projected process {r(&)} is bounded, and 8, = K ( & ) whenever 6, E 0; (8,) will also denote the projected process for notational convenience. 3 Main Results

In order to state our assumptions, we introduce the notion of a stochastic process near epoch dependent on an underlying mixing process. Such processes were studied by Billingsley (1968, Section 21). McLeish (1975) provides convergence results relied on here. Gallant and White (1988) introduced the terminology "near epoch dependent," and studied estimation of nonrecurrent nonlinear dynamic models of such processes.

Convergence Result in Recurrent Neural Networks

427

Let {V,} be a stochastic process on a probability space ( 0 , F . P )and define the mixing coefficients

4, = sup f

a,

SUP

= sup f

I P(G I F ) - P ( G ) I

{F€~-F1_,,G€~,~,,:P(F)>O}

IP(G n F ) - P(G)P(F)I

sup {F€F-

.G €F&, }

where 3: = o ( V r r ..., V f ) ,the rr-field generated by V,, . . . , Vf. When 4, 0 or a, + 0 as m 00 we say that {V,} is +mixing or a-mixing. When 6, = O(rnX)for some X < -a we say that { V f }is &mixing of size -a, and similarly for a,. Mixing processes have an asymptotic independence property, although dependence in the short run may be considerable. The class of mixing properties is still restrictive, however. For example, a stable linear first-order autoregressive [AR(l)I process need not be a-mixing (Andrews 1984). Processes formed as functions of infinite histories of mixing processes have longer memories, and constitute a much broader class of processes. As long as these functions depend mainly on the "near epoch of the mixing process, they are still well-behaved enough for our purposes. Because of the potentially long memories created by the recurrent variables of the network, the near epoch dependence conditions we adopt provide a convenient and powerful context for analysis. Let IIZrl12 = (EIZf12)1/2 and let L 2 ( P ) denote the class of random variables with IJZf112< 00. Let E f ! z ( Z f ) = E(Z, I &'?I). We express the dependence of {Z,} on an underlying process {V,} in the following way. -+

-+

Definition 3.1. Let {Z,} be a sequence of random variables belonging to Lz(P), and let { V , } be a stochastic process on (R, 3,P).Then {Z,} is near epoch dependent (NED) on { V ,} of size -a if v, 2 SUP, JIZ,- Ei!l(Zt) 1 l2 is of size -a. McLeish (1975, Theorem 3.1) establishes that NED functions of mixing processes are "mixingales," which possess convergence properties that suffice for the convergence conditions of Kushner and Clark (1978). We may now describe the data generating process. Assumption A.l. (R, 3,P ) is a complete probability space on which is defined the sequence of F-measurable functions {Z, : R + RU+l,t = 0 , 1 , 2 , .. .}, v E N with lZfl 5 E-' < 00. (2,) is NED on {V,} of size -1/2 where {V,,t = 0 , f l , X 2 , .. .} is a mixing process on (R, F ,P) with 4, of size -1/2 or a, of size -1. For each t = 0.1,. . . .Zf is measurable 3,= (T (. . . , VtPlrV,). Partition Z, as Zf= ( Y f X ,r)', X f : R + R', with Xf1 = 1, t = O , l , . . .. The process generating the input and target sequences is thus bounded and may have a moderately long memory. By convention,

C.-M. Kuan, K. Hornik, and H. White

428

6 is a generic small constant. Let supp Z f denote the support of Z f , i.e., the closure of the complement of the largest Borel set B such that P[Zt E B] = 0, and let jsupp { Z , } = cl(UF,supp Z,) denote the "joint support" of { Z f } . Assumption A.l implies that jsupp { Z , } c K, =

and

xy:;[-c-',

t-'1.

The following condition restricts the network error function. Assumption A.2. Let D,, D,, and DObe Borel subsets of RV+l, RP, and Rs, respectively, p , s . E N, with K , C D,. Then u : D, x Dr x Dg + W is continuously differentiable of order 2 on D, x D, x DO.

We let ug and ur denote the 1 x s and 1 x p partial derivative functions of u with respect to 0 and r. The next condition restricts network recurrence. Assumption A.3. With D,, Dr, and DOas in Assumption A.2, let K, be a compact subset of D, and let 0 be a compact subset of DO.

(i) p : D, x Dr x DO D, x D, x Do.

---f

K, is continuously differentiable of order 2 on

(ii) For each ( ~ 0 in ) K , x 0, [ ~ ( z.,,0) is a contraction mapping on K,, i.e., "~(2, r l , 0) - p ( z .r 2 , e ) l I cOlr1 - r21, co < 1, r l , r2, E Kr. We let 0 8 and p , denote the p x s and p x p Jacobian matrices of p with respect to 0 and r. The contraction property keeps the internal network feedbacks from exploding or becoming chaotic. Explosive behavior is undesirable, and it is appropriate to rule it out. In contrast, treatment of the chaotic case is particularly important, as chaos can generate rich dynamics, while the present contraction mapping condition is uncomfortably restrictive. Indeed, as the number of recurrent units increases, the conditions placed on the weights to ensure contraction become more and more stringent (see Assumptions 8.3 below). If chaotic recurrence is of value, then networks for which the contraction condition is enforced may behave poorly. We leave treatment of the chaotic case to future work. We now state formally the learning recursions. Assumption A.4. (i) let K A be a compact subset of RsxP, and let & E K,, A. E KA, and 80 E 0 be chosen arbitrarily and independently of { Z t } . For t = 0,1,2,. , . , define

Convergence Result in Recurrent Neural Networks

429

and

where T : Ws -+ 0 is a projection operator restricting (8,) to the compact set 0; and (ii) (7,) is a sequence of positive real numbers such that C;"=o$< and C ~ 0 7=~00.t

00

An important condition is the restriction on the learning rate sequence (7,). This condition holds whenever 7, 0: t - p , 1/2 < p 5 1. The larger values for p lead to faster convergence. The projection device applied to (8,) ensures that (8,) is bounded. Assumption A.3 ensures that {A,} is bounded. One more condition is required to state K W s convergence result; it guarantees the existence of the limit of E(Ve,(B) . ef(H)). We define a function h as

h(X,6)'

=

- [ u ~ ( zY,, 6)'

+ Atir(z, Y, B ) T ] ~ (r,~0),

where X = (zT,~ ~ , v e cand ~ Awe ) ~define & ( B ) = (Xf(0)T, Xf(H)T)T, where X f ( 0 ) = Zr, Xi(0) = lt(Z'-',0), and Xf(0) = vecVlf(Z",0). Our final condition is: Assumption A.5. For each 0 E 0, h ( 0 ) = limf+mE(h(X,(H), 0)) exists. -

With this assumption, -h(B) = limt,m E(Ve,(B)et(O)),the least squares gradient function. KC's results establish certain properties of the piecewise linear interpolations of (8,) with interpolation intervals (7,). Define rt E C::;q,, t 2 1, TO = 0. The interpolated process is defined as go(7) = 7L1(7r+i- 7%

+ 7 t 1 ( -~ Tr)er+i,

7

E

[Tr, rt+i)

and its leftward shifts are defined as

8,(7)

=

&(T,

=

8,

+

T)

T

7<-Tf

2

-7,

t=0:1,2,.

We thus have a sequence {8,(.)}of continuous function on (-co,co). In stating the result, we write 8, + O* as t 03 if infdEo.18, - 01 -+ 0 as t 00. Also, for a continuous vector field v(.) on 0, define the vector field %[v(.)] as -+

-+

+

?r[v(0)]= lim[.rr(0 bv(0))- H]/S, 6-0

B

E

0

when the limit is unique. When H is in 0 but not on its boundary, then for S sufficiently small B bv(c9) is in 0, so that ii[v(0)]= v(B).

+

C.-M. Kuan, K. Hornik, and H. White

430

Theorem 3.2. (Kuan and White, 1992). Suppose that Assumptions A.l-AS hold. Then

a P-null set 0 0 such that for w $ Ro, {&(.)} is bounded and equicontinuous on bounded interuals, and {&(.)} has a convergent subsequence whose limit &(.) satisfies the ODE 8 = ?i[h(B)].

(a) there exists

Let 0' be fheset of locally asymptotically stable (in the sense of Liupunou) equilibria in Ofor this ODE with domain ofattraction d(O*)C Ws. (b) I f 0 & d(O*),then 8, -, 0' us t

-+

00

with probability 1 (w.p.1).

(c) I f 0 is not contained in d ( O * ) ,but for each w 4 no,S,(u)enters a compact subset ofd(O*)infinitely often, then 8, 0' as t 03 w.p.1. -+

-+

Id) Given the conditions in (c), if 0' contains only finitely muny points, then there exists a measurable mapping 8' : R x 0 x K , x KA + 0' such that 8, - B*( ., so, &,A o ) 0 as t 00 w.p.1. --$

-+

The path of the recurrent backpropagation algorithm behaves asymptotically like the solution trajectory of an appropriate ODE. Thus, 0' satisfies h(B*) = 0 when h has a zero in 0. These conclusions are identical to those of Kuan and White (1993) using Theorem 2.4.2 of KC for single hidden layer feedforward networks, except that there 0' indexes a locally mean square optimal approximation to E(Y, I X , ) , while here the approximation is to E(Y1 I X f ) . Note that 0' is explicitly a function of both the data (through w ) and the starting values 00, &, and Ao. Different starting values can lead to different equilibria in general; similarly, different realizations of the data generating process will lead to different equilibria, as noise in the system may push 6, from the domain of attraction of one equilibrium into that of another. Nevertheless, cycling between isolated equilibria is ruled out by conclusion (d). Because output in single hidden layer recurrent networks is given by (3.1) j=1

the network error function is (3.2) For Jordan nets, network recurrence is (3.3)

Convergence Result in Recurrent Neural Networks

431

For Elman nets, network recurrence is

+

p l ( z .r, 19)= G (xTyl rr6,) ,

i = 1,. . . , 9

(3.4)

It is now simple to state conditions sufficient for those of Theorem 3.2; we maintain Assumptions A.l, A.4, and A.5, and choose F , G, and 0 so that Assumptions A.2 and A.3 hold. The following suffices for Assumption A.2.

Assumption B.2. Network output is given by 3.1 and network error by 3.2, where F : W -+ W and G : IW -+ 1 are twice continuously differentiable on W. For example, F may be the identity function, or F and G may be the logistic squasher or tanh squasher. We denote the first derivatives of F and G as F’ and G’. The differentiability conditions of Assumption A.3(i) are satisfied for both Jordan and Elman nets under Assumption B.2. It remains to guarantee that in each case p(z, ., 0) is a contraction mapping. First consider the Jordan net. Define the compact sets KF = {b E W : b = a+Cyzlp,a,.a, E 1, I9 E 0 ) and KG = { a E W : a = xTy,+rT6,,z E K,, r E Kr, I9 E O}, where here K , = COF(KF),the convex hull of the image of K F under F. The mean value theorem ensures that in the convex compact set K,

(

Ip(z,rl,o) - p ( z , r z , ~ ) ~ L sup

Ipr(z, r,e)l

zEK,,rEK,.BEB

The continuity of F’ and G’ and the compactness of KF and KG imply the existence of constants CF and CG bounding JF’(b)Jand IG’(a)l for all b E KF, u E KG, SO 9

IPr(Z, r,

L CFCG C Iij,I ,=1

This is less than 1 as we require if C/9=11P,116,l < ( c F c G ) - ’ , so we impose the following condition.

Assumption B.3(a) (Jordan). Network recurrence is determined by 3.3. Put CF=SUP~~~, lF’(b)l, CG=SU~,~~,(G’(U)~. Then 0 is such that Cy=l1@J116J1 5 (cFCc)-’(1 - E ) for some t > 0.

C.-M. Kuan, K. Hornik, and H. White

432

+

For example, if F(a) = G(a) = (1 e-')-' (logistic squashing at both hidden and output layers) then C F = CG = 1/4, so contraction is ensured by imposing Cy=l 10,116,15 16(1 - 6 ) . The theory thus provides a concrete benefit, insofar as this restriction aids practical implementations. For the Elman net, set K, = nq in defining Kc. Now p is a vector-valued function. The mean value theorem for such functions again ensures

~ p ( z , r l ,-P(Z,rzr6)1 ~) I

(

sup

1

IPr(Z,r,O)l

zEKz,rEK,,OEQ

-r21

1r1

where now IP,(z. r. O)l is the square root of the maximum eigenvalue of pr(Z. r, O)'/I,(Z. r, 6). NOW

+

p,,(z,r, 0) = G'(xTy, rrb,)6y

i = 1,.. . , q

We obtain the contraction property using

Assumption B.3(b) (Elman). Network recurrence is determined by 3.4. Put CG = supoEK,IG'(u)I. Then 0 is such that (C:=,bT6i)1/2I cE1(1 - c ) for some c > 0. The desired convergence now follows immediately.

Corollary 3.3. Given Assumptions A.2, B.2, B.3W or B.3(b), A.4, and A.5, the conclusions of Theorem 3.2 hold. Thus, recurrent backpropagation in the Jordan or Elman nets converges in the precise sense established by Theorem 3.2, provided that network weights are sufficiently restricted as to ensure a contraction mapping property for network recurrence. 4 Simulations

In this section we present the results of simulations designed to assess the effectiveness of our recursions, and in particular the suggested constraints. The target variables yt are generated from the following data generating processes (JXP's): (i) a bilinear DGP (Granger and Andersen 1978)

Convergence Result in Recurrent Neural Networks

433

and a nonlinear moving average (MA) DGP

where cf are independent N(0.1) and T = 2000. It is not difficult to see that targets generated from these nonlinear models depend on their own entire history. We estimate the Elman network with 4-7 hidden units and use two lags of the target variable, ytP1and yf-2, as network inputs. This permits recurrent variables to capture additional nonlinear and dynamic structure. The activation function F between the hidden and output layers is the identity function, and the activation function G between the input and hidden layers is the logistic function. These choices are standard. The network connection weights are estimated using the proposed recurrent backpropagation algorithm with and without the contraction mapping constraint. The learning rates are a constant (0.02) in the first 1500 recursive steps and tend to zero at the rate l / t from then on. In view of Assumption B.3 (b), CG = 1/4 because G is the logistic function; a constraint sufficient for the desired contraction mapping property is that 6,,5 4/9 - 0.0001 for all i,j, where 9 is the number of hidden units. The initial feedforward connection weights (Ss and 7s) are generated from N ( 0 , l ) ; the initial recurrent connection weights (6s) are generated from N(O,25) so that the imposed constraints can be binding. The number of replications is 200. We record the last 1000 mean squared errors (MSEs) from the learning process and average over 200 replications. For yi generated from the bilinear DGP, the MSEs of networks with 4-7 hidden units are plotted in Figures 3-6; for yr generated from the nonlinear MA DGP, the resulting MSEs are plotted in Figures 7-10. It can be seen from the figures that MSEs improve noticeably after the learning rates start decreasing, as they should. These features also clearly indicate that if the algorithm is employed without constraint, the resulting MSEs are always larger than the MSEs from the algorithm with constraint. This shows that the imposed constraint helps to deliver better convergence results. Table 1 summarizes the average of the last 1000 MSEs and the final MSE. From Table 1 we see that for the algorithm without constraint, the average and last MSEs may increase when the number of hidden units increases, while the average and last MSEs from the algorithm with constraint gradually decline as the number of hidden units increases. This suggests that the networks “learned” by the algorithm without constraint may be quite misleading. Of course, there is no guarantee that the restricted learning algorithm is itself converging to the globally optimal predictor for recurrent networks with u p to seven hidden units. For this reason, we recommend training recurrent networks from a large number of different starting values, with and without the the contraction constraint imposed.

C.-M. Kuan, K. Hornik, and H. White

434

-

1100

1200

1300

1400

1500

1600

1700

1800

1900

2000

1900

2000

ITE RAT I0NS

Figure 3: Bilinear DGP: Elman network with 4 hidden units.

1100

1200

1300

1400

1500

1600

1700

1800

ITERATION 5

Figure 4: Bilinear DGP: Elman network with 5 hidden units.

Convergence Result in Recurrent Neural Networks

-

1100

1200

1300

1400

1500

1600

435

1700

1800

1900

2000

1900

2000

ITERATIONS

Figure 5: Bilinear DGP: Elman network with 6 hidden units.

1100

1200

1300

1400

1500

1600

1700

1800

I T E RAT ION S

Figure 6: Bilinear DGP: Elman network with 7 hidden units.

C.-M. Kuan, K. Hornik, and H. White

436

1100

1200

1300

1400

1500

1600

1700

1800

1900

2000

ITERATIONS

Figure 7: Nonlinear-MA

1100

1200

DGP:Elman network with 4 hidden units.

1300

1400

1500

1600

1700

1800

1900

2000

IT ERAT I 0 N 8

Figure 8: Nonlinear-MA

DGP: Elman network with 5 hidden units.

Convergence Result in Recurrent Neural Networks

437

W

m

I

.

N

~

'

.

'

.

'

"

~

"

'

"

"

'

"

~~

Figure 9: Nonlinear-MA DGP: Elman network with 6 hidden units.

W

m

I

-

N 0

,

'

1100

.

'

.

1200

'

.

1300

"

"

L

1400

"

'

1500

'

,

1600

'

1700

1800

1900

z

ITE RAT 10N S

Figure 10: Nonlinear-MA

DGP: Elman network with 7 hidden units.

C.-M. Kuan, K. Hornik, and H. White

438

Table 1: Summary of Simulation Results! Bilinear DGP With constraint

Without constraint

Hidden units

Avg. MSE

Last MSE

Avg. MSE

Last MSE

4

1.399 (.0006307) 1.392 (.0009698) 1.375 (.0013630) 1.369 (.0016600)

1.366 1.351 1.325 1.315

1.456 (.0004065) 1.449 (.0006545) 1.462 (.0008569) 1.457 (.0011570)

1.429 1.416 1.423 1.412

5 6 7

Nonlinear-MA DGP With constraint

Without constraint

Hidden units

Avg. MSE

Last MSE

Avg. MSE

Last MSE

4 5 6 7

1.280 (.0003911) 1.273 (.0005717) 1.268 (.0008101) 1.265 (.0009178)

1.253 1.241 1.229 1.222

1.332 (.0003667) 1.338 (.0005720) 1.339 (.0006874) 1.348 (.0007850)

1.307 1.306 1.304 1.310

‘Standard errors are in parentheses; averages are over 200 replications.

5 Summary and Concluding Remarks

We have generalized backpropagation to recurrent networks and applied a result of Kuan and White (1992) to establish the almost sure convergence of recurrent backpropagation for the Jordan and Elman nets. The proposed algorithm is a nonlinear extension of the recursive prediction error algorithm studied by Ljung and Soderstrom (1983) in the context of system identification. Our result can also be applied directly to analyze those recursive algorithms. The analytic and simulation results suggest that recurrent networks constructed according to B.3(a) and (b) can be effectively trained via the proposed algorithm. Other recurrent structures are also readily handled by applying KWs result, for example, recurrent networks with several hidden layers feeding back into one another in various ways. A key restriction is that network recurrence has a contraction mapping property so that recurrent variables are stable eventually. Relaxing this restriction is the subject of future research. For simplicity, we did not permit control or manipulation of the system generating the data; however KWs results apply directly to this case also. Specifically, their convergence results apply to situations in which a recurrent network is learning while controlling an unknown system

Convergence Result in Recurrent Neural Networks

439

with internal feedback and output subject to exogenous noise. Some interesting difficulties arise in this context d u e to the networks ignorance of the recurrence structure of the system. In particular, the convergence is no longer necessarily to a locally mean square optimal approximation to expected system behavior. The interested reader is referred to KW for additional details.

Acknowledgments This work was supported by NSF Grants SES-8806989 and SES-8921382. The authors have benefited from conversations with Jose Ferreira Machado, Boyan Jovanovich, Tung Liu, and Max Stinchcornbe. Halbert White wishes especially to express his deep appreciation to the Faculty of Economics, New University of Lisbon (Lisbon, Portugal) whose setting and collegiality provided essential inspiration.

References Almeida, L. B. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Proceedings of the IEEE First International Conference on Neural Networks, pp. I1:609418. IEEE Press, New York. Andrews, D. W. K. 1984. Non-strong mixing autoregressive processes. J. Appl. Prob. 21, 93C-934. Billingsley, P. 1968. Convergence of Probability Measures. John Wiley, New York. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14, 179-211. Gallant, A. R., and White, H. 1988. A Unified The0y of Estimation and Inferencefor Nonlinear Dynamic Models. Basil Blackwell, Oxford. Granger, C. W. J., and Andersen, A. I? 1978. An Introduction to Bilinear Time Series Models. Vanderhock and Ruprecht, Gottingen. Jordan, M. 1986. Serial order: A parallel distributed processing approach. University of California San Diego, Institute for Cognitive Science, ICS Report 8604. Jordan, M. 1992. Constrained supervised learning. J.Math. Psychol. 36,396-425. Kuan, C.-M., and White, H. 1993. Artificial neural networks: An econometric perspective. Econometric Rm., forthcoming. Kuan, C.-M., and White, H. 1992. Adaptive learning with nonlinear dynamics driven by dependent processes. University of California San Diego, Department of Economics Discussion Paper. Kushner, H. J., and Clark, D. S. 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, New York. Ljung, L., and Soderstrom, T.1983. Theoy and Practice of Recursive Identification. MIT Press, Cambridge, MA. McLeish, D. L. 1975. A maximal inequality and dependent strong laws. Ann. Prob. 3, 829-839. Parker, D. B. 1982. Learning logic. Invention Report 581-64 (File l), Stanford University, Office of Technology Licensing.

440

C.-M. Kuan, K. Hornik, and H. White

Pineda, F. J. 1987a. Generalization of back-propagation to recurrent neural networks. Phys. Rev. Lett. 59,2229-2232. Pineda, F. J. 198%. Generalization of back-propagation to recurrent and higher order neural networks. In Proceedings of the I€€€ Conference on Neural lnformation Processing Systems. IEEE Press, New York. Priestley, M. B. 1988. Non-linear and Non-stationary Time Series Analysis. Academic Press, San Diego. Robbins, H., and Monro, S. 1951. A stochastic approximation method. Ann. Math. Statist. 22, 400-407. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructures of Cognition, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., Vol. 1, pp. 318-362. MIT Press, Cambridge, MA. Werbos, P. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished Ph.D. dissertation, Harvard University, Department of Applied Mathematics. White, H. 1989. Some asymptotic results for learning in single hidden layer feedforward network models. I. A m . Statist. Assoc. 84, 1003-1013. Williams, R. J., and Zipser, D. 1988. A learning algorithm for continually running fully recurrent neural networks. University of California San Diego, Institute for Cognitive Science, ICS Report 8805. Received April 1, 1991; accepted July 20, 1993.

This article has been cited by: 2. Dongpo Xu, Zhengxue Li, Wei Wu. 2010. Convergence of gradient method for a fully recurrent neural network. Soft Computing 14:3, 245-250. [CrossRef] 3. Wei Wu, Dong-po Xu, Zheng-xue Li. 2008. Convergence of gradient method for Elman networks. Applied Mathematics and Mechanics 29:9, 1231-1238. [CrossRef] 4. André Grüning. 2007. Elman Backpropagation as Reinforcement for Simple Recurrent NetworksElman Backpropagation as Reinforcement for Simple Recurrent Networks. Neural Computation 19:11, 3108-3131. [Abstract] [PDF] [PDF Plus] 5. Wei Wei. 1999. Nonlinear dynamic system modeling using recurrent wavelet network. Journal of Electronics (China) 16:3, 193-199. [CrossRef] 6. Seong-Whan Lee, Hee-Heon Song. 1997. A new recurrent neural-network architecture for visual pattern recognition. IEEE Transactions on Neural Networks 8:2, 331-340. [CrossRef] 7. Mike Casey. 1996. The Dynamics of Discrete-Time Computation, with Application to Recurrent Neural Networks and Finite State Machine ExtractionThe Dynamics of Discrete-Time Computation, with Application to Recurrent Neural Networks and Finite State Machine Extraction. Neural Computation 8:6, 1135-1178. [Abstract] [PDF] [PDF Plus] 8. Lizhong Wu , John Moody . 1996. A Smoothing Regularizer for Feedforward and Recurrent Neural NetworksA Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation 8:3, 461-489. [Abstract] [PDF] [PDF Plus] 9. Chung-Ming Kuan, Tung Liu. 1995. Forecasting exchange rates using feedforward and recurrent neural networks. Journal of Applied Econometrics 10:4, 347-364. [CrossRef]

Communicated by Harry Barrow

Topology Learning Solved by Extended Objects: A Neural Network Model Csaba SzepesvW LIszld Balks’ AndrL Liirincz Department of Photophysics, Institute of Isotopes of the Hungarian Academy of Sciences, P.O. Box 77, Budapest, Hungary H-1525 It is shown that local, extended objects of a metrical topological space shape the receptive fields of competitive neurons to local filters. Selforganized topology learning is then solved with the help of Hebbian learning together with extended objects that provide unique information about neighborhood relations. A topographical map is deduced and is used to speed up further adaptation in a changing environment with the help of Kohonen-type learning that teaches the neighbors of winning neurons as well. 1 Introduction Self-organized learning is a most attractive feature of certain artificial neural network paradigms (Grossberg 1976; Kohonen 1984; Carpenter and Grossberg 1987; FoldiAk 1990). It is considered as a means of solving problems in an unknown environment. The architecture of the neural network, however, is in general “inherited,” in other words is prewired and may severely limit the adaptation possibilities of the net. An example is the Kohonen-type topographical map that has a built-in neighborhood relation. Various attempts have been made to resolve this problem such as the self-building model of Fritzke (Fritzke 1991) and the neural gas network of Martinetz and Schulten (Martinetz and Schulten 1991). The closest to the present work is the neural gas network. It is based on the idea that topology can be learned on the basis of joint similarity. It determines adjacent neurons based on the distance measured in the metric of the input vector space. Consider, however, the example of a maze embedded in a two-dimensional Euclidean space. The two sides of one of the walls in the maze have very close input vectors in the metric of the Euclidean space though they may be very far from each other in the ‘Permanent address: Jdnos Bolyai Institute of Mathematics, Attila J6zsef University of Szeged, Szeged, Hungary H-6720. Neural Computation 6,441458 (1994) @ 1994 Massachusetts Institute of Technology

442

C. Szepesviri, L. Balazs, and A. LBrincz

metric space brought about by the topology of the maze. The question then arises if and how and under what conditions a given network is capable of determining the topology the external world. The model we present here relies on the extended nature of objects in the external world. This method can take advantage of the same idea of joint similarity and provides a simple route for exploring the neighborhood relations as the objects themselves bring the information. The receptive fields of the neurons are local filters and the neural network may be considered as a dimensional reducing system that provides position information and neglects object details. In order to take advantage of neighboring relations and the possibility of Kohonen-type neighbor training a distance function may be established with the help of Hebbian learning: extended objects may overlap with receptive fields of different neurons and may excite more than one neuron at the same time. These neurons then assume activities different from zero, and Hebbian learning may be used to set up connectivity that matches the topology. The strength of the connection may be related to the distance in further processing. The greater the strength, the smaller the distance. In this way a topographical map is established and Kohonen-type training becomes a feasible means of speeding up further adaptation in a changing environment. 2 Dimensionality Reduction with Spatial Filters ___

First, we define local, extended objects. Let us assume that the external world is a metrical topological space equipped with a measure and is embedded in a bounded region of Euclidean space. Let us then consider a mapping from the subsets of the bounded region of the Euclidean space into a finite-dimensional vector space. This mapping could, for example, be defined by choosing two vectors of the subsets randomly. Another type of mapping may, for example, spatially digitize the external world, and form a digital image. Hereinafter we shall use this mapping and call the elements of the digitized image pixels. A vector of the vector space will be considered a local, extended object if (1) there exists a connected open set of the metrical topological space that after mapping is identical with the said vector, (2) if the measure of that open set is not zero, and (3) if the open set's convex hull taken in the vector space is in the topological space. Let us further assume that our inputs are local, extended objects and our task is to provide the approximate position of the corresponding real object with no reference to its form. In order to be more concrete, let us take the example of a three-dimensional object mapped onto two two-dimensional retinas (i.e., to a many dimensional vector space). The vector that describes the digitized image on the retinas is the extended object. The task is to determine the position of the original, real object,

Topology Learning Solved by Extended Objects

443

1.oo

0.50

,

..

. .-

.

,

I

.

.,

... ..

Figure 1: Digitized spatial gaussian filters. Gaussian filters of a two-dimensional box corresponding to expressions: “upper left,” ”upper middle,” “upper right” etc. The figure was drawn by generating pixels randomly with probabilities that correspond to the gray-scale values. with no reference to its form and with no a priori knowledge of the dimensionality, nor even of the topology of the external world. This problem will not be considered here; however, these tools are general enough to solve it. In the following we restrict our investigations to the case of a single retina. We may say, for example, that an object is “in the middle” or “in the upper left corner” or that it is ”down and right.” This task may be considered as a dimensionality reduction problem since if the image is given in the form of n x n pixels, having continuous gray-scale intensities, then one maps an n x n input matrix having elements on the [0,1] real interval into the world of rn expressions that denote the possible different positions. Let us assume that the spatial filters that correspond to our position expressions already exist and let us list the expressions and organize the filters in a way that the ith filter corresponds to the ith expression. For example, in the case of a two-dimensional image the expression ”middle” would correspond to a spatial filter that transforms the image by causing no change in pixel intensities around the middle of the image but the farther the pixels are from the center the more the filter decreases the pixel intensities. As a demonstration Figure 1 shows spatial filters of a nine-expression set. These are gaussian filters but other filters could

C. Szepesvari, L. Balazs, and A. Ldrincz

444

serve just as well. The filters are digitized by replacing the center value of every pixel by the closest digitized gray-scale value. Let G(') denote the ith digitized gaussian spatial filter, and let S denote an input (image) vector. Let g z , spq (1 5 p , 9 5 n ) denote the values in pixels ( p , q ) of the digitized ith gauss filter and the input vector, respectively. Now, the searched position estimation may be given in the following fashion: First, let the input vector pass all of the digitized gauss filters. Let us denote the output of the ith filter by G,-examples will be given later-and denote the mapping by d: G, = d ( G ( ' ) , S ) ,

(i = 1 , 2 , .. . , m )

(2.1)

The mapping d is to be engineered in such a way that it can be considered as a "distance" function R""" -+ Rf U { 0 ) providing distance-like quantities between the pattern inputting the network and the gaussian filters. With such a d function one might choose the smallest Gi value. If that has index j, then we say the position of the object is the jth expression. There are various possibilities for function d; here we list three of them: 0

Conventional filtering is defined by multiplying the input values by the filter values and then integrating over the input space. In order to fulfil our requirements for the "distance" function d, let us define it in the following fashion: (2.2)

from here onward it is assumed that 0 5 xi,,y;, 5 1. This "distance" function has the form 1- (l/n2)X.Y where X'Y is the inner product or spherical distance. Since this "distance" definition is normalized, we might define an "input-to-filter-similarity-measure," or measure of similarity, S in short, as S = 1 - d, where d is the "distance." The smaller the distance, the larger the similarity between input and filter vectors. One might try to use the usual Euclidean distance in R""", that is I n

(2.3) 0

Another form that is not a metric but may be used here is (2.4)

Topology Learning Solved by Extended Objects

445

Figure 2: Architecture of the artificial neural network. The ANN has a set of input nodes. Inputs connected to the network are denoted by xi, i = 1,2,. . . , n. Every input node is connected to every neuron of the network. Every neuron stores a vector of the input space; neuron j stores (z$, wf), . . .z$). Neurons develop another set of connections, the ( q k l ) topology connections, an internal representation of the topology of the external world.

3 Forming Spatial Filters by Competitive Learning

Special competitive networks may form equiprobabilistic digitization of the input space according to the density distribution of input vectors (Desieno 1988). Networks of pure competitivity are known to solve the problem of equiprobabilistic digitization for uniform distributions (Kohonen 1984). Let us define a competitive neural network of rn neurons. The external world provides inputs to input nodes. Every input node is connected to every neuron of the network (see Fig. 2). Every neuron stores a vector of the input space. The stored vector of the ith neuron is denoted by wi. Training modifies the stored vectors. The procedure of modification is as follows: 0

An input is presented to the network in accordance with the density distribution. Input nodes forward inputs to the neurons.

C. Szepesvari, L. Balizs, and A. LBrincz

446 0

Neurons process their inputs and develop activities in accordance with the equation

D;= d ( x , wi) = 1 - S;

(3.1)

where x is the forwarded input vector, d is a “distance” function, and S; is the measure of similarity. 0

0

Competition starts. The winner of the competition is the neuron whose stored vector is the “closest” to the input vector, that is, the neuron having the smallest “distance” (or largest similarity). The stored vector of the winning neuron i is then modified with the help of the update rule:

where 0 is the so called learning rate; 0 < a 5 1. Here we apply a constant learning rate during the whole training. It was found that the time-dependent learning rate did not improve training results. Time-independent learning rate has the advantage that it keeps adaptivity. In the numerical simulations, we presented two-dimensional objects of a two-dimensional space to the network. The input space and one of the input vectors that was presented are illustrated in Figure 3. Input vectors were derived by computing the overlap of the local, extended, randomly positioneq objects and the pixels of digitization. Two different objects were used in these runs, an X-shaped object (shown in Fig. 3) and an 0-like object (not shown). In the first set of runs a single object was presented to the network at random positions. In other runs two or three objects were presented simultaneously to the network at random positions. The training procedure resulted in spatial filters for “distance” functions d l , d2, and d3 defined in the previous section. We tried single objects for “distance” functions d2 and d3. For spherical distance dl up to three objects were presented simultaneously. First, we tried the Euclidean distance function d ~ Judging . from our experience the network was able to learn only if the stored vectors were set close to zero prior to training. The noise resistance of the network equipped with the Euclidean distance was rather small. The heuristic reasoning for this finding is given in the Appendix. The term

Topology Learning Solved by Extended Objects

447

Figure 3: Typical input. The box on the left-hand side shows an object at a given position that was transmitted to the input nodes. The middle box shows the outputs of the input nodes developed. Input nodes developed activities according to their overlaps with the inputted object. The upper and the lower boxes on the right-hand side show the outputs DjN)of the neurons and the stored vector of the winning neuron, respectively. of equation 6.2 in the Appendix-the indices correspond to the digitization of the two-dimensional space and I denotes the set of zero elements of input vector x of a given extended object-is responsible for the poor performance of the Euclidean distance function. This term manifests itself in large distance values for inputs with a fair amount of noise. In this way one single neuron that has small zui) components can always win the competition. Training-as the analysis in the Appendix shows-tends to lead to this attractor. The simplest way of eliminating that term is to modify the Euclidean distance function to d3 of 2.4. “Distance” function d3 has a strong resemblance to the spherical distance functions &. Both of these functions solve the problem. There are other possible solutions to this problem, such as trying to decrease the mean value of the initial noise, or the learning rate, or start the learning rate from 1 and changing it in an appropriate fashion (Desieno 1988). Analysis shows, however, that the spherical distance works better under more demanding conditions. Single-object training results are shown for the spherical distance function in Figure 4. The results we present from now on were produced with this distance function.

C. Szepesvari, L. Balazs, and A. L6rincz

448

0.25

Oe50 0.00

Learning Step: ZOO0

Learninq Slep: 0

0.50

,j L-J

IF

0.48 0.24

0.00 Leorninq Step:

0.51

0.50

0.26

0.25

0.00

[1r""oi Learning Step: 15000

--Learning Step: loo00

5000

0.00

uu-

Learning Slep:

2oooO

Figure 4: Training results on self-organized filter formation during training. The numbers show the training steps. Filters are formed during the first 5000 steps. At later steps the configuration undergoes minor modifications in accordance with the random object generation, but stays stable. The figure was drawn by generating pixels randomly with probabilities that correspond to the gray-scale values. One of the results of the competition is that if one increases the size of the local, extended objects, one o r more neurons may lose their receptive fields; in other words may have near zero stored vectors. Neurons lose

Topology Learning Solved by Extended Objects

449

their receptive fields by first approaching a corner of the two-dimensional region. The number of neurons having nonzero receptive fields depends on the ratio of the bounded Euclidean region and the average area of the local, extended objects. The bounded Euclidean region is shared by the neurons: it is divided into nearly nonoverlapping regions that correspond to the average object size. In another set of training runs when two or three (more than one) randomly positioned objects were simultaneously presented to the network the results were very similar: filters were formed just like before, however, the rest of the filter vectors of the neurons were noisy. In other words the filters (the receptive fields) were surrounded by a low noise homogeneous background showing that winning neurons learned of the presence of other objects as well and represented those as a random background. This is an attractive property of the algorithm; our strong competition forces the neurons to learn the most important correlations (that being the locality of single objects) and thus the neurons can neglect the correlations between two or more randomly positioned objects. To improve the winning chances neurons develop a random-like background if more than one object is inputted to the network simultaneously. The background was considerably larger for the three-object case than for the two-object case. There was no noise for the single-object case. The two- and three-object filters are shown in Figure 5. In the following only the single-object case will be studied. It is worth noting that in the general case some neurons may be sentenced to have very small-but nonzero-receptive fields. As the receptive field of these neurons never becomes exactly zero one may hope that these neurons are only “sleeping” or “not needed at present” or “of small role” but not dead neurons. As it is shown in the paper these “smalI role neurons” may recover and assume an equal role if adaptivity is kept and the external world changes. 4 Topology by Hebbian Learning

It is a relatively easy task to build up the internal representation of the topology of the external world when the spatial filters are given. Let us introduce connections between neurons. These connections can represent the topology of the given metrical topogical space in the following fashion: The closer the stored vectors of two neurons are in the metric of the topological space, the stronger should be the connection between the two neurons and vice versa: if the connecting weight between two neurons is larger than zero then the vectors of the two neurons should be close in the metric of the topological space. To form these connections one needs to note that a local, extended object may overlap with two spatial filters and may excite two neurons simultaneously. This means that the closer two spatial filters are, the

00'0

000

zzo

L Z'O

451

Topology Learning Solved by Extended Objects

Figure 6: Learned one- and two-dimensional topologies. Connection thicknesses show the strengths of topology connections qkl. In the left-hand side figure objects were generated everywhere in the two-dimensional box. No line means approximately zero strength connections. In the right-hand side figure objects were generated along three horizontal strips in the two-dimensionalbox with arbitrary ordinates. No line means zero strength connections. [0,1]. The best results were achieved when only the winning neuron could update its connections:

i” - 9,j)

A911 = P ( y , + Y j ) (Si( N ) s.

(4.2)

where yi is the output of the ith neuron after competition: the output is 1 for the winning neuron and 0 for the others. In this way y, y, is not zero if and only if either the ith or the jth neuron was winning. Connection strengths are shown in the left-hand side of Figure 6. Connection strengths are depicted by the thicknesses of the connecting lines between neurons. The position and the size of the circles represent the position and the size of the spatial filters, respectively. A nonconnected topology was also produced by showing local, extended objects along three horizontal strips only (see the right-hand side of Fig. 6). Figure 6 shows well-developed connections between neighboring filters in both the onedimensional and the two-dimensional topologies. It is worth noting that connections between neurons that are farther (i.e., connections that would represent medium-range topology properties) did not develop in this model. In the present training examples filters are formed according to the object size and thus the object may excite only neighboring neurons. It is reasonable to expect, however, that if they have a distribution of object or feature sizes, filters will develop according to the average size. The larger-than-average object would develop connections between nonneighboring neurons as well if topology allows it.

+

452

C. Szepesvari, L. Balazs, and A. LBrincz

The neural gas model of Martinet2 and Schulten (1991) could not build u p the correct topology for this case as it is not based on the neighborhood relations of topological space provided by our local, extended objects, but is based on a closeness relation in the metric of Euclidean space into which the topology is embedded. As an example assume, that the input is such that the closest neuron has the top-left receptive field. The second closest neuron is then either the middle-left or the top-middle neuron or both according to the exact position of the object. That is, the neural gas model would develop connections between the top-left and the middle-left neurons too and the connection structure would become a two-dimensional grid. The neural gas algorithm in its present form is capable of representing the topology of only those worlds in which the closeness relation belonging to the topology and the closeness relation belonging to the Euclidean distance are identical. Slight modifications-modification of inputs and modification of distance function-can make the neural gas model work for all cases too. It has been shown for the case of single-object training that a lateral weight q,, is nonzero if and only if the presented local object series has an infinite subseries in a way that the objects of this subseries overlap with the outer inverse image of both the w iand the w, vectors (Szepesviri 1992). Since the presented objects are local objects one may conclude that the sets represented by the w, and w, vectors are locally connected (i.e., the nonzero lateral weights represent the topology). The necessary and sufficient condition of the proof is that both the digitization of the topological space and the distance function of the neural network should satisfy a separability condition (Szepesvhri 1992). The separability condition generalizes the view (naive in mathematical terms) that filter response should be zero if and only if the filter does not overlap with the input. This is the very point where the Euclidean distance fails. Figure 7 shows the connection strengths as a function of "distance" d3 of the spatial filters. 5 Topographical Map and Kohonen Training

The Hebbian connections allow us to utilize the Kohonen type neighbor training (i.e., to introduce a cooperative learning scheme and to speed u p the adaptivity of the network in a changing environment). The closeness or connection strengths of the neurons in the Kohonen map are predefined. Here we develop connection strengths in a dynamic fashion and that leads to a new problem when introducing Kohonen type neighbor training: cooperative training may win over filter forming competition. The original Kohonen neighbor training may be written as

Topology Learning Solved by Extended Objects

SbJI d

453

0.45 0.4

0.35

0.3 0.25 0 -a - 0.2 %d 0.15 0.1 u 0.05 0

3;

0 0.05 0.1 0.15 0.2 0.25 Overlap Figure 7 Monotonicity of topological connection strengths. Strength of topology connection qkl as a function of overlap of filters. The overlap of the kth and lth filters is defined as Ci,jwii (k)wii (1) .

where index k denotes the winning neuron, and the connection strengths may be considered as predefined time-dependent functions. Their initial values determine an inherited closeness between neurons. The said closeness is a slowly decreasing function of time whereas function H is a strictly decreasing monotonic function of distance; in other words, it is a strictIy increasing monotonic function of connection strength q i k with H ( 0 ) = 0 and H(1) = 1. If one tries to establish an adaptive cooperative neighbor training then first the inherited rule of closeness should be replaced by rule 4.1 in order to determine the closeness relations. However, this simple replacement and the usual function H lead to the loss of competition between neurons: the receptive fields of different neurons grossly overlap and become identical asymptotically. This is due to the fact that if the distance of the weights of two neurons is small then they efficiently teach each other and the connection strength between them further increases as they are often active simultaneously. A solution to this problem is to choose another function fi. Such a function should have the following properties: (1) it should be positive, (2) start from zero, ( 3 ) increase toward a maximum, and (4) decrease for qik

454

C. SzepesvPri, L. BalBzs, and A. LBrincz

larger arguments down to zero. Condition (4) ensures that neurons cannot share learning if they are too close to each other. The other conditions ensure that neurons far from each other cannot learn the same input. A function of the following form k(x) = (‘~~(e-7’ - e-7) satisfies these conditions if its parameters are appropriately chosen and competition persists. Parameters for our case were chosen to be ( = 100, y = 10. In these runs, just as in the other experiments to be discussed later, lateral connection development and the neighbor training through these connections were applied from the very beginning (i.e., the development of lateral connections and the development of feedforward connections were both on from the very beginning). It is quite surprising that there is a large family of training rules that does not utilize the arbitrary function fi and keeps the cooperative properties: The idea is that one may try to use the activity of the neurons in the learning rule. The point to remember here is that in the fully developed neural network we hope that “far away neurons” will have disjunct receptive fields and a given input will give rise to no activity of most of the neurons. The learning rule may now be expressed as

where k denotes the index of the winning neuron, and a and b are fixed positive powers. In this learning rule it is the dependence on the activities that results in no simultaneous learning for remote neurons. Factor (1 q i k ) decreases cooperativity for neurons coming too close to each other. In this way dynamic balance is ensured for cooperative learning. Based on our numerical experiments powers a and b should both be larger than 2 to keep competitivity. Integers between 2 and 4 were tried and all of them succeeded. If the limits of u and b go to infinity the neighbor training of 5.2 disappears and one is left with a simple competitive network. It is then expected that the training rule 5.2 is stable for a and b values both larger than two. This family of learning rules seems appropriate as a means of setting up adaptive cooperative Kohonen type neighbor training. The advantages of such training are dealt with below. It may be expected that cooperative neighbor training helps adaptivity. Since in our model the distinct learning rules may be compared in a relatively unambiguous fashion we tried separate runs so that we could compare the adaptivity of a competitive network and the network that utilized neighbor training 5.2, by applying a sudden change in the average object size. Networks respond to the change by changing filter sizes and creating or destroying filters. The time evolutions of filter sizes are shown in Figure 8. In the numerical experiment object size was decreased to one-half of its original size. The competitive network (left side) responded with a sudden decrease in the size of active filters and developed a new filter much later. The network that utilized neighbor training did not allow the activity of any of its neurons to decrease to

Topology Learning Solved by Extended Objects

455

Size 2.50 3

0.00 -100 -50

20 50 100 Learning Steps (in thousands) 0

10

Size

1

0.00

1 -100 -50 0 10 20 50 100 Learning Steps (in thousands)

Figure 8: Adaptivity of ANNs. After a sudden change in the external world or, here, the average object size, networks try to adapt. Adaptation means a change of filter size and creation or death of filters. The graph on the top shows the evolution of filter sizes for a competitive network that developed topology connections and Kohonen type neighbor training. The graphs depict points from 100,000 learning steps prior to and 100,000 learning steps after the sudden change. Step number zero is the time of the sudden change. The region of the first 20,000 steps after the change is enlarged and enclosed with dashed lines. Size is defined as

&(~i;k’)~.

very low levels and both the decrease in the size of the active filters and the increase in the small activity filters took place at a high rate. This

C. Szepesviiri, L. Baliizs, and A. Ldrincz

456

was followed by a slow decrease of the activity of one neuron, the only one that could not play a role in the new situation and was sentenced to remain silent. The comparison clearly shows that adaptivity increases with cooperative learning. 6 Conclusions

Competitive neural networks having local, extended objects as inputs can be used to form spatial filters, are able to discover the topology of the external world, and offer a means of designing neighbor training, which significantly improves adaptivity. The use of local, extended objects helps in reducing the necessary a priori information about the external world built into self-organizing neural networks. Appendix: Problems with the Robustness of the Euclidean Distance Function The activity of the ith neuron may be expressed as

where the indices correspond to the digitization of the two-dimensional space. Let us denote the set of zero elements of the input vector x of a given extended object by 1. The number of elements of set 1 are denoted by 11). The sum of equation 6.1 may be divided into two parts:

Let us set the components of the initial stored vectors around with a small noise content. Without loss of generality one may assume that the first neuron wins for the first presented input vector x. Let us assume as well that 11)is typically large ke., the extended objects are small). Now, we may approximate the average updating as

a(')= (I - N)"(O) and D(I) = w ( ~ ) , if i # 1

(6.3)

where the bar denotes averaging over the components of the stored vector of a neuron. Let us examine the case that a neuron has won and a new randomly positioned object is shown to the neural network. We are interested in the probability that the same neuron shall win. To this end

Topology Learning Solved by Extended Objects

457

let us give upper and lower estimates for the activities of the previously winning and the other neurons, respectively, in the new presentation:

bearing in mind that the weights always fall into the [0,11 interval. If

DI < Di,

i# 1

(6.5)

then in the next training step it is the first neuron that wins again. This inequality, however, is easily fulfilled. The inequalities 6.4 and 6.5 lead to

+

111/n2 > 1/[1 (1 - (1-

cy)2)q0)]

(6.6)

The larger m(0)and a, the easier it is to fulfil this condition. Let us assume that the first neuron won t times in a row. If inequality (6.7) is fulfilled, then it wins again. Inequality 6.7 shows that the first neuron’s chance of winning keeps growing. Taking the limit of t + cc we have IIl/n2 > 1/(1+

qo))

(6.8)

and this expression is independent of a. This gives rise to an upper limit of Wp). Above that limit (i.e., for small objects), the Euclidean distance function cannot solve the problem or, at least, one may say that the probability of having only one winning neuron is larger than zero: according to our experience it is close to 1. Having more than one neuron, however, does not mean that more than one spatial filter will be formed. The question is how to form separate filters. As it is shown in the paper the “spherical distance” is an appropriate solution of this problem.

Acknowledgments Acknowledgment is made to the referees for their constructive criticism.

References Carpenter, G. A., and Grossberg, S. 1987. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comp. Vision,Graphics Image Proc. 37, 54-115. Desieno, D. 1988. Adding conscience to competitive learning. Proc. Int. Conf. Neural Networks, pp. 117-124. IEEE Press, New York. Foldihk, P. 1990. Forming sparse representation by local anti-Hebbian learning. Bid. Cybern. 64, 165-170.

458

C. Szepesvdri, L. Baldzs, and A. LBrincz

Fritzke, B. 1991. Let it grow-self organizing feature maps with problem dependent cell structure. In Artificial Neural Networks, T. Kohonen, M. Makisara, 0. Simula, and J. Kangas, eds., Vol. 1, pp. 403-408. Elsevier Science Publishers B.V., Amsterdam. Grossberg, S. 1976. Adaptive pattern classification and universal recoding. I & 11. Biol. Cybern. 23, 121-134. Kohonen, T. 1984. Self-Organization and Associative Memory. Springer-Verlag, Berlin. Martinetz, T., and Schulten, K. 1991. A "neural-gas" network learns topologies. In ArtificialNeural Networks, T. Kohonen, M. Makisara, 0.Simula, and J. Kangas, eds., Vol. 1, pp. 397-402. Elsevier Science Publishers B.V., Amsterdam. Szepesviri, Cs. 1992. Competitive neural networks with lateral connections: Collective learning and topology representation. TDK Thesis (in Hungarian). Received October 19, 1992; accepted June 21, 1993.

This article has been cited by: 2. Jan C. Wiemer . 2003. The Time-Organized Map Algorithm: Extending the Self-Organizing Map to Spatiotemporal SignalsThe Time-Organized Map Algorithm: Extending the Self-Organizing Map to Spatiotemporal Signals. Neural Computation 15:5, 1143-1171. [Abstract] [PDF] [PDF Plus] 3. Tamás Rozgonyi, László Balázs, Tibor Fomin, András Lörincz. 1996. Self-organized formation of a set of scaling filters and their neighbouring connections. Biological Cybernetics 75:1, 37-47. [CrossRef]

Communicated by Francoise Fogelman-Soulie

Dynamics of Discrete Time, Continuous State Hopfield Networks Pascal Koiran Laboratoire de I'lnformatique du ParaIlPIisme, Ecole Normale Supkrieure de Lyon, 69364 Lyon Cedex 07, France The dynamics of discrete time, continuous state Hopfield networks is driven by an energy function. In this paper, we use this tool to prove under mild hypotheses that any trajectory converges to a fixed point for the sequential iteration, and to a cycle of length 2 or a fixed point for the parallel iteration. Perhaps surprisingly, it seems that no rigorous proof of these results was published before. 1 Introduction

A fundamental property of discrete time, discrete state Hopfield networks is that their dynamics is driven by an energy function (Hopfield 1982). This allows the length of a limit cycle to be bounded: the parallel iteration has cycles of length 1 or 2 only, and the sequential iteration has only fixed points. These results describe rather completely the asymptotic behavior of the network, since any trajectory enters a limit cycle after a transient period. Discrete time, continuous state Hopfield networks are also driven by an energy function (Marcus and Westervelt 1989; Fogelman-Soulib et al. 1989). However, a trajectory will generally not enter a cycle, so that the discrete-case case argument does not apply here, and the question of the convergence to a cycle arises. In this paper, we prove under mild hypotheses that any trajectory converges to a fixed point for the sequential iteration, and to a cycle of length 2 or a fixed point for the parallel iteration. This is a very fundamental result, which seemingly has not been established previously. Papers showing the existence of a Lyapunov function (Marcus and Westervelt 1989; Fogelman-Soulib et al. 1989) prove that this function converges, but do not prove the same thing for the state of the network; other works deal with local convergence [e.g., in the neighborhood of an attractive fixed point (Michel et al. 1990)1, and their methods cannot yield a global convergence result. Our proof is based on an intermediate result stating that Hopfield networks generically have a finite number of fixed points. Hence the requirements in Michel et al. (1990) and other papers that the fixed points Neural Computation 6,459468 (1994) @ 1994 Massachusetts Institute of Technology

Pascal Koiran

460

be isolated are not very restrictive. This result is also of interest in the case of continuous time Hopfield networks (Hirsch 19841, because these networks have the same fixed points as the discrete time networks we consider. It implies the same convergence property as in discrete time networks (Hirsch 1989). The generic finiteness of the set of fixed points is stated as an open problem in Hirsch (1989). This paper is nearly self-contained since we give proof sketchs for most of the results we need; we refer the reader to Fogelman-Soulie et al. (1989) or Goles and Martinez (1990) for more details. Our results are theoretical in nature, but are related to two questions of practical interest: 0

0

What is the maximum number of fixed points (or length-2 cycles) for a network of a given size? What is the speed of convergence to a fixed point (or a length-2 cycle)?

If the network is used for instance as an associative memory, the number of fixed points can be interpreted as the number of items that can be stored. The convergence speed gives an estimate of the time it takes to retrieve an item from a corrupted version. These questions have been studied in depth in the discrete case, but hardly anything is known for continuous networks. This is not too surprising since the simpler problems considered in this paper were not addressed successfully before. One can hope that refinements of the methods presented here will yield explicit bounds of practical interest. 2 Preliminaries

We consider a network of n interconnected neurons, whose states XI.. . . . x, belong to [-1.11. The transition function of neuron i is x , H f ( A , ) .A, is the activation of neuron i, defined by:

b, is the threshold of neuron i, and wII is the weight of the connection between neurons i and j . b = (bl)15,5,is the vector of thresholds. The matrix of weights W = (w,) is assumed to be symmetric, with a nonnegative diagonal. f is continuous, strictly increasing on an interval [a.$ ( o < j j and, possibly, (Y = --oo or 13 = +m), and constant outside 'dx 5 ( r . f ( x ) = -1 tlx 2 / L f ( x ) = 1

It seems that Fogelman-Soulie et al. (1989) assumes that o and /) are finite, but this is in fact not necessary. If 0 = --03 or [f = $00, we ask that lim, .Itnof(~)= fl.

Discrete Time Hopfield Networks

461

The hypotheses listed u p to this point will be assumed throughout the paper, and will not be repeated. We will sometimes assume that f is piecewise C'. This means that there is an increasing sequence c~, . . . ,c, such that f is C' on the intervals I0 =] - 00, C l ] , l l = [C', C * ] , . . . , I, = [c,, -too[. In the parallel iteration mode, all neurons change state simultaneously: for t E N and 1 5 i 5 n, X'(t

+ 1) =f(A;(t))

(2.1)

P denotes the function associated to this iteration mode: x(t+l) = P(x(t)). In the sequential iteration mode, the neurons are updated in increasing order: xi(t

+ i/n) = f(Aj(t + (i - l ) / n ) )

(2.2)

S denotes the function associated to this iteration mode: x ( t + l ) = S(x(f)). Hence all neurons are updated during one unit of time, like in the parallel case. Assuming a specific update order for sequential iterations is in fact not necessary. It is sufficient to update each neuron an infinite number of times. Note that these iteration modes have the same fixed points. Let F be the function associated to a given iteration mode (F = P or F = S). A cycle of length T is a sequence (yo,.. . ?yT-') of distinct states such that F(yi) = y(itl)modT.It is naturally identified to the shifted cycle (y', . . . ,yT-', yo),. We say that a sequence ( x ( t ) ) of iterates converges to this cycle if for any i such that 0 I i 5 T - 1, limr,t,x(Tt i) = y'. The ball of center u and radius E is B ( u , E ) = {x E R", ( ( x- all < t}, where I 1. I I denotes the Euclidian norm on R".

+

3 Sequential Iterations

The existence of a Lyapunov function for the sequential iteration and its consequence on the length of cycles are stated in Theorem 1 and Corollary 1. The first result can be found in Fogelman-Soulie (1989). We outline its proof, because it will be useful for establishing Theorem 2, which is with Theorem 4 the main result of this section.

Theorem 1. Let E be defined by E(x) = -xrWx/2

+ b'x + 2 /"f-'([)d< j=l

0

E is u Lyupunovfunction of the sequential iteration (2.2), i.e., ifx(t E(x(t l / n ) ) < E(x(t)).

+

+ l/n)#x(t),

Proof. Let us first assume that some integrals f-' (()d[ diverge. This implies that 13 = +00 or N = -00, and x,(t) = f l . This is possible only

Pascal Koiran

462

= 0: if [j = +co, l$f(R) hence Vi,Vt > O,x,(t)#l (the same is true if -m). It follows that E(x(1))< E ( x ( 0 ) )= +co. Let us now assume that the integrals converge. If xk is updated at time t, the energy variation AE = E ( x ( t + l / n ) )- E ( x ( t ) ) is [see FogelmanSoulie et al. (1989) for details]

for t

(t =

AE

=

-Wkk(Xk(f

+

I/fl)

- xk(t))*/2

(3.1)

The first term is clearly negative. 0

If Ak(f- 1)and Ak(f)are both smaller than (1 or both greater than [ j , I / f l ) = X k ( f ) Since xk X k ( f - 1 -/- I / f l ) = X k ( f + 1 / f l ) . But X k ( f - 1 has not been updated between time t - 1 + l / n and time t. Hence Xk(f l/n) =Xk(f).

+

+

0

If the previous condition is not true, either & ( f ) = Ak(t - 1 ) and x k ( f+ I / n ) = x k ( f )again, or Ak(f)#Ak(f- 1). In the latter case, X k ( f I / f l ) # x k ( t ) and E ( x ( t + l / n ) ) < E ( x ( t ) )since f is strictly increasing on [o../j]. 0

+

Corollary 1. A n y cycle of the sequential iteration is a fixed point. This is a standard consequence of the existence of a Lyapunov function; the proof will be omitted. In order to establish our main result, we need the following hypothesis:

(H) The network has a finite number of fixed points. Theorem 2. Under the additional hypothesis (HI, the sequential iteration conE [-1.1]". verges toafixedpointfvornanystartingpoint~~ Proof. Since E is a nonincreasing function, we can define Emin = 1lim E(x(t)) -32 Let K be the set of the accumulation points of ( x ( t ) ) t E Nand , let yo E K, y' = S(yo). Since E ( x ) is a continuous function of X , E(y0) = Emin. y' also belongs to K, hence E(y') = Emin. According to Theorem 1, yo = y l : the elements of K are fixed points. Thus K is finite according to hypothesis (HI. If K = {x' , . . . .xk}, the following property holds:

vt > 0,"

Vf

1 N , x ( t ) E (J B(x',f) l
Assume instead that an infinite sequence of terms x ( t ) does not belong to this union. This sequence has at least one accumulation point, which cannot belong to { x ' , . . . ,$}. This is a contradiction.

Discrete Time Hopfield Networks

463

Assume that k > 1, and choose f so that the sets B(x’,t)are disjoint. Let f 1 N such that x ( f ) E B ( x ’ , f ) . There is a f’ _> t such that x ( t ’ ) E B(x’,t) and x(t’ + l ) @ ( x ’ , c ) (otherwise, K = {XI}). Since t‘ + 1 2 N, x(t‘ + 1) E B(x’,6) for some i#1. This is impossible for 6 small enough, because S is continuous, and X I is a fixed point of S. Hence k = 1, and limf,+mx(t) = X I . 0 This theorem is in fact a general result on dynamic systems driven by a Lyapunov function: the specific form of the iterated function or of the energy function is not important. We now give a second proof of Theorem 2, assuming that f is piecewise C’. The first one is simpler and more general; however, the second one is not completely useless, because of the following lemma, which is of independent interest.

Lemma 1. Let M = I If’l

Im,

with f piecewise C’. If

xk

is updated at time t,

lAxl I2\lMlAEI

with Ax

=

(x(f

+ l / n ) - x ( t ) ( and AE = IE(x(f + l / n ) )- E ( x ( f ) J .

This result can be read as “AE small ==+ Ax small,” which is a generalization of the well-known property “AE = 0 =+Ax = 0.”

Lemma Proof. Since W k k 2 0, a partial integration in 3.1 yields (set u’ = 1 and u =f(Ak(t - 1))- f ( < ) in the integral):

Assume for instance that Ak(t) > Ak(t - 1). Let t E (O.Ak(t)- Ak(t - l)]:

lAEl L

E

Lf(Ak(t))-f(Ak(f - 1))- Mf] = 6

+

[Xk(t

+ 1) - x k ( t ) -ME]

hence 0 5 xk(t 1) - xk(f) 5 Mc + JAEJ/t. As a function of E, the right side is greater than 2 obtained for E = On the one hand, if

Jm).

then

Jm(value

Pascal Koiran

464

On the other hand, if

then

In both cases, Ixk(f + 1) - xk(t)l 5 2 d m .This result still holds if 0 A k ( t )< Ak(f - l ) , and of course if Ak(f)= Ak(f - 1).

Proof of Theorem 2. Emin > This is due to the fact that

-00,

because inf{E(x),x E [-1,1]"} >

-00.

0

-xT W x / 2 + brx is bounded on [ -1,lI".

0

eitherf-' is bounded in a neighborhood of 1,or limx+lf - ' ( x )

0

either f-' is bounded in a neighborhood of -1, or limx+-lf-l(x) =

= $00.

-0O.

Consequently, limf++mIE(x(t + 1)) - E ( x ( t ) ) l = 0 hence liml++oo"x(t + 1) - x ( t ) l l = 0 according to Lemma 1. It follows that K is connected. K is finite (see first proof) and connected, therefore it has a single element. Hypothesis (HI seems to be highly natural, and we are indeed going to prove that for "almost every" network, the set of fixed points is finite. We will make use of an elementary version of the parametric transversality theorem (Hirsch 1976):

Theorem 3. Let U and V be two open sets of Rq and R", respectively. Let F : U x V R" be a C' function. I f 0 is a regular value of F, then it is a regular value ofF(A, .)for X in an open dense set. --$

Recall that 0 is a regular value of F if for any y E U x V such that F ( y ) = 0, the Jacobian matrix DF(y) is of rank n. If 0 is a regular value of F(X,.), all zeros of F ( X , .) are isolated.

Theorem 4. When f is piecewise C', the network has a finite number offixed points for ( W, b) in an open dense set. Proof. Showing that the fixed points are isolated is sufficient, because they all belong to the compact [-1,1]". Assume first that f is C' on R. Theorem 3 is applied to F ( A , x ) = PA(x)- x, where PAis the parallel iteration of network A. We take V = R" and U = Rq, with q = n n(n + 1)/2 ("thresholds plus weights"). We are in fact going to prove that for any (X,x), DF(X,x) is of rank n, i.e.,

+

465

Discrete Time Hopfield Networks

all values of F are regular (not only 0). For a fixed ( A x ) , let us find a regular n x n submatrix of DF(X,x ) . Fi being the ith component of F,

Without loss of generality, we may assume that f ' ( A i )= 0 1 5 i 5 Y, for a given r (possibly, Y = 0). The submatrix composed of the columns ( ~ F ; / I ~ X , ) for ~ < 1~ 5 < ~j 5 r and (8F,/8b,)15i5n for r 1 5 j L n is lower triangular,-aid its diagonal elements are nonzero. It is thus regular. Note that this argument remains valid if each neuron i uses a different output function f,. This remark will be helpful in the end of the proof. In the general piecewise C' case, we divide the space of activation in ( p 1)""boxes." Each box is a product vectors A = (A;)l5i5n

+

+

n i=l

of intervals Ik, on which f is C'. Ax(x) is the value of the activation vector A of network X in state x . For a given box B and X in an open dense set OB,any element of X B = { x , x = P A ( x A) A A ( x )E B} is isolated in X B . Indeed, let (fl)l<15n be a sequence of functions of class C' on R such that f i ( x ) = f ( x ) for x E I k , . An element of X B is a fixed point of the network using the output functions f, instead off. According to our intermediate result, these fixed points are isolated for X in an open dense set. It follows that for X in the open dense set 0 = OB,any fixed point in a box B, is isolated in XB,. In order to prove that it is in fact isolated in the set of all fixed points X = Us X B , the following remark is sufficient: If a fixed point is not isolated, X is infinite. Hence there is a box B such that XB is infinite. We have just proved that this is impossible for X E 0. O 4 Parallel Iterations

The existence of a Lyapunov function for the parallel iteration and its consequence on the length of cycles are stated in Theorem 5 and Corollary 2. These results can also be found in Theorems 1 and 2 of Fogelman-Souli6 et al. (1989). We outline the proof of Theorem 5 because it will be useful for establishing Theorem 6, which is the main result of this section. Note that we do not use the same Lyapunov function as Theorem 1 of Fogelman-Soulie et al. (1989). However, the two functions are equal up to a constant, according to Corollary 1 of the same paper.

Pascal Koiran

466

Theorem 5. Let V be defined by

E ( x ) = V ( x ,P ( x ) ) is a Lyapunov function of the parallel iteration (2.1), i.e., if x ( t + 2 ) # x ( t ) , then E ( x ( t + 1 ) ) < E ( x ( t ) ) . Proof. One of the integrals may be divergent only at t = 0, as for sequential iterations. In this case, E ( x ( 1 ) )< E ( x ( 0 ) ) = +oo. Let us now assume that all integrals are convergent. At time t, the energy variation AE = E ( x ( t 1 ) ) - E ( x ( t ) )is (see Fogelman-Soulib et al. 1989' for details):

+

(4.1) 0

0

+

If i is such that A,(t - 1) and A,(t 1) are both smaller than both greater than 13, x , ( t 2) = x l ( t ) .

+

N

or

If i is such that the previous condition is not true, either A,(t + 1) = A , ( t - 1 ) andx,(t+2) = xl(t)again,orA,(t+l)#A,(t-1). Inthe latter case, x , ( t + 2)#x,(t) and E(x(t + 1))< E ( x ( t ) )since f is increasing on

. .1 PI.

Corollary 2. Any cycle of the parallel iteration is of length 1 or 2. This is a standard consequence of the existence of a Lyapunov function; the proof will be omitted. We now proceed to the parallel counterpart of Theorem 2. We need the following hypothesis:

(H') The network has a finite number of cycles. Theorem 6. Under the additional hypothesis (H'), the parallel iteration converges to a cycle of length 1 or 2 from any starting point xo E [-1,1]". Sketch of Proof. The proof is exactly the same as in the sequential case, with S replaced by P O P and K replaced by the set KO of the accumulation points of ( ~ ( 2 t ) ) ~ ~ ~ . The counterpart of Lemma 1 is as follows:

Lemma 2. Let M = Ilf'llm, with f piecewise C'. \Ax[ 5 2 d m , with Ax = Ix(t 2) - x(t)l and AE = IE(x(t + 1))- E ( x ( t ) l .

+

Sketch of Proof. A partial integration in 4.1 yields:

Discrete Time Hopfield Networks

467

In the same way as for Lemma 1, one can prove that

whence the result.

0

It follows that & is connected; as in the sequential case, this gives a second proof of the convergence theorem.

5 Five Open Problems A number of interesting questions related to the results of this paper deserve more study. We propose below five problems that are still unresolved to the best of our knowledge. As mentioned in the introduction, problems 3 and 5 have the greatest practical significance. 1. Prove that the number of length-2 cycles for the parallel iteration is finite for almost every network. We were able to solve this problem for fixed points (in Theorem 4) because the Jacobian matrix has a very special structure, which it no longer has when length-2 cycles are considered.

2. Give a condition on f ensuring that every network has a finite number of fixed points. This property does not hold for any f . For example, the identity function of [-1,1]” can be realized by a network using an output function f such that f ( x ) = x for -1 < x < 1, f ( x ) = 1 for x 2 1, and f ( x ) = -1 for x < -1. In Theorem 4 we showed only that the finiteness property holds for almost every network. 3. Give an upper bound on the number of fixed points or length-2 cycles (when it is finite). In view of possible applications to associative memories, it would also be interesting to estimate the size of the basins of attraction.

4. Do the iterations still converge for a (hypothetical) network having an infinite number of cycles? In Theorems 2 and 6, convergence was proved for networks having finitely many cycles.

5. Study the speed of convergence to a limit cycle. For discrete state networks, this amounts to bounding the transient length (FogelmanSoulie et al. 1989);for continuous state networks, this has been done by Michel et al. (1990) at the neighborhood of an attractive fixed point, but no global result is known.

468

Pascal Koiran

References Fogelman-Soulie, F., Mejia, C., Goles, E., and Martinez, S. 1989. Energy functions in neural networks with continuous local functions. Complex Syst. 3, 269-293. Goles, E., and Martinez, S. 1990. Neural and Automata Networks: Dynamical Behavior and Applications. Mathematics and lts Applications. Kluwer Academic Publishers, Boston. Hirsch, M. W. 1976. Differential Topology. Springer-Verlag, Berlin. Hirsch, M. W. 1989. Convergent activation dynamics in continuous time networks. Neural Networks 2, 331-349. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. 81, 3088-3092. Marcus, C. M., and Westervelt, R. M. 1989. Dynamics of iterated-map neural networks. Phys. Rev. A 40(1), 501-504. Michel, A. N., Farrell, J. A., and Sun, H.-F. 1990. Analysis and synthesis techniques for Hopfield type synchronous discrete time neural networks with application to associative memory. l E E E Transact. Circuits Syst. 37(11), 13561366. - - -

Received November 30, 1992; accepted July 12, 1993

This article has been cited by: 2. Pingzhou Liu, Qing-Long Han. 2007. Discrete-Time Analogs for a Class of Continuous-Time Recurrent Neural Networks. IEEE Transactions on Neural Networks 18:5, 1343-1355. [CrossRef] 3. J.M. Bahi, S. Contassot-Vivier. 2006. Basins of Attraction in Fully Asynchronous Discrete-Time Discrete-State Dynamic Networks. IEEE Transactions on Neural Networks 17:2, 397-408. [CrossRef] 4. V.M. Becerra, F.R. Garces, S.J. Nasuto, W. Holderbaum. 2005. An Efficient Parameterization of Dynamic Neural Networks for Nonlinear System Identification. IEEE Transactions on Neural Networks 16:4, 983-988. [CrossRef] 5. Jiří Šíma , Pekka Orponen . 2003. General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic ResultsGeneral-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results. Neural Computation 15:12, 2727-2778. [Abstract] [PDF] [PDF Plus] 6. J.M. Bahi, S. Contassot-Vivier. 2002. Stability of fully asynchronous discrete-time discrete-state dynamic networks. IEEE Transactions on Neural Networks 13:6, 1353-1363. [CrossRef] 7. Jiří Šíma , Pekka Orponen , Teemu Antti-Poika . 2000. On the Computational Complexity of Binary and Analog Symmetric Hopfield NetsOn the Computational Complexity of Binary and Analog Symmetric Hopfield Nets. Neural Computation 12:12, 2965-2989. [Abstract] [PDF] [PDF Plus] 8. Anand Rangarajan , Alan Yuille , Eric Mjolsness . 1999. Convergence Properties of the Softassign Quadratic Assignment AlgorithmConvergence Properties of the Softassign Quadratic Assignment Algorithm. Neural Computation 11:6, 1455-1474. [Abstract] [PDF] [PDF Plus] 9. Xin Wang , Arun Jagota , Fernanda Botelho , Max Garzon . 1998. Absence of Cycles in Symmetric Neural NetworksAbsence of Cycles in Symmetric Neural Networks. Neural Computation 10:5, 1235-1249. [Abstract] [PDF] [PDF Plus] 10. Lipo Wang. 1998. On the dynamics of discrete-time, continuous-state Hopfield neural networks. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 45:6, 747-749. [CrossRef] 11. M.J. Perez-Ilzarbe. 1998. Convergence analysis of a discrete-time recurrent neural network to perform quadratic real optimization with bound constraints. IEEE Transactions on Neural Networks 9:6, 1344-1351. [CrossRef] 12. Lipo Wang. 1997. Discrete-time convergence theory and updating rules for neural networks with energy functions. IEEE Transactions on Neural Networks 8:2, 445-447. [CrossRef]

Communicated by Andrew Barto

Alopex: A Correlation-Based Learning Algorithm for Feedforward and Recurrent Neural Networks K. P. Unnikrishnan* Computer Science Department, GM Research Laboratories, Warren, MI 48090 U S A

K. P. Venugopal Medical Image Processing Group, University of Pennsylvania, Philadelphia, PA 19104 U S A We present a learning algorithm for neural networks, called Alopex. Instead of error gradient, Alopex uses local correlations between changes in individual weights and changes in the global error measure. The algorithm does not make any assumptions about transfer functions of individual neurons, and does not explicitly depend on the functional form of the error measure. Hence, it can be used in networks with arbitrary transfer functions and for minimizing a large class of error measures. The learning algorithm is the same for feedforward and recurrent networks. All the weights in a network are updated simultaneously, using only local computations. This allows complete parallelization of the algorithm. The algorithm is stochastic and it uses a "temperature" parameter in a manner similar to that in simulated annealing. A heuristic "annealing schedule" is presented that is effective in finding global minima of error surfaces. In this paper, we report extensive simulation studies illustrating these advantages and show that learning times are comparable to those for standard gradient descent methods. Feedforward networks trained with Alopex are used to solve the MONK'S problems and symmetry problems. Recurrent networks trained with the sum& algorithm are used for solving temporal XOR problems. Scaling properties of the algorithm are demonstrated using encoder problems of different sizes and advantages of appropriate error measures are illustrated using a variety of problems. 1 Introduction

Artificial neural networks are very useful because they can represent complex classification functions and can discover these representations using powerful learning algorithms. Multilayer perceptrons using sigmoidal nonlinearities at their computing nodes can represent large classes *Also with the Artificial Intelligence Laboratory, University of Michigan, Ann Arbor, MI 48109 USA

Neural Computation 6, 469-490 (1994) @ 1994 Massachusetts Institute of Technology

470

K. P. Unnikrishnan and K. I? Venugopal

of functions (Hornik et al. 1989). In general, an optimum set of weights in these networks is learned by minimizing an error functional. But many of these functions (that give error as a function of weights) contain local minima, making the task of learning in these networks difficult (Hinton 1989). This problem can be mitigated by (1) choosing appropriate transfer functions at individual neurons and appropriate error functional for minimization and (2) by using powerful learning algorithms. Learning algorithms for neural networks can be categorized into two classes.’ The popular backpropagation (BPI and other related algorithms calculate explicit gradients of the error with respect to the weights. These require detailed knowledge of the network architecture and involve calculating derivatives of transfer functions. This limits the original version of BP (Rumelhart et al. 1986) to feedforward networks with neurons containing smooth, differentiable and non-saturating transfer functions. Some variations of this algorithm (Williams and Zipser 1989, for example) have been used in networks with feedback; but, these algorithms need non-local information, and are computationally expensive. A general purpose learning algorithm, without these limitations, can be very useful for neural networks. Such an algorithm, ideally, should use only locally available information, impose no restrictions on the network architecture, error measures, or transfer functions of individual neurons, and should be able to to find global minima of error surfaces. It should also allow simultaneous updating of the weights and hence reduce the overhead on hardware implementations. Learning algorithms that do not require explicit gradient calculations may offer a better choice in this respect. These algorithms usually estimate the gradient of the error by local measurements. One method is to systematically change the parameters (weights) to be optimized and measure the effect of these changes (perturbations) on the error to be minimized. Parameter perturbation methods have a long history in adaptive control, where they were commonly known as the “MIT rule” (Draper and Li 1951; Whitaker 1959). Many others have recently used perturbations of single weights (Jabri and Flower 1991), multiple weights (Dembo and Kailath 1990; Alspector et al. 19931, or single neurons (Widrow and Lehr 1990). A set of closely related techniques in machine learning is Learning Automata (Narendra and Thathachar 1989) and Reinforcement Learning (Barto etal. 1981). In this paper we present an algorithm called “Alopex”2 that is in this general category. Alopex has had one of the longest history of such methods, ever since its introduction for mapping visual receptive fields (Harth and Tzanakou 1974). It has subsequently been modified ‘Methods that are not explicitly based on gradient concepts have also been used for training layered networks (Minsky 1954; Rosenblatt 1962). These methods are limited in their performance and applicability and hence are not considered in our discussions. 2Alopex is an acronym for Algorithm for pattern extraction, and refers to the alopecic performance of the algorithm.

Alopex

471

and used in models of visual perception (Harth and Unnikrishnan 1985; Harth et al. 1987, 1990), visual development (Nine and Unnikrishnan 1993; Unnikrishnan and Nine 19931, for solving combinatorial optimization problems (Harth et al. 19861, for pattern classification (Venugopal et al. 1991,1992), and for control (Venugopal et al. 1994). In this paper we present a very brief description of the algorithm and show results of computer simulations where it has been used for training feedforward and recurrent networks. Detailed theoretical analysis of the algorithm and comparisons with other closely related algorithms such as reinforcement learning will appear elsewhere (Sastry and Unnikrishnan 1994). 2 The Alopex Algorithm

Here, learning in a neural network is treated as an optimization p r ~ b l e m . ~ The objective is to minimize an error measure, E, with respect to network weights w, for a given set of training samples. The algorithm can be described as follows: consider a neuron i with an interconnection strength wfI from neuron j . During the nth iteration, the weight w,, is updated according to the rule:

wl,(n)= w,(n - 1) + 6,(n)

(2.1)

where b , ( n ) is a small positive or negative step of size 6 with the following probabilities5

-6 with probability p , ( n ) with probability 1 - p l , ( n )

6,,(n) =

+d

(2.2)

The probability p i ( n ) for a negative step is given by the Boltzmann distribution:

p,(n)

1 =

1 + e-

(2.3)

c&l

nto

where C o ( n )is given by the correlation

C,(n) = A w , ( n ) W n ) ’

(2.4)

and T ( n )is a positive “temperature.” Aw,,(n) and AE(n) are the changes in weight w,, and the error measure E over the previous two iterations.

Aw,(n)

=

AE(n)

=

w,,(n - 1) - w,,(n - 2) E ( n - 1) - E ( n - 2)

(2.5a) (2.5b)

~~

3Earlier versions of this have been presented at conferences (Unnikrishnan and Pandit 1991; Unnikrishnan and Venugopal 1992). 4For the first two iterations, weights are chosen randomly. 51n simulations, this is done by generating a uniform random number between 0 and 1 and comparing it with p l , ( n ) .

K. P. Unnikrishnan and K. I? Venugopal

472

The “temperature” T in equation 2.3 is updated every N iterations using the following ”annealing schedule”:

T(n)

=

1 ~

NM

T(n)

=

cc 1

1

11-1

1 1 C,,(n’) 1

if n is a multiple of N

(2.6a)

r i ‘ - n ~

otherwise

T(n-1)

(2.6b)

M in the above equation is the total number of connections. Since the

magnitude of Azo is the same for all weights, equation (2.6a) reduces to

n N

T ( n )= II’=II

-N

I AE(n’) I

(2 .6 ~ )

2.1 Behavior of the Algorithm. Equations 2.1-2.5 can be rewritten to make the essential computations clearer. Z O l , ( ~ 1 ) = zul,(17 -

+

1) 6 . x,,(n - 1)

(2.7)

h is the step size and xlI is either +1 or -1 (randomly assigned for the

first two iterations). x , ( n - 1) =

x , ( n - 2) with probability p l I ( t z ) - x , ( n - 2) with probability 1 - pl,(n)

(2.8)

where (2.9) From equations 2.7-2.9 we can see that if AE is negative, the probability of moving each weight in the same direction is greater than 0.5. If AE is positive, the probability of moving each weight in the opposite direction is greater than 0.5. In other words, the algorithm favors weight changes that will decrease the error E. The temperature T in equation 2.3 determines the stochasticity of the algorithm. With a nonzero value for T, the algorithm takes biased random walks in the weight space toward decreasing E . If T is too large, the probabilities are too close to 0.5 and the algorithm does not settle into the global minimum of E. If T is too small, it gets trapped in local minima of €. Hence the value of T for each iteration is chosen very carefully. We have successfully used the heuristic “annealing schedule” shown in equation 2.6. We start the simulations with a large T , and at regular intervals, set it equal to the average absolute value of the correlation C,, over that interval. This method automatically reduces T when the correlations are small (which is likely to be near minima of error surfaces) and increases T in regions of large correlations. The correlations need to be averaged over a sufficiently large number of iterations so that the annealing does not freeze the algorithm at local minima. Toward the end, the step size (I‘ can also be reduced for precise convergence.

Alopex

473

The use of a controllable "temperature" and the use of probabilistic parameter updates are similar to the method of simulated annealing (Kirkpatrick et al. 1983). But Alopex differs from simulated annealing in three important aspects: (1) the correlation (AE . Aw) is used instead of the change in error AE for weight updates, (2) all weight changes are accepted at every iteration, and (3) during an iteration, all weights are updated simultaneously. 2.2 "Universality" of the Algorithm. The algorithm makes no assumptions about the structure of the network, the error measure being minimized, or the transfer functions at individual nodes. If the change in the error measure is broadcast to all the connection sites, then the computations are completely local and all the weights can be updated simultaneously. The stochastic nature of the algorithm can be used to find the global minimum of error function. The above features allow the use of Alopex as a learning algorithm in feedforward and recurrent networks, and for solving a wide variety of problems. In this paper we demonstrate some of these advantages through extensive simulation experiments. Convergence times of Alopex for solving XOR, parity, and encoder problems are shown to be comparable to those taken by backpropagation. Learning ability of Alopex is demonstrated on the MONKS problems (Thrun etal. 1991) and on the mirror symmetry problem (Peterson and Hartman 1989) that have been used extensively for benchmarking. Scaling properties of Alopex are investigated using encoder problems of different sizes. The utility of annealing schedule for overcoming local minima of error surfaces is demonstrated while solving the XOR problem. Since Alopex allows the usage of different error measures, we show that the use of an information theoretic error measure (Hopfield 1987; Baum and Wilczek 1988; Unnikrishnan et al. 1991), instead of the customary squared error results in smoother error surfaces and improved classifications. Finally we demonstrate its ability to train recurrent networks for solving temporal XOR problems. It should be stressed that in all these experiments, the same learning module was used for these diverse network architectures and problems. 3 Simulation Results

In this section we present results from an extensive set of simulation experiments. The algorithm has three main parameters: the initial temperature T, the step-size 6, and the number of iterations N over which the correlations are averaged for annealing. The initial temperature is usually set to a large value of about 1000. This allows the algorithm to get an estimate of the average correlation in the first N iterations and reset it to an appropriate value according to equation 2.6. Hence this parameter does not affect the simulations substantially. N is chosen empirically, and

474

K. P. Unnikrishnan and K. P. Venugopal

Table 1: Average Number of Iterations Taken by Backpropagation and Alopex to Solve Three Different Problems."

XOR (2-2-1 network) Parity (4-4-1 network) Encoder (4-2-4 network)

BP

Alopex

1175 595

478

2676

353 3092

"The average was taken over 100 trials with different initial weights. For backpropagation, the learning rate and momentum were 0.9 and 0.7, respectively, for XOR, 0.5 and 0.8, respectively, for parity, and the same for encoder. For Alopex, h and N were 0.35 and 20, respectively, for XOR, 0.1 and 30 for parity, and 0.05 and 100 for encoder.

usually has a value between 10 and 100. Again, this is not a very critical parameter and, for most of the runs, is not optimized. On the other hand, h is a critical parameter, and is chosen with care. We have found that a good initial value is about 0.001 to 0.01 times the dynamic range of the weights. We terminate learning when the output-neuron responses are within 0.1 of their targets for the entire training set.

3.1 Comparisons with Other Learning Algorithms. The first set of experiments was done to compare the convergence time of Alopex with backpropagation. Alopex was used to train multilayer perceptrons with sigmoidal transfer functions, using the mean-squared error measure. Table 1 shows the performance of Alopex and a standard version of the backpropagation on the XOR, parity, and encoder problems. A 2-2-1 network was used for solving the XOR, a 4-4-1 network was used for solving the (4 bit) parity, and a 4-2-4 network was used for solving the (4 bit) encoder problem. The average number of iterations taken by the two algorithms over 100 trials are given in Table 1. We can see that the average number of iterations taken by Alopex is comparable to those taken by backpropagation. It should be pointed out that in Alopex all the weights are updated simultaneously and hence with a parallel implementation, the computation time taken per updating would be much less than that of backpropagation. The next set of experiments was done to compare Alopex with Reinforcement Learning and Learning Automata. The multiplexer task, which involves learning a six-input boolean function, has been solved using both these methods (Barto 1985; Mukhopadhyay and Thathachar 1989). Of the six input lines, four carry data and two carry addresses. The task is to transmit the appropriate data, as specified by the address, to the output line. Following Barto (1985), we chose a network with six linear input units, four sigmoidal hidden units, and a sigmoidal output unit,

475

Alopex

Table 2: Number of Iterations Taken by Alopex, Learning Automata (LA), and Reinforcement Learning (ARP) to Solve the Four-Bit Switching Problem." Task 1 Algorithm Alopex LA ARP

Task 2

Lo

Hi

Av

Lo

Hi

Av

6306 10659

14532 15398

9851 12628

7141 10748

16619 15398

11249 12641

-

-

-

-

-

-

Task 3 Algorithm Alopex LA ARP

Overall

Lo

Hi

Av

Lo

Hi

Av

6206 10403

13872 12939

9948 11917

-

-

-

6206 10403 37500

16619 15398 350000

10349 12395 133149

'See Mukhopadhyay and Thathachar (1989) for a description of the three tasks. The data for LA are taken from the above paper and the ARP data are taken from Barto (1985). Slightly different updating and stopping criteria are used in each method and hence the three cannot be compared directly. For Alopex and LA, each task was run 10 times and for ARP one of the tasks was run 30 times. For Alopex, k was 0.0025 and N was 10.

with 39 parameters (34 weights and five thresholds) to adjust. The training data were continuously fed into the network and the parameters were updated after every 64 examples. The training was stopped when 1000 consecutive examples were correctly classified. Following Mukhopadhyay and Thathachar (19891, we created three tasks with three different sets of address lines. Table 2 shows the average number of updates (over 10 trials, each starting with a different set of weights) needed for solving each of the tasks.6 From Table 2 we can see that Alopex compares favorably with these algorithms. Since the updating and stopping criteria are slightly different in the three studies, the numbers cannot be compared directly. Table 3 shows the number of iterations taken (from one initial set of weights) for different step-sizes, using the mean-squared error and the log error (see Section 3.4). The third set of experiments was done to compare Alopex with weight perturbation methods. Figure 1 shows the mean square error as a function of iterations for the XOR problem. A 2-2-1 network was used. The data for weight perturbation and backpropagation are taken from Jabri and Flower (1991). For a small step-size (6 = 0.008), the error decrement for Alopex is fairly smooth and it takes about the same number of iterations as the other two methods to converge. The convergence can be 'Mukhopadhyay and Thathachar (1989) specify the convergence criterion as the correct classification of the 64 training examples. With this criterion, the number of iterations are lower.

K. I? Unnikrishnan and K. P. Venugopal

476

Table 3: Number of Iterations Taken by Alopex to Solve the Switching Task for Different Step-Sizes (6)and Error Measures.a Step-size

Sqr error

Log error

0.01 0.0075 0.005 0.0025 0.001 0.0005

12,382 10,421 9,416 6,341 11,666 13,657

7,802 6,327 15,385 12,910 9,289 14,714

Average

10,647

11,071

‘All networks were started with the same initial conditions.

0 35 Back prop Wt pen Alpx (sml Alpx W d

03

-

0 25

02

015

01

0 05

0

20

40

60

80 iterations

100

120

0

140

Figure 1: The mean square error as a function of iterations for learning the XOR problem with a 2-2-1 network. Back-prop, Backpropagation algorithm; Wt-pert, weight perturbation algorithm; Alpx (sml), Alopex with a small step size (6 = 0.008); Alpx (lrg), Alopex with a large step size (6 = 0.03). The plots for backpropagation and weight perturbation are reproduced from Jabri and Flower (1991).

speeded up by using larger steps, as shown by the plot 6 error decrement is no longer smooth.

=

0.03. The

Alopex

477 Table 4: Performance of Different Gradient Descent Methods and Alopex on MONKs Problems! Learning algorithm BP BP with weight decay Cascade correlation Alopex

Problem 1 Problem 2 Problem 3 (%)

(a)

(%)

100 100 100 100

100 100 100 100

93.1 97.2 97.2 100

'For Alopex, 6 was 0.01 and N was 10. Training was terminated when the network had learned to correctly classify all the test samples. (Data for backpropagation and cascade-correlation are from Thrun et al. 1991.)

3.2 The MONKs Problems. These are a set of three classification problems used for extensive benchmarking of machine learning techniques and neural network algorithms (see Thrun et al. 1991 for details). Samples are represented by six, discrete-valued attributes and each problem involves learning a binary function defined over this domain. Problem 1 is in standard disjunctive normal form. Problem 2 is similar to parity problems and combines different attributes in a way that makes it complicated to describe in disjunctive or conjunctive normal forms using only the given attributes. Problem 3 is again in disjunctive normal form, but contains about 5% misclassifications. In the database, 124 randomly chosen samples are designated for training the first problem, 169 for training the second problem, and 122 for training the third problem. The entire set of 432 samples is used for testing. A feedforward network with 15 input units, 3 hidden units, and an output unit was trained to solve these problems. The network contained sigmoidal nonlinearities and Alopex was used to minimize the meansquared error. The network learned to class@ the first test set with 100% accuracy after 5,000 iterations and the second test set after 10,000 iterations. The third test set was correctly classified after 1,000 iterations and Figure 2 shows the network output for the 432 samples. Table 4 compares the performance of feedforward perceptrons trained using standard backpropagation, backpropagation with weight decay, the cascade-correlation technique, and Alopex on these problems. We can see that Alopex is the only method capable of correctly learning all the three problems. It should be noted that about 25 learning methods were compared in Thrun et al., but none of them achieved 100% accuracy on all three test sets. These experiments show that Alopex can be used as a powerful, general learning algorithm.

478

K. P. Unnikrishnan and K. P. Venugopal

Figure 2: Network responses for the 432 test samples of the third MONKs problem. The network was trained on 122 training samples for 1000 iterations. The test samples are shown on a 18 x 24 grid, following the convention of Thrun et al. (1991). The height of the blocks represent the magnitude of network responses, with the maximum of 1.0 (deep red) and minimum of 0.0 (deep blue). The six marked samples were deliberately misclassified in the training set. Responses to these samples show the robustness of the algorithm. At this point in training, the network correctly classifies all the 432 samples. See text and Thrun et al. (1991) for details of MONKs problems.

3.3 The Mirror Symmetry Problem. The mirror symmetry problem has also been used for benchmarking learning algorithms (see Peterson and Hartman 1989; Sejnowski et al. 1986; Barto and Jordan 1987). The inputs are N x N-bit patterns with either a horizontal, a vertical, or a

Alopex

479

Table 5: Average Number of Training Iterations and Generalization Accuracies for the 4 x 4 Mirror Symmetry Problem.a Probability 0.5 Learning Algorithm Alopex (log) Alopex (square) Mean field theory Backpropagation

Learning Algorithm Alopex (log) Alopex (square) Mean field theory Backpropagation

Average number of iterations

4417 8734

-

Generalization accuracy (9’0)

74.0 73.3 70 71

Probability 0.4 Average number of iterations

3966 7735 -

-

Generalization accuracy (%)

79.3 73.7 63 64

Probability 0.6 Average number Generalization of iterations accuracy (%) 4412 6141

-

79.8 76.1 62 63

“Data for backpropagation (BPI and mean field theory learning (MFT) are taken from Peterson and Hartman (1989). For Alopex, 6 was 0.003 and N was 10.

diagonal axis of symmetry and the task of the network is to classify them accordingly. For comparing numerical generalization accuracies, we used the fixed training set paradigm described in Peterson and Hartman (1989). Ten sets of 4 x 4-bit data, with each set containing 100 training samples, were used in the experiments. A feedforward network with 16 input units, 12 hidden units, and 3 output units was trained on each one of these data sets and the training was terminated when all the training samples were correctly classified according to the ”mid-point” rite ria.^ The generalization accuracy was determined on the remaining 9 sets of data, using the same criterion. Experiments were done using patterns where the elements had probabilities of 0.4, 0.5, and 0.6 for being on. Table 5 shows the generalization accuracies and average number of training iterations. Alopex was used to minimize the mean-squared error measure and the log error measure (see below). The accuracies for Mean Field Theory Learning (MFIJ and backpropagation (BPI are also shown. The generalization accuracy for Alopex is slightly better in one case and is considerably better in the other two cases.* ’The responses of “correct” output units should be greater than 0.5 and the responses of “incorrect” output units should be less than 0.5. ‘The average number of iterations can not be compared, as Peterson and Hartman updates the weights after 5 patterns are presented, while we update the weights after all the 100 patterns are presented.

480

K. P. Unnikrishnan and K. I? Venugopal

Table 6: Comparison of Experiments Using Different Error Measures.a

Parity (4-2-4 network)

XOR (2-2-1 network)

Error Measure

Log Square BP (square)

Average number of iterations

Percentage of times ”stuck

Average number iterations

General accuracy (%)

1484 478 1175

4 19 15

297 353 595

0 0 0

Encoder (4-2-4 network) Average number Percentage of of iterations times “stuck

Error Measure

Log

0 1 0

2996 3092 2676

Square BP (square)

*An experiment was categorized as “stuck if it did not converge after 20,000 iterations. The learning parameters used for backpropagation (BP) and Alopex are the same as in Table 1.

3.4 Usefulness of Different Error Measures. In most of the studies reported above, we had used the mean-squared error measure. When the output nodes are sigmoidal, this error function has an upper and lower bound and may contain multiple minima even for a single layer network (no hidden units). Alopex can be used for minimizing arbitrary error measures. In this section we demonstrate the advantage of using an information theoretic (log) error measure. The classification error in this case is defined as

(ii;~:~) +

target, log -

E= I

(1- target,) log

(1

1 - target; - output,

)

(3.1)

where the targets for the output units are either 0 or lS9 For a network with one layer of connections (no hidden units), and containing sigmoid nonlinearities at output nodes, this error function has been shown to contain only a single minimum (Unnikrishnan et al. 1991). Table 6 shows the average number of iterations (over 100 trials) taken by networks using the squared and log errors to solve the XOR, parity, and encoder problems. The number of times these networks failed to converge after 20,000 iterations, are also shown in this table. For the ’Since derivatives of transfer functions are not explicitly calculated in Alopex, targets for learning can be 1.0 or 0.0.

Alopex

481

XOR problem, a network using the log error got "stuck during 4% of the trials while a network using the squared error got stuck during 19% of the trials. A network using backpropagation, and hence the squared error, got stuck during 14% of the trials. The improved performance of networks using the log error is due to the fact that these error surfaces are much smoother and contain fewer local minima. Figure 3a shows the network used for the XOR problem and Figure 3b-e shows the error surfaces around the solution point. The surfaces are plotted with respect to pairs of weights, holding the other weights at their final, converged values. We can see that the surfaces for the log error are much smoother than those for the squared error. Networks using the log error always converged faster during our experiments. For example, the third MONKS problem was solved by a network using the log error after only 665 iterations, while a network using the squared error took 1000 iterations. This is also evident in the data shown in Table 5 for the symmetry problem. Networks using the log error consistently converged faster (and generalized a little better). 3.5 Using the 'Annealing Schedule' to Reach Global Minimum. The annealing schedule described in equation 2.6 automatically controls the randomness of the algorithm and it has been successfully used on many occasions to reach global minima of error surfaces. Figure 4 illustrates a case for the XOR network shown in Figure 3a. Alopex was used to minimize the log error. The path taken by the algorithm to reach the solution point is plotted over the error surface with respect to two of the weights. The algorithm had to overcome several local minima to reach the global minimum. (These minima are not completely evident in the figure as the other weights are held at their optimum values for plotting the error surface. These weights were changing during learning.)

3.6 Scaling Properties of Alopex. The ability of Alopex to learn in networks with large number of output classes was investigated using encoder problems of different sizes. Table 7 shows average number of iterations in 25 trials. A network using the squared error could not solve problems bigger than 8 bits, but one using the log error could successfully learn problems up to 32 bits long that we attempted. The error per bit during these learning experiments are shown in Figure 5. These results show that with appropriate error measures, Alopex can be used in networks with large numbers of output nodes.

3.7 Learning in Networks with Feedback. Conventional feedforward networks have limited ability to process real-time temporal signals, model dynamic systems, or control them. We investigated the ability of the Alopex algorithm for training recurrent networks that could be used more effectively for such applications. Three-layered networks with to-

482

K. I? Unnikrishnan and K. I? Venugopal

Figure 3: (a) Schematic of the network used for solving the XOR problem. The weights in the network are labeled w 1through zug. Individual neurons contains sigmoidal nonlinearities. (b-e) Error surfaces around the solution point (center) for square (b and d) and log (c and e) errors. b and c show the surfaces with respect to w3 and w6;d and e show them with respect to wq and wg.All the other weights (and biases) were held at their optimum values for plotting the surfaces.

Alopex

483

Figure 4: Path taken by the Alopex algorithm to reach the global minimum of error for XOR. It is superimposed on the final error surface with respect to w3 and w6.The actual error surface encountered by the algorithm during learning is different from this final one since the other weights, which are held at their optimum value for this plot, changes during learning. Table 7: Average Number of Iterations (over 25 Trials) for Encoder Problems of Different Sizes? Average number of iterations Size of input and network 4 Bits (4-2-4 net) 8 Bits (8-3-8 net) 16 Bits (16-4-16 net) 32 Bits (32-5-32 net)

Square error 1734 6771

(No convergence) (No convergence)

Log error 1775 13,077 20,280 75,586

'6 was 0.005 for the 4-bit and 8-bit networks, 0.004 for the 16-bit network, and 0.001 for the 32-bit network. N was 10 for all the networks.

tally interconnected hidden layers (including self loops) were used to solve temporal XOR problems with various delays. The task is to make the network output a t time t, the XOR of the input at time t - 7, a n d the input a t time t - ( T 1). For this, the network needs to store values from 7 1 time-steps in the past.

+

+

484

K. I? Unnikrishnan and K. P. Venugopal

iterations

Figure 5: Error per bit for different encoder problems as a function of learning iterations. Alopex was used to minimize the log error. A randomly generated, 3000 bits long string was used for training and another 100 bits long string was used for testing. Alopex was used to minimize the squared error. A network with two hidden units (1-2-1 network) was able to learn the T = 0 problem in 6,000 iterations. The T = 1 problem was learned by a 1-4-1 network in 4,668 iterations and the T = 2 problem was learned by a 1-6-1 network in 27,000 iterations. Figure 6b shows the output of the last network along with the test data and Figure 7 shows the average error per pattern for the three networks during learning. 4 Neurobiological Connection

In this paper, we have presented Alopex as a learning algorithm for artificial neural networks. It was originally developed for modeling aspects of brain function and the following three characteristics make it ideal for these purposes:

1. it is able to handle hierarchical networks with feedback; 2. it is a correlation based algorithm; and 3. it is a stochastic algorithm.

485

Alopex

2 3

0.8 -

v7

-Y

-pI

0.6 0.4 -

0.2 -

-

0-

-0.2

0

5

10

15

20

time steps

Figure 6: (a) Schematic of networks used to solve the temporal XOR problems. They contained a totally interconnected (including self loops) hidden layer. (b) Output of a network (solid line) trained to solve the T = 2 temporal XOR problem. It is superimposed on the target sequence (dotted line). We can see that the two plots are almost identical. The network has learned to solve this problem with high accuracy.

486

K. I? Unnikrishnan and K. P. Venugopal

iterations

Figure 7: The average error per bit for the three networks trained to solve temporal XOR problems, as a function of learning iterations. The three networks are of different sizes. 6 was 0.005 and N was 10 for all three networks. The mammalian sensory systems are organized in a hierarchic fashion and there are extensive interconnections between neurons within a layer and between neurons in different layers (Van Essen 1985). During development, some of these feedback connections are established even before the feedforward connections (Shatz et al. 1990). We have extensively used simulations of multilayer networks with feedback to investigate the dynamics of sensory information processing and development (Harth and Unnikrishnan 1985; Harth et al. 1987,1990; Unnikrishnan and Nine 1991; Janakiraman and Unnikrishnan 1992, 1993; Unnikrishnan and Janakiraman 1992; Nine and Unnikrishnan 1993; Unnikrishnan and Nine 1994). In all these studies, Alopex is used as the underlying computational algorithm. In the nervous system, mechanisms such as the NMDA receptors are capable of computing temporal correlations between inputs in a natural fashion (Brown et al. 1990). Computer simulations of known neural circuitry in the mammalian visual system have demonstrated the capability of these mechanisms for carrying out Alopex (Sekar and Unnikrishnan 1992; Unnikrishnan and Sekar 1993).

Alopex

487

There is considerable randomness in the responses of neurons (VerVeen and Derksen 1965). Stochastic algorithms like Alopex take this aspect of the nervous system into consideration. 5 Discussion Our simulations show that Alopex is a robust, general purpose learning algorithm. By avoiding explicit gradient calculations, it overcomes

many of the difficulties of current methods. We have used the same algorithm to train feedforward and recurrent networks and for solving a large class of problems like XOR, parity, encoder, and temporal XOR. With appropriate error measures, it is able to learn up to 32-bit encoder problems. Our results on the MONKS problems are the best ones reported in the literature. The generalization results on the 4 x 4 symmetry problems, using a fixed training set, are better than the ones quoted for BP and MFT. Results on the switching problem show that the network takes a comparable number of iterations to solve this task as taken by reinforcement learning or learning automata. Other recent studies, reported elsewhere, have shown the applicability of Alopex for solving diverse problems such as recognition of underwater sonar targets and handwritten digits (Venugopal et al. 1991,19921, and control of nonlinear dynamics of an underwater vehicle Wenugopal et al. 1994). A continuous-time version of Alopex, using differentials instead of differences, and integrals instead of sums, has been developed recently (E. Harth, personal communication). It has been implemented in analog hardware and used for a variety of adaptive control applications. Analog VLSI implementations of these circuits may make real-time learning in neural networks possible. The algorithm uses a single scalar error information to update all the weights. This may pose a problem in networks with a large (hundreds) number of output units and in networks with a large number of hidden layers. Learning may have to be subdivided in these networks. A preliminary mathematical analysis of convergence properties of Alopex has been done and will be presented elsewhere (Sastry and Unnikrishnan 1994). Finally we would like to say that it was the long history of Alopex in brain modeling that prompted us to investigate it as a learning algorithm for artificial neural networks. We believe that, after all, knowledge of biological neural functions would be useful in developing effective learning algorithms for artificial neural networks. Acknowledgments The authors wish to thank P. S. Sastry for many useful discussions and the action editor for pointing out some of the relevant literature. The

488

K. I? Unnikrishnan and K. P. Venugopal

clarity of this manuscript has been greatly improved by comments from Jim Elshoff, Erich Harth, Don Jones, P. S. Sastry, R. Uthurusamy, Wayne Wiitanen, and Dick Young. Milind Pandit and Harmon Nine helped us with some of the simulations and Tom Boomgaard helped us with some of the figures. Part of this work was done when K. P. V. was a visitor a t GM Research Labs.

References Alspector, J., Meir, R., Yuhas, B., Jayakumar, A., and Lippe, D. 1993. A parallel gradient descent method for learning in analog VLSI neural networks. In Advances in Neural Information Processing Systems, pp. 836-844. Morgan Kaufmann, San Mateo, CA. Barto, A. G. 1985. Learning by statistical cooperation of self-interested neuronlike computing elements. Human Neurobiol. 4, 229-256. Barto, A. G., and Jordan, M. I. 1987. Gradient following without back-propagation in layered networks. Proc. IEEE First Ann. Intl. Conf. Neural Nets., 11629-11636. Barto, A. G., Sutton, R. S., and Brouwer, P. S. 1981. Associative search network: A reinforcement learning associative memory. Biol. Cybern. 40, 201-211. Baum, E. B., and Wilczek, F. 1988. Supervised learning of probability distributions by neural networks. In Neural Information Processing Systems, D. Z. Anderson, ed. American Institute of Physics, New York. Brown, T. H., Kairiss, E. W., and Keenan, C. L. 1990. Hebbian synapses: Biophysical mechanisms and algorithms. Annu. Rm.Neurosci. 13, 475-512. Dembo, A., and Kailath, T.1990. Model-free distributed learning. IE E E Trans. Neural Networks 1, 58-70. Draper, C. S., and Li, Y. T. 1951. Principles of Optimalizing Control Systems and an Application to the Internal Combustion Engine. ASME Publications. Harth, E., Pandya, A. S., and Unnikrishnan, K. P. 1990. Optimization of cortical responses by feedback modification and synthesis of sensory afferents. A model for perception and REM sleep. Concepts Neurosci. 1, 53-68. Harth, E., Pandya, A. S., and Unnikrishnan, K. P. 1986. Perception as an optimization process. Proc. l E E E Conf. CVPR., 662-665. Harth, E., and Tzanakou, E. 1974. Alopex: A stochastic method for determining visual receptive fields. Vision Res. 14, 1475-1482. Harth, E., and Unnikrishnan, K. P. 1985. Brainstem control of sensory information: A mechanism for perception. Int. 1.Psychopkysiol. 3, 101-119. Harth, E., Unnikrishnan, K. P., and Pandya, A. S. 1987. The inversion of sensory processing by feedback pathways: A model of visual cognitive functions. Science 237, 187-189. Hinton, G. E. 1989. Connectionist learning procedures. Artif. Intell. 40,185-234. Hopfield, J. J. 1987. Learning algorithms and probability distributions in feedforward and feed-back networks. Proc. Natl. Acad. Sci. U.S.A. 84, 8429-8433. Hornik, K., Stichcomb, M., and White, H. 1989. Multilayer feed-forward networks are universal approximators. Neural Networks 2, 359-366.

Alopex

489

Jabri, M., and Fowler, B. 1991. Weight perturbation: An optimal architecture and learning technique for analog VLSI feed-forward and recurrent multilayer networks. Neural Comp. 3, 546-565. Janakiraman, J., and Unnikrishnan, K. P. 1992. A feedback model of visual attention. Proc. ZJCNN, 111-541-111-546. Janakiraman, J., and Unnikrishnan, K. P. 1993. A model for dynamical aspects of visual attention. In Computation and Neural Systems, pp. 215-219. Kluwer Academic, Boston, MA. Kirkpatrick, S., Gelatt, C., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220, 671-680. Minski, M. L. 1954. Theory of Neural-Analog Reinforcement Systems and Its Application to the Brain-Model Problem, Princeton University, Princeton, NJ. Mukhopadhyay, S., and Thathachar, M. A. L. 1989. Associative learning of boolean functions. I E E E Trans. SMC 19,1008-1015. Narendra, K. S., and Thathachar, M. A. L. 1989. Learning Automata: An Zntroduction. Prentice Hall, Englewood Cliffs, NJ. Nine, H. S., and Unnikrishnan, K. P. 1993. The role of subplate feedback in the development of ocular dominance columns. In Computation and Neural Systems, pp. 389-393. Kluwer Academic, Boston, MA. Peterson, C., and Hartman, E. 1989. Explorations of the mean field theory learning algorithm. Neural Networks 2, 475-494. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. Parallel Distributed Processing, Vol. 1. MIT Press, Cambridge, MA. Sastry, P. S., and Unnikrishnan, K. P. 1994. On the convergence properties of the Alopex algorithm. Manuscript in preparation. Sejnowski, T. J., Kienker, P. K., and Hinton, G. E. 1986. Learning symmetry groups with hidden units: Beyond the perceptron. Physica D 22, 260-275. Sekar, N. S., and Unnikrishnan, K. P. 1992. The Alopex algorithm is biologically plausible. In Abstr. Learning and Memory meeting, p. 50. Cold Spring Harbor Laboratory, Cold Spring Harbor, NY. Shatz, C. J. et al. 1990. Pioneer neurons and target selection in cerebral cortical development. Cold Spring Harbor Symp. Quanf. Bid. 55, 469-480. Thrun, S. et al. 1991. The MONKS problems: A performance comparison of different learning algorithms. Carnegie Mellon University, CMU-CS-91-197, December, 1991. Unnikrishnan, K. P., and Janakiraman, J. 1992. Dynamical control of attention through feedback pathways: A network model. SOC.Neurosci. Abstr. 18, 741. Unnikrishnan, K. P., and Nine, H. S. 1991. Cortical feedback to LGN may play a major role in ocular dominance column development. SOC.Neurosci. Abstr. 17, 1135. Unnikrishnan, K. P., and Nine, H. S. 1994. Role of subplate ‘feedback’ in the formation of ocular dominance columns: A model. Submitted. Unnikrishnan, K. P., and Pandit, M. S. 1991. Learning in hierarchical neural networks with feedback. In Abstr. Neural Nets. Comp. Conf., Snowbird, UT.

490

K. P. Unnikrishnan and K. P. Venugopal

Unnikrishnan, K. P., and Sekar, N. S. 1993. A biophysical model of the Alopex algorithm. SOC.Neurosci. Abstr. 19, 241. Unnikrishnan, K. P., and Venugopal, K. I? 1992. Learning in connectionist networks using the Alopex algorithm. Proc. lJCNN, 1-926-1-931. Unnikrishnan, K. P., Pandya, A. S., and Harth, E. 1987. The role of feedback in visual perception. Proc. l E E E First Ann. lntl. Conf. Neural Nets, IV259-IV267. Unnikrishnan, K. P., Hopfield, J. J., and Tank, D. W. 1991. Connected-digit speaker-dependent speech recognition using a neural network with timedelayed connections. I E E E Trans. Signal Proc. 39, 698-713. Van Essen, D. C. 1985. Functional organization of primate visual cortex. In Cerebral Cortex, Vol. 3, A. Peters and E. G. Jones, eds. Plenum Press, New York. Venugopal, K. I?, Pandya, A. S., and Sudhakar, R. 1991. Continuous recognition of sonar targets using neural networks. In Automatic Target Recognition, F. Sadjadi, ed. Proc. SPIE 1471, 43-54. Venugopal, K. P., Pandya, A. S., and Sudhakar, R. 1992. Invariant recognition of 2-D objects using Alopex neural networks. In Applications of Artificial Neural Networks, S. K. Rogers, ed. Proc. SPIE 1708. Venugopal, K. P., Pandya, A. S., and Sudhakar, R. 1992b. A recurrent network controller and learning algorithm for the on-line learning control of autonomous underwater vehicles. Neural Networks (in press). Verveen, A. A., and Derksen, D. W. 1965. Fluctuations in membrane potential and the problem of coding. Kybernetik 2, 152-160. Whitaker, H. P. 1959. An adaptive system for control of aircraft and spacecraft. Institute for Aeronautical Sciences, paper 59-100. Widrow, B., and Lehr, M. A. 1990. 30 years of adaptive neural networks. Perceptron, madaline, and back-propagation. Proc. I E E E 78, 1415-1442. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1,270-280.

Received December 9, 1992; accepted September 13, 1993.

This article has been cited by: 2. B. Jarosiewicz, S. M. Chase, G. W. Fraser, M. Velliste, R. E. Kass, A. B. Schwartz. 2008. Functional network reorganization during learning in a brain-computer interface paradigm. Proceedings of the National Academy of Sciences 105:49, 19486-19491. [CrossRef] 3. Xiaolong Ma, Konstantin K. Likharev. 2007. Global Reinforcement Learning in Neural Networks. IEEE Transactions on Neural Networks 18:2, 573-577. [CrossRef] 4. Benjamin Rowland, Anthony Maida, Istvan Berkeley. 2006. Synaptic noise as a means of implementing weight-perturbation learning. Connection Science 18:1, 69-79. [CrossRef] 5. Zhe Chen , Suzanna Becker , Jeff Bondy , Ian C. Bruce , Simon Haykin . 2005. A Novel Model-Based Hearing Compensation Design Using a Gradient-Free Optimization MethodA Novel Model-Based Hearing Compensation Design Using a Gradient-Free Optimization Method. Neural Computation 17:12, 2648-2671. [Abstract] [PDF] [PDF Plus] 6. Zhe Chen. 2005. Stochastic correlative firing for figure-ground segregation. Biological Cybernetics 92:3, 192-198. [CrossRef] 7. S. Haykin, Z. Chen, S. Becker. 2004. Stochastic Correlative Learning Algorithms. IEEE Transactions on Signal Processing 52:8, 2200-2209. [CrossRef] 8. Manfred M Fischer, Martin Reismann, Katerina Hlavackova-Schindler. 2003. Neural Network Modeling of Constrained Spatial Interaction Flows: Design, Estimation, and Performance Issues. Journal of Regional Science 43:1, 35-61. [CrossRef] 9. P. S. Sastry , M. Magesh , K. P. Unnikrishnan . 2002. Two Timescale Analysis of the Alopex Algorithm for OptimizationTwo Timescale Analysis of the Alopex Algorithm for Optimization. Neural Computation 14:11, 2729-2750. [Abstract] [PDF] [PDF Plus] 10. G.N. Borisyuk, R.M. Borisyuk, Yakov B. Kazanovich, Genrikh R. Ivanitskii. 2002. Models of neural dynamics in brain information processing the developments of 'the decade'. Uspekhi Fizicheskih Nauk 172:10, 1189. [CrossRef] 11. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 12. Gustavo Deco , Bernd Schürmann . 1999. Spatiotemporal Coding in the Cortex: Information Flow-Based Learning in Spiking Neural NetworksSpatiotemporal Coding in the Cortex: Information Flow-Based Learning in Spiking Neural Networks. Neural Computation 11:4, 919-934. [Abstract] [PDF] [PDF Plus]

13. S. Shah, P.S. Sastry. 1999. New algorithms for learning and pruning oblique decision trees. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 29:4, 494-505. [CrossRef] 14. N. Honma, K. Kitagawa, K. Abe. 1998. Effect of complexity on learning ability of recurrent neural networks. Artificial Life and Robotics 2:3, 97-101. [CrossRef] 15. T.W.S. Chow, Siu-Yeung Cho. 1997. An accelerated recurrent network training algorithm using IIR filter model and recursive least squares method. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 44:11, 1082-1086. [CrossRef] 16. A.H. Khan, E.L. Hines. 1994. Integer-weight neural nets. Electronics Letters 30:15, 1237. [CrossRef] 17. Mikel L. Forcada, Marco GoriNeural Nets, Recurrent . [CrossRef]

Communicated by David MacKay

Duality Between Learning Machines: A Bridge Between Supervised and Unsupervised Learning J.-P. Nadal Laboratoire de Physique Statistique,* Ecole Nomale SupCrieure, 24, rue Lhomond, F-75231 Paris Cedex 05, France

N. Parga Departamento de Fisica Tedrica, Universidad Autdnoma de Madrid, Canto Blanco, 28049 Madrid, Spain

We exhibit a duality between two perceptrons that allows us to compare the theoretical analysis of supervised and unsupervised learning tasks. The first perceptron has one output and is asked to learn a classification of p patterns. The second (dual) perceptron has p outputs and is asked to transmit as much information as possible on a distribution of inputs. We show in particular that the maximum information that can be stored in the couplings for the supervised learning task is equal to the maximum information that can be transmitted by the dual perceptron. 1 Introduction

Supervised and unsupervised learning are the two main research themes in the study of formal neural networks. In the first case, one is given a set of input-output pairs that has to be learned by a neural network (usually of a given architecture). One may be interested in the performance of the network as an associative memory, or one may be interested in the ability of the network to generalize: a rule is assumed to be hidden behind the examples (the input-output pairs to be learned), and one asks whether the net will give a correct output for a new input. In the case of an associative memory, the emphasis is usually put on the fact that the memory is distributed: the memory is distributed among the synapses, but also the output patterns (or attractors for an autoassociative memory) are made of features distributed among the neurons (the best studied case is the one of random patterns) (Hopfield 1982; Peretto 1992). It is generally considered that such encoding should facilitate associative recall with a high noise tolerance. 'Laboratoire associe au C.N.R.S. (U.R.A. 1306),A I'E.N.S. et aux Universit6s Paris VI et Paris VII. Neural Computation 6,491-508 (1994) @ 1994 Massachusetts Institute of Technology

492

J.-P. Nadal and N. Parga

In the second case, no desired output is given, and one is asking the network to classify the data (input patterns). Typically one would like two patterns to be put in the same class if they are nearby in input space. Such a constraint is either implicit in the heuristic chosen for modifying the couplings, or explicit in the choice of a cost function. One of the most famous algorithms is the Kohonen maps algorithm (Kohonen 19841, where a topology is introduced in the output space. In some approaches one puts the emphasis on discriminating between patterns rather than on clustering. For example, it has been shown that unsupervised Hebbian learning with a single linear output neuron leads to a principal component analysis (Hertz et al. 1990). For a gaussian input distribution this is equivalent to maximizing the amount of information that the output gives on the input. In fact, a particular strategy is to define a cost function based on information theoretic criteria (Barlow 1961, 1989; Linsker 1988; Atick 19921, the justification being general considerations of what type of neural representations (or "codes") of the environment should be useful for the brain. Unsupervised learning often leads to "grandmother" type cells: each neuron tends to respond specifically to a given type of stimuli, or one particular feature. For example, with some unsupervised algorithms based on Hebbian learning (Hertz et al. 1990) each output unit becomes specific to one principal component; in clustering algorithms one gets clusterspecific cells. One is thus confronted by two completely opposite approaches, differing not only in the type of issues that they address but also in the type of neural codes that they use or construct. What we propose in this paper is a framework that might allow a better understanding of the differences between a supervised and an unsupervised learning task. We will show that one can establish a relationship between the questions that are relevant for each task. This will be done via a duality between two neural architectures. Moreover this duality is interesting in itself: in the context of supervised learning, the Bayesian approach tells one how to derive the parameters from the data by relating the probability of the parameters (the model) knowing the data to the probability of the data knowing the parameters. The duality that we introduce is nothing but an explicit implementation of this exchange between model and data. The paper is organized as follows. In Section 2 we present the duality between two perceptrons, and show how this allows us to relate the study of a supervised learning task to that of an unsupervised learning task. In particular we show the identity between various capacities that have been defined in each context. In Section 3 we emphasize the differences between the two tasks, showing, however, the deep relationship between the two problems. We show in particular how the statistical mechanics approach to learning is related to the study of the quantity of information that is relevant in the context of unsupervised learning. We show also that the first perceptron can be thought of as a decoder if one considers the

Duality Between Learning Machines

493

second one as a neural encoder. Perspectives are given in the Conclusion, and a generalization to other learning machines (other than the simple perceptron) is given in the Appendix. 2 From Supervised to Unsupervised Learning

2.1 The Dual Perceptrons. Let us consider a simple perceptron, with one binary output (whose state (T takes, say, the values 0 or 11, N inputs neurons and couplings J = (11. . . . ~ J N } .We consider continuous inputs unless otherwise specified. In a supervised learning task, one is given a set E of p input patterns,

-= = { “ 1 . / / = 1 . . . . . p }

(2.1)

and the set of the desired outputs, T

=

(T”

= 0.1, / L =

1... . . p )

that have to be learned by the perceptron. For a given choice of the couplings, the output (TI’ when the /ith pattern is presented is given by 0’’ = A(J,(”)

8

(h;)

(2.2)

/=1

where O ( h ) is 1 for h > 0 and 0 otherwise. For simplicity we assume zero threshold, and we will consider only the above deterministic rule (no synaptic noise). Now one can interpret this formula (2.2) in two ways. One is, as above, that we have p input-ouput pairs realized by a perceptron with a single output unit, whose couplings are the 1s. But one can as well say that we have a perceptron with p output units, where J is now an input pattern, and the 11 = 1.. . . , p are the p coupling vectors (Fig. 1). Let us call our initial perceptron with a unique output A, and the dual perceptron, with p output units as just explained, A*. In the following we show how useful this duality can be, in particular for the comparison between supervised and unsupervised learning. To avoid confusions when considering one of the dual perceptrons, we will append an asterisk to each ambiguous word whenever we are considering A*:in particular we will write ”pattern*” and “couplings*,” the asterisk being a reminder that for A’ these denominations refer to J and to the tp, respectively. <’L,

2.2 The Number of Domains. Let us recall some important results concerning the supervised learning task for A. Of particular interest for what follows is the geometric approach (Cover 1965) to the computation of the maximal storage capacity: one considers the space of couplings CJ = { ] j , j = 1,. . . ,N} being considered as a point in an N-dimensional

494

J.-P. Nadal and N. Parga

A*

A j=l

j=1

j=2

j=2

...

...

j=N/

j = N

0

9

Go 0

Figure 1: The dual perceptrons.

Figure 2: Partition of J space in domains.

space). Then each pattern p defines a hyperplane, and the output # is 1 or 0 depending on which side of the hyperplane the point J lies. Hence the p hyperplanes divide the space of couplings into domains (Fig. 21, each domain being associated with one specific set u = {d,. . . , d'} of outputs. Let us call A(E) the number of domains

A(=) = number ofdomains

(2.3)

Since each u p is either 0 or 1, there are at most 2P different output configurations u,that is

A(:)

5 2P

(2.4)

Duality Between Learning Machines

495

If the patterns are "in a general position," then A(:) is in fact independent of Z and a function only of p and N. One has the basic result (Cover 1965): A(:)

=

A(N,p)=

x'

min N , p

C;

(2.5)

k=O

where C;

= p!/[k!(p - k)!].In

particular (2.6)

This means that N is the "Vapnik-Chervonenkis dimension" (Vapnik 1982; Vapnik and Chervonenkis 1971) of the perceptron (that is N 1 is the first value of p for which A is smaller than 2%

+

dvc

=N

(2.7)

If the task is to learn a rule from examples, the VC dimension plays a crucial role: generalization will occur if the number of examples p is large compared to dvc (Vapnik 1982). Another important parameter is the asymptotic capacity. In the large N limit, for a fixed ratio (2.8) the fraction of output configurations that are not realized remains vanishingly small for N greater than 1, up to the "critical storage capacity" (Cover 1965; Gardner 1988) ac, a, = 2

(2.9)

2.3 The Number of Domains: The Dual Point of View. Now let us reconsider the geometric argument from the point of view of the dual perceptron A' as defined in 2.1. What we have just said is that, for a given choice of the couplings*, Z, one explores all the possible different output states u that can be obtained when the input pattern* J varies. If J represents, say, the light intensities on a retina, u is the first neural representation of a visual scene in the visual pathway. Since all visual scenes falling into a same domain are encoded with the same neural representation, A(Z) is the maximal number of visual scenes that can be distinguished. This can be said in term of coding of information: to specify one domain out of A(:) represents lnA(z) bits of information. Hence the maximum amount of information, or "information capacity" C, that u can convey on the inputs* is C(Z) = I n n ( = )

(2.10)

J.-P.Nadal and N. Parga

496

In what sense is C(Z) the maximal amount of information that can be gained? Let us consider again the retina analogy. Each visual scene is a vector in an N-dimensional space, but not every vector of that space may correspond to a possible visual scene. Hence some of the domains might be empty (no stimulus ever falls inside these domains), so that some of the output codes may not be used. More generally, the statistics of visual scenes will typically be such that the input* domains are not visited with equal frequency. The amount of information I actually transmitted is thus smaller (and at best equal to) C (we will come back to the study of I in Section 3). In the language of information theory, C is the channel capacity of the perceptron A' if used as a memoryless channel in a communication system (Blahut 1988). In that case the input alphabet is the set of all possible Js, and the output alphabet the 2P possible output configurations. If the Z are in general position, the capacity is only a function of p and N, and is given by 2.5-note that otherwise the capacity is smaller than (or equal to) lnA(N.p). From 2.5 one sees that u p to p = N each output neuron gives one bit of information (C = p ) , and for p > N one gains less and less information by adding new units*. More precisely, one has the asymptotic behavior

(2.11)

Here (and throughout this paper) logarithms are expressed in base 2, and S(x) is the entropy function (measured in bits):

S(x) = -(x In x

+ (1 - x) In (1

-

x)]

(2.12)

The information capacity c ( u ) is shown on Figure 3. We are thus led to consider the dual perceptron as what we will call a "neural encoder," a device that associates a neural representation (or codeword) with each input* signal, for which the performance is evaluated with tools coming from information theory. This point of view corresponds to an approach developed recently in particular for modeling the sensory pathways in the brain (Linsker 1988; Atick 1992). In that context one wants the system to perform an efficient coding, according to some cost function derived from information theory concepts and general considerations on what type of coding might be useful for the brain (Barlow 1961,1989). The algorithmic counterpart, that is the modification of the couplings* in order to minimize such a cost function, results in unsupervised learning schemes: the cost function specifies an average quality of the code, but not a desired output for a given input* (we will

Duality Between Learning Machines

497

5

4.5 4 3.5

3 2.5 2 U

1.5

1 0.5 0 0

2

4

6

8

10

PIN

Figure 3: The asymptotic information capacity/content c of the perceptron A ’ I A (in bits per input* neuron/coupling) as a function of cr = p / N . come back to this later on). The duality between the two perceptrons is thus a bridge between the study of supervised and unsupervised learning tasks. 2.4 The Information Content. We have seen that 1nA [ c ( n )in the large N limit] is an information quantity relevant for A*. What is its meaning for A? Since it is the number of bits needed for specifying one domain out of A, it is the amount of information stored in the couplings when learning an association (5.T ) whenever this particular configuration T corresponds to an existing domain. This gives the obvious result that below ctc the amount of information stored (in bits per synapse) is equal to ( 1 . But for f v > ( v , with probability one (in the large N limit) no domain exists for a configuration T chosen at random, and errors will result. However, it has been shown by G. Toulouse (Brunel ef al. 1992) that even above oo c(cr), as given by equation 2.11), remains the maximal amount of information that can be stored in the synapse. Hence we can use the term “information capacity” with its dual meaning of information content or of capacity for transmitting information. The rest of this paper will detail the comparison between the study of A for a supervised learning task and of A* as a neural encoder (as defined above).

J.-P. Nadal and N. Parga

498

3 Statistical Mechanics and the Mutual Information

Although the information capacity of the perceptron* is equal to the information storage capacity of the perceptron as an associative memory, there are important differences between the analysis of the two tasks. To see this, we have to be more specific about the relevant questions for each perceptron. 3.1 Supervised Learning. We start with the supervised learning task for A. The statistical physics (or Bayesian) approach to supervised learning (Gardner 1988; Levin et al. 1990; Grassberg and Nadal 1994) forces us to study a statistical ensemble of machines, the couplings being taken from some prior distribution p(J). For example, if one looks for discrete couplings, p(J) may give equal weight to every possible choice of couplings. Another example, the best studied case, is the one of spherical couplings: N

C$=N

(3.1)

j=l

with p ( J ) being the uniform measure on the sphere. One is interested in the probability that a given set of outputs a, chosen at random, is realizable. According to the deterministic rule 2.2, the probability for having a has the expression P

(3.2)

In other words P, is the fractional volume of the couplings that implement the particular set of associations (E, a).After A(:), P, is the most important quantity relevant to our discussion. The typical probability that a random a is learnable has been computed (Gardner 1988) for patterns drawn from a statistical ensemble. In principle one has to compute P, for a given choice of the patterns. However, in the large N limit the log-probability L,

=

L(o) lnP,

(3.3)

is “self-averaging”: the limit 1 = L/N when N goes to infinity exists and is only a function of the distribution p*. It is then also given by the limit of the averaged value of L: 1 = lim L / N

=

N-CC

lim << lnP, >> /N

N-CC

(3.4)

where << . >> means the average over the patterns:

/ l-I “

<< f >>=

P

fi=l

dtp

p*(E)f(E)

(3.5)

Duality Between Learning Machines

499

Statistical mechanics tools such as the ”replica technique” or the ”cavity method” (Mezard et al. 1987) have made possible the computation of I for various choices of the patterns distributions [uncorrelated, with and without bias, and very recently correlated patterns (Monasson 199211, and of the space of couplings (continuous or discrete). For each case one gets in particular the critical asymptotic capacity nC. 3.2 Unsupervised Learning. We turn now to the dual perceptron. Having specified the distribution p(J) for A, we have thus to consider that at each instant the dual perceptron receives a new input* J, a particular pattern* J occurring with probability p(J). To be more concrete we will talk of J as an ”image,” for which the neural encoder has to give a neural representation, or code, 0 . As much as possible different images should have different neural representations: as already mentioned, one is interested in having the largest variety of available outputs. This variety is measured by the entropy of the output:

(3.6) U

Since we are considering a deterministic system, this entropy is equal to the mutual information I(cr, 1) between the input* and the output:

I(n.1) = H ( P u )

(3.7)

Indeed, when a configuration CT is observed, the gain of information is equal to - l n P u = -L(n), and I(cr.1) = H(P,) is the average gain of information. The mutual information is the main quantity of interest for the study of A’. Its study, that is of the entropy 3.6, is to be contrasted with that of L for a randomly chosen u. Note that I is a function of the distribution p and of the couplings* 2 I = I ( p . Z). What is its relationship with the capacity C? If p gives the same weight to every domain (in J space), then for any CT that corresponds to a domain, P, = l/A(Z), and the entropy is at its maximal possible value, In A(:). Hence one has

I(..])

5 C = maxI(cr,])

(3.8)

P

and C - I(.]) is the redundancy of the code u. Now for a given distribution p, one would like to optimize the performance of the network by a proper choice of the parameters* Z. A simple idea is that the network should extract as much information as possible from the environment, which means maximizing the mutual information: Optimization principle: find Z* that realizes m_axI(o.1)

-

(3.9)

This strategy has been used by Linsker (1988) for modeling the first stages in the visual pathway, and is related to other strategies such as the minimization of redundancy (Atick 1992; Barlow 1961). In this framework

500

J.-P. Nadal and N. Parga

most of the analytical studies have been done for networks with linear neurons. Here we limit our study to the extreme opposite case of binary neurons. We will not detail here the various proposed strategies [for a general discussion see Barlow (1961, 1989) and Atick (1992) and for the case of the perceptron see Bialek and Zee (1988), Nadal and Parga (1992, 199311, and consider here only the above optimization principle. What would be the meaning of this principle in the context of A? Given that the couplings would be taken from some prior distribution, the optimal patterns Z* are those that are the easiest to learn in the following sense: in the ideal case, that is if I = C, all the learnable set of associations 2*,rare equiprobable. This might be useful if we are free to choose the patterns as “random” addresses, the perceptron being used as a random access memory for storing p-bit strings (the 7). In the context of learning a rule by example, the network is fully unbiased for this particular choice of input patterns, every learnable rule has the same weight. Conversely, is there any typical quantity of interest in the vein of 3.4? In fact it is indeed interesting to consider the typical mutual information x that results if the couplings* are taken from some statistical ensemble /I* (Z): ;I

lim l / N

NlW

=

lim << H(P,) >> / N N-CC

(3.10)

where << ’ >> denotes the average over the distribution p * ( Z )as in 3.5. The motivation is the following. One considers a given input* distribution p(J). First, one would like to know what is the information that is transmitted in the absence of any optimization: this can be obtained by considering couplings* taken at random, with each component being an independent unbiased random variable. This tells us how much can be gained by optimization. Furthermore, instead of trying to find the optimal couplings*, one may consider a statistical ensemble of coupling+ vectors characterized by the correlations r in their components. Then one looks for the correlations that maximize 2. We have carried out this program for the perceptron, and we present the details in a separate paper (Nadal and Parga 1992, 1993). The main result is that, for gaussian inputs* with correlation matrix G, the optimal correlation matrix r is equal to the inverse of G. This result is very similar to the one obtained for linear units by Linsker, but with two main differences. First as explained above we are computing the optimal statistical properties of the couplings* instead of the exact optimal couplings*. Second we are not considering translational invariant couplings*, that is

t;

= t(x, - X I )

(3.11)

where x, and xl are the locations of the output* neuron p and of the input* unit j , as is the case in the works of Linsker and Atick et al. In fact the study of binary units with the restriction 3.11 is much more difficult [see Bialek and Zee (1988) for the study of a binary perceptron*

Duality Between Learning Machines

501

used for a discrimination task under the condition 3.111. In particular, one does not know the capacity in that case (all we can say is that it is smaller than the one we computed without restricting the couplings* by 3.11). However, in our statistical approach we have a statistical invariance under translation: the correlations within the couplings* do not depend on the index p. 3.3 Decoding and Learning. When considering the perceptron as an neural encoder, it is natural to ask for the possibility of decoding, that is of reconstructing the input image that generated a particular codeword. This aspect may not be of any biological relevance, first because it is likely that the system does not need to perform such a reconstruction, and second because the way we consider the decoding process as explained below has no reason to be biologically plausible. However, it is a legitimate question from the information processing point of view. Of course one cannot obtain exactly the input image, since many different inputs give the same output configuration-and indeed we have at our disposal only a finite amount of information about J knowing the codeword u,this information being precisely equal to - InPC. However, one can produce one input configuration among those that produce this same codeword: a prototype of all the images considered as identical by the coding system. This can be done by considering precisely the perceptron A as we explain it now. From the knowledge of the codeword u and of the couplings* Z, we want to generate a pattern* J that satisfies the p equations 2.2. Hence one can conveniently come back to the first interpretation of these equations, by saying that we are looking for couplings realizing the p associations (Ep, u p ) , i = 1,.. . , p . This problem can be solved algorithmically by the use of perceptron type algorithms (Minsky and Papert 1988). Such algorithms are known to converge whenever a solution exists, which is the case here since the true input pattern is of course one particular solution. One may look for the coupling vector (that is the input pattern*) the most likely to have produced the codeword. This is the standard strategy in the task of learning a rule by example (Levin ef al. 1990; Grassberger and Nadal 1993). Hence, one sees that maximum likelihood decoding for A* [with p(J) as input* distribution] is equivalent to Bayesian learning for A [with p(J) as prior distribution 1. 3.4 Source Entropy and Storage Resources. The last point of comparison that we will make now bears on another aspect of the efficiency of the perceptrons. In the context of supervised learning, one is interested in comparing the amount of information stored to the number of bits used for storing the couplings. It is thus convenient to assume here a finite (possibly very large) number of bits per synapse, K. There are thus NK bits available, but since the couplings are taken from the prior

J.-P. Nadal and N. Parga

502

distribution p, the total number of bits that is effectively available may be smaller, being given by (3.12) Clearly the amount of information 1 that can be stored in the couplings cannot be larger: 1 5 lo

(3.13)

In the language of the dual perceptron, 3.13 states that the mutual information cannot be larger than the information content of the source, lo. How does this relate to the information capacity? The information capacity that we have computed was for continuous couplings ( K infinite). If one limits the number of bits to K, then there are only 2NKdifferent coupling vectors. In the partition of the J space induced by the patterns, some of the domains may be empty. If we call A,(=) the number of nonempty domains, the capacity is now

CK(:)

= lnAK(Z) 5 C

(3.14)

and this capacity is clearly bounded above by NK (more generally lo is the upper bound of the capacity if restricted to distributions of maximum entropy l o ) . Few analytical results are known for the case of discrete couplings, apart from the critical storage capacities N, for various choices of discrete couplings (Krauth and Mezard 1989; Gutfreund and Stein 1990). For example, if one takes binary couplings (I, = fl)(which corresponds to binary inputs?, the asymptotic information capacity c is equal to o up to N, N 0.83 (Krauth and Mezard 1989) (hence for binary inputs* the capacity is equal to (2 only u p to Q, N 0.83). 4 Conclusion

We have shown that the existence of a duality property between two perceptrons allows comparison of the theoretical analysis of supervised and unsupervised learning tasks. The questions that are relevant in one case are intimately related to those relevant for the other perceptron. In particular the information capacity has a nice dual meaning. But there are important differences, expressed mainly in the fact that in a supervised learning task one is interested in the performance of one given choice of outputs, whereas in the unsupervised task it is the average properties over all possible outputs that matter. We have shown also that statistical physics tools can be used for studying the typical properties of the dual perceptron A'. We present elsewhere (Nadal and Parga 1992, 1993) the detailed analysis. We have considered the simplest neural architecture. However, we stress that the duality can be extended to a general learning machine as

Duality Between Learning Machines

503

shown in details in the Appendix. In the introduction we mentioned that the duality can be viewed as an implementation of the exchange between model and data that appear in the Bayesian approach to learning: this is made explicit in the Appendix. Considering the extension to a general learning machine is useful in particular in order to identify the specific role of the number of couplings, the VC dimension and the number of inputs, which are all equal in the case of the perceptron. The main result is that essentially all that we have said for the perceptron remains valid in the general case, provided one interprets o as the ratio of p to the VC dimension (instead of the ratio of p to the number of couplings). This result is based on an upper bound for the number of domains, which is given by Vapnik (1982) in his book, and we point out in the Appendix that this bound is optimal. In fact it appears that the perceptron plays a special role: among all learning machines having the same VC dimension, dvc = d, the perceptron (with N = d ) is the one that has the largest information capacity C = In A(E) for p larger than d . Finally we note that we have restricted our study to a deterministic perceptron. We are presently working on noisy systems: most of what we have said remains valid, although noise introduces additional (and interesting) differences between the two types of learning tasks. Appendix: Dual Learning Machines A.1 Statement of the Problem. Results obtained for the perceptron may be misleading: the number of couplings, the number of inputs, and the VC dimension are all equal. Moreover, the number of domains is independent of the choice of the patterns (if they are in general position or if chosen at random). It is thus useful to consider the case of a general learning machine. We will see that all that is valid for the perceptron remains nearly valid in the general case, provided one identifies the specific role of the various parameters. Let us thus consider a machine defined by a given architecture A, with a set of N adjustable parameters (couplings) J = {J,,j = 1,. . . ,N } . If M is the dimension of the input space, the machine A associates with each input t = {El. i = I , .. . ,M} a binary output r~ (Fig. 4). One wants to study the heteroassociative task, where for p input patterns

-= = { < / " , p= 1,.. . . p }

(A.1)

one is given the set of the desired outputs, 7 . = ( 7 / L = o , l/ ,1 = 1 . . . . , p )

(A.2)

For a given choice of couplings the outputs are u = {(J',. . . ,d}with g p = A(J.tp).The input patterns are chosen from some distribution p*, and the desired outputs are chosen at random.

504

J.-P. Nadal and N. Parga

p*(z)

H

p

-+

I

A(.?,.)

I

-+

uN=A(.?,p)

Figure 4: The learning machine A.

Figure 5: The dual machine A*.

The main questions that one asks are: What is the storage capacity (the maximal number of input-output pairs that can be learned)? What is the maximal amount of information that can be stored in the couplings? And if the task is to learn a rule by example, what is the probability of making an error on a new pattern as a function of the fraction of errors done on the training set (Z, T)? The statistical physics (or Bayesian) approach to learning (Gardner 1988; Levin et al. 1990; Grassberger and Nadal 1993) forces us to study a statistical ensemble of machines, the couplings being taken from some prior distribution p(J). For example, if one looks for discrete couplings, p(J) may give equal weight to every possible choice of couplings. Having specified the machine d,the distributions p* and p, we can now introduce the dual statistical ensemble of learning machines as shown on Figure 5. The inputs to A" are N-dimensional patterns* J, occurring with probability p ( J ) . We want to study A' as a neural encoder, or as a module in a communication system, as briefly explained above. Here the main questions are: For a given distribution p(J), what is the information that the output conveys about the input* (the mutual information between n and J)? What is the maximal amount of information that can be conveyed [irrespective of p(J)] [or what is the channel capacity of A' if used as a channel] ? What is the choice of the parameters* Z [or the choice

Duality Between Learning Machines

505

of p * ( E ) ] that optimizes the performance of A* (according to some cost function to be specified)? A.2 The Number of Domains. One can relate the questions listed for the study of A to those for A* exactly as for the case of the perceptron. The first step is made by considering the number of domains A(E). For a general machine, this number will depend on the data Z. Note also that one domain, which is the set of points in J space associated with one given configuration u,need not be connected. The information capacity for a given Z is thus

C(Z) = Inn(=)

(A.3)

For a large system, we may expect C(E) to be a self-averaging quantity, so that one is interested in the average value -

C =< Inn(:)

>

(A.4)

where < . > means the average with respect to p * ( Z ) . is thus the typical amount of information that can be stored in the couplings, or the typical information capacity of the neural encoder. In the context of learning a rule by example, it has been shown by Vapnik (1982) that generalization is guaranteed (that is the probability of making an error on a new input pattern will tend towards the fraction of errors made on the training set) if -

One may define a "typical" VC dimension dt as the first value of p for which is smaller than p, and the critical -storage capacity as the maximal value of at = p/d, for which lim+mC/p = 1. This storage capacity should correspond to what is computed by statistical mechanics tools. A sufficient condition for A.5 to be true is that the VC dimension is finite. The VC dimension is defined relative to the worst case (or the best case, it depends on the point of view): one considers the maximal value of the number of domains: A,

= m_ax - A(E)

(A.6)

In Vapnik (1982) and Vapnik and Chervonenkis (1971) A, is called the growth function. It depends on the number of patterns, p, and on the architecture A. Again Am is at most equal to 2P, and the VC dimension dvc is equal to the first value of p for which A,,, < 2P. In 1968 Vapnik and Chervonenkis (1971, 1982) showed the remarkable result that when dvc is finite, Am

I A(~vc,P)

(A.7)

J.-P. Nadal and N. Parga

506

where A(dvc,p) is the number of domains for a perceptron with N inputs and p patterns in general position, that is (see 2.5):

= dvc

nvc

A ( ~ v cp), =

C Cf:

for p > dvc

(A.8)

k=O

where Ci = p!/[k! (p - k)!]. We have thus an upper bound C(dvc,p) = In A(dvc, p) for the information capacity C. Note that this bound is optimal: the bound is valid for all learning machines having a same value d of the VC dimension, and the bound is saturated for at least one of these machines, the perceptron for which N = d . For a large network, dvc is large. It is then convenient to define N by (1 G

P dvc

(A.9)

and the curve shown on Figure 3 appears as a universal curve. One has in the large size limit (dvc -+ 00 for a given ratio a): lim C/dvc = C ( o )

nvc -'XI

I lim C(dvc,p)/dvc = c ( u )

(A.10)

dvc-03

if (v 5 mc = 2 (A.11)

It is also interesting to note the meaning of C as an information content in the context of learning a rule: generalization occurs when adding a new pattern does not bring much new information. To conclude this section, one sees that one can extrapolate the results obtained for the perceptron to more general learning machines, provided one interprets cr as the ratio of the number of patterns to the VC dimension, and one interprets C = In A(N,p) as an upper bound for the information capacity if N is replaced by dvc. Acknowledgments We would like to thank M. Mkzard, R. Monasson, N. Sourlas, and G. Toulouse for fruitful discussions. We also thank G . Toulouse for a critical reading of the manuscript. We thank S. Verdu for discussions on information theory. We thank T. Watkin for checking the English correctness. One of us (N.P.) would like to warmly thank the Laboratoire de Physique Statistique of the Ecole Normale Superieure (Paris), and the Laboratorio di Fisica and the INFN of the Istituto Superiore di Sanita (Rome) for the hospitality received. This work was partly supported by the programme Cognisciences of C.N.R.S.

Duality Between Learning Machines

507

References Atick, J. J. 1992. Could information theory provide an ecological theory of sensory processing? NETWORK 3, 213-251. Barlow, H. B. 1961. Possible principles underlying the transformation of sensory messages. In Sensory Communication, W. Rosenblith, ed., p. 217. MIT Press, Cambridge, MA. Barlow, H. B. 1989. Unsupervised learning. Neural Comp. 1,295-311. Bialek, W., and Zee, A. 1988. Understanding the efficiency of human perception. Phys. Rev. Lett. 61, 1512-1515. Blahut, R. E. 1988. Principles and Practice of Information Theory. Addison-Wesley, Cambridge, MA. Brunel, N., Nadal, J.2, and Toulouse, G. 1992. Information capacity of a perceptron. J. Phys. A: Math. Gen. 25, 5017-5037. Cover, T. M. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. l E E E Trans. Electron. Comput. 14,326. Gardner, E. 1988. The space of interactions in neural networks models. J. Phys. A: Math. Gen. 21,257. Grassberger, P., and Nadal, J.-P., eds. 1994. From Statistical Physics to Statistical Inference and Back. Kluwer Academic Publishers, Dordrecht. Gutfreund, H., and Stein, Y. 1990. Capacity of neural networks with discrete synaptic couplings. I. Phys. A: Math. Gen. 23, 2613. Hertz, J., Krogh, A., and Palmer, R. G. 1990. lntroduction to the Theory of Neural Computation. Addison-Wesley, Cambridge, MA. Hopfield, J. J. 1982. Neural networks as physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. U S A . 79, 2554-2558. Kohonen, T. 0. 1984. Self-organization and Associative Memory. Springer, Berlin. Krauth, W., and Mezard, M. 1989. Storage capacity of memory networks with binary couplings. J. Phys. (France) 50,3057. Levin, E., Tishby, N., and Solla, S. A. 1990. A statistical approach to learning and generalization in layered networks. In 1989 Workshop on Computational Learning Theow, COLT'89. Linsker, R. 1988. Self-organization in a perceptual network. Computer 21, 105117. Mezard, M., Parisi, G., and Virasoro, M. 1987. Spin Glass Theory and Beyond. World Scientific Publishers, Singapore. Minsky, M. L., and Papert, S. A. 1988. Perceptrons. MIT Press, Cambridge, MA. Monasson, R. 1992. Properties of neural networks storing spatially correlated patterns. J. Phys. A: Math. Gen. 25, 3701-3720. Nadal, J.-F'., and Parga, N. 1992. Information processing by a perceptron. In Neural Networks from Biology to High Energy Physics, Elba 1992. lnt. J. Neural Syst., Vol. 3, (SUPP. 1992), 41-50. Nadal, J.-P., and Parga, N. 1993. Information processing by a perceptron in an unsupervised learning task. NETWORK 4,295-312. Peretto, I? 1992. An lntroduction to the Modeling of Neural Networks. Cambridge University Press, Cambridge. 1

J.-P. Nadal and N. Parga

508

Vapnik, V. 1982. Estimation of Dependences Based on Empirical Data. Springer Series in Statistics. Springer, New York. Vapnik, V. N., and Chervonenkis, A. YA. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16, 264-280.

Received December 16, 1992; accepted July 15, 1993.

This article has been cited by: 2. Didier Herschkowitz, Jean-Pierre Nadal. 1999. Unsupervised and supervised learning: Mutual information between parameters and observations. Physical Review E 59:3, 3344-3360. [CrossRef] 3. Simon Schultz, Alessandro Treves. 1998. Stability of the replica-symmetric solution for the information conveyed by a neural network. Physical Review E 57:3, 3302-3310. [CrossRef] 4. Gustavo Deco, Bernd Schürmann. 1995. Statistical-ensemble theory of redundancy reduction and the duality between unsupervised and supervised neural learning. Physical Review E 52:6, 6580-6587. [CrossRef]

Communicated by Steven J. Nowlan

Finding the Embedding Dimension and Variable Dependencies in Time Series Hong Pi Carsten Peterson Department of Theoretical Physics, University of Lund, Solvegatan 14A, S-223 62 Lund, Sweden

We present a general method, the &test, which establishes functional dependencies given a sequence of measurements. The approach is based on calculating conditional probabilities from vector component distances. Imposing the requirement of continuity of the underlying function, the obtained values of the conditional probabilities carry information on the embedding dimension and variable dependencies. The power of the method is illustrated on synthetic time-series with different time-lag dependencies and noise levels and on the sunspot data. The virtue of the method for preprocessing data in the context of feedforward neural networks is demonstrated. Also, its applicability for tracking residual errors in output units is stressed.

1 Introduction

The behavior of a dynamic system is often modeled by analyzing a time series record of certain system variables. Using artificial neural networks (A")to model such systems has recently attracted much attention. The success of such models relies heavily upon identifying the underlying structure in the time series-it is advantageous to know in advance the embedding dimension, most relevant inputs, noise level, etc. In this paper we devise a simple and easy-to-use method based on continuity requirements on statistical measures for identifying such essential properties in a time series record. Even though the language is that of time series the approach applies to any continuous function mapping problem. Time series can have a wide range of behavior ranging from being entirely random and uncorrelated to being completely deterministic. In reality one is often in between these two extremes. Existing approaches to determine dependencies are either based on entropy measures (Kolmogorov 1959; Farmer 1982), or on elaborate autocorrelation measures (Russell et al. 1980, Grassberger and Procaccia 1983; Brock et al. 1988; Savit and Green 1991). Our approach, which has its roots in the latNeural Computation 6, 509-520 (1994) @ 1994 Massachusetts Institute of Technology

P

510

Hong Pi and Carsten Peterson

ter philosophy, aims at determining the embedding dimension, pinning down sensitivity on the various variables and establishing noise levels. Brock et al. (1988) have devised a method (the BDS test) to test the null hypothesis of whether or not a sequence of numbers is IID (independently and identically distributed random numbers). This test was further developed by Savit and Green (1991) into a conditional probability approach in which the degree of variable dependence may be quantified. Although the latter method has its merit in brevity, the message it gives is not without ambiguities. Inspired by the work of Savit and Green (1991) we propose a method (the &test) from a different viewpoint, which exploits the definition of function continuity. This definition, which can be easily connected to the behavior of the conditional probabilities, gives clear signatures (apart from ambiguities arising from insufficient statistics) with respect to variable dependencies and the embedding dimensionality. To our knowledge the proposed method does not exist in the literature despite its conceptual simplicity. The existing approaches mentioned above in general aim at establishing some invariant fractal dimensional measures. The numerical implementations usually involve some box counting algorithms. Meaningful interpretations of the outcome from these algorithms as the box size is reduced rely heavily on a scaling assumption. The estimate of embedding dimension can be viewed as a byproduct from estimating fractal dimensions using these algorithms. In this paper we do not attempt to establish yet another invariant nonlinearity measure. Rather our aim is to pick out variable dependencies and identify the minimum embedding dimension directly from the data. By exploiting the properties of continuous functions we need no scaling assumption. Whereas the traditional line of approach leads naturally to the BDS statistic that tests against the null hypothesis of an IID sequence, the &test tests the hypotheses at both the extremes: IID or a deterministic map. In a large class of models, in particular the neural network models for time series prediction and system identification problems, the existence of a function mapping is inherently assumed. The 6-test provides good measures on the truthfulness of such assumptions and gives an estimate on how successful these models can be in reproducing the sequence of data. This is the strength the traditional approaches are lacking. Successful explorations are made on different maps with and without noise and with a variety of time-lag dependencies. Also, the underlying dynamics of a sunspot series is studied. The relevance of the method for feedforward network training is illustrated with the sunspot series, where it is shown that feeding the network with the established minimum embedding dimension vectors gives rise to state-of-the-art generalization performance. Also, the method can be used to track residual dependencies of the output errors in a Multilayer Perceptron (MLP).

Variable Dependencies in Time Series

511

2 General Formulation

Consider a discrete-time system, manifested as a time series xt, f = 1,2,3,. . . ,N, where we are interested in knowing if there exists a continuous map relating future values to the past ones (i.e., if it is possible to identify a state equation)' Xt

= f(xt-1,x f - 2 .

...

X1-d)

+

Yt

(2.1)

The "noise"-term Y, represents an indeterminable part that originates either from insufficient dimension of the measurements or from real noise. In general rf should decrease with d. If the system is completely deterministic, Tr should vanish entirely as d exceeds the minimum embedding dimension d,,,. For a map given as a sequence of measurements, one wants to know (1) the minimum embedding dimension dmin,(2) the sensitivjty of Xr with respect to each of the dependent variables, and (3) an estimate of the size of the noise. By "variable dependence" we mean the primary dependence, not the induced ones. If xf = f ( x t - 1 ) = f ( f ( x f - 2 ) ) we say that xt-lis the primary dependent variable, the induced dependence on x t - 2 is of no interest. Variables with no primary dependence are denoted "irrelevant." We approach the problem by constructing conditional probabilities in embedding spaces of various dimensions d. The time series in equation 2.1 is represented as a series of N points z(i) in a ( d + l)-dimensional space (d = 0,1,2, . . .I

z(i) = (zO(i),zt(i).. . . .zk(i), . . . , z d ( i ) )

(2.2)

where q ( f ) = x f - k . The distances between the kth components of two vectors z(i) and z(j) are defined as Ik(i, j ) = Izk(i) - zk(j)I,

k

= 0,1,. . . , d

(2.3)

Given a set of positive numbers, t and 6 = (61,.. .,6d), one can construct the following joint probabilities from the data (2.4) I

P ( 1 5 6) = -n(l Npair

i 6)

(2.5)

where Npairis total number of vector pairs, and n(20 5 f,1 5 6) and n(1 5 6) are the number of the pairs satisfying the corresponding distance constraints. Throughout this paper we freely use the notation 1 5 6 for ( ( I 1 5 bl), ( I 2 5 b2), . . . , ( I d 5 6 d ) ) . Also we set Si = 6 for all i. 'The series is assumed to be bounded and stationary and x , can take either red or complex values.

Hong Pi and Carsten Peterson

512

Next we form the conditional probabilities (2.6) How is P(l0 5 c 1 1 I 6 ) expected to vary under different conditions? The following important observations can be made: 1. For a completely random time series one has

This identity, which should be understood in a statistical sense, holds for any choice of positive 6 and 6. 2. If a continuous map exists as in equation 2.1 with no intrinsic noise, then for any 6 > 0 there exists a 6, such that Pd(f

16)= 1 for

6 5 6, and

d 2 do

(2.8)

The smallest integer do for which equation 2.8 holds is identified with (dmi,,- 1). 3. In the presence of noise r, pd(f I 6) will no longer saturate to 1 as E becomes smaller than the width Armaxof the noise.

Equation 2.8 is a direct consequence of the definition of function continuity, which states that if zo = f(zl,. . . , zd), then for any t > 0 there exists a h > 0 such that the conditions Izl - z;I < 6,.. . , Izd - 221 < 6 guarantee 120 - zbl < 6 . With the presence of noise, however, 120 - zbl = If(zl,. ..) -f(z;, . . .) + r - r’l, which as 6 .--t 0 cannot be made smaller than Ar = Ir - JI. This justifies the statement number 3. If we assume a flat noise distribution extending from -r to r with standard deviation u, = r/&, we get Armax = 2r = 2ficrrl which gives an upper limit estimate on ur knowing Armax. How does P,+(E1 6 ) vary as a function of 6 for fixed c? For 6 + 00 the conditions have no effect. Hence one has p d ( 6 I 6 ) 1 ~ +=~Po(t).As 6 + 0, pd(t I 6) should increase monotonically and saturate to 1 for d 2 do. This behavior is shown schematically in Figure la. The approach of Savit and Green (1991) was based on the identity Pd(t I 6)6=, = Pd-l(f I 6 ) b Z c , and establishes variable dependence when this identity is violated. However, ambiguities arise whenever irrelevant variables induce sizable changes in pd - pd-1 at 6 = 6 . This effect often occurs due to nonuniform curvatures of trajectories. For this reason we examine the maxima P d ( c ) = maXPd(6 I 6 ) = Pd(6 I 6)16<6, 6>0

(2.9)

Variable Dependencies in Time Series

513

E

Figure 1: (a) Pd(E I 6) as a function of 6 for fixed E . (b) The maxima Pd(E) as a function of E . Saturation to 1 would be observed for d 2 do. In the presence of noise the saturation deviates from 1 around EO Armax. N

Saturation of the maxima as d increases singIes out the irrelevant variables. How the maxima Pd(6) change with d and E provides basically all the information we need (see Fig. lb). Pd(€)measures how well the dynamics can be modeled in terms of the d variables. To quantify the dependence on each of the variables, it is convenient to define a dependability index2 (2.10)

and its average over

F

(2.11)

For a noise-free deterministic map, Pd(f)saturates to 1 ford 2 do and one has (2.12) 2This is similar to the index defined by Savit and Green (1991) but with different definition of P , ~ ( Eand ) normalization.

Hong Pi and Carsten Peterson

514

A variable is considered irrelevant if its inclusion in the condition does not raise the conditional probability to a higher plateau (Id M 0). In the approach of Savit and Green (1991) negative indices of statistical significance complicate the issue of identifying dependent variables. In our case it can be shown that Ad=, 2 0, Ado > 0, and &,do = 0. Thus negative indices for d > do can arise only from statistical fluctuations. For 1 < d < do we expect that negative Ad to be largely due to limited statistics in the maximization procedures. Anything beyond statistical problems can easily be clarified by inspecting whether the saturation to 1 is affected by treating the variable irrelevant. So far the formalism has assumed an infinite amount of data. With limited statistics very low &values or large ds may give rise to a picture not as crisp as the one in Figure 1. To estimate the errors we use the standard estimator

(2.13)

This error expression is not entirely adequate when correlations exist in time series. It nevertheless serves the purpose to signal whenever a statistically unreliable region is crossed into. Theoretically P ~ ( 1F6) is expected to be a smooth and monotonically decreasing function of 6. It is generally flat near 6 = 0 and has a plateau that extends to a finite 6,. Finding its approximate maximum is not difficult, provided that there is a reasonable amount of statistics to probe the plateau region. In cases with very limited statistics it is advantageous to set an irrelevant variable k inactive, which means that the condition Ik 5 d is omitted when computing P d ( F I 6) for d > k. By removing the unnecessary restrictive conditions in this way, improved statistics are obtained without affecting the evaluation of P , ~ sin higher dimensions. 3 Implementation Issues

Next we give a systematic procedure following the above on how to read off the key properties of Pd(f 1 6) given a data set. We propose the following scheme: 0. Compute P o ( f ) .

1. Starting with d

I 6) and find the maxima index Id.

= 1 compute P d ( t

2. Evaluate the dependability

Pd(F).

3. (Optional) Variable elimination: if the dth variable is irrelevant (3, O(O)),set it inactive. 4. Increment d by 1 and repeat steps 1-3 until Pd(t) salulurc's.

N

Variable Dependencies in Time Series

515

5. Identify do for which P d ( f ) begins to saturate, and also the point for which Pdo(6) begins to deviate from 1. The minimum embedding dimension is then do 1, and the noise width is estimated as f a .

+

If option 3 is used, the dependability index in equation 2.10 is modified according to Ad(€) = ( P d ( € )- Pd,(~))/(l - PO(€)), where d' 5 d - 1 is the nearest active variable and the summations in equation 2.12 are restricted to active variables only. In what follows t and 6 are expressed in units of o, the standard deviation of the data set. To obtain a map of P d ( l 0 5 f . 1 5 6) one can go through the data set and recount the statistics every time 6 and S are changed, or one can discretize the 6-6 plane, record the statistics in each bin, and then sum up contents in all bins progressively such that P(l0 5 t, 1 5 6) are obtained for a grid of 6-8 values by going through the data only once. We adopted the latter approach for the sake of computing speed and discretized the In €-In 6 plane by 30 x 30 bins with In 5 In€ 5 ln4 and ln10-4 5 ln6 5 1114. A lower cut t,in = 0.1 is used for the €-integrations in equation 2.11. These parameters are not critical to the outcome of the method. It is possible to design algorithms without them. 4 Applications

We apply the method to two synthetic time series and the sunspot data problems. For the Logistic map we generate 4000 time-steps according to

+

(4.1)

xf = qxf-l(l- x ~ - ~ rt)

with q = 4 giving o = 0.35. The iterative noise rf consists of random numbers uniformly distributed in ( - r , r ) . Two data sets are generated with r = 0 (1) and r = 0.28o (2), respectively. In order to keep the series bounded rf is constrained such that x t E ( 0 , l ) . In Figure 2a we show Pd(cl6) versus 6 at E = 0.085 for various d. We observe a saturation of Pd(fl6) to 1 for d 2 1 in case (I), as expected for a one-dimensional noise-free map. The conditional probabilities still saturate in case (2) but only to 0.55, which indicates the presence of noise. In Figure 2b the maximized conditional probability P I ( € ) is shown as a function of F for various dimensions d. For the noise-free data I'd(€) saturates to 1 for d 2 1 in the entire 6 range. For the noisy data I'd(€) still saturates approximately for d 2 1 but the saturation value starts to fall off from 1 somewhere between 0.3 and 0.7, implying a noise level r = lAmax[/2 between 0.150 and 0.350, which is consistent with the generated noise level. Calculating the dependability indices gives XI = 1 for the r = 0 data, XI = 0.97 for the noisy data and &2 z 0. N

Hong Pi and Carsten Peterson

516

~=0.085

0.50

0.25

0.00 10 E

Figure 2: Logistic map: (a) P d ( c I 6) as a function of h for noise-free ( r = 0 ) and noisy data ( r = 0.28). The curves represent d = 0 (solid), d = 1 (dashed), d = 2 (dash-dotted), and d = 3 (dotted), respectively. (b) P d ( t ) as a function of c . The curves correspond to noise-free data (same notation as in a) and the points to the noisy data. The upper curve displays the fall-off of P $ ( E )from unity on an enlarged scale.

For the Henon map we generate 4000 time-steps according to x f = 1 - a(x,.2

-

+

rf.-2)2 b(xf-4 - r f P 4 )+ rf

(4.2)

with a = 1.4, and b = 0.3 giving CT = 0.723. This is the usual Henon map with the dependencies stretched to larger lags and with the noise additively applied. Again we will use two data sets with Y = 0 (1) and r = 0.140 (2). The variable elimination option is used here. We obtain A1-4 = 0.002, 0.886, -0.023, 0.114 for (1) and 0.052, 0.728, 0.004, 0.128 for (2). The dependencies on xf-2 and x t - 4 emerge as large values of and A4. For (1) X2 + J4 = 1, indicating a noise-free map, whereas for (2) the value is 0.86 signaling the presence of noise. The sunspot data (Priestley 1988) contain the annual averaged sunspot activities from 1700 to 1979 (280 points). This is a very limited statistics data set. The resolution limit, from which to extrapolate t + 0 behavior, is set by 6 x 0.95. For that reason we explore two different input representations, linear and logarithmic-different sensitivities may give rise

x2

Variable Dependencies in Time Series

517

1.1

0.8

0.7

0.6

0.5 10- 2

10-1

100

10-2

10-1

100

Figure 3: P d ( E I 6) for the sunspot data together with estimated errors shown as a function of 6 for E = 0.957 and various dimensionsd (marked on the curve); (a) is for the embedding (xI,xt-1,. . .) and (b) is for the embedding (xl,ln(l+xt-l). . . .). The ds for which the plateau fail to rise are omitted. to a more complete picture. Using the variable elimination option Pd(f16) are shown in Figure 3 for the two cases. From Figure 3a we see that the probability with conditional variables xt-l, xt-2, ~ ~ - 3~ , ~ - 9and , xt-10 approximately saturates to 1, whereas in the logarithmic representation (Fig. 3b) the variables xt-l, xf-2, ~ ~ - 3and , xf-4 show u p to be relevant. Probing into smaller E with more statistics would clarify the situation. Based on the results of Figure 3 we approximately determine the embedding dimension to be around 7 with xt-1, ~ ~ - 2xf-3, , ~ ~ - 4xf-9 , and xt-lo as the most important variables, and we find the sunspot data are masked by large noises with amplitude on the order of 0.470.

5 Impact on Neural Network Learning Feeding an MLP with redundant variables should be avoided since fitting to noise increases the difficulty of learning and may give rise to poor generalization. In Weigend et al. (1990) an MLP with a layer of 8 hidden nodes is trained with the sunspot series data. The authors experiment

518

Hong Pi and Carsten Peterson

with different number of time-lags as inputs, 6, 12, and 24, and conclude on the basis of generalization performance that 12 is optimal and hence that this number reflects the embedding dimension. We instead use the b-test results above as a guide for the relevant input variables. Following the procedures (without weight elimination) of Weigend et al. (1990) we have trained MLPs with 8 sigmoidal hidden units and 1 linear output unit and various number of input units. With x f - l , ~ ~ - 2 xt-3, ~ ~ - 4~ ,~ - 9 ~ ~ , - 1as 0 inputs, we find for a 6-8-1 MLP the ratio of the mean square error to the variance3 (ARV) 0.073 on the test data from 1921 to 1955. This performance is as good as the one achieved by Nowlan and Hinton (1992) using more sophisticated algorithms. In Figure 4 we show the learning curves of the 6-input network together with the ones of a network using 12 lag variables as used by Weigend et al. (1990). It is also interesting to notice that after training the network with weight elimination, Weigend et al. (1990) find large connections to the hidden nodes from the inputs ~ ~ - 1~ , ~ - 2and , ~ ~ - 9consistent , with the 6-test findings. The 6-test is also very powerful when it comes to analyzing the residual error of output units in MLP learning. With perfect training the b-test should identify the residual series as an independent random series. If this is not the case the test singles out the relevant input units from which information has not been fully extracted by the network. 6 Summary

We have devised a general method, the 6-test, for identifying dependencies in continuous functions. It is not limited to linear correlations and it determines the embedding dimensions, dependencies, and noise levels fairly accurately even in cases of low statistics. Automated procedures for setting bin sizes, cutoffs, etc. and error analysis are feasible. Being based on conditional probabilities our approach at first sight appears very similar to that of Savit and Green (1991). However conceptually the two are rather distinct. The latter is based on Grassberger and Procaccia correlation integral using d = E . In contrast our method is based on the fundamental property of functional continuity (6 -+ 0). With the Savit and Green approach there are mainly two problems (Pi and Peterson 1993): (1) induced dependencies can show up in the index if nonuniform curvature is present in the function map; (2) there is not a saturation measure to indicate the critical embedding dimension and there is not an indication to quantify the noise level. We are able to extract information unambiguously from the region where Pd(€16)is maximized. There are also similarities between the Kolmogorov entropy method (Kolmogorov 1959) and ours in that both methods examine the behavior of a certain set of conditional probability distributions. However, evaluating ?The sunspot series has a variance u2 = 1495. We use u2 = 1535, a value quoted by Weigend et al. (1990),in order to compare our ARV with established results.

,

Variable Dependencies in Time Series

0.10 0.08 0.08 0.07

0

10000

20000 epochs

30000 0

519

20000

40000 epoch

60000

Figure 4: Learning curves for the sunspot data shown as ARV versus training epochs for a 12-8-1 network (a) and a 6-8-1 network (b). The solid lines are for the training set (1700-1920), the lower dotted lines are for the test set I (1921-1955) and the upper dotted lines are for the test set I1 (1956-1979). Large fluctuations on the curves have been filtered out. entropy for a high-dimensional joint probability distribution requires a huge amount of statistics in contrast to the present method.

Acknowledgments We are indebted to Richard Blankenbecler for bringing the work of Savit and Green (1991) to our attention.

References Brock, W. A,, Dechert, W. D., Scheinkman, J. A., and LeBaron, B. 1988. A test for independence based on the correlation dimension. University of Wisconsin Preprint. Farmer, J. D. 1982. Information dimension and the probabilistic structure of chaos. Z. Naturforschung 37A, 1304-1325.

520

Hong Pi and Carsten Peterson

Grassberger, P., and Procaccia, I. 1983. Measuring the strangeness of strange attractors. Physica D9, 189-208. Kolmogorov, A. N. 1959. Dokl. Akad. Nauk SSSR 124, 754-755 [Entropy per unit time as a metric invariant of automorphisms. Math. Rev. 21,386 (1960)l. Nowlan, S. J., and Hinton, G. 1992. Simplifying neural networks by soft weightsharing. Neural Comp. 4, 473493. Pi, H., and Peterson, C. 1994. To be published. Priestley, M. 8. 1988. Non-linear and Non-stationary Time Series Analysis, p. 223. Academic Press, New York. Russell, D. A., Hanson, J. D., and Ott, E. 1980. Dimension of strange attractors. Phys. Rev. Lett. 45, 1175-1178. Savit, R., and Green, M. 1991. Time series and dependent variables. Physica D50, 95-116. Weigend, A., Huberman, B. A., and Rumelhart, D. 1990. Predicting the future: A connectionist approach. Int. J . Neural Syst. 1, 19x209.

Received April 19, 1993; accepted September 13, 1993.

This article has been cited by: 2. Zhihao Guo, Shaya Sheikh, Camelia Al-Najjar, Hyun Kim, Behnam Malakooti. 2010. Mobile ad hoc network proactive routing with delay prediction using neural network. Wireless Networks 16:6, 1601-1620. [CrossRef] 3. Ricardo de A. Araújo. 2010. A quantum-inspired evolutionary hybrid intelligent approach for stock market prediction. International Journal of Intelligent Computing and Cybernetics 3:1, 24-54. [CrossRef] 4. M. Ciszak, F. Marino, A. Ortolan, T. Dal Canton. 2009. Identification of gravitational wave signals from chaotic astrophysical systems through phase space and attractor properties. Physical Review D 80:4. . [CrossRef] 5. Tiago A. E. Ferreira, Germano C. Vasconcelos, Paulo J. L. Adeodato. 2008. A New Intelligent System Methodology for Time Series Forecasting with Artificial Neural Networks. Neural Processing Letters 28:2, 113-129. [CrossRef] 6. P. Verdes, P. Granitto, H. Ceccatto. 2006. Overembedding Method for Modeling Nonstationary Systems. Physical Review Letters 96:11. . [CrossRef] 7. José Luis Rangel, Ursula Iturrarán-Viveros, A. Gustavo Ayala, Francisco Cervantes. 2005. Tunnel stability analysis during construction using a neuro-fuzzy system. International Journal for Numerical and Analytical Methods in Geomechanics 29:15, 1433-1456. [CrossRef] 8. P. Verdes. 2005. Assessing causality from multivariate time series. Physical Review E 72:2. . [CrossRef] 9. C. Goutte. 2000. Extraction of the relevant delays for temporal modeling. IEEE Transactions on Signal Processing 48:6, 1787-1795. [CrossRef] 10. Jiayu Lin, Zhiping Huang, Yueke Wang, Zhenken Shen. 2000. Selection of proper embedding dimension in phase space reconstruction of speech signals. Journal of Electronics (China) 17:2, 161-169. [CrossRef]

Communicated by Richard Lippmann

Comparison of Some Neural Network and Scattered Data Approximations: The Inverse Manipulator Kinematics Example Dimitry Gorinevsky* Thomas H. Connolly Lehrstuhl B fur Mechanik, Technische Universitut Miinchen, 0-80333 Munich 2, Germany This paper compares the application of five different methods for the approximation of the inverse kinematics of a manipulator arm from a number of joint anglelcartesian coordinate training pairs. The first method is a standard feedforward neural network with error backpropagation learning. The next two methods are derived from an extended Kohonen Map algorithm that we combine with Shepard interpolation for the forward computation. We compare the method of Ritter et al. for the learning of the extended Kohonen Map to our own scheme based on gradient descent optimization. We also study three scattered data approximation algorithms. They include two variants of the Radial Basis Function (RBF) method: Hardy's multiquadrics and gaussian RBF. We further develop our own Local Polynomial Fit method that could be considered as a modification of McLain's method. We propose extensions to the considered scattered data approximation algorithms to make them suitable for vector-valued multivariable functions, such as the mapping of Cartesian coordinates into joint angle coordinates. 1 Introduction Although various artificial neural network (A") approaches are being increasingly used, their comparative benefits and drawbacks are not quite understood. In many applications, an ANN performs an approximation of some multivariable input-output mapping. However, it is not entirely clear whether the ANN approaches are advantageous compared to other methods for multivariable function approximation when implemented on a sequential computer. In this paper we compare the performance of several such methods in a rather basic problem related to robot control, that of inverse kinematics computation. 'Presently at Robotics and Automation Laboratory, Department of Mechanical Engineering, University of Toronto, Toronto, Canada MSS 1A4. Neural Computation 6, 521-542 (1994)

@ 1994 Massachusetts Institute of Technology

522

Dimitry Gorinevsky and Thomas H. Connolly

In this paper we put the emphasis on the accuracy of the studied methods. For a given approximation domain, we study the dependence of the accuracy on the training set size. This differs from many related ANN papers that try to obtain the best approximation precision for a fixed network structure and suppose that the training set is as large as needed. Recently, several papers on applications of ANNs to the approximation of inverse manipulator kinematics were published. Two methods commonly used for manipulator kinematics approximation are multilayered feedforward networks with error backpropagation learning and Kohonen self-organizing maps. Multilayered feedforward networks have been applied to estimate kinematics mappings of robotic devices ranging from 2DOF planar manipulators (Guez and Ahmad 1988; Nguyen et al. 1990; Watanabe et al. 1992) to 6DOF manipulators (Kozakiewicz et al. 1991; Watanabe et al. 1992). Application of the extended Kohonen Map approach developed by Ritter et al. (1991) to the inverse kinematics problem for a 3DOF manipulator was studied by Brause (1990) and Kieffer et al. (1990) and was implemented on a manipulator with a vision system by Ritter et al. (1991). Although the method of Ritter et al. does not completely match the framework we use for the comparison of the approximation methods, we consider it in this study because it is specific to the inverse kinematics approximation problem. We also derive our own version of the extended Kohonen Map approach that matches the framework and is demonstrated to provide significantly better precision. The method of Ritter et al. is biologically inspired and somewhat lacks formal mathematical background. Our version of the method is derived from optimization of a standard loss function. In modern mathematics, much work on approximation exists, but only a small portion of it is applicable to the problem we consider. In the problem we examine, as in many other applications, the data to be approximated are multivariable and not available on a regular grid. Such a problem is called a Scattered Data Approximation problem. In the past 20 years a significant amount of research in this area, mostly related to computer graphics applications and experimental data processing (e.g., geophysical data), has been done. Some of the related work and surveys can be found (Franke and Nielson 1980; Franke 1982, 1987; Renka 1988; Alfeld 1989; Powell 1992). Most Scattered Data Approximation methods compute the approximation from the available data directly, in a single step. On the other hand, most ANN methods are based on relatively simple computational algorithms that can be used only after a lengthy and computationally intensive optimization of the network internal weights (learning). One should, however, mention that the distinction between these two groups of methods is simply a terminology question. Many recent papers consider modifications of Scattered Data Approximation algorithms such as Radial Basis Function approximation, as ANN methods (Broomhead and

Neural Network and Scattered Data Approximations

523

Lowe 1988; Moody and Darken, 1989; Bishop 1991; Park and Sandberg 1991; Platt, 1991). Generally, the ANN approximations could be considered as a special class of methods for Scattered Data Approximation (Poggio and Girosi 1990; Sanner and Slotine 1992). We share the latter viewpoint. A Scattered Data Approximation problem is typically stated as follows. Let us consider a smooth multivariable function with values known at N points in a given argument domain: y xu) and yu)

= 4(x) : =

4 (x”)

-

RK RM, where , (j= 1,. . . , N ) are known

(1.1)

The problem is to build an approximation y of the function value 4(x) for an arbitrary x in the argument domain. The pairs xu),yo) are called training set pairs. Although relatively extensive literature on ANN and Scattered Data Approximation methods exists, only a few recent papers attempt to compare the performance of some of these methods in application to approximation problems. The comparisons are performed on such problems as learning manipulator dynamics mappings (Atkeson 1991; Moody and Yarvin 1992),prediction of chaotic time series (Hartman and Keeler 1991; Platt, 1991; Moody and Darken 1992), and some other problems (Mischo et al. 1991; Carlin et al. 1992). The goal of this paper is to present an extensive comparison of several methods for an inverse manipulator kinematics problem using various training set sizes. We consider three ANN methods: a multilayered feedforward network, also called a multilayered perceptron or MLP, and two versions of the extended Kohonen Map algorithm and two Scattered Data Approximation methods: Radial Basis Function approximation and a Local Multivariable Polynomial Fit. The parameters compared are precision, sensitivity to noise in the data, and computation speed. Where such comparison could be made, our results generally confirm observations of other researchers. Some of this paper’s results were preliminarily published in the conference proceedings (Gorinevsky and Connolly 1992b, 1993).

2 Inverse Kinematic Approximation Problem

Let us consider an inverse kinematics problem for a three link anthropomorphic manipulator. This basic problem in robotics is a convenient testbed for the comparison of the approximation methods since the precise analytic expressions are readily available to check the accuracy of the approximation.

Dimitry Gorinevsky and Thomas H. Connolly

524

Let us consider a three link manipulator with joint angles 01, 02, and X3 of the manipulator tip as

03 that define coordinates XI, XZ, and

+ 13 cos(02 + 03))cos 01

x1

=

(12

X2

=

(12 cos 02 + 13 cos(B2+ H3)) sin HI

X3

=

11

cos 02

+

+ l3 sin(& + 03)

12 sin 02

(2.1)

where I,, 12, and l3 are the manipulator link lengths. The equations in 2.1 and their solutions with respect to 01, &, and 03 could be written in a vector form as

x

=

0 =

f(0); 0 = col(01,02,03) g(X); X=col(X,,X2,X3)

(2.2) (2.3)

Of course, the inverse kinematics solution 0 = g(X) is defined only for (1 X (1 5 1, + 12 + 13, where 11. (1 denotes the Euclidean norm of a vector. Generally, for a given vector X two solutions exist. We assume that the function g ( 0 ) represents one with 03 1 0. In our numerical study the link lengths are I1 = 0.0 and 12 = 13 = 1.0, and the joint angles 0 belong to the domain 2) = ( 0 = ~01(01,02,03)E

D : (0.0 5 81,82 5 0.5); (0.5 5 03 5 1.0)) (2.4)

The approximation problem is as follows. We consider N joint angle space points @ ( I ) , (i = 1, . . . ,N) randomly placed in the domain 2.4 and A! respective Cartesian coordinate points

x”)= f(OO’), @ ( I )

E

D; 0’= I , . . . ,N)

(2.5)

We assume that the function f ( . ) is unknown. The problem is, given an arbitrary vector X E f ( D ) and the data 2.5, compute an approximation 6 for the inverse kinematics solution 2.3. We will further refer to the set 2.5 as the training set. We check the accuracy of the approximation in the following way. Let us consider a test set of N I points Oik)randomly placed in the domain 2.4 and the respective set of Cartesian coordinates Xik’,

Xik’ = f(@jk’),

@ik’

E

D; (k = 1,.. . , Nt)

(2.6)

Using a chosen approximation method, we compute an approximation 6jk’= g(Xjk’)for each of the points 2.6. Since the mappingf(.) in 2.1 is in fact known, we compute the joint angle approximation errors e,(k) and the Cartesian space errors ef) as (2.7) (2.8)

By averaging the errors in 2.7 and 2.8 over the test data set and finding their maximum values over the set, one can get an impression of the approximation method’s accuracy.

Neural Network and Scattered Data Approximations

525

In addition to the training data set 2.5, we use a training data set with added noise XU)= f ( 0 U ) ) + Tp, 00) E ID; 0’= 1,. . . N ) (2.9) where 770’) are random independent variables uniformly distributed over In the numerical experiments we choose a noise the interval [-%,70]. intensity of 70 = 5 x which is less than the approximation error of the considered methods. Thus, the noise influence represents the robustness of the methods and not a degree of data degradation. In the next two sections we first generally formulate each of the approximation methods for the problem 1.1. Then, if necessary, we describe peculiarities in the application of each of the methods to the inverse kinematics approximation problem stated in Section 2. Section 5 presents the application results. ~

3 Artificial Neural Network Approximations

In this section we consider approaches that use an iterative (learning) procedure to obtain the approximating function parameters. These methods also are suitable for a massive parallel hardware implementation. 3.1 Multilayered Perceptron Network. In this most popular ANN method the network nodes are organized in several layers so that the input signal is processed consecutively from the input to the output layer. The input layer of K nodes (neurons) supplies the components xi of the input vector x; the output layer of M neurons gives the components y j of the output vector y. The most common network design is with one or two hidden layers of nodes placed between the input and output layer. Sontag (1990) proved that at least two hidden layers are generally needed for inverse function approximation. Further, Watanabe and coauthors (1992) tried various Multilayered Perceptron Network (MLP) architectures for inverse manipulator kinematics approximation and obtained the best accuracy with two-hidden-layered networks. Therefore, we also used two-hidden-layer networks. We denote by fi, (i = 1,., . ,K) and ?,, (i = 1,.. . K ) the outputs of the neurons of the first and second hidden layers, respectively. The network defines an input-output mapping of the form -

x, = u

($

wllxl- I+$) ; ;

j=1

(first hidden layer),

(3.1)

(second hidden layer),

(3.2)

(output layer),

(3.3)

Dimitry Gorinevsky and Thomas H. Connolly

526

where the weights wl,, mi,,and F , are usually called synaptic coefficients and the weights $ I ,qI,and the activation thresholds of the neurons. The sigmoidal function has the form ~ ( h =) tanh(/lh), where the scalar parameter /) is the same for all neurons. As usual, no sigmoidal transformation is applied to the output layer signals. In order for the network node state equations 3.1-3.3, which map the input vector x into the output vector y, to approximate the mapping y = $(x), the correct values of the weights and activation thresholds in equations 3.1-3.3 must be determined. The most natural way to determine them is to minimize the mean square approximation error for the training set 1.1

TI

~

yp)

(

9

)

where denotes the components of the network output vector 3.3 when the vector x(IL)1.1 is given as input. The performance index 3.4 is most commonly minimized with a form of gradient descent method, which is called error backpropagation learning (Rumelhart, Hinton, and Williams 1986; Miiller and Reinhardt 1990). We used this procedure in our numerical study. The network design could include a various number of neurons in the hidden layers. The two hidden layer networks implemented in our study have three input and three output neurons corresponding to the components of vectors X and 0 in the mapping 2.3. We used two different network architectures: 3-6-6-3 and 3-15-15-3. This means the two hidden layers contained 6 and 15 neurons, respectively [as in Kozakiewicz et a/. (1991) and Watanabe et al. (199211. For more neurons the iterative minimization (learning) process does not converge in a reasonable time. Furthermore, as shown by MacKay (29921, there exists a range of optimal number of neurons in the hidden layers that will provide the most accurate solution. Networks with fewer neurons will be less accurate and networks with more neurons will break down since they have too many weights to adjust as compared with the number of data points. For instance, a 3-15-15-3 network has 348 adjustable weights, and each training pair defines 6 scalar values. Thus, at least 58 points are needed to determine the weights. Normally, the number of training points should be several times larger, so one cannot expect further performance improvement for a larger network in the considered numerical experiments. The forward computations 3.1-3.3 written in C require 0.13 msec at a DECstation 5000/33 for the network with six neurons in the hidden layers and 0.43 msec for 15 neurons in the hidden layers. This is acceptable for real-time computation.

Neural Network and Scattered Data Approximations

527

3.2 Kohonen Map. Along with multilayered feedforward networks, Kohonen's self-organizing mapping algorithm is one of the most popular in robotics applications (see Brause 1990; Ritter et a/. 1991). In this subsection we first formulate our own version of the method that we compare to the version developed by Ritter et al. Numerical study shows that our version is superior in several respects. We begin by describing the forward computations of the method, which are the same for the two versions. What differs is the procedure for acquiring (tuning) the network parameters such as the weights, etc.

3.2.2 Fonuard Computations. We continue to consider the approximation problem 1.1. The Kohonen Map network consists of L nodes (neurons) with a vector zr E RK, a vector E RM,and a gradient matrix Wr E R M x Kassociated with each node r. Suppose that a vector x is input to the network. Following the method proposed by Ritter et al. (1991) , one finds a node s so that 11 x - zs11 is minimal over all the nodes and approximates the value y = g ( x ) of the function 1.1 as

ys = $s

+ WS(X - 2s)

(3.5)

Let us note, that the linear approximation 3.5 of the mapping 2.3 is continuous only piecewise. That is, if the input vector x experiences an infinitely small variation so that the closest node s changes, the output value of 3.5 will generally experience a finite change. Therefore, we propose to use the extended Kohonen Map algorithm 3.5 together with Shepard interpolation, which is commonly applied if one knows the function and its gradient at some scattered points (Franke and Nielson 1980; Franke 1982; Renka 1988). This is just the case. With Shepard interpolation we compute an approximation y of y = $ ( x ) as a weighted sum of the form

where the summation is performed over the network nodes (y, is an approximation of the form 3.5 and h(z,, x) are scalar weighting functions). We used functions k of the form h(x,2) = exp(- 11 x - z

112

/d2)

(3.7)

where d is the interpolation radius. Since the gaussian function 3.7 vanishes far from the coordinate origin, in computing 3.7 we take into account only the neighboring nodes for which 11 x - z, (1 is not much greater than d. The approximations 3.5-3.7 can be computed once the network parameters z,, li,,, W,are known and in Subsections 3.3.2 and 3.3.3 we consider two methods for determining them.

520

Dimitry Gorinevsky and Thomas H. Connolly

The forward computations, according to 3.5 and 3.6, include a search of the neighboring nodes, making it necessary to compute the distances 11 x-z, 11 for all nodes. For a large network this is the most computationally intensive part of the forward algorithm. Computations according to 3.53.7 are done only for a relatively small number of the nodes closest to x and do not require much computation time. Our implementation of this algorithm, programmed in C required 5.2 msec for a network of 100 nodes at a DECstation 5000/33 for the forward computation. This is much slower than the forward computation time for the networks described in Section 3.1. However, we feel that by optimizing the search of the neighboring nodes in the algorithm [e.g., as by Moody and Darken (199211 or by using parallel hardware one can use the algorithm for realtime control. Note that expressions 3.6 and 3.7 coincide with those used by Moody and Darken (1989) in a modification of a RBF network with ”normalized response functions.” Unlike these we use a local linear approximation 3.5 in the considered scheme. Some statistical approaches resulting in equations of the same form as Shepard interpolation 3.6, 3.7 are considered by Nowlan (19891, Lowe (19911, and Specht (19911. 3.2.2 Gradient Descent Optimization. One can determine the network parameters in a quite natural way by demanding them to minimize the mean square approximation error for the training set 1.1. Similarly to 3.4 let us demand

(3.8) where y(p) denotes the network output 3.3 when the input vector x(’) of the training set pair {x(p),y(p)}1.1 is given as the input. To find the optimum network parameters we use an iterative minimization of the cost function 3.8 with the steepest descent method, as in Section 3.1. In accordance with 3.5-3.7 the parameter update rule haS the (3.9)

z,)*h(z,, xU))/Hu)

(3.10) (3.11)

Neural Network and Scattered Data Approximations

529

The updating rules 3.9-3.10 have a certain similarity with the WidrowHoff learning rule used by Ritter ef al. (1991). However, the updating rule 3.11 differs from the perceptron learning rule used in Kohonen Maps. Therefore, strictly speaking our method cannot use this name. Nevertheless, it was inspired by the extended Kohonen Map algorithm of Ritter et al. By using the gradient descent optimization we determine the network parameters that are the best for the minimization of the loss function 3.8. This method, as shown in Section 4, demonstrated a twofold improvement in accuracy over the method of Ritter ef al. considered in the next subsection. Let us note that the optimality condition 3.8 unambiguously determines the parameters z,, ,$,, and W, stored for each network node only if the number of training pairs exceeds the number of nodes (neurons) at least (1 + K) times. In our numerical study, the number of nodes is always 4 times less than the number of training pairs. The decay steepness of the weighting function 3.7 is characterized by the parameter d. In the numerical study of Section 4 we gradually decrease the values of d and the scaling function c in 3.9-3.11 with iteration number n as d = do .0.99" and t = 60 .0.99".

3.2.3 Method of Ritter et al. We compared the method proposed in Subsection 3.2.2 to the extended Kohonen Map method of Ritter et al. which makes use of a Widrow-Hoff type error correction scheme. In other considered methods the approximations are built from the training set data 2.6. The method of Ritter et al. is specific for an inverse manipulator kinematics approximation and requires additional training data in the learning process. It assumes that we are able to move the manipulator to the configuration with given joint angles 0 and then measure the Cartesian tip coordinates X , therefore, producing additional input/output pairs for the mapping 2.2. The learning procedure is described in detail in Ritter e f al. (1991). Each input vector X from the training set is presented to the network. Then a course approximation of the corresponding joint angles to the input vector X is computed according to 3.5 and the manipulator is moved to this position. The course approximation is further refined by moving the manipulator to a new finer approximation of the corresponding joint angles computed using the data from the course approximation. Lastly, the data obtained for the manipulator configurations corresponding to the course and fine approximations are used to update the network internal weights (parameters $,, W,, and 2,). After learning, the accuracy is tested by presenting each of the test input vectors to the network and comparing it with the approximation 3.53.7. The details of our study can be found in Gorinevsky and Connolly (1992a,b).

Dimitry Gorinevsky and Thomas H. Connolly

530

In the numerical study, the network size was changed depending on the number of training pairs; it was the same as in the network of Subsection 3.2.2. 4 Scattered Data Approximations

Unlike the methods of Section 3, the methods that we consider in this section compute the approximation in one pass, without an iterative optimization procedure. In most problems related to robot control a vector-valued mapping $ ( x ) is to be approximated. A group of Scattered Data Approximation methods can be used for a vector-valued function without the need for multiple solutions of a scalar-valued function approximation problem. These methods compute the estimate of the form

r=l

where w,= w,( x . { x o ) } ) , (i = 1, . . . N) are scalar weights that depend on the relative position of the points x and xo) and do not depend on the known function values yo)'. Since estimate 4.1 could be applied to each component of the vector-valued function, one can just consider y(')in 4.1 as vectors. The triangularization-based methods do not belong to this group, because the triangularization usually depends on the values of the approximated function (Franke and Nielson 1980; Franke 1982; Alfeld 1989). The approximation has the form 4.1 for Radial Basis Function methods, including widely used Multiquadrics methods (Dyn 1987; Kansa 1990). For the inverse kinematics approximation problem that we consider, the data sets are relatively large. Therefore, we consider local versions of the mentioned Scattered Data Approximation methods. In this study we are not concerned whether the approximation obtained is smooth or even continuous, but we rather put emphasis on its accuracy. 4.1 Radial Basis Functions. The Radial Basis Functions (RBF) approximation is one of the most commonly used Scattered Data Approximation method groups. Although initially some of the methods belonging to this group were applied just empirically, presently they are acknowledged to give an approximation that minimizes a certain regularization performance index (Dyn 1987; Poggio and Girosi 1990; Powell 1992), describing the roughness of the surface being approximated. We consider here only so-called exact interpolation versions of the methods. Unlike the methods of Section 3, exact RBF interpolation requires no iterative minimization. However, some papers consider Radial Basis Function approximation as an ANN algorithm (Moody and Darken

Neural Network and Scattered Data Approximations

531

1989; Bishop 1991; Hartman and Keeler 1991; Park and Sandberg 1991; Platt 1991) and compute the approximation as a result of an iterative minimization process. 4.1.2 General Scheme. For a scalar valued function 1.1 (M RBF methods use an approximation of the form

=

1) the

(4.2) where k ( . ) is a function of the distance, or the radius (thus the name of the method). The weights cI are chosen to ensure zero approximation error at the oints xo), The latter condition is easily obtainable by substituting x = x(' into the right-hand side of 4.2 and y = y(') into the left-hand side. It could be written in a vector form as

P

HC = Y

(4.3)

where c = col(cl . . . ,CN), Y = col(y('),. . . ,y")) and H is a symmetric N x N matrix with entries HI, = k (11 x(') - xo) 11). If the radial function h(ll . 11) is chosen properly, the matrix H is invertible. By solving 4.3 for c and substituting into 4.2 we find that approximation 4.2 could be written in the form 4.1, where the weights wI are computed as ~

= C O ({w,}:') ~

w

=

H-'R(x), w

R(x)

=

col ({k(llx - x ' ) l l ) } ~ l )

(4.4)

For the approximation of a vector-valued function y = q5(x), one can still use expression 4.4 to find the weights w,and compute the linear combination 4.1 of the known vectors yo). If all the available data are taken into account (the approximation is global), matrix H 4.3 is the same for any point x . Therefore, H-' in 4.3 could be computed beforehand, reducing the volume of the computations for each new approximation 4.2. For a local version of the method, the set of the neighboring points depends on the considered vector x, making this precomputation impossible. 4.2.2 Gaussian RBF. One possible choice of the radial function h( 11 .II) in 4.2 is a gaussian function

k(u) = exp ( - ? / d * )

(4.5)

where d is the interaction radius ( k ( r ) << 1 for r >> d ) . For the function4.5, matrix H in 4.3 could be proved to be invertible for distinct points xu) (Micchelli 1986). Thus the solution 4.4 to the problem 4.3 always exists. For a large number of data points N, inversion of a N x N matrix H could be difficult since it could be ill-conditioned. Furthermore, since

Dimitry Gorinevsky and Thomas H. Connolly

532

the gaussian function 4.5 goes quickly to zero as its argument increases, the terms with IIx - x ( ~ ) I I >> d in 4.2 have little influence on the result. Therefore, we use a local version of the method. That is we take into account only the points x(’)! closest to x . In our numerical study L = 25 closest data points provided the best precision. It is known that parameter d in 4.5 should be related to the mean distance between the data points. We found that a good choice is

d

= K. ( V / N ) ’ / 3

(4.6)

where V = 0.53is the volume of the data domain and IE = 5 is an empirical coefficient. This is consistent with the findings of Botros and Atkeson (1991) that were done for approximation in a six-dimensional domain. For the considered local versions of the RBF methods, the ’%bottleneck of the forward computations is the search of the closest points in the data set (this includes computing the distances to each of the points) and the solution of the linear equation system Hw = R ( x ) in 4.4. The system is of order L, where L is the number of neighboring points in the local approximation. 4.2.3 Reverse Multiquadrics. We used the so-called reverse multiquadrics method due to Hardy. It is one of the most popular Scattered Data Approximation methods (e.g., see Franke 1982). In this method the radial function h( 11.11) in 4.2 is

h ( r ) = (1 + ?/d2)-’/’

(4.7)

where d has the same physical meaning as the parameter d in 4.6. Micchelli (1986) proved that for the radial function 4.7 the matrix H 4.3 is invertible if the points xu) are distinct. We used a local version of the multiquadrics method with the same number L = 25 of neighboring data points as for the gaussian radial function. The parameter d in 4.7 was chosen using the same expression 4.6 that was used for determining d in 4.5. 4.2 Local Multivariable Polynomial Fit. This method in its present form was introduced by Gorinevsky (1991,1992). It could be considered as a modification of McLain’s method (McLain 1976). For presentation simplicity we consider a second-order polynomial approximation here. We compute an estimate of the form 4.1 for the function $ ( x ) 1.1 at the point x = x(O) by solving a classical regression problem. We assume that points dj),(i = 1,.. . ,L ) lie ”in the vicinity” of the point x(O) so that we can write the Taylor expansion

Neural Network and Scattered Data Approximations

533

where h(') = x(') - x(O), v(') are random measurement errors of y('), and e(') is the mismatch of the Taylor expansion. To compute an estimate for y(O) = f ( ~ ( ~ )we ) , consider 4(x(O)),@ ~ / a x , ( x ( ~ )d) 2, ~ / d x , a x , ( x ( o ) )$),, and e(') as independent zero-mean random variables with covariances of the form

E ( @ 2 ( ~ ) )=

3, E

(4.9)

(4, r = 1,.. . , K), where E (.) denotes mathematical expectation and parameter a has the meaning of "wavelength of the function d ( ~ )That . is, for a variation A x of the input vector 1.1 so that 11 A x I[= a, the output value significantly changes. Parameter a could be assigned a value by considering a physical meaning of the input and output variables. 2 The expression for E (e(j)) in 4.9 follows from the estimate of the Taylor expansion residual in 4.8. Let us write 4.8 in the form $1)

= F T H ( i ) + ,(i) + + i )

(4.10)

- 1

a24 . .

' ax,ax,

.)

= xo

By introducing a matrix H = [H('),. . . , H @ ) ]and vectors Y = col(y('), . . . ,Y ( ~ ) )and & = col(e(')+ q ( ' ) ,. . . ,e(L)+ v ( ~ we ) ) represent our regression problem in the form

YT = FTH + &?

(4.12)

where H and Y are known matrices and & and F are zero-mean vectors with known covariances. The problem is to estimate y(O) = $(x(O)) = FT1, where 1 = col(l,O,.. . ,O). We search for the least covariance estimate of the form 4.1 y ( O ) = YTw, w = col({wi}f=,). Taking into account 4.12, we can write 4.1 in the form y(0) = yTw

=F

q +C

(4.13)

where C is the estimation error. Solving 4.13 and 4.9 for a least-covariance zero-mean

w

=

(E+HHT9H)-'HT@1

E

=

E(EE~),

C results in (4.14)

Q = E(FF~)

where B and @ are diagonal matrices with entries defined by 4.9 and 4.12.

Dimitry Gorinevsky and Thomas H. Connolly

534

We have formulated a method for fitting a quadratic polynomial 4.8 to the data. However, the method could be easily generalized for any order of the polynomial and in the numerical study we use a third-order polynomial fit. For a vector- or a matrix-valued function 1.1 we suppose that the stochastic model 4.9 is the same for each vector component or matrix entry. Therefore, a weight vector w 4.14 is the same for all components and the approximation has the form 4.1. The most computationally intensive art of the algorithm is the s o h tion of the L x L linear equation system + H T @ H )w = HTWI and calculation of the matrix HTQ.Although according to 4.9 and 4.10 the matrix = E FFT is diagonal, the matrix H has dimension L x [ ( P+ K ) ! / P ! K ! ] , where is the order of the polynomial approximation 4.8. In the inverse kinematics interpolation problem of Section 2, K = 3 and in the numerical study we used a cubic approximation, P = 3. Therefore, the dimension of matrix H (4.11) was L x 20. A similar solution to the polynomial fitting problem can be found in Atkeson (1991) . However, in Atkeson (1991) an expression of the form 4.14 is obtained just as a regularized solution to the ill-posed polynomial fitting problem and the solution contains many parameters that are chosen empirically or as a result of extensive numerical optimization. Unlike Atkeson (1991), the above derivation puts a certain physical meaning into the matrix Q in expression 4.14. The presented approximation method requires knowledge of some parameters that describe function $ ( x ) . The result of 4.14 depends on two such parameters of 4.9, i.e., on a and the ratio $/c. In the numerical study of Section 4 we set the "wavelength N = 1 and the relative measurement noise intensity +/c =

&

b)

5 Numerical Results and Discussion

5.1 Numerical Study. We compare the precision of the approximation methods of Sections 3 and 4 as described in Section 2. First, a large initial training data set was generated by a random number generator so that each component of the vector 0 is independent and uniformly distributed in the domain 2.4. Next, a random test data set of 100 more independent vectors 0,"' was generated in the same domain V. The training and test sets were the same for all methods. The Cartesian vectors corresponding to the randomly generated joint space angles were determined from the manipulator's forward kinematics. We checked the approximation error as defined by 2.7 and 2.8 for training sets of different sizes. Each time we took N consecutive points from the initiaI data set. To get representative results, the procedure was repeated with different training data sets of the same size. For N = 25

Neural Network and Scattered Data Approximations

535

s w

2 U

w

L

NUMBER OF TRAINING PARS

Figure 1: Test set joint space errors (no noise). KM-Ritter et al., Kohonen Map Method of Ritter et al., see Subsection 3.2.3. KM-Grad.Des.Optim.,Kohonen Map with Gradient Descent Optimization, see Subsection 3.2.2. MLP: 3-1515-3, Multilayered Perceptron Network 3-15-15-3, see Subsection 3.1. Local Polynomial Fit, Local Multivariable Polynomial Fit Method, see Subsection 4.2. RBF-Multiquadrics, Local Radial Basis Function Approximation with Reverse Multiquadrics Radial Function, see Subsection 4.1.3. training points we used M = 8 training sets; for N = 50, M = 8; for N = 100 and N = 200, M = 4; for N = 400, M = 2; for N = 800, M = 1. Figure 1 displays the mean values of the joint-space error 2.7, depending on the training set size N . The comparative Cartesian space error 2.8 basically follows the joint space error pattern except for some deviations that we specially mention later. Figure 2 illustrates respective results obtained for the training data contaminated with noise according to 2.9. An analysis of the results is summarized in Table 1. In the description of the computation volume, L denotes the number of neighboring data points used for the local version of the method. For an explanation see Sections 3 and 4. The two ANN schemes have the advantage of relatively fast forward computation. Our modifications of the extended Kohonen Map method

Low

LOW

Low

High

Low

Good

Moderate

Very good

Very good

Small (search of the neighboring neurons) Large (search of the neighboring data points, inversion of L x L matrix) Large (search of the neighboring data points, multiplication of two L x 20 matrices, and an inversion of an L x L matrix)

Very small

Forward Noise computation sensitivity time (bottleneck)

Good

Accuracy

"Gradient descent optimization as in Subsection 3.2.2. blearningscheme of Ritter e t a / . with Widrow-Hoff error correction (Subsection 3.2.3).

Multilayered Perceptron Network Kohonen Map Grad. Des. Opt." with Shepard interRitteP polation Local Radial Basis Function Approximation (gaussian and Reverse Multiquadrics) Third order local polynomial fit

Method

Table 1: Performance Comparison of the Approximation Methods.

No

No

Moderate (extra input/output pairs are needed)

Long

Long

Preliminary optimization (learning)

3:

wl W

3

2 0

a

3

nl

4

9

2.

UI W cn

Neural Network and Scattered Data Approximations

537

WITH NOISE

NUMBER OF TRAINING PAIRS

Figure 2: Test set joint space errors (with noise). KM-Ritter et al., Kohonen Map Method of Ritter et al., see Subsection 3.2.3. KM-Grad.Des.Optim., Kohonen Map with Gradient Descent Optimization, see Subsection 3.2.2. MLP: 3-15-15-3, Multilayered Perceptron Network 3-15-15-3, see Subsection 3.1. Local Polynomial Fit, Local Multivariable Polynomial Fit Method, see Subsection 4.2. RBF-Multiquadrics, Local Radial Basis Function Approximation with Reverse Multiquadrics Radial Function, see Subsection 4.1.3.

and the 3-15-15-3 MLP network have about 10 times better accuracy than Ritter’s extended Kohonen Map method. It is interesting to note that the Cartesian-space error 2.8 for the 3-6-6-3 MLP network (results not plotted in Fig. 1) is almost two times larger than that for the 3-15-15-3 network, whereas joint-space errors 2.7 for the two MLP networks are about the same. Also, though our modification of the Kohonen Map algorithm appears to give somewhat bigger joint error than the 3-15-15-3 MLP network, it provides a smaller Cartesian error 2.8 by 3040%. All four networks are rather robust to noise in the data. The considered Scattered Data Approximation methods have approximately the same accuracy, much better than that of the ANN methods

538

Dimitry Gorinevsky and Thomas H. Connolly

and require no learning. However, in their implemented form they have uncomparably larger forward computation time. The two radial basis functions, Reverse Multiquadrics and gaussian, provide similar accuracy in the absence of the noise. Therefore, we only show plots for the former function. With the noise present, the error for the gaussian function becomes 25%larger than for the Reverse Multiquadrics. In the absence of noise the Radial Basis Function approximations are most accurate for a high density of training data. However, the Local Polynomial Fit is much more robust in the presence of noise in the data. One possible method for reduction of the sensitivity of the Radial Basis Function methods to noise is to use a regularization procedure (Bishop 1991) and modify the matrix H in 4.3 by adding a small diagonal matrix to it. We found that this indeed improves robustness, but, at the same time, can significantly decrease approximation accuracy in the absence of noise in the data. The Local Polynomial Fit is rather computationally expensive and presently can practically be used only for off-line computations or if the approximation should be computed for a small number of training pairs. We should, however, mention a real-time implementation of a Local Polinomial Fit version by Atkeson (1991) which was achieved with the Connection Machine supercomputer.

5.2 Discussion. Generally, our results confirm findings of other authors, concerning both the accuracy of inverse manipulator kinematics approximation and the comparative performance of the methods. The accuracy of inverse manipulator kinematics approximation with the Multilayered Perceptron (MLP) networks was studied by Guez and Ahmad (1988) and by Kozakiewicz et al. (1991) for 3DOF manipulators. For a comparable number of training pairs, through a larger domain of approximation they obtained about 10 times less precision than we did. Our results are also consistent with the results of Watanabe and coauthors (19921, who studied the problem with more equivalent conditions. Few papers compare the accuracy of the considered approximation methods in different problems. Several authors have reported better accuracy by RBF approximation compared to MLP networks. For such comparisons Houslander and Taylor (19891, Moody and Darken (19891, and Hartman and Keeler (1991) report better performance for different modifications of the Radial Basis Function approach in such applications as stochastic time series predication and pattern classification. Atkeson (1991), Carlin et al. (1992), Kavli (1992), Mischo et al. (1991), and Moody and Yarvin (1992) all have found that various forms of polynomial approximation provide superior accuracy as compared to MLP in both low noise and high noise problems. The reasons why RBF and polynomial approximations perform well seem to be related. Local polynomial fit neglects influence of higher

Neural Network and Scattered Data Approximations

539

derivatives of the approximated function while trying to grasp polynomial behavior described by the lower order derivatives (see Section 4.2). On the other hand, as pointed out by Powell (1992) and Sanner and Slotine (19921, approximation with Radial Basis Functions is equivalent to low-pass spatial filtration. Therefore, high approximation accuracy of RBF methods means that the high-frequency part of the spatial spectrum of the function is small. This requirement is equivalent to the decay of the function's derivative norm with derivative order. One suggestion that readily results from our study is to combine the Local Polynomial Fit method with the ANN methods at the learning stage to boost their accuracy. For instance, if the training examples are scarce, an additional number of the input/output pairs could be generated with the Local Polynomial Fit method and used in the error backpropagation learning procedure to tune an MLP network. Preliminary numerical study shows that the accuracy of the MLP network could be improved twofold in this way.

Acknowledgments This study was supported by an Alexander von Humboldt Research Fellowship held by the first author and German BMFT Grant 413-5839-01 IN 104 D/7 for the second author.

References Alfeld, P. 1989. Scattered data interpolation in three or more variables. Mathematical Methods in Computer Aided Geometric Design, T. Lyche and L. L. Schumaker, eds. Academic Press, New York. Atkeson, C. G. 1991. Using locally weighted regression for robot learning. Proc. ZEEE Znt. Conf. Robot. Automation 3, 958-963. Bishop, C. 1991. Improving the generalizationproperties of radial basis function neural networks. Neural Comp. 3, 579-588. Botros, M., and Atkeson, C. G. 1991. Generalization properties of radial basis functions. Advances in Neural Znformation Processing Systems 3, R. P. Lipmann, J. E. Moody, and D. S. Touretzky, eds., pp. 707-713. Morgan Kaufmann, San Mateo, CA. Brause, R. 1990. Optimal information distribution and performance in neighbourhood-conserving maps for robot control. Proc. 2nd lnt. ZEEE Conf. Tools Artif. Zntell. 451-456. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Carlin, M. et al. 1992. A Comparison of Four Methods for Nonlinear Data Modelling. Center for Industrial Research, Oslo, Norway, SI-Report No. 9107102-2, Aug. 1992.

540

Dimitry Gorinevsky and Thomas H. Connolly

Dyn, N. 1987. Interpolation of scattered data by radial functions. In Topics in Multivariable Approximation, L. L. Schumaker, C. K. Chui, and F. I. Utreras, eds., pp. 47-61. Academic Press, Boston. Franke, R. 1982. Scattered data interpolation: Test of some methods. Math. Comp. 38(157), 181-200. Franke, R. 1987. Recent advances in the approximation of surfaces from scattered data. Topics in Multivariable Approximation, L. L. Schumaker, C. K. Chui, and F. I. Utreras, eds., pp. 79-98. Academic Press, Boston. Franke, R., and Nielson, G. 1980. Smooth interpolation of large sets of scattered data. Int. J. Numerical Methods Eng. 15, 1691-1704. Gorinevsky, D. M. 1991. Learning and approximation in database for feedforward control of flexible-joint manipulator. Proc. '91 ZCAR: 5th Znt. Conf. Adv. Robot. 1, 688492. Gorinevsky, D. M. 1992. Experiments in direct learning of feedforward control for manipulator path tracking. Robotersysteme 8, 139-147. Gorinevsky, D. M., and Connolly, T. H. 1992a. Comparison of neural network and scattered data approximations for inverse manipulator kinematics. Sys. Control Artif. Zntell. Conf. 1, ix-xvi. Gorinevsky, D. M., and Connolly, T. H. 1992b. Comparison of neural network and scattered data approximations for inverse manipulator kinematics. Tech. Rep., Lehrstuhl B fur Mechanik, Technische Universitat Munchen. Gorinevsky, D. M., and Connolly, T. H. 1993. Comparison of inverse manipulator kinematics approximations from scattered input-output data using ANN-like methods. American Control Conference, pp. 751-755. San Francisco, CA, June 1993. Guez, A,, and Ahmad, Z. 1988. Solution to the inverse kinematics problem in robotics by neural networks. Proc. ZEEE Znt. Joint Conf. Neural Networks 2, 617-624. Hartman, E., and Keeler, D. 1991. Predicting the future: Advantages of semilocal units. Neural Comp. 3, 566-578. Houslander, P. K., and Taylor, J. T. 1989. On the use of predefined regions to minimize the training and complexity of multilayer neural networks. First ZEE Znt. Conf. Artif. Neural Networks 383-386. IEE, London, UK. Kalvi, T. 1992. ASMOD-Algorithm for adaptive spline modelling of observation data. Center for Industrial Research, Oslo, Norway, SI-Report No. 910710-2-1, Aug. 1992. Kansa, E. J. 1990. Multiquadrics-A scattered data approximation scheme with applications to computational fluid dynamics-I. Comput. Math. Appl. 19(8), 127-145. Kieffer, S., Morellas, V., and Donath, M. 1991. Neural network learning of the inverse kinematic relationships for a robot arm. Proc. IEEE Int. Conf. Robot. Automation 3, 2418-2425. Kozakiewicz, C., Ogiso, T., and Miyake, N. 1991. Partitioned neural network architecture for inverse kinematic calculation of a 6 DOF robot manipulator. Proc. IEEE Znt. Joint Conf. Neural Networks 3, 2001-2006.

Neural Network and Scattered Data Approximations

541

Lowe, D. 1991. On the iterative inversion of RBF networks: A statistical interpretation. 2nd Int. Conf. Artificial Neural Networks 29-33. IEE, Bournemouth, UK. MacKay, D. J. C. 1992. A practical Bayesian framework for backpropagation networks. Neural Comp. 4(3), 449472. McLain, D. H. 1976. Two dimensional interpolation from random data. Cornput. 1.19, 178-181. Micchelli, C. A. 1986. Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Const. Approx. 2, 11-22. Mischo, W. S., Hormel, M., and Tolle, H. 1991. Neurally inspired associative memories for learning and control. Proc. ICANN-91’, Int. Conf. Artif. Neural Networks, Espoo, Finland, pp. 1241-1244. Moody, J., and Darken, C. J. 1989. Fast learning in networks of locally-tuned processing units. Neural Cornp. 1, 281-284. Moody, J., and Yarvin, N. 1992. Networks with learned unit response functions. In Advances in Neural Information Processing Systems 4, J. E. Moody, R. I? Lipmann, and D. s. Touretzky, eds., pp. 1048-1055. Morgan Kaufmann, San Mateo, CA. Miiller, B., and Reinhardt, J. 1990. Neural Networks. Springer-Verlag, Berlin. Nguyen, L., Patel, R. V., and Khorasani, K. 1990. Neural network architectures for the forward kinematics problem in robotics. Proc. I € € € Int. Joint Conf. Neural Networks 3, 393-399. Nowlan, S. J. 1989. Maximum likelihood competitive learning. In Advances in Neural Information Processing Systems, D. S. Touretzky, ed., pp. 574-582. Morgan Kaufmann, San Mateo, CA. Park, J., and Sandberg, I. W. 1991. Universal approximation using radial-basisfunction networks. Neural Cornp. 3,246-257. Platt, J. 1991. A resource-allocating network for function interpolation. Neural Cornp. 3,213-225. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. l E € € 78(9), 1481-1497. Powell, M. J. D. 1992. The theory of radial basis function approximation in 1990. In Advances in Numerical Analysis, W. Light, ed., Vol. 1, pp. 105-210. Clarendon Press, Oxford. Renka, R. J. 1988. Multivariable interpolation of large sets of scattered data. ACM Trans. Math. Software 14(2), 139-148. Ritter, H., Martinez, T., and Schulten, K. 1991. Neuronale Netze. Addison-Wesley, Bonn. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by back-propagating errors. Nature (London) 323, 533-536. Sanner, R. M., and Slotine, J.-J. E. 1992. Gaussian networks for direct adaptive control. I E E E Trans. Neural Networks 3(6), 837-863. Sontag, E. D. 1990. Feedback stabilization using two-hidden-layer nets. Tech. Rep. SYCON-90-11. Rutgers Center for Systems and Control, Rutgers Univ., New Brunswick, NJ.

542

Dimity Gorinevsky and Thomas H. Connolly

Specht, D. F. 1991. A general regression neural network. I € € € Trans. Neural Networks 2(6), 568-576. Watanabe, T. ef al. 1992. The calibration of position and orientation of robot manipulators using a neural network. IapanlUSA Symp. on Flexible Automation, pp. 219-225. ASME. Received September 8, 1992; accepted July 9, 1993.

This article has been cited by: 2. Kian Hsiang Low , Wee Kheng Leow , Marcelo H. Ang Jr. . 2005. An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion TasksAn Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks. Neural Computation 17:6, 1411-1445. [Abstract] [PDF] [PDF Plus] 3. Rudolf Kulhavý, Petya Ivanova. 1999. Quo vadis, Bayesian identification?. International Journal of Adaptive Control and Signal Processing 13:6, 469-485. [CrossRef] 4. D. Gorinevsky. 1997. An approach to parametric nonlinear least square optimization and application to task-level learning control. IEEE Transactions on Automatic Control 42:7, 912-927. [CrossRef] 5. D. Gorinevsky, A. Kapitanovsky, A. Goldenberg. 1996. Radial basis function network architecture for nonholonomic motion planning and control of free-flying manipulators. IEEE Transactions on Robotics and Automation 12:3, 491-496. [CrossRef] 6. D. Gorinevsky, A. Kapitanovsky, A. Goldenberg. 1996. Neural network architecture for trajectory generation and control of automated car parking. IEEE Transactions on Control Systems Technology 4:1, 50-56. [CrossRef]

Communicated by Halbert White

Functionally Equivalent Feedforward Neural Networks W r a KGrkovi lnstitute of Computer Science, P. 0.Box 5,182 07, Prague 8, Czechia

Paul C. Kainen lndustrial Math, 3044 N St., N . W., Washington, D.C. 20007 USA

For a feedforward perceptron type architecture with a single hidden layer but with a quite general activation function, we characterize the relation between pairs of weight vectors determining networks with the same input-output function. 1 Introduction

In the last several years, capabilities of multilayer perceptron-type networks to approximate within an arbitrary accuracy quite general mappings from one finite dimensional space into another were confirmed by several authors. But it seems that in most practical applications a large number of units in hidden layers and hence large dimension for the space of parameters (weight space) would be needed. Parallel processing can overcome some of these problems, but it appears sensible to also study possibilities of reducing the size of weight spaces within which the search for an optimal weight vector must be done. Hecht-Nielsen (1990) emphasized the importance of understanding the structure of equivalence relations on weight spaces relating pairs of weight vectors that determine the same input-output function for the network. Subsets of weight spaces containing exactly one representative from each class of these equivalences can occupy only a small fraction of the whole weight space. Some learning algorithms (like graded and genetic learning) may thus achieve considerable reduction in the volume of a search set. Recently, several authors studied these equivalences for perceptrontype network with hyperbolic tangent as an activation function (Chen and Hecht-Nielsen 1991; Chen et al. 1993; Sussmann 1992). Albertini and Sontag (1993) extended this analysis to any infinitely differentiable function (T satisfing ( ~ ( 0= ) 0, (~'(0)# 0 and (~''(0)= 0. Our analysis requires neither differentiability nor continuity. It needs only the activation function to be asymptotically constant (both limits in +cc and -cc exist and are finite), and so covers any sigmoid or gaussian. Moreover, we do not restrict ourselves only to subspaces of irreducible weight vectors Neural Computation 6, 543-558 (1994)

@ 1994 Massachusetts Institute of Technology

544

Vera Kirkova and Paul C. Kainen

(i.e., weight vectors where no hidden units can be removed). Although it makes our definitions less simple to state, they are more useful for reducing the size of search sets. When networks are trained to perform a desired function without knowing in advance the minimal number of hidden units sufficient for its implementation, our results can be applied. We introduce two key properties of an asymptotically constant activation function that turn out to be sufficient to characterize the nontrivial pairs of equivalent weight vectors. An activation function is self-affine if it is nontrivialy affinely self-equivalent; it is affinely recursive if it is a weighted sum of their affine equivalents. We show that currently popular activation functions are self-affine but not affinely recursive. For such activation functions that are also bounded, nonconstant but asymptotically constant, we show that equivalent weight vectors can always be derived by composition of two basic relations called "interchange" and "conjugation." Our argument needs very few conditions on the activation function. Although we have to explicitly describe several properties (bounded nonconstant and asymptotically constant), much more severe constraints such as continuity or even smoothness are avoided. Methods used are elementary affine geometry, so our theoretical development is self-cont a ined . The paper is organized as follows. In Section 2, we introduce the required definitions and notations. Section 3 presents our theoretical results. We discuss in Section 4 the applicability of our theorems to neural networks from an engineering standpoint and we also consider connections with wavelets. The mathematical proofs are given in an appendix. 2 Preliminaries

In this paper we study weight vectors determining the same input-output function for one-hidden-layer single-output perceptron-type architectures with various activation functions. Recall that a multilayer architecture is called perceptron-type if units in each hidden layer sum u p weighted inputs from the preceding layer, add to this sum a bias and then apply a common activation function, while output units compute only a weighted sum of their inputs. Typical choices of an activation function include: discontinuous threshold function, gaussians (exp(-t2)), logistic sigmoid ((1 ecf)-'), and hyperbolic tangent. An activation function 0 : R -+ R is asymptotically constant if both its limits in +m and -aexist and are finite. A function 0 is sigmoidal if it is monotonically increasing and asymptotically constant. A bounded nonconstant asymptotically constant activation function is called regular. Consider a one-hidden-layer single output perceptron-type architecture with n input and k hidden units with an activation function IT:

+

Feedforward Neural Networks

545

R + R (Rdenotes the set of real numbers) as a mapping from its space of parameters Rk(n+Z) into the set S ( R " ,R) of all input-output mappings from R" to R.Denote it by P,(n, k) : Rk("+') + S ( R " ,R) If u = ( ~ 1 , .. . uk), where for each i, Uj = (Wj, Vjr bj) with Wir bj E R,Vj E E Rk("+2), then the I / O function in S ( R " , R )corresponding to the weight vector u and architecture P,(n,k)is

R",so u k

C W i g ( (vi + bj) X)

i=l

where x E R" is an input vector, vi . x denotes the inner product with a vector vi E R"corresponding to input weights of the ith hidden unit, bi is the bias of the ith hidden unit, and wi is the weight corresponding to the connection between the ith hidden unit and the output unit. We shall call ui = (w,,vi,bi) the parameter vector for the ith hidden unit; so this unit has bias bi, input weights v,, and output weight wi. A hidden unit participates if wi # 0 and is constant if v, = 0. For any architecture P,(n, k) we call two weight vectors u,u' E Rk("+2) functionally equivalent with respect to P,(n, k) and denote by u -F u' if P, ( n ,k)(u)= P, ( n ,k)(u'). A subset S of weight space, Rk("f2), is called a search set for P,(n,k) if for every u in weight space, there exists at least one representative u' in S such that u -F u'. To describe search sets, we need functions 'pi: Rk("+2) -+ R,i = 1,.. . k, defined for those u with vi # 0 by 'pi(u)= vij, where I.'+, = 0 for every p < j , and for u with v, = 0 by c~;(u) = 0. To describe functional equivalence we define the composition o -2 of two relations -1 and -2 on the same set X as the set of all pairs (x,z) such that there exists y E X such that x w 1 y and y w 2 2. We will prove that -F is the composition of two basic equivalence relations on the set of weight vectors. The first of these involves permutation of certain hidden units while the second generalizes the notion of sign-flips. It may happen that more than one hidden unit is constant or that for a couple of hidden units i, j the sum of their contribution is constant. In such case, these units can be replaced by one hidden unit with a constant affine transformation and the number of hidden units can be reduced. We call a pair of vectors (w, v, b), (w', v', b') E Rn+2compensating (with respect to o) if wo (v . x + b ) w'g(v' . x b') is a constant function of x. If in a weight vector u E Rk(n+2) every hidden unit participates, there is no compensating pair ui,uj and there is at most one constant hidden unit, then we call u irreducible. In this paper, however, we do not restrict ourselves to irreducible weight vectors. For applications, networks are trained to approximate

+

+

Vera Kiirkova and Paul C. Kainen

546

a desired function although one does not know the minimal number of hidden units. A subset E of the set of hidden units is called essential for a weight [with respect to the architecture Pb(n,k)] if the comvector u E Rk(n+Z) plementary set { 1 , . . . ,k}-E contains only constant hidden units and pairs of hidden units with compensating parameter vectors. We denote adj(u,E) = W,(T o cy, and call it the adjustment of u with respect to E (and the architecture). For two weight vectors u,u’ E Rk(“+’) we say that u can be obtained from u’ by an interchange, denoted u -1 u’,if there exist essential sets E for u and E’ for u’ and a bijection K : E + E’ such that u, = u;(!) for every i E E, and a d j ( u , E ) = adj(u’,E’).It is easy to check that -I is an equivalence and u -I u’ implies P, ( n k ) ( u )= P, ( n .k)(u’), so -1 G -F . Another source of nonuniqueness of weights determining the same input-output function of a network is changes within weights corresponding to some hidden unit. For instance, odd or even activation functions allow sign flips of weights and biases corresponding to any hidden unit. More generally, any activation function that is affinely equivalent to itself in a nontrivial way allows analogous changes. Two functions n, T E S ( R ,R) are called affinely equivalent if there exist invertible affine transformations m ,0:R + R such that (T = i -l o T o tr; i.e., if there exist real numbers v,w,b. d such that n ( t ) = w r(vt + b ) + d with neither w nor v equal to 0. If in addition, d = 0. we call (T and T strongly affinely equivalent. Check that both of these relations is indeed an equivalence. A function (T: R -+ R is (strongly) self-affine if there is a (strong) affine equivalence of (T to itself other than the identity. For example, the logistic sigmoid X(t) = (1+e-‘)-’ is self-affine since X(t) = 1 - A ( - t ) for every t E R. However, X is not strongly self-affine. R is odd if u ( - t ) = - o ( t ) and even Recall that a function n: R if n ( - t ) = g(f); r7 is translated odd (even) if there exist an odd (even) function 7 : R R and b E R such that for every t a ( t ) = r(t b ) . A function a is shifted translated odd (even) if there is a translated odd (even) function 7 and d E R such that for every t a ( t ) = T ( f ) + d . Thus, translation amounts to moving the graph of the function along the horizontal axis while shifting is moving parallel with the vertical axes. It is trivial to check that shifted translated odd or translated even functions are self-affine. We will show later (Corollary 3.5.) that the converse is true for any regular activation function a, and moreover any self-affine regular activation function satisfies an equation o ( t ) = W ( T (- t + b ) + d , where real parameters w,b, d are unique and IwI = 1. For an odd or even activation function, it is clear that any u is functionally equivalent to u‘ obtained by ”sign flipping” some of the hidden units, i.e., taking the weights and biases for a hidden unit and multiplying by -1 and then multiplying the output weight of this unit by -1 or +1 (in the odd or even case, respectively). We can now extend this sim-+

-+

+

Feed forward Neural Networks

547

ple idea to self-affine regular activation functions. We call the resulting equivalence relation on weight vectors conjugation. So for an activation function r satisfying an equation u ( t )= wr(--t + b)+d, with I wI = 1,we call two weight vectors u.u' E Rk("+*) conjugationequivalent with respect to P , ( n , k ) and denote u -C u' if there exists a joint essential set E for both u and u' such that for every i E E either u, = u: or u: = (ww,. -v,, -b, b ) , where u, = (wI,vI, b,) and a d j ( u , E ) adj(u'. E ) = m d , where m = #{i E E , u, # u:}. It is easy to verify that -C is an equivalence relation and u -C u' implies P, ( n ,k)(u)= P, ( n ,k ) (u'), so - 1 s ~ If ~ w . = 1, we call -C even conjugation, if w = -1, we call N C odd conjugation. There is still one more way that a perceptron-type network can yield a constant function. The ramp sigmoid, for example, which is defined by p ( t ) = 0 for t 5 0, p ( t ) = t for 0 < t < 1. and p ( t ) = 1 for t 2 1, satisfies the following functional equation: for any 0 < p < 1.p ( t ) = p p ( t / p ) + (1- p ) p ( ( t - p ) / ( l - p ) ) . The discontinuous threshold function also has the property that it can be expressed as a finite linear combination of dilated and translated copies of itself. Thus, for an activation function 0,which is not self-affine, we say that CJ is affhely recursive if there exists a positive integer rn 2 3 and nonzero real numbers w, and nonconstant affine transformations y, : R -+ R,i = 1.. . . , rn, such that W,CJo y,is a constant function and for no pair i, j w,ro 3; wIo o -yl is constant. So,r = - ~,"=,(wl/wm)u o y,o '7 ; c/w, is a shifted sum of at least rn - 1 2 2 dilated and translated copies of itself. Recall that a finite family of real functions eft, i = 1, . . m } is called linearly independent if for every family {wl.i = 1,.. . rn} of real numbers Eyll wlf, = 0 implies w, = 0 for every i = I , . . . rn. Notice that studying functionally equivalent weight vectors we are actually considering linearly dependent families of the form {CJo a , } , where a, are affine maps from R" to R. If u -F u' with respect to Pn(n,k ) , then

+

+

k

xzl

+

k

So it is natural to study the problem of whether there exists a finite family { a ; }of affine maps from R" to R such that {CJ o a ; }is linearly dependent and this will be much easier if we can reduce to the case n = 1. For when n = 1 such a linearly dependent family implies that either 0 is self-affine or CJ is affinely recursive. To do this reduction, we use some tools from affine geometry [see, e.g., Birkhoff and MacLane (196511. Recall that an affine subspace of R" is defined as a subset containing with any two of its points the entire line through these points, where lines in R" are subspaces of the form {at b, t E R} for some a, b E R" with a # 0. Two affine subspaces H,K of R" are called parallel if there exists u E R" such that H = {x u,x E K } . Every affine subspace is parallel to a unique linear subspace. The

+

+

Vera Kiirkova and Paul C. Kainen

548

dimension of an affine subspace is defined to be the dimension of the parallel linear subspace. A hyperplane in R" is an affine subspace of dimension n - 1. Denote by A(R")= {tr(t) = vt b , v E R " , b E R } the set of all affine functions from R" to R,by A the group of all nonconstant affine transformations from R to R,and by d+ the subset of A containing all E A such that y ( t ) = vt b, where v > 0. Consider an equivalence relation = on d(R")defined by CY = D if and only if there exists y E A with N = y o [j.It is easy to verify that it = /3 if and only if their cozero hyperplanes H((Y)= ( Y - ~( 0 ) and H ( D ) = p-'(O) are parallel. Denote by x1 : R" + R a constant function assigning 1 to every x E R". A function (T : R + R is distinguishing (with respect to x) if for every finite family {(I,, i = 1 , . . . m} c d(R")linear dependence of { c r o o l , i = l , . . . r n } U { x l } implieslineardependenceof { ~ ~ N ~ , ~ E P } U { ~ ~ for every class P of equivalence =, and moreover if C:", W,(T o irl = c, then for every P there exists c p such that C I E P w op i t l = c p . By N we denote the set of natural numbers.

+

+

3 Theoretical Results

Our first theorem gives conditions on an activation function (T allowing only interchange-equivalent weight vectors to be functionally equivalent. Theorem 3.1. Let (T : R -+ R be distinguishing, not afinely recursive and zuith respect to the not self-afine. Then for any positive integers n , k NF=-I architecture P,(n, k). This allows for the possibility of reducing the number of inputs to 1. Corollary 3.2. Let (T: R + R be distinguishing and n, k be positive integers 1 respect to the architecture P,(n, k). Then there exists such that ~ ~ f - with a positive integer ko such that for every k 2 ko -F#-I with respect to the architecture Pu(l,k). Thus, if we have non-trivially functionally equivalent weight vectors at all, then they must exist even in the case of only a single input. Our next result, proved using techniques of affine geometry, guarantees that functions currently used as activation functions are, in fact, distinguishing. Theorem 3.3. Every asymptotically constant function u: R + R is distinguishing. So, all sigmoidal and gaussian activation functions are distinguishing. To investigate functionally equivalent weight vectors, we must study asymptotically constant functions that are either self-affine or affinely recursive. First notice, that many typical choices of an activation function are self-affine: hyperbolic tangent, gaussian, logistic sigmoid, ramp sigmoids,

Feed forward Neural Networks

549

and discontinuous threshold functions. All of them are shifted translated (possibly by 0) odd or even functions. Our next result shows that these are the only possibilities for a reasonable activation function. Theorem 3.4. Let u be a regular activation function. If u is self-afine, then there exist unique real numbers b,d, w such that IwI = 1 and for every t E R .(t) = W U ( - - ~ b ) d.

+ +

Corollary 3.5. Let u be a regular activation function. Then o is self-afine if and only if it is either translated men or shifted translated odd. While most activation functions are self-affine, they may not be affinely recursive. We need the following technical criterion. Lemma 3.6. Let u: R + R be a self-affine regular activation function with different limits in +ca and -ca.Then 0 is not affinely recursive if and only if { u o y;, i = 1,. . . m} is linearly independent for any finite family of distinct y;in A+. Theorem 3.7. The logistic sigmoid is not afinely recursive. Sussmann (1992) used a similar linear independence condition to obtain a related result to characterize functionally equivalent irreducible weight vectors with respect to hyperbolic tangent. Since hyperbolic tangent is affinely equivalent to logistic sigmoid (tanh(t) = 2X(2t - 1) it follows from our result that hyperbolic tangent is not affinely recursive, so our results extend Sussmann's. So, typical activation functions are self-affine and not affinely recursive. The following theorem characterizes equivalent weight vectors for one-hidden-layer perceptron-type networks with such activation functions. Theorem 3.8. Let o : R + R be a self-afine regular activation function that is not afinely recursive, and let n , k be any positive integers. Then NF=-I o w C with respect to the architecture P,(n, k ) , where NC is an odd or an even compensation equivalence. The reader may wish to interpret these results in terms of search sets. The following proposition describes search sets for self-affine regular activation functions.

Proposition 3.9. Let n. k be any ositive integers. I f u is a translated even function, then S E ( n , k )= {u E Rk"+'),O5 cpl(u)5 p*(u)5 ...PA.( u)} i s a search set for P O ( n , k ) If. u is a translated odd function, then So(n.k) = {u E Rk("+'), 0 < w1 5 w2 5 . . . wk} is a search set for P,(n, k ) . If u is a shifted translated oddfunction, then S s ( n , k ) = {u E Rk("+2).[(3iE (1,. . . k } ) ( v ,= 0 ) and (0 5 w1 5 w2 5 . . .wk)]V[(Vi= 1,.. . k ) ( v ,# 0 ) and (w1 5 w2 5 . . .wk)] is a search set for P, ( n,k).

Y

Vera Kiirkova and Paul C. Kainen

550 4 Discussion

We have analyzed functionally equivalent weight vectors for single-output one-hidden-layer perceptron-type networks with asymptotically constant, nonconstant bounded activation functions. The last two of these conditions-bounded and nonconstant-are sufficient for a class of networks of this type to possess the “universal approximation property” (Hornik 1991). However, extension of our results to activation functions that are not asymptotically constant seems to be difficult. Asymptotic constancy is a rather natural requirement from an engineering perspective. Our reduction of the problem to the case of single-input networks uses asymptotic constancy. This reduction of input dimension seems to be typical for perceptron-type networks. For example, Stinchcombe and White (1989) gave an elegant proof that the problem of universal approximation by perceptron-type networks can be similarly reduced to the single-input case. In applications, we are looking for weight vectors generating the same ”error” with respect to the function being approximated. Any two functionally equivalent weight vectors clearly do have the same error with respect to any function being approximated and to any error measure. There exist affinely recursive functions recently studied in connection with wavelets (scaling functions used to construct the mother wavelet). See, e g , Strang (1989) and Daubechies (1992). The possibility of utilizing wavelets as activation functions has been explored by Kreinovitch et al. (1992) and by Pati and Krishnaprasad (1991). The latter constructed a mother wavelet function from three translated and dilated copies of logistic sigmoid. However, scaling functions have not been considered as activation functions.

Appendix: Mathematical Proofs Proof of Theorem 3.1. Let u, u’ E Rk(”+*) such that u -F u’ with respect to P,(n, k ) . Putting (ri(x) = v, . x b,, and o;(x) = v: . x + b: for every i = 1... . k , we have

+

k

k

Let P, P’, and S be partitions of the sets {(Y,.i = 1 , .. . k } , ((1:.i = 1 . . . . k } and { w I .i = 1,. . . k } u {a:,i = 1,. . . k}, respectively, with respect to =. For every S E S we have S = PS UP;, where PS E P or PS = 0 and P$ E P‘ or P$ = 0. Put M = { i E (1, . . . k}.w, # O},M‘ = {i E (1,. . . k } , w ;# O},Ms = {i E M , o , E P s } and M$ = {i E M‘,oi E P;}. Since (7 is distinguishing, for every S E S there exists cs E R such that X I E M , WI.

0 N,

- C l t qw:.

0

a; = cs.

Feedforward Neural Networks

551

Denote by SO the class of M containing all constant affine functions. For every S E S with S # SOchoose some as E S. Then for every i E Ms and i E M $ there exist yi E A, y( E A, respectively, such that a, = y;o as and a: = 7: 0 U S . Since N is onto R,we have &Ms w,a o yi - &M; W ~ Uo y,! = C S . Since a is not affinely recursive, we can verify by induction that there exist sets Is,Ds, Ts C Ms and I;, D$, 7’; C M $ and one-to-one onto mappings T S : IS -+ I;, 7s: DS + TS,.;: D$ T i , such that Ms = IsUDsUTs, M $ = I$UD$UTLand for every i E IsUDsUD‘, there exists c, E R such that w,aoyi+w,(i)aoy7.,(~) = ci for every i E IS, wiaoy,+w,(,)uoy,(,)= c; for every i E Ds, and wia o 7: W L ( ~ ) ,o~ &) = c; for every i E 0;. Since a is not self-affine, ci = 0 for every i E IS UDSU D$,w, = w&;)and y,= yb(,)for every i E IS, wi = -wT(i)and yi = y,(i) for every i E Ds? and w: = -w:(,) and y( = y:(,) for every i E D$.Put I = U(1s.S E S,S # SO}. I’ = U { I i , S E S,S # So} and define T : I I’ by T(i) = Ts(i) for i E IS. Since for every i E I ui= u;(,) and for every i @ I either a; is constant or there exists j # I such that w,u o a, + wja o (yi = c for some c E R,and for every i @ ~ “ ( 1either ) a: is constant or there exists j @ ~ ( 1such ) that w:o o a: + wja o a; = c for some c E R,u -1 u. -+

+

-+

Proof of Corollary 3.2. Suppose that ~ F f - 1 . Then by Theorem 3.1 (T is self-affine or affinely recursive. If a is self-affine, then there exist w, v,b, d E R such that a ( t ) = wa(vf+ b ) + d for every t E R and at least one of affine transformations wt d or vt + b is not identity. Then wIg(vlxl bl) = wwIu(vvIxI vbl b ) d. So for every k 2 2, P,(l,k)(u) = P,(l,k)(u’),where u1 = (wl:vl,bl)and ui = 0 for every i = 2,. . .k, and ui = (wwI,vv1, vbl + b), u; = (w’,0, b’), where d = w’a(b’),and u: = 0 for every i = 3 , . . . k. So for every k 2 2 (if a is strongly self-affine k 2 1) there exists u, u’ E Rk(n+2) such that u -F u‘ with respect to P u ( l ,k) but u and u’ are not interchange equivalent. If o is affinely recursive, then a ( t ) = ELl p,a(qit Y;) c. So for every k 1 m + 1, P,,(l,k)(u)= Pu(l,k)(u‘), where u1 = ( w l , v l , b l )and u,= 0 for every i = 2,. . . k, and ul = (piwi,q,v,,qib, ri) for every i = 1,.. . m? u;+~= (w‘,0,b’) where cd = w’u(b’),and ui = 0 for every i = m 2,. . . k. So for every k 2 m + 1 there exists u,u’ E Rk(”+’) such that u -F u’ with respect to P,,(l,k) but u and u’ are not interchange equivalent.

+ + + +

+

+ +

+

+

Proof of Theorem 3.3. If n = 1, we have only two classes of equivalence M, one containing all nonconstant affine functions and the second containing all constant affine functions. So, the only possible decomposition of any one-variable function f of the form 2.1 is f = fp c. So, if f is constant, fp must be constant, too. If n 2 2, take the decomposition u p , P E P } of the representation f = CEl w,a o ai of a constant function f ( x ) = c. If a hyperplane H is parallel with a cozero hyperplane Hi = a,:’({O}), then N, is constant on H. Hence, for every P E P f p is constant on every hyperplane parallel with the family of mutually parallel hyperplanes {Hi, i E P}.

+

Vera Kiirkovl and Paul C. Kainen

552

Suppose that for some P E P , f p were not constant. We derive a contradiction. There exist two hyperplanes H and K parallel with the cozero hyperplanes Hi, for i E P, such that f p / K = CK # CH = f p / H . Without loss of generality, 0 $2 H. In the case that n = 2, hyperplanes are lines. In this case it may happen that the unique line contained in H is orthogonal to a cozero hyperplane Hi for some i $2 P. We shall return to this case in a moment. When n 2 3, the dimension of any hyperplane is at least 2 and so there are infinitely many possible directions of lines in H. Since there are only a finite number of directions of lines in H that are orthogonal to Hi for some i $2 P, there exists some line in H that is not orthogonal to any of the cozero hyperplane Hi for i $2 P. Take such a line and represent it as {at+ b,t E R} for some a , b E R" such that v , . a # 0 for every i $2 P (recall that ui(x) = vi . x + ui). Consider an affine mapping cp : R + H defined by p(t) = at b. Take any d E R" such that K = H d and put Q ( t ) = at b d. Then $1 is an affine mapping from R to K. Now, we shall verify that

+ +

lim(f - f p )

f-oc

+

o cp(t) =

+

lim(f - j p ) 0 ?i,(t)

f-rn

(A.1)

Put I , = limf-,g(t), I- = limf-,mo(t),P+ = {i E P,v, . a > 0}, and P- = { i E P,vi. a < 0}, w+ = CIEp+ wi and w- = CIEPw,. Since v, . u # 0 for every i $2 P. we have

So by A . l there exists t E R such that I(f-fP)Ocp(t)-(f-fP)O$(t)l

< ICK-CHI

However, since cp(t) E H and $ ( t ) E K , we have (f - f P )

v ( t )- (f - f P )

$ ( t ) = CH - c K

which is a contradiction. Return to the case n = 2. There exists at most one S E P such that for every i E P and for every j E S Hi and H, are orthogonal. Check that there exist a, b E R" with b # 0 and s E R,such that

+ b,t E R} (ii) K = {at + bs, t E R} (i) H = {at

(iii) {bt,t

E R}

is parallel with Hj for j E S.

Putp(t) =at+b,$(t) =at+bs.SinceforeveryRE P w i t h R # S a . v , # O for every i E R, we can, analogously to A.l above, verify that

~

Feedforward Neural Networks

553

So there exists t E R such that

Since fs is constant on every hyperplane (line) parallel with H, for i E S. we have for every t E R

+

+

f~ o p(t) = fs(at b) = fs(at bs) = fs o $ ( t ) Hence Icf - f p - f s ) o p(t) - (f - f p - fs) o $(t)l contradiction.

=

ICK

-

CHI.

which is a

The following lemmas will be used in our proof of Theorem 3.4.

Lemma A.1. Let u : R + R be a bounded self-afinefunction, then there exist a strongly self-afinefunction r: R -+ R and d E R suck that (T = r d .

+

Proof. Since (T is self-affine, there exist w,v,b.cE R such that ~ ( t=) wu(vf+ b) + c for every t E R. If w # 1, put d = c / ( l - w) and T = u - d . Then T ( f ) = wT(vf + b ) for every t E R,so r is strongly self-affine. I f w = l , wehavecr(t) = u ( v " t + ( v " - ' + . . . + l ) b ) + n c f o r e v e r y t E R and for every n E N with n 1 1. If c = 0 , (T is strongly self-affine and our statement holds. If c # 0 , we have a ( t ) / c- u ( U " t

+ b(d' - l ) / ( v - I ) ) / c = n

for every t E R and for every positive integer n. But it is in a contradiction with our assumption that (T is bounded.

Lemma A.2. Let 0:R + R bea bounded nonconstantfunctionand v,w.b E R. If o ( t ) = wu(vt + b) for every t E R,then Iw( = 1 = IvI. Proof: Since u is not constant, neither v nor w is zero. Moreover, it is easy to verify that for any positive integer n, there are functional equations (as functions of t E R): .(t) = W".(U"f

+ (u"

-

l ) b / ( v - 1))

(A.2)

and u ( t ) = (l/w")u((f - b ) / v " )- (1- l / v " ) b / ( v - 1))

(A.3)

Suppose that IwI # 1. Since u is not constant, there exists some R such that .(to) # 0. If IwJ < 1, limn+m(cr(to)/w")is +oo or -co, and if IwI > 1 lim,,m(w"u(to)) is +m or -m. However, by A.2 lim,,,(u(fo)/w") = limn+mo(v"t0 (v" - l ) b / ( v- 1))and by A.3 limn-m (w"u(to))= limn+mr((t- b)/v")- (1 - l/v")b/(v - 1)).So whether (w( were greater or less than 1, in either case we would find a sequence un with u(u,) unbounded. This proves that IwJ= 1. to E

+

VPra Kiirkova and Paul C. Kainen

554

If (vJ< 1, then by A.2 for every t E R we have o ( t ) = lim

n--‘3o

but

a(v“t+ b(vn - l ) / ( v - 1))= a ( b / ( l

-

v)

(I was supposed to be not constant. If 1v1 > 1, then by A.3 for any t E R,

o ( t ) = lim n-m

.((t/v”)

-

(1 - l/v”)b/(v - 1))= cT(b/(l - v))

again a contradiction. Hence, (v1= 1.

+

Proof of Theorem 3.4. By Proposition A.l, (I = T c, where T is strongly self-affine. So, there exist u , w, b E R such that v # 0, w # 0 and T ( f ) = wT(vt b) for every t E R.By Lemma A.2, IvI = 1 and IwI = 1. Suppose that v = 1. Then ~ ( t =) T ( t + 2b) for every t E R. So T is periodical. However, the only periodical functions that are asymptotically constant are constant functions. Since we suppose that T is not constant, v = -1. HenceT(t)=wT(-t+b), where IwI = l . S o , ( ~ ( t ) = w ( ~ ( - t + b ) + d , w h e r e d = C/W. Suppose that a ( t ) = wa(-t 6) d = wa(-t + b’) + d’. Then limn+% wo(-t + b) - limn+mw(I(-f b’) = 0 = d - d’. So, d = d’ . wa(-t b) = wrr(-t +b‘) implies n ( t ) = o(f + b - b’) for every t. Since (T is not periodic, b = b’.

+

+

+ +

+

Proof of Corollary 3.5. If o ( t ) = .r(t + b) + d , where T is either odd or even function, then (I is self-affine. If (I is self-affine, then by Theorem 3.4 n ( t ) = wu(-t + b) d , where w = 1 or -1. I f w = l , o ( f ) =(I(-t+b)+dimplies+t+b)=a(-(-t+b)+b)+d= o ( t ) + d . Hence d = 0. Put p(t) = T ( t + b/2). Then p(t) = wp(-t). So for w = 1, p is even and since n ( t ) = p(t - b / 2 ) , ( 1is a translated even function. For w = -1, p is odd and since a ( t ) = p(t - b/2) d , (I is a shifted translated odd function.

+

+

Proof of Lemma 3.6. Since (I has different limits in +co and -co,it follows from Theorem 3.4. that .(t) = --n(-t + b) + d . We first prove that if {(I o yi, i = 1,. . . m } is linearly independent for every positive integer rn and for every {yi, i = 1 , .. . ,m } 5 A+ with yi # y, for i # j , then (I is not affinely recursive. Suppose that ELl W ~ ( Io yi = C, where rn 2 3,yi E A with Ti # rj for i # j . Let yi(t) = vit + b,. Put M = (1,. . . m } , M + = {i E M,vi > O},M- = {i E M , v i < 0}, and for i E M+, put vi = vi, bi = b i , ~ = i w,,while for i E M - put iji = -vi, bi = -b,+b, and zbi = -wi. Putting -i.,(t) = ijit+bi and 2. = c - d#M-, we have ELl W,a o +, = i. Since for every i E M, E A+, ~

Feedforward Neural Networks

555

Since limt+-m u ( t ) # lim,+m u ( t ) , C,"=,Gi= 0, and hence c = 0. So EEl @a o 9, = 0. Hence for every i E M , either Gi = 0 or there exists j E M such that y,= -7, and Gi Gj = 0. In the first case, wi = 0, while in the second one, we have Giu o 9, Gju o 9, = 0, which implies w,u o y, wju o y,= 0. So, u cannot be affinely recursive. Now for the other direction. If u is not affinely recursive, then { u o y,,i = 1,. . . ,m} is linearly independent for every rn E N and for every {yi, i = 1,.. . , m} C A+ with 7, # y,for i # j . Let ELl wiuoyi = 0. We shall verify that Wj = 0 for every i = 1,.. . , m. Put I = {i E ( 1 , . . . rn},wi # O } . Case (i). #I = 1; trivial. Case ( 3 . #I = 2. Suppose I = { 1,2}. Then w l u o y1 w20 o y2 = 0, and hence u = ( W ~ / W I ) Uo 7 2 o 7.'; Put w = w2/wI and vt b = y2 o 7 '; . So u ( t ) = wu(vt+b) for every t E R. By Lemma A.2, Iw2/wlI = 1 and IvI = 1. Since u has different limits in $00 and - m u = -1. So y20y7l(t) = -t+b. Hence y 2 ( f ) = y ~ ( - t b). This is in contradiction with the assumption that 71, "12 E A+. Case (iii). #I 2 3. The assumption that (T is not affinely recursive implies that there exist i,j E I and c E R such that W ~ (oT yi + w p o y, = c. Putting w = w,/wi and vt + b = 7, o 7 '; , we obtain r ( t ) = wrr(vt + b) c/w, for every t E R. Analogously with the previous case, we get a contradiction with the assumption that both y,.y, E A+. In the following verification of the conditions of Lemma 3.6 for logistic sigmoid, we use some tricks invented by Sussmann (1992), but our argument is simpler.

+

+

+

+ +

+

+

Proof of Theorem 3.7. By Lemma 3.6, it is sufficient to verify that { A o a,,i = 1,. . . rn} is linearly independent for every rn E N with rn 2 1, and for every {a,,i = 1,. . . rn} C A+ with a, # ait for every i # i'. Put a; = vit b,. Then v,> 0 for every i = 1, . . . m ,and for every i # i', vi # vp or bi # b;t. Let

+

m

+

C w , X ( v , t b,) = 0

(A.4)

i=l

Expressing X as an infinite series

and putting C,, = (-l)jwieOb~,we obtain from A.4 m

for every t E R. Putting

x nEN

une-"' = 0

(A.5)

Vera Kiirkova and Paul C. Kainen

556

for every f E R. We shall verify by induction that for every n E N , u, = 0. Let n = 0. Since v, # 0 for every i = 1,.. . m,jv, = 0 implies j = 0. Hence uo = C:=, w,. Since D, > 0 for every i = 1,.. . rn, m

lim x w , X ( v , t+ b,) = r-+w ,=1

m

m

1w,lim X(t) = m=l

r-m

w,= uo m=l

+

Since C;=, w,X(v,t b,) = 0, we have uo = 0. Let for every n < no,u,, = 0. Then by A.5 u,,,e

"0'

+ C une-"' = o n>nO

for every t E R.Hence

= 0 consider I = { i E {l,..m},v, = Let io E {l,..m}. To verify that w,, E N put n, = jv,". Since for every j E N u,, = 0, we have

v,,]}.For every j

o = u,,, = C(-l)ku,e-'b' IEl

Putting c, = ecbt, we have CIEI w,cf' = 0 for every j E N . Since for every i, i' E I b, # b,!,we have c, # cp, too. Denote by k the number of elements of I and let 4 : {l, . . . k} + I be such a mapping that c + ( 1 ) > ... > c,(k). We shall prove by induction that w@(,) = 0 for every i = 1,.. . k. W+(1) = - 1iq-w CfL~ 4 ( l ) ( c @ ( f ) / ~ 4 ( l ) ) 2 ' = 0. since C+(I)/C@(l) < 1 for every i = 2 , . . . k. If w + ( ~=)0 for every p < 9, then k

since C ~ ( , ) / C ~ (<~ )1 for every i = 9 + 1 , .. . k So for every io E (1, . . . rn} w,= 0.

Proof of Theorem 3.8. First proceed as in the first two paragraphs of the proof of Theorem 3.1. However, now we suppose that a is selfaffine. By Theorem 3.4 there exist w,b, d E R such that I wI = 1 and o ( t ) = wo(-f b ) d for every t E R.Hence, putting I = U{Is, S E P , S # SO}, we have w:(,) = ww,, v).(,) = -v,, b',,,, = -b, b and c, = d for every i E I. So, for every i E I, u&,) = (ww,, -v,, -b, b ) , where u, = (wI,v,,b,). Putting u; = u * ( ~for ) i E I and u; = u,,we have u - I u" and u" -C u'. By Corollary 3.5 o is either a translated even or a shifted odd function. So, the conjugation N C is odd or even.

+ +

+

+

Proof of Proposition 3.9. If o ( t ) = o(-t+b) (a is translated even), then for ~ k ( n + 2 )put u: = (w,, -vI, -b, b ) if pl(u)< 0,u: = u, if pl(u) 2 0. Let

+

Feedforward Neural Networks

557

T be a permutation of the set (1, . ,k} such that ( ~ ~ ( I 1 )~ ~ ( I 2 .). . I q r ( k ) . Putting uy = u:(;), we have u -C u' -1 u", so u -F u" and hence S E ( n k) , is a search set. If a ( t ) = -a(-t b ) (a is translated odd), then for u E Rk("+l) put u: = (-wi,-Vj, -b, b) if wi < 0, and ui = ui if wi 2 0. Let 7r be a permutation of the set (1,. . .k} such that wn(l) 5 wr(2) 5 . . . 5 wr(k). Putting uy = u,(,),we have u -C u' -1 u", so u -F U" and hence So(n,k) is a search set. If a ( t ) = -a(-f b) d ( a is shifted translated odd), then for u E Rk(n+2) containing a constant hidden unit choose such unit, j , and for i # j put U: = (-wj, -vj,-bi b) if wj < 0, and U: = ~j if wj 2 0. Put m = #{i E (1 . . . k}, w; < 0) and ui = (w!, 0, bj), where b; is any real number such that a(bj) # 0 and wj = (wja(1j)- md)/a(bj).If u contains no constant hidden unit, put u' = u. Let 7r be a permutation of the set (1,. . .k} such that wT(l)5 w,p) 5 . . . 5 w,(k). Putting uy = uk(,),we have u -C u' N I u", So u -F u" and hence S s ( n ,k) is a search set. ,

+ +

+ +

+

~

Acknowledgment

V. Kiirkovi thanks E. Sontag for a stimulating discussion. References Albertini, F., and Sontag, E. D. 1994. For neural networks, function determines form. Neural Networks (in press). Birkhoff, G., and MacLane, S. 1965. A Survey of Modern Algebra. Macmillan, New York. Chen, A. M., and Hecht-Nielsen, R. 1991. On the geometry of feedforward neural network weight spaces. In Proceedings of the 2nd IEE Conference on Artificial Neural Networks, pp. 1 4 . IEE Press, London. Chen, A. M., Haw-minn, L., and Hecht-Nielsen, R. 1993. On the geometry of feedforward neural network error surfaces. Neural Computation 5(6), 910927. Daubechies, I. 1992. Ten Lectures on Wavelets. Siam, Philadelphia. Hecht-Nielsen, R. 1990. On the algebraic structure of feedforward network weight spaces. In Advanced Neural Computers, pp. 129-135. Elsevier, Amsterdam. Homik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2), 251-257. Kreinovitch, V., Sirisaengtaksin, O., and Cabrera, S. 1992. Wavelet neural networks are optimal approximators for functions of one variable. Tech. Rep. UTEP-92. Pati, Y. C., and Krishnaprasad, P. S. 1991. Discrete affine wavelet transforms for analysis and synthesis of feedforward neural networks. In Advances in

Vera Kiirkovti and Paul C. Kainen

558

Neural Information Processing Systems 111, pp. 743-749. Morgan Kaufmann, San Mateo, CA. Sussmann, H. J. 1992. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks 5(4), 589-594. Stinchcombe, M., and White, M. 1990. Approximating and learning unknown mappings using multilayer feedforward networks with bounded weights. In Proceedings of the International Ioint Conference on Neural Networks, pp. 111, 7-16. IEEE, New York. Strang, G. 1989. Wavelets and dilation equations: Brief introduction. S I A M Rev. 31(4), 614-627.

Received June 10, 1992; accepted June 25, 1993.

This article has been cited by: 2. Christopher DiMattina, Kechen Zhang. 2010. How to Modify a Neural Network Gradually Without Changing Its Input-Output FunctionalityHow to Modify a Neural Network Gradually Without Changing Its Input-Output Functionality. Neural Computation 22:1, 1-47. [Abstract] [Full Text] [PDF] [PDF Plus] [Supplementary material] 3. Haikun Wei, Jun Zhang, Florent Cousseau, Tomoko Ozeki, Shun-ichi Amari. 2008. Dynamics of Learning Near Singularities in Layered NetworksDynamics of Learning Near Singularities in Layered Networks. Neural Computation 20:3, 813-843. [Abstract] [PDF] [PDF Plus] 4. Bruce Curry. 2007. Parameter redundancy in neural networks: an application of Chebyshev polynomials. Computational Management Science 4:3, 227-242. [CrossRef] 5. Shun-ichi Amari , Hyeyoung Park , Tomoko Ozeki . 2006. Singularities Affect Dynamics of Learning in NeuromanifoldsSingularities Affect Dynamics of Learning in Neuromanifolds. Neural Computation 18:5, 1007-1065. [Abstract] [PDF] [PDF Plus] 6. M.C. Medeiros, A. Veiga. 2005. A Flexible Coefficient Smooth Transition Time Series Model. IEEE Transactions on Neural Networks 16:1, 97-113. [CrossRef] 7. M.C. Medeiros, A. Veiga, C.E. Pedreira. 2001. Modeling exchange rates: smooth transitions, neural networks, and linear models. IEEE Transactions on Neural Networks 12:4, 755-764. [CrossRef] 8. Adrian Trapletti , Friedrich Leisch , Kurt Hornik . 2000. Stationary and Integrated Autoregressive Neural Network ProcessesStationary and Integrated Autoregressive Neural Network Processes. Neural Computation 12:10, 2427-2450. [Abstract] [PDF] [PDF Plus] 9. Terrence L. Fine , Sayandev Mukherjee . 1999. Parameter Convergence and Learning Curves for Neural NetworksParameter Convergence and Learning Curves for Neural Networks. Neural Computation 11:3, 747-769. [Abstract] [PDF] [PDF Plus] 10. Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143. [CrossRef]

ARTICLE

Communicated by Ted Adelson

What Is the Goal of Sensory Coding? David J. Field Department of Psychology, Coriiell Uiiiversity, Ithncn, NY 14850 U S A

A number of recent attempts have been made to describe early sensory coding in terms of a general information processing strategy. In this paper, two strategies are contrasted. Both strategies take advantage of the redundancy in the environment to produce more effective representations. The first is described as a “compactN coding scheme. A compact code performs a transform that allows the input to be represented with a reduced number of vectors (cells) with minimal RMS error. This approach has recently become popular in the neural network literature and is related to a process called Principal Components Analysis (PCA). A number of recent papers have suggested that the optimal ”compact” code for representing natural scenes will have units with receptive field profiles much like those found in the retina and primary visual cortex. However, in this paper, it is proposed that compact coding schemes are insufficient to account for the receptive field properties of cells in the mammalian visual pathway. In contrast, it is proposed that the visual system is near to optimal in representing natural scenes only if optimality is defined in terms of ”sparse distributed” coding. In a sparse distributed code, all cells in the code have an equal response probability across the class of images but have a low response probability for any single image. In such a code, the dimensionality is not reduced. Rather, the redundancy of the input is transformed into the redundancy of the firing pattern of cells. It is proposed that the signature for a sparse code is found in the fourth moment of the response distribution (i.e., the kurtosis). In measurements with 55 calibrated natural scenes, the kurtosis was found to peak when the bandwidths of the visual code matched those of cells in the mammalian visual cortex. Codes resembling ”wavelet transforms” are proposed to be effective because the response histograms of such codes are sparse (i.e., show high kurtosis) when presented with natural scenes. It is proposed that the structure of the image that allows sparse coding is found in the phase spectrum of the image. It is suggested that natural scenes, to a first approximation, can be considered as a sum of self-similar local functions (the inverse of a wavelet). Possible reasons for why sensory systems would evolve toward sparse coding are presented. Neural Computation 6, 559-601 (1994)

@ 1994 Massachusetts Institute of Technology

560

David J. Field

1 Introduction

Although we know a great deal about how sensory systems code information, there remains considerable debate regarding the goal of this coding. In many studies, there is an implicit assumption that there is no single goal. It is assumed that sensory systems solve a wide range of tasks important to the animal and since the range of tasks varies widely, one would not expect to see any common "theme" across the different coding strategies. A second approach, which serves as the basis for the ideas presented in this paper, proposes that it is possible to describe sensory coding in terms of a general information processing strategy. By this tradition, it is presumed that redundancy in different sensory environments can be represented within a single framework and that the goal of sensory coding is to transform the redundancy to provide some advantage to later stages of processing. In this paper, two information processing strategies are contrasted. Both of these approaches take advantage of the redundancy in the input to produce more effective representations of the environment. However, the two approaches achieve different goals and depend on different forms of redundancy. The first of these, which will be described as "compact coding," has gained considerable attention in the neural network literature and serves as the basis of much of the work in image compression. This approach suggests that the principal goal of visual coding is to reduce the redundancy of the visual representation. Many of these ideas can be traced backed to Barlow's theories of redundancy reduction ( e g , Barlow 1961). Recently, a number of studies have proposed that spatial coding by the mammalian visual system is well described by codes that make use of the correlations to reduce the redundancy of the sensory representation (e.g., Atick and Redlich 1990, 1992; Atick 1992; Barlow and Foldiak 1989; Daugman 1988,1991; Linsker 1988; Foldiak 1990; Sanger 1989). This approach to coding is illustrated in Figure 1A. In a compact code, the goal is to represent all the likely inputs with a relatively small number of vectors (e.g., cells) with minimal loss in the description of the input. In such a code, the dimensionality of the representation is reduced, resulting in a code where only a subset of the possible inputs can be accurately represented. The code is effective when this subset is capable of representing the probable inputs to the code. In the next section, we will see how this approach reduces redundancy. The second approach suggests that the principal goal of sensory coding is to produce a sparse-distributed representation of the sensory input. This approach has been specifically proposed with respect to visual code and the representations of natural scenes (Field 1987,1989,1993; Zetzsche 1990). However, several authors have noted that codes that produce sparse outputs may provide several advantages for representing sensory information ( e g , Barlow 1972, 1985; Palm 1980; Baum r t a/. 1988). Unfortunately, in much of this work, the distinction between sparse and

Sensory Coding

561

1

4

Compact Coding

Sparse Distributed Coding

Represents data with minimum number of units

Represents data with minimum number of active units

c)

a a !=

I

Figure 1: Two methods of taking advantage of redundancy in a sensory environment. In the compact coding approach (A), the code transforms the vector space to allow the data to be represented by a smaller number of vectors with only minimal loss in the representation (i.e., dimensionality is reduced). In a sparse coding scheme (B), the code transforms the vector space to allow the input to be represented with a minimum number of active cells. In a sparse coding scheme, the dimensionality is not reduced. Rather, the redundancy in the input is transformed into the redundancy in the firing rate of the cells (i.e., the response histogram) to produce a code where the response probability for any particular cell is relatively low.

compact codes has not been clear. Therefore, in this paper, much of the discussion will be devoted to the differences and requirements for each type of coding. In a sparse-distributed code, the dimensionality of the representation is maintained (the number of cells remains roughly constant and may even increase). However, the number of cells responding to any particular instance of the input is minimized. Over the population of likely inputs, every cell has the same probability of producing a response ( i t . , distributed) but that probability is low for any given cell (i.e., sparse). In sparse-distributed coding, the goal is to obtain a code where only a few cells respond to any given input. However, across the population of images the information is distributed across all of the cells. As we will see in the next section, this coding scheme does not reduce redundancy.

562

David J. Field

Rather, high-order redundancy is transformed into redundancy of the firing patterns of the cells (i.e., the response histograms) as was proposed previously (Field 1987). By this approach, the goal of the coding is to maximize the redundancy of the response histograms by minimizing the statistical dependencies between units. In the following sections, these two approaches will be analyzed and contrasted. We will begin by looking at the type of redundancy required for producing a compact code and discuss the relations between this code and Principal Components Analysis (PCA). We will then look at the type of redundancy and the type of transform required for a sparse code. This will be followed by an experiment that models the response of visual neurons to natural scenes. Throughout this paper, we will concentrate on the response properties of cells in the mammalian visual system and the statistical relations found in the natural environment. However, the general ideas can be applied to any sensory system and any sensory environment. 2 State Space

One effective method for describing redundancy in a data set is to consider the ”state space” of possible inputs. The state space (sometimes called vector space or phase space) describes the space of all possible states. This approach provides a useful means of showing how different coding schemes take advantage of the various forms of redundancy present in a population of inputs. Although the notion of state space has proved popular in descriptions of chaotic interactions and learning in neural networks ( e g , Churchland and Sejnowski 1992 for an introduction), it is not widely discussed in theories of visual coding. To describe the state space of an image set, one can use the pixel amplitudes to represent the coordinate axes of the space. For any n-pixel image, one requires an n-dimensional space to represent the set of all possible images. Every possible image (e.g., a particular face, tree, etc.) is represented in terms of its unique location in the space. For example, white noise patterns (i.e., patterns with random pixel intensities) represent random locations in that n-dimensional space. Therefore, if we let the space become filled with examples of such white noise patterns, then all regions of the space would be filled equally-that is, the probability density will be uniform throughout the state space. However, most naturally occurring phenomena are not random. Any redundancy that occurs in a data population implies that the population does not fill the state space uniformly. Natural scenes, for example, are statistically very different from white noise patterns and we would therefore not expect the state space of natural scenes to have the same probability density. Consider the case of a 256x256 pixel image. The state space of all possible images at this resolution requires a 65,536-

Sensory Coding

563

dimensional space where the amplitudes of each of the pixels represent the axes of the space. As one can imagine, the probability of seeing something resembling a natural scene when presented with white noise patterns is extremely low. This implies that in the state space of all possible 256 x 256 images, natural scenes occupy an extremely small area of the space. To develop an understanding of the functional properties of the mammalian visual system, a number of recent studies have proposed that one must understand the statistical regularity of the mammalian visual environment and its relation to the visual code ( e g , Field 1987, 1989, 1993; Atick and Redlich 1990, 1992; Bialek et al. 1991; Eckert and Buchsbaum 1993; Hancock et al. 1992; van Hateren 1992; Kersten 1987). In terms of the state space, every statistical regularity that one finds in the visual environment provides a clue about the location and shape describing the probability density of natural scenes within the state space. Our visual environment is highly structured. The physics of how objects and surfaces reflect light forces the probability density into highly constrained forms. We will see that much of the debate about the goal of visual coding depends on the particular shape that is produced with data from the sensory system's typical environment. It will be argued that the state space describing the probability density of natural scenes is highly predictable but does not have the shape that is widely presumed. Similarly, the spatial response properties of a cell (e.g., the receptive field) can also be described in terms of locations in the state space that will produce a response. For example, a cell with a linear response can be represented by a vector extending from the origin of the state space in a direction that depends uniquely on the particular receptive field profile. Every distinguishable receptive field profile is represented in terms of a unique direction in the state space. Thus, an array of cells can be considered as an array of vectors each pointing to unique locations in the state space. If the local amplitudes of a waveform (e.g., the pixels) represent the coordinate axes of the state space, then an ortho-normal transform (e.g., a Fourier transform) can be represented by a rotation and /or translation of the coordinate axes.' This concept of treating a transform as a rotation of the coordinate axes will prove important in the discussions below. Ortho-normal transforms represent the special class of transforms where the vectors remain of normal length lV;l = 1 and the vectors remain orthogonal.

v;. vj = 0 'Technically, an ortho-normal transform performs an "isometric" transform on the state space (ix., it is form preserving).

564

David J. Field

It is unlikely that the visual system’s transform is either completely orthogonal or normal, but this will not affect the main points of the discussion presented in this paper. To a first approximation, a sensory code can be thought of as rotation of the coordinate axes of the state space. The response properties of any particular cell (e.g., the receptive field) describe the direction of the vector in the state space. The collection of cells mapping out the visual field forms an array of vectors mapping out the state space. To understand the goal of sensory coding, it is proposed that one must determine the relation between the directions of the vectors (i.e., the response properties of the cells) and the “state space” of typical inputs for that sensory system. Throughout this paper, terms such as ”the form of the state space” or “the state space of input” will be used. This represents a shorthand for “the regions within the state space where images have a relatively high probability.” The state space always represents the space of all possible images. However, “state space of natural scenes” will refer to the region within this space that describes where natural scenes are likely to fall. This study will investigate the relation between the spatial response properties of some of the cells in the visual pathway (e.g., the vectors described by the receptive field profiles) and the state space populated by natural scenes. The redundancy in the environment can take several forms. In the next sections, two forms of redundancy will be considered and we will explore two types of transform that can take advantage of this redundancy.

2.1 Correlations in State-Space. Let us consider a very simple example where a data set consists of only two independent variables (e.g., the intensities of two neighboring pixels). Figure 2 shows an example of the state space of a collection of two pixel images where the horizontal

Figure 2: Fnciiig pngc’. (a) An example of a two-dimensional state space. In this example, we have assumed that the axes represent the intensities of two pixels. The ellipse represents a population of probable inputs showing that Pixel A and Pixel B are highly correlated. However, in this example, there is little redundancy in the response histograms of the two pixels. Each shows a gaussian distribution with a relatively large variance. The histogram of activity for the two pixels is found by projecting the state space onto the two axes. (b) The transform where the new basis functions are represented by a rotation of the coordinate axes. In this new coordinate system, most of the variance in the data can be represented with only a single coefficient ( A ’ ) . Removing B’ from the code produces only minimal loss. This rotation of the coordinate system to allow the vectors to be aligned with the principal axes of the data is what is achieved with the process called Principal Component Analysis (PCA).

Sensory Coding

565

axis represents the intensity of Pixel A and the vertical axis represents the intensity of Pixel B. Any two-pixel image is represented as a unique point in the two-dimensional state space. Figure 2a provides an example of one form of redundancy. In this case, the population of images produces a correlation in the outputs of the two pixels. Although the pixels are correlated, each pixel is contributing equal information about the image. In this case, it has been assumed that

-

IHistogramsl

Response Amplitude

Response Amplitude

David J. Field

566

the response behavior is normally distributed about the two axes. The histogram of activity (i.e., the probability distribution function) for the two pixels is found by projecting the state space onto the two axes. In this case, the activity of each pixel is normally distributed and the two have equal variance. It is possible to transform the coordinate system to take advantage of the redundancy. Figure 2b shows the transform where the new basis functions are represented by a simple rotation of the coordinate axes where A' = A + B and B' = A - B. To keep the basis vectors of normal length, the transform is

[;:I=[$

-"IJz[,']

In this new coordinate system, most of the variance in the data can be represented with only a single coefficient (A'). Removing B' from the code produces only minimal loss in the description of the input. This rotation of the coordinate system to allow the vectors to be aligned with the principal axes of the data is what is achieved with a process called Principal Component Analysis (PCA)-sometimes called the Karhounen-Loeve transform. Principal component analysis computes the eigenvalues of the covariance matrix (e.g., the covariance of the pixels in an image). The corresponding eigenvectors represent a hierarchy of orthogonal coefficients where the highest valued vectors account for the greatest part of the covariance. By using only these highest valued vectors, the state space can be represented with a subset of the vectors and only minimal loss in the representation as measured in RMS error. As shown in Figure 2, the variance in A' has increased and the variance in B' has decreased markedly. Now, most of the variance of the input is coded in A'. In general, by removing those vectors with low variance, we end up with a state space that is "more packed." Thus, the new state space has less redundancy. Redundancy reduction is achieved by removing regions of the state space where the probability density is low. One can either remove low variance vectors from the representation or reduce the range that a vector can cover (i.e., reduce the dynamic range of the basis vector). If we think of the data as forming an ellipse then PCA provides a method of finding the axes of the ellipse. Of course, this two-point transform is a rather restricted example. If one hopes to model the redundancy of natural scenes then one needs a high-dimensional space to describe the possible states. An n-pixel image requires an n-dimensional state space. Although the different forms of redundancy can become quite complex, it is not necessarily difficult to describe. It is possible to generalize the ellipse shown in Figure 3 to high dimensions where the region of high-probability images (i.e., high density) is described by OX;

+ bxy + CX: + d ~ +: ex: +

'.

. <= k

(2.2)

Sensory Coding

567

Figure 3: Ellipse and two filtered images. Examples of two images created by multiplying the spectrum of white noise (flat spectrum) by l/f. With images like this, all the redundancy is captured in the Fourier amplitude spectrum. The state space of images created in this fashion form a high-dimensional ellipse as described in equation 2.2. The axes of the ellipse are aligned with the vectors in the Fourier transform. Thus, the ”principal components” and the veztors of Fourier transform are equivalent. where a 2b>c

2dle..,

a n d x 1 x2. ~ xg . . . represent the axes of the ellipsoid. Whatever direction the axes are pointing, PCA can use the correlations in the data to find the directions of these axes a n d produce the vectors that correspond to the axes of the ellipsoid. The advantage of these particular vectors is that they provide a means of compressing image data with minimal RMS error. Just as the axes of the ellipse are ordered in equation 2.2, the eigenvectors are ordered in a sequence where the first few vectors account

David J. Field

568

for the highest proportion of the variance. If the goal is to compress an n-pixel image set using only rri vectors where 111 < 11, then the principal components (i.e., the principal axes) represent the optimal vectors for describing the data. As will be suggested in the next section, the state space representing natural scenes may not be well described by a highdimensional ellipse. Nevertheless, no matter what the actual form of the state space, if the goal is to transmit a data set with a reduced number of vectors and with minimal RMS error, then PCA provides a means of finding the optimal set of vectors.2 2.2 Principal Components and the Amplitude Spectra of Natural Scenes. An interesting and important idea involves PCA when the statistics of a data set are stationary. Stationarity implies that over the population of images in the data set ( e g , all natural scenes), the statistics at one location are no different than at any other location. Across all images P ( x , I X , + I . X , + Z > . . .) = P(x, 1 x1+1.x/+2... .) for all i and j . This is a fairly reasonable assumption with natural scenes since it implies that there are no "special" locations in an image where the statistics tend to be different (e.g., the camera does not have a preferred direction). It should be noted that stationarity is not a description of the presence or lack of local features in an image. Rather, stationarity implies that over the population, all features have the same probability of occurring in one location versus another. When the statistics of an image set are stationary, the amplitudes of the Fourier coefficients of the image must be uncorrelated (e.g., Field 1989). Indeed, any two filters that are orthogonal over translation:

g(x) . Ir(x - xO)= 0

for all xg

will have uncorrelated outputs in the presence of data with stationary statistics (Field 1993). Since this holds for the Fourier coefficients, the amplitudes of the Fourier coefficients will be uncorrelated. This means that when the statistics of a data set are stationary then all the redundancy reflected in the correlations between pixels is captured by the amplitude spectra of the data. This should not be surprising since the Fourier transform of the autocorrelation function is equal to the power spectrum. Therefore, with stationary statistics, the amplitude spectrum describes the principal axes (i.e., the principal components) of the data in the state space (Pratt 1978). With stationary data, the phase spectra of the data are irrelevant to the directions of the principal axes. As noted previously (Field 1987), an image that is scale invariant will have a well-ordered amplitude spectrum. For a two-dimensional image, ?It should be noted that PCA finds only the optimal linear solution. It is always possible that for a given data set there exists a nonlinear transform that will provide better compression than the vectors provided by PCA.

Sensory Coding

569

the amplitudes will fall inversely with frequency (i.e., a l/f spectrum). Natural scenes have been shown to have spectra that fall as roughly l/f (Burton and Moorhead 1987; Field 1987, 1993; Tolhurst rt a/. 1992). If we accept that the statistics of natural images are stationary, then the l/f amplitude spectrum provides a complete description of the correlations in natural scenes. The amplitude spectrum certainly does not provide a complete description of the redundancy in natural scenes, but it does describe the relative amplitudes of the principal axes. We can ask what an image would look like if all the redundancy in the image set was described by a l/f amplitude spectrum. Or in terms of the state space, we can ask what images would look like if the probability density was described by a high-dimensional ellipsoid where the principal axes of the ellipsoid are the Fourier transform. Figure 3 shows examples of two such images. They are created by multiplying the spectrum of white noise (flat spectrum) by l/f. Although these images have amplitude spectra similar to natural scenes (and therefore the same principal components), they clearly do not look like real scenes. They do not contain any of the local structure like edges and lines found in the natural environment. As proposed in the next section, images that have such local structure are not described well by a high-dimensional ellipsoid. A number of recent studies have discussed the similarities between the principal components of natural scenes and the receptive fields of cells in the visual pathway (Bossomaier and Snyder 1986; Atick and Redlich 1990, 1992; Atick 1992; Hancock et al. 1992; Baddeley and Hancock 1991; MacKay and Miller 1990; Derrico and Buchsbaum 1991). And there have been a number of studies that have shown that under the right constraints, units in competitive networks can develop large oriented receptive fields (e.g., Lehky and Sejnowski 1990; Linsker 1988; Sanger 1989). Indeed, it has been noted that networks like Linsker’s work off the induced correlations caused by the overlapping receptive fields in the network and the network produces results similar to the principal components (MacKay and Miller 1990). And it has been noted that Hebbian learning can, under the right conditions, find the principal components (Oja 1982; Foldiak 1989; Sanger 1989). This appears to pose a dilemma. If the principal components of images with stationary statistics are equivalent to the Fourier transform, shouldn’t the derived receptive field profiles of the appropriate Hebbian network look like the Fourier transform if presented with natural scenes? Not necessarily. The two-dimensional Fourier transform of natural scenes shows considerable symmetry. The amplitude spectra of natural scenes fall as approximately l/f at all orientations. Therefore, at any frequency there exists a range of orientations that are likely to have similar amplitude. When a number of Fourier coefficients have the same amplitude,

570

David J. Field

there will exist a wide range of linear combinations that will account for equal amounts of the covariance (i.e., the solution is degenerate). Therefore, linear combinations of these equivariant vectors will also account for equal amounts of the covariance. Recently, Baddeley and Hancock (1991) and Hancock ef al. (1992) calculated the principal components of a number of natural scenes directly. When gaussian windowed, the first few coefficients do show some resemblance to cortical receptive fields-but this is expected since a gaussianwindowed low-frequency sinusoid (the first Fourier coefficients with the greatest amplitude) will produce the popular “Gabor function” shown to provide good models of cortical receptive fields (e.g., Field and Tolhurst 1986). Past the first 3 or 4 Fourier coefficients, the receptive field profiles no longer look like cortical receptive field profiles. Instead, the profiles are substantially different from those found in the mammalian visual system. Figure 4B shows an example of the two-dimensional amplitude spectrum of a natural scene. On average, a ring of frequencies around the origin will have the same amplitude. This produces degeneracy in the set of possible solutions. The principal components can consist of any linear combination of these frequencies. That is, an equivalent solution for the principal components may consist of an ortho-normal transformation of these vectors. Figure 4C shows four spatial frequencies that would be expected to account for equal amounts of the covariance in natural scenes. The four lower “receptive fields” represent a rotation of the other receptive fields. For a natural scene with a l/f spectrum and stationary statistics, the vectors in either 4C or 4D will serve equally well as principal components. Thus, even if the statistics of the input are stationary, the Fourier transform is not necessarily the only solution. It is important to recognize that if the statistics of the input are stationary then there is no reason for Principal Components Analysis to produce localized receptive fields. In Fourier terms, a function is localized because the phases of the different frequencies are aligned at the point where the function is localized. Cortical simple cells, for example, have been shown to have a high degree of phase alignment at the center of the receptive field (Field and Tolhurst 1986). The large lowfrequency receptive fields may contain only a few spectral components. However, the small high-frequency receptive fields have relatively broad bandwidths and hence have a large number of spectral components, all aligned near the center of the receptive field. Randomizing the phases of a localized function will distribute the energy across the image. Since the covariance matrix does not capture the phases (if the statistics are stationary), Principal Components Analysis cannot produce localized receptive fields. If the primary constraints of the code are to represent the input with reduced dimensionality, then the codes should not be capable of producing the small high-frequency orientation selective receptive fields. Atick

Sensory Coding

571

Figure 4: An example of a scene (A) and its two-dimensional amplitude spectrum (B). For a population of images scenes with stationary statistics, the amplitude spectrum describes the principal components. However, in the twodimensional amplitude spectrum of natural scenes there is considerable symmetry with similar frequencies at different orientations likely to have the same amplitude (C). This means that appropriate combinations of the Fourier coefficients with equal amplitude will account for equal amounts of the covariance. (D) An example of a rotation of the Fourier vectors that should account for equal amounts of the covariance. With stationary statistics, the phase of the phase spectrum of the input is irrelevant to the principal components. Thus, Principal Components Analysis (PCA) cannot result in localized receptive fields without forcing constraints on the phase. a n d Redlich (1990,1992) have suggested that the spatial frequency tuning of retinal ganglion cells is well matched to the combination of amplitude spectra of natural scenes a n d high-frequency quanta1 limitations found in the natural environment. This is a n important finding with regards to

572

David J. Field

the amplitude spectrum of the receptive field profile. However, a receptive field will be localized only if its phase spectrum is aligned, so this approach cannot explain why the receptive fields are localized. It has been proposed that it is the phase spectrum of natural scenes that must be considered if one is to account for the localized nature of receptive fields (Field 1989, 1993). Indeed, although there are numerous reports of networks that produce hidden units with oriented receptive fields, 1 have found no published report of a network that can produce the small localized receptive fields without specifically forcing these locality constraints on the network (e.g., forcing a small gaussian window). I t has even been suggested that competitive learning is, in general, inappropriate for producing the small high-frequency receptive fields (Webber 1991). The receptive field profiles of the hidden units are always the same size as the window of the network. In the next section it will be proposed that the principal components of natural images do not describe the important forms of redundancy. To summarize, PCA and compact coding have several problems in accounting for receptive field profiles of cells in the retina and primary visual cortex. 1. If one accepts that the statistics of natural scenes are stationary, then the principal components are dependent on only the amplitude spectra of the input. The phase spectra of the inputs are irrelevant. 2. Because of the symmetry in the amplitude spectra of natural scenes, the principal components are likely to have a degenerate solution. The resulting ”receptive fields” will consist of an unconstrained collection of orientation components with random phase. This will be most apparent at higher frequencies where there exists a wide range of orientation components at each frequency.

3. A function is localized only when the phases of the Fourier coefficients are aligned. Thus, when the statistics are stationary, the Principal Components Analysis will not produce localized receptive fields because the phase spectrum is not constrained. 4. Since the principal components are not dependent on the phase spectra of the input, the principal components will not reflect the presence of local structure in the data.

In the next section, it will be proposed that the shape of the state space describing natural scenes is not elliptical and this nonelliptical state space provides a clue as to why the mammalian visual system represents spatial data as it does. To account for receptive field profiles in primary visual cortex, we consider a different theory regarding the goal of visual coding and we consider a different form of redundancy found in natural scenes. However, before discussing sparse coding techniques, it should be noted that the PCA represents an important technique for reducing the

Sensory Coding

573

number of vectors describing the input-when it is possible. For example, in the color domain, it has been found that the first three principal components can account for much of the variance in the spectra of naturally occurring reflectances. This has led investigators to suggest that the three cone types found in primate vision provide an efficient description of our chromatic environment (e.g., Buchsbaum and Gottschalk 1983; Maloney 1986). In general, the principal components can be used to determine which dimensions of the state space are needed to represent the data. If a reduced dimensionality is possible, the principal components will be able to tell you that. However, once the dimensions have been decided, it is proposed that other factors must be considered to determine the best choice of vectors to describe that space. 2.3 State Spaces That Allow Sparse Coding. Consider a two-pixel data set with redundancy like that shown in Figure 5. The data are redundant since the state space is not filled uniformly. However, this data set has some interesting properties. We can think of this data set as a collection of two types of images: one set with pixels that have high positive correlation and one set with high negative correlation. For this data set, there will be no correlation between Pixel A and Pixel B. Furthermore, there is no "high-order" redundancy since there are only two vectors. This is an example of a type of second-order redundancy that is not captured by correlations. However, the same transformation performed as before (i.e., a rotation) produces a marked change in the histograms of the basis functions A' and B'. This particular data set allows a "sparse" response output. Although the variance of each basis function remains constant, the histogram describing the output of each basis function has changed considerably. After the transformation, vector A' is high or vector B' is high but they are never high at the same time. The histograms of each vector show a dramatic change. Relative to a normal distribution, there is a higher probability of no response and a higher probability of a high response, but a reduction in the probability of a mid-level response. This change in shape can be represented in terms of the kurtosis of the distribution where the kurtosis is defined as the fourth moment according to

K

=

l/n

c[(x

- s)'/cT']- 3

(2.3)

Figure 6 provides an example of distributions with various degrees of kurtosis. In a sparse code, any given input can be described by only a subset of cells, but that subset changes from input to input. Since only a small number of vectors describe any given image, any particular vector should have a high probability of no activity (when other vectors describe the image) and a higher probability of a large response (when the vector is part of the family of vectors describing the image). Thus, a sparse code

574

David J. Field

[Histograms1

Response Amplitude

Response Aniplitude

Figure 5: As in Figure 2, this shows an example of the state space of a population of two-pixel images. The data set are redundant since the state space is not filled uniformly. However, with these data, there are no correlations in the data and, therefore, there are no principal axes that can account for a major component of the variance. In such a data set, there is no way to take advantage of the redundancy by reducing the dimensionality. The right of the figure shows how the response distribution changes if the vector space is transformed to allow the vectors to be aligned with the axes of the data. In this case, each of the vectors maintains the same variance. However, the shape of the distribution is no longer gaussian (high entropy). Instead, the response distribution shows a high degree of kurtosis (lower entropy-higher redundancy).

Sensory Coding

575

Kurtosis

JL '

(K=51

'-

Figure 6: Examples of three levels of kurtosis. Each of the distributions has the same variance. A gaussian distribution has minimal redundancy (highest entropy) for a fixed variance. The higher the kurtosis, the higher the redundancy. With high kurtosis, there is a higher probability of a low response or a high response with a reduced probability of a mid-level response.

should have response distributions with high kurtosis. Although kurtosis appears to capture this property of a distribution with both high and low variances, one should not presume that kurtosis necessarily represents the optimal measure for defining a sparse code. At this time, we consider it a "useful measure." We will return to this point later. As we move to higher dimensions (e.g., images with a larger number of pixels), we might consider the case where only one basis vector is active at a time (e.g., vector 1 or vector 2 or vector 3 . . .): ux1

u uxz u ax3 u ux4 . . .

(2.4)

In this case, each image can be described by a single vector and the number of images equals the number of vectors. However, this is a rather extreme case and is certainly an unreasonable description of most data sets, especially natural scenes. When we go to higher dimensions, there exist a wide range of possible shapes that allow sparse coding. Overall, the shape must require a large

David J. Field

576

set of vectors to describe entire population of possible inputs, but require a subset of vectors to describe any particular input. Image

=

CaV,

where

ti

<

(2.5)

I

where 111 is the number of dimensions required to represent all images in the population (e.g., all natural scenes). For example, with a three-pixel image where only two pixels are nonzero at a time, it is possible to have: 0x1

+ bxz u ax2 + D x ~u iixl + I I X ~

(2.6)

This state space consists of three orthogonal planes. If the data fall in these three planes with equal probability, then there will be no correlations in the data and therefore the principal components will not provide the axes of the planes (indeed, there are no priticipal axes of the state space). However, by choosing vectors aligned with the planes (e.g., XI. x2. xj), it is possible to have a code in which only two vectors are nonzero for any input. In some situations, the principal components may even point in the wrong direction for achieving a sparse code. Figure 7A shows an example where the data fall along slightly nonorthogonal lines. In this case, there is a positive correlation in pixels. As with the ellipse shown in 7B, the principal components lie along the diagonals rather than the axes of the data. However, selecting vectors aligned with the data can produce histograms with positive kurtosis even though these are nonorthogonal. Indeed, it is important to recognize that the optimal sparse representation of a data set is not necessarily an ortho-normal representation. Figure 7C shows a three-dimensional variation that has some interesting properties. If the state space forms a hollow three-dimensional cone like that shown, the first principal component will fall along the major axis of the cone. Indeed, Figure 7 D shows an ellipsoid with the same principal axes as 7C. However, in the case of the ellipsoid, all the redundancy is captured by the principal components (i.e., the principal axes define the ellipse). In the case of the cone, the principal components fail to exploit some of the most interesting aspects of the data. The reason for using the cone as an analogy is that the vectors along the tangents of the cone extending from the origin can allow a sparse representation but the probability distribution is locally continuous. Also, we have found that the cone provides a reasonable description of how a localized function (e.g., a gaussian blob) will be distributed within the state space when the feature is varied in position and amplitude. However, a three-dimensional state space cannot begin to describe the full complexity of the redundancy in natural scenes, so the cone should be considered as only a crude example. The precise form within the state space is not critical to the argument. For a sparse code to be possible, a

Sensory Coding

577

Figure 7: (A) An example of a data distribution where the principal components lie along the diagonal but where such components are ineffective at producing a sparse code. (B) A data set with the same principal components (an ellipse). (C, D) Two three-dimensional examples of state spaces that have the same principal components. State spaces like that shown in A and C allow sparse coding. State spaces like B and D do not. It is suggested that the local structure available in natural scenes produces a high-dimensional state space analogous to this cone rather than the ellipse.

large set of vectors must be required to describe all points on the form (all possible images) but the form must require only a subset of vectors to describe any particular point in the space (any particular image)equation 2.5. There will always exist a wide range of shapes of the probability density function that can result in the same principal components. The possibility of finding a sparse code depends on this shape. For an ellipsoid, it is not possible to produce a sparse output. In this paper, it is proposed that the signature of a sparse code is found in the kurtosis of the response distribution. A high kurtosis signifies that a large proportion of the cells is inactive (low variance) with a small proportion of the cells describing the contents of the image (high variance). However, an effective sparse code is not determined solely by the data or solely by

578

David J. Field

the vectors but by the relation between the data and the vectors. In the next section we take a closer look at the vectors described by the receptive field profiles of two types of visual cells and look at their relation to natural scenes. Before we begin this description, it should be noted that the results of several studies suggest that cells with properties similar to those in the mammalian visual cortex will show high kurtosis in response to natural scenes. In Field (1987),visual codes with a range of different bandwidths were studied to determine how populations of cells would respond when presented with natural scenes. It was found that when the parameters of the visual code matched the properties of simple cells in the mammalian visual cortex, a small proportion of cells could describe a high proportion of the variance in a given image. When the parameters of the code differed from those of the mammalian visual system, the response histograms for any given image were more equally distributed. The published response histograms by both Zetzsche (1990) and Daugman (1988) also suggest that codes based on the properties of the mammalian visual system will show positive kurtosis in response to natural scenes. Burt and Adelson (1983) noted that the histograms of their "Laplacian pyramids" showed a concentration near zero when presented with their images and suggested that this property could be used for an efficient coding strategy. In this paper, it is emphasized that these histograms have these shapes because of the particular relation between the code and the properties of the images. In particular, when the input has stationary statistics, it is the phase spectrum that describes the redundancy required for sparse coding. In previous work by the author (Field 1993) complex correlations in natural scenes were studied to determine the extent to which the phases were aligned across the different frequency bands. The extent of the phase alignment across neighboring frequency bands was found to be proportional to frequency (i.e., at high frequencies, the phases were aligned across a broader range of frequencies). In other words, at low frequencies, local structure tends to be spatially extended while at high frequencies the local structure tends to be more spatially limited. It was noted that this particular alignment in phases would be expected if these scenes were scale invariant with regards to both their amplitude spectra and their phase spectra. It was proposed that, to a first approximation, natural scenes should be considered as a sum of bandlimited, self-similar ''phase structures" (points of phase alignment). The spectral extent of this phase alignment was found to be well matched to the bandwidths of cortical simple/complex cells (Field 1989, 1993) and it was suggested this property is what allowed the visual system to produce a sparse output. We will return to this discussion when we consider synthetic images that produce high kurtosis in wavelet transforms. First, however, this study considers a more direct test of sparse coding by looking at the kurtosis

Sensory Coding

579

of the response distributions of two types of visual cells when presented with natural scenes. 3 Receptive Fields in the Mammalian Visual Pathway as Vectors

~

Just as it is possible to rotate the coordinate axes in a wide variety of ways, there exist a wide range of transforms that are capable of representing the information in an rz-dimensional data space. The Fourier transform represents one example of a rotation. Gabor (1946) described a family of transforms that were capable of representing one-dimensional waveforms. These ideas were extended by Kulikowski et al. (1982) to include transforms in which the bandwidths increase proportionally to frequency. Such transforms, which have recently become known as "wavelet transforms" (e.g., Mallatt 1989),consist of arrays of self-similar basis functions that differ only by translations, dilations, and rotations of a single function. Although much of the work on wavelet transforms has been devoted to the development of ortho-normal bases (e.g., Adelson et al. 1987; Daubechies 1988) the coding with these transforms has found a wide variety of applications ( e g , Farge et al. 1993) from image processing to the representation of turbulence. For our purposes, however, it is important to recognize that such transforms can be considered as rotations of the coordinate system. Figure 8 shows one part of a two-dimensional implementation of a wavelet transform. Wavelets based on the gaussian modulated sinusoid (i.e., Gabor functions) have proved to be popular models of the visual cortex (Watson 1983; Daugman 1985; Field 1987). Although the Gabor function has found some support in the physiology (Webster and DeValois 1985; Field and Tolhurst 1986; Jones and Palmer 1987),other functions have been proposed such as the Cauchy (Klein and Levi 1985) and the log-Gabor used in this study (Field 1987). Transforms based on these functions capture some of the basic properties of cells in the mammalian visual cortex. (1)The receptive fields are localized in space and are bandpass in frequency but overlap in both space and frequency. (2) The spatial frequency bandwidths are constant when measured on logarithmic axes (in octaves) resulting in a set of self-similar receptive fields. (3) They are orientation selective. These basic properties provide the basis for a number of models of visual coding and the model used in this study does not have any major components that differ from previous models (e.g., Watson 1983; Daugman 1985; Field 1987). It is important to recognize that these transforms are not ortho-normal. First, the basis functions are not quite orthogonal. Second, in this study (as in Field 1987) the vector length of the transform increases with frequency (i.e., the peak of the spatial frequency response is constant and the bandwidth increases). As previously noted, this results in a code with distributed activity in the presence of images with l/f spectra like

David 1. Field

580

that found in natural scenes (Field 1987; see Brady and Field 1993 for a discussion of vector length). This means that the variance of the filtered images will be roughly the same magnitude at the different frequencies. It must also be emphasized that the functions used in this study to model visual neurons are highly idealized versions of actual cells. Both the models of retinal neurons and cortical simple cells involve codes in which all the cells have the same logarithmic bandwidth. Visual neurons are known to show a range of bandwidths which average around 1.5 octaves but become somewhat narrower at higher frequencies ( e g , Tolhurst and Thompson 1982; DeValois et ul. 1982). However, the most important difference is that the "cells" in this study do not have many of the known nonlinearities found in cortical simple cells (e.g., end-stopping, crossorientation inhibition, etc.). Indeed, the response histograms are allowed to go negative in our modeled cells while actual cortical cells can produce only the positive component of the histogram. We will return to this discussion of nonlinearities later. However, it should be remembered that the simple cells that are modeled are only rough approximations to the cells that are actually found in the visual cortex. The codes in this study represent images with arrays of basis functions that are localized in the two-dimensional frequency plane as well as the two-dimensional image plane. Figure 8 shows one example of the division of the two-dimensional frequency plane and the corresponding

Wavelet

Figure 8: Two-dimensional wavelet. Three-dimensional information diagrams tor a two-dimensional wavelet. The information diagrams are actually fourdimensional ( 1 1 . ZJ. x.!/) but we have limited the diagram to the representation of a single orientation, to allow a graphic description.

Sensory Coding

581

representation in space for one orientation and one phase. Each basis function is thus localized in the four dimensions of x. y. u . u. In this particular description, the spatial sampling grid is rectangular, but this is not a requirement. One requires a four-dimensional plot to cover the full 2-D space by 2-D frequency trade-off, but this is difficult to depict graphically. If we consider only a single orientation for the purposes of the display, it is possible to show the space-frequency trade-off using the representations shown in the figure. A comment should also be made with regards to the phase spectra of these filters. In line with previous studies (Field 1987, 1989, 1993), the oriented wavelet transform uses a pair of even- and odd-symmetric filters (i.e., filters in quadrature) at each location. Since these receptive fields are localized, they can be defined in Fourier terms by an alignment in the phase at the center of the receptive field (e.g., Field and Tolhurst 1986; Field 1993). Even complex cells have localized receptive fields, so to some extent, they must also be phase selective. One recent model of complex cells describes the response of a complex cell as a vector sum of two quadrature-phase cortical cells (Field 1987; Morrone and Burr 1988). By this model, complex cells detect an alignment of phases at particular locations of the visual field but the response does not depend on the absolute phase of that alignment (e.g., sine versus cosine). However, in this study the intention is not to model all of the various forms of nonlinearity found in the mammalian visual system. Rather, the goal is to show that the basic properties of visual neurons (i.e., orientation tuning, spatial frequency tuning, and position selectivity) produce a sparse representation of natural scenes. 4 Kurtosis in the Response to Natural Scenes

In the previous sections, it was proposed that if the visual system is producing a sparse representation of the environment then the histograms describing the response of the visual system should have a high kurtosis. In this section, these two approaches will be contrasted by investigating the histograms of the cells in several visual codes. 4.1 Method.

Images. The images used in the following sections consist of digitized photographs of the natural environment. They consist of 55 scenes, six of which are shown in Figure 9. The only photographic restriction was that the images have no man-made features (buildings, signs, etc.) since these tend to have different statistical structures (e.g., a higher probability of long straight edges). Images were photographed with Ilford XP1 film using a 35-mm camera. Photographic negatives were scanned using a Bameyscan digitizer that provided a resolution of 512 x 512 pixels per

David J. Field

582

Figure 9: Examples of six images used in this study. Each of the images was calibrated for the intensity response of the film. picture with 256 gray levels. The images were calibrated for luminance using Munsell swatches allowing the pixel values to have a linear relation to the image intensities in the original scene. The optics of the camera were taken into account by determining the response to thin lines and correcting for the changes produced in the amplitude spectrum (i.e., the modulation transfer function). Before analysis, the log of the image was determined and the calculations were based on this simple nonlinear transform of the image. This allows the units to respond in terms of contrast (i.e., ratios of intensities) rather than to intensity differences (i.e., amplitude) since: log(a) - log(b) = log(a/b) This is believed to produce a more accurate representation of cells in the visual pathway since cells are know to produce a more linear response to contrast (e.g., see Shapley and Lennie 1985) rather than amplitude. It was also found that the pixel histograms of some of the images had spuriously high kurtosis values (several images had K > 20 before the log transform) because of a few bright points in the image such as bright sky filtering through a tree. One should note film normally uses a compressive gamma, so studies that do not calibrate their film end up with

Sensory Coding

583

similar high-intensity compression without intending to do so. Also, this collection of images did not contain large blank regions (e.g., an image of half sky). Such images result in a large number of inactive cells that will also increase the kurtosis for any code using local operators. These efforts were made to minimize any biases that might have been present in our image collection and allow a more accurate comparison of the different codes. 4.2 Population Activity. Population measures were based on the outputs of arrays of cells determined by convolving the images with the appropriate filters. The methodology is similar to that of Field (1987). However, in this study results are provided for a fixed number of spatial frequencies and orientations. Two types of filters are compared. The first is a center surround operator made from a difference of gaussians (DOG) where the surround has three times the width of the center. g(x,y) = ge-(x2+.v2)/2(n)' - e-(.v2+.v2)/2(3n)2 (44

For the DOG calculations, each image was convolved with two filter sizes (the spectrum peaked at 20 and 40 cycles/picture). The second type is the oriented "log-Gabor " described previously (Field 1987). Radially the function has a log-normal spectrum:

G(k) = e-ln(k/ko)2/2[ln(n/ku)IZ

(4.3)

and is gaussian along the orthogonal axis. As in Field (1987) the local phase was represented by a pair of filters in quadrature (i.e., even-symmetric and odd-symmetric). Thus, each image was convolved with filters at two spatial frequencies (ko = 20 and 40 cycles/picture), four orientations (10,55, 100, and 145"), and two phases. For each bandwidth selected, this requires 2 x 2 x 4 = 16 convolutions per image to obtain the total response histograms at the 8 spectral locations. Although these codes do not provide a complete representation of the image, the results provide a relatively direct method of comparing the population responses of different codes and eliminate problems of filters near the edges of the spectrum. With the wavelet codes, different spatial frequency bandwidths are compared. With the orientation bandwidth set to 40" orientation tuning (full width at half height), the spatial frequency bandwidth was set to one of 8 spatial frequency bandwidths ranging from 0.5 to 8.0 octaves in logarithmic steps where the bandwidth is defined as

B,,,

= h(kZ/kl

1

(4.4)

where kl and k2 define the lower and upper frequencies defining the width at half-height. For each image and each bandwidth, the kurtosis of the distribution was determined by calculating the histogram of the pixels from the 16 filtered images. Near the edges of each of the filtered images, the response

584

David J. Field

of any basis function can produce spurious results. To remove these effects, only the data from the central 180 x 180 region was used in the analysis. Thus for each image, the histograms were based on a total of 16x 180x 180 = 518.400 samples for the wavelet and 2 x 180x 180 = 97,200 samples for the DOG. One should note that these samples are not completely independent samples since the images are not sampled in proportion to the size of the basis function as in Field (1987). However, this will have minimal effect on the overall histograms but implies that the low-frequency channel (e.g., at 20 cycles/picture) will have histograms based on fewer independent samples than the high-frequency channels with smaller receptive fields (e.g., 40 cycles/degree). 4.3 Results. Figure 10A shows the histogram for image 1 with a single condition (spatial frequency bandwidth: 1.4 octave-20" orientation bandwidths). With these filters, the response of any function is as likely to be positive as negative. One can see that the general form of the histogram is much like that shown in the kurtosis plots described previously. Indeed, this pattern has a kurtosis of 6.2. Figure 108 shows the results describing the kurtosis for the original image, the DOG function, and the wavelet when the bandwidth was fixed at 1.4 octaves. These results show that both the DOG function and the oriented wavelet show increased kurtosis. Figure 1OC shows the results for 20 l/f noise patterns like those shown in Figure 3. These images have random phase spectra but similar amplitude spectra to natural scenes (and therefore the same principal components). Figure 11A shows the results describing the kurtosis of the response histograms for the wavelet as a function of the spatial frequency bandwidth. Results are shown for the six images shown in Figure 3. Figure 118 provides a histogram of the bandwidths that produce peak kurtosis for all 55 of our natural scenes. These results show that the bandwidth that produces the maximum kurtosis typically falls in the range of 1.0 to 3.0 octaves. This falls within the range of bandwidths that are most commonly found in the mammalian visual system (e.g., Tolhurst and Thompson 1982).

5 Discussion

The results shown above support the notion that codes that have similar properties to that found in the mammalian visual system are effective at producing a sparse representation of the natural scenes. The results in Figure 10 suggest that as one moves from the retina to the cortex, the kurtosis of the distribution increases. Higher levels appear to be more capable of taking advantage of the redundancy in natural scenes. The redundancy that is captured by these codes is not due to the amplitude spectra of the images. Figure 10B shows that images with

Sensory Coding

585

Figure 10: (A) The response histogram for a single image using the wavelet. The histogram is based on the response at four orientations at spatial frequencies of 20 and 40 cycles/picture and two phases (even- and odd-symmetric). (B) The mean kurtosis for the images calculated from the pixels of the original image, after convolution with the difference of gaussians and after convolution with the wavelet. The bottom right shows the same processes with 20 l/f noise patterns that have approximately the same amplitude spectra as that in the natural scenes.

similar amplitude spectra a s natural scenes but random phase spectra produce histograms with kurtosis values of 0.0 (normal distributions). Since these images differ only in their phase spectra, it is clear that the

David J. Field

586

A

Results for six images in Figure 9

-2'

B

'

1.0 2.0 4.0 8.0 Spatial frequency bandwidth (octaves) 0.5

;',;z;,s

Bandwidth producing highest kurtosis 55 images

Spatial frequency bandwidth (octaves)

Figure 11: Distribution of peak kurtosis as a function of bandwidth. (A) The kurtosis of the response distributions for the six images shown in Figure 9 as a function of the spatial frequency bandwidth (orientation bandwidth is fixed at 40"). For each of the 55 images, the kurtosis of the response distribution was determined and the bandwidth that produced the highest kurtosis was calculated. (B) The distribution of spatial frequency bandwidths that produced the highest kurtosis.

phase spectra of natural scenes play a major role in determining whether the wavelet transform will produce a sparse output.

Sensory Coding

587

5.1 Synthetic Images That Produce High Kurtosis. In Field (1993), it was proposed that self-similar transforms are effective at producing sparse representations with natural scenes because such scenes have sparse, scale-invariant phase spectra. That is, the phases in such scenes are aligned at a relatively small number of locations in space/frequency and the spectral extent of the alignment is proportional to frequency. It is a relatively simple process to synthesize images that have such properties. Figure 12 provides examples of two such images. These particular images were created using the equation: Yfll,Il

,fr=O

,,=I

where a is the number of scales in the image, /j controls the density of the elements at each scale, cr controls the spectral distance between scales, and T controls the relative spatial extent of the function at each scale. Figure 12A and B shows two densities where g(x.y) is a gaussian modulated sinusoid. Each of the functions is added with the same average spatial amplitude independent of scale ( ~ l f l l l has the same average amplitude at all scales). If each function is treated as a vector, then the vector length of each function decreases as the scale increases (proportional to l/f). By increasing the number of vectors in proportion to the square of the frequency and adding them in random positions, this technique produces a scale-invariant l/f amplitude spectrum but in which the phases are locally aligned (See Appendix). Each of these images can be thought of as a n "inversion" of the visual code. In a sparse-distributed code, only a subset of the cells is active but each cell has the same probability of activity. The method described in equation 5.1 is roughly equivalent to assigning a low probability to each vector in a wavelet code and summing the sparse set of vectors together. The state space of these images is analogous to the state space shown in Figure 4 but, of course, in higher dimensions. We are currently attempting to provide a better description of the high-dimensional state space and believe that a set of high-dimensional cones of different diameters is a useful analogy. A subset of vectors from the wavelet code can represent this image because it is created by a subset of vectors from the wavelet code. Although these images d o not have all the redundancy of natural scenes, they provide a good first approximation and certainly provide a better approximation than the l/f noise patterns shown in Figure 3.3 The results in Figure 10 suggest that although the DOG patterns show a significant increase in kurtosis, the oriented wavelet results in higher kurtosis. This suggests, that in natural scenes, when local structure is present (i.e., the phases are aligned), the structure tends to be oriented. This is not mathematically necessary. It is possible to create images 'As the density approaches 0.5, the images show the same structure as the l/f noise patterns shown earlier in Figure 3.

588

David J. Field

Figure 12: Images that produce high kurtosis can be created by "inverting" the visual code. The top images were created by distributing a set of self-similar functions as described in equation 5.1. The images were created by producing a random sum of the oriented wavelet basis functions. The two images shown were created by assigning a probability of either (A) 0.01, (8)0.08, or (C) 0.3 to each of the members of the wavelet basis set (e.g., Fig. 8). The lower right image (D) was created using nonoriented difference of gaussians (DOG). It is proposed that images such as these provide a first approximation to the statistical structurc of natural scenes. The state space is analogous to the cone (E) with the principal axes aligned with the Fourier vectors. like that s h o w n in Figure 12 using DOG functions instead of oriented wavelets. Under such circumstances, a n oriented wavelet will not produce a n increase in kurtosis over the DOG.

Sensory Coding

589

It is proposed that because natural scenes have regions where the phases are locally aligned (i.e., the images have local structure), both ganglion cell receptive fields and cortical receptive fields have the shape that they do. The amplitude spectra of natural scenes and the limits in quanta1 catch may certainly be important in understanding the spatial frequency tuning curves of ganglion cells and the contrast sensitivity function as suggested by Atick and Redlich (1991, 1992). However, it is proposed here that it is the phase spectrum of natural scenes that one must understand to account for the phase spectra of receptive fields in the visual pathway. Indeed, it is likely that many natural phenomena show this self-similar phase structure and may be why the wavelet transform has found so much success in applied mathematics. 5.2 Higher-Order Redundancy in Natural Scenes. Although the images in Figure 12 may provide a good first approximation to the statistics of the environment, these images certainly do not capture all the redundancy that is present in real scenes and do not look that much like real scenes. Consider the analogy of looking at the redundancy in musical symphonies. One might try to find the properties of the typical notes that are played. One might then try to create synthetic symphonies by playing those notes randomly. Although these sounds would be more like symphonies than random noise, they would not sound like symphonies because they would not contain any of the rules of music. Similarly, the images in Figure 12 are more like natural scenes than l/f noise, but they do not contain any of the combinatorial rules found in real scenes. As previously noted (Field 1993),natural scenes show a significant degree of continuity across space and frequency. Figure 13 shows a simple example of this continuity. A “fractal” edge is split into three frequency bands. Across the different frequency bands and across the length of the edge, the orientation of the edge twists and turns. Locally, the shift in orientation is continuous. The wavelet code does not directly take advantage of this continuity. However, it has recently been proposed (Field et al. 1993) that the visual system may make use of this redundancy by using connections between locally oriented units along the lines of the lateral connections found in the primary visual cortex (e.g., Rockland and Lund 1983; Blasdel et al. 1985; see Gilbert 1992 for a review). Computational studies by Zucker et al. (1989) suggest that such an approach may be an effective strategy in representing continuous features in natural scenes. It should also be emphasized that the ”cells” modeled in this study do not have the spatial nonlinearities that are often found in visual neurons. Such nonlinearities come in various forms. First, in primary visual cortex, “simple cells” do not represent the majority of cells. Both complex and hypercomplex cells certainly represent a major component to visual coding and show highly nonlinear behavior. However, the basic findings described here will at least apply to complex cells. The bandwidths

590

David J. Field

and spatial extents of complex cells are similar to those of simple cells. As noted earlier, one model of complex cells suggests that they detect an

Sensory Coding

591

alignment of phases, like simple cells, but do not differentiate the absolute phase of that alignment. Since they are localized in space/frequency in a similar manner to simple cells, it would be expected that complex cells will show a similar sparse representation to the modeled simple cells studied here. It should also be noted that even classically defined simple cells exhibit a number of important nonlinearities (e.g., end-stopping and crossorientation inhibition). We currently believe that these nonlinearities may help to increase the "sparseness" of the code by allowing the code to play a local "winner take all" strategy. This is beyond the scope of this paper, but it should be mentioned that the kurtosis values shown in Figures 10 and 11 are likely to be significantly lower than the kurtosis values actually found in the mammalian visual cortex. It is important to treat the results described in this study with some caution. We have not added the nonlinearities known to exist in the mammalian visual system and we have only searched through a small variety of linear transforms. There may well be linear transforms that produce higher kurtosis than the wavelets tested here and it is quite likely that there exist more effective nonlinear codes. Our main goal in this paper is to demonstrate that the bandpass, localized receptive fields found in the mammalian visual system can take advantage of the redundancy in the environment without compressing the representation (reducing redundancy) and without using the principal components. 5.3 Why Sparse Coding? In this paper, evidence was provided that codes that share properties with the mammalian visual system produce a sparse-distributed output to natural scenes. But the question remains as to why a sensory system would evolve a strategy like sparse coding. As was noted, there are a number of reasons that one might want to produce a compact code. A compact code allows the data to be stored and transmitted with a smaller population of total cells. But what are the advantages of sparse codes? Under some conditions, it is possible to use the sparse output to produce a compressed image. Techniques such as run-length coding are designed to do just that. There is also a considerable literature on compression techniques with sparse matrices (e.g., Evans 1985; Schendel

Figure 13: Facing pnye. Unlike the scenes shown in Figure 12, natural scenes have structures that are continuous across space and frequency. The orientation of the edge shifts across the length of the edge as well as across the different scales. The edge has a well-defined orientation only at a given scale and position along the length of the edge. This oriented structure that is localized in space/frequency makes the wavelet code an effective representation. However, the continuity between scales and across the length of the edge at each scale represents a form of redundancy that is not directly captured by the wavelet and requires more complex processing.

David J. Field

592

1989). However, it is proposed here that sparse coding serves purposes other than compression. Barlow (1972, 1985) has proposed that codes that minimize the number of active neurons can be useful to the resentation and detection of “suspicious coincidences.” Zetzsche (1990) has suggested that it is the work on associative memories that provides the most biologically plausible reasons that sensory systems would use sparse coding schemes. Three interrelated advantages are described below.

Signal-to-Noise. First, a sparse coding scheme can increase the signalto-noise ratio. As was noted in Field (1987), if most of the variance of a data set is represented by a small subset of cells, then that subset must have a high response relative to the cells that are not part of the subset. The smaller the subset, the higher the response of each member of the subset, given an image with constant variance. If we consider the response of the subset as “the signal” and if all the cells in the population are subject to additive noise, then by increasing the response of the subset of cells relative to the population, it is possible to increase the probability of detecting the correct subset of cells that represent the image. However, it should be emphasized that arguments of this form depend critically on the properties of the noise (e.g., correlated versus uncorrelated) and the location of the noise (e.g., photon versus neural transduction noise). Correspondence and Feature Detection. Although signal-to-noise considerations may be important, it is proposed that the main reason for sparse coding is that it assists in the process of recognition and generalization. In an ideal sparse code, the activity of any particular basis function has a low probability. Because the response of each cell is relatively rare, tasks that require matching of features should be more successful. Consider the problem of identifying a corresponding structure (e.g., an edge) across two frames of a movie or across two images of a stereo-pair. If the probability of a cell’s response has probability p and there are n comparable cells of that type within some region of the visual field then, assuming independent probability of response, the probability of detecting the correct correspondence depends on the probability that only the corresponding cell responds, which is P(on1y corresponding cell responds)

=

(1 - p)”

The lower the probability p (i.e., the more sparse the code), the more likely a correct correspondence. As noted above, we would not expect cortical cells to be completely independent of their neighbors. Nonetheless, the general rule should still hold. As a code becomes more sparse, the probability of detecting a correct correspondence increases. This suggests that a sparse code should be of assistance in tasks requiring solutions to a correspondence problem and can be related to what Barlow calls a suspicious coincidence (e.g., Barlow 1989). If the probability that any cell

Sensory Coding

593

responds is low ( p << 0.5), then the probability of two cells responding is p2 ( p << 0.25) assuming response independence. Higher-order relations (pairwise, triad, etc.) become increasingly rare and therefore more informative when they are present in the stimulus. This implies that the unique pattern of activity found with sparse codes may also be of assistance with the general problem of recognition. If relations among units are used to recognize a particular view of an object (e.g., a face), then with a sparse code, any particular high-order relation is relatively rare. In a compact code, a few cells have relatively high probability of response. Therefore, any particular high-order relation among this group is relatively common. Overall, in a compact code different objects are represented in terms of the differential firing of the same subset of cells. With a sparse-distributed code, a large population has a relatively low probability of response. Different objects are represented by a unique subset of cells. That is, different objects are represented by which cells are active rather than how much they are active. The primary difficulty with this line of thinking is that implementing a process that detects sparse structure in neural architecture may not be so straightforward. Higher-order relations may be relatively rare in a sparse code, but detecting all possible nth-order relations among m cells requires nr” detectors. It is likely that the visual system looks only for ”probable” relations by looking at combinations only within local regions and looking only for probable structure (e.g., continuity). This is certainly an interesting problem, but further study of the higher-order structure of natural scenes would be required to determine whether the visual system is using an effective strategy.

Storage and Retrieval with Associative Memory. The third advantage of sparse coding comes from work on networks with associative memories and is related to the above discussion. Several authors have noted that when the inputs to the networks are sparse, the networks can store more memories and provide more effective retrieval with partial information (Palm 1980; Baum et al. 1988). Baum et al. (1988) suggest that the advantages of sparsity for cell efficiency are ”so great that it can be useful to artificially ‘sparsify’ data or responses which are not sparse to begin with.” Indeed, it is not surprising that many types of networks will solve problems more efficiently if the inputs are first ”sparsified.” Since the sparse representation will have fewer higher-order relations, learning to classify or discriminate inputs should require less computation. Therefore, “sparsifying” the input should help to simplify many of the problems that the network is designed to face. This work with associative memory will hopefully lead to a better understanding of the advantages of sparse codes and help us to understand whether transforming the data to create a sparse input is generally a useful strategy to help networks solve problems.

David J. Field

594

In this study, there is no attempt to provide a complete description of the possible uses of sparse codes. Rather, the goal is to demonstrate that sparse coding represents one important method for taking advantage of redundancy and that sensory systems show evidence that they make use of this method. 5.4 Factorial Codes and Projection Pursuit. A gaussian distribution has the lowest redundancy (highest entropy) of any distribution given a fixed variance. Thus, a distribution with high kurtosis has a lower entropy than a gaussian distribution. The results shown in Figure 10 show that the visual code has relatively redundant first-order histograms. A sparse-distributed code converts high-order redundancy (relations between units) into first-order redundancy (the response distributions of the basis vectors). Therefore, a transform that produces highly redundant histograms has decreased the higher-order redundancy. Is it possible to find a code that completely removes higher-order redundancy? In such a case, the responses of the vectors would be independent of one another. For example, if the responses of two vectors are independent then

and

P( v; I Vj) = P( Vi) and in general, if all the vectors are independent then the probability of any given image can be determined by multiplying the probabilities of each of the vectors.

P( image) =

n

P( V , )

In this case, the image is described as having a “factorial code” (Barlow et nl. 1989; Barlow 1987; Schmidhuber 1992; Atick et a/. 1993). A population of images like those shown in Figure 12 can have nearly factorial codes since these were actually generated by combining nearly orthogonal vectors with independent probabilities. That is, the probability that any function was added to the image did not depend on the probability that any other function was added to the image. The differences between the images in Figures 12 and 13 point out the importance of sources of redundancy that remain after coding by the wavelet. Indeed, the fact that the natural scenes do not look like the images in Figure 12 demonstrates that after natural scenes are coded by the wavelet, the responses of wavelet basis functions will not be independent. Thus the individual probabilities of units are unlikely to predict the probability of a particular natural scene. Whether one is searching for a sparse code or a factorial code, the goal is to find a set of units that are as independent

Sensory Coding

595

from each other as possible. However, the sparse coding approach provides a guide for achieving this goal. By maximizing the kurtosis (i.e., the redundancy) of the response histograms, one effectively minimizes statistical dependencies between units (Field 1987). In this paper, there has been no discussion of how one might find the optimal sparse code for a given data set. An effective sparse code must have two properties. It must span the space of inputs (i.e., preserve information) and show high kurtosis in the response histograms. At this time, there do not appear to be any learning rules that can achieve these goals. There is a problem related to this, described as “projection pursuit” (e.g., Friedman 1987; Huber 1985; Intrator 1992,1993). In many domains, where one must deal with high-dimensional data, one is interested in finding interesting projections of the data. The ”projections” refer to the response histograms of the vectors describing the data. Intrator (1992, 1993) has noted that one should look for projections (histograms) that are as far from gaussian as possible. Indeed, in line with the above discussion, the more non-gaussian the histogram, the more independent the units should be. High kurtosis represents just one way to deviate from a gaussian distribution. It is not clear whether natural scenes have redundancy that will allow other forms of non-gaussian behavior in the response histograms. It is also not clear that the visual system can take advantage of other forms of non-gaussian behavior. In this paper, it is noted only that the visual system appears to have found one type of non-gaussian distribution (high kurtosis) and this particular distribution results in a sparse representation. However, at this time, there is no known technique for finding the optimal sparse code. Techniques such as those of Intrator (1992, 1993), Foldiak (1990), Schmidhuber (1992), and Linsker (1993) may ultimately provide insights into this problem, but their work demonstrates that the solution may not be a simple one. 6 Summary In this paper, we have compared two approaches to sensory coding: compact and sparse-distributed coding. When the statistics of the inputs are stationary, it was noted that compact coding depends primarily on the amplitude spectra of the data (i.e., the correlations). When effective dimensionality reduction is possible, compact coding provides a good first step. Indeed, some aspects of sensory coding (e.g., trichromacy) require a consideration of the efficiency achieved by dimensionality reduction. However, many redundant data sets do not allow effective compression. Furthermore, even if compact coding is possible, there will be a variety of codes that will produce equal compression. If the goal of a code is to assist higher level processing (e.g., recognition), then other coding strategies and other forms of redundancy must

596

David J. Field

be considered. To account for the primary receptive field properties of cells found in the mammalian visual system (i.e., localized, bandpass, self-similar), it was proposed that one must consider a sparse-distributed representation of natural scenes. It was noted that when the statistics of the data are stationary, sparse coding depends primarily on the phase spectra of the data. In this paper, we have concentrated on the visual system and the statistics of natural scenes. However, the main ideas presented here can be applied to any sensory system. Natural sounds, for example, are likely to have local structure similar to that found in natural scenes. We are currently working on the question of whether the sparse coding approach to natural sounds can provide an account of the frequency selectivity found in auditory neurons. It is proposed that the evidence for this selectivity will be found in the kurtosis of the response distribution. If a sensory system is designed for sparse coding, then we would expect cells in that sensory system to show high kurtosis in the presence of the typical sensory environment. If the general goal of sensory coding is to produce a sparse representation of the environment, then we would expect that recording from any sensory neuron in any animal will produce a histogram with high kurtosis-as long as the recording is performed in an awake, behaving animal in its natural environment. This is a general claim that is left for future work to answer.

Appendix

._

An image consisting of the appropriate functions distributed across the image in the appropriate manner will result in an image that is scale invariant in both contrast and structure. As noted previously (Field 1987), a two-dimensional image that is scale invariant in contrast will have an amplitude spectrum that falls with frequency as lif. In this appendix, it is demonstrated that images that obey the rules described in equation 5.1 will be scale-invariant in their contrast (have 1[f amplitude spectra) and also have a scale-invariant structure. Consider a simple function localized in space and assigned a scale k. In equation 5.1, the scale k = d t ’ .By the scaling theorem, it follows that

Now consider a sum of scaled functions that are placed in random positions relative to one another. Functions that are in random position will, on average, have phases that are orthogonal. In line with the sum of intensities in incoherent optics, the power spectrum (square of the am-

Sensory Coding

597

Log Frequency

Figure 14: See text. plitude spectrum) is equal to the sum of the power spectra of each of the functions, thus Space Amplitude spectra

-

11

Cg(k~+~,.ky+y,)

Ji~[G(~/k.u/k)k~]’ =

J;;lG( ~ / k./k)l/k2 .

In equation 5.1, the images are created by summing i j . k2 functions at each scale where k x d”.Thus at each scale, the spectrum is proportional to

G(U ,U.k )

=

=

@G( u / k . ~ / k ) / k * G ( u / k .u / k ) / k

Thus at each scale, the peak of the spectrum falls by a factor of I l k . Figure 14 shows an example of the spectra at different scales. The spacing in the frequency domain is proportional to 0”’ (nz = 1 . 2 . 3 .. . .). On a log frequency axis, the spacing is log(dlI)K

?I2

( m = 1 . 2 . 3 . .. . )

With this sum of scaled functions, the amplitude spectra of the synthetic image will fall as l/f, that is, the contrast is scale-invariant. The

598

David J. Field

local structure is also scale invariant. At each scale k (it., within each frequency band), the image consists of / l . k’ elements of amplitude 70,,,,, and size 1 / T k . If w e magnify the image by some factor r then we shift to a scale with ( r . k ) ?elements of size l / r k a n d amplitude Z L ~ , , , , , . However, if w e scale our window with the magnification then the area is reduced by a factor of l / r 2 . Thus with a window of constant angular size, the number of elements at any scale is independent of magnification. ,j. ( r . k)’/r2 =

k2

Thus, at all scales there remain ,jk2 randomly positioned elements of size 1/7-k. The images created using the method described in Figure 12 have local structure that is scale-invariant. As noted in the text, this provides a good first approximation to natural scenes b u t fails to capture combinatorial rules found in real scenes.

Acknowledgments This work was supported by NIH Grant R29MR50588. I would like to thank Peter Foldiak, Nuala Brady, a n d the six reviewers for their helpful comments.

References Adelson, E. H., Simoncelli, E., and Hingorani, R. 1987. Orthogonal pyramid transforms for image coding. SPIE Visird Coriir~~r~r~. Iiiin,qc I-’r.ocrss. 11, 845. Atick, J . J., and Redlich, A. N. 1990. Towards a theory of early visual processing. Ncrrml Cotrip. 4, 196-210. Atick, J. J. 1992. Could information theory provide an ecological theory of sensory processing? Netzoork 3, 213-251. Atick, J. J., and Redlich, A. N. 1992. What does the retina know about natural 4, 449-572. scenes? Nrirnil COJII/J. Atick, J. J., Li, Zhaoping, and Redlich, N . 1993. What does post-adaptation color appearance reveal about cortical color coding? Visiorr it^. 33, 123-129. Baddeley, R. J., and Hancock, P. J. 1991. A statistical analysis of natural images matches psychophysically derived orientation tuning curves. Proc. Roy. Soc. Lotldoli B 246, 219-223. Barlow, H. B. 1961. The coding of sensory messages. Cirrruiil I - ’ ~ O ~ ~ C I ill I I SAnirriril Bc/inz,ior-. Cambridge, Cambridge University Press. Barlow, H. B. 1972. Single units and sensation: A neuron doctrine for perceptual psychology? P~rceptioti1, 371-394. Barlow, H. B. 1985. The Twelfth Bartlett Memorial Lecture: The role of single neurons in the psychology of perception. Q. /. Exp. Psycho/. 37A, 121-145. Barlow, H. B. 1989. Unsupervised learning. Nrrrml Corirp. 1, 295-311. Barlow, H. B., and Foldiak, P. 1989. Adaptation and decorrelation in the cortex. In T/IPConipirfingNtwroii, R. Durbin, C. Miall, and G. Mitchison, eds., pp. 5472. Addison-Wesley, Reading, MA.

Sensory Coding

599

Barlow, H. B., Kaushal, T. P., and Mitchison, G. J. 1989. Finding minimum entropy codes. N t w a l Conip. 1, 412423. Baum, E. B., Moody, J., and Wilczek, F. 1988. Internal representations for associative memory. B i d . Cyberw. 59, 217-228. Bialek, W., Ruderman, D. L., and Zee, A. 1991. Optimal sampling of natural images: A design principle for the visual system? In Advanccs in Neural Irlformatic~nProcessing 3, R. Lippmann, J. Moody, and D. Touretzky, eds., pp. 363-369. Morgan Kaufmann, San Mateo, CA. Blasdel, G. G., Lund, J. S., and Fitzpatrick, D. 1985. Intrinsic connections of macaque striate cortex: Axonal projections of cells outside lamina 4C. 1. Nwrosci. 5, 3350-3369. Bossomaier, T., and Snyder, A. W. 1986. Why spatial frequency processing in the visual cortex? Vision RPS. 26, 1307-1309. Brady, N., and Field, D. J. 1993. What’s constant in contrast constancy?: A vector length model of suprathreshold sensitivity. Visiori Res., submitted. Buchsbaum, G., and Gottschalk, A. 1983. Trichromacy, opponent colors coding and optimum color information transmission in the retina. Proc. Roy. SOC. L O ~ Z ~BO 220, > I 89-113. Burt, P. J., and Adelson, E. H. 1983. The Laplacian pyramid as a compact image code. I € € € Tramactions 011 Coniin~rnications31, 532-540. Burton, G. J., and Moorhead, I. R. 1987. Color and spatial structure in natural scenes. App/. Opt. 26, 157-170. Churchland, P. S., and Sejnowski, T. J. 1992. Tlre Conrpirtatiorrnl Brain. MIT Press, Cambridge, MA. Daubechies, I. 1988. Orthonormal bases of compactly supported wavelets. Comm. Pure Appl. M n f h . 41, 909-996. Daugman, J. 1985. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. 1. Opt. soc. A m r . 2(7), 1160-1169. Daugman, J. G. 1988. Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression. Z€E€ Transact. Acoirstics, Speech Signal IJrocess. 36(7), 1169-1179. Daugman, J. G. 1991. Self-similar oriented wavelet pyramids: Conjectures about neural non-orthogonality. In A. Gorea, ed., Representations of Vision. Cambridge University Press, Cambridge. Derrico, J. B., and Buchsbaum, G. 1991. A computational model of spatiochromatic coding in early vision. I. Visual Cornitrun. linage Proem. 2, 31-38. DeValois, R. L., Albrecht, D. G., and Thorell, L. G. 1982. Spatial frequency selectivity of cells in macaque visual cortex. Vision Rcs. 22, 545-559. Eckert, M. P., and Buchsbaum, G. 1993. Efficient coding of natural time varying images in the early visual system. Phil. Trans. R. Soc. London B 339, 385-395. Evans, D. 1985. Sparsity and Its Applications. Cambridge University Press, Cambridge. Farge, M., Hunt, J., and Vassilicos, J. C., eds. 1992. Wavdrts, Fractals and Foirricr Transfortns: New Dezwloprnents and Neiu Applications. Oxford University Press, Oxford.

600

David J. Field

Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. j . Opt. Soc. Anicr. 4, 2379-2394. Field, D. 1. 1989. What the statistics of natural images tell us about visual coding. Proc. SPlE 1077, 269-276. Field, D. J. 1993. Scale-invariance and self-similar ’wavelet’ transforms: An analysis of natural scenes and mammalian visual systems. In Wavelets, Fractalsarid Foirrier Tmnsfornrs, M. Farge, J. Hunt, and J. C. Vassilicos, eds. Oxford University Press, Oxford. Field, D. J., and Tolhurst, D. J. 1986. The structure and symmetry of simple-cell receptive field profiles in the cat’s visual cortex. Proc. Roy. Soc. London B 228, 379400. Field, D. J., Hayes, A., and Hess, R. F. 1993. Contour integration by the human visual system: Evidence for a local “association field.” Vision Rcs. 33, 173193. Foldiak, P. 1989. Adaptive network for optimal linear feature extraction. In Procecdirigs of the / € € € / I N N S lntcrnational joint Conferencc on Ncrrral Nctzoorks, Vol. 1, pp. 401-405. IEEE Press, New York. Foldiak, P. 1990. Forming sparse representations by local anti-Hebbian learning. Biol. Cyhcm. 64, 165-170. Friedman, J. H. 1987. Exploratory projection pursuit. 1. Atner. Statist. Assoc. 82, 249-266. Gabor, D. 1946. Theory of Communication. j . I E E London 93(III), 429-457. Gilbert, C. D. 1992. Horizontal integration and cortical dynamics. Nrirron 9, 1-13. Hancock, P. J., Baddeley, R. J., and Smith, L. S. 1992. The principal components of natural images. Nrtzoork 3, 61-70. Huber, I? J. 1985. Projection pursuit. Ann. Statist. 13, 435475. Intrator, N. 1992. Feature extraction using an unsuperviscd neural network. Nerrrnl Cornp. 4, 98-107. Intrator, N. 1993. Combining exploratory projection pursuit and projection pursuit regression with application to neural networks. Ncirml Coniy. 5,443-455. Jones, J., and Palmer, L. 1987. An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. 1. Neirrop/rysio/. 58(6), 1233-1258. Kersten, D. 1987. Predictability and redundancy of natural images. j . Opt. Soc. Anrrr. 4, 2395-2400. Klein, S. A., and Levi, D. M. 1985. Hyperacuity thresholds of 1 sec: Theoretical predictions and empirical validation. 1. Opt. Soc. Anrer. 2(7), 1170-1190. Kulikowski, J. J., Marcelja, S., and Bishop, P. 0. 1982. Theory of spatial position and spatial frequency relations in the receptive fields of simple cells in the visual cortex. B i d . Cybern. 43, 187-198. Lehky, S. R., and Sejnowski, T. J. 1990. Network model of shape-from-shading: Neural function arises from both receptive and projective receptive fields. Nature (London) 333, 452-454. Linsker, R. 1988. Self-organization in a perceptual network. Conipirter 21, 105117. Linsker, R. 1993. Deriving receptive fields using an optimal encoding crite-

Sensory Coding

601

rion. In Advances i n Neirral Infcvwntiori Processing Systerm 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 953-960. Morgan Kaufmann, San Mateo, CA. MacKay, D. J., and Miller, K. D. 1990. Analysis of Linsker’s simulation of Hebbian rules. Nerrrnl Comp. 1, 173-187. Mallat, S. G. 1989. A theory for multiresolution signal decomposition: The wavelet representation. l E E E Transnct. Pattern A n d . Machine Intelligence 11(7), 674-693. Maloney, L. T. 1986. Evaluation of linear models of surface spectral reflectance with small numbers of parameters. 1. Opt. Soc. Ainer. A 3, 1673-1683. Morrone, M. C., and Burr, D. C. 1988. Feature detection in human vision: A phase-dependent energy model. Proc. Roy. Soc. London B 235, 221-245. Qa, E. 1982. A simplified neuron model as a principal component analyzer. /. Math. Biol. 15, 267-273. Palm, G. 1980. On associative memory. Biol. Cybern. 36, 19-31. Pratt, W. K. 1978. Digitnl Irrrage Processing. Wiley, New York. Rockland, K., and Lund, J. S. 1983. Intrinsic laminar lattice connections in primary visual cortex. /. Cornp. NeiiroI. 216, 303-318. Sanger, T. D. 1989. Optimal unsupervised learning in a single layer network. Nenral Nctzuorks 2, 459473. Schendel, U. 1989. Spnrse Mntrices. Wiley, New York. Schmidhuber, J. 1992. Learning factorial codes by predictability minimization. Neirral Conrp. 4, 863-879. Shapley, R. M., and Lennie, P. 1985. Spatial frequency analysis in the visual system. Annu. Rev. Neurvsci. 8, 547-583. Tolhurst, D. J., and Thompson, I. D. 1982. On the variety of spatial frequency selectivities shown by neurons in area 17 of the cat. Proc. Roy. Soc. London Ser. B 213, 183-199. Tolhurst, D. J., Tadmor, Y., and Tang Chao 1992. The amplitude spectra o f natural images. Oplztlinl. Physiol. Opt. 12, 229-232. van Hateren, J. H. 1992. Real and optimal neural images in early vision. NntirrL, (London) 360, 68-69. Webber, C. J. St. C. 1991. Competitive learning, natural images and cortical cells. Ni+uIork 2, 169-187. Watson, A. B. 1983. Detection and recognition of simple spatial forms. In Physical and Biological Processing of linages, 0. J. Braddick and A. C. Slade, eds. Springer-Verlag, Berlin. Webster, M. A,, and DeValois, R. L. 1985. Relationship between spatial-frequency and orientation tuning of striate-cortex cells. /. Opt. Soc. Airier. 2(2), 11241132. Zetzsche, C. 1990. Sparse coding: The link between low level vision and associative memory. In Pnrnllel Processing in Neirrnl System nnd Conzpirtcrs, R. Eckmiller, G. Hartmann, and G. Hauske, eds. North-Holland, Amsterdam. Zucker, S. W., Dobbins, A,, Iverson, L. 1989. Two stages of curve detection suggest two styles of visual computation. Neiiral Comp. 1, 68-81. ~

~

Received January 8, 1993; accepted October 20, 1993.

This article has been cited by: 1. Ming-Jun Chen, Alan C. Bovik. 2010. Fast structural similarity index algorithm. Journal of Real-Time Image Processing . [CrossRef] 2. Laurent U. Perrinet. 2010. Role of Homeostasis in Learning Sparse RepresentationsRole of Homeostasis in Learning Sparse Representations. Neural Computation 22:7, 1812-1836. [Abstract] [Full Text] [PDF] [PDF Plus] 3. Evgenia Rubinshtein, Anuj Srivastava. 2010. Optimal linear projections for enhancing desired data statistics. Statistics and Computing 20:3, 267-282. [CrossRef] 4. Julien Chauveau, David Rousseau, François Chapeau-Blondeau. 2010. Fractal capacity dimension of three-dimensional histogram from color images. Multidimensional Systems and Signal Processing 21:2, 197-211. [CrossRef] 5. Ying-Ying ZHANG, Xin JIN, Hai-Qing GONG, Pei-Ji LIANG. 2010. Temporal and Spatial Patterns of Retinal Ganglion Cells in Response to Natural Stimuli*. PROGRESS IN BIOCHEMISTRY AND BIOPHYSICS 37:4, 389-396. [CrossRef] 6. Aapo Hyvärinen. 2010. Statistical Models of Natural Images and Cortical Visual Representation. Topics in Cognitive Science 2:2, 251-264. [CrossRef] 7. Henrikas Vaitkevicius, Vilius Viliunas, Remigijus Bliumas, Rytis Stanikunas, Algimantas Svegzda, Aldona Dzekeviciute, Janus J. Kulikowski. 2009. Influences of prolonged viewing of tilted lines on perceived line orientation: the normalization and tilt after-effect. Journal of the Optical Society of America A 26:7, 1553. [CrossRef] 8. Yu. E. Shelepin, V. N. Chikhman, N. Foreman. 2009. Analysis of the studies of the perception of fragmented images: global description and perception using local features. Neuroscience and Behavioral Physiology 39:6, 569-580. [CrossRef] 9. Dongyue Chen, Liming Zhang, Juyang Weng. 2009. Spatio–Temporal Adaptation in the Unsupervised Development of Networked Visual Neurons. IEEE Transactions on Neural Networks 20:6, 992-1008. [CrossRef] 10. G. A. Alvarez, A. Oliva. 2009. Spatial ensemble statistics are efficient codes that can be represented with reduced attention. Proceedings of the National Academy of Sciences 106:18, 7345-7350. [CrossRef] 11. R. Pashaie, N.H. Farhat. 2009. Self-Organization in a Parametrically Coupled Logistic Map Network: A Model for Information Processing in the Visual Cortex. IEEE Transactions on Neural Networks 20:4, 597-608. [CrossRef] 12. Ruiming Liu. 2009. Eigentargets Versus Kernel Eigentargets: Detection of Infrared Point Targets Using Linear and Nonlinear Subspace Algorithms. Journal of Infrared, Millimeter, and Terahertz Waves 30:3, 278-293. [CrossRef]

13. JAN WILTSCHUT, FRED H. HAMKER. 2009. Efficient coding correlates with spatial frequency tuning in a model of V1 receptive field organization. Visual Neuroscience 26:01, 21. [CrossRef] 14. Sen Jia, Yuntao Qian. 2009. Constrained Nonnegative Matrix Factorization for Hyperspectral Unmixing. IEEE Transactions on Geoscience and Remote Sensing 47:1, 161-173. [CrossRef] 15. Le Li, Yu-Jin Zhang. 2009. FastNMF: highly efficient monotonic fixed-point nonnegative matrix factorization algorithm with good applicability. Journal of Electronic Imaging 18:3, 033004. [CrossRef] 16. Joschka Boedecker, Oliver Obst, N. Michael Mayer, Minoru Asada. 2009. Initialization and self-organized optimization of recurrent neural network connectivity. HFSP Journal 3:5, 340. [CrossRef] 17. K. Labusch, E. Barth, T. Martinetz. 2008. Simple Method for High-Performance Digit Recognition Based on Sparse Coding. IEEE Transactions on Neural Networks 19:11, 1985-1989. [CrossRef] 18. Christopher J. Rozell, Don H. Johnson, Richard G. Baraniuk, Bruno A. Olshausen. 2008. Sparse Coding via Thresholding and Local Competition in Neural CircuitsSparse Coding via Thresholding and Local Competition in Neural Circuits. Neural Computation 20:10, 2526-2563. [Abstract] [PDF] [PDF Plus] 19. Ben Willmore, Ryan J. Prenger, Michael C.-K. Wu, Jack L. Gallant. 2008. The Berkeley Wavelet Transform: A Biologically Inspired Orthogonal Wavelet TransformThe Berkeley Wavelet Transform: A Biologically Inspired Orthogonal Wavelet Transform. Neural Computation 20:6, 1537-1564. [Abstract] [PDF] [PDF Plus] 20. Ali Yoonessi, Frederick A. A. Kingdom, Samih Alqawlaq. 2008. Is color patchy?. Journal of the Optical Society of America A 25:6, 1330. [CrossRef] 21. Y. Hel-Or, D. Shaked. 2008. A Discriminative Approach for Wavelet Denoising. IEEE Transactions on Image Processing 17:4, 443-457. [CrossRef] 22. Steven K. Shevell, Frederick A. A. Kingdom. 2008. Color in Complex Scenes. Annual Review of Psychology 59:1, 143-166. [CrossRef] 23. Wilson S. Geisler. 2008. Visual Perception and the Statistical Properties of Natural Scenes. Annual Review of Psychology 59:1, 167-192. [CrossRef] 24. Pierre Chainais. 2007. Infinitely Divisible Cascades to Model the Statistics of Natural Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:12, 2105-2119. [CrossRef] 25. Ling-Zhi Liao, Si-Wei Luo, Mei Tian. 2007. “Whitenedfaces” Recognition With PCA and ICA. IEEE Signal Processing Letters 14:12, 1008-1011. [CrossRef] 26. Joseph F. Murray, Kenneth Kreutz-Delgado. 2007. Visual Recognition and Inference Using Dynamic Overcomplete Sparse LearningVisual Recognition and

Inference Using Dynamic Overcomplete Sparse Learning. Neural Computation 19:9, 2301-2352. [Abstract] [PDF] [PDF Plus] 27. Sheng Li, Si Wu. 2007. Robustness of neural codes and its implication on natural image processing. Cognitive Neurodynamics 1:3, 261-272. [CrossRef] 28. Odelia Schwartz, Anne Hsu, Peter Dayan. 2007. Space and time in visual context. Nature Reviews Neuroscience 8:7, 522-535. [CrossRef] 29. Leonardo Franco, Edmund T. Rolls, Nikolaos C. Aggelopoulos, Jose M. Jerez. 2007. Neuronal selectivity, population sparseness, and ergodicity in the inferior temporal visual cortex. Biological Cybernetics 96:6, 547-560. [CrossRef] 30. Denis Mareschal, Michael S. C. Thomas. 2007. Computational Modeling in Developmental Psychology. IEEE Transactions on Evolutionary Computation 11:2, 137-150. [CrossRef] 31. L. Perrinet. 2007. Dynamical neural networks: Modeling low-level vision at short latencies. The European Physical Journal Special Topics 142:1, 163-225. [CrossRef] 32. Jordi Vitri, Marco Bressan, Petia Radeva. 2007. Bayesian Classification of Cork Stoppers Using Class-Conditional Independent Component Analysis. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 37:1, 32-38. [CrossRef] 33. Barbara L. Finlay. 2007. Endless minds most beautiful. Developmental Science 10:1, 30-34. [CrossRef] 34. Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan. 2006. Soft Mixer Assignment in a Hierarchical Generative Model of Natural Scene StatisticsSoft Mixer Assignment in a Hierarchical Generative Model of Natural Scene Statistics. Neural Computation 18:11, 2680-2718. [Abstract] [PDF] [PDF Plus] 35. Manuel F. Casanova, Imke A. J. Kooten, Andrew E. Switala, Herman Engeland, Helmut Heinsen, Harry W. M. Steinbusch, Patrick R. Hof, Juan Trippe, Janet Stone, Christoph Schmitz. 2006. Minicolumnar abnormalities in autism. Acta Neuropathologica 112:3, 287-303. [CrossRef] 36. Michael C.-K. Wu, Stephen V. David, Jack L. Gallant. 2006. COMPLETE FUNCTIONAL CHARACTERIZATION OF SENSORY NEURONS BY SYSTEM IDENTIFICATION. Annual Review of Neuroscience 29:1, 477-505. [CrossRef] 37. Yasser Roudi, Alessandro Treves. 2006. Localized activity profiles and storage capacity of rate-based autoassociative networks. Physical Review E 73:6. . [CrossRef] 38. Melchi M. Michel , Robert A. Jacobs . 2006. The Costs of Ignoring High-Order Correlations in Populations of Model NeuronsThe Costs of Ignoring High-Order Correlations in Populations of Model Neurons. Neural Computation 18:3, 660-682. [Abstract] [PDF] [PDF Plus]

39. Benjamin J. Balas , Pawan Sinha . 2006. Receptive Field Structures for RecognitionReceptive Field Structures for Recognition. Neural Computation 18:3, 497-520. [Abstract] [PDF] [PDF Plus] 40. Tatyana O. Sharpee, Hiroki Sugihara, Andrei V. Kurgansky, Sergei P. Rebrik, Michael P. Stryker, Kenneth D. Miller. 2006. Adaptive filtering enhances information transmission in visual cortex. Nature 439:7079, 936-942. [CrossRef] 41. Long Xie, Masatoshi Ogawa, Youichi Kigawa, Harutoshi Ogai. 2006. Intelligent Surveillance System Design Based on Independent Component Analysis and Wireless Sensor Network. IEEJ Transactions on Electronics, Information and Systems 126:12, 1543-1550. [CrossRef] 42. Zhou Wang, Alan C. Bovik. 2006. Modern Image Quality Assessment. Synthesis Lectures on Image, Video, and Multimedia Processing 2:1, 1-156. [CrossRef] 43. Dario Floreano , Mototaka Suzuki , Claudio Mattiussi . 2005. Active Vision and Receptive Field Development in Evolutionary RobotsActive Vision and Receptive Field Development in Evolutionary Robots. Evolutionary Computation 13:4, 527-544. [Abstract] [PDF] [PDF Plus] 44. Gidon Felsen, Yang Dan. 2005. A natural approach to studying vision. Nature Neuroscience 8:12, 1643-1646. [CrossRef] 45. Denis Mareschal, Daisy Powell, Gert Westermann, Agnes Volein. 2005. Evidence of rapid correlation-based perceptual category learning by 4-month-olds. Infant and Child Development 14:5, 445-457. [CrossRef] 46. Zhe Chen. 2005. Stochastic correlative firing for figure-ground segregation. Biological Cybernetics 92:3, 192-198. [CrossRef] 47. Kun Guo, Robert G. Robertson, Sasan Mahmoodi, Malcolm P. Young. 2005. Centre-surround interactions in response to natural scene stimulation in the primary visual cortex. European Journal of Neuroscience 21:2, 536-548. [CrossRef] 48. Gidon Felsen, Jon Touryan, Feng Han, Yang Dan. 2005. Cortical Sensitivity to Visual Features in Natural Scenes. PLoS Biology 3:10, e342. [CrossRef] 49. József Fiser, Richard N. Aslin. 2005. Encoding Multielement Scenes: Statistical Learning of Visual Feature Hierarchies. Journal of Experimental Psychology: General 134:4, 521-537. [CrossRef] 50. Hyun-Jin Park, Te-Won Lee. 2005. Unsupervised learning of nonlinear dependencies in natural images. International Journal of Imaging Systems and Technology 15:1, 34-47. [CrossRef] 51. Yizhou Wang, Song-Chun Zhu. 2004. Analysis and synthesis of textured motion: particles and waves. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:10, 1348-1363. [CrossRef] 52. L. Perrinet, M. Samuelides, S. Thorpe. 2004. Coding Static Natural Images Using Spiking Event Times: Do Neurons Cooperate?. IEEE Transactions on Neural Networks 15:5, 1164-1175. [CrossRef]

53. N. Vasconcelos. 2004. Minimum Probability of Error Image Retrieval. IEEE Transactions on Signal Processing 52:8, 2322-2336. [CrossRef] 54. S. Haykin, Z. Chen, S. Becker. 2004. Stochastic Correlative Learning Algorithms. IEEE Transactions on Signal Processing 52:8, 2200-2209. [CrossRef] 55. C. Liu. 2004. Enhanced Independent Component Analysis and Its Application to Content Based Face Image Retrieval. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:2, 1117-1127. [CrossRef] 56. P.-F. Ruedi, P. Heim, F. Kaess, E. Grenet, F. Heitger, P.-Y. Burgi, S. Gyger, P. Nussbaum. 2003. A 128 x 128 pixel 120-db dynamic-range vision-sensor chip for image contrast and orientation extraction. IEEE Journal of Solid-State Circuits 38:12, 2325-2333. [CrossRef] 57. Chengjun Liu, H. Wechsler. 2003. Independent component analysis of gabor features for face recognition. IEEE Transactions on Neural Networks 14:4, 919-928. [CrossRef] 58. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 59. Song-Chun Zhu. 2003. Statistical modeling and conceptualization of visual patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:6, 691-712. [CrossRef] 60. Jarmo Hurri , Aapo Hyvärinen . 2003. Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural VideoSimple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video. Neural Computation 15:3, 663-691. [Abstract] [PDF] [PDF Plus] 61. Kenneth Kreutz-Delgado , Joseph F. Murray , Bhaskar D. Rao , Kjersti Engan , Te-Won Lee , Terrence J. Sejnowski . 2003. Dictionary Learning Algorithms for Sparse RepresentationDictionary Learning Algorithms for Sparse Representation. Neural Computation 15:2, 349-396. [Abstract] [PDF] [PDF Plus] 62. Eizaburo Doi , Toshio Inui , Te-Won Lee , Thomas Wachtler , Terrence J. Sejnowski . 2003. Spatiochromatic Receptive Field Properties Derived from Information-Theoretic Analyses of Cone Mosaic Responses to Natural ScenesSpatiochromatic Receptive Field Properties Derived from Information-Theoretic Analyses of Cone Mosaic Responses to Natural Scenes. Neural Computation 15:2, 397-417. [Abstract] [PDF] [PDF Plus] 63. Xiuwen Liu, Lei Cheng. 2003. Independent spectral representations of images for recognition. Journal of the Optical Society of America A 20:7, 1271. [CrossRef] 64. Brian Potetz, Tai Sing Lee. 2003. Statistical correlations between two-dimensional images and three-dimensional structures in natural scenes. Journal of the Optical Society of America A 20:7, 1292. [CrossRef]

65. Yury Petrov, L. Zhaoping. 2003. Local correlations, information redundancy, and sufficient pixel depth in natural images. Journal of the Optical Society of America A 20:1, 56. [CrossRef] 66. Aapo Hyvärinen, Jarmo Hurri, Jaakko Väyrynen. 2003. Bubbles: a unifying framework for low-level statistical properties of natural image sequences. Journal of the Optical Society of America A 20:7, 1237. [CrossRef] 67. M.S. Bartlett, J.R. Movellan, T.J. Sejnowski. 2002. Face recognition by independent component analysis. IEEE Transactions on Neural Networks 13:6, 1450-1464. [CrossRef] 68. M. Plumbley. 2002. Conditions for nonnegative independent component analysis. IEEE Signal Processing Letters 9:6, 177-180. [CrossRef] 69. Denis Mareschal, Scott P. Johnson. 2002. Learning to perceive object unity: a connectionist account. Developmental Science 5:2, 151-172. [CrossRef] 70. Chengjun Liu, H. Wechsler. 2002. Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Transactions on Image Processing 11:4, 467-476. [CrossRef] 71. A. Turiel, A. del Pozo. 2002. Reconstructing images from their most singular fractal manifold. IEEE Transactions on Image Processing 11:4, 345-350. [CrossRef] 72. Norberto M. Grzywacz , Rosario M. Balboa . 2002. A Bayesian Framework for Sensory AdaptationA Bayesian Framework for Sensory Adaptation. Neural Computation 14:3, 543-559. [Abstract] [PDF] [PDF Plus] 73. Dávid Bálya, Botond Roska, Tamás Roska, Frank S. Werblin. 2002. A CNN framework for modeling parallel processing in a mammalian retina. International Journal of Circuit Theory and Applications 30:2-3, 363-393. [CrossRef] 74. W.T. Freeman, T.R. Jones, E.C. Pasztor. 2002. Example-based super-resolution. IEEE Computer Graphics and Applications 22:2, 56-65. [CrossRef] 75. Te-Won Lee, M.S. Lewicki. 2002. Unsupervised image classification, segmentation, and enhancement using ICA mixture models. IEEE Transactions on Image Processing 11:3, 270-279. [CrossRef] 76. M. Barbaro, P.-Y. Burgi, A. Mortara, P. Nussbaum, F. Heitger. 2002. A 100×100 pixel silicon retina for gradient extraction with steering filter capabilities and temporal output coding. IEEE Journal of Solid-State Circuits 37:2, 160-172. [CrossRef] 77. Randall C. O'Reilly . 2001. Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian LearningGeneralization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning. Neural Computation 13:6, 1199-1241. [Abstract] [PDF] [PDF Plus] 78. Rufin Van Rullen , Simon J. Thorpe . 2001. Rate Coding Versus Temporal Order Coding: What the Retinal Ganglion Cells Tell the Visual CortexRate Coding

Versus Temporal Order Coding: What the Retinal Ganglion Cells Tell the Visual Cortex. Neural Computation 13:6, 1255-1283. [Abstract] [PDF] [PDF Plus] 79. Chris J. S. Webber . 2001. Predictions of the Spontaneous Symmetry-Breaking Theory for Visual Code Completeness and Spatial Scaling in Single-Cell Learning RulesPredictions of the Spontaneous Symmetry-Breaking Theory for Visual Code Completeness and Spatial Scaling in Single-Cell Learning Rules. Neural Computation 13:5, 1023-1043. [Abstract] [PDF] [PDF Plus] 80. A. R. Gardner-Medwin , H. B. Barlow . 2001. The Limits of Counting Accuracy in Distributed Neural RepresentationsThe Limits of Counting Accuracy in Distributed Neural Representations. Neural Computation 13:3, 477-504. [Abstract] [PDF] [PDF Plus] 81. Gilles Laurent, Mark Stopfer, Rainer W Friedrich, Misha I Rabinovich, Alexander Volkovskii, Henry DI Abarbanel. 2001. ODOR ENCODING AS AN ACTIVE, DYNAMICAL PROCESS: Experiments, Computation, and Theory. Annual Review of Neuroscience 24:1, 263-297. [CrossRef] 82. Eero P Simoncelli, Bruno A Olshausen. 2001. NATURAL IMAGE STATISTICS AND NEURAL REPRESENTATION. Annual Review of Neuroscience 24:1, 1193-1216. [CrossRef] 83. M. Thomson. 2001. Sensory Coding and the Second Spectra of Natural Signals. Physical Review Letters 86:13, 2901-2904. [CrossRef] 84. Elizabeth Thomas, Marc M. Van Hulle, Rufin Vogel. 2001. Encoding of Categories by Noncategory-Specific Neurons in the Inferior Temporal CortexEncoding of Categories by Noncategory-Specific Neurons in the Inferior Temporal Cortex. Journal of Cognitive Neuroscience 13:2, 190-200. [Abstract] [PDF] [PDF Plus] 85. Ulrich Hillenbrand , J. Leo van Hemmen . 2001. Does Corticothalamic Feedback Control Cortical Velocity Tuning?Does Corticothalamic Feedback Control Cortical Velocity Tuning?. Neural Computation 13:2, 327-355. [Abstract] [PDF] [PDF Plus] 86. Norbert Krüger . 2001. Learning Object Representations Using A Priori Constraints Within ORASSYLLLearning Object Representations Using A Priori Constraints Within ORASSYLL. Neural Computation 13:2, 389-410. [Abstract] [PDF] [PDF Plus] 87. Thomas Wachtler, Te-Won Lee, Terrence J. Sejnowski. 2001. Chromatic structure of natural scenes. Journal of the Optical Society of America A 18:1, 65. [CrossRef] 88. Christoph Zetzsche, Gerhard Krieger. 2001. Nonlinear mechanisms and higher-order statistics in biological vision and electronic image processing: review and perspectives. Journal of Electronic Imaging 10:1, 56. [CrossRef] 89. James A. Bednar , Risto Miikkulainen . 2000. Tilt Aftereffects in a Self-Organizing Model of the Primary Visual CortexTilt Aftereffects in a

Self-Organizing Model of the Primary Visual Cortex. Neural Computation 12:7, 1721-1740. [Abstract] [PDF] [PDF Plus] 90. Rosario M. Balboa , Norberto M. Grzywacz . 2000. The Minimal Local-Asperity Hypothesis of Early Retinal Lateral InhibitionThe Minimal Local-Asperity Hypothesis of Early Retinal Lateral Inhibition. Neural Computation 12:7, 1485-1517. [Abstract] [PDF] [PDF Plus] 91. Aapo Hyvärinen , Patrik Hoyer . 2000. Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature SubspacesEmergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces. Neural Computation 12:7, 1705-1720. [Abstract] [PDF] [PDF Plus] 92. Michael S. Lewicki , Terrence J. Sejnowski . 2000. Learning Overcomplete RepresentationsLearning Overcomplete Representations. Neural Computation 12:2, 337-365. [Abstract] [PDF] [PDF Plus] 93. R.W. Buccigrossi, E.P. Simoncelli. 1999. Image compression via joint statistical characterization in the wavelet domain. IEEE Transactions on Image Processing 8:12, 1688-1701. [CrossRef] 94. Aapo Hyvärinen . 1999. Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood EstimationSparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation. Neural Computation 11:7, 1739-1768. [Abstract] [PDF] [PDF Plus] 95. G. Donato, M.S. Bartlett, J.C. Hager, P. Ekman, T.J. Sejnowski. 1999. Classifying facial actions. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:10, 974-989. [CrossRef] 96. Marco Budinich , Renato Frison . 1999. Adaptive Calibration of Imaging Array DetectorsAdaptive Calibration of Imaging Array Detectors. Neural Computation 11:6, 1281-1296. [Abstract] [PDF] [PDF Plus] 97. Satoshi Maekawa, Hajime Kita, Yoshikazu Nishikawa, Hidefumi Sawai. 1999. Self-organizing formation of receptive fields. Systems and Computers in Japan 30:8, 1-10. [CrossRef] 98. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus] 99. S. F. Cotter, B. D. Rao, K. Kreutz-Delgado, J. Adler. 1999. Forward sequential algorithms for best basis selection. IEE Proceedings - Vision, Image, and Signal Processing 146:5, 235. [CrossRef] 100. Michael S. Lewicki, Bruno A. Olshausen. 1999. Probabilistic framework for the adaptation and comparison of image codes. Journal of the Optical Society of America A 16:7, 1587. [CrossRef] 101. Mitchell G. A. Thomson. 1999. Higher-order structure in natural scenes. Journal of the Optical Society of America A 16:7, 1549. [CrossRef]

102. Stefan Schaal , Christopher G. Atkeson . 1998. Constructive Incremental Learning from Only Local InformationConstructive Incremental Learning from Only Local Information. Neural Computation 10:8, 2047-2084. [Abstract] [PDF] [PDF Plus] 103. Brian S. Blais , N. Intrator , H. Shouval , Leon N. Cooper . 1998. Receptive Field Formation in Natural Scene Environments: Comparison of Single-Cell Learning RulesReceptive Field Formation in Natural Scene Environments: Comparison of Single-Cell Learning Rules. Neural Computation 10:7, 1797-1813. [Abstract] [PDF] [PDF Plus] 104. Marc M. Van Hulle . 1998. Kernel-Based Equiprobabilistic Topographic Map FormationKernel-Based Equiprobabilistic Topographic Map Formation. Neural Computation 10:7, 1847-1871. [Abstract] [PDF] [PDF Plus] 105. D.L. Donoho, M. Vetterli, R.A. DeVore, I. Daubechies. 1998. Data compression and harmonic analysis. IEEE Transactions on Information Theory 44:6, 2435-2476. [CrossRef] 106. J. Homigo, G. Cristobal. 1998. High resolution spectral analysis of images using the pseudo-Wigner distribution. IEEE Transactions on Signal Processing 46:6, 1757-1763. [CrossRef] 107. Song Chun Zhu , Ying Nian Wu , David Mumford . 1997. Minimax Entropy Principle and Its Application to Texture ModelingMinimax Entropy Principle and Its Application to Texture Modeling. Neural Computation 9:8, 1627-1660. [Abstract] [PDF] [PDF Plus] 108. Juan K. Lin, David G. Grier, Jack D. Cowan. 1997. Faithful Representation of Separable DistributionsFaithful Representation of Separable Distributions. Neural Computation 9:6, 1305-1320. [Abstract] [PDF] [PDF Plus] 109. Reiner Lenz, Mats Österberg, Jouni Hiltunen, Timo Jaaskelainen, Jussi Parkkinen. 1996. Unsupervised filtering of color spectra. Journal of the Optical Society of America A 13:7, 1315. [CrossRef] 110. Jürgen Schmidhuber, Martin Eldracher, Bernhard Foltin. 1996. Semilinear Predictability Minimization Produces Well-Known Feature DetectorsSemilinear Predictability Minimization Produces Well-Known Feature Detectors. Neural Computation 8:4, 773-786. [Abstract] [PDF] [PDF Plus] 111. Christopher W. Lee , Bruno A. Olshausen . 1996. A Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot StereogramsA Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot Stereograms. Neural Computation 8:3, 545-566. [Abstract] [PDF] [PDF Plus] 112. Anthony J. Bell , Terrence J. Sejnowski . 1995. An Information-Maximization Approach to Blind Separation and Blind DeconvolutionAn Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7:6, 1129-1159. [Abstract] [PDF] [PDF Plus]

113. Karen K. De Valois, Russell L. De ValoisHuman Visual System-Spatial Visual Processing . [CrossRef]

Communicated by Terrence Sejnowski

NOTE

Design Principles of Columnar Organization in Visual Cortex Ernst Niebur Coiti~~irtatioii r i t i d Nciiral Systciiis Progrmii, Califortiin liistitirte ofTcchiiolo,y!y, Pasadcnn. C A 91225 U S A

Florentin Worgotter liistitirt Fir Physiologic2,Ruhr-Univc.vsifiit BocIiim, 0-46.30 Bochirtii, Gertnainy

Visual space is represented by cortical cells in an orderly manner. Only little variation in the cell behavior is found with changing depth below the cortical surface, that is, all cells in a column with axis perpendicular to the cortical plane have approximately the same properties (Hubel and Wiesel 1962, 1963, 1968). Therefore, the multiple features of the visual space (e.g., position in visual space, preferred orientation, and orientation tuning strength) are mapped on a two-dimensional space, the cortical plane. Such a dimension reduction leads to complex maps (Durbin and Mitchison 1990) that so far have evaded an intuitive understanding. Analyzing optical imaging data (Blasdel 1992a,b; Blasdel and Salama 1986; Grinvald et al. 1986) using a theoretical approach we will show that the most salient features of these maps can be understood from a few basic design principles: local correlation, modularity, isotropy, and homogeneity. These principles can be defined in a mathematically exact sense in the Fourier domain by a rather simple annuluslike spectral structure. Many of the models that have been developed to explain the mapping of the preferred orientations (Cooper et al. 1979; Legendy 1978; Linsker 1986a,b; Miller 1992; Nass and Cooper 1975; Obermayer et al. 1990, 1992; Soodak 1987; Swindale 1982, 1985, 1992; von der Malsburg 1973; von der Malsburg and Cowan 1982) are quite successful in generating maps that are close to experimental maps. We suggest that this success is due to these principles, which are common properties of the models and of biological maps.

Recently it became possible to extract features of cortical cell behavior using optical imaging techniques (Blasdel 1992a,b; Blasdel and Salama 1986; Grinvald et a/. 1986; Frostig et al. 1990; Ts’o et a / . 1990; Bonhoeffer and Grinvald 1991). A map of the preferred orientations in the visual cortex of monkey obtained this way is shown in Figure 2a (Blasdel 1992b). Nrirml

Cotti~~i4h?fioii 6, 602-614 (1994)

@ 1994 Massachusetts Institute of Technology

Columnar Organization in Visual Cortex

603

One of the obvious features of the map shown is its periodicity (Hubel and Wiesel 1968; Albus 1975): for most points, the preferred orientation is repeated in a certain distance, which we call A. Generally, cells with all preferred orientations are found in an area of linear dimension X around any point and visual space appears to be mapped in a repeating pattern onto the cortex. A region in which "all" features (e.g., preferred orientation, ocular dominance, color, velocity, etc.) of visual space are represented at least once is called a hypevcolunin (Hubel and Wiesel 1968). Hypercolumns seem to be arranged roughly following a "module concept" (Szentagothai 1975) in which adjacent x.y-locations in the visual field are projected onto adjacent hypercolumns in the cortex. This leads to a complete representation of all features of one location in the visual space in a locally confined cortical module while representing adjacent locations in adjacent modules. There are many ways to achieve a modular organization. It is neither required to arrange the features periodically with one predominant frequency nor is it necessary to have an orderly arrangement within the individual modules, by only making sure that each module has at least one feature detector of each characteristic. Therefore, periodicity does not seem to be an a priori concept of cortical design but rather a derived quantity. What then are the basic cortical design principles that, together with modularity, engender the observed periodicity of the maps? We propose that these principles are (positive) local correlation, homogeneity, and isotropy, all with respect to a length scale A, which is the only parameter in our framework. Let us introduce a coordinate system with coordinates' x = ( K , . x ~ )in the cortical plane and let qb(x) be the preferred orientation of the column at location x. Following Swindale et 01. (1987) we define the complex orientation anglef by

We introduced a factor 2 because the angles of preferred orientation take on values only between 0" and 180" (not 3600) and two angles differing by 180" are equivalent [see Swindale (1982) for details]. The preferred orientation at point x is then represented by the complex numberf, which can be interpreted as a vector of unit length in the complex plane. The correlation between the preferred orientations at two points is given by the scalar product of the respective vectors at these points. When we define the complex two-point autocorrelation function off(x ) as,

(see, e.g., Champeney 1973), it is seen that the real part of C ( y ) corresponds to the mentioned scalar product,

'We use bold characters to designate vectors.

Ernst Niebur and Florentin Wiirgotter

604

Isotropy implies the absence of systematic differences with respect to direction in the cortical plane. Therefore, $(x) - 4 ( x + y ) depends only on r2 = y: + y: and the autocorrelation function is circularly symmetric. We may therefore write We define a map as being locally correlated if, on average, variations of the preferred orientation on a length scale much smaller than A are significantly smaller than those on a length scale A. The autocorrelation function has then high values at small distances ( r << A). A system is Ironropcous if all its locations are equivalent, that is, no systematic differences can be observed between different locations. Such a system is devoid of long-range correlations and therefore has a vanishing correlation function for long distances ( r >> A). Having expressed homogeneity, local correlation, and isotropy in terms of properties of the autocorrelation function, what are the consequences of modularity?? Within every module of linear dimension zz A, all preferred orientations are represented. Close to a given point ( r << A), orientations similar to that at the point itself preponderate. Since modularity requires that within the distance ,\ from this point (on average) all orientations have to occur with comparable frequencies, orientations other than the one at the given point have to occur with above-aoerage frequency between the central peak and r = A. From equation 1.3 and the lines following it, it is then seen that the autocorrelation function is characterized by a central positive peak, a surrounding negative "valley," and zero values for large distances. A simple model for such a function is a difference of gaussians,

q r )= L , - ( ~ r / A ) '

- -1p - ( 2 r / 1 F

4

(1.5)

(the imaginary part of C vanishes identically in our model). This function, shown in Figure l a , has a peak with width zz A/4 around r = 0, a minimum at r FZ A/2, and decays to zero for r + x. According to the Wiener-Khintchine theorem (Champeney 1973), the power spectrum of the map is obtained from the Fourier transform of its autocorrelation function as P(k) = f I P k X C ( x d) x . From equation 1.5, we therefore obtain

where k2 = k: + ki. This spectrum, shown in Figure lb, has a large amplitude on an annulus with radius zz 2r/A and small or vanishing 'Modularity here is understood in a restricted sense, that is, we suppose modularity only with respect to the distribution of preferred orientation. Swindale (1990) has given arguments against the realization of a strong form of modularity in cortex, in the sense that onc module contains all represented features; see also Bartfeld and Grinvald (1992). These arguments do not exclude modularity in the more restricted sense used here.

605

Columnar Organization in Visual Cortex

1

,. h

0.5

Y

U

0 0

h

5

0.1

1

2

-

Figure 1: (a) Radial component of the model autocorrelation function (equation 1.5). The function is circularly symmetric around the origin. The horizontal axis is in units of A. (b) Radial component of the power spectrum corresponding to the autocorrelation function shown in (a), as computed in equation 1.6. The horizontal axis in (b) is in units of 2 i ~ / A . Because the power spectrum is the Fourier transform of a function with circular symmetry, it is also circularly symmetric and therefore the function shown corresponds to an annulus around the origin.

606

Ernst Niebur and Florentin Worgotter

amplitudes elsewhere. Similar power spectra have been observed experimentally for orientation column structures of monkeys (Obermayer et al. 1991,1992). Local correlation leads to the absence of spectral components for frequencies much larger than the radius of the annulus (i.e., for spatial frequencies >> Y'). Significant nonzero components for high spatial frequencies would lead to short-range variations in the map, which are generically not observed. Such variations do occur at isolated points, the so-called singularities (Swindale ef al. 1987), but these singularities are necessary for topological reasons (they correspond to common zeroes of the numerator and denominator in equation 1.8 and only in very unusual cases are such common zeroes absent) and they are not caused by high frequencies in the spectrum off. This can be seen clearly by the singularities in maps generated from Fourier spectra with vanishing high-frequency components (e.g., Fig. 2D). Homogeneity leads to vanishing amplitudes of the low spatial frequency components in the spectrum inside the annulus because nonzero components for low spatial frequencies would lead to systematic differences (inhomogeneities) between distant hypercolumns, which are not

Columnar Organization in Visual Cortex

607

observed. Isotropy is reflected in the spectrum by the fact that the statistical distribution of the nonzero components in the same in all directions around the origin. Local correlation and homogeneity (i.e., missing highand low-frequency components) will lead to a bandpass characteristic and consequently to a periodicity with only one predominant frequency. In the preceding, w e have shown that a n annular spectrum is necessary for modular orientation maps that are homogeneous, isotropic, and locally correlated. Is an annulus spectrum also suffcient to produce realistic column structures? To test if this is the case or if more information is hidden in the details of the amplitudes or phases of the spectra of orientation column structures, we performed an inverse Fourier transform of simple annulus spectra (Fig. 2C), which have zero amplitudes everywhere except on an annulus of radius z 21r/X. On this annulus, the amplitude has random values and the phases were either all set to zero

Figure 2: Facing page. Analysis of observed orientation preference map of monkey (A,B) and synthesis of model map (C,D). (A) Map of preferred orientations in area 18 of the macaque monkey measured with optical imaging by Blasdel (1992). Color circle for the preferred orientations: 0" = dark blue, 22.5" = purple, 45" = red, 67.5" = orange, 90" = yellow, 112.5" = green, 135" = light blue, 157.5' = sky blue, 180" = dark blue. (B) Power spectrum obtained from (A). Except for a DC component (orange pixel in the center of the annulus) that we attribute to a bias caused by the finite size of the map, the spectrum has significant power only on an annulus of a radius given by the inverse period of the preferred orientation in (A). (C) Model power spectrum. The power vanishes everywhere except on an annulus where it takes on random values (uniform distribution). (D) Model map of preferred orientations, color coding as in (A). Defining F(k,r.k,,)as the Fourier transform of the complex anglef (see equation l.l),

and from tan24 = Zmlf]/Rclf], we can trivially compute the preferred orientations d)(x.y) as follows:

In this equation, Rr and Iru describe the real and imaginary parts o f their arguments, and IFT denotes the inverse Fourier transform (i.e., the inverse of equation 1.7). In order to show that the amplitude information presented in (C) is sufficient to generate a realistically looking map, we replace F(k,,.k,) in equation 1.8 by the square root of the power spectrum in (C), that is, we set all phases identically to zero (similar results were obtained by choosing random phases). The organization of the resulting map, shown partially in (D), is very similar to that of experimentally found maps (e.g., in A).

608

Ernst Niebur and Florentin Wiirgcittcr

or to random values (with similar results). In Figure 2D, we show the cortical map that is obtained from this spectrum by a standard procedure (Swindale 1985) and that shows a striking resemblance to experimentally observed maps. Furthermore, we found that this scheme is very robust: we obtain realistically looking maps for a wide range of variations in amplitudes, phases, and the width of the annulus, which can be varied by about a factor of 10 without significant disturbance of the maps. We also varied the form of the annular probabilistic distribution and found that artificial maps generated with an annulus constructed as a difference of gaussians (as in equation 1.6) yield realistic maps, as d o annuli that are generated by a distribution generated as a gaussian whose mean value is the annulus radius and “rectangular” annuli with sharp borders as the one shown in Figure 2C. This robustness might explain why so many different developmental models are capable of producing ”good-looking” maps: this is expected as long as their ”products” (the maps they generate) are consistent with the basic properties of local correlation, homogeneity, modularity, and isotropy. Note, however, that in this study we are not concerned with the properties of developmental models but only with the properties of maps. Nevertheless, care has to be taken to avoid the introduction of artifacts in this procedure. Rojer and Schwartz (1990) obtained orientation columns by bandpass-filtering two-dimensional noise. This procedure is mathematically equivalent to our Fourier transformation of a noisy annulus. Their method differs, however, from ours in the next step. Since preferred orientation is a vectorial quantity, they then differentiated the output of the bandpass filter and interpreted the obtained gradient as a vector field representing the map of preferred operations. The obtained maps share many properties with measured orientation column maps: cells with similar orientation preference are clustered together and the maps have singularities as well as fractures. Closer inspection (Erwin et a1. 1993) reveals, however, that the thus generated maps differ from experimentally observed data. For instance, certain types of physiologically quite frequently observed singularities can never occur in these maps. While loop singularities (“hairpin bend” shaped) with the opening to the left or right can occur, the same singularity cannot be obtained if they are turned by 90°, for example, if they have the opening to the top. Erwin ct al. (1993) show that this deficiency is due to the fact that a gradient field is conservative which limits the class of patterns that can be generated when using gradient fields. The described singularity (loop open at top) would require a vector field with nonvanishing curl that is not conservative and can therefore never be obtained as the gradient of a scalar field. Experimentally observed orientation maps do not have this restriction and neither do the maps generated from annulus spectra by the procedure introduced by Swindale (1982) and used in this work (see caption of Fig. 2D).

Columnar Organization in Visual Cortex

609

Unfortunately, no complete scheme has yet been found for a quantitative characterization of cortical orientation maps. A first attempt toward the development of such a "fingerprint" has been made by Obermayer ct al. (1992) who compared orientation and ocular dominance maps obtained from optical imaging data with maps generated from self-organizing feature maps. We have applied their methods to show that our simple model achieves quantitative agreement with experimental data. The power spectra of experimental and artificial maps are shown in Figure 3a, the circular autocorrelation function of the maps in Figure 3b, and the distribution of the orientations of the maps in Figure 3c. By all three' measures, the model maps generated with our simple procedure are strikingly similar to experimentally obtained maps. Furthermore, we counted the density of singularities in the measured map and in the artificial map and found very similar values (3.3 singularities per squared hypercolumn length X2 in the real map and 3.4 in the artificial map). Homogeneity, local correlation, and isotropy are properties of natural images (Field 1987), that is, properties of the input to the visual system, and it might be advantageous for an information-processing system if its structure reflects the properties of the input signals. For instance, if the incoming signals are, on average, homogeneous (no systematic variation across the visual field), similar signals have to be treated in all parts of the cortical representation of the visual field. It is therefore plausible that similar structures are to be found in different parts of the topographic map. Two remarks are in order here: (1) Neither the visual input (Switkes et al. 1978) nor the human visual system (Mitchell et a/. 1967)are cornplefel!yisotropic, and similar statements are probably true for homogeneity and local correlation. Our results should rather be taken as a general framework for leading-order effects than as a detailed model for particular features. (2) We neglect distortions by the "complex-logarithm transformation" of visual images that emphasizes the foveal region with respect to the periphery (Schwartz 1977). Independent of the properties of visual input, one expects homogeneity to be a useful feature in any parallel system, since it allows replication of one module many times for parallel information treatment. Local correlation is found in all cortical areas reflecting the tendency of neurons to work in an environment in which they are surrounded by other neurons whose properties vary in a smooth, orderly manner (Legendy 1978). There seem to be less compelling reasons for strict isotropy except from conceptual and developmental simplicity, and, indeed, this property is not always found in perfect form. Ocular dominance columns in monkey and orientation columns in cat are better described by an anisotropic spectrum (Obermayer et al. 1991; Rojer and Schwartz 1990). It is possible 'Obermayer et nl. (1992) introduced a fourth measure that characterized the interaction between orientation columns and ocular dominance columns. Since we do not model binocularity, this is not applicable to our maps.

Ernst Niebur and Florentin Wiirgiitter

0 1 s p a t l a 1 f r e q u e n c y (rioririali ? e d )

0

0

1

0.5 distance

(normalized)

30

90

60

120 1 5 0

180

angle (degrees)

that interactions between sensory features (like ocularity and orientation) induce corresponding interactions between the feature maps, which allow only one of these maps to be described by these simple principles.

Columnar Organization in Visual Cortex

611

Figure 3: Facing page. Statistical analysis of the real and artificial maps and spectra shown in Figure 2. (a) Normalized distribution of the energy of the spectra Figure 2B,C as a function of the radial spatial frequency. The DC component (i.e., at radius = 0) is omitted. The solid line belongs to the spectrum obtained from the experimentally measured map. The dotted line is directly computed from the annulus using the formula: energy = while the dashed line results from Fourier analysis of the artificial map Figure 2A. Note the significant background "noise" in the spectra, which is introduced by the numerical Fourier analysis. Due to the finite size of the map, most frequencies are not integer multiples of the map size and therefore do not correspond to sharp peaks in the spectrum but to rather broad structures. This background is also present in the spectrum of the experimentally determined map and highlights the question about the limits of resolution of Fourier methods applied to cortical maps. (b) Circular autocorrelation as defined in the text for the maps in Figure 2A,D computed and averaged at 500 randomly chosen map locations (solid line, experimental map; dashed line, artificial map). (c) Distribution of the preferred orientations for 18,000 randomly chosen map locations (solid line, experimental map; dashed line, artificial map).

d m ,

In previous reports we have shown that rather unspecific isotropic connections similar to those observed in cortex (Bonds 1989) can produce complex, anisotropic behavior in cortical cells (Niebur and Worgotter 1990; Worgotter et al. 1992). The circular connections we used there were embedded in the cortical column structure, which was the major topic of the current study. Here w e conclude that the complicated looking column system might be based on only a few design principles and that these very simple principles are sufficient to explain the essential features of the column system. It is certainly a n oversimplification to neglect major anatomical details, but it appears that the combination of rather unspecific connections on an unspecifically designed column system could already explain the robustness of the cortical network during development and while suffering from damage. Very little structural information could be at the basis of highly complex performance.

Acknowledgments We thank Gary Blasdel for sharing his data with us prior to publication and Ed Erwin, Ken Miller, and Klaus Obermayer for helpful discussions. E. N. is supported by the Office of Naval Research and F. W. by the Deutsche Forschungsgemeinschaft. We also acknowledge support by the Air Force Office of Scientific Research, the James S. McDonnell Foundation, and a n NSF Presidential Young Investigator Award to Christof Koch.

612

Ernst Niebur and Florentin Wiirgiitter

References

Albus, K. 1975. A quantitative study of the projection area of the central and the paracentral visual field in area 17 of the cat. 11: The spatial organization of the orientation domain. Exp. Brniri Res. 24, 181-202. Bartfeld, E., and Grinvald, A. 1992. Relationships between orientation-preference pinwheels, cytochrome oxidase blobs, and ocular-dominance columns in primate striate cortex. Proc. Nntl. Acnd. Sci. U.S.A. 89, 11905-11909. Blasdel, G. G. 1992a. Differential imaging of ocular dominance and orientation selectivity in monkey striate cortex. J. Ncirrosci. 12, 3115-3138. Blasdel, G. G. 1992b. Orientation selectivity, preference and continuity in monkey striate cortex. 1. Ncrrrosci. 12, 3139-3161. Blasdel, G. G., and Salama, G. 1986. Voltage-sensitive dyes reveal a modular organization in monkey striate cortex. Natirrr fLondoii) 321, 579-585. Bonds, A. B. 1989. Role of inhibition in the specification of orientation selectivity of cells in the cat striate cortex. Visirnl N m v s c i . 2, 41-55. Bonhoeffer, T., and Grinvald, A. 1991. Iso-orientation domains in cat visual cortex are arranged in pinwheel-like patterns. Nritirrc. fLoiufriii) 353, 429-431. Champeney, D. C. 1973. Foirricr Transfortiis nird Their P/iysI'c~?/ Ayp/ications. Academic Press, London. Cooper, L. N., Liberman, F., and Oja, E. 1979. A theory for the acquisition and loss of neuron specificity in visual cortex. Biol. C y b m i . 33, 9-28. Durbin, R., and Mitchison, G. 1990. A dimension reduction framework for understanding cortical maps. N n t i m (London) 343, 644-647. Erwin, E., Obermayer, K., and Schulten, K. 1993. A comparison of models of visual cortical map formation. In CoirrpirtntioriniidNerrrnl Systefns, F. Eeckman and 1 . Bower, eds., pp. 395402. Kluwer, Boston. Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Airr. A 4(12), 2379-2394. Frostig, R. D., Lieke, E., Ts'o, D. Y., and Grinvald, A. 1990. Cortical functional architecture and local coupling between neuronal activity and the microcirculation revealed by in vivo high-resolution optical imaging of intrinsic signals. Proc. Nntl. Acnrf. Sci. U.S.A. 87, 6082-6086. Grinvald, A., Lieke, E., Frostig, R. D., Gilbert, C. D., and Wiesel, T. N. 1986. Functional architecture of cortex revealed by optical imaging of intrinsic signals. Nntirrc (London) 324, 361-364. Hubel, D. H., and Wiesel, T. N. 1962. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. 1. Ph!ysiol. 160, 106-154. Hubel, D. H., and Wiesel, T. N. 1963. Receptive fields of cells in striate cortex of very young, visually inexperienced kittens. J. Neirroplysiol. 26, 994-1002. Hubel, D. H., and Wiesel, T. N. 1968. Receptive fields and functional architecture of monkey striate cortex. J. Physiol. 195, 215-243. Legendy, C. R. 1978. Cortical columns and the tendency of neighboring neurons to act similarly. Brnin Rrs. 158, 89-105. Linsker, R. 1986a. From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Nntl. Acnd. Sci. U.S.A. 83,8390-8394.

Columnar Organization in Visual Cortex

613

Linsker, R. 1986b. From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Not/. Aeod. Sei. U.S.A. 83, 7508-7512. Miller, K. D. 1992. Development of orientation columns via competition between ON- and OFF-center inputs. NcirroRqmrt 3, 73-76. Mitchell, D. E., Freeman, R. D., and Westheimer, G. 1967. Effect of orientation on the modulation sensitivity for interferences fringes on the retina. 1. Opt. SOC.Ant. 57, 246-249. Nass, M. M., and Cooper, L. N.1975. A theory for the development of feature detecting cells in visual cortex. Bid. Cybcrn. 19, 1-18. Niebur, E., and Worgotter, F. 1990. Circular inhibition: A ncw concept in longrange interactions in the mammalian visual cortex. In Proc. oftlie Intrrnntionnl joint Confercizce OJZNeirrnl Netzcwks-Son Diego, pp. 11-367-11-372. IEEE, Piscataway, NJ. Obermayer, K., Ritter, H., and Schulten, K. 1990. A principle for the formation of the spatial structure of cortical feature maps. Proc. Not/. Acnd. Sci. U.S.A. 87, 8345-8349. Obermayer, K., Blasdel, G. G., and Schulten, K. 1991. A neural network model for the formation and for the spatial structure of retinotopic maps, orientationand ocular dominance columns. ICANN-91, Hdsiriki. Obermayer, K., Blasdel, G. G., and Schulten, K. 1992. A statistical mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. R m A 45, 7568-7589. Rojer, A. S., and Schwartz, E. L. 1990. Cat and monkey cortical columnar patterns modeled by bandpass-filtered 2D white noise. Biol. Cybcrn. 62, 381-391. Schwartz, E. L. 1977. Spatial mapping in the primate sensory projection: Analytic structure and relevance to perception. Biol. CyDern. 25, 181-194. Soodak, R. E. 1987. The retinal ganglion cell mosaic defines orientation columns in striate cortex. Proc. Nntl. Acad. Sci. U S A . 84, 3936-3940. Swindale, N. V. 1982. A model for the formation of orientation columns. Proc. R. SOC.L o i i d ~ Bt ~215, 211-230. Swindale, N. V. 1985. Iso-orientation domains and their relationship with cytochrome oxidase patches. In Modcls o f t l i t , Visiiril Cortcs, D. Rose and V. G. Dobson, eds., pp. 452-461. Wiley, New York. Swindale, N. V. 1990. Is the cerebral cortex modular? Trc.rzds N~irrosci.13(12), 487-492. Swindale, N.V. 1992. A model for the coordinated development of columnar systems in primate striate cortex. Biol. Cybern. 66, 217-230. Swindale, N.V., Matsubara, J. A., and Cynader, M. S. 1987. Surface organization of orientation and direction selectivity in cat area 18. /. Ncirrosci. 7, 14141427. Switkes, E., Mayer, M. J., and Sloan, J. A. 1978. Spatial frequency analysis of the visual environment: Anisotropy and the carpentered environment hypothesis. Vision Res, 18, 1393-1399. Szentagothai, J. 1975. The module-concept in cerebral cortex architecture. Brnin RES. 95, 47-96. Ts’o, D. Y., Frostig, R. D., Lieke, E., and Grinvald, A. 1990. Functional organi-

614

Ernst Niebur and Florentin Worgotter

zation of primate visual cortex revealed by high resolution optical imaging of intrinsic signals. Science 249, 417-423. von der Malsburg, C. 1973. Self-organization of orientation selective cells in the striate cortex. K!ybernetik 14, 85-100. von der Malsburg, C., and Cowan, J. D. 1982. Outline of a theory for the ontogenesis of iso-orientation domains in visual cortex. B i d . Cybcrn. 45, 49-56. Worgotter, F., Niebur, E., and Koch, C. 1992. Generation of direction selectivity by isotropic intracortical connections. N e r d Conrp. 4(2), 332-340.

Received April 3, 1992; accepted November 4, 1993.

This article has been cited by: 1. Olaf Sporns, Giulio Tononi. 2001. Classes of network connectivity and dynamics. Complexity 7:1, 28-38. [CrossRef] 2. T. M Müller , M. Stetter , M. Hübener , F. Sengpiel , T. Bonhoeffer , I. Gödecke , B. Chapman , S. Löwel , K. Obermayer . 2000. An Analysis of Orientation and Ocular Dominance Patterns in the Visual Cortex of Cats and FerretsAn Analysis of Orientation and Ocular Dominance Patterns in the Visual Cortex of Cats and Ferrets. Neural Computation 12:11, 2573-2595. [Abstract] [PDF] [PDF Plus] 3. Lance R. Williams, David W. Jacobs. 1997. Local Parallel Computation of Stochastic Completion FieldsLocal Parallel Computation of Stochastic Completion Fields. Neural Computation 9:4, 859-881. [Abstract] [PDF] [PDF Plus] 4. K. Obermayer, G. G. Blasdel. 1997. Singularities in Primate Orientation MapsSingularities in Primate Orientation Maps. Neural Computation 9:3, 555-575. [Abstract] [PDF] [PDF Plus] 5. S. P. Sabatini, G. M. Bisio, L. Raffo. 1997. Functional Periodic Intracortical Couplings Induced by Structured Lateral Inhibition in a Linear Cortical NetworkFunctional Periodic Intracortical Couplings Induced by Structured Lateral Inhibition in a Linear Cortical Network. Neural Computation 9:3, 525-531. [Abstract] [PDF] [PDF Plus]

Communicated by Graeme Mitchison

Elastic Net Model of Ocular Dominance: Overall Stripe Pattern and Monocular Deprivation Geoffrey J. Goodhill David J. Willshaw Centrefor Cognitive Science, University i f EdinDiirgh, 2 B i i c c l e i i h Place, Edinbiirg/i E H 8 9LW, UK

The elastic net (Durbin and Willshaw 1987) can account for the development of both topography and ocular dominance in the mapping from the lateral geniculate nucleus to primary visual cortex (Goodhill and Willshaw 1990). Here it is further shown for this model that (1) the overall pattern of stripes produced is strongly influenced by the shape of the cortex: in particular, stripes with a global order similar to that seen biologically can be produced under appropriate conditions, and (2) the observed changes in stripe width associated with monocular deprivation are reproduced in the model. 1 Introduction Two well-documented phenomena associated with ocular dominance stripe formation in primary visual cortex are as follows. First, for the macaque monkey there is a global order to the stripe pattern (LeVay e t a / . 1985). Second, the relative widths of left- and right-eye stripes can change following monocular deprivation in the cat (Shatz and Stryker 1978) and in the macaque (Hubel rt al. 1977). In this paper empirical results are presented for the elastic net model of ocular dominance, showing that it can reproduce both these phenomena under appropriate conditions. The elastic net is an algorithm for finding neighborhood-preserving mappings between spaces of different dimensionalities (Durbin and Willshaw 1987). It was originally developed from a model for retinotopic map formation (Willshaw and von der Malsburg 1979), and has been applied to the problem of finding mappings that are both striped and topographic, such as the ocular dominance map (Goodhill and Willshaw 1990). The algorithm finds a mapping between a "feature" space and a "cortical" space. For the topography and ocular dominance problem, the feature space consists of all positions in both eyes, and here we refer to these feature points as LGN (lateral geniculate nucleus) units. Distances between LGN units in the space encode the "similarity" of units to each other, such that similar units lie close to each other. Similarity could be Ntwrnl Cottipiitation 6, 615-621 (1994)

@ 1994 Massachusetts Institute of Technology

616

Geoffrey J. Goodhill and David J. Willshaw

interpreted as the degree to which the activity of units is correlated in the feature space (Yuille et al. 1991). Each LGN is represented as a two-dimensional sheet of points, and the two sheets lie atop one another separated by a small gap. The third dimension represents ocularity: distances in the ocularity dimension represent similarities according to the same metric as in the other two dimensions. The images of cortical points under the mapping are envisaged as an elastic sheet moving in the feature space: these points are referred to as cortical units. For a fuller discussion of this formulation of the problem see Goodhill (1991), Yuille rt a/. (1991). Refer to the positions of LGN units as x, and cortical units as y,. The change in the position Ay, of each cortical unit at each time step is given by

The first term is a matching term that represents the ”pull” of LGN units for cortical units, which is traded off with ratio o / . j k against a regularization term representing a “tension” in the sheet, that is, a desire for neighboring cortical units to represent neighboring points in the feature space. N ( j ) refers to the set of points in the sheet that are neighboring to j, The “weights“ zu,, are defined as follows:

where

Over the course of a simulation, the scale parameter k is gradually reduced, so that the matching term comes to dominate the regularization term. These equations can be interpreted as saying that each cortical unit has a gaussian receptive field at position y, in the feature space (Durbin and Mitchison 1990). The amount by which cortical unit j responds to input i at position x, is given by zu,,. The normalization of zu,, by the response of all other cortical units to input i implements a form of soft competition between cortical units. Although all cortical units are adapted toward input i, those that respond most strongly are adapted the most. The first term can therefore be seen as Hebbian. The second term says that cortical units are also adapted toward inputs that their neighbors respond to. It was shown in Goodhill and Willshaw (1990) that stripe width is controlled by the ratio of the separation of corresponding units between the LGN sheets to the separation of neighboring units within each LGN sheet. A simple analysis in one dimension suggests that stripe width increases linearly with this ratio (Goodhill 1991). A much deeper analysis

Elastic Net Model of Ocular Dominance

617

of stripe width in the elastic net model can be found in Dayan (1993), where the relative influences of the input similarities and the cortical interaction function are determined. Interpreting similarity as correlation, stripe width for the elastic net increases as the degree of correlation between the two eyes decreases. Similar behavior is found in the competitive model of Goodhill (1993), where the prediction was made that stripe width should thus be wider in strabismatic cats than normal cats. This prediction has recently been confirmed experimentally (Lowel and Singer 1993).

2 Overall Stripe Pattern

Naturally occurring ocular dominance stripes in the monkey exhibit global order: stripes remain roughly parallel over large distances along the dorsal-ventral axis. This result might not be expected from the operation of the purely local mechanisms that are present in most models. LeVay et al. (1985) provide a review of theoretical ideas regarding possible sources of an anistropic ordering influence, and hence how such global order might arise. Such ideas include elongation of geniculocortical arborizations (von der Malsburg 1979), anisotropic growth of the cortex (Swindale 1980), and differences in the strength of two orthogonal gradients of adhesiveness (Fraser 1985). LeVay et al. (1985) suggest that the effect could be due to the shape of primary visual cortex as compared to the shape of the LGNs. The basic geometry of the mapping in the macaque is that of two roughly circular LGN layers projecting to a roughly elliptical region with a ratio of major to minor axes of 21. Assume that the map is interdigitated into parallel stripes. Then, in order to fit both of the circular LGN regions into the elliptical region, each circular region will be less "stretched" if stripes are formed running parallel to the short axis of the ellipse, compared to the case of stripes running parallel to the long axis. This hypothesis was tested for the elastic net algorithm by comparing the mapping formed between two disks of LGN units and a sheet of cortical units that is (1) a disk and (2) a n elliptical region as described above. The dimensions of the cortex were chosen such that the number of units was about the same as the total in both LGNs. Results are shown in Figure 1. It can be seen that the shape of the cortex exerts a strong influence on the overall stripe pattern: for a circular boundary there is no preferred stripe orientation, whereas for an elliptical boundary stripes are indeed aligned with the short axis, as seen biologically. This boundary effect could in general provide another test of the adequacy of models for ocular dominance, for instance (Miller et d . 1989; Obermayer et al. 1992). It would not be expected to occur in models where there is no consideration of topography [such as Swindale (1980)], since

618

Geoffrey J. Goodhill and David J. Willshaw

the effect relies on the undesirability of stretching, that is, deformations in topography in certain directions. The cortical shape effect hypothesized by LeVay d 01. (1985) has also recently been demonstrated in a different computational scheme by Jones ct d. (1991). They defined a particular cost function measuring topographic distortion, and then exactly minimized this for the mapping between two LGNs and an elliptical cortex. However, their result was obtained by a brute-force minimization of the cost function, without a particular biological mechanism in mind.

Elastic Net Model of Ocular Dominance

619

3 Monocular Deprivation

If one eye is occluded or sewn shut during the critical period for ocular dominance development in the cat or monkey, it is found that substantially more of the cells in layer IV of primary visual cortex can be driven by the normal eye as compared to the deprived eye. The anatomical correlate of this is that ocular dominance stripes receiving input from the normal eye expand at the expense of the stripes from the deprived eye: however, stripe periodicity remains the same [see, e.g., for the cat Shatz and Stryker (1978) and for the macaque Hubel et al. (1977)l. It is the matching term in the elastic net which determines the amount of ”pull” each LGN unit exerts on the cortical sheet. In equation 1.1 each LGN unit exerts the same total pull. We modeled monocular deprivation by a procedure analogous to that used in other models [e.g., Miller ef al. (1989)l. The pull of all the units in one LGN was reduced by a fixed amount, corresponding to a decrease in competitive strength for that eye. This is most simply achieved by redefining the constant (P so that it has a different value for each eye. The amount of deprivation, or ”deprivation parameter,” was defined to be the ratio of 0 for the deprived eye to that for the normal eye. The variation of stripe pattern with the size of this deprivation parameter is shown for the elastic net algorithm in Figure 2. It can be seen that the deprived eye takes over less of the cortex than the normal eye: the deprived-eye stripes become thinner. However, the stripe periodicity remains the same as in the normal case. Note that we have only changed (r, all other parameters remaining the same.

Figure 1: Facing page. The effect of the shape of the cortex on overall stripe pattern. Each LGN was a unit disk of approximately 1750 units arranged in a hexagonal array. The disks were separated by a gap of 0.08. A small random component was added to each 3-D position to prevent any artefacts that might arise from complete regularity. The cortical sheet contained approximately 3500 units in a hexagonal array, bounded by either a circle (a) or an ellipse (b). Results are shown after 250 iterations: a slightly more efficient optimization procedure than steepest descent was used to reduce computation time [see, e.g., Durbin and Mitchison (1990); Goodhill (1991)l. The ocularity of cortical units is represented: each is colored white or black depending on which LGN is closer. All units have become monocular. In the circular boundary case (a) there is no preferred stripe direction, whereas in the elliptical boundary case (b) stripes tend to be aligned parallel to the short axis. Other parameters: = 0.2, /j = 2.0, initial value of k = 0.2, factor by which the annealing parameter k is reduced at each iteration = 0.99. Initial conditions were defined by assigning each cortical unit an arbitrary ocularity value, and a topographic position was choser? randomly from a uniform distribution within 0.5 (i.e., half the width of the LGN sheets) of its “ideal” location.

620

Geoffrey J. Goodhill and David J. Willshaw

Figure 2: Effects of monocular deprivation. For the "black" eye, o = 0.2. (a) Deprivation parameter = 0.75. (b) Deprivation parameter = 0.5. All other parameters (including initial conditions) as in Figure l(b). Stripes representing the deprived eye become increasingly narrow as the size of the deprivation parameter is increased.

Acknowledgments Funding: G. J. G. by a n SERC postgraduate studentship (at Sussex University) and a Joint Councils Initiative in CogSci/HCI postdoctoral training fellowship; D. J. W. by MRC Programme Grant PG9119632. We thank the anonymous referees for useful suggestions, and Martin Simmen for helpful discussions.

Elastic Net Model of Ocular Dominance

621

References Dayan, P. S. 1993. Arbitrary elastic topologies and ocular dominance. N w r n l Coniy. 5, 392401. Durbin, R., and Mitchison, G. 1990. A dimension reduction framework for understanding cortical maps. Nntrrrc, (Loridoii) 343, 644-647. Durbin, R., and Willshaw, D. J. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nntirrc (Lnrrdori) 326, 689-691. Fraser, S. E. 1985. Cell interactions involved in neural patterning. In Molrculnr Bases of Neirrnl D e ~ ~ ~ I o p ~ iG. i ~ M. w t ,Edelman, W. E. Gall, and W. M. Cowan, eds., pp. 481-507. Wiley, New York. Goodhill, G. I. 1991. Corrdatioris, corriprtifiori mid optiriralify: Modclliri~t l r ~ 7d ~ 7 ) ~ I o p ttiivit of topography nrid ocirlnr d O J 7 l i t ? ~ ? i l C w . Ph.D. thesis, University of Sussex. Goodhill, G. J. 1993. Topography and ocular dominance: A model exploring positive correlations. Biol. Cyb~rrii~t. 69, 109-118. Goodhill, G. J., and Willshaw, D. J. 1990. Application of the elastic net algorithm to the formation of ocular dominance stripes. N P ~ z L1,~ 41-59. % Hubel, D. H., Wiesel, T. N., and LeVay, S. 1977. Plasticity of ocular dominance columns in monkey striate cortex. Phi/. Trnrrs. I<. soc. Loridori B 278, 377-409. Jones, D. G., Van Sluyters, R. C., and Murphy, K. M. 1991. A computational model for the overall pattern of ocular dominance. /. Nmrosci. 11,3794-3808. LeVay, S., Connolly, M., Houde, J., and Van Essen, D. C. 1985. The complete pattern of ocular dominance stripes in the striate cortex and visual field of the macaque monkey. J. Ne~rrosci.,5, 486-501. LBwel, S., and Singer, W. 1993. Strabismus changes the spacing of ocular dominance columns in the visual cortex of cats. Soc. Nmrosci. Abstr. 19, 867. Miller, K. D., Keller, J. B., and Stryker, M. P. 1989. Ocular dominance column development: Analysis and simulation. Sci~ricc245, 605-615. Obermayer, K., Blasdel, G. G., and Schulten, K. 1992. Statistical-mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rtw A 45, 7568-7589. Shatz, C. J., and Stryker, M. P. 1978. Ocular dominance in layer IV of the cat’s visual cortex and the effects of monocular deprivation. 1. Physiol. 281, 267-283. Swindale, N . V. 1980. A model for the formation of ocular dominance stripes. Proc. R. Soc. London B 208, 243-264. von der Malsburg, C. 1979. Development of ocularity domains and growth behaviour of axon terminals. B i d . C!/biwit>t.32, 49-62. Willshaw, D. I., and von der Malsburg, C. 1979. A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Phil. Trarrs. Roy. Soc. B 287, 203-243. Yuille, A. L., Kolodny, I. A., and Lee, C. W. 1991. Dirri~~11si0ii udi/rtio)i,gcwrmli d Llifortmblc rirodels m i d tlic dczdoprtruit of ocrrlnrity airif orictitatiori. Harvard Robotics Laboratory Tech. Rep. no. 91-3. ~

~~

~

Received April 20, 1993; accepted November 17, 1993.

This article has been cited by: 2. O. Scherf, K. Pawelzik, F. Wolf, T. Geisel. 1999. Theory of ocular dominance pattern formation. Physical Review E 59:6, 6977-6993. [CrossRef] 3. H.-U. Bauer . 1995. Development of Oriented Ocular Dominance Bands as a Consequence of Areal GeometryDevelopment of Oriented Ocular Dominance Bands as a Consequence of Areal Geometry. Neural Computation 7:1, 36-50. [Abstract] [PDF] [PDF Plus]

Communicated by Laurence Abbott

The Effect of Synchronized Inputs at the Single Neuron Level Ojvind Bernander, Christof Koch Conrputation arid Neirral S y s t e m Prograwi,Calforrria Iirstitirte uf Tcchlogy, Pasadem. C A 91125 U S A

Marius Usher Department of Psyhology, Carnegie Mellori Uitizwsity, Pittsburgh, PA 15213 USA

It is commonly assumed that temporal synchronization of excitatory synaptic inputs onto a single neuron increases its firing rate. We investigate here the role of synaptic synchronization for the leaky integrateand-fire neuron as well as for a biophysically and anatomically detailed compartmental model of a cortical pyramidal cell. We find that if the number of excitatory inputs, N, is on the same order as the number of fully synchronized inputs necessary to trigger a single action potential, Nt, synchronization always increases the firing rate (for both constant and Poisson-distributed input). However, for large values of N compared to Nt, “overcrowding“ occurs and temporal synchronization is detrimental to firing frequency. This behavior is caused by the conflicting influence of the low-pass nature of the passive dendritic membrane on the one hand and the refractory period on the other. If both temporal synchronization as well as the fraction of synchronized inputs (Murthy and Fetz 1993) is varied, synchronization is only advantageous if either N or the average input frequency, fin, are small enough. 1 Introduction

It has long been postulated that the synchronous firing activity of cortical neurons is a crucial stage underlying perception. The psychologist Milner (1974) first proposed that neurons responding to a ”figure” fire synchronously in time, while neurons responding to another figure or to the “ground“ fire randomly: the “primitive unity of a figure” would be defined at the neuronal level by synchronized firing activity. Several years later, von der Malsburg (1981) formulated his influential correlatiori theory of brain function on the basis of the importance of synchronized activity. How this theory could be used to temporally segregate patterns was demonstrated by von der Malsburg and Schneider N ~ w n i Corrrprtfntiorr l 6, 622-641 (1994)

@ 1994 Massachusetts Institute of Technology

Synchronized Inputs

623

(1986). Using computer simulations, they showed that from an initially totally interconnected set of tonotopic neurons two distinct groups of neurons-corresponding to two distinct voices-arise. The mechanism of this segmentation is the temporal synchronization of simultaneously active cells using a fast Hebbian synaptic modulation mechanism. This idea has been extended by Crick and Koch (1990,1992), who postulated that synchronized and oscillatory firing activity in a subset of cortical neurons constitutes the neuronal correlate of visual attention and awareness. Over the last 10 years, a small but growing community of electrophysiologists has focused on the synchronized electrical activity among two or more simultaneously recorded neurons in the cortex of cats and monkeys (Toyama et al. 1981a,b; Abeles 1982, 1990; Ts’o et al. 1986; Aertsen et al. 1989; Nelson et al. 1992; Kreiter and Singer 1992; Murthy and Fetz 1993). Remarkably, in some of these studies, cross-correlation among two cortical cells reveals a central peak with a width of less than 1 msec. These studies were brought to the forefront with the discovery (based on Freeman’s earlier work; Freeman 1975) of 40 Hz oscillations in the visual cortex of cats (Eckhorn etal. 1988; Gray and Singer 1989) and the crucial demonstration that these oscillatory responses can become temporally synchronized in a stimulus-dependent manner (Gray et al. 1989). In the cat, the oscillations can be phase-locked with a phase-shift of +3 msec around the origin at distances up to 7 mm (Engel et al. 1992). The existence and strength of the 40 Hz oscillations in the awake monkey appear to be highly variable (some laboratories routinely observe them while others do not). In consequence, recent theoretical studies focus on synchronized neuronal activity-rather than oscillations-as the basis for figure-ground segregation (Bush and Douglas 1991; Koch and Schuster 1992; Tononi et al. 1992). The principal idea underlying these studies is the belief that synchronized neuronal firing in large populations of pyramidal cells causes a higher firing rate in postsynaptic target cells (after suitably accounting for axonal and synaptic delays; Manor et al. 1992; Abeles 1982). This, in fact, is already inherent in the McCullough and Pitts (1943) neuron: if one such binary “unit” has a threshold of two, the simultaneous activity of two presynaptic neurons is required to bring the unit above threshold. However, it has rarely been asked to what extent more realistic and biophysically plausible models of neurons prefer synchronized to desynchronized, excitatory synaptic input. Are there physiologically meaningful conditions under which temporally synchronized input leads to less effective postsynaptic firing than less synchronized activity? This is the question we address here and we find that under many conditions synchronized firing is not optimal for the cell in terms of generating the largest number of spikes. We shall leave aside the interesting question of how synchronized activity arises in neural populations (see Abbott 1990,

624

0. Bernander, C. Koch, and M. Usher

1993; Mirrollo and Strogatz 1990; Sompolinsky et al. 1990; Hansel and Sompolinsky 1992; Usher et al. 1993). To our knowledge, only a single paper has investigated the possible “negative” effect of synaptic synchrony on postsynaptic firing frequency (Murthy and Fetz 1993). Their numerical study varies the fraction of cells that is perfectly synchronized among each other, concluding that synchronization increases the postsynaptic firing frequency only under certain conditions. In our more general investigation, we use both an analytically treatable neuron model (from the family of integrate-and-fire models) as well as computer simulations of a biophysically very realistic cortical pyramidal cell to investigate the effect of single-shot and repetitive synaptic input at various synchronization levels. The principal-and most limiting-assumption we make here is that the dendritic tree of cells is passive and does not contain special, fast voltage-dependent nonlinearities, which are limited to the cell body. The degree of synaptic correlation in the input varies in accordance with two independent factors: the temporal spread of synchronization, T, referred to as the desynchronization interval and the fraction of neurons that is synchronized, r. As we shall see, these two factors affect the postsynaptic firing frequency in different ways. In the first two sections of the paper, we will investigate the effect of the desynchronization interval on the firing rate, assuming that N synapses are each activated only once, under two extreme assumptions: (1) the input is Poissondistributed throughout T with, on average, N synaptic inputs and (2) the input is constant, approximating the situation of regular input activity every T I N msec. In Section 4, we will deal with the added complication arising from repetitive synaptic input. 2 Synchronicity in Integrate-and-Fire Models

We will first consider different variants of the integrate-arid-fire (I&F) model neuron (Knight 1972; Fig. la), under the assumption that N synapses are activated only once (single-shot case). In its simplest version, discrete synaptic inputs arriving at times t, place an identical charge Qo onto a capacitance C, charging up the membrane potential across the capacitance by AV = Qo/C. When the voltage reaches a fixed threshold value, V,, a point-like pulse is generated and the potential V ( t )is reset to 0. Two important modifications to this model include a membrane leak conductance G and an absolute refractory period Tr,. The leaky orforgetful integrate-and-fire model has finite memory: since the membrane potential decays exponentially between synaptic inputs (with time-constant 7, = C/G), events that occurred in the past are less effective than more recent ones. The effect of T,, is to hold the potential V(f) to 0 for the duration Trpafter the model has generated a spike, rendering all synaptic inputs ineffective during this time. The

Synchronized Inputs

625

a

b

cm V

Spike

Figure 1: The two neuronal models used in our study. (a) The leaky integrateand-fire (I&F) model with optional membrane leak G and refractory period Trp. Synaptic input can be modeled either as a conductance change Gsyn (shown)or as a current source lsyn(not shown). (b) Compartmental model of a morphologically reconstructed layer V pyramidal cell from cat cortex (Bernander et al. 1991), using 400 passive dendritic compartments and a soma containing eight voltage- and time-dependent Hodgkin-Huxley-like currents. main virtue of this family of models is their simplicity, allowing us to study some of their properties analytically. 2.1 Regular Synaptic Input. We here assume that the synaptic input arrives at a constant rate X = N I T ; in other words during the interval T, N synaptic inputs arrive in a regular manner, spaced T I N msec apart. Let Tspike be the time required to charge up the membrane from rest (V = 0) to Vt. The total number of output spikes, Nsp, generated during this interval T will be the largest integer n for which

that is

Nsp= Floor Tspike

+ Trp

where Floor[x] is the largest integer smaller than or equal to x . For an analytical treatment it is more convenient to use a continuous approximation:

(2.3)

0. Bernander, C. Koch, and M. Usher

626

Table 1: Analytical Expressions for the Time Tspike Required for Constant Synaptic Input Arriving at Rate X = N/T to Reach Threshold, Assuming a Continuous

Approximation of the Discrete Input." ~

Case

Model

Synaptic input

Tqpike

1 2 3

I&F I&F Leaky I&F

Current lo = CXAV Conductance Current lo = CXAV

Y

4

Leaky I&F

Conductance

h

In(I hC.,,, - r m ln(1 -

-~

- _c_

c+x,),, ln(1 -

2)

VC AF,y~G,,,

-

"See Appendix A for derivation.

We also assume that Nt = V,/AV, that is, the number of simultaneous current inputs required to reach threshold is an integer. The only quantity that needs to be evaluated in equation 2.3 is Tsplkl.. Case 1 is an integrate-and-fire model with a constant rate X = N/T of identical synaptic input pulses, each one dumping the charge Qo = CAV onto the capacitance. This is equivalent to injecting the constant current l o = CXAV onto the capacitance. In the third case, the same current lo is injected into the leaky integrate-and-fire model. In the other two cases, the input is treated as a conductance input Gsyn > 0 in series with a synaptic battery E5,, = 70 mV, for the standard (case 2) or for the leaky I&F model (case 4). This is equivalent to a single effective conductance of value XG,, that is activated during the interval T. We derive in Appendix A an analytical expression for Tsplke in the fourth and most general case (see Table 1). If the synaptic conductance is small compared to the membrane leak conductance G, the input can be treated as fixed synaptic input current Isyn = XCAV = XEsynC,ynand we arrive at Tzplke for current input (case 3). The time to spike for the standard I&F model can be obtained by setting the leak term G = 0. While conductance inputs are more relevant to the physiological situation where massive synaptic input fires the pyramidal cell at high rates (Bernander et al. 1991), further analysis is simplified if current inputs are used. We evaluated equation (2.3) for the case of conductance inputs to the I&F model with N = 1000 inputs and either no or a fixed refractory period (T,, = 2 msec; see Fig. 2) and used values for G, C, TrF, and N, that mimic the values observed in cortical pyramidal cells (see below). For the standard integrator (with T,, = 0 and G = 0; top curve), the number of output spikes, Nsp,is independent of T. In fact, Nsp is always independent of the arrival times of the input but only depends on the total number of inputs: N,, = N/Nt for the I&F model with an infinite memory. When a membrane leak G is introduced, the number of output spikes decreases with T because earlier inputs leak away with a time

Synchronized Inputs

i

627

'I

Figure 2: Temporal dispersion of synaptic input and its effect on firing rate for different single cell models. A fixed number of identical excitatory inputs N is evenly distributed along the interval T and the number of output spikes N,, is computed as a function of T. (a) Leaky integrate-and-fire model with N = 1000 conductance inputs. The leak conductance G is either 0 or 58.8 nS, and the refractory period T,, is either 0 or 2 msec. The membrane leakage "pulls down" the right end of the curve, while the refractory period pulls down the left end of the curve. AV = 0.25, Vt = 15 mV, r,,, = 17 msec, and Nt = 60. These parameters are similar to those of the detailed model. The optimal value of T, ToFt, is marked on the bottom curve. (b) Compartmental model of layer V pyramidal cell. N = 200-1000 excitatory, fast AMPA synapses were distributed throughout the cell. Nt = 66 synchronized somatic synaptic input are required to trigger one action potential. For N = 1000 the I&F model with refractory period and conductance inputs is in good qualitative agreement with the compartmental model. The principal result of our study is that for N >> Nt, synchronization of synaptic input causes the cell to fire fewer spikes than if the synaptic input is temporally dispersed (i.e., the optimal T > 0).

constant r,,, zz C/G. This is at the heart of the traditional argument for the advantage of synchronizing synaptic input in terms of eliciting the maximum number of postsynaptic spikes: temporal dispersion of synaptic input reduce their eflecectiveness (e.g., Abeles 1982). However, when a refractory period, T,, is introduced into the leaky I&F neuron (lower curve in Fig. 2a), the initial part of the curve is "pulled down," so that for small desynchronization intervals T, N,, will increase with T . The reason for this "overcrowding" effect is that synaptic input in excess of Nt will be "wasted." Synaptic inputs arriving during the refractory period do not contribute to the excitability of the cell. Thus, for N > Nt, desynchronized synaptic input increases the spiking rate, or high synchronicity of massive synaptic input reducesfiring rate. The optimal T is a compromise

628

0. Bernander, C. Koch, and M. Usher

between the effects of G and T,, and is about 60 msec for N = 1000 as shown in Figure 2. Substituting current inputs for conductance inputs had only a minor effect on the I&F models. If the parameters were adjusted to give similar values of N,, for small T, then current inputs give slightly larger values for large T (graphs not shown). It is conceivable that the peaked form of N,, could be due to synaptic saturation. Rather than inputs being "wasted" during the refractory period, an increased resting potential would reduce the driving force E,,, - V ( t )for the excitatory synapses. Such saturation effects exist only if the synaptic input is treated as a conductance change and could, in principle, reduce N,, for high input rates X = N / T and small values of T. However, rfN,,/dT < 0 for case 4 with T,,= 0 msec, independent of Esyn, and therefore no peak can occur (see monotonically decreasing curve in Figure 2a). 2.2 Optimal Desynchronization Interval. How does TL,Ft,the optimal value of T , that is the desynchronization interval at which N,, is maximized, depend on the various parameters? To find T,,pt,we compute dN,,/dT for the leaky I&F model with current inputs and set this derivative to zero (see Appendix B). This expression cannot be solved in closed form since it is of the formf(y) = log(1 - y). Instead we show the numerically obtained value of T,,,t in Figure 3. Note that all three axes are in dimensionless variables: T q t / T r p , Tm/Trp, and N/Nt. As can be seen, Tort is almost exactly linear in N/Nt, except for values of N/N,in the neighborhood of 1. This is not surprising, since at high input rates T,,ptbecomes much larger than T,,~and a dynamic steady state condition prevails during most of the single-shot. In other words, for N >> Nt there exists an optimal input rate fill = N/T,,,, (see Appendix B). For our parameter values in the leaky I&F model, this rate is approximately 10 inputs/msec. If Jll is increased above this 10 kHz rate, the refractory period will reduce N,,, while if fin is decreased below 10 kHz, the membrane leak causes a reduction in T",, increases with T,, = C/C but not in a linear fashion due to the opposing effects of T,, (favoring larger values of T,,,J and G (favoring smaller values). 2.3 Synchronicity for Poisson-Distributed Input. Up to now we considered synaptic input that arrives at a constant rate. Let us now analyze the more realistic situation of Poisson-distributed current inputs with a mean rate X = N/T to the leaky integrate-and-fire model over a fixed temporal interval of duration T. Due to the stochastic nature of the input, the membrane potential V executes a random walk-like trajectory. When the potential reaches Vt a pulse is generated and V ( t )is reset to zero. While the probability

Synchronized Inputs

629

Top t / T rp

taUITrV

Figure 3: The optimal value of T , TOFt,in units of TC,,t/Trp,in the leaky integrateand-fire model for current input as a function of the normalized number of inputs N/Nt and normalized time-constant rm/TrF. ToFt is almost linear in N for N >> Nt = 60; for example, if the number of inputs N doubles, they should be spread out over twice as long an interval T as before in order to maximize firing frequency. distribution function for the entire interspike interval distribution is not known for the leaky I&F model, it is possible to compute the mean time to threshold, TSplke (known as the first time to passage problem). Following Ricciardi (1977; see Appendix C for details), we compute the mean value of Tsplkr from the equilibrium probability distribution of the membrane potential by using Siegert’s recursion formula:

(2.4) where the limits of integration are

The average number of output spikes N,, can now be calculated from equation 2.3 and is plotted in Figure 4 as a function of the desynchronization interval T. Comparing N,, for this stochastic input with the

0. Bernander, C. Koch, and M. Usher

630

Poisson vs. constant i n p u t 6 .N

-

500

4

%

z

2

I

0 55

100 T (rnsec)

,

)O

Figure 4: Number of output spikes (from equation 2.4) for Poisson-distributed (thin lines) and for constant synaptic input (thick lines) for the leaky integrateand-fire model with current input (see case 3 in Table 1) as a function of the desynchronization interval T. The different curves are for different values of N , that is, the average number of synaptic inputs arriving during the interval T . For N < N t = 60 or for T > Tcutc,fl, fluctuations in the random input can always push the potential above threshold, while the model ceases to respond to constant input. For T < Tcutoffand N > N t , that is, for large input rates X = N / T , the two models agree closely. For N 5 N t , the analytical approximation deviates significantly from the numerical one due to truncation error and only the latter is shown for N = 55 and 60.

superimposed functions for the constant input we can see two important differences at the two extremes of the axis. One major difference is when on average less than Nt inputs are present. For constant synaptic input, no spikes are generated for N < Nt = 60 and N,, # 0 only for T = 0 at N = 60. However, for Poisson-distributed input, there always exists a nonzero probability that stochastic fluctuations in the input will carry the potential above threshold. Likewise, Poisson input can in principle, for large values of T , always exceed the threshold N,,while this is not possible for regularly timed input (where the cutoff value of T is given by Tcutoff= Tl,N/Nt), resulting in a "tail" for large values of T . For all other values of T < Tcutoffand N > Nt, the constant synaptic input closely approximates Poisson-distributed input. In other words, for large enough values of the synaptic input rate X = N I T , Poisson input can be approximated by constant input (case 3 in Table 1).

Synchronized Inputs

631

3 Synchronicity in a Detailed Model of a Pyramidal Cell

To what extent are our results due to the very simple neuronal model we have been using? After all, I&F neurons have a fixed threshold, no dendritic tree, no voltage- and time-dependent conductances, and no synaptic dynamics. In order to address this issue, we simulated a detailed compartmental model of a morphologically reconstructed layer 5 pyramidal cell from visual cortex in the anesthetized adult cat (Douglas et a / . 1991; see Fig. lb; for details of the model see Bernander rf al. 1991, 1992). Briefly, the dendritic tree consists of about 400 passive compartments (with C, = 1 pF/cm2, R,, = 100 k12-cm2, R, = 200 52-cm, and Erest = -66 mV). The membrane resistance and reversal potential for each compartment is adjusted to mimic a 0.5 Hz “background” activation of 4000 excitatory, fast AMPA (or non-NMDA) synapses and 1000 GABAergic inhibitory synapses. Since each synapse corresponds to a small conductance change, the effective membrane resistance is reduced to 10-50 kl2cm2. The soma contains 8 active currents, including two sodium currents (IN, and IN,,,), four potassium currents, one of which is calciumdependent ( I D R , IA,IM, and IAHp), one calcium current (Icd),and an anomalous rectifier (IAR).The A current allows the cell to spike at low frequen1 ’ cies over an extended range and the combination of Ic,, and 1 ~ ~ causes the firing rate to adapt by a factor 2-3 (for medium and high firing rates). The input resistance in the presence of synaptic background activity is 15 Mi2 and the time constant 17 msec. The peak voltage of the spikes drops below 0 mV at approximately 500 Hz, suggesting an absolute refractory period of about 2 msec. A simulation was run on the full model for N = 200 to 1000 fast, excitatory, voltage-independent AMPA synapses distributed throughout the dendritic arbor in accordance with the known anatomical distribution (Fig. 2b). For N = 1000 the same basic effect is observed as in the leaky I&F model with refractory period. In the case of total synchronization, T = 0, only two spikes are produced due to overcrowding. A maximum of 5 spikes is obtained for T = 25-65 msec and the response decreases to 0 spikes for T = 200 msec. When N is reduced to 300 synapses or less, the peak in N,, disappears. Note, however, that now N,, is essentially flat around the origin, implying that the cell is not highly tuned to small T (for the r, = 17 msec used here). To assess to what extent this behavior is due to the fact that synaptic input increases the membrane conductance-rather than injecting a current into the cell-we approximated this condition by reducing the ,, for each synapse by a factor 10, while synaptic conductance change ,G increasing the driving force Ere, - V by a factor 10. This removed any saturation in the dendritic tree, and thus more current was injected during stimulation (curves not shown). The main difference was that a few

0. Bernander, C. Koch, and M. Usher

632

more spikes were obtained at every T, while the N,, still peaked for approximately the same values of T. Substituting NMDA synapses for half of the AMPA synapses had the effect of broadening the peak, as well as making it less pronounced. This can be explained by the much slower time course of the NMDA synapses (T,,,~~,~,= 40 msec), which is conceptually similar to desynchronizing the much faster AMPA synapses. No obvious cooperative effects were seen due to the negative input conductance of the NMDA synapses. 4 Correlated Synaptic Input

In the previous sections, we assumed that N independent synapses were each activated only once (single-shot case). However, synaptic input is repetitive (cells fire more than once) and can be correlated. How are our previous results affected by such correlated activity? The degree of correlation in the input may vary in accordance with two factors: the temporal spread of synchronization (T as expressed by the width of the cross-covariance function between input neurons), and the fraction of neurons, r, that is synchronized. By varying each of the factors independently, one can interpolate between a fully synchronized and a fully desynchronized input. If each of the N input neurons is firing with a Poisson probability distribution with mean rate fill, two extreme situations can be considered: 0

0

If none of the neurons is correlated ( r = 0), the input consists of a single Poisson process of events of height AV and rate X = Nfin. The mean interspike interval and output firing rate can then be estimated from equation 2.4 and will be denoted byfi,ut.o. When all ( r = 1) neurons are perfectly correlated (T = 0), the input is equivalent to a single Poisson stream of events of height NAV and rate X =fin.Assuming N > Nt (otherwise the neuron will rarely fire), each synchronized event triggers one spike (the refractory period prevents multiple spikes for such synchronized input currents) and the output frequency is equal to the input frequency All.

The intermediate situation (rN neurons perfectly correlated with T = 0, and the rest independent) interpolates between these two limiting cases as shown in Figure 5. Thus increasing the number of perfectly synchronized and correlated neurons is advantageous only if the response in the uncorrelated case is lower than fill, that is if fout,"(N.fin) < fill. We derive the border between these two domains by finding those values of N and finwhere this inequality turns into an equality. As we saw in Section 2.3, if X is large enough, the value of Tsp,kefor Poisson input is well approximated by that of constant input given by case 3 in Table 1. Using this

Synchronized Inputs

633

Figure 5: Simulation results for the response ratefi,,t of a leaky I&F neuron with current input (T,,, = 17 msec, T,, = 2 msec, and Nt = 60) to a varying fraction rN of correlated inputs out of a total population of N = 200 input neurons. The bold curves in (b) and (c) highlight the response when the correlated neurons are perfectly synchronized (T = 0 ) , while the thin lines correspond to nonzero values of the desynchronization interval T as indicated. (a) Border in thefi,, - N plane delimiting the parameter range for which perfect synchronization at T = 0 reduces the firing rate from the opposing situation (see equation 4.1, diamonds represent simulation results). For small values of either N (as long as N > N,) or fin synchronization always increases foul. (b) For fin = 5 Hz, synchronization always increases the firing frequency. ( c ) For larger values of the input frequency, here fin = 20 Hz, too much or too little temporal synchronization decreases fi,ut. latter result as well asfoul."= l/(T5,,ke

+ T,,)

and X

=

Nfin w e arrive at (4.1)

634

0. Bernander, C. Koch, and M. Usher

This expression, then, demarcates the two domains. For values of N and and to the left of this border (Fig. 5a), increasing the number rN of correlated neurons enhances the output rate while above and to the right of this curve increased synchronization reduces the output rate. The latter effect is due to the fact that at higher N and jl,,input spikes are rendered ineffective due to the refractory period. For low values of the input frequency relative to the leak term, that , f;,, term in the exponent in the right hand of is, when f,,, << l / ~ " ,the equation 4.1 side can be neglected, leading to an inverse relationship between N and fin and to a hyperbolic curve for small values of Jll in Figure 5a. What effect does the temporal width of the cross-covariance function T have on postsynaptic firing frequency? We approached this question by numerically evaluating ft,Llt for a variety of different settings in the relevant four-dimensional space spanned by Y, T, N, and fin. In the simulations shown in Figure 5b,c, we computefOutas a function of the number of synchronized inputs using the leaky I&F model with current inputs. Figure 5b shows the response rate for a population of N = 200 input neurons, all firing with a mean input rateJ, = 5 Hz. We are here well in the domain (see the lower cross in Fig. 5a) where perfect (T = 0) synchronization will lead to an increase in the firing rate. If none of the inputs is correlated, we are in the first of the limiting cases discussed at the beginning of the section withJ,l,t,o = 0. As rN increases, the unit starts to fire. At perfect levels of temporal synchronization, fUut steeply rises in the neighborhood of rN FZ N, and saturates for large values of rN at the firing rate Jll(second limiting case discussed above), since all 200 inputs fire at once, causing only a single postsynaptic spike per input volley. For a finite desynchronization interval, here T = 10 and 20 msec, the steep rise i n j , u t occurs at somewhat larger values of rN than for perfect synchronization. However, as Y continues to increase, the firing rate increases to almost twice the frequency compared to perfectly synchronized input (T = 0), expressing the fact that two spikes are fired on average per input volley (due to the temporal spread of all 200 inputs over 10 or 20 msec). For larger desynchronization intervals (here 50 and 200 msec), the leaky membrane limits the response of the I&F unit and the postsynaptic response remains small. Figure 5c illustrates the reverse case when the input rate is so high (heref,,, = 20 Hz), that temporal synchronization leads to a reduction in postsynaptic response (see the upper cross in panel a). Here, increasing the fraction of perfectly synchronized neurons rN causes a drop in fOrlt, except when a T = 10 msec desynchronization interval is being used for large values of rN. The fact that when all neurons are correlated at the endpoints rN = 200 of Figure 5b,c, the optimal desynchronization interval is between 10 and 20 msec and can be understood from our analysis of the single-shot case in Section 2.

hll below

Synchronized Inputs

635

5 Discussion A number of proposals for linking neuronal firing with higher-level

"emergent" properties explicitly or implicitly assume that synaptic input synchronization always leads to an increase in postsynaptic firing frequency compared to the desynchronized case (see the Introduction). We here investigate this hypothesis in detail. Before we summarize and interpret our results, let us state the principal limitations of our study. We investigated the firing properties of two distinct neuronal models: the analytically treatable integrate-and-fire (I&F) family of integrator models (Knight 1972) as well as a biophysically detailed compartmental model of an anatomically reconstructed cortical pyramidal cell (Bernander et al. 1991, 1992). This model assumes that no voltage-dependent membrane currents are present in the dendrites (with the exception of the voltage-dependent NMDA synaptic input) and that the normal complement of ionic currents (eight in our model) generates the responses seen in a typical, regular-firing pyramidal cell (McCormick et al. 1985; Connors and Gutnick 1990). We did not consider bursting cells that can generate two or more fast spikes in response to an appropriate synaptic input nor voltage-dependent sodium or calcium currents in the dendritic tree (Llinas 1988). Both situations render any analysis such as the one carried out here considerably more complex. Sufficiently fast and strong dendritic nonlinearities, such as postulated by Softky (1994), can in principle render the cell susceptible to specific temporal arrangements of synaptic input (i.e., specific values of T ) and would invalidate our analysis. Given that Softky and Koch (1993) postulate such dendritic nonlinearities to explain the high variability of the firing rate of cortical pyramidal cells, we plan to investigate the dependency of such a cell to synchronized synaptic input in a later study. We here provide the baseline against which the performance of more complex neuronal models need to be evaluated. We assume that synaptic input is either constant or distributed according to a Poisson process. Our detailed analysis of the power spectrum, interspike interval distribution and firing variability of nonbursting cortical cells in the awake and behaving monkey firing at high rates supports the Poisson hypothesis (Softky and Koch 1993; Bair ef al. 1993). Finally, we here consider only the effect of excitatory synaptic input, neglecting the effect of synchronization of inhibitory synaptic input. However, in as far as steady-state conditions are met (i.e., for large enough values of A), the current due to the inhibitory synaptic input can be subtracted from the excitatory current, yielding a net effective input current (or input rate) and all of our arguments apply. Both the I&F models as well as the detailed biophysical model display the same behavior: if the entire excitatory synaptic input is correlated (Y = 1; single-shot case), temporal synchronization (small values of T ) increases the output firing rate only if the average number of spikes (as characterized by a Poisson process with rate X = N / T ) is lower than

636

0. Bernander, C. Koch, and M. Usher

the number of input N,needed to reach threshold (Fig. 2).' For rates significantly larger than N , / T , there will be a nonzero, optimal desynchronization interval. As witnessed in Figure 3, this optimum interval increases linearly with N/Nt and sublinearly with r,,,/Trp. For desynchronization intervals larger than Top,, the response is reduced due to temporal dispersion induced by the membrane leak, while the refractory period limits the usefulness of high synchronization. Any neuronal model with a refractory period will display such a tendency against o z w crozudiiig of synaptic inputs. It should be noted that such overcrowding can occur at what is believed to be physiological levels of synaptic input; 1000 synaptic inputs (constituting about 10% of all excitatory synaptic inputs; Larkman 1991) impinging onto our pyramidal cell within 50 msec give rise to twice the number of spikes than the same number of inputs applied instantaneously (T = 0). Because the similarity between the I&F and the detailed models, we only use the former when we investigate the more complex situation arising during repetitive input when only a fraction Y of the N input synapses is correlated (with the width of the central peak in the cross-covariance function characterized by T). This is the situation Murthy and Fetz (1993) studied, assuming always T = 0. They conclude that synchronization is useful only when N, AV, orf,, are not too large. We explicitly found (Fig. 5a) the domain in N.f,,, space for which correlated inputs enhance the response. For values of fill and N below the border displayed in Figure 5a, and if the number of correlated inputs YN < Nt (N,is about 60 as estimated by biophysical parameters from our reconstructed neuron) perfect temporal synchronization (with a zero-width peak) is advantageous (Fig. 5a). In this regime, the assumption that high levels of firing synchronization-as expressed by sharp peaks in the cross-covariance function-play a significant role in various perceptual processes is valid (Milner 1974; von der Malsburg 1981; Abeles 1982, 1990; Gray c>t nl. 1989; Crick and Koch 1990; Kreiter and Singer 1992). I f rN > N,,perfect synchronization ceases to be optimal due to overcrowding. In this regime, small enough values of the average input frequency fin in combination with small desynchronization intervals (T = 10-20 msec; Fig. 5b,c) enhances the response rate compared to perfect or no temporal synchronization. Thus, cross-covariance functions with pronounced but wide peaks can indeed be more advantageous than extremely narrow central peaks in the cross-covariance (e.g., cell pairs of the T type in Nelson ct nl. 1992). We conclude that (in the absence of fast and powerful active dendritic conductances) if the synchronization of the firing of cortical cells is indeed a crucial signal underlying higher-level perceptual processes, 'We would likc to notc in passing that both in this study and in our analysis o f the firing variability, characterized in terms of both the variability of the number of spikes and the variability of the interspike intervals (Softky and Koch, 1993), the le'iky integrate-and-fire model (Knight 1972) is in good qualitative agreement with the biophysically detailed pyramidal cell model with passive dendrites.

Synchronized Inputs

637

the brain must take care to ensure that only some minimal number of neurons are simultaneously active.

Appendix A: Tspikc for Constant Inputs in the Modified I&F Model

-

We derive here the time T,plke it takes for a constant conductance input, XGsyn,to charge the membrane from 0 to Vt in the leaky integrate-and-fire model (case 4 in Table 1). The input is approximated by a constant synaptic conductance increase in series with a synaptic battery Esyn (here, EsynGsyn= CAV). Replacing the two parallel conductances G and XG,,, with a single equivalent conductance G’ = G + XG,,, and replacing the battery Esyn with E’ = EsynXG,,,/G’ we arrive at a first-order, ordinary differential equation: dV C- + ( V - E’)G’ = 0 dt

Solving this and setting V to Vt and t

= Tsplke

leads to

which can be rewritten as

The expression for current inputs (case 3) can be obtained as a limiting case by setting 10 = X A V = XEsynGsynand letting Gsyn -+ 0 and Esyn + 03, keeping GsynEsynconstant. Tspike for the nonleaky I&F model (cases 1 and 2) can be simply obtained by setting the membrane leak G to 0 [and exploiting ln(1 + d x ) = d x for small values of d x ] .

Appendix B: Optimal Desynchronization Interval For current inputs, the number of output spikes is Nsp(T)

=

T + Trp T,, - T, log[l - (NtT/T,N)]

where Nt = Vt/AV is the number of simultaneous EPSPs necessary to reach threshold. If we define the dimensionless variables 171 = N/Nt.a2 = T,/T,,, and y = NtT/(NT,), take the derivative of equation B.l and set the resulting expression to zero, we obtain

0. Bernander, C. Koch, and M. Usher

638

Note that y depends only on the two dimensionless quantities n l and No closed-form expression exists for the solution of y(uI.aZ), which was therefore solved numerically. When ? / ( a ,az) . is graphed (not shown), it is almost independent of a1 except for small values of 1 1 ~ . Because of the linear relationship between Tc,ptand y, the optimal desynchronization interval Toptis linear in N for most parameter values, given constant T", and Trp.

02.

Appendix C: Poisson Inputs We here present a brief derivation of the mean first-passage time in equation 2.4 (for details see Ricciardi 1977). If a leaky integrator with time constant r,,, receives a stream of Poisson-distributed EPSPs of rate X and depolarization AV, the transition probability density function satisfies the following equation:

f(V.t

+ A f I u.t) = [l

with

-

XAf]h(U,)- V) + X A f h ( U ]- V)

(C.1)

At U,] = u - ur

The first term is associated with relaxation of the membrane potential if no input reaches the cell during time At and the second with one EPSP reaching the cell at time A, (with At, + At2 = At). In the limit of At -+0 one can manipulate these expressions to obtain the following differential-difference equation:

Assuming that the amplitude of a single EPSP is small compared to the threshold Vt (Vt/AV sz 60), equation C.2 is expanded in a Taylor series. Keeping the first two terms leads to the Fokker-Planck equation:

+

where A I ( V ) = -V/T", AAV in the first term accounts for the deterministic part (constant mean current and leakage) while A 2 ( V ) = o2 = XAV' is the variance of the input stream and accounts for the fluctuations. The associated steady-state distribution W( V) is then

Synchronized Inputs

639

The mean interspike interval (first-passage-time) is then obtained using Siegert's recursion formula (Ricciardi 1977): T$plke =

iV'2[Azlz)W(-)]-'dz

l: W(y)dy

(C.5)

Joint application of equations C.3-C.5 leads to equation 2.4.

Acknowledgments We wish to thank Ernst Niebur a n d William Softky for helpful comments. The research w a s supported by the Air Force Office of Scientific Research, the Office of Naval Research a n d the National Institute of Mental Health. Marius Usher was supported by a Myron A. Bantrell Research Fellowship.

References Abbott, L. F. 1990. A network of oscillators. 1. Phys. A 23(16), 3835-3859. Abbott, L. F., and van Wresswijk, C. 1993. Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E 48(2), 1483-1490. Abeles, M. 1982. Role of the cortical neuron: Integrator or coincidence detector? lsrad 1. Med. Sci. 18, 83-92. Abeles, M. 1991. Corticotiics-Neural Circrrits of the Cerebral Cortex. Cambridge University Press, Cambridge. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., and Palm, G. 1989. Dynamics of neuronal firing correlation: Modulation of effective connectivity. 1. Nwroplysiol. 61(5), 900-91 7. Bair, W., Koch, C., Newsome, W., and Britten, K. 1993. Power spectrum analysis of MT neurons in the awake monkey. In Cornpiitatioti atid Neurnl Systeiiis 92, F. Eeckman, ed. Kluwer Academic Publishers, Boston, pp. 495-502. Bernander, O., Douglas, R. J., Martin, K. A. C., and Koch, C. 1991. Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Nntl. Acnd. Sci. U.S.A. 88, 11569-11573. Bernander, O., Douglas, R. J., and Koch, C. 1992. A model of cortical pyramidal neurons. CNS memo 16, California Institute of Technology, Pasadena, CA, 91125. Bush, P. C., and Douglas, R. J. 1991. Synchronization of bursting action potential discharge in a model network of neocortical neurons. Ncirrnl Conip. 3, 19-30. Connors, B. W., and Gutnick, M. J. 1990. Intrinsic firing patterns o f diverse neocortical neurons. Trends Nwrosci. 13(3), 99-104. Crick, F., and Koch, C. 1990. Towards a neurobiological theory of consciousness. Seiii. Neirrosci. 2, 263-275. Crick, F., and Koch, C. 1992. The problem of consciousness. Sci. An/. 267(3), 152-159.

640

0. Bernander, C. Koch, and M. Usher

Douglas, R. J., Martin, K. A. C., and Whitteridge, D. 1991. An intracellular analysis of the visual responses of neurones in cat visual cortex. /. Pliysiol. 440, 659-696. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitbiick, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? B i d . Cybcrii. 60, 121-130. Engel, A. K., Kiinig, P., Kreiter, A. K., and Schillen, T. B. 1992. Temporal coding in the visual cortex-new vistas on integration in the nervous system. T r d s Nt’lr Y t a r i . 15(6), 2 18-226. Freeman, W. j. 1975. Moss Acfiori in tlic N r r m r s Systcrir. Acadcmic Press, New York. Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Not/. Acrid. Sci. U.S.A. 86, 1698-1702. Gray, C. M., Kiinig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Natirw (Loridon) 338, 334-337. Hansel, D., and Sompolinsky, H. 1992. Synchronization and computation in a chaotic neural network. Phys. Rtizl. Lptt. 68(5), 718-721. Knight, B. 1Y72. Dynamics of encoding in a population of neurons. J . Gcw. Ph,~/s. 59, 734-766. Koch, C., and Schuster, H. 1992. A simple network showing burst synchronization without frequency locking. N w r d Conrp 4, 211-223. Krciter, A. K., and Singer, W. 1992. Oscillatory neuronal rcsponses in the visual cortex of the awake macaque monkey. Eirro. J . Nrirrosci. 4, 369-375. Larkman, A. U. 1991. Dendritic morphology of pyramidal neurones of the visual cortex of the rat: I. Branching patterns. 1. Ciiriip. Neirrol. 306, 307-319. Lliniis, R. 1988. The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Scimcc 242, 16541664. Manor, Y., Koch, C., and Segev, I. 1991. Effect of geometrical irregularities on propagation delay in axonal trees. Biophys. 1. 60, 1424-1437. McCormick, D. A,, Connors, 8. W., Lighthall, 1. W., and Prince, D. A. 1985. Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. I. Neurophysiol. 54(4), 782-806. Milner, P. M. 1974. A model for visual shape recognition. Ps,~p.ho/.RPZ).81, 521-535. Mirollo, R. E., and Strogatz, S. H. 1990. Synchronization of pulse-coupled biological oscillators. Siarir J. Appl. Math. 50(6), 1645-1662. Murthy, V. N., and Fetz, E. E. 1993. Effects of input synchrony on the response of a model neuron. In Coriipirtafiori m i d Neirral Systertis 92, F. Eeckman, ed. Kluwer Academic Publishers, Boston, pp. 475479. Nelson, J. I., Salin, P. A., Munk, M.H.-J., Arzi, M., and Bullier, J. 1992. Spatial and temporal coherence in cortico-cortical connections: A cross-correlation study in areas 17 and 18 in the cat. Vis. Niwrcisci. 9, 21-37. Ricciardi, L. M. 1977. Diffirsioii Processes atid R d n t r d Topics iii Biology. SpringerVerlag, Berlin.

Synchronized Inputs

64 1

Softky, W. 1994. Sub-millisecond coincidence detection in active dendritic trees. I. Ncirrosci. 58(1), 13-41. Softky, W., and Koch, C. 1993. The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSP’s. 1. Ncirrosci. 13(1), 334-350. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1991. Cooperative dynamics in visual processing. Phys. Rrv. A 43(12), 6990-7011. Tononi, G., Sporns, O., and Edelman, G. M. 1992. Reentry and the problem of integrating multiple cortical areas: Simulation o f dynamic integration in the visual system. C m b r n l Cortex 2, 310-335. Toyama, K., Kimura, M., and Tanaka, K. 1981a. Cross-correlation analysis of interneuronal connectivity in cat visual cortex. 1. NcwrophysiuI. 46(2), 191201. Toyama, K., Kimura, M., and Tanaka, K. 1981b. Organization of cat visual cortex as investigated by cross-correlation technique. 1. N L ~ I I ~ O ~ I I46(2), ~SIOI. 202-21 4. Ts’o, D. Y., Gilbert, C. D., and Wiesel, T. N. 1986. Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. 1. Ntwosci. 6(4), 1160-1170. Usher, M., Schuster, J., and Niebur, E. 1993. Dynamics of populations of integrate of fire neurons, partial synchronization and memory. Ncurnl Corriy. 5(4), 570-586. von der Malsburg, C. 1981. The correlation theory of brain function. Irifrrrznl Report 81(2). von der Malsburg, C., and Schneider, W. 1986. A neural cocktail-party processor. Biol. Cybern. 54, 2940. ~

~~

Received March 4, 1993; accepted July 29, 1993.

This article has been cited by: 2. S. M. Korogod, D. V. Chernetchenko. 2008. Nature of Electrical Tristability in a Neuron Model with Bistable Asymmetrical Dendrites. Neurophysiology 40:5-6, 412-416. [CrossRef] 3. Xiaoshen Li, Giorgio A. Ascoli. 2008. Effects of Synaptic Synchrony on the Neuronal Input-Output RelationshipEffects of Synaptic Synchrony on the Neuronal Input-Output Relationship. Neural Computation 20:7, 1717-1731. [Abstract] [PDF] [PDF Plus] 4. Andrea Benucci , Paul F.M.J. Verschure , Peter König . 2004. Two-State Membrane Potential Fluctuations Driven by Weak Pairwise CorrelationsTwo-State Membrane Potential Fluctuations Driven by Weak Pairwise Correlations. Neural Computation 16:11, 2351-2378. [Abstract] [PDF] [PDF Plus] 5. S. Mikula, E. Niebur. 2004. Correlated Inhibitory and Excitatory Inputs to the Coincidence Detector: Analytical Solution. IEEE Transactions on Neural Networks 15:5, 957-962. [CrossRef] 6. Shawn Mikula , Ernst Niebur . 2003. The Effects of Input Rate and Synchrony on a Coincidence Detector: Analytical SolutionThe Effects of Input Rate and Synchrony on a Coincidence Detector: Analytical Solution. Neural Computation 15:3, 539-547. [Abstract] [PDF] [PDF Plus] 7. Alexandre Kuhn , Ad Aertsen , Stefan Rotter . 2003. Higher-Order Statistics of Input Ensembles and the Response of Simple Model NeuronsHigher-Order Statistics of Input Ensembles and the Response of Simple Model Neurons. Neural Computation 15:1, 67-101. [Abstract] [PDF] [PDF Plus] 8. A. N. Burkitt , G. M. Clark . 2000. Calculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic InputsCalculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic Inputs. Neural Computation 12:8, 1789-1820. [Abstract] [PDF] [PDF Plus] 9. David M. Halliday . 2000. Weak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal BandwidthWeak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal Bandwidth. Neural Computation 12:3, 693-707. [Abstract] [PDF] [PDF Plus] 10. S. M. Korogod, I. B. Kulagina. 2000. Electrical bistability in a neuron model with monostable dendritic and axosomatic membranes. Neurophysiology 32:2, 73-76. [CrossRef] 11. S. M. Bohte , H. Spekreijse , P. R. Roelfsema . 2000. The Effects of Pair-wise and Higher-order Correlations on the Firing Rate of a Postsynaptic NeuronThe Effects of Pair-wise and Higher-order Correlations on the Firing Rate of a

Postsynaptic Neuron. Neural Computation 12:1, 153-179. [Abstract] [PDF] [PDF Plus] 12. A. N. Burkitt , G. M. Clark . 1999. Analysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike OutputAnalysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike Output. Neural Computation 11:4, 871-901. [Abstract] [PDF] [PDF Plus] 13. Marius Usher, Ernst Niebur. 1996. Modeling the Temporal Dynamics of IT Neurons in Visual Search: A Mechanism for Top-Down Selective AttentionModeling the Temporal Dynamics of IT Neurons in Visual Search: A Mechanism for Top-Down Selective Attention. Journal of Cognitive Neuroscience 8:4, 311-327. [Abstract] [PDF] [PDF Plus] 14. David P. M. Northmore, John G. Elias. 1996. Spike Train Processing by a Silicon Neuromorph: The Role of Sublinear Summation in DendritesSpike Train Processing by a Silicon Neuromorph: The Role of Sublinear Summation in Dendrites. Neural Computation 8:6, 1245-1265. [Abstract] [PDF] [PDF Plus] 15. D. Hansel, H. Sompolinsky. 1996. Chaos and synchrony in a model of a hypercolumn in visual cortex. Journal of Computational Neuroscience 3:1, 7-34. [CrossRef] 16. Venkatesh N. Murthy , Eberhard E. Fetz . 1994. Effects of Input Synchrony on the Firing Rate of a Three-Conductance Cortical Neuron ModelEffects of Input Synchrony on the Firing Rate of a Three-Conductance Cortical Neuron Model. Neural Computation 6:6, 1111-1126. [Abstract] [PDF] [PDF Plus] 17. Bartlett W. Mel . 1994. Information Processing in Dendritic TreesInformation Processing in Dendritic Trees. Neural Computation 6:6, 1031-1085. [Abstract] [PDF] [PDF Plus] 18. Marius Usher , Martin Stemmler , Christof Koch , Zeev Olami . 1994. Network Amplification of Local Fluctuations Causes High Spike Rate Variability, Fractal Firing Patterns and Oscillatory Local Field PotentialsNetwork Amplification of Local Fluctuations Causes High Spike Rate Variability, Fractal Firing Patterns and Oscillatory Local Field Potentials. Neural Computation 6:5, 795-836. [Abstract] [PDF] [PDF Plus]

Communicated by Peter K6nig

Segmentation by a Network of Oscillators with Stored Memories H. Sompolinsky M. Tsodyks Racnh Itistitirtc of Physics and Centerfor Neirral Corrrpirtntiori, Hebrezu University, jerusnleiii, Israel 91904 We propose a model of coupled oscillators with noise that performs segmentation of stimuli using a set of stored images, each consisting of objects and a background. The oscillators’ amplitudes encode the spatial and featural distribution of the external stimulus. The coherence of their phases signifies their belonging to the same object. In the learning stage, the couplings between phases are modified in a Hebb-like manner. By mean-field analysis and simulations, we show that an external stimulus whose local features resemble those of one or several of the stored objects generates a selective phase coherence that represents the stored pattern of segmentation. 1 Introduction

In most associative memory neural network models, information is encoded by the time-averaged levels of the neurons’ activity. Thus, in these models, it is difficult to incorporate global relationships between parts of the memories, such as binding subsets of the network into distinct objects and separating them from the background. von der Malsburg and Schneider (1986) proposed that the temporal firing patterns of different neurons encode the global properties of the stimulus. This idea gained support from recent experiments (Gray et al. 1989; Eckhorn et al. 1988) that showed that neurons in the visual cortex of a cat exhibit oscillatory responses that are coherent over large distances in a manner that depends on the global properties of the stimulus. Several network models of oscillatory neurons that are capable of segmentation of external stimuli have been studied. In these models, neurons are synchronized if their local stimuli have similar features ( e g , similar orientation or color). Thus the boundaries between objects are defined by the variation of the spatial distribution of features. This, however, will be inappropriate in cases where the features of the external stimulus do not lend themselves to segmentation based on a simple criterion. This may occur for stimuli that incorporate multiple features or Ncitrnl C O l l l p l t ~ ~ ~ 6, ~ ~642-657 Jtl (1994) @ 1994 Massachusetts Institute of Technology

Segmentation by a Network of Oscillators

643

complex textured objects. It is thus of interest to study models in which the segmentation criteria are affected by a learning process. Segmentation of coactive memorized patterns by an oscillatory network has been studied by Wang et al. (1990) and by Horn and Usher (1991). In their models, groups of neurons, representing different memories, inhibit each other so that they oscillate out of phase. These models are not capable of separating memorized images from their surrounding backgrounds. In addition, the use of nonlinear oscillatory units makes it difficult to assess the performance of these networks analytically. The present model is improved in these areas. Another advantage of the present model is its "cortical-like" architecture, which can incorporate both hard-wired connections and stored memory. 2 Architecture and Dynamics of the Network

Our model of coupled phase oscillators is based on the phenomenological oscillator model of Sompolinsky et al. (1990, 1991). Neurons are modeled as interacting oscillators whose amplitudes are determined completely by the local external stimulus. The phases are dynamic variables that evolve cooperatively. In the previous models of binding by oscillatory networks ( K h i g and Schillen 1991; Sporns et al. 1991; Sompolinsky et al. 1990), the degree of synchronization between a pair of neurons is a simple function of the difference between the local features encoded by the neurons. Instead, in the present model the connections are determined through Hebb-like learning, based on a set of memorized images, and may have no simple relation to the similarity of the local features of the corresponding neurons. Here, a memorized image consists of a twodimensional array of local features. In the learning phase, part of the local features is labeled as an object (or objects), and the remaining is labeled as the background. Presentation of an image during the learning stage causes a modification of the connections between pairs of neurons that belong to the same object. In the recalling phase, the system is activated by an input that overlaps one or more stored objects. The pattern of connections causes the system to settle in a state where all the neurons coding for the same memorized object are coherent. The relative phases between the oscillators that represent different objects as well as those belonging to the background remain asynchronous. In this model, segmentation relies only on the stored information and does not require the existence of natural boundaries either in the stored images or in the recalling stimulus. Following Sompolinsky et al. (1990) we describe the neural activity by its instantaneous rate P,(t), which is the probability for the neuron at site r to emit spike at time t. We assume that this probability oscillates with time according to an equation

P,(t) = V(r)[l + Acos(di(r.t))]

(2.1)

H. Sornpolinsky and M. Tsodyks

644

The amplitude V(r) is taken as independent of time during a given stimulus. It is a fixed monotonic function of the time-averaged level of activity of the neuron at site r, and is determined by the properties of the current t ) , which govern the temporal aspects of the stimulus. The phases cii,(r> neural activity, are assumed to obey the following equations: iJ(r.t ) = i~

+ r/(r.t ) C J(r,r’)sin[c/)(r,t ) -

-

c/j(r’,t ) ]

(2.2)

r’fr

where w is the frequency of the neural oscillations and //(r.t ) represents an external noise. The matrix elements J(r.r’) are the connection strengths between the phases of neurons at locations r and r’. As in the preceding work (Sompolinsky rt al. 1990), they are postulated to obey a fast Hebb rule J(r.r’) = V(r)W(r.r’)V(r’)

(2.3)

Equation 2.3 determines the effect of the current stimulus on the connection between the phases. The terms W(r,r’) specify the architecture of the connections, independent of the current stimulus. In the present model, they encode the memory patterns and are therefore modified, through a slow learning process, by past experience. The architecture of the model is reminiscent of that of the visual cortex. Neurons are labeled by r 5 ( R .H ) , where R is the spatial two-dimensional coordinate and H is the feature encoded by the neuron, taken here to be its preferred orientation. Neurons with the same R are called a ”cluster.” The preferred orientations, H , are assumed to be uniformly distributed between 0 and T within each cluster. The stimulus consists of an army of short, oriented bars. They are described by H:, specifying the orientation of the external input at the cluster site R. The amplitudes of activity of a neuron at (RH) induced by the stimulus is given by a “tuning curve” V(HR - 0:). The function V(0) has a period of T . It is assumed to have a narrow positive maximum at H = 0. In order to have an unambiguous assignment of each local stimulus to an object, the phases of all the neurons in a cluster that are activated by a single bar should be strongly synchronized. This condition can be secured if the intracluster connections WR,R(H.0’) are sufficiently strong [see Somplinsky et al. (1990) for a formal proof]. We therefore write (I)R(fi.

f ) = wf

+

$R(f)

(2.4)

where the clusterplzase & ( f ) represents the deviation of the phase within a cluster from the simple rotator wt.Likewise, the external noise is assumed to be spatially uniform within each cluster so that it depends only on R. Averaging equation 2.2 over the intracluster coordinates 0, H’, one obtains (2.5)

Segmentation by a Network of Oscillators

-

T 00

645

._. I ,

\-, -.I.,, < , <“ -- .,I I ,

l i

\,\.\.<

5

10

0

15

5

10

15

-

00

5

LO

15

Figure 1: Stored images containing three geometric shapes a s objects. The bars represent the local orientations of edges in the objects. The blank sites are the backgrounds. where \R,RI is the effective interaction between the cluster phases. The effective interactions are given by 1R.R’ =

cV(H

-

H):)WR,R’(H. H’)V(H’- H i ! )

(2.6)

H.H’

The cluster noise ? / R ( t ) has a variance ( r / R ( f ) r / p ( t ’ ) ) = 2Th[(.R’h(f- f ‘ ) where T is the amplitude of the noise (analog of temperature in statistical physics). The competition between J R , p and the cluster noise determines the coherence between the different clusters. Note that we have assumed here that the external stimulus excites all the clusters in the network. The values of WR,R’(H , H ’ ) are determined by the learning rules detailed below. 3 The Learning Rule

For the sake of concreteness we will consider memorized images, each of which consists of a single object and a background. We defined a set of P images of the form {<;, H i } , 11 = 1. . . . P. The variables encode the segmentation of the image. They take the values 0, 1 where = 1 stands for “object” and 0 for ”background.” The orientations of the local edges within the object are encoded by the angles 0 < 0: < T . Note that 0: are defined only on the I‘th object, that is, only for R such that <; = 1. Examples are shown in Figure la-c. The presentation of the P images leads to Hebbian connections of the form ~

wR,R,(H.H’)

=

c nN l

P

-

<~;<:,V(H - HI;)V(H -’ o;:,)

<:

<:

(3.1)

p=l

where V(H) is a function of the amplitude V(H),and its form will be discussed below. The integer n is the number of neurons in a cluster and

646

H. Sompolinsky and M. Tsodyks

N is the number of clusters. Substituting equation 3.1 in equation 2.6 yields (3.2) where (3.3) The sum in equation 3.2 is over the J I preferred orientations within a cluster. The form of A(#) (or equivalently that of V ) determines the stability of the memories. Therefore it has to be chosen according to the statistical distribution of the embedded images. In the present work we will focus on the simple case of memories with random orientation distribution, that is, we will assume that cadi 0; is drawn at rondorrifrorrr a iiriiforrii distributiori Ortzoceri 0 arid 7r. For this case, stability of the retrieved patterns requires that

This equation guarantees that the local field (see equation 4.7) in the background, that is generated by the neurons in the object sums up to zero and does not synchronize the background. Equation 3.4 implies that the angular average of V (H ) vanishes. An appropriate choice for V (0 ) is, therefore (3.5) where again the summation is over all the 11 preferred orientations within the Rth cluster. Another important feature is the width of A(()),denoted by IT. Using the normalization A ( 0 ) = 1, n is given by

4 Recalling Stage: Mean-Field Theory The segmentation of an external stimulus by the network is performed as follows. The external stimulus activates all the oscillators with local amplitudes, that are determined by the local external angles 0:. The oscillators evolve according to the phase equations (equation 2.2). The strong excitatory connections within a cluster rapidly synchronize the phases within each cluster, and generate the effective dynamics of the cluster phases, given by equations 2.4-2.5 with IR.1
Segmentation by a Network of Oscillators

647

part of the external stimulus matches a stored object, the clusters corresponding to this object will be synchronized, whereas the rest will not. To show that this is indeed the case, we analyze the performance of the system in the limit of large N and finite P using the mean-field theory. We first consider the segmentation of the stimulus, containing only one stored object. The generalization to the case of several objects in the stimuli is straightforward and is illustrated in the numerical example. For the sake of simplicity we will further assume that each of the objects in the memories occupies a fraction f , 0 < f < 1 of the N clusters. The retrieving stimulus corresponding to the object, say number 1, has an orientation distribution of the form

Hi

if <;< = 1 otherwise

{ %dom

Because the connection matrix (equation 3.2) is symmetric, the dynamics of cluster phases (equation 2.5) may be written in the form of a Langevin equation with noise:

where E is the following energy function: 1

E { h } = -2

I f < . f < i C O S ( dl~

4'~')

(4.3)

RfR'

Thus, the equilibrium properties of the system can be described by a Gibbs distribution exp(-ijE{ulR}) with /j = 1/T, and the amplitude of the noise T plays a role of the temperature in statistical physics. It is useful to introduce a spin SR, which is the two-dimensional unit vector (cos + R , sin $R). The resemblance between the equilibrium state during retrieval and the stored objects is measured by the order parameters m,,,

where the angular bracket (. . .) denotes thermal average and p;

= <#(HO,

-

HI;)

(4.5)

From equation 4.1 it is seen that /)k =
H. Sompolinsky and M. Tsodyks

648

two-dimensional spins. The results is (see Appendix for details)

where Io(x) is a Bessel function of zeroth order. The quantity / I is the magnitude of the random two-dimensional vector h, representing the local fields. It is equal to P

(4.7)

where 1)" denotes a random variable with the statistics of 14; (equation 4.5). The double angular brackets ((. .)) denotes averages over p,'. The order parameters are determined by minimizing the free energy (equation 4.6) with respect to m,,, which yields the following selfconsistency equations:

where Il(x) = Z;(x)/I"(x), and h is given by equation 4.7. Solving these equations for m,,, we find the following three states: 0

T/f> 0.5. All m,, = 0. The system is in an ergodic stntc, where all the phases fluctuate rapidly in time, in an asynchronous manner, leading to the vanishing of ( S R ) for all R as well as the vanishing of all pair correlations. 0 . 5 ~ ( -1f ) / ( l - nf) < T/f < 0.5. Here m,, = mh,,,, where the magnitude of m is nonzero, as shown in Figure 2, and the direction of m is arbitrary. Equation 4.8 yields the following equation for iii

which has a nonzero solution for all T/f < 0.5. By calculating the Hessian matrix of the free energy, we find that this solution is stable only for T/f > 0.50(1 - f ) / ( l - nf). This state is the interesting segmentation state. In this phase all spins on sites corresponding to the object of the image 1 are parallel, that is, ( S R ) = m, signaling their synchronization. The phases in the background remain asynchronous, that is, ( S R ) 0 for all sites such = 0. Note that the synchronized object does not entrain the that background, because the individual couplings between neurons in the object and the background (due to the other images) are weak and random in sign.

1

Segmentation by a Network of Oscillators

649

\ \ \ \ \ \ - / , , . . I / - /

, \ \ - / \ I . / \ \ / , \ \ / \ /

\ I \ / / . / \ / . / / / \ - \ / . \ / \ - I / / \ \ .

\

- \ / \ / - \ / \ \ / \ I - / -

/ . / \ , / . / - / \ , \ . / I

- \ \ / / \ \ \ / / \ / / / \ I

/ - \ \ l / \ . / / - - l \ / l \

/

l - \ l \ - / - / / / l - /

/ - \ l / / \ . - / l l / / / l ,,/.-I \ - l \ / \ / l \ ,

\ l \ \ \ / / / / / l \ - . - / I

I

10

15

1 10

a.

Figure 2: (a) External input, in which the triangle and the square of Figure 1 are embedded. (b,c) Results of simulation of the cluster dynamics (equation 2.5). The arrow at each site represents the two-dimensional vector ( S R ) . (b) Longtime averaging. (c) Short-time averaging. For details see text.

T/f< 0.5cr(1-f)/(1-gf ). In this range the segmentation state is unstable to the appearance of other overlaps, in addition to ml. This state is denoted as a frozen state. The object remains well synchronized, but the spins of the background no longer fluctuate rapidly in time but freeze in random directions that are determined by the realizations of all the images and the local orientations in the stimu-

H. Sompolinsky and M. Tsodyks

650

lus. This causes fixed phase relationships between various neurons in the background as well as between them and the object.

To summarize, we have found that as long as the width of the tuning curve is not too big (i.e., IT is sufficiently small), there is a significant range of temperatures at which an external stimulus containing an object similar to one of memorized objects will generate a segmentation state in which the neurons corresponding to the object will be synchronized while the background will be completely asynchronous. The relation between the configuration of ( S R )and quantitative measures of the oscillating clusters’ coherence will be given later in the context of a numerical example. So far we have discussed figure/ground segmentation. Segmentation of several nonoverlapping objects is also possible. It can be seen from equations 4.6 and 4.7 that the m,, that belong to nonoverlapping objects do not interact. Hence, if the external stimulus contains nonoverlapping objects of several memorized images, the system will exhibit a segmentation state, similar to that described above, but with all the m,, that correspond to the objects that appear in the stimulus having the same magnitude and arbitrary directions. This means that the phases within each of the objects will be synchronized, while the relative phases between the objects will not, as will be demonstrated in the next section. Note that the restriction to zero overlap between the objects applies only to the objects that are present in the same stimulus, but not to the whole set of stored images, which may be overlapping. The available range of temperatures within which segmentation is possible and the required processing times depend on the criteria adopted for the segmentation. In principle, segmentation can be based either on the temporal regularity of the oscillations of the local clusters “within the object,” or on the spatial coherence of these oscillations. If the tciiiporol coherence is used as a criterion, segmentation can be performed only in the intermediate temperature range of the scgmntation state phase, and may require significant time averaging. If, however, the spatial colierencp is taken as the dominant criterion, then segmentation can be performed very rapidly and even at low temperatures. This is because the phases in the background are spatially disordered, although they are temporally ordered. This issue will be demonstrated below in a concrete numerical example. 5 Numerical Example

An example of the performance of the system is shown in Figures 1 and 2. Three geometric objects were embedded as memories in a 16 x 16 lattice. They are shown in Figure 1. The sites with bars constitute the object and hence have <; = 1. The bars represent the local orientation of the memorized objects. The angles of the bars define 0;. The blank areas are the backgrounds, and hence correspond to (L = 0. These three memorized

Segmentation by a Network of Oscillators

651

images define the Hebbian connections (equation 3.1) between pairs of neurons. We have chosen a "square" tuning curve V (H ) of the form V = 1 for 101 < a and V = -2a/(7r - 2a) for 17 < 101 < ~ / and 2 n = 0.1. This yields a function A(H) with a width (T = 0 . 1 ~ The . external input in the retrieval phase is shown in Figure 2.3. It contains two of the three embedded objects, the square and the triangle, as can be checked by comparing with Figure 1. The orientations at locations that do not correspond to the two objects are random. Note that because of the random orientations of local edges in the input, there is no simple criterion of segmentation, other than the information stored in the network. We have assumed that the interaction between pairs within a cluster is sufficiently strong so that their phases become locked very rapidly. We have therefore simulated only the cluster dynamics (equations 2.5 and 3.2) at T = 0.2, which is within the range of the segriieritatiori state. The results of the simulations are shown in Figure 2b. The arrow at each cluster represents a time average of SR(t), which is equivalent to the thermal average (&). The length of the arrow represents the magnitude of ( S R ) , that is, the degree of the temporal coherence at this site. The orientation of the arrow represents the average phase of the oscillator at this site. Figure 2b shows the teniporal and spatial colicrerice of the oscillators, corresponding to each of the two memorized objects that are present in the external stimulus. The temporal coherence is indicated by the considerable length of the arrows, which implies that the phase of the oscillators (relative to the average motion zut ) is relatively fixed in time. The spatial coherence is indicated by the fact that the angles of the arrows (i.e., the phases of the oscillators) corresponding to an object are all very similar. The relative phase between the two objects is arbitrary and is determined by their random initial condition. On the other hand, the oscillators in the background have very little teriipoval order, as signaled by the small magnitudes of their local magnetization. Furthermore, the random directions of the small arrows in the background indicate that the weak, short-time local order is not spatially synchronized. It should be noted that because of the finite size of the system, the true thermal averages (SR) are always zero, which means that if the time of the simulations is very big (proportional to the size of the system), (SR) = 0. However, for any reasonable network size there is a wide range in the simulation times, for which the results are robust. The results presented in Figure 2b are obtained by averaging over a time of approximately 200 time units. This relatively long averaging time is required for establishing the temporal disorder of the background oscillators. The temporal and spatial coherence of oscillators within the object appear faster, a s demonstrated in Figure 2c, which presents the results of averaging over a window of 20 units of time, following a transient of 10 time units. Although the background displays a considerable temporal order at short times, it is distinguished from the object by its spatial disorder. Thus the actual time required for segmentation depends on the particular method

H. Sompolinsky and M. Tsodyks

652

TIME

TIME

Figure 3: (a) Autocorrelograms computed through equation 5.1 using the values of the phases obtained from the numerical simulations. Full line corresponds to an oscillator (with LJ = 0.1) on a site of an object (in Fig. 2b); dashed line corresponds to a background site. (b) Crosscorrelogram of two oscillators belonging to the same object of Figure 2b. Coiitiriircd riext pop’.

of higher processing used by the system to ”read-out” the coherence pattern of neural oscillations. In addition, relating the time units in the phase dynamics to the frequency of the underlying oscillations requires modeling the full dynamics including that of the amplitudes, which is beyond the scope of the present model. We now relate the results of (SR),as displayed in Figure 2b, to quantitative measures of the coherence of the underlying oscillatory clusters. According to Sompolinsky et a!. (1990), the time-dependent part of the autocorrelogram (PR(t)PR(t 7 ) )is proportional to the autocorrelation of the phase

+

Segmentation by a Network of Oscillators

653

TIME

Figure 3: (c) Crosscorrelogram of oscillators belonging to two different objects. (d) Crosscorrelogram of an oscillator in an object and one in the background. All times are in units of 1/u.

Thus the long time limit of C R R ( 7 ) measures the temporal phase coherence, and is related to the length of the displayed arrows through

Hence, we conclude from Figure 2b that the autocorrelograms of the neurons belonging to an object will show oscillatory modulations for relatively long times, whereas those of the background neurons will decay rapidly to constant values. This is shown in Figure 3a. The spatial coherence is measured by the time-dependent part of the crosscorrelogram between neurons in different clusters, ( P R ( t ) P R , ( t + T ) ) . It is proportional to

654

H. Somyolinsky and M. Tsodyks

where CI(/<, ( 7 ) and \ ( T j represent the amplitude of the coherence between the two oscillators and their phase shifts, respectively. The definitions of Cl;p ( T ) and \ IW( T ) in terms of the time-dependent phases are C / ( ~ ( ' (=T )d m a n d \ ~ < R , ( T ) = arctan(o/b)wherea = (sin[,IR(t)-,,'I(,(t+ T ) ] ) and 1~ = (cossiii[r'R(t)- o l < f ( t+ r ) ] ) .If the number of clusters is large, both quantities are time-independent. They are given by

Thus, the crosscorrelograms of neurons belonging to the same objects will show pronounced oscillatory modulations with zero phase shifts, as shown in Figure 3b. The crosscorrelograms for neurons belonging to two different objects will show phase shifted oscillatory modulation (which will be averaged to zero if different initial conditions are summed). This is demonstrated in Figure 3c. Finally, crosscorrelograms involving neurons in the background will be essentially constant in time (Fig. 3d).

6 Discussion ___

I t is important to stress the qualitative differences between the present model and networks of associative memory based on the Hopfield model. In the Hopfield model, the system flows into an attractor that specifies the state of all neurons in the system. Thus the whole image, including the background, is part of the memorized pattern. Even when memories are sparsely encoded, a faithful retrieval requires that the neurons outside the object be quiescent. This is also true of the oscillatory network models (Wang ct d . 1990; Horn and Usher 1991), which were proposed recently for the simultaneous recall of several memories. These models are therefore inappropriate for retrieval of memories in cases where the background is activated by stimuli that are unrelated to the memorized image. In order to achieve retrieval of objects that are embedded in an active background, the network architecture must be more complex than the conventional models of associative memory. In order to distinguish the background from the collection of other stored patterns, one has to introduce an additional internal degree of freedom at each node. In our case these degrees of freedom are the preferred orientations 0 within a cluster. We have focused here on the dynamics of the phases and assumed that the amplitude of the oscillatory responses are fixed by the external stimulus. One deficiency of this model is that it does not have an errorcorrecting power. In order to incorporate desired features such as object

Segmentation by a Network of Oscillators

655

completion, the model has to be extended so that the amplitudes can also evolve in time. In the present formulation of the model, segmentation is performed exclusively on the basis of memory, and does not depend on criteria based on the spatial distribution of features in the object or in the figure/ground boundaries. However, if the distribution of features in the memorized objects is not random, as was assumed here for simplicity, the pattern of connections that will emerge via learning will encode some aspects of the statistical structure of the memorized objects. For example, if the local edges within the stored objects tend to be parallel to each other, the couplings between the neurons with similar preferred orientations will be reinforced, and hence any new object with parallel edges will be segmented, as was demonstrated by K6nig et al. (1992). Thus, on the basis of a common learning rule, the present model is capable of segmentation using both regularities in the input as well as matching to memorized objects. The theoretical analysis of the model was limited to relatively few memories, that is, P << N. Preliminary study of the model in the case of P ix N reveals that the capacity of the model is roughly proportional to the number of clusters, N, divided by the width of the tuning curve (or equivalently A). This is to be expected, since this is the number of the effective independent degrees of freedom in the network. The simple Hebb rule (equation 3.1) is adequate only for images with random objects. We have derived modified pseudoinverse learning rules that are appropriate for storing more interesting, correlated objects (Sompolinsky and Tsodyks 1993). In the present model the role of the external input during recall is not to determine the initial conditions but rather to modify the connectivity between the phases. This feature of the model is expressed in the form of the effective connection matrix between clusters (equation 3.2). The role of the intracluster degrees of freedom is to generate such an effective connectivity, starting from a simple Hebb rule. See equations 2.3 and 2.6. Thus the model incorporates two stages of “synaptic modification.” One is responsible for the long-term storage of the memories (equation 3.1). Another modification occurs on a fast time scale and incorporates the effect of the recalling external stimulus (equation 2.6). Numerical and analytical studies have shown that the fast modification of the interactions between the phases of neuronal oscillators also can arise from the dynamics of the network and may not require a physical change in the underlying connections between the oscillators. For this to occur, one has to resort to the full dynamics of the system, where not only the phases but also the amplitude of the oscillations are dynamic variables, as discussed in Sompolinsky et al. (1990), Grannan et al. (1993), and Hansel and Sompolinsky (1992). It would be interesting to study extensions of the present model that will generate dynamically both the modification of the effective connection by the external stimulus as well as the desynchronizing noise.

656

H. Sompolinsky and M. Tsodyks

Appendix The free-energy F of the spin system with the energy function of equation 4.3 is given by the following expression:

where we have used equations 3.2 and 4.5 with S R standing for a twodimensional irnit vector with an angle i / i R . For convenience we have included in the energy terms corresponding to R = R'. The integration over SR means an integration over its angle $ R . Using a gaussian transformation to linearize the quadratic term in equation A.2 we obtain

where Dm,, is an integral over the space of two-dimensional vectors m,,. The integration over the spins can be now performed independently for each site R, yielding

where IO(x) = ,I,pdH/27i. L~"~"'' is the Bessel function of the zero order. The second sum in equation A.4 is the sum of N random independent variables. By the central limit theorem, this sum can be replaced by N times its average value, in the limit of large N. The argument of the exponent is thus equal to -N/jF{m,,}, where F is given by equation 4.6. In the limit of large N the integrals over m,, are dominated by the value of the integrand at the saddle point, leading to equation 4.8. Acknowledgments We have benefited from helpful discussions with L. Abbott and D. Kleinfeld. The work of M. T. was supported in part by the Ministry of Science and Technology of Israel, and by a fellowship from Intel Electronics Ltd. H. S. is partially supported by the US-ISRAEL Binational Science Foundation.

Segmentation by a Network of Oscillators

657

References Amit, D. J., Gutfreund, H., and Sompolinsky, H. 1985. Spin glass models of neural networks. Phys. Rev. 32 A, 1007-1018. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? B i d . Cybrrti. 60, 121-130. Gray, C. M., Kbnig, I?, Engel, A. K., and Singer, W. 1989. Oscillatory responses in visual cortex exhibit intercolumnar synchronization which reflects global stimulus properties. Nntirre (London) 338, 334-337. Grannan, E. R., Kleinfeld, D., and Sompolinsky, H. 1993. Stimulus dependent synchronization of neural assemblies. Neirrnl Cotrip. 5, 550-569. Hansel, D., and Sompolinsky, H. 1992. Synchronization and computation in a chaotic neural network. Phys. R m . Lett. 68, 718-721. Horn, D., and Usher, M. 1991. Parallel activation of memories in an oscillatory neural network. Neirrnl Comp. 3, 3143. Konig, P., and Schillen, T. B. 1991. Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. N w u l Corrip. 3, 155-166. Kbnig, P., Janosh, B., and Schillen, T. B. 1992. Stimulus-dependent assembly formation of oscillatory responses: 111. Learning. Neural Comp. 4, 666-681. Sompolinsky, H., and Tsodyks, M. 1993. Unpublished. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1990. Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Nntl. Acnd. Sci. U.S.A. 87, 7200-7204. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1991. Cooperative dynamics in visual processing. Phys. Rezi. A 43, 6990-7011. Sporns, O., Tononi, G., and Edelman, G. M. 1991. Modeling perceptual grouping and figure-ground segmentation by means of active reentrant connections. Proc. Not/. Acad. Sci. U.S.A. 88, 129-135. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail-party processor. Biol. Cyberri. 54, 2940. Wang, D. L., Buhmann, J., and von der Malsburg, C. 1990. Pattern segmentation in associative memory. Neirral Conip. 2, 94-106.

Received January 11, 1993; accepted December 14, 1993.

This article has been cited by: 2. D. Wang. 2005. The Time Dimension for Scene Analysis. IEEE Transactions on Neural Networks 16:6, 1401-1426. [CrossRef] 3. DeLiang Wang, Xiuwen Liu. 2002. Scene analysis by integrating primitive segmentation and associative memory. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:3, 254-268. [CrossRef] 4. Sitabhra Sinha, Jayanta Basak. 2002. Dynamical response of an excitatory-inhibitory neural network to external stimulation: An application to image segmentation. Physical Review E 65:4. . [CrossRef] 5. G. Frank, G. Hartmann, A. Jahnke, M. Schafer. 1999. An accelerator for neural networks with pulse-coded model neurons. IEEE Transactions on Neural Networks 10:3, 527-538. [CrossRef] 6. Qing Ma. 1996. Adaptive associative memories capable of pattern segmentation. IEEE Transactions on Neural Networks 7:6, 1439-1449. [CrossRef] 7. Tomoki Fukai, Masatoshi Shiino. 1995. Memory Recall by Quasi-Fixed-Point Attractors in Oscillator Neural NetworksMemory Recall by Quasi-Fixed-Point Attractors in Oscillator Neural Networks. Neural Computation 7:3, 529-548. [Abstract] [PDF] [PDF Plus]

Communicated by Haim Sompolinsky

Numerical Bifurcation Analysis of an Oscillatory Neural Network with Synchronous/Asynchronous Connections Yukio Hayashi ATR Hirrrrnri Irlfornmtiori Processirrg Rfscnrdi Laborntories, 2-2, Hiknridai, Scikn-clro, Sornkir-gun, Kyoto 61 9-02, /npnr~

One of the advantages of oscillatory neural networks is their dynamic links among related features; the links are based on input-dependent synchronized oscillations. This paper investigates the relations between synchronous/asynchronous oscillations and the connection architectures of an oscillatory neural network with two excitatory-inhibitory pair oscillators. Through numerical analysis, we show synchronous and asynchronous connection types over a wide parameter space for two different inputs and one connection parameter. The results are not only consistent with the classification of synchronous/asynchronous connection types in Konig’s model (1991),but also offer a useful guideline on how to construct a network with local connections for a segmentation task. 1 Introduction

Recently, stimulus-specific synchronized oscillations have been found in the visual cortex of cats (Eckhorn et n/. 1988; Gray and Singer 1989). These synchronized oscillations are interesting not only biologically, but also computationally as a dynamic feature-binding mechanism of the brain. One dynamic feature-binding hypothesis states that synchronized oscillations quickly create dynamic links among related features, which are segmented from other features with desynchronized oscillations by means of either spatial (Shimizu and Yamaguchi 1989; Sompolinsky P t n / . 1991) or temporal (Wang E L a/. 1990; Horn L>L a / . 1991; von der Malsburg and Buhmann 1992) segmentation. However, achieving dynamic information synthesis by synchronization requires bifurcation analysis to be discussed quantitatively as well as qualitatively, because the existing complex behaviors may be influenced both by different inputs and by different topological connection architectures. In a general case involving coupled oscillator models with E-I (excitatory-inhibitory) pair oscillators, a theoretical analysis of the synchronization is almost impossible. One reason is that the external input bias changes not only the number and position of equilibrium points, but also the characteristic frequency of each oscillator. In the case of two Ncwrnl Corrrprctntiorr 6, 658-667 (1994)

@ 1994 Massachusetts Institute of Technology

Oscillatory Neural Network

659

oscillators with no input bias, theoretical analyses have been reported on in-phase/anti-phase oscillations (Kawato and Suzuki 1980; Nagashino and Kelso 1991). Khibnik et al. (1992) have reported a numerical bifurcation analysis similar to that found in this paper. Since their model is constructed using complex Willson-Cowan oscillators (Borisyuk and Kirillov 1992), the bifurcation routes are very complex, even though they are two homogeneous oscillators with an “identical input bias.” It should be noted that Hopf bifurcation (limit cycle t--1 torus) is common to every case of their connection architectures. It is important to note that having different input values is crucial to dynamic information synthesis by synchronization, because, as in feature-binding, we want to create dynamic synchronized clusters according to the inputs or the relations among them. Konig and Schillen (1991) have stated that there are two types of connections in a coupled neural network with different inputs: synchronous and asynchronous. They showed that the connection from an E-cell to the nearest-neighbor I-cell causes synchronism, and that the connection from an E-cell to the next nearest-neighbor E-cell causes asynchronism. Their qualitative discussion is almost entirely based on several examples. As a guideline to constructing a network for a segmentation task, further quantitative analysis is required. By using numerical analysis, we have investigated the relations between synchronous/asynchronousoscillations and the connection architectures in a model with two E-I pairs. We show that, over a wide parameter space for two different inputs and one connection parameter, the excitatory-xcitatory and inhibitory-inhibitory connections between two E-I pairs tend to be asynchronous, while the excitatory-inhibitory and inhibitory-excitatory connections tend to be synchronous. We also discuss briefly how to use the results to construct a network with local connections for a segmentation task. 2 Four Types of Connection Architectures

In this paper, we consider oscillatory neural networks with two symmetrically connected homogeneous E-I pairs. The four types of connection architectures are shown in Figure 1. The dynamic equations are as follows (i.j = 1 or 2, j # i). (Case 1) i,= -x, G(W,,x, - I(,[!/, t I , -t W,/S,). 4,= -y, (Case 2) XI = -1‘1 G (WIIX, - K JJY, I , - K I ,2?//). y, = -Ill (Case 3 ) XI = -x, G(W,,x,- KLlyr 1,). ii =

+

+ G(KJLS,)

(2.1)

+

+

+ C(K,,s , )

(2.2)

+

+

+ G ( K I F x , IW,,y,)

(2.3)

-

Yukio Hayashi

660

< Asynchronous Cases > 11

I

Case 1

I

I

< Synchronous Cases >

12

Figure 1: Four types of connection architectures. Symmetrical connections between homogeneous E-I pairs (excitatory cell: x,, inhibitory cell: y,, input bias: I,) are denoted by a bold line. -I and denote inhibitory connections and excitatory connections, respectively.

-

Here, X , and y, denote the ith excitatory (E) and inhibitory (I) cells, respectively. G ( z ) is a sigmoid function G ( z ) = 2arctan(z/a)/x and a is the slope of C ( z ) . Each E-cell has the self-excitatory connection weight W,,. For each E-I pair, the connection from the I-cell to the E-cell is inhibitory with connection weight - K r l , and that from the E-cell to the I-cell is excitatory with the connection weight KII . The connection weights have the same constant values and are set to W,, = 1.0, KII = Krr = 2.0 (the parameter u is set to a = 0 1) to satisfy the oscillatory condition of a one E-I pair oscillator (Hayashi 1992). The symmetrical connection between two E-I pairs is excitatory with the connection weight W,, (Case 1) or KIFZ (Case 4). On the other hand, it is inhibitory with the connection weight - K E I ~(Case 2) or -IW,, (Case 3). I, is the input bias for the ith E-cell (1, # 12). Variables I,, Z2, and W,, (or K F , ~ ZW, , , and K I I ~are ) bifurcation parameters.

Oscillatory Neural Network

661

3 Results of Numerical Analysis

This section shows that Cases 1 and 3 tend to be asynchronous, while Cases 2 and 4 tend to be synchronous over a wide parameter space for two inputs and one connection weight. We also explain briefly the differences and similarities of bifurcation structures among the four cases. First, the two input biases are assumed to be 1, > 0.1, > 12. Each input is varied from -1.0 to 1.0 in 0.2 increments. By exchanging cell numbers, the result is the same for I , < l2 and I , > Iz, (i.e., the result has symmetry) and by a simple variable-change (2, = -x,.j l = -yI. Ii = -Ii) using the fact that G ( z ) is an odd function, the result is the same for I, < 0 and I , > 0. For a combination of the input biases, each connection parameter is varied from 0.01 to 2.6 in 0.001 increments. With the fourthorder Runge-Kutta approximation (At = 0.01), there are 30,000 more iterations after 30,000 iterations for the transition phase from the initial values: xi(0) = I,, y l ( 0 )= 0. Though the model has multistable solutions for any initial states different from x,(O) = Il.!yi(0) = 0, only the case employing the above initial states is discussed below. We obtained the bifurcation diagrams according to whether the oscillations were frequency-locked or not, using both Fourier analysis of the waveform x , ( t ) and a Poincare section on the projection plane xi - y,. In this paper, we do not discriminate between chaos and a quasiperiodic orbit. Figure 2 shows 3D diagrams each with a roof on the start points of stable points (by the primary Hopf bifurcation), and a curved surface on the start points of limit cycles (by frequency-locking from quasiperiodic orbits). Figure 3 shows the 2D sections at I , = constarit z~aluefor each case. In the figures, ( ( I , /j) denotes the winding numbers of a harmonic limit cycle (torus) on the two projection planes x l - y1 and x? - y2. The bifurcation structures between Cases 1 and 3 (or between Cases 2 and 4) are qualitatively similar. The quantitatively different points are the range of frequency-locking areas (HLC regions) and the slope of the roof on the primary Hopf bifurcation points. However, the bifurcation structures between Cases 1 and 2 (or between Cases 3 and 4) are different under the roof. In Case 1 (or 3), there is only one SLC region between the roof and the curved surface. Under the surface, there are various orbits that are several harmonic limit cycles with high winding numbers and quasiperiodic or chaotic orbits. If the two input values are very different (e.g., Ill[ z 1, 1121 = 0), or both of the two input values are small (e.g., Ill1 z 1121 z 0), the region is almost entirely occupied by the QPO region. In Case 2 (or 4), there are various orbits in the region between the roof and the curved surface, and there is a QPO region under the surface. Only the harmonics (1, 2), (2, 2 ) and simple limit cycle coexist in the HLC regions. Since the QPO and CM regions become small as the input value I , decreases, the oscillations tend to be frequency-locked. Here, the synchronous and asynchronous oscillations are defined as a

Yukio Hayashi

662

Connection Parameter

( Case 2 )

( Case 1 )

I

(Case 4 )

( Case 3 )

3

3

3 !

2

I

I

i

3

2

Figure 2: Three-dimensional bifurcation diagrams. Each case is taken from its corresponding case in Figure 1. The roof and the curved surface of each diagram denote starting points of stable points and limit cycles, respectively.

Oscillatory Neural Network

663

( Case 1 )

1

1

0 fSl)I,=lo

0

-1

I a3 1 I! I0.8

-I2

,

-

31 WI,

1 12

I

I

I

SP

la41 II=O.O

I

SP

1

0

l b 2 1 1,=09

-12

-

1

-1

l2

0

I b4 1 I, = 0.6

-

(Case 4 )

( Case 3 )

SP

Y.2

SP

Y.2

SLC

t2/ 1

o,Az-%hJ 1

t2t

1

0

ICl

)

I , I1.0

I2

SLC

1

1

0 I c2 ) I, = 0.9

-

1 I2

(

d2 1 I, = 0.9

-

.

I2

.

Id41 I 2 = O 6

-11

Figure 3: 2D-sections of each diagram (11 = const). The regions are SP, stable point; SLC, simple limit cycle; HLC, harmonic limit cycle; QPO, quasiperiodic orbit or chaos from two independent limit cycles; QPO', quasiperiodic orbit or chaos between the SLC and HLC regions; CM, complex mix-mode.

frequency-locking limit cycle and a frequency-unlocking quasiperiodic orbit or chaos, respectively. Thus, as the main result of this paper, we obtained the following difference between [Cases 1 and 31 and [Cases 2 and 41 in the weakly connected area (0 < each connection parameter value < 1).

I 12

Yukio Hayashi

664

[Cases 1 and 31 Since there are only several thin, isolated HLC regions in the wide QPO region, the oscillations tend to be asynchronous, as the difference between the two input values increases. [Cases 2 and 41 Since the QPO region is comparatively small in the SLC and HLC regions, the oscillations tend to be synchronous. We should mention the omitted points in Figure 3. In Cases 1 and 3, the thin HLC regions are not drawn in the left and right wings of the QPO region, because the figures have symmetry. When the input bias I, is less than 0.8 in Case 3, the HLC region of the harmonics (3, 4) vanishes, although this is not shown. Unfortunately, we cannot analytically show this difference between Cases 1 and 3 and Cases 2 and 4. For nonperturbative coupling, a theoretical analysis of the difference in the measure of locking regions is very difficult (perhaps even impossible at present). Further study is required to investigate this difference. In addition, we show two other results as follows. The primary Hopf bifurcation points are approximately on a line (Case 1: WI,(,~,,Case 2: KI,~,~,~), or are almost constant (Case 3 : ZWh(rpf, Case 4: Khopt) in the strongly connected area. The lines can be approximately analyzed, but this paper does not discuss these topics. As each connection parameter increases from zero, the common typical bifurcation route is almost independent two cycles harmonic limit cycle (frequency locking or unlocking by saddle-node birth-death bifurcation on a Poincare section) H quasiperiodic orbit (secondary Hopf bifurcation) or chaos (breaking T2 torus) + simple limit cycle (primary stable point. Hopf bifurcation) Figure 4a shows an example of a typical bifurcation route from a torus (harmonic limit cycle) to a stable point through a simple limit cycle without harmonics. Figure 4b, on the other hand, shows an example of a complex bifurcation route from a complex orbit to a stable point (like in Cases 2 and 4). As shown in Figure 4b, the CM region has simple limit cycles for the two synchronized pairs, limit cycles on a line by another destabilized pair, and complex orbits. In order to discriminate these complex orbits, a more detailed analysis will be required.

-

-

4 Discussion

.-

In the following, we discuss briefly how to use the bifurcation diagrams to construct a network with local connections for a segmentation task. Based on the results of the bifurcation diagrams of the oscillatory neural network with two E-I pairs, we investigated an example of orientation feature extraction from a pattern image of a hand written number (Hayashi 1993). In order to synchronize the line segment pixels (corresponding to the E-I pairs) with similar input values in one direction on the drawing, the connection parameter W,, between the nearest-neighbor

Oscillatory Neural Network

10

'4 I00

10

665

10

10

- - 1 01

QPO

6 7 too--

1.0J/

i:

~ O " /

-10

- -1.01

SLC

10

-1 0

HLC

101

10

-

-1 01

Y,

10

1-x, 0

I

Y,

100

P1:1

10

QPO y,

, ,

l0I

x.

X.

10

10

.1.0

SLC

-

100

-

-1.01

00

-10

SP

X,

XI 10

Figure 4: Trajectories on the projection plane XI - yl. (a) Typical bifurcation route: QPO-locking HLC-unlocking QPO-second Hopf 4 SLC-first Hopf SP; (b) complex route: CM(simp1e limit cycle) CM(comp1ex orbit) CM(1imit cycle on line)-first Hopf 4 SP.

-

--t

-

-

E-cells (Case 1) in the s a m e direction w a s set to a value in a synchronous area of the bifurcation diagram. As a result, the synchronizations grew larger than the size of the nearest-neighbor connections, a n d the cross-

Yukio Hayashi

666

correlation between them became high with increasing connection pdrameter value. Also, the synchronization between pixels depended on the difference in input values. For other connection architectures (Cases 2-4), a similar tendency in term of synchronization existed. These results suggest that for multipair oscillators with local connections, the synchronous area in the parameter space has a qualitatively similar range. However, since such a nonlinear system is not additive, the general case of multipair oscillators without local connections may not have a similar result. For the general case of multipair oscillators, future studies, both theoretical and numerical, are required. The existence of chaos is also an open question (even in the case of two pair oscillators). While the initial states were assumed as x , ( 0 ) = I,, y,(O) = 0 in this paper, the model has multistable solutions for other initial states that differ from our assumptions. This is interesting from a theoretical point of view. However, from an application point of view, multisolutions may be too complex for use in a segmentation task. 5 Conclusion

This study investigated the relations between synchronous/asynchronous oscillations and the connection architectures of an oscillatory neural network. Through numerical analysis, we have shown that over a wide parameter space for two different inputs and one connection weight, the excitatory-excitatory and inhibitory-inhibitory connections tend to be asynchronous, while the excitatory-inhibitory and inhibitory-excitatory connections tend to be synchronous. To conclude, the results are consistent with the classification of synchronous/asynchronousconnection types in Konig’s model. Furthermore, we have discussed how our results may be useful in the construction of a network with local connections for a segmentation task. Acknowledgments

-

I would like to thank Dr. Peter Davis for his helpful discussions and enhancement of this paper and the reviewers for their valuable comments. References

~-

.

Borisyuk, R. M., and Kirillov, A. 8.1992. Bifurcation analysis of a neural network model. B i d . Cyberw. 66, 319-325. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Rcitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex?-Multiple electrode and correlation analyses in the cat. B i d . Cybcrri. 60, 121-130.

Oscillatory Neural Network

667

Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Nntl. Acnd. Sci. U.S.A. 86, 1689-1702. Hayashi, Y. 1992. Learning of continuously transformed pattern cycles by an oscillatory neural network. Proc. IICNN '92, Bnltirnorc IV, 122-127. Hayashi, Y. 1993. Dependency of the connection architectures of oscillatory neural networks on synchronization. Proc. 11CNN '93, Nngoyn 11, 1434-1438. Horn, D., Sagi, D., and Usher, M. 1991. Segmentation, binding, and illusory conjunctions. Neural Cotnp. 3(4), 510-525. Kawato, M., and Suzuki, R. 1980. Two coupled neural oscillators as a model of the circadian pacemaker. I. Tlreor. Biol. 86, 547-575. Khibnik, A. I., Borisyuk, R. M., and Roose, D. 1992. Numerical bifurcation analysis of a model of coupled neural oscillators. I n ! . Series Nunr. Mnilr. 104. Kanig, I?, and Schillen, T. B. 1991. Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization and 11. Desynchronization. Neirrnl C0i71p. 3(2), 155-177. Nagashino, H., and Kelso, J. A. 1991. Bifurcation of oscillatory solutions in a neural oscillator network model. Proc. of NOLTA '91, 119-122. Shimizu, H., and Yamaguchi, Y. 1989. How animals understand the meaning of indefinite information from environments. Prog. Tlicor. Phys. Siippl. 99, 404424. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1991. Phase coherence and computation in a neural network of coupled oscillators. In Noriliriear Dyirnrirics and Nenronol Netzcwks, H. G. Schuster, ed., pp. 113-130. Germany: VCH. von der Malsburg, C. P., and Buhmann, J. 1992. Sensory segmentation with coupled neural oscillators. B i d . Cybern. 67,233-242. Wang, D., von der Malsburg, C. P., and Buhmann, J. 1990. Pattern segmentation in associative memory. Nciiral Cornp. 2(l), 94-106. ~~~

~

~~~~

Received December 14, 1992; accepted November 22, 1993.

This article has been cited by: 2. Hideo Hasegawa. 2007. Generalized rate-code model for neuron ensembles with finite populations. Physical Review E 75:5. . [CrossRef]

Communicated by Wyeth Bair

Analysis of the Effects of Noise on a Model for the Neural Mechanism of Short-Term Active Memory J. Devin McAuley Joseph Stampfli Dcpnrtrirtvit c?fCCoriipirtu Sciericc niuf Dcpnrtriierit of Mntlirrtratics, Bloorrririgtotr, IN 47405 U S A

Iirdiririri

Uiiiwrsity,

Zipser (1991) showed that the hidden unit activity of a fully recurrent neural network model, trained on a simple memory task, matched the temporal activity patterns of memory-associated neurons in monkeys performing delayed saccade or delayed match-to-sample tasks. When noise, simulating random fluctuations in neural firing rate, is added to the unit activations of this model, the effect on the memory dynamics is to slow the rate of information loss. In this paper, we show that the dynamics of the iterated sigrnoid function, with gain and bias parameters, is qualitatively very similar to the tonic response properties of Zipser’s multiunit model. Analysis of the simpler system provides an explanation for the effect of noise that is missing from the description of the multiunit model. 1 Introduction

Single-unit recording studies conducted over the past 20 years have consistently suggested that information pertaining to a cued stimulus can be stored temporarily as the tonic activity of a population of neurons (Fuster and Alexander 1971; Gottlieb ct al. 1989). These firing patterns are similar across modality-specific cortical regions, suggesting a general underlying memory mechanism. Zipser (1991) showed that the activity of hidden units in a fully recurrent neural network, trained on a very simple memory task, matched the qualitative temporal activity patterns of some of these memory-associated neurons. He also described a seemingly paradoxical property of this model: simulated random fluctuations in neural firing rate (noise) can slow the rate of information loss. In the present work, we show that the dynamics of an iterated sigmoid function, with gain and bias parameters, can mimic the temporal activation patterns of units in the Zipser model, which show tonic response to a cued stimulus. A mathematical analysis of the single-unit model is tractable and provides an explanation for why noise (on average) can improve stimulus retention. We believe that this explanation extends to Nrrrrnl C~~riiy~ctntiori 6, 668-678 (1994)

@ 1994 Massachusetts Institute of Technology

Effects of Noise on an Active Memory Model

669

Zipser ’s multiunit model because both models have qualitatively similar dynamics. 2 Zipser Model

Zipser’s short-term active memory model consists of two linear input units and a number of fully connected sigmoid units representing shortterm memory. The input units, which have connections to all memory nodes, are a binary cue x, and a real-valued stimulus x,. The unit equation for all memory nodes is

where the sigmoid function is

The activation of each memory unit i is a weighted sum of the inputs, the activations of all of the memory units, and a bias term H,, squashed by the sigmoid shaped function 4 ( x ) . The cue and stimulus inputs and weights are subscripted with c and s, respectively. A gaussian noise term X , ( t ) (with zero mean and standard deviation v) is added during testing trials only, to simulate random neural excitation. The training task, shown in Figure lA, is to store a cued intensity value in memory for an unspecified duration. During a training sequence, the cue is initialized to 1.0 and a stimulus is selected from the interval [O. 11. The “output” unit of the network is trained using the real-time recurrent learning algorithm of Williams and Zipser (1989) to autoassociate the cued stimulus value for a random number of time steps. The biases 0, remain fixed at negative values typically in the range [-1.0. -3.51. Between cued stimuli, the cue unit is set to 0.0 and the stimulus unit varies randomly. The typical response of the network after training is to produce a brief peak to each event of stimulus-plus-cue, to remain at the approximate stimulus value, and then to decay slowly (see Fig. 1B). As the interstimulus interval becomes sufficiently large, the activations approach a stable equilibrium, indicating that the network has “forgotten” the stimulus value. The basis for memory in this model is the slow rdaxation to an attractor following presentation of the stimulus. This is contrasted with Hopfield memory models (Hopfield 1982) that use attractors to store memory items, that is, input pattern processing converges to a stable activation pattern that is most closely associated with the input. McAuley et a/. (1992) have investigated the behavior of the Zipser model by testing its response on a same-different (roving-level) intensity discrimination task (Durlach and Braida 1969). They replicated Zipser’s

J. Devin McAuley and Joseph Starnpfli

670

Figure 1: (A) Cue, stimulus, and teacher values for a hypothetical training sequence. (8)The output unit peaks in response to each stimulus and cue, and is able to roughly hold the stimulus value by slowly relaxing to an attractor. observation that a noise term X ; ( t ) , added to unit activations during testing trials, can improve (on average) the retention of input by slowing the decay to an attractor.

3 Single-Unit Model

~

To better understand this behavior, we study a single-unit approximation of this system. Let

is a gain term and H is a bias term. By initializing y(0) as where the ”to-be-remembered” stimulus, this equation is a primitive model of short-term memory that we can compare directly to the performance of the Zipser model. Finding equilibria for this system for different values of gain and bias requires numerical techniques such as Newton’s method for approximating roots (Atkinson 1978). However, without ”number crunching” we can investigate the qualitative attractor dynamics (number and type of equilibria) by comparing a graph of y ( t + I ) versus y ( t ) and the line y( t + 1) = y(t), for different gain and bias values (see Fig. 2). Parameter values for gain and bias that form a boundary between one and two attractor systems can be found explicitly by observing that systems on this boundary have a saddle equilibrium X that is tangent to the diagonal line y(t + 1) = y ( t ) (see Fig. 2B and C). At such tangent equilibrium points, (/j(X)= 1. This information, combined with the equilibrium

Effects of Noise on an Active Memory Model

671

1

r

g

0.5

0 0

1

0.5

0

0.5

1

Y(0

Figure 2: Attractor dynamics of the iterated sigmoid function for different values of gain and bias. Each panel specifies the system’s equilibria as intersections betweeny(t)versusy(t+l)and the liney(t) = y(t+l). In (A) (y = 5, H = -0.319), the system has one attractor. In (B) (7 = 10, H = -0.319), the system has one attractor near 1.0 and a saddle point. In (C) (y = 5, 0 = -0.531), the system has an attractor that is near 0.0 and a saddle point. In (D) (y = 10, H = -0.531), the system has two attractors and one unstable equilibrium, which acts as a threshold. equation, can be used to define an expression for the bias as a function of the gain: (j=- 1

* J1-4/1 2

-

ln[2/(1f

J1-4/.)1-1 Y

(3.2)

This curve in gain-bias space is shown in Figure 3. Points outside the curve configure one attractor models and points inside the curve configure two attractor models. The two attractor systems also have an unstable equilibrium which acts as a threshold. Stimuli y(0) above this threshold converge to an upper attractor while stimuli below this threshold con-

672

J. Devin McAuley and Joseph Stampfli

-0.3 . -0.4 v)

-0.5 -0.6 -0.7 -0.8I ’ 3 4 I

I

I

5 6

1

I

7 8 Gain

, 9101112 I

I

Figure 3: Bifurcations in the dynamics of the single-unit sigmoid model as a function of the gain and bias parameters. The four equilibrium cases described in the previous figure are labeled as A, B, C, and D.

verge to a lower attractor. As the gain increases, the upper and lower attractors approach 1.0 and 0.0, respectively. 4 Comparing Models

In this section, we show that the single-unit sigmoid model, with appropriately chosen gain and bias parameters, can approximate the “output” response of a nine-processing-unit version of Zipser’s memory model. The single-unit model can also approximate the tonic response properties of the hidden units, which are similar to the output unit, and which Zipser has linked to the temporal activity patterns of memory-associated neurons. The Zipser model was trained using the procedure described in Section 2, with biases fixed at -2.5. Figure 4 compares memory response, with and without noise, for the Zipser model at time step 10 (left panel),

673

Effects of Noise on an Active Memory Model

1

1 perf&

0.8

8

withnoise memory ........ +

.."

....' ..' +

0.8

'

0.6

8

2

I-

0.6

2

t

0.4

0.4

0.2

0.2

0

0

0.2

0.4

0.6

Stimulus

0.8

1

withoutnobe wkhnolse

0

0.2

0.4

0.6

-

0.8

+

1

Stimulus

Figure 4: Memory response of the Zipser model at time step 10 (left) compared with the memory response of the single-unit sigmoid model at time step 7 (right). Each cross symbol represents a data point and shows the noisy response to a fixed stimulus. For comparison, the diagonal line (stimulus = trace) shows perfect memory. The performance of both memory models is qualitatively very similar, although the memory traces decay faster in the single-unit case. and the single-unit model at time step 7 (right panel). Stimulus value is plotted along the abscissa and the corresponding memory trace (activation of the output unit at time step n ) is plotted along the ordinate. Perfect memory is depicted by the diagonal line (stimulus = trace). Both systems have two attractors. For the single-unit case, = 6.0 and H = -0.5. These gain and bias parameters were selected only to qualitatively match the memory response of Zipser's model. Although a better fit to the data was certainly possible, it was not the goal of this comparison. For both models, memory traces for stimuli below a threshold relax toward an attractor located near 0.0. Above this threshold, memory traces relax towards an attractor near 1.O. To examine the effect of noise, we added a gaussian random variable X;(t) with a mean of 0.0 and a standard deviation of 0.05 to the output of each unit i on each time step. Memory response was sampled 20 times for each stimulus value after 10 time steps (7 for the single-unit model), and then averaged across trials. Each cross symbol in Figure 4 represents a data point, indicating the average effect of noise on memory for a fixed stimulus value after a fixed number of time steps. With noise, both models maintain a better approximation of a range of stimulus values than without noise, that is, for fixed stimulus regions, the noisy-memory data are closer to the diagonal (perfect memory) than the curves showing memory performance without noise. In the next section, we analyze the

674

J. Devin McAuley and Joseph Stamyfli

single-unit system to explain the effect of noise on memory. This will enable us to specify the stimulus regions for which noise will improve retention. 5 Analysis

The average effect of noise is described here as

1

(/Ji7(X) =

;1 4 [x + X , ( t ) ] 'I

,=I

where 17 is the number of trials that are averaged over and X , ( t ) is the noise value on trial i. For this analysis, we assume that X , ( t ) is a Bernoulli probability function on the discrete set {--I). v}, that is, X I ( / )is either 11 or --I/ with probability 0.5. There are two cases to consider:

and

As 11 becomes large, the ratio of the cb,,(x)instances to the (JL,,(x) instances will approach 1.0. Consequently, equation 5.1 can be simplified to

(5.4)

Suppose that the single-unit model 9(x) is a linear function, then by the principle of superposition, averaging will exactly cancel the effect of noise. h ( x ) = ci,(x)

(5.5)

However, for the single-unit sigmoid model, the principles of linearity do not apply and, consequently, the effect of noise is not necessarily cancelled by averaging. A summary of all six cases is provided in Table 1. In this table, we let

A

= C/J(X) - X

(5.6)

and 12 = #i7(x) - (il(x)

(5.7)

where x is the stimulus and 4(x) is its memory trace after one iteration. The sign of A indicates whether successive iterations of I.~,(x)are moving toward an attractor that is above or below the initial stimulus x. Positive A implies that iterations of &(x) are converging toward an attractor that has a value larger than the stimulus. Negative A implies the opposite.

Effects of Noise on an Active Memory Model

675

Table 1: The Nonlinear Effect of Noise on Iterations of the Single-Unit Sigmoid Model. Case

1 2 3 4 5 6

A

12

+ + + - + -

+ -

-

o 0

Rate of information loss Faster Slower Slower Faster Same Same

The sign of 12 indicates the direction noise (on average) pushes iterations of ci,(x). Positive f2 increases 9 ( x ) . Negative 12 decreases ci,(x). For each case, we show the average effect of noise on the rate of information loss (memory retention). If A and 12 have the same sign then noise degrades the memory trace of stimulus x because it relaxes the system towards an attractor at a faster rate than without noise (cases 1 and 4). If A and 12 have opposite signs then noise sustains the memory trace by slowing down the relaxation rate (cases 2 and 3 ) . When x = -0, it can be shown that 12 = 0 (cases 5 and 6); this is the point at which 12 switches sign. The A term changes sign at stable and unstable equilibria. Above and below attractors, A is negative and positive, respectively. The opposite is true for unstable equilibria. A point in gain-bias space fixes the number and location of equilibria and hence determines the stimulus ranges for which noise (on average) will improve retention. The six different configurations are enumerated in Figure 5A. For one-attractor models (cases 1, 2, and 3), noise improves retention of stimuli that are between this attractor and x = -0. For twoattractor models (cases 4, 5, and 6), noise improves retention for stimuli between these attractors, except for the stimulus region bounded by the unstable equilibria and x = -#. In Figure 58, we have fixed the gain and bias of the single unit model at 6 and -0.5, respectively. This model is an instance of case 6, but serves to summarize cases 4 and 5 as well. If we choose to "load" a stimulus value of 0.6 into the memory of this model, then noise should improve retention of this stimulus because the value is between the two attractors. Figure 58 compares 10 iterations of the functions c,b(x) and &(x) for stimulus (initial x) = 0.6 and I / = 0.15. As expected, the model with noise (dotted line) converges to the upper attractor at a slower rate than the model without noise (solid line). In Figure 5C, we illustrate the opposite effect. The gain and the bias are fixed at 3.8 and -0.5 respectively. This model is an instance of case 1, but also illustrates the properties of cases 2 and 3. In contrast to Figure 5B, noise added to this model after loading a stimulus value of 0.6 should

J. Devin McAuley and Joseph Stampfli

676

Case

Stlmuluo Range

ISustain Degrade

op-1 1

1.01

.._....... --

..........

’ e

I

2 0.41 o.2

t

o.otl 0.0

....... , 2.0

.

I

I

4.0 6.0 Iteration

7

-..._._ ............................................

o,J\

-

wlhoutnoise with noise

0

0.8

”.”... ........

o,6

1

......

I

I

II

0.01,

8.0

10.0

0.0

O.7

wilhouinoise wlh noise

, 2.0

4.0

,

,

6.0

8.0

10.0

Iteration

Figure 5: (A) Enumeration of the stimulus intervals for which noise slows the rate of information loss, as a function of memory dynamics: A indicates an attractor, U indicates an unstable equilibrium, and B indicates the point x = -0. Noise improves retention for stimuli within the dark shaded regions. The model in (B) is an instance of case 6. It compares memory performance with and without noise for a stimulus of 0.6. The model in (C) is an instance of case 1. It compares memory performance with and without noise for a stimulus of 0.8.

speed up the trace decay. Moreover, as an instance of case 1 models, noise hurts the retention of values in the entire stimulus range. Figure 5C compares 10 iterations of 4(x) and q5ij-(x) for stimulus = 0.8 and I / = 0.15. As expected, the model with noise (dotted line) converges to the attractor at a faster rate than the model without noise (solid line).

Effects of Noise on an Active Memory Model

677

6 Conclusions

We have shown that a single sigmoid unit, with gain and bias parameters, can approximate the the tonic response properties of processing units in a fully recurrent network model for the neural mechanism of short-term active memory (Zipser 1991). The stimulus regions for which noise slows the rate of information loss have been shown to vary predictably as a function of the gain and bias parameters. Thus, the surprising effect of noise has a rather straightforward explanation in the nonlinear dynamics of the sigmoid function. For one-attractor models (low gain), the stimulus region for which noise improves retention is small or nonexistent (as in case 1). For two-attractor models (high gain), the stimulus region for which noise improves retention is much larger (between the two attractors) and continues to increase in size with further increases in gain. However, the memory of large-gain models is inherently poor. This suggests that an “optimal” balance of having good inherent retention and sizable stimulus regions that are helped by noise may lie near bifurcation boundaries between one attractor and two attractor models. We have yet to quantify a measure of “how much” noise improves retention. In general, this research suggests one way that a nervous system may take advantage of noise inherent in the system, rather than be hindered by it as generally assumed, to better represent and process information.

Acknowledgments The authors are grateful to Sven Anderson, Fred Cummins, Jason Holt, Gary Kidd, Robert Port, Catherine Rogers, James Townsend, and Charles Watson for their critical comments on earlier versions of this manuscript and for helpful discussions. This research was supported by ONR Grant N00014-91-Jl261.

References Atkinson, K. 1978. An lritrodirctiorz to Nwnericnl Annlysis. Wiley, New York. Durlach, N., and Braida, L. 1969. Intensity perception. I. Preliminary theory of intensity resolution. 1. Acoiist. Soc. A I I I 46(2), . 372-383. Fuster, J., and Alexander, G. 1971. Neuron activity related to short-term memory. Scierice 173, 652-654. Gottlieb, Y., Vaadia, E., and Abeles, M. 1989. Single unit activity in the auditory cortex of a monkey performing a short term memory task. E x p Brniri Res. 74, 139-148. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Nntl. Acnd. Sci. U.S.A. 79, 2554-2558.

J. Devin McAuley and Joseph Stampfli

678

McAuley, J. D., Anderson, S., and Port, R. 1992. Sensory discrimination in a short-term trace memory. In Proceedings of the Foirrteeritli Anrinnl Mwfiticyc!f the Cogiiitive Science Society, pp. 13&140. Erlbaum, Hillsdale, NJ. Williams, R., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neirral Cornp. 1(2), 270-280. Zipser, D. 1991. Recurrent network model of the neural mechanism of shortterm active memory. Neural Corrrp. 3, 179-193. ~~~~~~

~

~~

Received June 16, 1993; accepted November 5, 1993

This article has been cited by: 2. Jianfu Ma, Jianhong Wu. 2009. Multistability and gluing bifurcation to butterflies in coupled networks with non-monotonic feedback. Nonlinearity 22:6, 1383-1412. [CrossRef] 3. Peter Tiňo , Bill G. Horne , C. Lee Giles . 2001. Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks)Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks). Neural Computation 13:6, 1379-1414. [Abstract] [PDF] [PDF Plus] 4. Hiroyuki Nakahara* , Kenji Doya . 1998. Near-Saddle-Node Bifurcation Behavior as Dynamics in Working Memory for Goal-Directed BehaviorNear-Saddle-Node Bifurcation Behavior as Dynamics in Working Memory for Goal-Directed Behavior. Neural Computation 10:1, 113-132. [Abstract] [PDF] [PDF Plus] 5. Toru Ohira, Jack D. Cowan. 1995. Stochastic Single NeuronsStochastic Single Neurons. Neural Computation 7:3, 518-528. [Abstract] [PDF] [PDF Plus]

Communicated by Laurence Abbott

Reduction of Conductance-Based Models with Slow Synapses to Neural Nets Bard Ermentrout Dept. i$ Mathematics, Utiiurrsity of Pittsbrrrgh, Pittsburgh, PA 15260 U S A

The method of averaging and a detailed bifurcation calculation are used to reduce a system of synaptically coupled neurons to a Hopfield type continuous time neural network. Due to some special properties of the bifurcation, explicit averaging is not required and the reduction becomes a simple algebraic problem. The resultant calculations show one how to derive a new type of “squashing function” whose properties are directly related to the detailed ionic mechanisms of the membrane. Frequency encoding as opposed to amplitude encoding emerges in a natural fashion from the theory. The full system and the reduced system are numerically compared. 1 Introduction

The appearance of large scale modeling tools for ”biophysically realistic” models in neurobiology has led to increasingly complicated systems of nonlinear equations that defy any type of mathematical or heuristic analysis (Wilson and Bower 1989; Traub and Miles 1991). Supporters of this approach argue that all of the added complexity is necessary in order to produce results that can be matched to experimental data. In contrast are the very simple models that are based on firing rate (Hopfield and Tank 1986; Wilson and Cowan 1972), which are easily analyzed and computationally simple. These models inevitably invoke a sigmoid nonlinear transfer function for the current to firing rate relationship. The motivation for this is generally heuristic. In a recent paper, Rinzel and Frankel (1992) show that for some types of membrane models, one can explicitly derive neural network-like equations from the biophysics by averaging and assuming slow synapses. Our approach is similar with two important exceptions that we outline in detail below. Consider the following system of coupled neurons:

Neural Corripittation 6, 679-695 (1994)

@ 1994 Massachusetts Institute of Technology

Bard Ermentrout

680

(1.3)

The jth cell of the network is represented by its potential, V, and all of the auxiliary channel variables that make u p the dynamics for the membrane, w,. (For the Hodgkin-Huxley equations, these would be 111. ) I . h.) The term Ippp' is any tonic applied current. Finally, the synapses are modeled by simple first-order dynamics and act to hyperpolarize, shunt, or depolarize the postsynaptic cell. Associated with each neuron is a synaptic channel whose dynamics is governed by the variable sI, which depends in a (generally nonlinear) manner on the somatic potential. Thus, one can think of sj as being the fraction of open channels due to the presynaptic potential. The functions S, have maxima of 1 and minima of 0. The effective maximal conductances of the synapses between the cells are in the nonnegative numbers A, and the reversal potentials of each of the synapses are V r . Our goal is to derive equations that involve only the sI variables and thus reduce the complexity of the model yet retain the qualitative (and perhaps quantitative) features of the original. (Note that we are explicitly assuming that the synapses from a given neuron, k, share the same synaptic dependencies, s k . If this is unreasonable for a particular model, then one must include a different variable for each of the different synaptic dependencies of a given neuron. Nevertheless, the techniques of this article can still be used.) The main idea is to exploit the smallness of f and thus invoke the averaging theorem on the slow synaptic equations. Each of the slow synapses s1 . . . . s,! is held constant and the membrane dynamics equations are solved for the potentials, V,(t;sl,. . . .s,,). The potentials, of course, naturally depend on the values of the synapses. We will assume that the potentials are either constant or periodic with period, T(s1.. . . .s,!). (If they are constant, one can take T = 1 for example.) Once the potentials are found, one then averages the slow synaptic equations over one period of the membrane dynamics obtaining ds

dt

= f[S,(S,, . . .

where

.

Sj(S13 . . . s,,) =

IT'' 5,

Tj(Sl>. . . . S , ! ) . a

....,s,,)

s,[V;(t;sl,.. . .s,,)]dt

(1.5)

Our goal is to explicitly derive expressions for these dependencies and to then compare the simplified or reduced system with the full model that includes all of the fast membrane dynamics. Many synapses in the nervous system are not slow and others that are said to be slow are slow only in the sense that they have long lasting effects (see, e.g., Kandel et a/. 1991, Chapter 11). Thus, one must view

Reduction of Conductance-Based Models

681

this work as an approximation of what really happens and as a means of converting full spiking models to ”firing rate” models while maintaining quantitative features of the latter. While we have explicitly assumed that the synapses are slow in order to derive the relevant equations, one does not need to be so specific as to the separation of time scales between spiking and synaptic activity. The main assumption is that the detailed phase and timing of individual spikes are not important, that is, one is free to average over many spiking events. If events are occurring far out on the dendritic tree, then the low-pass filtering properties of long dendrites act in a manner similar to ”slow synapses.” Thus, one should regard the assumption of slowness as sufficient but not necessary for the present reduction. Rinzel and Frankel (1992) apply the same averaging methods to derive equations for a pair of mutually coupled neurons that was motivated by an experimental preparation. In their paper, they require only cross connections with no self-self interactions. Thus, they arrive at equations where V, depends only on sI; where j = 1.2. k = 2 for j = 1, and k = 1 for j = 2.

The key to their analysis is that they are able to numerically determine the potential as a function of the strength of the synapses. They use a class of membrane models that are called ”class 11” (see Rinzel and Ermentrout, 1989) where the transition from rest to repetitive firing occurs at a subcritical Hopf bifurcations and is typical of the Hodgkin-Huxley model (Hodgkin and Huxley 1952). This latter assumption implies that the average potential exhibits hysteresis as a function of the synaptic drive. That is the functions s j ( s k ) are multivalued for some interval of values of s k . Because of this hysteresis, they are able to combine an excitatory cell and an inhibitory cell in a network as shown in Figure 1 and obtain oscillations. It is well known that for smooth single valued functions S, oscillations are impossible without self-excitatory interactions (see, e.g., Rinzel and Ermentrout 1989). If, however, one generalizes their approach to include more than one type of synapse per cell or the addition of applied currents, then it is necessary to compute the voltages for each set of possible synaptic values. This is a formidable numerical task. The other feature of the Rinzel-Frankel analysis is that the nonlinearity S, does not look like a typical Hopfield or Wilson-Cowan squashing function due to the hysteresis phenomenon. This hysteresis can be avoided if the bifurcation to periodic solutions is supercritical but then there are very delicate problems as the synapses slowly pass through the bifurcation point (Baer et al. 1989). (In fact, in this slow passage problem occurs in the subcritical case as well; the authors avoid its consequences by adding a small amount of noise to the simulations.) Finally, in class I1

682

Bard Ermentrout

Figure 1: Two-cell network with reciprocal connections. No self-connections are allowed. membranes, the dependence of the frequency on the level of depolarization is discontinuous and essentially piecewise quadratic. The amplitude of the potential varies in a piecewise linear fashion. (See Figure 5.2 in Rinzel and Ermentrout 1989.) Thus, the values of s, in the Rinzel-Frankel model are not explicitly tied to thefrequency of the impulses as they are to the averaged magnitude of the potential. In this paper, we consider use of “class I” models of membranes instead of “class 11” membranes as the basis for the averaging. We show that this will lead to a simpler form for the synaptic equations with the added benefits of (1) obviating the complicated mathematical difficulties while passing through the transition from rest to repetitive firing; (2) suggesting a natural form for the squashing function that is monotone increasing and that is very similar to the firing rate curves of some cortical neurons; and (3) making very explicit the frequency encoding properties of the active membrane. Note that this reduction requires that the membrane actually exhibit class I dynamics and in those systems in which class I1 dynamics occurs, the Rinzel-Frankel reduction is appropriate. In Section 2, we review the membrane dynamics and show how the necessary averages can be approximated by a simple algebraic expression. We then apply this to a two-layer model with self-excitation and self-inhibition as well as crossed connections. Comparisons between the full and the reduced models are given in the last section.

Reduction of Conductance-Based Models

683

2 Class I Dynamics and Firing Rates

Rinzel and Ermentrout (1989) contrast the difference between class I1 and class I membranes. The former include the well-known Hodgkin-Huxley model while the latter include the model of Connor as well as models of cortical cells with A currents (L. Abbott, personal communication). Class I1 axons have a transition from rest to repetitive firing that arises from a Hopf bifurcation. As a consequence, the frequency goes from 0 spikes per second to some finite nonzero rate and the average of the potential (in the case of Rinzel and Frankel) has a discontinuous jump. As we discussed previously, there is hysteresis so that the average value of the potential is as shown in Figure 2a. In contrast to the dynamics of class I1 membranes, class I membranes have a continuous transition from nonspiking to spiking both in the average potential and in the firing rate of the membrane (see Fig. 2b). A general analysis of this type of bifurcation is given in Ermentrout and Kopell (1986) with particular attention paid to slowly perturbed neural

I

I

Figure 2: Average potential and frequency of firing as a function of current for (a) class I1 membranes; (b) class I membranes.

684

Bard Ermentrout

models (precisely what we have here). In particular, we show that the frequency of the periodic orbit is

where p is any parameter such that as y increases from below p’, the membrane switches into oscillations. (Here, [z] = z if z > 0 and vanishes for z < 0.) There are several important advantages of this type of relation in so far as coding of information by firing rate is concerned. First, the network is extremely sensitive near threshold and thus a very large range of firing rates is possible. This behavior is typical in continuous but nondifferentiable firing rate models. Second, slowly varying the parameter p yields the intuitively expected behavior of slowly varying the firing rate even as one crosses the critical regime (cf. Ermentrout and Kopell 1986, Lemma 1). In contrast, with a class I1 membrane, there is bistability and thus an instant jump from no firing t o nearly maximal firing and a ramping effect whereby the system may remain very close to rest even though the slowly moving parameter is well past criticality (see Baer cf a/. 1989). One can object that the computational problem of finding the average of the potential as a function of many possible parameters still exists. However, as we will see, it turns out that the parametric computation of the periodic solutions for the membrane can actually be reduced to an algebraic computation of equilibria near criticality. Thus, the computation of hundreds of periodic orbits is no longer necessary. Because of the special form of the membrane potential, the algebraic calculation is one of finding fixed points to a scalar equation! Recall from 1.3 that s is the function that determines the fraction of open channels as a function of the presynaptic potential. This is generally a function of the potential, for example, s(V j = [tanh[cT(V - Vt,,)]]+ . Here (T is the sharpness of the nonlinearity. Consider the averaged synaptic function:

where p is a critical parameter on which the potential depends. Suppose that 5 is 0 if the membrane is at rest (and thus below threshold) and 1 if the voltage is above threshold (e.g., (T is very large). Let ( ( p ) denote the amount of time during one cycle that the potential is above threshold. Then, 2.2 is simplified to

where we have used the fact that T = 27r/d. A fortuitous property of class I membranes is that the time for which the spike is above threshold is largely independent of the period of the spike so that ( ( p ) is essentially constant. (This is certainly true near threshold and we have found it to

Reduction of Conductance-Based Models

685

be empirically valid in a wide parameter range.) Thus, combining this with 2.1, we obtain the very simple squashing function:

where C ( p ) and of course p* depend on the other parameters in the model. In deriving 2.4 we have made two simplifications: (1) the actual frequency is of the form 2.1 only near criticality and (2) the synaptic function 2 may be more graded than a simple on/off. The first simplification seems to be pretty reasonable; prior calculations (see Rinzel and Ermentrout 1989) indicate the square-root relationship holds over a wide range of the parameter. The second simplification is not particularly important as the main effect is to slightly change the constant C ( p * ) . The squashing function is completely determined if we can compute p' and C ( p * ) as a function of all of the remaining parameters in the membrane. As we remarked in the above paragraph, the actual constant C ( p * ) is not that important, so that the crucial calculation is to find the critical value p' as a function of the remaining parameters. Since only the injected currents and the other synapses will be variable, we can limit the number of parameters to vary. In the next section, we compute p* and thus obtain a closed form equation for the synaptic activities. 3 Computation of the Squashing Function for a Simple Model

~

In this section, we detail the calculations of p* and C ( p * )for the dimensionless Morris-Lecar membrane model (see Rinzel and Ermentrout 1989; Morris and Lecar 1981) with excitatory and inhibitory synapses and injected current. The explicit dimensionless equations are (3.1)

where I,,,( v. W ) m,(V) w,(V) 7,,,(V)

(v)(1 - v) + 0.5(-0.5 - v) + 2W(-0.7 0.5(1 + tanh((V+ 0.01)/0.15]) 0.5(1 + tanh[(V - 0.05)/0.1]) 1 / ( 3 ~ 0 s h [ (V 0.05)/0.2)])

= tTJ, = = =

-

v)

and 7 -- 1, V, = -0.7. The transition from rest to oscillation occurs in class I membranes via a saddle-node bifurcation of an equilibrium. Thus, if the differential equation is dx

dt = f ( x . p )

-

Bard Ermentrout

686

we want to solvef(x.p*) = 0 where x = x(p) is a vector of rest states depending on p such that the determinant of the Jacobi matrix off with respect to x evaluated at x(p') is zero (see Guckenheimer and Holmes 1983). For membrane models, this is very simple. Let V denote the membrane potential and 7u denote the gating variable(s). Each of the gating variables is set to its steady state value, 70 = w,( V ) and thus the equilibrium for the voltage satisfies

o = I ~ ~ ~ ~ c [ v . ~ U , ( V +g,,(v, )I - v) + g , ( v ,- v) + r

(3.3)

where g, and g1are the total excitatory and inhibitory conductances, V, are the reversal potentials, and I is the total applied current. For each (ge.gl,I) we find a rest state, V . Next, we want this to be a saddle-node point, so we want the derivative of this function to vanish: O=

dIl~'ni"v, 7u, ( V ) ] -8, -81 dV

(3.4)

Suppose we want to view ge as the parameter that moves us from rest to repetitive firing. Then, we can solve 3.3 for ge(V.I.gl).substitute it into 3.4, and use a root finder to get the critical value of V' and thus gf. This is a simple algebraic calculation and results in g;(g,.I). a two-parameter surface of critical values of the excitatory conductance. A local bifurcation analysis at this point enables LIS to compute C(p*)as well. (The local bifurcation calculation is tedious but routine (see Guckenheimer and Holmes 1983) and involves no more than a few Taylor series terms of the nonlinearities and some trivial linear algebra. In particular, no numerical calculations of periodic orbits are required.) The end result of this calculation is shown in Figure 3a,b. The figure for the critical value of the excitatory conductance suggests that the relationship between it and the parameters g, and I is almost linear. We have used a least squares fit of this and find that

where a = 0.02316, b = -0.7689, c = 0.3468, d = -0.1694. Note that the dependence is not strictly linear; there is a small interaction term of the inhibitory conductance with the current. Using this we can obtain the slow equations for a Wilson-Cowan type two-cell network of excitatory and inhibitory cells with self-excitatory connections (in contrast to Fig. 1). The conductances are related to the synapses by the relations, g,tp = A,",js," where o , 11 are either c or i. Thus, we arrive at the coupled network: ds, redt

+ s,

=

CrJ[AeeS,

-

g;(XicSi. &)I+

(3.6) (3.7)

Reduction of Conductance-BasedModels

687

0. 0. 0.

-0.

.35

- .25

2.1 2

1.9 1.8

1.7 1.6 1.5 35

- .25

Figure 3: Critical surface (a) and bifurcation coefficient (b) for dimensionless Morris-Lecar model (3.1). Note that C, and Ci also depend on g: so that the dependence is not strictly a square-root. However, as illustrated in Figure 3b, the bifurcation coefficient is close to constant, so that throughout the rest of this paper, we have chosen it as such. Note that the equations are not strictly additive as would be the case in a "pure" Hopfield network. This is be-

688

Bard Ermentrout

cause the inhibitory interactions are not just the negative of the excitatory interactions. Both are multiplied by their respective reversal potentials. Also, note that one does not have to compute the critical point with respect to ge and could just as easily compute it with respect to some other parameter such as xi. The function 8; is the critical value of excitatory conductance as a function of all of the other parameters and thus there is no "symmetry" in the notation for the two equations (3.6, 3.7). Since the computation of g; is purely local (that is, we need only compute rest states and their stability) for any given model, we can find 8; as a function of g, and I. Once this is done, a curve fitting algorithm can be used to approximate the surface. Finally, the results can be substituted into 3.6-3.7 to arrive at a neural network model that is directly derived from the biophysical parameters. If one wishes to investigate a large network of such neurons, then the calculation remains simple as long as the only types of synapses are excitatory and inhibitory. As different synaptic types are added, one must compute 8: as a function of these additional parameters. Because the global rhythmic behavior of class I membranes is so stereotypical (at least, near threshold, and for many models, well beyond threshold), it is unnecessary to compute the dynamic periodic solutions numerically and so the computational complexity is reduced considerably. There is no obvious manner in which the present computation could be extended to neurons that intrinsically burst or that have slow modulatory processes such as spike adaptation. However, such slow modulatory processes could be incorporated into the dynamic equations along with the firing rate equations. The reduction would then no longer lead to a single equation for each neuron, but rather several corresponding to each of the modulatory dynamic variables. 4 Some Numerical Comparisons

In this section, we compare the simplified dynamics with the membrane models. In particular, the simple models predict the onset of oscillations in the firing rate of the full system (bursting) as well as transient excitation and bistability. As a final calculation, we illustrate spatial patterning in a two-layer multineuron network and compare it to a reduced model of the same type. For the present calculations, we have set the Morris-Lecar parameters as in Figure 3 and varied only the synaptic time constants 7,.q. currents, I,, I;, and weights, Xjk. In Figure 4, a typical nullcline configuration for 3.6 and 3.7 is shown when there is a unique steady state. If q is not too small, then it is unstable and as can be seen in the figure, there is a stable periodic solution. Using the same values of the network parameters, we integrate 1.1 with F = 0.01. Figure 5 shows the phase-portrait of the averaged and the full models. There is good agreement in the shape of the oscillation.

Reduction of Conductance-Based Models

689

r I

o.6 0.5

0.3 si

0'4

o.2 O.'

i

1-

1I i

1 0.6

Figure 4: Nullclines and trajectory for 3.6-3.7. Nullclines are shown in dark lines, trajectory in light lines. Parameters are T? = T! = 1. I, = 0.05, Ii = -0.1. and ,A, = 0.5, X,i = 0.6, Xi, = I, Aii = 0.

0.1

0.I

0.4

0.3

0.2

0.1

0 Be

Figure 5: Phase-plane comparing full and averaged models for parameters of Figures 3 and 4.

690

Bard Ermentrout

Figure 6: Time course of (a) full model showing excitatory potential and synaptic activity; (b) reduced model showing excitatory synaptic activity.

Figure 6b shows the time course of s, for the averaged equations and Figure 6a shows that of V , and s, for the full model. The slow oscillation of the synaptic activities is reflected by the bursting activity of the potential. One point that is evident is that the period of the full oscillation is roughly the same as that of the averaged. Indeed, the period of the full model is about 5.405 and that of the averaged is 5.45 (in slow time units).

Reduction of Conductance-Based Models

691

04

01

,I

0.2 01

0 .

.0.1 -0.1

0.1

0

.~

~~

0.2

0.3

0.4

0J

.I

0.4

0.35

0.3

0.25

:

0.2

0.15

0.1

0.05

0

0.05

0.1

0.2

0.15

0.25

0.3

0.35

se

Figure 7: (a) Nullclines and trajectory for 3.6-3.7 in the excitable regime. Parameters as in Figure 4 except that I, = 0.0. Circle indicates the initial condition. (b) Comparison of phase-plane of full and reduced models.

Thus, not only is there qualitative agreement, but quantitative agreement between the full and reduced models. Figure 7a shows the nullclines and phase-plane for a set of parameters for which the reduced model is either excitable (7 zz 1) or bistable (q << 1). Figure 7b compares the phase-planes of the full and reduced models in the excitable regime. In the bistable regime, one finds that the potentials exist either silently or in an oscillatory mode corresponding to the upper steady state of the synaptic activities. The reduced model is a good predictor of transient activity as well as the sustained bursting shown in the previous example.

Bard Ermentrout

6Y2

Figure 8: Schematic of a network of connected cells in a ring

As a final simulation that truly illustrates the power of the method, we consider a two-layer network of 20 cells coupled in a ring of length 27r as shown in Figure 8. The network equations for the reduced model have the form:

Here zok~and zv, are periodized gaussian weights

x:-,

is either e or i and the constants B,, are chosen so that 1. Setting uc = 0.5 and (T! = 2.0 we obtain a network with lateral inhibitory properties. The parameters I, and I, are externally applied stimuli to the network. This network is integrated numerically starting with initial data with two peaks and a random perturbation. The steady state spatial profiles of the reduced and the averaged models are shown in Figure 9a. The spacing, width, and height of the peaks agree well between the two regimes. Figure 9b shows a space-time plot of the potentials of the excitatory cells in the fast time scale (thus showing the individual spikes of the active region). The strip is broken into spatially distinct regimes of oscillatory and silent cells. These simulations show that there is very good qualitative and even quantitative ayeement between the reduced system and the full model where

(I

zu(,(rz)

=

Reduction of Conductance-Based Models

0.35

I

693

A

Figure 9: Simulation of the ring. Parameters are T, = T, = 1. I, = 0.1, Ii = 0. and ,A, = 0.2, Ai, = 0.6, A,, = 0.8, Xii = 0. (a) Spatial profile of the reduced and full excitatory synaptic activity. (b) Potential of the excitatory cell as a function of space and time. (Note that this is in the fast time-scale so that individual spikes can be seen in the active regions.) for simple two-cell networks as well as for spatially extended systems. There is a huge increase in computational speed for the simplified models and thus one can use them to find "interesting" parameter ranges before attempting to simulate the full model. The main feature of this method is that it makes explicit the dependence of the frequency encoding of a neuron on its biophysical properties. By using membranes with so-called class I excitability, the actual computation of the reduced models is reduced to an algebraic problem in

694

Bard Ermentrout

combination with a curve fitting algorithm. We suggest that to the usual types of squashing functions seen in the neural network literature, one a d d the square-root model since it can be explicitly derived from the conductance properties of nerve membranes. We finally want to point out that this reduction justifies the use of so-called connectionist models under the assumptions that the synaptic interactions are slozo compared to the ionic flows in the membrane. Furthermore, the time scales of the present reduced models are not those of the membrane time constant to which many of the Hopfield models refer. Nevertheless, many neuronal processes occur at time scales slower than the action potentials; thus o u r technique can be useful in understanding these phenomena.

Acknowledgments Supported in part by NIMH-47150, NSF (DMS-9002028).

References Baer, S. M., Erneux, T., and Rinzel, J. 1989. The slow passage through a Hopf bifurcation: Delay, memory effects, and resonance. SlAM I. Appl. Mntli. 49, 55-71. Connor, J. A,, Walter, D., and McKown, R. 1977. Neural repetitive firing: Modifications of the Hodgkin-Huxley axon suggested by experimental results from crustacean axons. Biophys. 1. 18, 81-102. Ermentrout, G. B., and Kopell, N. K. 1986. Parabolic bursting in an excitable system coupled with a slow oscillation. SlAM 1. Appl. Mnth. 46, 233-253. Guckenheimer, J., and Holmes, P. 1983. Noriliiicnr Oscillntioris, Dyiiiiriiicnl S I / s t ~ i i i s , nrid Bifiircritioris of Vwtor Ficlds. Springer-Verlag, New York. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. /. Pliysiol. (Loiidoti) 117, 500-544. Hopfield, J. J., and Tank, D. W. 1986. Computing with neural circuits: A model. Sciiwcc 233, 625-633. Kandel, E., Schwartz, J. H., and Jessel, T. M. 1991. Priricip1i.s of Nrirrnl Sciriicr. Elsevier, New York. Morris, C., and Lecar, H. 1981. Voltage oscillations in the barnacle giant muscle fiber. Biophys. 1. 35, 193-213. Rinzel, J., and Ermentrout, G. B. 1989. Analysis of neural excitability and oscillations. In Mrthodof NriiroiinlModditig, C. Koch and I. Segcv, eds., pp. 135-171. MIT Press, Cambridge, MA. Rinzel, J., and Frankel, I? 1992. Activity patterns of a slow synapse network predicted by explicitly averaging spike dynamics. Ncirrnl Corrip. 4, 534-545. Traub, R., and Miles, R. 1991. Nriiriiiinl Nctworks qfthe Hippicnrrrpiis. Cambridge University Press, Cambridge, UK.

Reduction of Conductance-Based Models

695

Wilson, M. A., and Bower, J. M. 1989. The simulation of large-scale neural networks. In Method ofNcuronnlModeling, C. Koch and I. Segev, eds., pp. 295341. MIT Press, Cambridge, MA. Wilson, H. R., and Cowan, J. D. 1972. Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J. 12, 1-24.

Received JUIY 9, 1993; accepted November 4, 1993.

This article has been cited by: 2. Carlo R. Laing, Thomas Frewen, Ioannis G. Kevrekidis. 2010. Reduced models for binocular rivalry. Journal of Computational Neuroscience 28:3, 459-476. [CrossRef] 3. Serafim Rodrigues, Anton V. Chizhov, Frank Marten, John R. Terry. 2010. Mappings between a macroscopic neural-mass model and a reduced conductance-based model. Biological Cybernetics 102:5, 361-371. [CrossRef] 4. P. D'Souza, S.-C. Liu, R. H. R. Hahnloser. 2010. Perceptron learning rule derived from spike-frequency adaptation and spike-time-dependent plasticity. Proceedings of the National Academy of Sciences 107:10, 4722-4727. [CrossRef] 5. Hideo Hasegawa. 2008. Synchrony and variability induced by spatially correlated additive and multiplicative noise in the coupled Langevin model. Physical Review E 78:3. . [CrossRef] 6. Boris B. Vladimirski, Joël Tabak, Michael J. O’Donovan, John Rinzel. 2008. Episodic activity in a heterogeneous excitatory network, from spiking neurons to mean field. Journal of Computational Neuroscience 25:1, 39-63. [CrossRef] 7. Christian K. Machens, Carlos D. Brody. 2008. Design of Continuous Attractor Networks with Monotonic Tuning Using a Symmetry PrincipleDesign of Continuous Attractor Networks with Monotonic Tuning Using a Symmetry Principle. Neural Computation 20:2, 452-485. [Abstract] [PDF] [PDF Plus] 8. Hideo Hasegawa. 2007. Generalized rate-code model for neuron ensembles with finite populations. Physical Review E 75:5. . [CrossRef] 9. Bard Ermentrout. 2006. Gap junctions destroy persistent states in excitatory networks. Physical Review E 74:3. . [CrossRef] 10. Tadashi Yamazaki , Shigeru Tanaka . 2005. Neural Modeling of an Internal ClockNeural Modeling of an Internal Clock. Neural Computation 17:5, 1032-1058. [Abstract] [PDF] [PDF Plus] 11. A. Hutt. 2004. Effects of nonlocal feedback on traveling fronts in neural fields subject to transmission delay. Physical Review E 70:5. . [CrossRef] 12. Barbara J. Breen , William C. Gerken , Robert J. Butera, Jr. . 2003. Hybrid Integrate-and-Fire Model of a Bursting NeuronHybrid Integrate-and-Fire Model of a Bursting Neuron. Neural Computation 15:12, 2843-2862. [Abstract] [PDF] [PDF Plus] 13. Carlo R. Laing , André Longtin . 2003. Dynamics of Deterministic and Stochastic Paired Excitatory—Inhibitory Delayed FeedbackDynamics of Deterministic and Stochastic Paired Excitatory—Inhibitory Delayed Feedback. Neural Computation 15:12, 2779-2822. [Abstract] [PDF] [PDF Plus] 14. Bard Ermentrout . 2003. Dynamical Consequences of Fast-Rising, Slow-Decaying Synapses in Neuronal NetworksDynamical Consequences of Fast-Rising, Slow-Decaying Synapses in Neuronal Networks. Neural Computation 15:11, 2483-2522. [Abstract] [PDF] [PDF Plus]

15. Oren Shriki , David Hansel , Haim Sompolinsky . 2003. Rate Models for Conductance-Based Cortical Neuronal NetworksRate Models for Conductance-Based Cortical Neuronal Networks. Neural Computation 15:8, 1809-1841. [Abstract] [PDF] [PDF Plus] 16. Richard H. R. Hahnloser , H. Sebastian Seung , Jean-Jacques Slotine . 2003. Permitted and Forbidden Sets in Symmetric Threshold-Linear NetworksPermitted and Forbidden Sets in Symmetric Threshold-Linear Networks. Neural Computation 15:3, 621-638. [Abstract] [PDF] [PDF Plus] 17. D. Hansel , G. Mato . 2003. Asynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory NeuronsAsynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory Neurons. Neural Computation 15:1, 1-56. [Abstract] [PDF] [PDF Plus] 18. Eugene M. Izhikevich, Frank C. Hoppensteadt. 2003. Slowly Coupled Oscillators: Phase Dynamics and Synchronization. SIAM Journal on Applied Mathematics 63:6, 1935. [CrossRef] 19. Xiaohui Xie, Richard Hahnloser, H. Seung. 2002. Double-ring network model of the head-direction system. Physical Review E 66:4. . [CrossRef] 20. C. van Vreeswijk , D. Hansel . 2001. Patterns of Synchrony in Neural Networks with Spike AdaptationPatterns of Synchrony in Neural Networks with Spike Adaptation. Neural Computation 13:5, 959-992. [Abstract] [PDF] [PDF Plus] 21. David J. Pinto, G. Bard Ermentrout. 2001. Spatially Structured Activity in Synaptically Coupled Neuronal Networks: I. Traveling Fronts and Pulses. SIAM Journal on Applied Mathematics 62:1, 206. [CrossRef] 22. Marius Usher, James L. McClelland. 2001. The time course of perceptual choice: The leaky, competing accumulator model. Psychological Review 108:3, 550-592. [CrossRef] 23. Mónica Romeo, Christopher Jones. 2000. Stability of neuronal pulses composed of concatenated unstable kinks. Physical Review E 63:1. . [CrossRef] 24. Paul C. Bressloff , S. Coombes . 2000. Dynamics of Strongly Coupled Spiking NeuronsDynamics of Strongly Coupled Spiking Neurons. Neural Computation 12:1, 91-129. [Abstract] [PDF] [PDF Plus] 25. P. C. Bressloff, S. Coombes. 2000. A Dynamical Theory of Spike Train Transitions in Networks of Integrate-and-Fire Oscillators. SIAM Journal on Applied Mathematics 60:3, 820. [CrossRef] 26. Bard Ermentrout . 1998. Linearization of F-I Curves by AdaptationLinearization of F-I Curves by Adaptation. Neural Computation 10:7, 1721-1729. [Abstract] [PDF] [PDF Plus] 27. Bard Ermentrout. 1998. Reports on Progress in Physics 61:4, 353-430. [CrossRef]

28. Bo Cartling. 1996. Response Characteristics of a Low-Dimensional Model NeuronResponse Characteristics of a Low-Dimensional Model Neuron. Neural Computation 8:8, 1643-1652. [Abstract] [PDF] [PDF Plus] 29. Isaac Meilijson, Eytan Ruppin. 1996. Optimal firing in sparsely-connected low-activity attractor networks. Biological Cybernetics 74:6, 479-485. [CrossRef] 30. Bard ErmentroutNeural Behavior: Mathematical Models . [CrossRef]

Communicated by Laurence Abbott

Dimension Reduction of Biological Neuron Models by Artificial Neural Networks Kenji Doya’ Allen I. Selverston Depnrtnient of Biology, University of Cnlifornia, Snn D k p , La lolls, C A 92093-0322 U S A

An artificial neural network approach to dimension reduction of dynamical systems is proposed and applied to conductance-based neuron models. Networks with bottleneck layers of continuous-time dynamical units could make a two-dimensional model from the trajectories of the Hodgkin-Huxley model and a three-dimensional model from the trajectories of a six-dimensional bursting neuron model. Nullcline analysis of these reduced models revealed the bifurcations of the dynamical system underlying firing and bursting behaviors. 1 Introduction

The dynamics of neural membrane potential is well described by conductance-based models (Hodgkin and Huxley 1952; Hille 1992), which are usually fourth or higher order differential equation systems. The lower dimensional models that approximate the behaviors of those higher order models have been proposed for faster simulations, rigorous mathematical analyses, and better intuitive understanding of the dynamics (FitzHugh 1961; Nagumo ct al. 1962; Rinzel 1985; Kepler et a / . 1992). In this paper, we propose a new method for reducing the dimensionality of dynamical systems using artificial neural networks. We applied this method for constructing a two-dimensional model that approximates the firing behavior of the four-dimensional Hodgkin-Huxley (HH) model (Hodgkin and Huxley 1952) and a three-dimensional model that approximates a six-dimensional model of bursting neurons recently proposed by Guckenheimer e t a / . (1993). In a typical conductance-based model, the dynamics of the membrane potential u is given by the following form. Vl

i

’Present address: The Salk Institute, Computational Neurobiology Laboratory, 10010 North Torrey Pines Road, La Jolla, CA 92037 USA. Neirrnl Corripirtntion 6, 696-717 (1994)

@ 1994 Massachusetts Institute of Technology

Dimension Reduction of Neuron Models

697

where C is the membrane capacitance and I is externally injected current. The subscript j denotes one kind of ionic current and $, and v,are the maximal conductance and the reversal potential of the ionic current, respectively. The variables a, and b, are the normalized activation and inactivation of the ionic current, which represent the opening probabilities of the gating elements of the ionic channels, and the exponents p, and 9, represent the number of the gating elements in a channel, which are usually integers between 0 and 4. The dynamics of activation and inactivation variables are typically given by the form

a = kl(u)[a,(U) - 4

(1.2)

where k,(u) is a rate constant and a,(u) is a steady-state activation or inactivation level, which is a sigmoidal function of the membrane potential u. The HH model incorporated sodium, potassium, and leak currents, resulting in a four-dimensional differential equation system. As more kinds of ionic currents are discovered and characterized by physiological experiments, the corresponding conductance-based models become quite complex. For example, a recent model of the LP neuron in the crab stomatogastric ganglion is a 14-dimensional differential equation system (Buchholtz et a/. 1992). Such high-dimensional models are not suitable for mathematical analyses and numerical simulations, especially in multicellular or multicompartmental studies. We propose an artificial neural network approach to reduction of conductance-based models. Learning in artificial neural networks is a practical method for constructing a model from examples. We designed a special kind of recurrent network to generate a lower dimensional model from examples of the trajectories of a higher dimensional dynamical system. 2 Dimension Reduction

Trajectories of a dissipative dynamical system usually fall into some lowdimensional submanifold of the state space after some transient time. In such cases, it is convenient to define a lower dimensional coordinate system and a compact form of the vector field on this manifold. The dimension required for such a reduced state space depends on the dimensionality of the attractor trajectory. Simple limit cycle trajectories, such as the waveforms of tonically firing neurons, can be well approximated by two-dimensional systems, whereas some of the complex limit cycle trajectories, such as those of bursting neurons, require three-dimensional state space. Chaotic oscillation requires at least three-dimensional state space and the dimension must be larger than the Lyapunov dimension of the attractor (Takens 1980; Brown and Bryant 1991). Dimension reduction of conductance-based neuron models has been performed by heuristically finding some constraints between the time

698

Kenji Doya and Allen I. Selverston

courses of the dynamical variables (FitzHugh 1961; Nagumo et al. 1962; Krinsky and Kokos 1973; Rinzel 1985; Rose and Hindmarsh 1989). Most often, the fastest activation variables are regarded as instantaneous functions of the membrane potential. The slower variables are grouped by the similarity of their time courses and replaced by their linear combination. Kepler et al. (1992) have proposed a systematic dimension reduction method that utilizes linear combination of the ”equivalent potential’’ of activation variables, which is defined as v(aj = a;’(aj. They reduced the four-dimensional HH model into a two-dimensional system and a six-dimensional neuron model with A-current (Connor et nl. 1977) into a three-dimensional model. Recently, Golomb et al. (1993) applied their method to a 13-dimensional model of the stomatogastric LP neuron (Buchholtz et nl. 1992) and derived a six-dimensional model, which is still a high dimensional model for a tonically firing neuron. It suggests that the nonlinear coordinate transformation by equivalent potentials is not always the best way for finding reduced variables. We propose the use of artificial neural networks to allow arbitrary nonlinear mapping between the original and the reduced dynamical variables. We use a multilayer neural network to implement a mapping from the reduced variables to the original variables and use another multilayer network to represent the vector field of the reduced system. The parameters of the networks are optimized to approximate the sample trajectories from the original neuron model.

2.1 Dynamical Bottleneck Network. Multilayer neural networks with a “bottleneck” layer have been used for compression of high dimensional data vectors, or nonlinear principal component analysis (Kramer 1991; DeMers and Cottrell 1993). This architecture is also suitable for applying the “teacher forcing” technique, which has been used to train a recurrent network as an autonomous oscillator (Doya and Yoshizawa 1989; Williams and Zipser 1989). Figure 1 shows the basic architecture of the ”dynamical bottleneck network” that we designed for dimension reduction of dynamical systems. When we train the network, a sample trajectoryx’(t) = [x;(t).. . . , x , ; ( t ) ] of the original n-dimensional system is given to the input layer, compressed into an r-dimensional trajectory y( t ) = b, (t j. . . . , y,( t j ] in the bottleneck layer, and then reconstructed as an n-dimensional trajectory x ( t j = [xl(t j. . . . , x,,( t ) ] in the output layer (teacher forcing mode; Fig. la). After the actual output x ( t ) has become very close to the target trajectory x * ( t ) by error gradient descent learning, the output x ( t) is fed back to the input so that the network evolves as an v-dimensional autonomous system (autonomous mode; Fig. lb). Because our purpose is to reduce the dimensionality of state space, only the units in the bottleneck layer have continuous-time dynamics. The units in other layers are supposed to operate instantaneously.

Dimension Reduction of Neuron Models

sP’

699

(b) Autonomous mode

Figure 1: The architecture of dynamical bottleneck network. (a) Teacher forcing mode. (b) Autonomous mode. We used a three layer network to implement the mapping from the input layer to the bottleneck layer so that arbitrary nonlinear vector fields can be implemented with enough hidden units (Irie and Miyake 1988; Hornik 1991).

x; ( t ) teacher forcing mode x l ( t ) autonomous mode

( i = 1 . .. . .t l )

(2.1)

1

( i = 1... . . I I ~ )

(2.2)

I’

+ CZU~:’S, +by /=I

Kenji Doya and Allen I. Selverston

700

(2.3) /=1

where z ( t ) = [zl(t).. . . ,zl,(f)]is the output of the input layer, ( ~ 1 %. .. .s,,) is a modulatory input that represents the parameter of the model, [hl(t). . . . Iilll(f)] is the output of the hidden layer, and f is a sigmoid function f ( x ) = 1/ ( 1 + c‘). The advantage of the continuous-time model (2.3) is that each unit i can be assigned a distinct time constant T~ so that events on the time scale near T~ will be represented primarily by unit i. Furthermore, the time constants can be fine-tuned by the learning algorithm described below. The mapping from the bottleneck layer to the output layer was also implemented by a three layer network to allow arbitrary nonlinear mappings between the reduced and the original variables.

/

where [s,(t), . . . . g / ( t ) ]is the output of another hidden layer. The instantaneous mapping x(y) determined by equations 2.4 and 2.5 represents the mapping between the reduced and original state variables. The mapping from the input layer to the bottleneck layer is not the inverse of the mapping from the bottleneck layer to the output layer as in the static case (Kramer 1991; DeMers and Cottrell 1993). The composite mapping from the bottleneck layer through the output and input layers to the bottleneck layer determines the r-dimensional vector field of the reduced system in autonomous mode. 2.2 Learning Algorithms. We used gradient descent learning algorithms to minimize the average error

E=

Tl,e ( t ) d t 1

.T

(2.6)

where T is the period of the target trajectory x*(t) and e ( t ) is the instantaneous error

51 [ x ; ( t ) - xt(t)]’

I’

r ( t )= 1 I=

(2.7)

I

The error gradients with respect to the weights wx and wG were derived by the standard backpropagation algorithm (Rumelhart cf al. 1986). The error gradients with respect to the weights w yand zuH and the time

Dimension Reduction of Neuron Models

701

constants r were derived using a continuous-time version of real-time recurrent learning algorithm (Doya and Yoshizawa 1989; Rowat and Selverston 1991) as shown in the Appendix. We tried two parameter update schemes. In the batch update scheme, the average error gradient

was integrated for each training run and then the parameters were updated by

k)

Azu(k)

=

(1

w(k + 1)

=

w ( k )- E&u(k)

-

Azu(k - 1)

i)E + a1 dw --

where Aw(k) is the smoothed gradient, k is the index of iterations, a is the time constant of weight update, and E is the learning rate. In the corresponding real-time update scheme, the parameters were updated while the network was running by the running average AZUof the error gradient as follows.

(2.9) We could use larger learning rates in the real-time update scheme since the parameter change was smoother and the learning process was less likely to become unstable. All biases b were changed by regarding them as weights for a unitary input. Because the time constants r, should always stay positive, it was convenient to use an exponential expression r, = and to perform gradient descent with respect to the exponent (T,(Rowat and Selverston 1991). Since ih,/ihI = r,, this is equivalent to performing gradient descent on rl with the learning rate scaled by rJ2. 2.3 Teacher Forcing. It has been demonstrated that teacher forcing is essential in training a recurrent network into an autonomous oscillator (Doya and Yoshizawa 1989; Williams and Zipser 1989). Without teacher forcing, it is very difficult to go across bifurcation boundaries in the parameter space, where the trajectory changes qualitatively, for example, from a fixed point to a limit cycle (Doya 1992). With the bottleneck architecture above, the network has only first-order linear feedback systems in teacher forcing mode and therefore no bifurcation can happen as long as the time constants are kept positive. Although teacher forcing avoids problems in learning, it can lead to a problem when the network is run autonomously. Even if the output error x ( f ) - x * ( t ) is very small under teacher forcing, the existence and

Kenji Doya and Allen I. Selverston

702

the stability of the autonomous solution are not guaranteed. This was not a major problem in learning simple trajectories such as sinusoidal ones (Doya and Yoshizawa 1989; Williams and Zipser 1989). However, in the case of bursting trajectories, the network often remained quiescent or tonically firing in autonomous mode, even if the error was very small (E N under teacher forcing. For the trajectory to be reproduced under autonomous dynamics, it has to be an attractor, to which all neighboring trajectories converge (Guckenheimer and Holmes 1983; Wiggins 1990). However, if a network is trained only on the attractor trajectory, the resulting trajectory under autonomous dynamics can be a repellor or a saddle-type one instead of an attractor. To ensure that the trajectory becomes an attractor, we should impose an additional constraint on learning so that the neighboring trajectories converge to the target trajectory (Tsung and Cottrell 1993). We used a partial, noisy teacher forcing scheme

z l ( t )= trx:(t)

+ (1 - r r ) x , ( t )+ / j / r ( t )

(2.10)

instead of the equation 2.1, where I r ( t ) is a normalized gaussian random noise. The parameter c t specifies the rate of teacher forcing (Williams and Zipser 1990; Toomarian and Barhen 1992) and ,j the amplitude of the noise. This imposes an additional constraint that the perturbed trajectory z ( t ) in the input layer is mapped to the target trajectory x * ( t ) in the output layer. 3 Reduction of the Hodgkin-Huxley Model

We tested if our method can reduce the four-dimensional HH model to a two-dimensional model. The HH model has the following form (Hodgkin and Huxley 1952).

where the subscripts L, Na, and K denote the leak, sodium, and potassium currents, respectively. The activation m, the inactivation Ir of the sodium current, and the activation n of the delayed rectifier potassium current obey differential equations of the form 1.2. This equation system was numerically solved with five levels of depolarizing current I = 0, 10, 20, 40, and 60 pA/cm2. The system was quiescent at I = 0 and oscillatory at higher levels of I. These solutions were used as the target trajectory x ' ( t ) = [v(t).rn(t).n(t).h(t)] after rescaling v ( t ) to the range between 0 and 1. We represented the level of current injection I by the modulatory inputs = 0.011(/rA/cm2).One sweep of training consisted of learning five different waveforms (50 msec for the quiescent solution and one cycle for oscillatory solutions) associated with the modulatory input levels s = 0, 0.1, 0.2, 0.4, and 0.6. The connection weights were initialized randomly

Dimension Reduction of Neuron Models

703

A

Xl

-1

J

lvl

r

I

I

0

20

10

time imsi

30

0

20

I0

30

t i m e (msl

Figure 2: The output of the reduced HH model in autonomous mode. (a) I = 20 pA/cm2. (b) I = 60 pA/cm2. The upper four traces show the reconstructed output x ( t ) (black)in the output layer and the target trajectory x * ( t ) = (v.m. n. 11) (gray). The bottom two traces show the output of the bottleneck layer y(t), which is the two-dimensional state of the reduced model. Scales are [O. 11 for x and [-2.21 for y. The membrane potential u was rescaled from 1-90 mV, 50 mV] to [O. 11 in XI.

(mean = 0, SD = 0.5). The time constants were initially set as TI = 0.2 msec and r2 = 5 msec so that yl captures the faster component of the neuron dynamics. The numbers of hidden units were it1 = 1 = 8 and the learning parameters were a = 5, E = 1, o. = 0.2, and /j = 0.01. Figure 2 illustrates the behavior of the two-dimensional network model. After 3000 sweeps of learning, the output of the network x ( t ) became very close to the target waveform x * ( t ) under teacher forcing ( E < lop3)).When the teacher forcing was cut off and the output x ( t ) was fed back to the input, the network kept on oscillating autonomously on its two-dimensional state vector (yl.y2) of the bottleneck layer. The timecourses of the original four-dimensional systems (v.i 7 2 . 1 1 . h ) at different levels of current injection were well approximated. Figure 3 shows the comparison of the frequency-current curves of the original and the reduced systems. Although the network was trained only at five levels of the input current I (marked by the small squares in the graph) its firing frequency changed smoothly within and outside the trained parameter range.

Kenji Doya and Allen I. Selverston

704

firing frequency IIHr)

I

000

so 00

10000

15000

reduced model

........................... H-H nicdel

200 00

Figure 3: The frequency-xrrent curves of the reduced two-dimensional model (solid line) and the original HH model (dotted line). The points used for training were marked by the small squares. The threshold of firing was Y.7 /rA/crn' for the reduced model and 7 //A/cm2 for the HH model. The purpose of model reduction is not only to decrease the amount of computation but also to provide a good insight into the dynamical system. In two- or three-dimensional systems, we can graphically investigate the structure of the state space, for example, using the nullclines of the vector field. In a two-dimensional state space, we can locate equilibrium points from the intersections of two nullclines 01( t ) = 0 and 0 2 ( t ) = 0 and their stabilities can be estimated by the slopes of the nullclines at the intersections. Figure 4 illustrates the nullclines of the network model in autonomous mode. The horizontal axis represents the fast variable yl, whose time constant became 7-1 = 0.06 msec after learning, and the vertical axis represents the slow variable y2, whose time constant was T? = 5.8 msec. The nullcline o l ( t ) = 0 of the fast variable has an "N-shape," which is reminiscent of the FitzHugh-Nagumo model (FitzHugh 1961; Nagumo ct al. 1962). On the right-hand side of this nullcline, trajectories head to the left, and vice versa. Thus the section of the nullcline shown by the dotted line is "unstable" and the trajectories head to either of the sections of the nullcline shown by the solid lines. At a low level of current injection (Fig. 4a; I = 5 pA), two nullclines intersect on the stable section of the fast nullcline and therefore the equilibrium point is stable. As the level of the current is increased (Fig. 4b; I 9.7 //A), the intersection shifts to the unstable section of the fast ~

Dimension Reduction of Neuron Models

705

\

C

\

0.6.

0.5N

x

0.4 \ \

0.3

\ \ -3

-2

-1

0

Figure 4: The phase portrait of the reduced HH model at different levels of current injection I . (a) I = 5 /,A/cm2. (b) 1 = 9.7 /rA/cm2. (c) I = 60 pA/cm2. The stable and unstable sections of the nullclines ijl = 0 of the fast variable are shown by the solid and dotted lines, respectively. The nullclines 4 2 = 0 of the slow variable are shown by the dashed lines. The stable and unstable limit cycle trajectories are shown by the thick and thin loops with arrows. The stable and unstable fixed points are shown by the solid and hollow dots. nullcline. This gives rise to a stable limit cycle that goes back and forth between the two sections of the fast nullclines. At this critical level of the current injection, the equilibrium point remains locally stable. An unstable limit cycle, which is shown as the smaller loop, separates the attractor basins for the firing and quiescent behaviors. Such bistability at the onset of firing is also seen in the original HH system (Rinzell978). As

706

Kenji Doya and Allen I. Selverston

Figure 5: Representation of the biophysical variables of the HH model in the two-dimensional state space of the reduced model. the current is further increased, the unstable limit cycle shrinks, merges with the stable equilibrium, and becomes an unstable equilibrium point. This is a typical process of ”subcritical Hopf bifurcation” (Guckenheimer and Holmes 1983; Wiggins 1990). After this bifurcation, the stable limit cycle becomes the global attractor (Fig. 4c; I = 60 //A),therefore firing is the only possible behavior. Figure 5 shows how the physiological variables of the HH model are represented in the reduced two-dimensional state space. The mapping was approximately linear. The fast variable yl mainly represents ZI and 111, and the slow variable y2 was related to ti and 17, which is similar to the construction of the FitzHugh-Nagumo model from the HH model (FitzHugh 1961; Krinsky and Kokos 1973; Rinzel 1985). 4 Reduction of the AB Neuron Model

The crustacean stomatogastric ganglion is a well identified neural network that controls the movement of the stomach (Selverston and Moulins

Dimension Reduction of Neuron Models

707

1987; Harris-Warrick et al. 1992). Some of the stomatogastric neurons are conditional bursters that endogenously alternate between "firing" and "resting" phases under the effect of neuromodulators. Recently, Guckenheimer et al. (1993) proposed a conductance-based model of the anterior burster (AB) neuron in the stomatogastric ganglion. We applied our method to reduce their six-dimensional model (GGH model) to a three-dimensional model. Note that three-dimensional state space is sufficient for reconstructing any limit cycle attractors (Takens 1980; Brown and Bryant 1991) and also minimum for reproducing chaotic behaviors (Guckenheimer and Holmes 1983; Wiggins 1990), which have been observed in some parameter regions of the GGH model. The GGH model incorporates transient potassium current (A current) (Connor et al. 1977), calcium-activated potassium current (KCa current), and calcium current in addition to the currents in the HH model. The dynamics of the membrane potential u and the normalized intracellular calcium concentration c are given by the following equations.

CV

=

I -gL(z1 g

-

u,.)- gN,,nl(u)3k(U - u N , ~ ) - ~ K ? I ' ( U - u K )

z c d (v ~ 0.5 + c

The dynamics of sodium current inactivation h, delayed-rectifier potassium current activation n, A current inactivation b, and calcium current activation z are given by the form (1.2). Sodium current activation 117 and A current inactivation a are regarded as instantaneous functions of the membrane potential u. Therefore it is already a partly reduced model. The input and the output layers had six units that represent the state of the GGH model x = (u.c. n , 1 1 . 2 .b). The membrane potential z1 and the calcium concentration c were scaled between 0 and 1 to match the range of other variables. The bottleneck layer had r = 3 units and their time constants were set initially as TI = 0.2 msec, r2 = 6.32 msec, and Q = 200 msec. The numbers of hidden units were ni = I = 12 and the learning parameters were a = 1, E = 1, ct = 0.2, and /j= 0.01. After training for 50,000 cycles (period T = 921.6 msec), the output x ( t ) was almost identical with the target output x * ( l ) in teacher forcing Figure 6 shows an example of the o u t p u t of the reduced mode ( E = model in autonomous mode. Its bursting waveforms had 6 spikes per burst as in the original model, although the interspike and interburst intervals were shorter in this case. Figure 7 shows the three nullclines and the bursting trajectory in the three-dimensional state space of the reduced model. The fast nullcline g1 = 0 (Fig. 7a) forms an arch with a U-shaped cross section. The medium speed nullcline Q2 = 0 (Fig. 7b, medium gray) is fairly flat and intersects

708

Kenji Doya and Allen I. Selverston

Figure 6: The output of the reduced GGH model in autonomous mode. The upper six traces show the reconstructed output x ( t ) (black) in the output layer and the target trajectory x * ( t ) = ( v . c . N . ~ . z .(gray). ~) The bottom three traces show the output of the bottleneck layer y ( t ) , which is the three-dimensional state of the reduced model. Scales are 10.11 for x and [-1.5.1.5] for y. The membrane potential z i was rescaled from [-70 mV, 30 mV] to [O. 11 in x l .

the fast nullcline in two regions, which correspond to the two quasistable states, one firing and one quiescent. The slow nullcline = 0 (Fig. 7b, dark gray) is also flat and separates the two quasistable states; the trajectory spirals up on the left-hand side and falls down on the right-hand side. The time constants for the three variables were T I = 0.13 msec, T~ = 7.5 msec, and ~3 = 201 msec after learning. Because y 3 changes very slowly compared to other two variables, we can investigate the shortterm behavior of the system by taking two-dimensional slices of the state space at different levels of yl, as shown in Figure 8. At a low level of y3 (Fig. 8a), there is only one equilibrium point ( ( i ) in the y1-1~2 subspace. This equilibrium is unstable since it is located in the folded section of the fast nullcline 41 = 0 and there is a limit cycle around it, which corresponds to the firing behavior. As the slow variable y3 is increased (Fig. 8b), the fast nullcline =0 makes new intersections with the medium nullcline ?/2 = 0 and two more equilibrium points (/jand 2 ) emerge. One of the new equilibria !j is stable

o3

oI

Dimension Reduction of Neuron Models

709

Figure 7: The nullclines and the limit cycle trajectory of the reduced GGH model. (a) The nullcline = 0 of the fast variable (light gray). (b) The nullclines 5 2 = 0 of the medium variable (medium gray) and tj3 = 0 o f the slow variable (dark gray). The bursting trajectory is shown as the thick lines with arrows.

and the other ej is a saddle point that divides the two-dimensional state space into two attractor basins. Therefore at this level of y3, there are two distinct behaviors, one is firing with the limit cycle around the unstable equilibrium (r and another is resting at the stable equilibrium ij. These two attractors are located on opposite sides of the nullcline ?/3 = 0 so that the slow variable y3 increases when the neuron is firing and decreases when it is resting. When y3 is further increased, the limit cycle collides with the saddle point 7 and makes a “homoclinic orbit” (Fig. 8c; the trajectory from the saddle point 7 to itself) and then disappears. This event is known as ”homoclinic bifurcation” (Guckenheimer and Holmes 1983; Wiggins 1990), in which the period of the limit cycle becomes infinitely long. After the homoclinic bifurcation, the equilibrium d becomes the global attractor and the trajectory heads toward / j through the narrow channel between two nullclines jl = 0 and ?/? = 0. Since /j is located on the right-hand side of the slow nullcline (dotted line) where 0 3 < 0, the slow variable y3 then decreases slowly, which corresponds to the resting phase. As y3 is decreased, the equilibria / j and 7 move closer and they finally collide and disappear (Fig. 7a) by a “saddle-node bifurcation” (Guckenheimer and Holmes 1983; Wiggins 1990). Then the trajectory heads toward the stable limit cycle around o , which is located on the left-hand side of the slow nullcline where O3 > 0. Thus, the system goes into a new firing phase and the above process will be repeated.

71 0

Kenji Doya and Allen 1. Selverston

Figure 8: Two-dimensional sections of the nullclines of the reduced GGH model at different levels of the slow variable y3. (a) y~ = -0.36. (b) y 3 = -0.28. (c) y3 = -0.2. The nullclines y 1 = 0, !j2 = 0, and y 3 = 0 are shown by the solid, dashed, and dotted lines, respectively. The equilibrium points of the faster twodimensional system (with y3 fixed) are marked as solid dots (sink), hollow dots (source), and small squares (saddle). Trajectories of the two-dimensional system are shown by the thick lines with arrows.

Figure 9 shows the representation of the physiological variables in the GGH model in the three-dimensional state space. While the original variables u, c, a n d b are mapped fairly linearly in the reduced state space, n , 17, and z are mapped highly nonlinearly. This is a remarkable result from the use of multilayer neural networks. The membrane potential depends mainly on the fast variable yl but also partly o n y 2 . The slow variable y3 is strongly correlated with the intracellular calcium concen-

Dimension Reduction of Neuron Models

711

Figure 9: Representation of the original GGH state variables in the reduced three-dimensional state space. Three surfaces for x; = 0.25, 0.5, and 0.75 are shown in dark, medium, and light grays, respectively.

71 2

Kenji Doya and Allen I. Selverston

tration c. Thus we can see that the slow time course of bursting is mainly due to intracellular calcium dynamics. The potassium activation I I and sodium inactivation / I are not complementary to each other as in the case of the HH model. Calcium activation t and A-current inactivation b are mapped in the region where the two attractor basins are separated, which suggests that they contribute to the bistability in the t j - 1 ~ 2 subspace. 5 Discussion

5.1 What Was Found by the Network Models. In our reduced HH model, the nullcline of the fast variable had an “N-shape,” which is the primary feature of the FitzHugh-Nagumo and other relaxation oscillator neuron models (FitzHugh 1961; Nagumo e t a / . 1962; Krinsky and Kokos 1973; Hindmarsh and Rose 1982; Rinzel 1985; Rinzel and Ermentrout 1989, Rowat and Selverston 1993). As the current I is increased, the intersection of the two nullclines shifts into the middle section of the fast nullcline and a limit cycle is generated through a subcritical Hopf bifurcation. This mechanism has been found in the original HH model (Rinzel 1978) and also in two-dimensional models with “N-shape” nullclines (Rinzel 1985; Rinzel and Ermentrout 1989). Although our result was mainly a rediscovery of the prior observations, an important point is that such an insight into the neuronal dynamics was derived simply from the sample trajectories of the original system without heuristic processes. In our reduced GGH model, coexistence of two attractors in the ?/I-!/? subspace at an intermediate level of the slow variable was essential for the bursting behavior. Without this bistability, slow negative feedback on the variable y3, which corresponds to the intracellular calcium dynamics, results in either quiescent or tonically firing behaviors. For the system to alternate between the two quasistable states, one of them must become unstable at a higher level of the slow variable and the other at a lower level. This hysteretic mechanism for bursting has been found in a conductance-based model of the pancreatic $cells (Chay and Keizer 1983; Rinzel and Lee 1986)and also proposed in a conceptual model of bursting (Hindmarsh and Rose 1984). The bursting behavior of the GGH model is similar to the Chay-Keizer ,&cell model in that the spike interval is shorter at the onset of bursting and becomes longer at the end of burst. Such characteristics are due to the bifurcations we have seen in Figure 8; a burst starts with a transition to a limit cycle after a saddle-node bifurcation and ends with a homoclinic bifurcation (Rinzel and Lee 1986). Note that this mechanism is different from the bursting of aplysia R-15 neuron models (Plant and Kim 1976; Plant 1981; Rinzel and Lee 1987), although the GGH model was partly based on these models. In the R-15 models, both the onset and the end of a burst are associated with ”degenerate homoclinic bifurcation” (Rinzel and Ermentrout 1989) of the fast system

Dimension Reduction of Neuron Models

713

and accordingly the spike interval is longer both at the onset and at the end of the burst (Rinzel and Lee 1987). Although it is theoretically assured that any limit cycle trajectory can be embedded in a three-dimensional state space (Takens 1980; Brown and Bryant 1991), reduction of the bursting model into a three-dimensional model would have been very difficult in conventional linear combination schemes (Rinzel 1985; Kepler et al. 1992). As we have seen in Figure 9, the mapping between the original and the reduced variables was highly nonlinear. We have experimented with networks that do not have a hidden layer between the bottleneck and output layers. It worked well for reduction of the HH model but not for the GGH model.

5.2 Dynamical Bottleneck Architecture. The dynamical bottleneck network architecture proposed in this paper provides a general method for constructing a low dimensional dynamical system from sample trajectories. A typical neural network approach to modeling temporal sequence is the use of feedforward networks with tapped delay lines in the input layer (Widrow and Steams 1985; Waibel 1989; Principe et al. 1992). This approach was not suitable for dimension reduction because it requires a very long array of delay units to model a dynamical system with multiple time scales and thus leads to a high dimensional model. For example, in order to model a bursting neuron, the network must respond to the fast spikes in the time scale less than 1 msec and also have to capture the very slow bursting rhythm in the time scale of seconds. Use of continuous-time units with different time constants was crucial for successful learning. The use of fully connected recurrent neural networks has been another typical connectionist approach to temporal tasks (Doya and Yoshizawa 1989; Pearlmutter 1989; Williams and Zipser 1989). However, error gradient descent is not as reliable a strategy for recurrent networks as it is for feedforward networks (Kuan et al. 1990; Doya 1992). Our successful results depend on the special architecture; the network is nearly feedforward during training and the continuous-time units are provided with different time constants.

6 Conclusions

Analysis of the phase portraits of the lower dimensional system gives deeper insight into the dynamics of the higher dimensional system. In particular, the dynamics of a six-dimensional model of the AB neuron could be approximated by a three-dimensional network model and it enabled us to investigate the bifurcation mechanisms underlying the complex behavior.

Kenji Doya and Allen I. Selverston

714

Appendix: Error Gradient Calculation

The error gradients with respect to the weights wx and zuC and the bottleneck layer output yI are given by the standard back-propagation algorithm (Rumelhart ct a/. 1986). In the following, g' and 11' denote the derivatives of the sigmoid functionf( ) at the output values g and 11, i.e., g' = g ( l - g) and 11' = h(1 - 11).

The effect of a small change in the weights 711' and zi~" and the time constants T, can be estimated by the following variational equations (Doya and Yoshizawa 1989; Rowat and Selverston 1991).

0y1 -(O) 8TI

=

0

From their solutions, we have the error gradients as follows.

i)e( f ) 8wl;

~-

-

& ( f ) dyl dyl(t ) dwj; ( t )

-__

Acknowledgments

We thank Peter Rowat for his detailed comments on the manuscript. We are grateful to Gary Cottrell, Dave DeMers, Fu-Sheng Tsung, and Mikc Casey for helpful discussions. We appreciate the comments from two anonymous reviewers. This research was supported by a grant from Office of Naval Research N00014-91-J-1720.

Dimension Reduction of Neuron Models

715

References Brown, R., and Bryant, P. 1991. Computing the Lyapunov spectrum of a dynamical system from an observed time series. Phys. Rev. A 43, 2787-2806. Buchholtz, F., Golowasch, J., Epstein, I. R., and Marder, E. 1992. Mathematical model of an identified stomatogastric ganglion neuron. J. Neurophysiol. 67, 332-340. Chay, T. R., and Keizer, J. 1983. Minimal model for membrane oscillations in the pancreatic /)-cell. Biophys. 1. 42, 181-190. Connor, J. A., Walter, D., and McKown, R. 1977. Neural repetitive firing, modifications of the Hodgkin-Huxley axon suggested by experimental results from crustacean axons. Biophys. 1. 18, 81-102. DeMers, D., and Cottrell, G. 1993. Non-linear dimensionality reduction. In Advances in Neirral lnforniution Processing Systems 5, C . L. Giles, S. J. Hanson, and J. D. Cowan, eds., pp. 580-587. Morgan Kaufmann, San Mateo, CA. Doya, K. 1992. Bifurcations in the learning of recurrent neural networks. Proc. 2992 l E E E lnt. Symp. Circiiits Syst. 6, 2777-2780. Doya, K., and Yoshizawa, S. 1989. Adaptive neural oscillator using continuoustime back-propagation learning. Neural Netzuorks 2, 375-386. FitzHugh, R. 1961. Impulses and physiological states in theoretical models of nerve membrane. Biophys. 1. 1,445-466. Golomb, D., Guckenheimer, J., and Gueron, S. 1993. Reduction of a channelbased model for a stomatogastric ganglion LP neuron. Biol. Cybernet. 68, 129-137. Guckenheimer, J., and Holmes, P. 1983. Nonlinear Oscillation, Dynnnrical Systems, and Bifurcations of Vector Fields. Springer-Verlag, New York. Guckenheimer, J., Gueron, S., and Harris-Warrick, R. M. 1993. Mapping the dynamics of a bursting neuron. Phil. Transnct. Roy. Soc., Series B, 341, 345358. Harris-Warrick, R. M., Marder, E., Selverston, A. I., and Moulins, M. 1992. Dynaniic Biological Networks-The Stomatogastric Nervous Systcm. MIT Press, Cambridge, MA. Hille, B. 1992. lonic Channels of Excitable Menibranes. Sinauer, Sunderland, MA. Hindmarsh, J. L., and Rose, R. M. 1982. A model of the nerve impulse using two first-order differential equations. Nature (London) 296, 162-164. Hindmarsh, J. L., and Rose, R. M. 1984. A model of neuronal bursting using three coupled first order differential equations. Proc. Roy. Soc. London Ser. B 211, 87-102. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane currents and its application to conduction and excitation in nerve. 1. Physiol. 117, 500-544. Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neirral Networks 4, 251-257. Irie, B., and Miyake, S. 1988. Capabilities of three-layered perceptrons. Proc. hit. Conf. Neirral Networks 1, 641448. Kepler, T. B., Abbott, L. F., and Marder, E. 1992. Reduction of conductance-based neuron models. Biol. Cybernet. 66, 381-387.

716

Kenji Doya and Allen I. Selverston

Kramer, M. A. 1991. Nonlinear principal component analysis using autoassociative neural networks. Am. Iiist. C l i e i ~ i E. q . /. 37, 233-243. Krinsky, V. I., and Kokos, Y. M. 1973. Analysis of the equations of excitable membranes. Biofizikn 18, 506-511. Kuan, C., Hornik, K., and White, H. 1990. Some convergence results for learning in recurrent neural networks. Discussion Paper 90-42, Department of Economics, University of California, San Diego. Nagumo, J., Arimoto, S., and Yoshizawa, S. 1962. An active pulse transmission line simulating nerve axon. Proc. IRE 50, 2061-2070. Pearlmutter, B. A. 1989. Learning state space trajectories in recurrent neural networks. Neirrnl C o n y 1, 263-269. Plant, R. E. 1981. Bifurcation and resonance in a model for bursting neurons. 1. M n t h . B i d . 11, 15-32. Plant, R. E., and Kim, M. 1976. Mathematical description of a bursting pacemaker neuron by a modification of the Hodgkin-Huxley equations. Bioplrys. 1. 16, 227-244. Principe, J. C., Rathie, A., and Kuo, 1.-M. 1992. Prediction of chaotic time series with neural networks and the issue of dynamic modeling. lrit. 1. Bfirrcntiori Cllnos 2, 989-996. Rinzel, J. 1978. On repetitive activity in nerve. Frd. Proc. 37, 2793-2802. Rinzel, J. 1985. Excitation dynamics: Insights from simplified membrane models. Fed. Proc. 44, 2944-2946. Rinzel, I., and Ermentrout, G. B. 1989. Analysis of neural excitability and oscillations. In Metliods in Ncirrorinl Modeliris, C. Koch and 1. Segev, eds., pp. 135169. MIT Press, Cambridge, MA. Rinzel, J., and Lee, Y. S. 1986. On different mechanisms for membrane potential bursting. In Noriliiienr Oscillntions in Biolog.y nrid Clisrtristry, Vol. 66 of Lectrirc Notes in Biortintlrcr~rntics,H. G. Othmer, ed., pp. 19-83. Springer-Verlag, New York. Rinzel, J., and Lee, Y. S. 1987. Dissection of a model for neuronal parabolic bursting. 1. Moth. Biol. 25, 653-675. Rose, R. M., and Hindmarsh, J. L. 1989. The assembly of ionic currents in a thalamic neuron, 1-111. Prcic. R. SOC.London Ser. B 237, 267-334. Rowat, P. F., and Selverston, A. I. 1993. Modeling the gastric mill central pattern generator of the lobster with a relaxation-oscillator network. 1. Neiiroplrysid. 70, 1030-1053. Rowat, P. F., and Selverston, A. I. 1991. Learning algorithms for oscillatory networks with gap junctions and membrane currents. Nctrc~ork2, 17-41. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Pnrdlel Distributed Processing, J. L. McClelland, D. E. Rumelhart, and the PDP research group, eds., vol. 1, pp. 318362. MIT Press, Cambridge, MA. Selverston, A. I., and Moulins, M. 1987. T l i ~Criistactm S t o m n t o p t r i c S!/stcrii. Springer-Verlag, New York. Takens, F. 1980. Detecting strange attractors in turbulence. In Dytrnriiicnl Systcrrrs nrid Tlirbrilenre, D. A. Rand and L. Young, eds., Vol. 898 of SpYiigcr Lsctirrc Notcs in Matlicinntics, pp. 366-381. Springer-Verlag, New York.

Dimension Reduction of Neuron Models

717

Toomarian, N. B., and Barhen, J. 1992. Learning a trajectory using adjoint functions and teacher forcing. Neiird NPtworks 5, 473484. Tsung, E-S., and Cottrell, G. W. 1993. Phase-space learning in recurrent networks. Tech. Rep. CS93-285, Dept. of Computer Science and Engineering, University of California, San Diego. Waibel, A. 1989. Modular construction of time-delay neural networks for speech recognition. Neural Corny. 1, 3946. Widrow, B., and Stearns, S. D. 1985. Adaptive Signal Processing. Prentice Hall, Englewood Cliffs, NJ. Wiggins, S. 1990. lntrodiictioti to Applied Nonlitiear Dyriartiical System and Chaos. Springer-Verlag, New York. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neirrd Corrip. 1, 270-280. Williams, R. J., and Zipser, D. 1990. Gradient based learning algorithms for recurrent connectionist networks. Tech. Rep. NU-CCS-90-9, College of Computer Science, Northeastern University. -~

~

Received April 28, 1993; accepted September 6, 1993.

This article has been cited by: 2. Irina Erchova, David J. McGonigle. 2008. Rhythms of the brain: An examination of mixed mode oscillation approaches to the analysis of neurophysiological data. Chaos: An Interdisciplinary Journal of Nonlinear Science 18:1, 015115. [CrossRef] 3. Bo Cartling. 1996. Response Characteristics of a Low-Dimensional Model NeuronResponse Characteristics of a Low-Dimensional Model Neuron. Neural Computation 8:8, 1643-1652. [Abstract] [PDF] [PDF Plus] 4. Isaac Meilijson, Eytan Ruppin. 1996. Optimal firing in sparsely-connected low-activity attractor networks. Biological Cybernetics 74:6, 479-485. [CrossRef] 5. Michael J. Kirby , Rick Miranda . 1996. Circular Nodes in Neural NetworksCircular Nodes in Neural Networks. Neural Computation 8:2, 390-402. [Abstract] [PDF] [PDF Plus] 6. Christopher J. Coomber. 1995. Compartmental modelling with artificial neural networks. Neural Processing Letters 2:1, 13-18. [CrossRef]

Communicated by Robert Jacobs

Neural Network Process Models Based on Linear Model Structures Gary M. Scott* W. Harmon Ray Dcpartmcnt of Climiical Enginwring, 1435 loliiisori Driw, Llrriz~crsityc?f Wiscorisiri, Madison, W I 53706 USA The KBANN(Knowledge-Based Artificial Neural Networks) approach uses neural networks to refine knowledge that can be written in the form of simple propositional rules. This idea is extended by presenting the MANNIDENT (Multivariable Artificial Neural Network Identification) algorithm by which the mathematical equations of linear dynamic process models determine the topology and initial weights of a network, which is further trained using backpropagation. This method is applied to the task of modeling a nonisothermal chemical reactor in which a first-order exothermic reaction is occurring. This method produces statistically significant gains in accuracy over both a standard neural network approach and a linear model. Furthermore, using the approximate linear model to initialize the weights of the network produces statistically less variation in model fidelity. By structuring the neural network according to the approximate linear model, the model can be readily interpreted. 1 Introduction

Research into the design of neural networks for process modeling has often ignored existing knowledge about the task at hand. One form this knowledge (often called the "domain theory") can take is embodied in traditional modeling paradigms. The recently developed KBANN approach (Towel1 et al. 1990) addresses this issue for tasks for which a domain theory (written using simple, nonrecursive propositional rules) is available. The basis of this approach is to use the existing knowledge to determine an appropriate network topology and initial weights, such that the network begins its learning process at a "good" starting point. One extension of this approach uses the mathematical form of a PID controller to determine the structure and initial weights of a neural network controller (Scott et al. 1992). 'Current address: Fiber Processes and Products Group, Forest Products LaboratoryUSDA, One Gifford Pinchot Drive, Madison, WI 53705.

Neitml Corripictotior~ 6, 718-738 (1Y94)

@ 1994 Massachusetts Institute o f Technology

Neural Network Process Models

719

This paper describes the MANNIDENT algorithm, a method of using a traditional modeling paradigm to determine the topology and initial weights of a network. The use of linear models in this way eliminates network-design problems such as the choice of network topology (Le., the number of hidden units) and reduces the sensitivity of the network to the initial values of the weights. Furthermore, the initial configuration of the network is closer to its final state than it would normally be in a randomly configured network. Thus, the MANNIDENT networks perform better and more consistently than the standard, randomly initialized three-layer approach. The task we examine here is learning to model a nonlinear MultipleInput, Multiple-Output (MIMO) system. There are a number of reasons to investigate this task using the MANNIDENT neural network modeling approach. First, many processes involve nonlinear input-output relationships, which can be handled by the nonlinear nature of neural networks. Second, there have been a number of earlier successful applications of neural networks to this task (Bhat and McAvoy 1990; Bhat et al. 1990; Donat ef al. 1990; Haesloop and Holt 1990; Jordan and Jacobs 1990; Narenda topology is and Parthasarathy 1990). Finally, the resulting MANNIDENT much easier to interpret than the topology resulting from the standard network modeling approach. In what follows, we introduce the principles of the MANNIDENT algorithm and then present an application of this technique to modeling of the temperature and concentration in a nonisothermal continuous-stirred, tank reactor (CSTR) that is highly nonlinear. The concluding sections describe some related work and some extensions and applications of the algorithm. 2 Design of Modeling Networks

In the modeling methodology presented here, we wish to create a neural network structure and initialize the weights using as much prior knowledge as is available. In addition, we would like the resulting neural network model to be capable of interpretation in terms of more traditional model forms. In this way, we can create efficient neural network models which can be readily understood in physical terms. To accomplish these goals, we describe in this section the MANNIDENT system, outlined in Figure 1. MANNIDENT (Multivariable Artificial Neural Network Identification) is a construct that uses knowledge-based neural networks for the task of process modeling. There are several forms that the initial knowledge can take, depending on the type of mathematical model structure chosen for the system. In our case, the knowledge used to configure and initialize the network is embodied in approximate linear models that can be easily identified with traditional identification techniques. This technique is

Gary M. Scott and W. Harmon Ray

720

Training Data

4-4

MANNIDENT

Network Dejnition

Linear Weights

I1 Trained A N N - C T Model

1

Sensitivity Annlysis

I

/ / / / -

LAPLACE INTERPRET

I

J

SNARTZ

REDUCE

Discrete Slate Space Model Pulse TransJer Function Model Combined Pulse Transfer Function Model Continuous TransJer Function Model Reduced Transfer Function Model

Ill

Figure 1: Overview of the MANNIDENT approach to process modeling showing the three phases of the process. I = Knowledge Insertion; ![ = Network Training; 111 = Network Interpretation. Many of the programs named in the figure are part of the CONSYD package (Holt e t a / . 1987).

easily extended to other more general model forms. This section gives an overview of this approach. Note that there are essentially three phases to the process of creating a model: knowledge insertion, network training, and network interpretation. The three phases are described in detail below.

Neural Network Process Models

721

2.1 Knowledge Insertion. The first step of the MANNIDENT process is the knowledge insertion phase. This step is important for several reasons. First, it defines the architecture of the network, thus eliminating the need for trial-and-error techniques to determine the proper number of hidden units. Second, the initialization of some of the weights to nonnear-zero values gives the network a "good" starting place from which to continue learning, thus decreasing the learning time. Also, the use of knowledge-based networks tends to decrease the size of the network necessary to learn a given mapping. The initialization also tends to reduce the variability between runs since the effect of the initial weight randomization is reduced. Finally, since the network structure and weights correspond to parameters in a traditional linearized model, the physical interpretation of the trained ANN model is possible. Consider a process which has Ni, inputs, namely U I . . . . . U N , , , and Nout outputs, namely yl , . . . .YN,,"~, with a nonlinear, dynamic relationship between them. Given a time series of the inputs and the resulting outputs, a first-order approximation to the relationship can be determined. Using, for example, the MIDENT program of the CONSYD CAD package (Holt et al. 1987), or other linear identification technique, a transfer function matrix

can be written where each element of this matrix is in the form of a Laplace transform K,,e-'J I!' g,;(s)= r,,s 1 ~

+

where K,,, T,,, and t,!,,, are the steady-state gain, time constant, and time delay of each element, respectively. This represents a common form often used in empirical modeling (Stephanopoulos 1984). Greater dynamic flexibility of the ANN model can be achieved by representing the elements of this matrix by a sum of such Laplace transforms, which would result in additional hidden units in the network. A realization of this transfer function uses NoutN,,states, namely ~ 1 . ~ 2 . .. .. X N ,,,~N , , ,where one state is used for each element of the matrix. This would lead to the equivalent model in the time domain of

Gary M. Scott and W. Harmon Ray

722

where k = (j - l)Nl,,+ i otherwise where k = ( j - l)Nlll+ i otherwise where k = (j- l)NII1 +i otherwise

I

a k k = -TI1

a/k = 0 bkt 5 -

71,

0 =1 =0

01, =

Clk C$

for all i

=

1 . .. N,,and j = 1 . .. Nout

An individual element from equation 2.1 can be written as

where k = ( j - l)Nil, + i. Taking the finite difference approximation, the model becomes

at

= akkxk(n)

+ bklu, ( n

-

x k ( n ) = x k ( n - 1) + A x k ( n )

2)

where n now indexes the discrete time steps. Note that this approximation is most appropriate when At << T ~ , .This results in (2.2) k

Now consider using equations 2.2 and 2.3 to configure and initialize the network shown in Figure 2, which depicts the network created for a two-input, two-output system. Note that an “Elman” style of network results (Elman 1990). The important distinction of this network is that the values of the hidden layer are recurrent rather than the Values of the output layer. In the figure, solid lines indicate connections initialized with process information that involve the nonlinearity of the hidden layer. All other weights connecting a layer of units to all subsequent layers are initialized to small random numbers (shown as dotted lines). Table 1 explains the various layers of units in this network, their sizes, and interpretations. Table 2 summarizes how the weights of the network are initialized. Note that in the case of time delays, the weight from the appropriate element in the Uaug(n)layer is initialized in order to account for this delay. In the case that the delay is not an exact multiple of at, the weights corresponding to the units with the two nearest integer delays are proportionally initialized. The ANN resulting from this process is referred to as an ANN-CT model because it is initially based on a continuous time transfer function model.

Neural Network Process Models

P-

723

W==

Figure 2: Recurrent feedforward network with topology based on a continuous transfer function (ANN-CT model). Table 1: Sizes and Interpretations of the Units in the ANN-CT Network.

Layer of units uAl8g(~l)

Size N,, = number of inputs Nollt= number of outputs N,,[max(%)

+ 11

Interpretation The current input to the model and a number of past inputs depending on the maximum timedelay in the transfer function. The first N,, units in the layer contain the current values, the next N,, contain the values from one time step past, etc.

N,>UlNIl,

The past states of the model.

X( J 1 1

NoutNin

The new states of the model calculated by the model.

Y(t1)

Nout

L g O J-1)

The new outputs of the model.

Gary M. Scott and W. Harmon Ray

724

Table 2: Initial Values of the Weights in the ANN-CT Network. Weight From

To

Initial value

2.2 Network Training. The second major step in the process is the training of the configured network. This step contains two tasks: The first is network learning using the backpropagation algorithm. The second optional task is pruning, which removes connections (and units) that do not significantly contribute to the performance of the network. These two tasks are discussed below.

Lenrnirig. Although more sophisticated learning 2.2.2 Bnckyropng~~tiori algorithms are available, backpropagation was used because of its simplicity and ease of implementation. Good values for the learning parameters were found empirically and used for all trials with no further modification. Training was terminated when the difference between the training error and the testing error began to diverge. This was done to prevent "memorization" of the training data causing a degradation of the ability of the network to generalize. The network was then pruned (see below) and retrained. Retraining was allowed to proceed for at least twice as long as the initial training in order to allow enough time for the network to reorganize and stopped when again the termination criteria was met after that point. 2.2.2 WeighfPruning. A relatively simple pruning technique was used here to remove unnecessary connections from the network: Since weight decay was used in training, the assumption is made that "low" valued weights are such because they had decayed to this value and are thus insignificant to the network's performance. For this reason, weights smaller than a certain fraction (usually 0-10%) of the average weight in that group of connections were set to zero and clamped to remain there. In this way, weights that contribute minimally were removed from the network. 3 Modeling Example

___.

Here we provide an example illustrating three methods of creating a model for a process using a time series of input-output data. The time

Neural Network Process Models

725

series data was generated by a first principles model of a Continuous Stirred-Tank Reactor (CSTR) with an irreversible reaction (Uppal et al. 1976). The system consists of a well-stirred reactor in which an exothermic, first-order reaction is taking place. The parameters of this system were chosen such that it exhibited strong parametric sensitivity in the range to be modeled. In order to simplify the differential equations resulting from material and energy balances of the system, several dimensionless quantities are defined and the inputs, disturbances, and outputs to the system are also made to be dimensionless. Also, since the threshold function of the units had a range of (-1. l),the inputs and outputs to the network were scaled to be within this range. These scaled values also represent deviation variables from steady state values. Table 3 summarizes these quantities and the constants of the system as well as the scaling used. The symbols used for this model are defined in the notation section. The resulting differential equations (after substituting for the dimensionless quantities) are given below.

3 = 1[-(u2 + l ) ( x l dt r dX2

-=

dt

1[ ( L I Z + l ) ( d l T

y1

= x1

y2

= x2

-

-

d2)

x2)

+ Da(1 - xl)exp

BDa + -(l Y

-

xl)exp

(3.1)

(-)7x2+ 1 x2

-

4x2

- UI)

1

(3.2)

The data consisted of a training set of 1000 points representing the input and the output of the CSTR sampled at one time unit intervals. The inputs to the CSTR were a linear random distribution in the ranges of the scaled values. Furthermore, the time between changes in the inputs was exponentially and independently distributed. A testing set (labeled as testing set l),distinct from the training set but similarly determined, also consisted of 1000 data points. A second testing set (labeled as testing set 2), consisting of 450 data points, consisting of individual step changes in each of the two inputs. Three models were created based on these data sets: A continuous linear model with first-order elements, a nonlinear ANN-ARMA model of the type found in the literature (Bhat and McAvoy 1990; Bhat etal. 1990; Donat et al. 1990; Haesloop and Holt 1990;Jones et al. 1989; Jordan and Jacobs 1990; Narenda and Parthasarathy 1990; Pineda 1989; Waibel 1989), and a nonlinear ANN-CT model of the architecture described above. For each of the network models, results are averaged over 10 runs each. Reported in the following section are the mean quadratic error for each of the data sets, as well as the 95% confidence intervals based on the multiple runs.

Gary M. Scott and W. Harmon Ray

726

Table 3: Summary of Quantities in CSTR Model.

Symbol

Value or range

Description

Model Parameters Nominal spacetime o f reactor 1.o 0.11 Damkiililer number 20 Activation energy 7.0

Heat o f reaction

0.5 Heat transfer coefficient 1.0 Nominal feed flow 300 Nominal feed temperature 1.Q Nominal feed concentration Inputs (Control variables) T, E 1250.3501

Coolant temperature Input feed rate

F t [0.5.1.5] Scaled Inputs Scaled coolant temperature i l l E [-1.11 Scaled input feed rate 112 E [-1,1] Outputs (Measured variables) OLItflow conccntration Outflow temperature Scaled Outputs Scaled outflow concentration !/I E 1-1.11 Scaled outflow temperature !/2 E [-I. I ] Disturbances (Noise) T, E [295.305] C A ~E

4 Modeling Results

10.9.1.11

Feed temperature Feed concentration

~

4.1 Linear Continuous Model. Using the MIDENTprogram of the CONSYDpackage, a continuous transfer function model was identified from the data set. The structure of each element of the model was assumed to be first-order with time delay, which resulted in a model with 12 parameters (4 steady-state gains, 4 time constants, and 4 time delays).

Neural Network Process Models

727

Figure 3: Concentration and temperature predictions of a continuous transfer function model with first-order elements and the ANN-CT model for testing set 1. The solid line is the ANN-CT model output, the dotted-dashed line is the linear model output, and the dotted line is the actual process output. For the input graph, 111 is represented by the solid line and 112 is represented by the dotted line. Note that the ANN-CT response and the actual process response are overlapping. The performance of this model on testing set 1 is shown in Figure 3 (shown by the dotted-dashed lines) and represents the performance with which the MANNIDENT network discussed below was initialized. As can be seen, this type of model was unable to correctly account for the differing steady-state gains for positive and negative step changes in the inputs. Rather the model seemed to average the two effects, thus undershooting the correct value in one direction while overshooting the correct value in the other direction. Also, the identified time constants seemed to be too high; that is, the identified model outputs seemed to react too sluggishly to the shown changes in the inputs. These effects are the result of the identification software attempting to find the best fit linear model to a nonlinear process. 4.2 ANN-CT Model. A nonlinear network model using the MANNIDENT method developed here was created. First, MIDENT,using the net-

work training data, identified a continuous first-order model of the pro-

728

Gary M. Scott and W. Harmon Ray

cess as given in equation 4.1. This model was then used to create and initialize a network with the structure depicted in Figure 2. Figure 3 shows its performance on testing set 1, where the solid line represents this model’s output which follows the desired output (dotted line) to a high degree of accuracy. Note that unlike the linear model, the ANN-CT model was able to represent the changing gains for positive and negative steps in the inputs. The use of additional hidden units [increasing the size of the xaL,,,(n- 1 ) and x ( n ) layers] did not improve the performance of the network, but did in fact slow down the learning process. These additional ”state” neurons were not initialized with any model information, but instead had all of their associated weights assigned small random values. This failure to improve performance with the addition of more units is an indication that the chosen architecture and network size were appropriate to this task. 4.3 ANN-ARMA Model. For comparison purposes, ANN-ARMA models were also created for the example system. For a nonlinear ARMA network model, a standard three-layer recurrent feedforward that included past outputs of the network in the input layer of the network was configured. The initial weights were chosen as small random numbers. For comparison purposes, networks containing three through nine units in the hidden layer were created; this requires 35 to 101 weight parameters which covers the range of weights in the networks for the ANN-CT MANNIDENT models. Figure 4 shows the performance of the model on the testing set for the network with nine hidden units. Note that this model was also able to show the different gains resulting from positive and negative steps in the input. However, the model did show a noticeably greater offset from the actual process than the ANN-CT model.

4.4 Discussion. Table 4 summarizes the performance of all the models described above. For each of the models, the table gives the mean training error, the mean testing error on both test sets, and the number of adjustable parameters in the model (weights and biases for the network models). For the network models, these averages were taken over 10 runs for each configuration, with each run only differing in the random initialization of the weights. In the case of the ANN-CT the random initialization refers only to those weights not initialized with model information. Also given for each of the mean errors are the 95% confidence intervals for which this mean is the true value. These data are used as a measure of the variance in the models between runs. Note that while the ANN models used anywhere from three to six times the number of parameters as the transfer function models, their applicability over a wider range of conditions means that a single ANN

Neural Network Process Models

729

Figure 4: Concentration and temperature predictions of the three-layer ANNARMA network model for the second testing set. The solid line is the network output and the dotted line is the process output. For the inputs, 111 is represented by the solid line and 112 is represented by the dotted line.

model would be sufficient where several linear models (at different operating points) would be needed. Over the range studied in the previous examples, both the ANN-ARMA models and the ANN-CT models performed significantly better than either the linear model with first-order elements or the linear model with first-over-second-order elements (99.99% confidence). A single linear model was not sufficient to adequately model the process over this range. Comparisons can also be made between the ANN-ARMA models and the ANN-CT (pruned) models. While the improvement in training errors for the ANN-CT model is not as highly significant (90% confidence for the ANN-ARMA model with 7 hidden nodes, 99.5% confidence for the ANNARMA model with 9 hidden nodes), the difference in the testing errors is significant (99.99% confidence for all ANN-ARMA models). This is an indication again that the ANN-CT models are better able to generalize to unseen testing data. However, of more importance is the statistical analysis using an F-test of the amount of variation between different runs of the same model. In all cases, the ANN-CT models showed significantly less variation between runs as compared to the ANN-ARMA models (99.99'70

Testing set I1 error

Testing set I error

*

Continuous Transfer Function Models (Linear) 0.05790 0.01426 0.05387 0.01242 0.02764 0.03165 ANN-ARMA Models 0.00222 i 0.000242 0.00178 f 0.000075 0.00356 f 0.000141 0.00188 i0.000161 0.00330 f 0.000157 0.00175 f 0.000081 0.00318 f 0.000114 0.00158 5 0.000132 0.00175 f 0.000050 0.00303 f0.000119 0.00139 f 0.000152 0.00177 f 0.000086 0.00111 f 0.000137 0.00287 f 0.000102 0.00182 f0.000066 ANN-CT Models 0.00072 f 0.000002 0.00222 f0.000001 0.00277 f0.000002 0.00032 f 0.000003 0.00218 f 0.000014 0.00168 0.000004

Training set error

2.02 2.16 2.33 2.91 3.39 0.45 3.38

46 40

Time (min)

35 46 57 79 101

12 20

Model size

"The model size refers to the number of parameters in the model. The number in parentheses for the ANN-ARMA model is the size of the hidden layer.

Unpruned Pruned

First First/Second

Model

Table 4: Comparison of Final Integral Square Error Values of ANN Models and Traditional Models."

tl

2:

Y

Y

n

2

n

5

Y

. I .

R

Ff

Y

Neural Network Process Models

731

confidence). The implication of this is that with the ANN-CT models, there is greater confidence that the model developed is close to the ”true” model. From a computational point of view, fewer trials are needed in order to have confidence in the model that is produced. Also, the ANN-CT model, on subsequent runs, converged to near the same local minimum, allowing a comparison of individual weight values, which is not possible in the case of the randomly-initialized ANN-ARMA model. Thus, the ANN-CT models show significant performance improvements over the ANN-ARMA model, so that the ANN-CT formulation would be a good model structure for nonlinear modeling. The training times between the various ANN models can also be compared. As can be seen from Table 4, the unpruned ANN-CT model trains approximately five times faster than all of the ANN-ARMA models. Also, this network is better able to generalize to the testing sets as shown by its better performance on those sets. The pruning of the ANNCT model results in a network that takes approximately the same time to develop as the best ANN-ARMA model, but which has significantly better performance.

5 Knowledge Extraction

The final step in the modeling process is the interpretation of the trained network. This is important since the ANN is essentially a “black box” and the ability to interpret the model greatly enhances confidence in the results. Also, the capability of extracting traditional models from the ANN model allows traditional controller design techniques to be used based on the extracted model. A sensitivity analysis of the network at a particular operating point is the basis for the interpretation of the network. From this analysis, a discrete state space model of the process is created which is further manipulated to create a model of the desired form. The details of each step are given below. 5.1 Sensitivity Analysis. Since the interpretation of the network’s connections produces an approximate linear model, the point of linearization must be chosen. After the network has achieved steady state at this point, a sensitivity analysis can be performed. Once this analysis is complete, the information can be interpreted to develop a discrete state space model of the process at the chosen point. The model takes the form of

x(n) y(n)

where

= =

@x(n- 1) + Pu(n) Cx(r2)

a, @. and C are constant matrices. These matrices can be seen to

Gary M. Scott and W. Harmon Ray

732

take the following values resulting from the sensitivity analysis: (1) =

ax(n ) dx'l,lg( ) I - 1)

=

f'(a')"'

which are easily calculated from the weights of the network (W", W"', and W"') and the current activation of the units (mr and a").The resulting model is in the form of a discrete state space model. 5.2 Approximate Linear Models. The result of the above analysis results in a discrete state space model of the process at a particular operating point. The model can then be transformed into transfer function models (either continuous or discrete) of the desired order. The details of the steps of the transformation are given in Scott (1993). The CONSYD package has several utilities that allow this model to be converted into other forms also useful for modeling (Holt ct a/. 1987). 5.3 Interpretation of a Trained ANN-CT Model. The trained ANNCT model from the previous section was used for the extraction of a linear model around the center operating point. The method described above resulted in an approximate linear model of the following form around the steady state in Table 3

where the degree of each element was determined to match the degree of the elements of the linear model that results from an exact local linearization of the original differential equations that describe the CSTR. This exact local linearization model is G e x a c t ( ~ )=

-1.24 2.21s2+2 H%+l (O.H4Ys+l.l21) 2.21s'fZ RSs+1

(0.416stO.411) 2.21s2t2.HSs t l -(0.2YlstO 244) 2.21s2+2.H51t-1

I

(5.2)

As can be seen, the model extracted from the network agreed very well with the local linear model in terms of the steady state gain. However, the denominator of each element shows some differences. The response of each of these models to step changes in the input is shown in Figure 5. From the figure, it is apparent that the response of the extracted model was very similar to that of the linearized model. This is an indication

Neural Network Process Models

733

0.21

-0.2

0

I

I

20

40

1

I

I

60

80

I 100

0.2 1

-0.2

0

I

I

I

I

20

40

60

80

I 100

Time

Figure 5: Comparison of response of linear model extracted from ANN-CT model and a linearized model of the actual process. The solid line is the model extracted from the ANN-CT model and the dotted line is the exact linearized model. that the extracted model is an adequate representation of the process at this operating point for small deviations from steady state. To demonstrate this further, the order of both the extracted model and the linearized model were reduced (using the CoNsYD program REDUCE) to have elements consisting of first-order responses with time delay. The extracted model and reduced linearized model are shown in equations 5.3 and 5.4, respectively.

(5.3) (5.4) L

1 Y%+l

A comparison of these two transfer functions shows that they are in good agreement, with the single exception of the time constant in the (1.2)element. This again supports the idea that the extracted model was a good representation of the process at this point. The comparisons above show that the network model is a adequate linear approximation to the process. Since large amplitude input changes

Gary M. Scott and W. Harmon

734

QZ1

-1

0

i

Figure 6: Extracted steady state gains from ANN-CT model of the CSTR as the first input is varied and the second input is held constant at zero. The solid line shows the steady state gain extracted from the network model; the dotted line shows the actual gain of the process at that point; and the dotted-dashed line shows the linear gain identified by MIDENT. were used for training the network, the linearization reflects this. This approximatation could be made better if more training set data were chosen closer to the the desired steady state. To demonstrate the fidelity of the network model over a larger range of input amplitudes, plots of the steady state gain as the first input is changed are shown in Figure 6. I n this figure, the horizontal dotted-dashed line is the constant gain from the linear model that was used to initialize the network. Through training, the steady-state gain of the network (the solid line) much more closely approximates the actual gains of the process (the dotted line). Thus the ANN-CT model captures the nonlinear gain behavior for the process and represents the response to dynamic inputs. 6 Related Work

Although many ANN applications in the literature do not consider prior knowledge, there have been other techniques developed for this purpose. As previously mentioned, the KBANN approach is one such method in which information in the form of propositional rules was used as the

Neural Network Process Models

735

basis for the structuring and initialization of the network (Towel1 et a / . 1990). It was this idea that formed the basis of the MANNIDENT system described here. The method of Haesloop and Holt uses linear information in the form of direct links from the input layer to the output layer, in addition to the standard pathways for activation through the hidden layers (Haesloop and Holt 1990). In this case, the output of the network would be calculated as the sum of the linear model and the nonlinear network, which is trained to account for only the residual nonlinear behavior. Foster et al. (1990) use an ANN to combine the predictions of traditional forecasting schemes to produce a combined forecast. Psichogios and Ungar (1992a,b) use an ANN to estimate unmeasured process parameters of a first principles model, creating a hybrid model with an ANN. Note that this typically requires the backpropagation of the error signals through the first principles model to the network. The multivariate statistical methods such a Principal Components Analysis (PCA) and Projection to Latent Structures (PLS) can easily be used in conjunction with artificial neural networks. First, PCA and PLS can be used to preprocess the data in order to form a more meaningful set of inputs and outputs from which the ANN can learn. Removing highly correlated variables in this manner should potentially reduce the size of the network as well as the time necessary to train it. The method described here has the effect of replacing the regression step in PLS algorithm, with an ANN in order to introduce nonlinearities into the model (MacGregor et al. 1991; Qin 1991; Qin and McAvoy 1991). Second, PCA and PLS can be used to assign initial weights to an ANN, which a learning algorithm such a backpropagation can further refine (Piovoso and Owens 1991). Using this method, each of the units in the hidden layer represent one of the principal components of the input data.

7 Conclusions

The MANNIDENT algorithm for network design and initial weight specification significantly improves the performance of the nonlinear network in several ways. The algorithm quickly determines a relevant network architecture without resorting to trial-and-error methods, and through initialization of the weights with prior information, gives the learning algorithm an appropriate direction in which to continue learning, thus decreasing the training time and less variability between runs. Also, since the units and some of the weights initially have physical meaning, the MANNIDENT networks are easier to interpret after training, allowing classical models to be extracted from the network. Finally, since the network structure is based on a linear model of a type often used in modeling, the MANNIDENT algorithm is able to build on what is familiar.

Gary M. Scott and W. Harmon Ray

736

Further enhancements using the MANNIDENT method are possible and are discussed more fully in Scott (1993) and Scott and Ray (1993a). Exploring different operating regions of the nonlinear CSTR, the ANN-CT network was able to model multiple steady state and limit cycle oscillations when the training data set contained these phenomena. Further enhancements expand the generality of the ANN-CT model. For example, one may use higher order continuous models or a discrete linear model of the process as the basis for the network. Since these models can be of arbitrary order, more complex network structures result. An example is given in Scott (1993) and Scott and Ray (1993a). Finally, the ANN-CT models can be incorporated into nonlinear model-based controllers, which show excellent performance in the face of setpoint changes and process disturbances and have excellent robustness properties (Scott and Ray 1993b). 8 Notation

time domain model coefficients network threshold function subscript indices transfer function models linear model parameters number of process inputs and outputs discretization time step of model linear model inputs network inputs scaled deviation inputs weight matrix from x to y linear model states network states (hidden units) linear model outputs network outputs scaled deviation outputs discrete state space parameters total input to unit CSTR model nomenclature A = heat transfer area CA = outlet concentration c A ~ = feed concentration C A ~ = nominal feed concentration C,, = heat capacity E = reaction activation energy F = volumetric feed rate

Neural Network Process Models

737

nominal feed rate heat transfer coefficient heat of reaction kinetic preexponential factor gas constant time feed temperature coolant temperature feed temperature nominal feed temperature reactor volume densify

References Bhat, N., and McAvoy, T. 1. 1990. Use of neural nets for dynamic modeling and control of chemical process systems. Compirt. CIie~ii.Eiig. 14, 573-583. Bhat, N. V., Minderman, P. A., McAvoy, T., and Wang, N. S. 1990. Modeling chemical process systems via neural computation. I € € € Coiitrol Syst. Mag. 10, 24-29. Donat, J. S., Bhat, N., and McAvoy, T. J. 1990. Optimizing neural net based predictive control. In Aniericari Control Confereiicc, Vol. 3, pp. 2466-2471. IEEE, Sail Diego, CA. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14, 179-211. Foster, B., Collopy, F., and Ungar, L. 1990. Neural network forecasting of short, noisy time series. Tech. Rep., University of Pennsylvania. Haesloop, D., and Holt, B. R. 1990. A neural network structure for system identification. In Aiiiericuri Control Coi!fcrencr, Vol. 3, pp. 2460-2465. IEEE, San Diego, CA. Holt, B. et al. 1987. CONSYD: Integrated software for computer aided control system design and analysis. Cotripiit. Cheiri. Eiig. 11(2), 187-203. Jones, R. D., Lee, Y. C., Barnes, C. W., Flake, G. W., Lee, K., Lewis, P. S., and Qian, S. 1989. Function approximation and time series predication with neural networks. Tech. Rep. LA-UR 90-21, Los Alamos National Laboratory. Jordan, M. I., and Jacobs, R. A. 1990. Learning to control an unstable system with forward modeling. In Advances ill Nriirnl Inforirintioii Procc~ssingSystems, Vol. 2, pp. 325-331. Morgan Kaufmann, San Mateo, CA. MacGregor, J. F., Marlin, T. E., Kresta, J. V., and Skagerberg, B. 1991. Multivariate statistical methods in process analysis and control. In Foirrtli Interntional Confercricc 011 C/imiicnl Process Coiitrol, Y. Arkun and W. H. Ray, eds., pp. 7999. CACHE, AIChE, Padre Island, TX. Narendra, K. S., and Parthasarathy, D. 1990. Identification and control of dynamical systems using neural networks. / € € € Tmtisnct. N[wral Net7c~orks1(1), 4-27. Pineda, F. J. 1989. Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neiiral Conip. 1, 161-172.

738

Gary M. Scott and W. Harmon Ray

Piovoso, M. J., and Owens, A. J. 1991. Sensor data analysis using nrtifical neural networks. In Foirrtli Iiiferntiorinl Cofercwcc7m Clir~iiiicnlProccss Corrtrol, Y. Arkun and W. H. Ray, eds., pp. 101-118. CACHE, AIChE, Padre Island, TX. Psichogios, D. C., and Ungar, L. H. 1992a. A hybrid neural network-first principles approach to process modeling. A K h € /. 38(10), 1499-1511. Psichogios, D. C., and Ungar, L. H. 1992b. Process modeling using structured neural networks. In Proceediiigs of t l i ~2992 Airrcricmi Coirtrol Cotlfi.r~wc pp. 1917-1921. IEEE, Piscataway, NJ. Qin, S. J. 1991. Neural net PLS approach to dynamic modeling: Method and application. Tech. Rep., Department of Chemical Engineering, University of Maryland. Qin, S. J., and McAvoy, T. J. 1991. Nonlinear PLS modeling using neural networks. Tech. Rep., Department of Chemical Engineering, University of Maryland. Scott, G. M. 1993. Knowledge-based artificial neural networks for process modelling and control. Ph.D. thesis, University of Wisconsin, Madison, WI. Scott, G. M., and Ray, W. H. 1993a. Creating efficient nonlinear neural network process models that allow model interpretation. I. Proccss Coiitrol 3(3), 163178. Scott, G. M., and Ray, W. H. 1993b. Experiences with model-based controllers based on neural network process models. /. Procc.ss Coritrol 3(3), 179-196. Scott, G. M., Shavlik, J. W., and Ray, W. H. 1992. Refining PID controllers using neural networks. Nciirnl Comp. 4(5), 746-757. Stephanopoulos, G. 1984. Climicnl Process Control: Ati Iiifrodirctiori to Tlicwy orid Practicc. Prentice Hall, Englewood Cliffs, NJ. Towell, G., Shavlik, J., and Noordewier, M. 1990. Refinement of approximate domain theories by knowledge-base neural networks. In €ig/rt/r Nntioiml Corlfcrmcr 0 1 1 Artificinl I n t e l l i q ~ n c ~pp. , 861-866. AAAl Press, Menlo Park, CA. Uppal, A., Ray, W. H., and Poore, A. B. 1976. The classification of the dynamic behavior of continuous stirred tank reactors-influence of reactor residence time. Chrrri. E t i g . Sci. 31, 205-214. Waibel, A. 1989. Modular construction of time-delay neural networks for speech recognition. Nciirnl Coiiip 1, 3946. ~

Received May 6, 1993, accepted November 4, 1993

This article has been cited by: 2. Richard D. De Veaux, Rod Bain, Lyle H. Ungar. 1999. Hybrid neural network models for environmental process control (The 1998 Hunter Lecture). Environmetrics 10:3, 225-236. [CrossRef]

Communicated by Erkki Oja

Stability of Oja’s PCA Subspace Rule Juha Karhunen Hrlsirzki Utiiversity of Edirzology, Lnbomtory of Corrrpirfcr nrid Rnkcnfajnnnukio 2 C, FIN-02150 Espoo, Firilnrid

/ ~ $ w r i i n f i m i Scicncr,

This paper deals with stability of Oja’s symmetric algorithm for estimating the principal component subspace of the input data. Exact conditions are derived for the gain parameter on which the discrete algorithm remains bounded. The result is extended for a nonlinear version of Oja’s algorithm. 1 Introduction

Principal eigenvectors of the data covariance matrix or the subspace spanned by them, called PCA subspace, provide optimal solutions to several information representation tasks. Recently, many neural approaches have been proposed for learning them (see, e.g., Hertz r t a/.1991; Oja 1992). A well-known algorithm for learning the PCA subspace of the input vectors is so-called Op’s subspace rule (Oja 1989; Hertz et a / . 1991): Wk+l = Wk

+ /rA[I wkwl]xkxlwA -

(1.1)

In the symmetric algorithm 1.1 the columns of the L x M-matrix Wk = [wk(l).. . . , wk(M)],L 2 M are the weight vectors of the M neurons after k iterations. A Hebbian type term xkx[wI(, product of the linear output xlwk(i) and the L-dimensional kth input vector XA for the ith neuron, is mainly responsible for the learning. The gain parameter 2 0 controls the learning rate. The additive nonlinear constraint WI(W[xkxlWk prevents different weight vectors from becoming too similar, and stabilizes the algorithm. In 1.1, the constraint roughly orthonormalizes the weight vectors: WiWk E I. The special case M = 1 yields the standard Op’s single neuron rule. Several authors (e.g., Hertz et a/.1991; Oja 1992) have shown that the averaged differential equation corresponding to 1.1 converges to the Mdimensional PCA subspace of the input vectors. This kind of asymptotic analysis yields the limiting values of 1.1, but is not alone sufficient for an exact convergence proof. One must in addition prove the convergence of the differential equation globally, and show that 1.1 itself is stable, that is, the weight vectors must remain bounded on some realistic conditions. Because of the nonlinearities, both these tasks are difficult though Ntvtrol Coriiyirtotiori 6, 739-747 (1994)

@ 1994 Massachusetts Institute of Technology

Juha Karhunen

740

definitely worthwhile (Hornik and Kuan 1992). X u (1993) has recently provided some such global analysis for 1.1. The only known boundedness condition for discrete PCA algorithms is given in Lemma 5 in O p and Karhunen (1985) for Op’s single neuron rule and is not exact. In the following, a stability theorem is proved for the discrete algorithm 1.1. It is noteworthy that the analysis is accurate, yielding an exact upper bound for the gain parameter. This is demonstrated by a simple example. The theorem is extended to a nonlinear generalization of 1.1, and the results are discussed shortly. 2 The Main Theorem

Theorem 1. Oja’s subsyaccalgoritlinz 1.1 is stable, that is, the iioriii ofthe nratrix W&is boutidid for all k, if the gain yaranreter satisfies at e71cry iteratioii tlzc coiiditioii 05

//L

L 2/ I1 X k (I2

(2.1)

and the iizitial zocight nratrix W1 is chosen so that the largest eigt~iiz~alue of W1WF is at niost 2. I f the largest eigcnvaliic. of the inatrix wlwf or imrL’ generally wAw[ is X I > 2, / / A itlust satisjij the coiiditiori

Proof. We prove the theorem by deriving for / / , k the conditions on which the norm of the matrix Wk remains bounded for all k. Note that the gain is the only parameter in 1.1 that can be chosen freely at each iteration after initialization. We use the 2-norm, which is defined as the square root of the largest eigenvalue XI of the matrix WTWk, and is compatible with the usual Euclidean vector norm.

It seems easier to analyze the boundedness of the matrix Yk = WkWl’, since Yk is always a square matrix while Wk is generally an L x M matrix. From the singular value decomposition theorem it follows that the matrices WlWk and WkWT have the same nonzero eigenvalues X I 2 2 . . . AM. Denote the corresponding normalized eigenvectors of Yk by el. . . . . eM. The additional zero eigenvalues AM+, . . . . . XI. o f Yk have no significance in our analysis. Clearly, the norms obey the simple relationship llYkll = X I = llWk1I2. We have recently shown (Karhunen and Joutsensalo 1993) that 1.1 is actually a stochastic gradient ascent algorithm for maximizing the criterion \(W) = tr(W’R,,W) under the constraint that the weight vectors of the neurons must be mutually orthonormal: WTW = I. Here R,, = E(xxT) is the correlation matrix of the input vectors, I is unit matrix, and tr denotes trace. If the orthonormality constraint WTWk = I holds exactly,

Stability of Oja's PCA Subspace Rule

741

all the nonzero eigenvalues of Yk equal to unity. Generally, W[Wk # I in 1.1. In its stability region, 1.1 tends then to decrease the eigenvalues of YI; (or equivalently the norms 11 wk(i) 11) that are greater than unity, and increase the eigenvalues that are smaller than unity. Both the cases must be discussed in a complete stability analysis. YI;, and consequently Wk, will be stable if there exists a constant H such that (1 YI; 11 5 H for all k. We derive the stability condition by requiring that 1) YI;+~11 5 (1 Yk 11 at each iteration. One method of determining the norm I( Yk+l 11 is to find the unit vector a that maximizes the quadratic is the largest eigenvalue of form Q k + l (a) = a'Yk+,a. The maximum of QI;,.~ Yk+l, and it is achieved when a is the corresponding principal eigenvector. Thus, we can equivalently require that Qk+l(a) I XI = II Yk

11

I1 a II=

for all

1.

(2.3)

From 1.1,

+ //,k[I Yk]xkxlYI;+ //kYI;xkxl[I - YI;] + /!;[I Yk]XkX$YyXI;XL[I Yk]

Yk+l = Yk

-

-

-

(2.4)

and Qk+l(a) = a'rYka

+ 2//k[arxk

+ ,r;[arxk

-

-

a'Ykxk]x[Yka

a'~~xk]*x,T~kx~

(2.5)

Now we will find quite generally the vectors a and xk that yield the possible minima and maxima of Qk+l(a). The problem is meaningful . we require only if a constraint is imposed on the norms of a and x ~ Thus and consider the that a'a = 1 and XTXI; = c, where the constant c = criterion Ik+l(a.xk) = Qk+l(a) + //1(ara- 1)+ / / ~ ( X ~ X I ; c)

(2.6)

where //1 and rl2 are Lagrange multipliers. The extremizing values can be found in a standard way by computing the gradients of JI;+l with respect to a and xk and equating the results to zero. This can be done exactly in a straightforward manner, but here it is sufficient to observe that the gradients of all the involved scalar terms x[x~, a'a, aTxk, a'YkxI; = x,TYka, a'Yka, and x ~ Y ~ xare I ; proportional to the , Yka due to the symmetry of the matrix Yk. Thus vectors a, Xk, Y ~ x Aand we end up in two nonredundant equations having the general form

+ +

rrlYka 02a rt5Yka oha

= =

ct3YI;xk fi7YkXk

+ 04Xk + (t8xk

(2.7) (2.8)

Here r t 1 . . . . , rr8 are scalar coefficients that have in some cases a rather complicated form. One can always eliminate the vectors Ykxk and Yka from the above equations, resulting in an equivalent pair of equations (2.9) (2.10)

Juha Karhunen

742

where ,j,s are again scalars depending on u p . Solving xk from 2.10 and inserting it into 2.9 yields finally an equation having the general form

+

(2.11)

[-/lyZ 2 r ~ d a=

where 31, 22, and 27 are scalar coefficients. But 2.11 can hold for a # 0 only if a is an eigenvector of + 1 2 Y k . Since this matrix has the same eigenvectors as Yk, a must be one of the (unnormalized) eigenvectors of Yk. Finally, the unit norm constraint yields for the possible extremizing M. A similar procedure shows that XI. must also values a = &el,i = 1, be an eigenvector of Yk with possible values xk = 11 XI 11 e l . It follows easily that

*

* 11

lk+i(*ef5

XA

11 el) =

Qk+i(e,)= !&(el) = A,

if I # I since then e:e, = 0. Thus I( Yn+l 11 5 11 Yk 11 holds always in this case. If a and xk correspond to the same eigenvector e, of Yk, a’xk = 11 XI11, aTYkxa = &A, 11 xk 1 , aTYAa = A,, and X : Y ~ X ~ = A, 11 xk 1 ’. Inserting these quantities into 2.5 and 2.6 yields now

*

IA+ ) ( * e l . I( XA 11 e , ) =

QA+l(e,)= A,[1 t /!A(l

~

A,)

II xi 11212

(2.12)

independently of the chosen signs. The boundedness requirement Q ~ + l ( e 5, ) A1 must be satisfied for all nonzero A,, which gives the inequalities

We consider first the eigenvalues A, > 1. The inequality 2.13 is tightest for the largest eigenvalue XI, and gives the condition 2.2. This is natural, since it corresponds to the most critical direction e l that yields the norm 1) Yk 11 = XI. Consider then the eigenvahes A, < 1. From 2.13, we get other meaningful upper bounds for //k: (2.14) Clearly, this may yield a tighter bound than 2.2 if, for example, XI is slightly larger than unity. We conclude that the general stability condition is the tightest of the upper bounds of 1.1 is 0 5 //(k)5 /I,,,, where,,,(/ 2.2 and 2.14, which must be evaluated for all nonzero A, < 1. This general condition is difficult to use in a practical neural algorithm, since it requires knowledge of the eigenvalues of the matrix YA at every iteration k. A simpler stability condition can be derived in the following way. First, we seek the A, < 1 that maximizes 2.12. Differentiating 2.12 with respect to A, and equating the result to zero yields the xk 112) and roots A, = 1 + l / ( / /11 ~ A,

==

1/3 f 1/(3//A11

XA

\I2)

(2.15)

Stability of qa’s PCA Subspace Rule

743

The first root is not interesting, since it is greater than unity and yields the minimum. The latter root yields the maximum of 2.12 and belongs to the desired interval (0.1) if / / A > 0.5 11 Xk 11-2. Observe that if / / A < 0.5 11 XI 11 -2, the maximum of 2.12 for A, 5 1 is 1 and is achieved when A, = 1. This means that unstability cannot occur for A, 5 1 if / / A < 0.5 11 XA We continue our analysis by using the upper bound / / A = 2/[(X1- 1) 11 XA 1’1 from 2.2 in 2.15, which yields A, = ( A 1 - 1j/6. Now the smallest possible value of A1 can be found by computing the maximum of 2.12 for these values of / i k and A, and equating the result to the norm A1 of Yk. This procedure leads to the third order equation 2( A1 1)’ = 27x1( A 1 - 1j2, which has the roots A1 = 2, 0.2, and 0.2. The only meaningful solution is x1 = 2. Hence = 2 11 xk 11-2 is the largest value of the gain parameter for which 1.1 remains bounded (provided that 11 Y1 115 2). For this i l k , both the cases A, < 1 and A, > 1 can yield (1 Yk+l 11 = 1) Yk 11. Thus we have derived and justified the condition 2.1. From the analysis above it follows also that if A1 > 2, the upper bound of / i k is given by 2.2; the conditions 2.14 need not be taken into account 0 in this case. This concludes the proof.

+

The analysis in the proof gives a good insight into the behavior of 1.1. The stability condition 2.1 depends on the squared norm llxk)1’ = X,’XA of the input vector only. This can be computed easily or replaced by a suitable upper bound. A fast initial convergence is usually achieved when / i k is roughly in the range 0 5/11xk11’. . l/llxk11’. For improving the estimation accuracy, / i k can gradually be made smaller later on. The upper bound 2.2 has a natural form: it is inversely proportional to (1 xk 11’ and to the ”distance” A, - 1 of the squared norm XI of Wk from its stable value unity. 3 Simulation Example

The accuracy of the stability bound 2.1 is demonstrated in a simple but illustrative example. In this case, the data vectors xk were generated randomly from the uniform distribution defined by the bounding parallelogram in Figure 1. In addition, Figure 1shows the learning trajectories of the two weight vectors Wk(1) and WA(2) computed using 1.1. The final values of the weight vectors are marked by the straight lines starting from the origin. In Figure 1, the gain parameter was = 1/ I( xk [ I 2 ; Figures 2 and 3 show the respective trajectories for //k = 1.3,’ )I xk 11’ and (2 + (1 xk /I2. In Figure 1, 1.1 converges quickly, and the weight vectors stay in a relatively small region after initial convergence. Figure 2 shows that a somewhat larger gain parameter makes the weight vectors highly variable, and they do not actually converge to any final values even in the mean sense. In Figure 3, the weight vectors first move a long time about on the boundary of the limiting stability circle with the radius 11 w 11 = Since the gain parameter is slightly larger than the upper

a.

Juha Karhunen

744

,

1.5

1

1-

0.5 -

0-0.5 -

-1 -

-1.5 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Figure 1: The data vectors in the example are uniformly distributed inside the parallelogram. The figure shows also the learning trajectories of the weight vectors given by oja’s subspace rule 1.1 when the gain sequence was / / A = 1/ 11 XI, l12. The final values are marked by the straight lines starting from the origin.

bound 2/ 11 XI, the weight vectors eventually escape from this circle and 1.1 “explodes.” If / r k is slightly smaller than 2/ 11 XI( I 2, the weight vectors remain just inside the limiting stability circle, but the update at each iteration is clearly too large for achieving any kind of convergence. 4 Stability of a Nonlinear Generalization

O p ’ s PCA subspace algorithm 1.1 can be modified in several ways to

include explicit nonlinearities (Karhunen and Joutsensalo 1993). A direct generalization of 1.1 is WA+1= WA

+ / l A [ I - wkw:’]xkg(x:wA)

(4.1)

Here the function g ( t ) is applied separately to each component of the argument vector. For stability reasons, g(t ) is usually a monotonic odd function, for example, tanh(t ) suitably scaled. In Karhunen and Joutsensalo (1993), we have related 4.1 to an optimization criterion, showing that it is a kind of robust PCA subspace algorithm.

Stability of Oja’s PCA Subspace Rule

745

1.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Figure 2: The learning trajectories of the weight vectors given by q a ’ s subspace rule 1.1 when the gain sequence was 1i.k = 1.3/ 11 xk 112. Theorem 1 can be generalized to the nonlinear algorithm 4.1 as follows. Corollary 1. The algorithm 4.1 is stable if the conditions of Theorem 1 hold and I g ( t ) I 5 I t lfor all t, that is, the oddfunction g ( t )grows at most lineurly. Proof. From 4.1, we get for the ith weight vector wk(i)the update formula

wk+l(i) = wk(i) +

- ~~~:Ix&kTwk(i)I

(4.2)

The only difference in 4.2 with the respective formula for 1.1is the nonlinearity g ( t ) . If we define a new gain parameter 11; = pkg[x[wk(i)]/x[wk(i), the two update formulas become exactly the same. Hence we can apply Theorem 1 for p;, and conclude that 4.2 is stable (bounded) if 05

/Lk

5 /1;x:wk(i)/g[xkTwk(i)]

(4.3)

This is different for each wk(i).Generally, one should compute 4.3 for i = 1... . . M , and take the smallest interval for the stability region of the matrix algorithm 4.1. But if g ( t ) grows at most linearly, or Ig(t)l 5 Itl, the upper bound in 4.3 is always at least as high as for 1.1 and 4.1 is guaranteed to be stable whenever 1.1 is. This verifies Corollary 1. o

Juha Karhunen

746

1.5 -

1-

0.5 -

0-0.5 -

-1 -

-1.5 -

-2

I_

-1.5

-1

-0.5

0

0.5

1

1.5

2

Figure 3: The learning trajectories of the weight vectors given by 1.1 when the gain sequence was = (2+10-’)/ 11 x i 1 ’. The figure shows that Oja’s subspace rule 1.1 becomes eventually unstable as the weight vectors grow outside the stability region, which is a circle with radius 11 w 1) = fi and centered at the origin.

If g(t ) grows less than linearly with t, 4.3 is actually more robust than 1.1, for example, against impulsive noise and outliers. 5 Concluding Remarks

The theorem proved in this paper guarantees that O p ’ s PCA subspace rule 1.1 remains stable on reasonable conditions. Combined with earlier results, it justifies convergence of 1.1 to the PCA subspace of the input vectors with standard assumptions. The piece still lacking from a strict convergence theorem is a complete global analysis of the corresponding averaged differential equation. The analysis presented in the proof helps to understand the properties of 1.1 and is therefore itself useful. The stability theorem can easily be generalized to a nonlinear (robust) variant of 1.1. It would be worthwhile to derive similar stability results for other PCA type learning algorithms, but this is still an open research problem.

Stability of Op’s PCA Subspace Rule

747

Acknowledgment The author is grateful to Jyrki Joutsensalo for providing the simulation example.

References Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introdiictioii to the Theory of Neirral Computation. Addison-Wesley, Reading, MA. Hornik, K., and Kuan, C.-M. 1992. Convergence analysis of local feature extraction algorithms. Neural Networks 5, 229-240. Karhunen, J., and Joutsensalo, J. 1993. Representation and separation of signals using nonlinear PCA type learning. Neural Networks, in press. Qa, E., and Karhunen, J. 1985. On stochastic approximation of the eigenvectors and eigenvalues of a random matrix. Int. 1. Math. Andy. Appl. 106, 69-84. Oja, E. 1989. Neural networks, principal components, and subspaces. Int. 1. Neiiral Syst. 1, 61-68. Oja, E. 1992. Principal components, minor components, and linear neural networks. Neiiral Networks 5, 927-935. Xu, L. 1993. Least mean square error reconstruction principle for self-organizing neural-nets. Neural Networks 6, 627-648. _ _ _~ __ ~ ~ Received July 9, 1993; accepted October 15, 1993. ~.

This article has been cited by: 2. Y. Miao, Y. Hua. 1998. Fast subspace tracking and neural network learning by a novel information criterion. IEEE Transactions on Signal Processing 46:7, 1967-1979. [CrossRef] 3. C. Chatterjee, V.P. Roychowdhury, E.K.P. Chong. 1998. On relative convergence properties of principal component analysis algorithms. IEEE Transactions on Neural Networks 9:2, 319-329. [CrossRef] 4. J. Karhunen, E. Oja, L. Wang, R. Vigario, J. Joutsensalo. 1997. A class of neural networks for independent component analysis. IEEE Transactions on Neural Networks 8:3, 486-504. [CrossRef]

Communicated by Andrew Barto

Supervised Training of Neural Networks via Ellipsoid Algorithms Man-Fung Cheung Kevin M. Passino Stephen Yurkovich D y r t r i i e r i t qf Ekctricnl Etzgiiiwriti‘y, The Ohio Stntc Uriizvrsity, 21115 Nril

Aveiilie, C o l i i i i i b i ~ s OH ,

43210 USA

In this paper we show that two ellipsoid algorithms can be used to train single-layer neural networks with general staircase nonlinearities. The ellipsoid algorithms have several advantages over other conventional training approaches including (1) explicit convergence results and automatic determination of linear separability, (2) an elimination of problems with picking initial values for the weights, (3) guarantees that the trained weights are in some ”acceptable region,” (4) certain “robustness” characteristics, and (5)a training approach for neural networks with a wider variety of activation functions. We illustrate the training approach by training the MAJ function and then by showing how to train a controller for a reaction chamber temperature control problem. 1 Introduction

In this paper we will introduce two ellipsoid algorithms that have been used in system identification and parameter estimation to the training of artificial neural networks (ANN) (Widrow and Lehr 1990; Barto 1989; Lippmann 1987; Beale and Jackson 1990; Antognetti and Milutinovic 1991) with general staircase nonlinearities. The utility of the ellipsoid algorithm [some of the earliest uses for parameter set estimation appear in Fogel and Huang (1982)l is motivated by its ease in use and implementation, where its recursive nature makes the training process very attractive computationally. The staircase nonlinearity is a generalization of the hard limiter and such a generalization can be useful in classifying patterns into several linearly separable regions (such as parallel strips in two dimensions). The two ellipsoid algorithms that we propose to use for training are the “Optimal Volume Ellipsoid” (OVE) (Cheung 1991; Cheung rt al. 1993) algorithm and the ”Underbounding Ellipsoid” (UBE) algorithm introduced here. The OVE algorithm results in an ellipsoidal set overbounding the feasible set of weights while the UBE algorithm results in an underbounding ellipsoid inscribed inside the feasible set of Nritrol Corrrjwtntioir 6, 748-760 (1Y94)

@ 1994 Massachusetts Institute o f Technology

Neural Networks via Ellipsoid Algorithms

749

weights consistent with the training data set. It is guaranteed that the center of the ellipsoid using the OVE algorithm is a feasible solution after the algorithm converges (that is, convergence in the weight estimates if there exists a solution). In fact, all weights inside the final ellipsoid from the UBE algorithm are feasible if the UBE ellipsoids' final volume is nonzero. Several applications are studied and the paper closes with a discussion on the advantages/disadvantages of OVE/UBE. 2 The OVE/UBE Algorithms

The objective in parameter set estimation is to identify a feasible set of parameters that is consistent with the measurement data and the model structure used. One can interpret the set estimate as some nominal parameter estimate accompanied by a quantification of the uncertainty parametrically around the nominal model. An important feature in parameter set estimation is the guaranteed inclusion of the true mapping that is not exactly known. In this section we will provide a brief introduction to the OVE algorithm introduced in Cheung (1991) and Cheung et al. (1993) for parameter set estimation in system identification and a discussion on finding an underbounding ellipsoid for the feasible parameter set consistent with the training data set. The underbounding ellipsoid (UBE) algorithm is similar to the overbounding ellipsoid algorithm (OVE), except that the underbounding ellipsoid has the feature that all points in the ellipsoid are feasible parameters. In the next section we will show how the algorithms can be used to train neural networks. Consider the following kth pair of parallel linear constraints

Iyk - WTXkl 5 7 where yk E Y?, 2 E \R, Xk E CR' are known, and W E 'K' is the unknown parameter (weight) vector. Let 3kc 8' be the set of feasible parameters given k constraints. That is, ;Fk = { W : lyl - W'X,I 5 7 . i

=

1.. . ..k}

(2.1)

Define also

FA

=

{ W : Iyk

Ea

=

{W: (W-

-

WTXaI I 7 )

(2.2)

wh)Tp['(w-Wi)}

(2.3)

and where W,, the kth estimate of the unknown parameters, is the center of an ellipsoid €a and Pk' is a positive definite matrix that characterizes the size and shape of the ellipsoid. Suppose the ellipsoid E k is an overbounding ellipsoid for 3A.The OVE algorithm finds the smallest volume ellipsoid € ~ + 1 containing the

M.-F. Clieung, K. M. Passino, and S. Yurkovich

750

intersection of .Fk,.I and Ek. The intersection is essentially the portion of El cut out by the two parallel hyperplanes defined in S k . Theorem 1. The OVE algorithm is comprised of the following recursive equations:

where if

(a)

#

and

!fA,

Tk

where

then

is the real solution of

(II

and

/jk

are defined as

Proof. The proof appears in Cheung (1991) and Cheung t>t nl. (1993) and is available from the authors on request. Suppose now that the ellipsoid E k is an undevbounding ellipsoid for .Fk. The underbounding ellipsoid algorithm finds the largest volume ellipsoid Ek+1 underbounded in the intersection of &+I and E l such that (1) the center of the new ellipsoid is located midway between the parallel hyperplanes defined in 3k+, and (2) the new ellipsoid touches both hyperplanes. Theorem 2. The UBE algorithm is comprised of the following recursive equations:

Neural Networks via Ellipsoid Algorithms

751

where

and

hk

is the solution of

0;

+ (Tt

-

/j;

-

1)hk

+ /j;

=0

such that h k T k / ( h k - /j?) 5 1; ( i k and / j A are defined as in Theorem 1. If o k > 1, reset i l k to 1 4 - [ ( q- 1)/2] and then reset (rk to one; on the other hand, if 2jj - (?k > 1, reset / j h to (1 + o k ) / 2 . Proof. In this proof, the same approach in deriving the OVE algorithm is used here and therefore one has the same form of updating equations for w k and I$. Using the affine transformation w = w k + ] W where PA = ]IT, the ellipsoid E k is transformed to a unit radius hypersphere denoted as Ek

=

{ w : wTw5 1)

and the new ellipsoid may be parameterized as

such that the Wl is perpendicular to the transformed parallel hyperplanes. Since Tk [given by ( u k - I & ) ] is the midpoint between the hyperplane on zul axis and since the ellipsoid touches both hyperplanes, the semiaxial length of the new ellipsoid along axis is therefore /jk as the hyperplanes are 2/& apart in the transformed coordinate. Finding the largest volume &+I is equivalent to finding the largest h, such that & + I is still an underbounding ellipsoid. The largest allowable h k is achieved when the surfaces of the two ellipsoids E k and .&+] touch each other. The surfaces are

6:+ w;+ . + zu; ' '

=

1

and

After some manipulation, one gets

Since the surfaces touch each other, then the discriminant in 2.4 must vanish and as a result, one gets

752

M.-F. Cheung, K. M. Passim, and S. Yurkovich

I

L

Figure 1: Examples of staircase nonlinearities. I t is obvious that the solution for ZUI when the surfaces touch must be less than (or equal to) one, and therefore only the hi that results in 7Ul = h k ~ i / ( h k- Iff)5 1 is valid. This completes the proof of Theorem 2.

The OVE and UBE algorithms can be initialized with a sufficiently large Eo containing the feasible parameter set, where W,, = 0 and Po = !I (with 0 < f << 1) are typical starting values. Note that for the OVE algorithm, if E,, contains the parameter set, so does all other €A and it is therefore important to make sure that Eo is large enough so that the new E k computed are meaningful. For the UBE algorithm, after iterating through the entire data set once, the resulting ellipsoid will be an underbounding ellipsoid for the data set in which every point in the ellipsoid is a feasible parameter vector consistent with the training set. However, the underbounding ellipsoid may vanish even though the feasible parameter set is nonempty. 3 Supervised Training of a Single Perceptron via OVE/UBE

Consider a single-layer perceptron with one neuron that has n inputs, XI... . .x,,, and one output y; w o is the weight on the fixed bias input SO = 1; Z U I . . . . .zu,, are the weights on the inputs xl.. . . .x,,, respectively; z is the output of the summer and the input t o f ( . ) wheref(.) is a fixed ”staircase nonlinearity function” that has nr + 1 distinct quantized steps such that f(a,) = b, as shown in Figure la. As a matter of fact, the staircase nonlinearity shown in Figure 1 can have arbitrary shape as long as the steps have distinct levels, and it can also be used to approximate a sigmoid function and other nonlinearities (for example, Fig. l b is another possible function).

Neural Networks via Ellipsoid Algorithms

753

Without loss of generality, assume the nonlinear functionf(.) is completely known, and that a, and b, are all given. Let N be the number of training data pairs in the data set

s = {(Xk.Yk)

: k E [1.N]}

and let

x = [X,)

XI

' '

'

x,,]T

be the input vector. (Note: the parameter r in Section 2 is equal to n + 1 here.) The generic problem of training a single perceptron is to find a weight vector

w = [ZU"

zu1

' ' '

ZU,,]T

such that the perceptron can regenerate the input-output patterns in the training set, and generalize so as to capture the complete input-output mapping of the underlying process. In this paper, our goals are (1) to find an ellipsoidal set using the OVE algorithm to overbound the feasible set of weights consistent with the training data set S (the motivation for using the OVE algorithm is that intrinsic in the algorithm, the center of the resulting ellipsoid is guaranteed to be a feasible solution, after convergence, for designing a perceptron to implement the mapping defined by the available input-output patterns), and (2) to find an ellipsoidal set that underbounds the feasible set of weights consistent with the training data set S using the UBE algorithm. The UBE ellipsoid characterizes a feasible set of weights that shows what variations in the weights can occur for which we are still guaranteed to implement the proper mapping. Consequently, the final UBE ellipsoid provides a characterization of "robustness" of the neural network mapping with respect to variations in the weights (in case of uncertainties in the implementation process or in case they vary after implementation). The motivation for using the UBE algorithm is that the algorithm must be executed only once for the training set to give a feasible set for designing a perceptron to implement the mapping in interest (the center of the final UBE ellipsoid can be used for the weights). Suppose a data pair (Xk+l.yk+l) is given and that y k + l = b,; then the ). interval in which z k [given by (X:,, H ) ] lies is known to be [ ~ 7 , . u , + lThat is

where a, and u , , ~ are known. However, the inequalities in 3.1 define a set that is convex but not closed. In order to utilize OVE and UBE that

M.-F. Cheung, K. M. Passino, and S. Yurkovich

754

work on convex and closed sets, it is desired to make the set closed by relaxing the right-hand inequality in 3.1, adding very little conservatism, as

xi+~w

a; 5 5 11,+1 (3.2) which is a superset of that in 3.1. As a result, the feasible set of weights can be succinctly defined by the following 2N inequality constraints: a, 5 X:W 5 n l i l for k = 1... . . N , and i E ( O , m ] is such yk = b;. For & + I which corresponds to the feasible region for the data pair (Xktl.yk+~), the constraints are Ijkt-1 -

I

-1

xk+1w5 y k i l + 7

(3.3)

Theorem 3. Given that ykt1 = 6, or a feasible set of weights defined by equation 3.2 and a previous bounding ellipsoid defined in equation 2.3, the OVE/UBE algorithm can be used to find a new overbounding/underbounding ellipsoid with the following definitions of ( I , and ,jA : (a) i # 0 or i # nr (k!,

=

ditl -

x;+,w

Ch

a, -

x,7',.1 w, Gk

-/,

1.

where Gk = Wk and Pk are the parameters associated with the previous ellipsoid that is obtained through training the data up to the kth pair of data. (b) i = 0 or i = 111 That is when y k S 1 = bo (or b,,,), the first (or the last) quantized output, then

< 01 (or > a,,,) Incorporating a small degree of conservatism as zk 5 01 (or 2 a,,,) results in Zk

Sketch of the Proof. The result in (a) is obtained by comparing the lefthand sides and the right-hand sides of 3.2 and 3.3, respectively, and then determining the new parametrizations for and h. In case (b), essentially only one hyperplane cuts the previous ellipsoid. Therefore, one can move the noncutting hyperplane to touch the ellipsoid corresponding to either one of the following conditions: o k = 1 or O.A - 211, = -1 depending on which hyperplane intersects. For details, see Cheung (1991) and Cheung rt al. (1993).

Neural Networks via Ellipsoid Algorithms

755

4 Applications

The OVE algorithm has been shown to be convergent in Cheung (1991) and Cheung et al. (1993), in that the volume at each iteration is nonincreasing and the center of the elliposid will converge as the size of the data set tends to infinity. In our application to ANN where the number of data are finite, the convergence property must be maintained. By repeatedly applying the OVE algorithm over the entire finite training set, it is easy to show that OVE retains its convergence properties. This is the case because the OVE algorithm views the repetition of the finite training set as a long (possibly infinite) data sequence. Therefore the convergence properties are guaranteed and the center of the ellipsoid, after convergence, must be inside the feasible set and be a feasible solution to the mapping of the finite data set. If this is not true, then a contradiction results since, as shown in Cheung (1991) and Cheung etal. (1993), if the center of the ellipsoid does not satisfy one of the parallel hyperplane contraints, one is guaranteed that a smaller volume ellipsoid can be found implying that the ellipsoids have not yet converged. The notion of convergence is difficult to quantitatively guarantee for finite sets. For the examples studied below, the following stopping criterion is used along with the OVE algorithm. The OVE algorithm ceases if V(1)- V(1 1) if V(1) where V ( l )denotes the volume of an ellipsoid after sweeping through the entire finite data set I times, and t is a small positive number. Note that although the choice of f is heuristic, the algorithm is guaranteed to meet this condition (for some finite I for some 6 ) . However, this does not guarantee that the center of the last ellipsoid will be a feasible solution; a feasible solution test must be conducted for validation and a smaller f may be necessary if the ellipsoid center turns out to be a nonfeasible solution. Often V ( l )- V ( l +1) = 0 for some 1 implying that the algorithm has actually converged with respect to the available data and that repeatedly applying OVE over the data set will give no new information about the bounding ellipsoid. In other words, the center of the final ellipsoid must be a feasible solution. In the following two examples, the initial ellipsoid is set to be a sphere centered at the origin with radius 10 units for each case. For the OVE implementation, t = 0.001 is used and the feasibility test is passed for all examples. For the UBE implementation, the algorithm is iterated through the training data set once to give an ellipsoidal set of feasible weights.

+

4.1 Perceptron with Hard-Limiter as Nonlinearity. It is known (Widrow and Lehr 1990; Lippmann 1987) that a single perception can be used to linearly classify input patterns into two different groups. Essentially,

M.-F. Cheung, K. M. Passino, and S. Yurkovich

756

Table 1: 1/0Table for M A / Logic

xl

1 1 1 1

x2 x3

!/

1 -1 1 1

1 1 -1 1

1 -1 -1 -1

-1 1 1 1

-1 1 -1 -1

-1 -1 1 -1

-1 -1 -1 -1

the perceptron with a hard-limiter as the nonlinearity divides the input space into two regions separated by a hyperplane (a line in twodimensional space). Here, the ellipsoid algorithms with the parametrizations given in Theorem 1 and Theorem 2 are used to train a perceptron to realize the linearly separable logic functions. The example to be studied is the MA] logic function (OR, XOR, and AND logic were also implemented, but not included here). The functional mapping table is given in Table 1 where the inputs are denoted as x, and the output as y. For training with the OVE algorithm, the entire data set is swept through five times before satisfying the stopping criterion. The following result was obtained: [ -0.0671 1

1 3.7799 J For training with the UBE algorithm, the following results were obtained: 1.6296 1 3.6854 Wx(MA’) = 3.7950

r

I

I

and the singular values of the associated Pi matrix which correspond to the square of the semiaxial lengths are (1.0607, 0.8389, 0.6511, 0.3884). Without knowing the orientation of the final ellipsoid, the minimum amount of variation allowed in each weight around the center Wk is given by a ( P k ) / r where a ( P k ) is the smallest singular value of Pa. Hence the UB algorithm successfully trained the perceptron and the amount of variation allowed in each weight is at least 0.3116. This provides a range (consistent with the training data) that the weights may vary in the implementation of an ANN when the center estimate is used to implement the weights (that is, it provides a characterization of the robustness of the ANN map).

d

4.2 Perceptron with a Staircase Nonlinearity-A Control Application. Consider a reaction chamber temperature following the control

Neural Networks via Ellipsoid Algorithms

f Perceptron

757

Y

1

Figure 2: Temperature following control problem. problem shown in Figure 2. TRC is the temperature of the reaction chamber which is desired to follow the reference temperature TRef. The temperature inside the reaction chamber can be controlled by appropriately switching on the heater/cooler unit. The following rules for activating the heater/cooler unit are given: 1. If

Tp,C

< TRef - 3, turn on the heater;

2. If TRC> Tf+f+ 3, turn on the cooler; 3. If ~TRC - TRefI 5 3, neither the heater nor the cooler is on.

Assume that the heater/cooler unit is under a single control y, the output of a neural net controller or the perceptron, so that when

y y y

1, = -1, = 0, =

heater is on; cooler is on; the heater/cooler unit is idle.

A training data set of 12 data pairs is used which is generated according to the rules above, and shown in Table 2. Figure 3 shows the training data pattern on the TRC- TRef plane. The solid parallel lines separate the plane into three regions: Region A corresponds to case (1) with y = 1; Region B corresponds to case ( 2 ) with y = -1; and Region C corresponds to case (3) with y = 0. The problem here is to use a perceptron with a staircase nonlinearity to classify the input pattern into three different classes. The staircase nonlinearity used here has the following parameters: wz = 2, a l = -3, a2 = 3, bo = -1, bl = 0 and b2 = 1; x,

M.-F. Cheung, K. M. Passino, and S. Yurkovich

758

Table 2: Training Set for the Heater/Cooler Unit Activation

-3 1 0 1.5 9 0 TRC 3 14 2.4 -3.9 Ti
2 5.4 1

1 - 3 -3 7 -1 1

4 -2 -1

5 1.5 -1

10.

i Temp-Ref

Figure 3: Training data pattern for temperature control problem.

and x2 correspond to TRrf and T R C , respectively. The training data set is swept through 48 times until the OVE algorithm stops (convergence is achieved); the following results were obtained:

W.576

=

[

-1.0296 -1.5859 1.4492

]

The parallel separating lines from the perceptron using W57h are indicated as dash-dot lines shown in Figure 3. Clearly, the perceptron can categorically separate the training set appropriately. In fact, if a larger set of training instances is used, the classification boundaries from the perceptron will closely match the desired ones (solid parallel lines).

Neural Networks via Ellipsoid Algorithms

759

For the training with UBE in this example, no feasible solution was obtained. This can happen a s each time the UBE algorithm is used, part of the feasible parameter set may be discarded and eventually an intermediate underbounding ellipsoid may not contain any feasible solutions. 5 Discussion

In this paper we have examined how two ellipsoid algorithms that are useful in general system identification studies can be used for the training of neural networks. Both ellipsoid algorithms provide their unique features to neural net training and implementation. In particular:

1. The choice of the initial ellipsoid Eo in both the OVE and UBE algorithms can be used as an instrument to confine the consideration of physical realizable weights in an ANN. This may bear some practical significance when hardware implementation is considered. The ellipsoid algorithm approach to training a perceptron has another distinct advantage over other training algorithms in that it solves the problem of choosing initial weights. One merely needs to pick a large enough initial ellipsoid that guarantees the overbounding of the feasible set. 2. The OVE algorithm gives a convergent estimate and can be used as an automatic test for linear separability of input-output mappings. However, this is not necessarily true for the UBE algorithm as indicated in the control example of the last section. 3. The results from the UBE algorithm give a characterization of the feasible weights to indicate the flexibility in implementing a perceptron to realize a certain mapping; it may have a strong bearing on the robustness of a perceptron with respect to disturbances in the inputs as well. The center of the UBE ellipsoidal set could be the best choice of weights as they may vary slightly during implementation.

4. The UBE algorithm needs to be executed the same number of times as the number of data patterns available for training to give a feasible set. However, the UBE algorithm may fail to characterize a feasible ellipsoid even though the feasible set is nonempty (again as indicated in the control application example). Nevertheless, this can be complemented by using the OVE algorithm as it is guaranteed that the center of the overbounding ellipsoid is a feasible solution if the feasible set is nonempty, after the algorithm has converged.

5. For training of multilayer perceptrons, because of the nondifferentiability of the activation function, a heuristic approach has been used (but not reported on here due to space constraints) to train

M.-F. Cheung, K. M. Passino, and S. Yurkovich

760

each perceptron independently. This required assigning input-output patterns for each perceptron appropriately so that the entire ANN works in the manner desired. 6. Finally, we note that the complexity of the OVE and UBE algorithms is discussed in detail in Cheung ct al. (1993).

In this initial investigation into using ellipsoid algorithms for training ANN we have shown several advantages of OVE/UBE; however, much work remains. For instance, there is the need to extend the results (including the desirable convergence and robustness properties) to the training of general multilayer perceptrons with general staircase nonlinearities. Acknowledgments

K. Passino was supported in part by an Engineering Foundation Research Initiation Grant. References Antognetti, P., and Milutinovic, V. eds. Neirrnl Ne>tumrks: C ~ J ~ I W \AJ/ y~dSi c, n t i o i ~ s niid Implernmtntion. Prentice Hall, New York. Barto, A. G. 1989. Connectionist learning for control: An overview. COINS Tech. Rep. 89-89 5, University of Massachusetts, Amherst. Beale, R., and Jackson, T. 1990. Neirrnl Compirtiq: An Zritrodirctioti. Adam Hilge, New York. Cheung, M. F. 1991. On optimal algorithms for parameter set estimation. Ph.D. thesis, The Ohio State University, Columbus, OH. Cheung, M. F., Yurkovich, S., and Passino, K. M. 1993. An optimal volume ellipsoid algorithm for parameter set estimation. IEEE Conf D ~ ~ i s i oCoirtrol n (Brighton, UK), pp. 969-974,1993. An expanded version will appear in / € E E Trmsnct. Airtcimntic Coiitrd, Vol. 38, No. 8, pp. 1292-1296. Fogel, E., and Huang, Y. F. 1982. On the value of information in system identification-Bounded noise case. Autoninticn 18(2), 229-238. Lippmann, R. P. 1987. An introduction to computing with neural nets. I € € € ASSP Mug. 4-22. Widrow, B., and Lehr, M. A. 1990. 30 years of adaptive neural networks: Perceptron, Madaline, and back propagation. Ploc. fE€E 78.

Received April 14, 1992; accepted September 14, 1993.

This article has been cited by:

Communicated by Halbert White

Why Some Feedforward Networks Cannot Learn Some Polynomials N. Scott Cardell Wayne Joerding Ying Li Drpnrtment of Ecoriornics, Wnshiti~tonStatt, University, Pnlliiinn, WA 99164 U S A

It seems natural to test feedforward networks on deterministic functions. Yet, some simple functions, notably polynomials, present some difficult problems for approximation by feedforward networks. The estimated parameters become unbounded and fail to follow any unique pattern. Furthermore, as the fit to the specified functions becomes closer, numerical problems may develop in the algorithm. This paper explains why these problems occur for polynomials of order less than or equal to the number of hidden units of a feedforward network. We show that other examples occur for functions mathematically related to the network’s squashing function. These difficulties do not indicate problems with the training algorithm, but occur as an inherent consequence of the role of the connection weights in feedforward networks. 1 Introduction

Developers of new training algorithms and feedforward network architectures frequently test their creations by fitting a polynomial or other simple relationship [see, for example, Almeida (1987),Barton (1991), Haley and Soloway (1992) and Webb and Lowe (1988)j. However, oral tradition has alerted practitioners to training difficulties in such circumstances. Training algorithms have trouble converging to a particular feedforward network when trained on artificial data generated by a polynomial relationship in which the order of the polynomial is less than or equal to the number of hidden units in the network. 2 The Problem with Polynomials

x;:=,)

Define PI1 as the H order polynomial mapping x H PH(x)= b/,x’l. Let represent a single hidden layer feedforward network output function with H hidden units, a single input unit, and a single output unit. In this qjkr

Nruvnl Corrrprrtnfiori 6, 761-766 (1994)

@ 1994 Massachusetts Institute of Technology

N. S. Cardell, W. Joerding, and Y. Li

762

-

section we show that sequences of connection weights exist such that, in a sense to be made precise, i/iff P H . Define the network @ J I by the mapping x ++ with real analytic nonpolynomial squashing function f. Let (b represent the weights connecting the input layer to the hidden layer and /j the weights connecting the hidden layer to the output unit. Assume bias terms for each hidden unit and for the output unit, and no squashing at the output unit. Then r/lff(x)

provides a convenient vector representation for the network evaluated at the point x. In this context, training seeks values for the connection weights that minimize some objective criterion expressed as a norm, 11 1 , of the difference between PI, and + H , that is, minimize IIPl1 - r / l I l ( ( . This notation accommodates, for example, the common choice to minimize the root mean square error, in which case

or

where g(.) represents the density of a continuously distributed input variable or the discrete probability of a discretely distributed input variable. For example, if X I , . . . . X N represents a fixed finite training set, then g ( X , ) = 1 / N for i = 1 . .. . , N and zero otherwise. The notation also allows for choosing to minimize the expected absolute deviation, or any other norm. Of course, minimizing any monotonic transformation of ~ ( P J-I ‘,’1111 is equivalent to minimizing IIPf, - L/rilII; for instance, the sum of squared errors equals NIIPf, - $~111’ where II.II represents the Euclidian norm. Later we point out the implications for implementable training algorithms from results using this abstract problem. In the following presentation we also overload the notation 11 . 11 by letting it denote any norm which takes vector arguments. In the subsequent development, because the relative ratios of the (k I parameters remain fixed, we substitute hH,, = w / l l , / I = 1... . H . We can take H as normalized so that H T H = 1. Next, forming an H order Maclaurin series expansion, factoring out (hx)” terms, and defining$”’ = f ( ”( )o k ( ) ) / h ! ,

Feed forward Networks and Polynomials

763

we obtain

or, more compactly, $H(x)

=

/ j T A z+ /jTR[(hx)""].

(2.3)

From the Taylor Expansion Theorem, I I R [ ( ~ x ) ~ + is ' ] IOI [ ( h x ) " + ' ]There. fore, for any fixed x, IIR[(hx)"+']II 0 as h 0. Choose any weights f t and H such that A-' exists. ( A is nonsingular for almost all rr 0.) Then choose / j T = (bo.b l / h . . . . bf,/A")Ap' and substitute into 2.3 to obtain -+

-+

%

(2.4) (2.5) Recall that bo. . . . b f frepresent the fixed real coefficients of the polynomial PHand / j represents the weights connecting the hidden layer to the output ~ *a, o 1 1 . . . . ( P H I + 0, and layer. We note that as h 0 then /jO,. . . . / j -+ olO.. . are fixed. ~

-

Proposition 1. For any given x , ( I ] H ( X ) converges to P t / ( x )os h

-

0.

Proof: Let A,, be the maximum singular value of A-' [i.e., the square root of the largest eigenvalue of A - ' ( A - ' ) T ]and , let b = (bo.. . . . bf+)T.Without loss of generality take 0 < h < 1. Then,

N. S. Cardell, W. Joerding, and Y. Li

764

The first inequality follows from the Cauchy-Schwarz Inequality, and the second from the Courant-Fischer Minimax Theorem (see Golub and Van Loan 1989). But R[(hx)"+']is O[(dx)""]; thus, there exists a fixed finite K (which may depend on s) such that I(R(zii"t')ll < Kzu"+' for all 70 such that l ~ o l5 1.~1. Since lhxl < 1x1, we have IPtl(.u) - /,it+(s)l5 ( I l b l l / h " ) . I X l , , I . K . h ' ~ + ' . s " + ' = h . l l b l l . l X l l l l . K . l x l ' + i l .Thus, for any fixed 0 s,limb-0 I P / / ( x )- vsfi(x)l = 0, and !ql/,(x)converges to P / / ( s ) . Corollary 1. I f t l w sirpport of8 is coiitnirird in nn!yfiscd corrpict wt, iriirlrr tlie iioriii I/ 1 , tliiit is, lim,,-[l IIP/, - !.'/11I = 0.

(,'I/

-

PI{

Proof. Let S denote a compact set that includes the support of 3. Then

l l ~ t v l- P ~ l l l i i

max I P d x ) - VII(.Y)I IES h

(2.9)

ll!~ll . 1 ~ 1 1 1 1 . max(llKx""I1) 1E.S

(2.10)

But, S is a compact set, therefore K' = max,,s(lKl) < x exists. Let = max,,s( 1x1) < 'x. Hence I ( Y ~ I I - PtiII 5 h(lbll . ~ X l l.lK' ~ . MI' and thus 0 limb .[I Ill,"/ - P,/II = 0.

+'

M

Thus, by letting h 0, we can produce ever closer approximations to the polynomial P I , over any compact set. However, at h = 0, we have ( 1 I equal to a zero vector, F equal to a vector of constants, and 1 is undefined. -+

I

3 Discussion

Implementable training algorithms must use a finite fixed data set and cannot depend on an unknown distribution for the input variable. A common training approach seeks optimal connection weights that minimize the sum-of-squared errors over a fixed data set. Thus, a researcher testing an algorithm on the polynomial PI/ would use a data set of the form { [x,.Pt/(x,)]I i = 1 . .. . ,N}.Any finite set of points is compact and, as discussed above, the sum of squared errors is N 'I I I , ' / / - P//1I2,with 1 1 . 11 the Euclidian norm with respect to the discrete PDFx(x,) = 1/N, ~ ( x=) 0, for x # xi any i, 1 5 i 5 N. Therefore the algorithm can find ever smaller values of the sum of squared errors by decreasing ( I I to zero along the direction w i = hH and by increasing / f toward infinity along the direction , j r = ( b ~bl. / h.. . . . bH/h")A-'. Note that the above easily generalizes to networks with a number of hidden units equal to or greater than the polynomial order. In practice, for restricted classes of squashing functions, K' can exist independent of x. If the support of 8 is not contained on a compact set, but K' < 3c exists and IIxHt'I( < xj, then P I , as h 0 still holds. However, as the problem with polynomials will be encountered by any researcher on the first finite training set used, this further result does not seem particularly interesting.

-

-

Feedforward Networks and Polynomials

765

Generally, A will be nonsingular except on a set of measure zero (in the 2H - 1 dimensional space spanned by f v 0 . 0 ) . For example, for H = 1, A is nonsingular for all (k10 such thatf’(rvlo) # 0. Thus, for almost all ( r o. 0, we can arbitrarily closely approximate PH(x)with +H (x).Therefore, repeated estimation attempts with different starting values will yield completely different estimated o and /jweights, each approximating the polynomial arbitrarily accurately. Other functions can present difficulty to training algorithms. For example, a feedforward network output function having two hidden units and squashing functionf can approximate 9 =f’arbitrarily closely if the training algorithm increases the connection weights from the two hidden units toward positive and negative infinity. To see this, note that we can write

for small n. Observe that 3.1 represents a feedforward network output function with two hidden units and weights /jo = 0, /], = l / h , and /I2 = - l / h . ( q o = 0, = 1, 020 = h, and ez1= 1.) As h + 0, the approx+ cc imation error declines toward zero and the connection weights and /j2 -m. Other examples exist, as the essential problem concerns the ability of networks with a fixed number of hidden units and unbounded connection weights arbitrarily accurately to approximate functions that are not themselves feedforward network output functions. Let C,, represent the set of feedforward network output functions with H hidden units and Ctl the closure of Et,. For example, PH (and 9) represent functions in XI, that are not contained in C H . We have described a problem in which an algorithm training a network to approximate an element of yH/& can find a sequence of connection weights that produces ever more accurate approximations but never converges. One solution uses a convergence criterion that stops the algorithm when the sum-of-squared errors fails to decline by some small amount. Naturally, machine limitations impose an automatic convergence criterion of this sort. However, as pointed out above, this will not guarantee a unique solution in terms of o. 11. A better solution would impose bounds on the connection weights. White (1990) shows that if the I Y and weights satisfy an absolute value bound, then one can find a sequence of estimated single hidden layer feedforward network output functions that converges to any square integrable relation as the number of hidden units increases to infinity. In other words, networks can consistently approximate arbitrary square integrable functions, including polynomials. Stinchcombe and White (1990) show that networks satisfying weight bounds retain their universal approximation characteristics. Finally, we note that Hornik et al. (1990) impose an absolute value bound on the connection weights to obtain

-

N. S. Cardell, W. Joerding, and Y. Li

766

their results on convergence in Sobolev norm for feedforward network output functions. Finally, the problems described in this paper with using polynomial data to test algorithms cannot be eliminated by simply adding noise to the simulated data. While adding noise would permit a n algorithm without constraints on the connection weights to converge to finite weight values, those estimated weights would depend completely on the realized values of the added noise. That is, trained weight values will converge to very different values for data sets that differ only in the realized values of the noise added to the data.

References Almeida, L. B. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Procceclirip cf lritt~riiafiuiinlloirrt Corfercricc or7 Ncirrnl N~>t.ioorks,pp. 609-618. SOS Printing, San Diego. Barton, S. A. 1991. A matrix method for optimizing a neural network. N w r d Coriip. 3, 450-459. Colub, G. H., and Van Loan, C. F. 1989. Mntris Corirpirtntions. Johns Hopkins University Press, Baltimore. Haley, P. J., and Soloway, D. 1992. Extrapolation limitations of multilayer fecdforward neural networks. In Procccdiiigs cf Iritcriiotiorinl loiiit Coilf;'rcwcc 0 1 1 Ncwrnl Nrtzuorks, pp. 25-30. SOS Printing, San Diego. Hornik, K., Stinchcombe, M., and White, H. 1990. Universal approximation of an unknown mapping and its derivatives using multilayer feed forward networks. Ncirrnl Netzuorks 3(5), 551-560. Stinchcombe, M., and White, H. 1990. Approximating and learning unknown mappings using multilayer feedforward networks with bounded weights. In Proce~dir?gsr f tlrt liitrriintiorinl Joirit C o n f i r ~ v i co ~i i Nrirml Nrtzuorks, Vol. 111, pp. 7-16. IEEE Press, New York. Webb, A,, and Lowe, D. 1988. A hybrid optimization strategy for adaptive feedforward layered networks. Tech. Rep. 4193, Royal Signals and Radar Establishment Memorandum, Ministry of Defence, Malvern, UK. White, H. 1990. Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. N m r d Nrtzuorks 3(5), 535-549. ~~

~

~~

Received February 8, 1993; accepted October 22, 1993

This article has been cited by:

Communicated by Graeme Mitchison

ARTZCLE

A Bayesian Analysis of Self-organizing Maps Stephen P. Luttrell Adaptive Systems Theory Section, Defence Research Agency, St. Andrews Rd., Maluern, Worcestershire, WR14 3PS, United Kingdom

In this paper Bayesian methods are used to analyze some of the properties of a special type of Markov chain. The forward transitions through the chain are followed by inverse transitions (using Bayes' theorem) backward through a copy of the same chain; this will be called a folded Markov chain. If an appropriately defined Euclidean error (between the original input and its "reconstruction" via Bayes' theorem) is minimized with respect to the choice of Markov chain transition probabilities, then the familiar theories of both vector quantizers and selforganizing maps emerge. This approach is also used to derive the theory of self-supervision, in which the higher layers of a multilayer network supervise the lower layers, even though overall there is no external teacher. 1 Introduction A self-organizing map (SOM) is an adaptive function that transforms (or maps) from an input vector space to an output vector space, where the adaptation is driven entirely by signals derived from the input space. Ir. the context of neural networks an SOM would therefore be realized as an unsupervised network whose training algorithm acts to minimize a suitably defined error function in the network's input space. The aim of this paper is to develop a theoretical framework that unifies several different strands of unsupervised network theory, and for this purpose it is best to develop the theory anew, rather than to build a hybrid theory out of an assortment of existing theories. 1.1 Variational Formulation of SOM Optimization. In order to create a theoretically clean framework it is necessary to express the optimization of an SOM in terms of a variational principle. Thus define a scalar functional D as follows

D=

.IdxP(x)d[x.y(x).x'(y)]

where the vectors x and y sit in the input and output spaces, respectively, and the functions y(x) and x'(y) transform from x-space to y-space and N w r d Coriipirtation 6, 767-794 (1994)

@ 1994 British Crown Copyright/DRA

Stephen P. Luttrell

768

vice versa, respectively. These are the functions that must be optimized so as to minimize the value of D. The scalar function d[x.y(x).x‘(y)] contributes to the functional D an amount that depends on the currently selected input vector x, and on the functions y(x) and x’(y). P ( x ) is a probability density (or measure) that weights the x integral nonuniformly. In practice P ( x ) might be used to represent the relative frequency of occurrence of each input vector x in a representative set of input vectors, in which case D would be the average of d[x,y(x),x’(y)] over that set of vectors. In equation 1.1 the functional dependence of d [ x .y(x).x’(y)]can be simplified to d { x . x‘[y(x)]},so D simplifies to

D=

s

d~P(x)d{x,x’[y(~)]}

(1.2)

This is possible because in d[x,y(x).x‘(y)]the vector y in the function x’(y) is a placeholder that receives the output of the function y ( x ) . The variational problem is then to find functions y(x) and x’(y)that minimize D as defined in equation 1.2. Equation 1.2 can readily be extended to include the effects of an additive noise process acting in y-space. Thus D becomes

D=

s

dxP(x)

s

dnn(n)d{x,x’[y(x)+ n ] }

(1.3)

where r(n) is the probability density of the noise vector n. The variational problem is then to find functions y(x) and x’(y) that minimize this augmented expression for D. A side effect of including the noise process is that the information that is stored in the output vector y [produced by the action of y(x) transforming the input vector x] is represented in such a way that it is robust with respect to the damaging effects of the noise process. 1.2 Variational Principle versus Unsupervised Network Notation. By making the replacement d ( x , x’) = IIx - x’)I2(i.e., a Euclidean distance) in equation 1.3 this variational formulation can be shown to lead to a type of SOM that is similar to, but not precisely the same as, the unsupervised network that is known as the Kohonen map (see Kohonen 1984) with the noise probability density n(n) playing the role of the SOM neighborhood function. This type of analysis was also introduced in a nonneural context to design an optimal vector quantizer (VQ) codebook for encoding data for transmission along a noisy channel (Kumazawa et al. 1984; Favardin 1990; Favardin and Vaishampayan 1991). The above variational principle can be related to a corresponding unsupervised network as shown in Table 1. Note that in practice the output y of an unsupervised network is usually the index of the ”winning node,” which is a discrete-valued quantity.

A Bayesian Analysis of Self-Organizing Maps

769

Table 1: Variational Principle and Unsupervised Networks. Term

Variational principle

Unsupervised network

x

Training/test vector Encoding prescription

d[x,y ( x ) , x’(y)] P(X)

Input vector Input to output transformation Output to input transformation Function Integration measure

D

Functional

Y (4 .‘(Y)

Reference vector Error function Probability density of training/test vectors Average error over the training/ test set

Whether the output is continuous or discrete is unimportant to the variational approach, but in this paper a continuum notation is used because it is more compact. The basic derivations of the variational formulation are in Luttrell (1989a,c, 1990), a simple application to time series compression is in Luttrell (1989b), the compression of synthetic aperture radar images is in Luttrell (1989d), an analysis of the density of reference vectors is in Luttrell (1991a), and an extension of unsupervised networks to multilayer networks in which higher layers supervise lower layers (self-supervised networks) is in Luttrell (1991b, 1992).

1.3 Encoding Prescriptions. It has been noted (Luttrell 1989a,c, 1990; Favardin and Vaishampayan 1991) that the above Euclidean error function is not minimized when the nearest neighbor encoding prescription is used. Exact minimization requires a new type of winner-take-all encoding prescription to be used; the nearest neighbor prescription must be replaced by the so-called minimum distortion prescription, in which the choice of winner is influenced by the reference vectors to which it is connected by the SOM neighborhood function. Figure 1 shows graphically how nearest neighbor and minimum distortion encoding are related. The minimum distortion prescription was used in Luttrell (1991a) to study the equilibrium density of reference vectors in a one-dimensional input space, where it was reported that the result was insensitive to the choice of neighborhood function used in the SOM, provided that it was a monotonically decreasing symmetric function. The corresponding results using a standard SOM with the nearest neighbor prescription appeared in Ritter (1991), where it was shown that a neighborhood-sensitive density of reference vectors emerged.

Stephen I? Luttrell

770

Y

Figure 1: Assume that the input vector x and the function x’(y) are such that the error function d[x,x’(y)] has the y dependence shown above by the solid curve. There is a well-defined minimum, and the shape of the minimum is skewed so that d[x,x‘(y)] increases more rapidly to the right than to the left of the minimum. Assume that the SOM neighborhood function a[y - y(x)] is the symmetric function denoted by the dashed curve. The error function must be averaged over y (weighted by the SOM neighborhood function) to determine the expected error for the particular choice of x, x‘(y), and y(x). In effect, d[x,x’(y)] is convolved with a(y) to produce a smeared error function. The encoding prescription is one’s choice of y(x), and the optimum choice is not the position of the minimum of the original error function with respect to y (nearest neighbor encoding), but rather the position of the minimum of the smeared error function with respect to y (minimum distortion encoding). Because of the skewed d[x,x‘(y)] the minimum distortion encoding of x is thus displaced a little to the left of the nearest neighbor encoding. 1.4 Self-Supervised Multilayer

SOMs. Hierarchical

multilayer

SOMs were studied in Luttrell (1991b, 1992). For instance, a 2-layer version of this type of network splits the input space into a number of lower dimensional subspaces, and trains a SOM in each subspace. Simultaneously, another SOM is trained on the outputs of the above SOMs to produce the final network output. This approach can be cascaded to more stages if required. This type of multilayer SOM can be refined by introducing an error function that measures the average Euclidean error between the input vector and a reference vector that is constructed from the final network output (Luttrell 1991b, 1992). This forces the optimization of the layers to be tied together in such a way that they have to be trained cooperatively. For instance, in a 2-layer network the outputs

A Bayesian Analysis of Self-organizing Maps

771

from the SOMs in the first layer have to be matched to the capabilities of the SOM in the second layer; this is achieved by sending backpropagation signals from the second layer to the first layer. However, these backpropagation signals are not derived from an external supervisor, so this type of training algorithm is called “self-supervised.” 1.5 Purpose of This Paper. The purpose of this paper is to embed the above variational formulation of SOMs in a more general framework in which the transformations y ( x ) and x ’ ( y ) are replaced by probabilistic transformations P l ( y I x ) and P2(d I y ) . Equation 1.2 then becomes

D

=

J ~ x P ( xJ) d y d x ‘ P Z ( ~ I’ y ) P i ( y 1

x)~(x,x‘)

(1.4)

The integration over y and x’ performs a weighted average of d { x , x ‘ [ y ( x ) ] } over a variety of alternative transformations y ( x ) and x‘(y), rather than just a single pair of transformations as would have been the case in the basic variational formulation. Similarly equation 1.3 becomes

D

=

J d x P ( x ) J d n T ( n ) J dydx‘Pz(x’ I y + n ) P l ( y I x ) d ( x , x ’ ) (1.5)

This “sum over alternatives” is useful for a number of reasons. 1. Theoretical manipulations are easier to perform on “soft” probabilities than on ”hard” deterministic functions. 2. The probabilistic approach lends itself well to a simulated annealing approach in numerical simulations.

3. Contact with standard results can be made by making the replacement d ( x , d ) = IIx - ~ ’ 1 1 ’ (i.e., a Euclidean distance) in equation 1.5. This leads to a probabilistic generalization of standard SOM theory. 4. If a single pair of transformations y ( x ) and x ’ ( y ) was being used, but one was uncertain about which particular pair, then a probabilistic formulation would be necessary.

1.6 Structure of This Paper. The principal new result in this paper is a Bayesian derivation of the properties of SOMs starting from a generalization of equation 1.4. Thus a Markov chain of probabilistic transformations of the input vector is inverted by sending the (probabilistic)output of the chain back through the inverse probabilistic transformations (derived from Bayes’ theorem) to eventually reemerge as a (probabilistic) reconstructed version of the input vector. This type of structure will be called a folded Markov chain (FMC). This FMC is then optimized by minimizing the average Euclidean error between the input vector and its reconstruction. Various constraints can be placed on this optimization. For instance, if a 2-stage FMC (i.e., 2 stages of probabilistic transformation) is considered, and only its first stage is optimized, then the theory

772

Stephen P. Luttrell

of SOMs emerges (as in Luttrell 1989a,c, 1990). Alternatively, if the state space of each stage of a 2-stage FMC is split into two (or more) lower dimensional subspaces, then the theory of self-supervision emerges (as in Luttrell 1991b, 1992). The structure of this paper is as follows. In Section 2 the idea of an FMC is introduced, where Bayes' theorem is used to invert a Markov chain of probabilistic transformations. In Section 3 the relationship between FMCs and VQs and SOMs is derived, and it is shown how a l-stage (or 2-stage) FMC contains a V Q (or SOM) as a special case. In Section 4 these results are extended to the case of a pair of coupled FMCs, and it is shown how self-supervision emerges naturally. In the Appendix the relationship between the continuum notation that is used in this paper and the more usual discrete notation is explained (Section 6.1) and a more complete and technically rigorous derivation of the results of Section 3 is presented (Section 6.2). The main new results are contained in Sections 3 and 4.

2 Basic Theory

The notation that is used for probabilities is very carefully chosen so as to avoid ambiguities that could arise if the notation P ( . . .) were used blindly to denote "the probability density of . . . ." However, occasional use of the ambiguous P ( . . .) notation is made where there is no possibility of an ambiguity arising. Also, a careful distinction is drawn between the notation that is used for inverse probabilities (obtained from Bayes' theorem) and the notation that is used for forward probabilities; the former always appear with a tilde over the P, which thus appears as P. Also the state space(s) will be assumed to be continuous, except where otherwise noted. This is because the corresponding discrete space results can be written down by inspection of the continuum results, and because the meaning of a continuum calculation is usually more transparent than its discrete counterpart. The type of Markov chain that will be considered is shown in Figure 2. This is a folded Markov chain (FMC) which performs an L-stage transformation of an input vector xg to an output vector xL (via L - 1 intermediate vectors X I ,x2, . . . ,X L - ~), and then performs the Bayes inverse transformation to arrive eventually at a reconstructed input vector xb . The delta function 6(dL- XL) is used to ensure that dL = XL. The conditional probabilities are related by Bayes' theorem as follows

where Pk(Xk) denotes the marginal probability (density) of xk. To construct any joint probability in the FMC system it is both necessary and

A Bayesian Analysis of Self-Organizing Maps

773

Figure 2: A folded Markov chain (i.e.,both the forward and backward directions are represented). The top (bottom) half of the diagram represents the forward (backward)pass through the chain. The conditional probability P k + l , k ( ~ k + l I xk) is used as a probabilistic transformation that generates the probable states of layer k + 1 from the state of layer k, and the conditional probability Pk,k+l(x; I d+l)is similarly used to generate the probable states of layer k from the state of layer k + 1; this is also referred to generically as stage k of the FMC. These two conditional probabilities are related by Bayes‘ theorem. sufficient to specify PO(x0) and the L transformations PI,O(XI I X O ) , P z ,(x2 ~ I Thus

I

XI),. . . , P L , L - I ( X LX L - 1 ) . P ( X 0 , XI 1 . .

.

1

XL; X i ,

.. .

1

4 ,Xb)

= P O ( X 0 ) Pl,O(Xl

x

P L , L - ~ ( xI L~~-i)b(dL - XL)

x PL-l,L(XI-l

x

I X o ) P2,l(XZ I XI 1 . . . I4

‘ ’ ‘

k2(4 I X i ) P O , I ( x b I 4 )

(2.2)

The marginal probability P k ( ~ k )is obtained by integrating over all variables other than Xk, which yields

I

Pk(Xk) = ~dXodXl...~Xk-lPo~Xo~Pl,o~Xl XO)P2,1(X2 I X l ) ” ’

x Pk,k-l(Xk

I Xk-1)

(2.3)

Note that Bayes‘ theorem guarantees that the marginal probabilities of 4 are the same. In this paper our attention will be restricted to FMCs with 1 or 2 stages only (i.e. L = 1 or 2). Furthermore, whenever an FMC is to be optimized, only its first stage will be optimized.

xk and

3 1- and 2-Stage Folded Markov Chains

FMCs are especially interesting in adaptive network design theory, because they turn out to contain some well-known systems as special cases: a VQ is a special case of a 1-stage FMC, and an SOM is a special case of

Stephen P. Luttrell

774

Figure 3: A 1-stage folded Markov chain. When PI,O(XII xo) = XI - XI(XO)], and D is minimized with respect to XI(XO), this reduces to a vector quantizer with an infinite number of reference vectors (i.e., continuum limit). See the Appendix for a detailed discussion of the relationship between the continuum and discrete cases. Bayes’ theorem ensures that P I , o ( I~4)cannot be varied independently of PI,O(XI I xo). a 2-stage FMC. The success of these derivations relies on the similarities that exist between the following two situations: 1. Direct/inverse probabilities occur in Bayes’ theorem as applied to a Markov chain, which might be used in the analysis of scattering problems in layered media, for instance. The sequence of processing operations is source-+scattering-+inversescattering-reconstruction.

2. Encoding/decoding operations occur in VQs and SOMs, which might be used in the analysis of information transmission down a noisy communication channel, for instance. The sequence of processing operations is source+encode-decode+reconstruction. 3.1 I-Stage Folded Markov Chain: Vector Quantizer. A 1-stage FMC is shown in Figure 3. The derivation starts from the definition of the average Euclidean error D between xo and 4 in a 1-stage FMC.

D

=

/dxo

dxl

x @od&

d& dx’, PO(XO) PI,O(XI I XO)6(x’, - XI)

14)I14 - xo112

(3.1)

After integrating out the (irrelevant) X; using the delta function 6(X; -XI), using Bayes‘ theorem in the form PO,I(XO I XI)PI(XI) = PI,O(XI I xo)Po(xo)

A Bayesian Analysis of Self-organizingMaps

(i.e., equation 2.1 with k tions, this reduces to

=

775

0), and rearranging the order of the integra-

The next step is to expand the norm as llxo112+ perform the integrations where possible to obtain

I~x#

-

2x0.x~and to

(3.3)

which may be rewritten as

D

=2

J

2

dxi pi (xi)

J dxo Po,i (Xo I xi) 11x0 - J duo po,i(uo I xi) uo11 (3.4)

The derivation of equation 3.4 from equation 3.2 is well-known; it says that the average Euclidean error between pairs of vectors drawn independently from P O , , ~I (xl) X ~is twice the variance of vectors drawn from PO,I(XOI XI). Finally, Bayes’ theorem (i.e., equation 2.1 with k = 0) can be used to obtain the required result.

D

=2

J

dxo~o(xo)Jdx1

I

~ l , o ( x lxo) 11x0 - JduoPo.l(uo

2

I xl)uoI/ (3.5)

Equation 3.5 has all of the right structure to relate FMCs to VQs. It has a source of input vectors Po(xo), a “soft” encoder P I , o ( x I~XO), and a reference vector J duoPo,l(UOI XI) uo attached to each x1 with which to compare the input vector xo to compute a Euclidean distortion. The only differences between this FMC and a standard VQ are: 1. PI,O(XI I XO) is not a winner-take-all encoder. Each input vector xo is transformed into each possible output vector x1 with probability Pl,o(xl I xo). In the language of neural networks, it is as if each possible output vector had an ”activity” specified by PI,o(x~ I XO). A winner-take-all would result if Pl,o(xlI xo) were replaced by a probability whose mass was concentrated all at one point; this would be a delta function 6[x1 - xl(xo)]. 2. The reference vector Jduo POJ(UO I XI) uo is dependent on the encoder Pl,o(xl I xo); they are related by Bayes’ theorem (i.e., equation 2.1 with k = 0). In a VQ the reference vector and the encoder are also related because the encoder is usually a nearest neighbor prescription, which in turn depends on the location of the reference vectors. It is not at all obvious that these two pictures (the FMC and the VQ) are related in a simple way.

Stephen P. Luttrell

776

Now consider a modified form of equation 3.5 in which D becomes

D

=2

J dxo PO(XO)

/ h i

Pi,o(Xi

I xo) 11x0- x~CXI)II’

(3.6)

where Jduo P O , ~ (IUXI)~ uo has been replaced by the function & ( X I ) . Functionally differentiate this expression for D with respect to xb(x1) to obtain

The stationary point SD/Sxb(x1) = 0 is obtained when &(XI) satisfies

By using Bayes‘ theorem (i.e., equation 2.1 with k condition reduces to &(xi) = Jdxo ~o,i(xoI xi) xo

=

0) this stationarity

(3.9)

So the modified expression for D in equation 3.6 reduces to the original expression for D in equation 3.5 provided that xb(xl)is chosen to minimize D. This is a major simplification, because the coupling (via Bayes’ theorem) between Pl,o(xl 1 X O ) and J d u o & ( u ~ I XI) uo that appeared in equation 3.5 can now safely be ignored by the simple trick of using equation 3.6 instead [with the proviso that xb(xl) should always be optimized so as to minimize D]. At this point the minimization of D (as written in equation 3.6) with respect to Pl,o(xl I X O ) should be considered. However, this derivation is messy, so it is presented in the Appendix. Instead, a simplified (and strictly incomplete) version of this derivation is presented below. Make the following replacement in equation 3.6 Pl,O(Xl

I xo)

--t

(3.10)

61x1 - Xl(X0)l

which converts P l , o ( x l I XO) into a winner-take-all encoder. A winnertake-all encoder might appear intuitively to be the obvious solution to the problem of minimizing D with respect to Pl,o(xl I X O ) . However, other solutions may also be possible in general. A detailed derivation is given in the Appendix to show how this result emerges in the case of a Euclidean error function. Also a couple of simple counterexamples are presented in the Appendix to show how non-winner-take-all encoders can also be valid solutions. With these replacements D becomes

D

=2

J dxo po(xo)

(xo)lt

11x0 -~ b [ ~ i

(3.11)

A Bayesian Analysis of Self-organizing Maps

777

xo space

Figure 4: A discrete VQ represented as a network. The bottom layer is the input space (or XO), the top layer is the output space (or XI), and the connections between the two represent a soft encoding operation P l , o ( x l I XO), akin to that used in Yair et al. (1992). A winner-take-all VQ uses an encoder of the form P l , o ( x l 1 xg) = 6[x1 - xl(xo)]. The input and output layers of this network are represented in different ways: the input layer is the vector xo represented as in Figure 3, whereas each node of the the output layer corresponds to exactly one possible state of the vector XI. Note also that the connections are not to be interpreted as weights in the conventional sense, rather they merely indicate the functional interdependence of the various parts of the network. An example of a winner in the output layer is represented by the open circle. which is exactly what would be written for the continuum version of a VQ (apart from the trivial overall factor of 2). The network representation of a VQ is shown in Figure 4 for a discrete-valued output, which should be compared with the FMC representation shown in Figure 3. The gradient of D with respect to & ( X I ) is given by the functional derivative

(3.12)

By inspecting the dependence of Equation 3.11 on xl(x~),and by setting SD/S&(x1) in Equation 3.12 to zero, the following batch training prescription for minimizing D can be obtained xl(x0)

=

arg min 11x0 -x~(x~)II’ XI

(3.13) These results merit the following remarks:

1. The function & ( X I ) can be interpreted as the continuum version of a V Q codebook, where x1 is the (continuum) code index and xb(x1) is the code vector (or reference vector) associated with that index.

778

Stephen P. Luttrell

Figure 5: A 2-stage folded Markov chain. This is basically the same as Figure 3, except that P2,1(x2I XI) causes the information that flows through the folded Markov chain to be further corrupted before it can begin its return journey.

-

2. The result for xb(xl) corresponds to equation 3.6 with Pl,o(xlI x") b[x1 - x1 (x")], and it is the "centroiding" prescription for updating the code vectors after a batch of training data has been presented to a VQ, as used in the LBG algorithm (Linde et nl. 1980).

3. The result for XI ( X O ) is the "nearest neighbor" encoding prescription for encoding the input of a VQ. An on-line training prescription can also be obtained to implement updates to xb(xl) after each input vector xo is selected at random from Po(x,l). The on-line prescription is d"(X1) - x b ( x l ) + f n [ X l - x l ~ ~ o ~ l l ~ o - ~ ; ( ~ l ) I

(3.14)

Note that the delta function permits nonzero updates only for XI = x1 (xu); this is the continuum version of updating the nearest neighbor code vector toward the input vector. The relationship between this continuum result and the corresponding discrete result is discussed in the Appendix. This completes the demonstration that an optimal 1-stage FMC is a VQ. Note that the use of a Euclidean error is sufficient (but not necessary) for this result to emerge. There are choices of error function for which this result does not emerge.

3.2 2-Stage Folded Markov Chain: Self-organizing Map. The above derivation may be extended to a 2-stage FMC, as shown in Figure 5. Because the derivation is so similar to the case of a 1-stage FMC, only an abbreviated derivation will be given. For simplicity, the following

779

A Bayesian Analysis of Self-organizingMaps

notation will be used P2,o(x2 I xo) = /dXl P2,1(x2 I x1) Pl,O(Xl I xo)

I x2) Pz(x2)

P0,2(XO

=

P2,O(X2 I xo) PO(X0)

(3.15)

P2,0(x2 I xo) is the probability (density) that

x2 will be generated by XO, taking into account all of the possible values that the intermediate state xl might take. Po,2(xo1 x2) is the corresponding inverse probability that is obtained from Bayes' theorem. Using this notation the expression for D becomes (compare equation 3.1)

(3.16)

Consider a modified form of D (compare equation 3.6)

D

=

J

dxoPo(xo) /dx2p2,o(xz

I xo) 11x0- Xb(X2)1I2

(3.18)

Set 6D/6xb(xz)= 0 to obtain (compare equation 3.8 and equation 3.9)

The modified expression for D in Equation 3.18 reduces to the original expression in equation 3.17 provided that xh(x2) is chosen to minimize D, so equation 3.18 will be used in preference to equation 3.17 [with the proviso that xL(x2) should always be chosen to minimize D]. Make the replacement Pl,o(xl I XO) + &[XI - X~(XO)],which converts Pl,o(xl I XO) into a winner-take-all encoder (see the discussion following equation 3.10, and the Appendix for more details), and note that P2,0(x2I XO) = J dxl P2.1(x2 I XI ) Pl,o(xl I XO), to obtain D as (compare equation 3.11) D

=2

I

dxo ~o(xo)Jdxz

I xi (x0)111xo - Xb(xz)~~'

~z,i[X2

(3.20)

The functional derivative 6D/Sxb(xz) is thus (compare equation 3.12) (3.21)

Stephen P.Luttrell

780

which leads to the following batch training prescription for minimizing D (compare equation 3.13)

(3.22) and the following on-line training prescription (compare equation 3.14) xb(x2)

+

xb(x2) + fP2,1[X2

I ~ l ( X O ) l [ X O- xb(x2)l

(3.23)

These results correspond to the results that were reported in Luttrell (1989a,c, 1990). They can be interpreted as the generalization of the VQ results in the previous section to the case where the output of the V Q is corrupted by the action of P2,1(x2 I XI) before Bayes’ theorem is then used in an attempt to reconstruct the input vector. This winner-take-all version of a 2-stage FMC turns out to be an SOM whose network representation is shown in Figure 6 for a discrete-valued output, which should be compared with the FMC representation that is shown in Figure 5. The SOM interpretation of these results for optimizing a 2-stage FMC is as follows: 1. The function xb(x2) can be interpreted as the continuum version of the SOM reference vectors, where x2 is the (continuum) index and xb(x2) is the reference vector associated with that index. The batch update prescription for X ~ ( X Z is ) a generalization of the LBG ”centroiding” prescription (Linde 1980) that accounts for the effect of P2,I(x2 I XI 1.

2. The result for xl(x0)is not a “nearest neighbor” encoding prescription. Rather, it says that XI ( X O ) is the value that XI must take in order to ensure that the distortion D is minimized after taking into account the effect of P Z , ~ I( xl). X ~ Thus the nearest neighbor encoding prescription has become a minimum distortion encoding prescription. This reduces to the nearest neighbor encoding prescription when P2,1(x2 I XI) + 6(x2 - XI), as expected. 3. The on-line training prescription is the continuum version of the standard SOM training prescription, where P2,1(x2 1 XI) plays the role of the SOM neighborhood function. P 2 , 1 ( ~ 2I XI) also has this interpretation in the batch training prescription. This completes the demonstration that an optimal 2-stage FMC is an SOM. Note that minimum distortion encoding is used, rather than nearest neighbor encoding, so this type of SOM is only an approximation to the standard SOM that was discussed in Kohonen (1984).

A Bayesian Analysis of Self-organizingMaps

xo space

781

I

Figure 6: A discrete SOM represented as a network. This is the same as the VQ network in Figure 4 with an additional stage of processing applied to its output layer. Each node in the hidden and output layer of this SOM network corresponds to exactly one possible state of the vector x1 and x2, respectively. P2,1 (x2 I XI) serves as the SOM neighborhood function by connecting together states of XI so that they become ordered (across the page in this case). Note that the VQ network in Figure 4 does not have this ordering property, although the states of x1 (or nodes) are still drawn in an ordered fashion, for convenience. An example of a winner in the hidden layer is drawn as an open circle, as are each of the corresponding soft winners in the output layer. An example of the degree to which each node in the output layer is activated is indicated by a histogram, which records P2,1(x2 I XI) for each possible state that x2 might take.

4 Coupled 2-Stage Folded Markov Chains In Luttrell (1991b, 1992) some interesting results were reported where the behavior of a multilayer SOM could be interpreted as if the higher network layers were supervising the lower layers, and the term “selfsupervision” was introduced to describe this effect. The purpose of this section is to show how these results can be derived from the theory of FMCs.

4.1 Splitting the Markov Chain State Spaces. The derivation starts by splitting each state of a 2-stage FMC into two lower dimensional pieces, as shown in Figure 7.

782

Stephen P. Luttrell

J Figure 7: A 2-stage folded Markov chain with each state split into two lower dimensional pieces. This is basically the same as Figure 5, except that the states have been exploded to reveal their internal structure. Figure 7 can be obtained by making the following changes to the notation in Figure 5

so the FMC in Figure 7 is the same as the FMC in Figure 5, apart from the notation that is used to describe its state spaces. This tautology is motivated by the need to break up high dimensional spaces, such as might occur in image processing problems, into a number of coupled lower dimensional spaces. For instance, if the coupling between the FMCs is weak, then a series expansion about zero coupling strength could be attempted. The lowest order term in this expansion would correspond to a pair of uncoupled FMCs, and the higher order terms would correspond to interactions between the FMCs.

A Bayesian Analysis of Self-organizing Maps

783

If the new notation in equation 4.1 is used together with the generalization of equation 3.10 Pl.O(4?$

I4J3

+

6[4-4(4lX$)l6[$ -$(4d)I

(4.2)

in the expression for the distortion D in a 2-stage FMC (see equation 3.17), and noting that P2,0(x2I X O ) = J dxl P2,1(XZ 1 xl) Pl,o(xlI X O ) , then D reduces to

(4.3)

where the Euclidean error has been split into a sum of contributions from the a and the b subspaces of X O . The Euclidean error terms have an interesting structure. For instance, in FMC a the Euclidean distance between x;l_and Jdu: Po,z(u: 1 $,$) u: is computed, which explicitly depends via P o , ~ ( u I$%,$) on the outputs (%,g) of both FMC u and FMC b. In the previous section terms like J duo Po,z(uo I x 2 ) uo turned out to correspond to SOM reference vectors, so by analogy it is expected that J d u $ P o , ~ ( uI ;%,$) u; (k = a, b) will also turn out to be the reference vectors for a pair of coupled SOMs corresponding to FMC u and FMC b. Unfortunately, because of the coupling between FMC a and FMC b, J d u ; P ~ , ~ (I ugi, $ ) u i ( k = a,b) depend on both % and x i . The purpose of the following derivation is to find an approximate way of optimizing D that does not involve these simultaneous dependencies on both x.j and 4, but uses quantities like JduiP0,2(ui 1 4)ui rather than Jdu;P0,2(ui I %,$) ui for k = a, b. Thus each term can be split in the following manner for k = u . b

where the distinction between . J d u i P 0 , 2 ( ~1 ;%,$) uk and Jdu;P0,2(ui 1 6 )ui has been carefully used to separate the 4 and J dut P0,2(u; I 24,$) ui terms. The following two definitions are introduced for k = a, b

Stephen I? Luttrell

784

and used together with Bayes' theorem in the form

The first two terms in equation 4.7 have a structure that is simpler than in equation 4.3. For instance, Do(a) depends on the Euclidean distance between x$ and Jdu",P0,2(u; I A$) u;, which depends only on the output of FMC a and not on the output of FMC b. An analogous remark applies to Do(b). The last two terms in equation 4.7 contain the undesirable terms such as Jdu; P 0 , 2 ( 4 I A$,.",) u:. In the following derivation these terms will be discarded to obtain an approximate scheme for minimizing D. 4.2 Least Upper Bound Optimization Scheme. Note that the following inequalities hold &(a)

2 0 Do(b) 2 0 D ~ ( u2 ) 0 D l ( b ) 2 0 D 20

(4.8)

These lead to the following constraint on D 0 I D 5 &(a)

+ Do(b)

(4.9)

Although ideally D itself should be minimized, it turns out to be much simpler to minimize its upper bound &(a) Do(b). This least upper bound prescription achieves what is required, namely the elimination of the undesirable DI ( a ) and DI ( b ) terms that depend on J duk P o , ~ (Iu ~ A$,.",) uk for k = a, b. This approximate prescription becomes exact in the limit where the coupling between the FMCs in Figure 7 tends to zero. This approximate approach to optimizing the FMC is shown in Figure 8. At this point it is appropriate to introduce a modified form of Do(k) in which J'dukPO,~(U; I 6 ) u! in equation 4.5 is replaced by the function x$(x",)for k = a, b

+

Do(k)

=

2/&dxb,f'o(IpOj$)

A Bayesian Analysis of Self-organizing Maps

785

Figure 8: An approximation to a 2-stage folded Markov chain with each state split into two lower dimensional pieces. Only the P0,2(gI $) part is indicated on the reverse part of the chain. The approximation arises because I.$, rather than ($,.",), is used as the initial state on the reverse part of the chain. An analogous discussion holds for Po,z(Xbg I .",). The functional derivative SDo(k)/S$k(xk,) then becomes for k = a , b

and the cross derivatives bD~(a)/S$~(xb,) and SDo(b)/6G(%)are zero. The stationary point SDo(k)/S$k(x$) = 0 is obtained when xt(4) satisfies for k=a,b

Note that this result can readily be generalized to an arbitrary choice of I $,,A$) (i.e., not assuming equation 4.2).

Pl,o($,x$

786

Stephen P. Luttrell

By using Bayes‘ theorem in the form shown in Equation 4.6 this reduces to

Thus the modified expression for Do(k) in equation 4.10 reduces to the original expression for Do(k) in equation 4.5 provided that x$(x$) is chosen to minimize Do(k), so equation 4.10 will be used in preference to equation 4.5 [with the proviso that x$($) should always be chosen to minimize D o ( k ) ] . When the upper bound of D in equation 4.9 is minimized with respect to using equation 4.10 it yields for k = a , b

$($,4)

and the following on-line training prescription for k = a , b

These results correspond to the self-supervised training scheme that was proposed in Luttrell (1991b, 1992), which was the result of a detailed study of the problem of designing an encoder/decoder for a pair of communication channels whose transmitted information was degraded by a noise process (both external noise and noisy coupling between the channels). The improvement in performance when channel coupling is taken into account was shown to be significant, so the least upper bound approximation is justified in hindsight for this type of system. The following remarks can be made about these results: 1. The functions $(I$) and Xbb(x”,) are the continuum versions of the reference vectors of a pair of SOMs corresponding to FMC a and FMC b, respectively. 2. The results for +(x$,$) and $(+,*) are modified forms of the minimum distortion encoding prescription, in which the coupling between the FMC a and FMC b manifests itself through P*,l($ I gJ;) and P2,1(x”,I &$).

3. In both the batch and the on-line training prescription P2,1(% I 4,$) and P2,,(x”, I g ,$)play the role of neighborhood functions for SOM u and SOM b, respectively. However, the coupling between the SOMs causes these neighborhood functions to be dependent on the input data, as will be discussed in more detail below.

787

A Bayesian Analysis of Self-organizingMaps

Figure 9: A pair of coupled discrete SOMs represented as a pair of coupled networks of the type shown in Figure 6. As derived above, the coupling P2,1($rxiI between networks a and b expresses itself as a pair of data dependent neighborhood functions P ~ J ( $1 $,g)and P ~ J ( $ I $ , x ! ) . When the first stage of each of these coupled SOMs is optimized, taking account of Pz,I($, I $,$) and P ~ J ( $ I the encoding functions $($-,,x$) and x$) become mutually coupled.

$,4)

4($,

$,4),

4.3 Network Representation of Self-supervision. In Figure 9 the network representation of the pair of coupled SOMs is shown for a discrete-valued output. The detailed interpretation of equation 4.14 and equation 4.15 is shown in Figure 10. The marginal probabilities P z ,($ ~ I %, $) and P2,1 ( x i I g,4) are obtained by projecting P2,1 ($, $ I %, $) onto $ and $, respectively. The data dependence of P2,1($, $ I %, $) causes these marginal probabilities to be data dependent. In particular, they can be biased as shown in Figure 10, which causes the neighborhood functions (in the SOM interpretation) in $ and $ space to be biased. The data dependence has another subtle side effect in Figure 10. The input transformations %($-,,XI)and $(A$,$) depend on the marginal probabilities P2,1(4 I %, $)and P2,1($ I $,$),because the input transformations satisfy a minimum distortion criterion which depends on these marginal probabilities (see equation 4.14). However, the marginal probabilities themselves are data dependent, because they depend on % and $, which in turn depend on the input transformations. Overall, the marginal probabilities and the input transformations are mutually dependent, which makes the minimum distortion encoding prescription quite subtle to implement in this case. Further details on self-supervision can be found in Luttrell (1991b, 1992), where a detailed discussion and numerical simulation of the consequences of using a particular type of P2,1(%, I %, $) are presented, a comparison is made between nearest neighbor and minimum distortion encoding, and a comparison is made between using mutually depen-

4

788

Stephen P. Luttrell

x," space

Figure 10: Diagram showing the detailed operation of a pair of coupled FMCs. For simplicity, ($,g)is assumed to sit in the same vector space as (4,g). Examples of the winners 4(.",,.",) and $(.",,.",) are indicated by the open circles. These are then jointly smeared into the distribution P2,1($,$ I +,$),which is drawn as a 2-dimensional histogram. As derived above, the optimization depends on a pair of neighborhood functions PZJ($ I 4, $1 and PZJ(4 I 4, $1, which are the marginal probabilities of Pz,](z$, x! I 4,$), and which are drawn as one-dimensional histograms in the diagram. These data-dependent neighborhood functions not only influence the optimization of the FMC,but also determine the winners that should have been used in the first place. dent neighborhood functions P Z ,(.",~ I 4,$) (k = a , b) and independent neighborhood functions P~J(x$ I g)and P2,1($ I $). 5 Conclusions

In this paper it has been demonstrated that VQ theory, SOM theory, and the theory of self-supervision all emerge naturally when an FMC is optimized so as to minimize the expected Euclidean error between an input vector and its attempted reconstruction (using Bayes' theorem).

A Bayesian Analysis of Self-organizing Maps

789

FMC theory can be used to facilitate many computations that would otherwise be theoretically and/or numerically intractable. The “soft” probabilities that are used in the FMC are easier to compute with than the “hard” delta functions in the corresponding winner-take-all VQs and SOMs. The results contained in this paper guarantee that these ”soft” computations reduce to the required “hard” computations when the first stage of the FMC is optimized. 6 Appendix

6.1 Relationship between Continuum and Discrete Vector Quantizers. Throughout this paper continuum notation is used. In the case of a VQ this has the effect that the index that is used to select the winning reference vector i s assumed to be a continuous-valued quantity, rather than a discrete-valued quantity. The purpose of this appendix is to relate the continuum case to the discrete case. It is sufficient to discuss the meaning of the on-line training prescription in equation 3.13 and equation 3.14, which is presented again here for convenience.

where x1 is assumed to lie in the unit N-dimensional hypercube. The discrete counterpart of this on-line prescription is

%,k,

‘b,k, -tE6k,,kl(xo) b0- %b,k,)

(6.2)

where kl is a point on an N-dimensional cubic lattice. The relationship between the nearest neighbor encoding prescriptions xl(x0) and kl(x0) is simple to understand, so no further comment is required. On the other hand, the interpretation of the update prescription for &(XI) is quite subtle, so it will now be discussed in some detail. In order to interpret the continuum prescription correctly it is necessary to integrate over a small region S with volume 6V that encloses the point x1 = xl(xo),and to assume that & ( x I ) is a smooth function of x1 (for a justification of this assumption, see the last paragraph of this appendix). This leads to the result

xb(x1) + xb(x1) + [xo - xb(xl)] where x1 E S and xl(x0) E S (6.3) 6V E

which can be compared directly with the update prescription for #b,k,. Note that e/6V, rather than E , determines the size of the updates of the reference vectors that are attached to points x1 inside S. The smaller

Stephen P. Luttrell

790

the volume hV is, the smaller E has to be to keep the updates the same size. The most natural prescription is to impose a lower bound hL on the length scale of x1 of interest, which then imposes a natural size on the volume of S such that bV = dLN. Since x1 is assumed to lie in the unit hypercube, this prescription leads to a finite effective number of reference vectors given by 6L-N, which would correspond to an N-dimensional cubic lattice whose size was l / h L lattice spacings in each dimension. An interesting side effect of introducing this inner length scale hL (which is effectively a regularization constant) is that it automatically generates an SOM on-line training prescription. To see this, note that in equation 6.3 a whole volume hV of reference vectors x1 in the neighborhood S of the point x1 = x l ( x 0 ) is updated. This is precisely the type of update prescription that creates an SOM. Furthermore, this justifies with hindsight the assumption made earlier that $,(XI)is a smooth function of X I . 6.2 Full Optimization of 2-Stage Folded Markov Chains. In this appendix it will be shown that if a 2-stage FMC with a Euclidean error function is optimized with respect to P l , g ( x l I XO), then it reduces to PI,O(XII xo) = h[xl - XI(XO)]. Because a 2-stage FMC contains a 1-stage FMC as a special case, it is necessary to present the derivation only for a 2-stage FMC. The result that is obtained depends on certain properties of the Euclidean error function; it is not a general property of FMCs. There are two constraints that must be respected during the optimization of Pl,g(xl I XO). Total probability must be conserved / d x l Pl,g(xl

I xg) = 1

for all xg

(6.4)

and Po,l(xg I X I ) must vary in response to variations of P,.o(xl I xg) in such a way that Bayes' theorem is respected (i.e., equation 2.1 with k = 0) Po,,(xo I x 1 )

[/duoPl,o(xl

I u")Po(uo)]

=

pl.o(xl I xo)Po(xo)

for all xg x1 ~

(6.5)

The first constraint will be imposed for all xu by using a Lagrange multiplier function X(xo), whereas the second constraint will be imposed by using Bayes' theorem to eliminate PO,I(XO I XI) from D. The overall quantity D to be minimized is therefore

A Bayesian Analysis of Self-organizingMaps

791

Integration over x2 and use of Bayes’ theorem on both P1,2(4 I x i ) and Po,l(4 I x ; ) transforms D into a form that is suitable for differentiation with respect to the sought after quantity Pl,o(xl I xo)

D

=

J dxo dxi dd dx’,d.’2 Po(xo) Pi,o(Xi I

I

X O ) Pi,o(x’,

Po(xb)

Functional differentiation of D with respect to Pl,o(zl 1 ZO), and using the symmetry of 5> under interchange of xo and 4, then leads to

Using Bayes’ theorem this simplifies to

Stephen I? Luttrell

792

Expansion and factorization eventually lead to the result

2

x

(I/d&d4P031(xbIX;)PI,Z(~; I~;)&-zoI~

-

X(Z0)

(6.10)

The stationarity condition 6D/6Pl,o(xl 1 XO) = 0 is therefore 2

2Po(xo) px;P2,l(X;

I XI)

II/d&Po,z(x; I4,xb -xoll

= X(x0)

(6.11)

The right-hand side of equation 6.11 is a function only of XO, whereas the left-hand side is a function of both xo and XI, where xi appears only in the P2,1(X;I xl) factor. Because x1 appears only on the left-hand side of this equation, its effect must somehow vanish. There are two types of solution in which the dependence on XI vanishes: 1. PO(xo)= 0. The dependence of the left-hand side of equation 6.11 is suppressed by the zero-valued Po ( X O ) factor. Any legal probability Pl,o(xl I x g ) is thus permitted for values of xo that have PO(xo)= 0. This solution may be eliminated because for those xo that have PO(x0) = 0 where there is no need to define Pl,o(xl XO) anyway.

I

2. Po(xo)> 0. Note that the variance-like factor

so there are two situations to consider:

11 . . . > 0. In order to suppress the x1 dependence of the lefthand side of equation 6.11 x1 must be uniquely determined by the value of X O . So XI = XI(XO) and PI,O(XI I X O ) = XI - XI(XO)]. b. 11 . . . 112 = 0. Apparently, any legal probability PI,O(XII XO) is permitted. However, any PI,O(XII X O ) # 6[x1- XI(XO)] will guarantee that the variance-like factor 1) . . * [I2 > 0, so the only Pl,o(xl 1 xo) that can possibly survive are PI,O(XI I XO) = XI xl(xo)]. Note that with P~,O(XI I XO) = XI - xl(xo)] it is still possible that I( 112 > 0, in which case this solution can be eliminated. a.

.

.

+

The only solution that remains is PI,O(XI I X O ) = 6[x1 - XI(XO)]. This result establishes the fact that the replacement of P~,o(xI 1 X O ) by XI - XI(XO)] used in equation 3.10 (and in equation 4.2, in the case of coupled FMCs) emerges naturally from minimizing D in the function space of probabilities Pl,o(xl 1 xo).

A Bayesian Analysis of Self-organizing Maps

793

It is important to note that the choice of a Euclidean error function is sufficient (but not necessary) for Pl,O(xl I xo) = S[xl - xl(xo)] to emerge. Also, note that it is not true in general that optimization of Pl,o(xl I xo) leads to Pl,o(xl I XO) = S[xl - xl(xo)]. For instance, this is the case when 11x0 - &112 is replaced by either of the following two functional forms 11x0 -

xbll’

+

{

+

A(x0) B(&) A(XO)B(xb)

counterexample 1 counterexample 2

(6.12)

Although these might not be considered to be sensible error functions, the fact that counterexamples exist is in itself important. These counterexamples may be described briefly as follows:

1. Counterexample 1 leads to a D that has no dependence on P l , o ( x l I x0). Optimization of P l , o ( x l 1 XO) will allow any legal probability, so Pl,0(xl 1 xo) # 6[x1- XI (xo)] is permitted. 2. Counterexample 2 is more complicated to analyze. However, it is possible to show that Pl,o(xl I x0), when viewed as a matrix, has a block diagonal structure. This type of Pl,o(xl 1 xo) does not imply a deterministic relationship between xo and XI, so Pl,o(xl I XO) # 6[x1 - xl(x0)] is permitted. Acknowledgments The author is indebted to the following people for critically reading this paper: Eric Jakeman, David Lowe, and Graeme Mitchison. References Farvardin, N. 1990. A study of vector quantisation for noisy channels. I E E E Trans. IT 36, 799-809. Farvardin, N., and Vaishampayan, V. 1991. On the performance and complexity of channel-optimised vector quantisers. I E E E Trans. IT 37, 155-160. Kohonen, T. 1984. Self Organisation and Associative Memory. Springer-Verlag, Berlin. Kumazawa, H., Kasahara, M., and Namekawa, T. 1984. A construction of vector quantisers for noisy channels. Elect. Eng. Ipn. 67B,3947. Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantiser design. I E E E Trans. COM 28, 84-95. Luttrell, S. P. 1989a. Self-organisation: A derivation from first principles of a class of learning algorithms. Proc. 3rd I E E E Int. Ioint Con5 Neural Networks, Washington, DC,2, 495498. Luttrell, S. P. 1989b. Hierarchical vector quantisation. Proc. IE E Part I, 136, 405413. Luttrell, S. P. 1989c. Hierarchical self-organising networks. Proc. Zst I E E Conf. Artificial Neural Networks, London, 2-6.

794

Stephen P. Luttrell

Luttrell, S. P. 1989d. Image compression using a multilayer neural network. Patt. Recog. Lett. 10, 1-7. Luttrell, S. P. 1990. Derivation of a class of training algorithms. ZEEE Trans. N N 1,229-232. Luttrell, S . P. 1991a. Code vector density in topographic mappings: Scalar case. I E E E Trans. N N 2,427436. Luttrell, S. P. 1991b. Self-supervised training of hierarchical vector quantisers. Proc. 2nd I€€ Conf. Art$cial Neural Networks, Boumemouth, 5-9. Luttrell, S. P. 1992. Self-supervision in multilayer adaptive networks. Proc. I€€ Part F 139(6),371-377. fitter, H. 1991. Asymptotic level density for a class of vector quantisation processes. ZEEE Trans. N N 2, 173-175. Yair, E., Zeger, K., and Gersho, A. 1992. Competitive learning and soft competition for vector quantiser design. l E E € Trans. SP 40,294-309. Received May 19,1993; accepted September 7, 1993.

This article has been cited by: 1. Shih-Sian Cheng, Hsin-Chia Fu, Hsin-Min Wang. 2009. Model-Based Clustering by Probabilistic Self-Organizing Maps. IEEE Transactions on Neural Networks 20:5, 805-826. [CrossRef] 2. Maxim Raginsky, Thomas J. Anastasio. 2008. Cooperation in self-organizing map networks enhances information transmission in the presence of input background activity. Biological Cybernetics 98:3, 195-211. [CrossRef] 3. Tsvi Tlusty. 2008. Rate-Distortion Scenario for the Emergence and Evolution of Noisy Molecular Codes. Physical Review Letters 100:4. . [CrossRef] 4. Hujun Yin. 2007. Nonlinear dimensionality reduction and data visualization: A review. International Journal of Automation and Computing 4:3, 294-303. [CrossRef] 5. T.W.S. Chow, S. Wu. 2004. An Online Cellular Probabilistic Self-Organizing Map for Static and Dynamic Data Sets. IEEE Transactions on Circuits and Systems I: Regular Papers 51:4, 732-747. [CrossRef] 6. A. Baraldi, E. Alpaydin. 2002. Constructive feedforward ART clustering networks. II. IEEE Transactions on Neural Networks 13:3, 662-677. [CrossRef] 7. Hujun Yin. 2002. ViSOM - a novel method for multivariate data projection and structure visualization. IEEE Transactions on Neural Networks 13:1, 237-243. [CrossRef] 8. T. Heskes. 2001. Self-organizing maps, vector quantization, and mixture modeling. IEEE Transactions on Neural Networks 12:6, 1299-1305. [CrossRef] 9. Mu-Chun Su, Hsiao-Te Chang. 2000. Fast self-organizing feature map algorithm. IEEE Transactions on Neural Networks 11:3, 721-733. [CrossRef] 10. L.M. Fu. 2000. Discrete probability estimation for classification using certainty-factor-based neural networks. IEEE Transactions on Neural Networks 11:2, 415-422. [CrossRef] 11. Thore Graepel , Klaus Obermayer . 1999. A Stochastic Self-Organizing Map for Proximity DataA Stochastic Self-Organizing Map for Proximity Data. Neural Computation 11:1, 139-155. [Abstract] [PDF] [PDF Plus] 12. A. Baraldi, P. Blonda. 1999. A survey of fuzzy clustering algorithms for pattern recognition. II. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 29:6, 786-801. [CrossRef] 13. Christopher M. Bishop , Markus Svensén , Christopher K. I. Williams . 1998. GTM: The Generative Topographic MappingGTM: The Generative Topographic Mapping. Neural Computation 10:1, 215-234. [Abstract] [PDF] [PDF Plus] 14. Chi-Sing Leung, Lai-Wan Chan. 1998. An error control scheme for transmission of vector quantization data over noisy channels. IEEE Transactions on Signal Processing 46:10, 2767-2780. [CrossRef]

15. Thore Graepel, Matthias Burger, Klaus Obermayer. 1997. Phase transitions in stochastic self-organizing maps. Physical Review E 56:4, 3876-3890. [CrossRef] 16. Geoffrey J. Goodhill, Terrence J. Sejnowski. 1997. A Unifying Objective Function for Topographic MappingsA Unifying Objective Function for Topographic Mappings. Neural Computation 9:6, 1291-1303. [Abstract] [PDF] [PDF Plus] 17. Teuvo Kohonen, Samuel Kaski, Harri Lappalainen. 1997. Self-Organized Formation of Various Invariant-Feature Filters in the Adaptive-Subspace SOMSelf-Organized Formation of Various Invariant-Feature Filters in the Adaptive-Subspace SOM. Neural Computation 9:6, 1321-1344. [Abstract] [PDF] [PDF Plus] 18. H.-U. Bauer, M. Riesenhuber, T. Geisel. 1996. Phase diagrams of self-organizing maps. Physical Review E 54:3, 2807-2810. [CrossRef] 19. Tamás Rozgonyi, László Balázs, Tibor Fomin, András Lörincz. 1996. Self-organized formation of a set of scaling filters and their neighbouring connections. Biological Cybernetics 75:1, 37-47. [CrossRef] 20. Tom M. Heskes. 1996. Transition times in self-organizing maps. Biological Cybernetics 75:1, 49-57. [CrossRef] 21. Peter Dayan , Geoffrey E. Hinton , Radford M. Neal , Richard S. Zemel . 1995. The Helmholtz MachineThe Helmholtz Machine. Neural Computation 7:5, 889-904. [Abstract] [PDF] [PDF Plus]

ARTICLE

Communicated by Laurence Abbott

Network Amplification of Local Fluctuations Causes High Spike Rate Variability, Fractal Firing Patterns and Oscillatory Local Field Potentials Marius Usher Martin Stemmler Christof Koch Computation and Neural Systems, 239-74, California lnstitute of Technology, Pasadena. C A 92225 USA

Zeev Olami Department of Chemical Physics, Weizmann Institute of Science, Rehovot 76200, lsrael

We investigate a model for neural activity in a two-dimensional sheet of leaky integrate-and-fire neurons with feedback connectivity consisting of local excitation and surround inhibition. Each neuron receives stochastic input from an external source, independent in space and time. As recently suggested by Softky and Koch (1992,1993), independent stochastic input alone cannot explain the high interspike interval variability exhibited by cortical neurons in behaving monkeys. We show that high variability can be obtained due to the amplification of correlated fluctuations in a recurrent network. Furthermore, the crosscorrelation functions have a dual structure, with a sharp peak on top of a much broader hill. This is due to the inhibitory and excitatory feedback connections, which cause "hotspots" of neural activity to form within the network. These localized patterns of excitation appear as clusters or stripes that coalesce, disintegrate, or fluctuate in size while simultaneously moving in a random walk constrained by the interaction with other clusters. The synaptic current impinging upon a single neuron shows large fluctuations at many time scales, leading to a large coefficient of variation (Cv) for the interspike interval statistics. The power spectrum associated with single units shows a l/f decay for small frequencies and is flat at higher frequencies, while the power spectrum of the spiking activity averaged over many cells-equivalent to the local field potential-shows no l/f decay but a prominent peak around 40 Hz, in agreement with data recorded from cat and monkey cortex (Gray et al. 1990; Eckhorn et al. 1993). Firing rates exhibit self-similarity between 20 and 800 msec, resulting in I/!-like noise, consistent with the fractal nature of neural spike trains (Teich 1992). Neural Computation 6, 795-836 (1994) @ 1994 Massachusetts Institute of Technology

796

M. Usher et al.

1 Introduction A puzzling conflict between standard biophysical theories and the characteristics of spike trains recorded from cortical cells responding at high rates to visual input has recently been pointed out (Softky and Koch 1993). Experimental evidence shows that the amplitude of an individual excitatory postsynaptic potential (EPSP) is on the order of 0.1 mV, about two orders of magnitude smaller than the threshold depolarization from rest necessary for a pyramidal cell to spike (Komatsu et al. 1988; Mason et al. 1991) [for a review see also (Fetz et al. 1991)]. Based on this, Softky and Koch showed that the neural firing pattern will be highly regular if the neuronal membrane acts as a leaky integrator summing over a train of stochastic, uncorrelated EPSPs. In an integrator model, the time to spike is determined by the total time in which a critical number of EPSPs accumulate. Since the interspike interval is the sum of random variables representing the intervals between EPSP inputs, the central limit theorem predicts that the output spikes will be highly regular. In other words, the shape of the interspike interval histogram will become highly peaked as measured by the coefficient of variation, CV, defined as the standard deviation over the mean of the interspike interval (ISI) distribution. Softky and Koch also showed that this central limit result holds for a detailed biophysical compartmental model (including seven voltage-dependent somatic currents) of a cortical pyramidal cell in the presence of independent synaptic input. However, recordings from cells in V1 and MT cortex in the behaving monkey show that the discharge at high rates (up to 200 Hz) is highly variable in the length of interspike intervals, with a coefficient of variation of around one. Softky and Koch (1993)suggested two possible solutions to this dilemma: the first one requires fast and powerful active Na+ conductances sensitive to inputs at the millisecond time scale in the dendrites (as discussed in depth by Softky 1993). In this framework, neurons act as coincidence detectors, firing only if many synaptic inputs arrive simultaneously (Abeles 1982, 1991). A second approach, which we will develop here, is to solve the discrepancy by challenging the assumption of uncorrelated inputs while preserving the standard biophysical model of a temporally integrating membrane summing over small EPSPs. Here, interspike interval variability is a direct consequence of the global network dynamics, which create and amplify correlated fluctuations in an irregular fashion. This approach is motivated by the fact that extensive axon collaterals of pyramidal cells in cortex permit massively recurrent connections between neurons, which of necessity leads to strong correlations in the spike output of cells in the same proximity; however, recurrent connections alone do not lead to an increase in variability. In fact, depending on the pattern of connections, the opposite may occur. For example, in a recurrent network cells may entrain themselves into a steady state fixed point or to a limit cycle (oscillation) (Amit and Tsodyks 1991a,b; Koch and Schuster

Network Amplification of Local Fluctuations

797

1992; van Vreeswijk and Abbott 1993), where the interspike variability would be even lower than for uncorrelated input. We find that the generation of high variability depends critically on the spatial extent of inhibition relative to excitation. One connectivity pattern that robustly results in high variability is local excitation surrounded by inhibition, isotropically distributed across the entire network. Theoretical investigations of such connectivity using continuous firing rate models have shown that the translational and rotational symmetry of an isotropic network can break spontaneously, leading to localized patterns of excitations (Willshaw and von der Malsburg 1976; Amari 1977; Wilson and Cowan 1973; Ermentrout and Cowan 1980; Chernjavsky and Moody 1990) [see also Cowan (1982) for a bifurcation analysis predicting various geometric patterns, such as hexagonal or square lattices]. Generalizing this approach to a spiking model with noisy input, we find that the patterns of excitation exhibit metastability with large temporal fluctuations. In particular, we find that in the presence of homogeneous and independent spatiotemporal input to all cells firing patterns display fluctuations on many temporal scales, leading to l/f components in the power spectra, and high variability in the number of events. Furthermore, correlation functions between neighboring units display a sharp peak around zero, riding on a much broader hill. Such fluctuations of the firing behavior across many different time scales have recently been reported in neurons from the auditory pathway in mammals and the mesencephalic reticular formation by (Teich 1989, 1992; Grueneis et a/. 1990), who showed that it implies temporal clustering of events and a fractal firing pattern. In the following, we first review different aspects of the variability problem. In Section 3, our model is presented, followed by the results in Section 4. Finally, we discuss the relationship between our results and typical electrophysiologicalexperiments in monkey cortex as well as the implications of our model in the last section.

2 The Variability Problem

2.1 Interspike Variability. The interspike interval histogram and the coefficient of variation can be easily obtained for a pure integrator that receives a stream of Poisson inputs. Consider first the case of pure excitation in which an integrate-and-fire unit receives Poisson distributed EPSPs at rate A. Assuming that the membrane potential is reset to zero after each emitted spike and that NOsynaptic inputs are required in order for the unit to reach threshold, the output interspike interval distribution PN,(T) is equal to the probability that NO events will arrive during an interval T . If the arrival times of the incoming EPSPs are exponentially distributed, P ( T ) = X exp(-AT), then the probability distribution for the

M. Usher et al.

798

Renormalized Pmbability Density

Figure 1: Probability density functions of PN# for an integrate-and-fire unit requiring NO Poisson-distributed synaptic inputs to fire. The interspike interval (1'51) distribution becomes relatively narrower as the number of inputs required to reach threshold increases, a consequence of the Law of Large Numbers. The x-axis is in units of the mean time to spike (T). Functions are rescaled so that integrals under curves are all equal to unity.

sum of N o independent identically distributed random variables, PN@ (T) is

If we denote the mean time between spikes as A

(where ( ) indicates the temporal average), we can introduce the normalized probability distribution P N( ~T / (7')). Although P N s ( T ) becomes broader with increasing NO, the normalized distribution PN@ ( T / ( T ) )befor several values of comes narrower. In Figure 1, we display PN,(T/(T)) NO,assuming a constant rate X of inputs. From equation 2.1, we obtain the variance and the standard deviation divided by the mean (the coefficient of variation Cy): Ne Var(T) = ( ( T - (7'))2)= (2.2) A2

C"

=

Var(T)1'2 - - 1 (T)

(2.3)

Network Amplification of Local Fluctuations

799

We observe that the coefficient of variation decreases with the square root of NO. Thus for No = 100, corresponding to a reasonable estimate of the total number of EPSPs required to trigger a cortical pyramidal cell (Komatsu et al. 1988; Mason et al. 1991), one obtains CV = 0.1, much lower than the experimental values for cortical cells firing at high rates, for which Cv = 1. This result is simply a consequence of the Law of Large Numbers: the sum of a sufficiently large number of independent random variables has a gaussian distribution. In principle, the variability can increase due to inhibition, which may cancel part of the mean rate of the input signal while keeping or increasing the fluctuation. However, a significant contribution of inhibition to variability can be ruled out for biophysical and mathematical reasons. First, compartmental simulations (Softky and Koch 1993) show that the degree of inhibition required for this is much larger than observed in in vivo intracellular recordings in cells in cat striate cortex (Douglas et al. 1988; Berman et al. 1991). In the following, we briefly outline a mathematical argument. Consider a cell receiving a superposition of an excitatory Poisson stream (of rate A,) and an inhibitory stream (of rate A; ) of synaptic inputs, of equal but opposite magnitude. Under these conditions, the interspike interval, interval's mean and variance can be written as (Tuckwell 1988):

(T)

NO =

The coefficient of variation is:

Equation 2.4 shows that in order to obtain a Cv of order of one, the relation between excitation and inhibition rates should satisfy

or in terms of the rate of inhibition, A;

Rz

No - 1

-

NO+ 1''

For NO = 100, the lnhibition rate should equal A, = 0.98 A,. In other words, both excitation and inhibition need to be roughly 50 times larger than the net resulting current proportional to A, - A;. No experimental evidence exists for such extremely high background currents. This scheme requires an unusually high degree of balancing, since a mere 2% higher inhibitory rate would totally cancel the excitatory input. Finally,

800

M. Usher et al.

the output rate of such a neuron would be 50 times lower than the input rate, since the mean output rate is given by (A, - X,)/Ne = 0.02 &/NO. When membrane leakage is taken into account, the interspike interval histogram associated with the leaky integrate-and-fire unit cannot be derived analytically. However, as shown by Softky and Koch (1993) in simulations of both integrate-and-fire and a detailed compartmental model, leakage leads to an increase in variability only for very low firing rates (relative to the reciprocal of the membrane time constant). At high output rates (relative to the inverse of the membrane time-constant), no significant decrement occurs between spikes, and the Cv is not affected. Because cortical recordings show large coefficients of variation even at high firing rates (Softky and Koch 1993), the problem of variability requires a different solution. 2.2 Variability in the Number of Events. A different measure of the ”unpredictability” of a cell’s discharge is the variability in the number of action potentials the cell fires in response to specific input. If action potentials are distributed according to a Poisson point process, the simplest of all stochastic processes, the mean number of spikes, should be identical to the variance: Var(N) = N. Yet, experimentally, neurons frequently show larger fluctuations, indicating clustering of the spike trains. Two different experimental paradigms have been used to evaluate this. In the first paradigm, the mean discharge rate associated with one cell respunding numerous times to one particular stimulus for a fixed interval is computed along with the variance about the mean, resulting in one [N,,Var(N;)]pair (Snowden et al. 1992; Vogels et al. 1989; Tolhurst et al. 1983). This experiment is repeated for different stimuli and different cells (always for the same duration). These pairs are then plotted in log-log coordinates. For a Poisson process, the slope of the line passing through these points should be 1. Yet, different experiments in both V1 and in MT of the anesthetized or the behaving monkey using bars, gratings or random-dot stimuli have consistently found a scaling law of the type Var(N) x N5/4. In other words, the variance in the number of spikes is greater than expected for a pure Poisson process. A clue to the origin of this high variability comes from a related experiment, in which long spike trains from neurons responding either spontaneously or to stationary sensory stimuli are partitioned into nonoverlapping time windows T , for which then the mean and variance in the number of events are calculated as a function of 7’. When the spike discharge in response to a constant input in different noncortical neurons is evaluated in this manner, the neurons show high variability, leading to a power law of the form:

Network Amplification of Local Fluctuations

801

Again, for a pure Poisson process, r/ = 1. Sometimes (Teich 1992) an equivalent exponent, called the Fano number, equal to v - 1, is used instead. For regular spike trains, whose mean interspike interval does not diverge, the exponent v has an upper bound of 2 (exponents larger than two lead to large and nonconvergent fluctuations in the firing rate). On the other hand, exponents v < 1 imply high regularity in the firing pattern (e.g., a purely periodic spike train has zero variance and hence v = 0). Typical values for r/ range from 1.2 to 2.0 for spike trains from auditory nerve fiber (Teich 1989, 1992) and the cochlear nucleus (Shofner and Dye 1989),’ while the firing of vestibular neurons is much more regular, with v = 1.03 (Teich 1992). What are the implications of these power laws? Furthermore, what is the relationship between assessing the mean and variability in a neuron’s response by keeping the time window T constant while varying the stimulus intensity (or contrast) and, in a different experiment, varying T while keeping the stimulus constant? The implication of v > 1 in equation 2.6 can be understood by examining the fluctuations in the firin ratef = N(T)/T, which can be described in terms of AN(T) =

4 4 :

Af

=

W T ) T N”/2-1

-

For a Poisson process, where v = 1, the rate fluctuations decay with increasing time windows as That is, recording four times longer doubles the signal-to-noise ratio. Values of v bigger than 1 imply persistent firing rate fluctuations and clustering of the spiking process: fluctuations in the instantaneous firing rate will be partially preserved when the rate is averaged over increasingly longer intervals. In the extreme case of v = 2, the ratio of signal to noise remains constant no matter how long one averages. The random walk described by the deviation from the mean number , to a different way to of spikes in an interval T , that is N ( T ) - ( N ( T ) )leads understand the significance of the v exponent. The standard deviation of this walk is proportional to T”12. Thus, multiplying the time interval by a factor of 2 scales the walk deviation by a factor of 2”/2.Such processes are called self-u@efvactals, and 1//2 is called the roughness (or Hurst) exponent (Feder 1988). For the case of v = 2, the walk is itself-similar and scales by the same factor as the time interval, showing self-similarity. Two important statistical measures of spike trains related to the existence of self-similarity and persistent fluctuations, as reflected in the power law discussed above, are the interspike interval distribution P( t ) ‘In a preliminary study, we have found similar power-law exponents for long spike trains recorded from the parietal cortex of monkeys performing delayed matching to sample tasks in the laboratory of J. Fuster (Zhou and Fuster 1992).

M. Usher et al.

802

and the autocorrelation function A ( t ) . These two measures relate complementary statistical properties of spike trains: while P( t ) measures statistics of intervals between consecutive spikes and is order independent, A ( t ) measures the fraction of spikes separated by time t and is related to statistics of intervals of all orders. While a power law decay (with exponent between 0 and 1) in thc autocorrelation function implies that the spike train fluctuations in N(t ) are persistent and self-affine (characterized by I / exponents between 1 and 2 ) , we shall show that a power law decay (with an exponent between 1 and 2 ) in the interspike interval distribution indicates that the point process is a true fructul with a noninteger dimension D between 0 and 1. Consider first the interval distribution P( t ). Following Mandelbrot (1983), the fractal dimension of an infinite recursive point process (dust) is defined via the number of covering intervals of length h, N(h ) , needed to cover all the events in a finite interval of length T: (2.7)

N ( h )x

Since actual spike trains have finite length, equation 2.7 can hold only over a limited temporal range. Clearly, when h is much smaller than the mean interspike interval, the number of covering intervals saturates at the total number of spikes in the train, imposing a lower cutoff on the range over which the process exhibits power law behavior. We show in Appendix A that a point process whose interspike interval distribution decays as a power-law with exponent -2, satisfying 1 5 2 5 2, has a fractal dimension D = 2 - 1. Thus processes whose interval distribution decays faster than t-*, such as the Poisson process whose interval distribution decays exponentially, have a dimension of one, implying that at time-scales longer than the mean interval there are very few empty covering intervals. On the other hand, for a power law distribution with 1 < < 2, the process is much more clustered, resulting in a large number of empty covering intervals. The dimension D for such clustered processes is a noninteger number that lies between zero and one. The second statistical measure A ( t ) can be related to the variance in the number of events N ( T ) and the 11 exponent (Cox and Lewis 1966; Teich 1989). For any stationary point process of mean rate A, the variance in the number of events in time T (see Appendix B) is

-,

+ 2 ./o

.7

Var(N) = N

(T- t ) [ A ( t )- A ( m ) ]d t

(2.8)

In the absence of correlations [ A ( t ) = A ( w ) ] for all values of t, this equation implies that Var(N) = N, as for a Poisson process. If the autocorrelation function decays to chance levels in finite time t,, the variance will be proportional to N for t > t,. The power law behavior for N(T ) will be observed, therefore, only if there are long-range temporal correlations in the data.

Network Amplification of Local Fluctuations

803

Table 1: Scaling Relations.

Var(N)

-

Autocorrelation

Var(N) N u

A(t)

-

Power spectrum S ( f ) -f-”+’

IS1 distribution P(t)

-

t-l’

Since the number of spikes N depends linearly on T, equation 2.8 will lead to Var[N(T)]cx T” (for large values of N) only if the autocorrelation function satisfies A ( t ) - A ( m ) rx t”+2

(2.9)

for large values of t.* By the Wiener-Khinchin theorem, the power spectrum of a process is the Fourier transform of the autocorrelation. Assuming that A ( t ) >> A ( m ) , equation 2.9 leads to

S(f) K xf-”+’

(2.10)

for small frequencies. For 1 5 I / 5 2, the exponent of the power spectrum at low frequencies will be between minus one and zero. Such exponents in the power spectrum are generally called l/f noise. For the special case of a renewal process, in which consecutive intervals are independent of each other, a power law with exponent -7 in P ( t ) leads to a power law with exponent y - 2 in the autocorrelation function A ( t ) (Lowen and Teich 1993). For such processes, the exponent D in equation 2.7 is related to the Hurst or roughness exponent r//2 by r/=D+l=T

(2.11)

In the strictest sense, these power law relationships hold only in the limit for which P ( t ) behaves as a power law for all t. If the power law behavior extends only over a finite range, the prediction of exponents becomes approximate. For renewal processes characterized by a power law between two temporal cut-offs, full analytic solutions of A ( t ) given P ( t ) are difficult to obtain [but see Lowen and Teich (1993) for the special case of D = 0.51. A restricted range of power law behavior is in general accompanied by a significant baseline autocorrelation A ( m), so that if A(t) decreases as a power law, A ( t ) - A ( m ) will, of course, decrease more slowly than A ( t ) . By the same token, Var(N) will increase more slowly with N than predicted. We summarize the main scaling relationships in Table 1. *If A ( t ) x e - t / T ’ , then Var(N) IX N in the asymptotic limit of long T. Only A ( t ) x t-,’ leads to a power law for Var(N) valid at large N.

M. Usher et al.

804

These power law relationships will also hold-under certain conditions-for the first type of experiment (with fixed recording time T, but variable stimulus intensity leading to corresponding mean rates A). In this case, the result Var(N) = W ’ with 1 5

11’

(2.12)

5 2 can be explained in terms of equation 2.8 if

A ( t . A ) cx A ” ’ A ( t )

(2.13)

in the simultaneous presence of long-range temporal correlations. In other words, if the correlation function scales in this manner with the mean firing frequency and the temporal correlations are long-lasting, equation 2.8 implies Var(N) 0: N” for large values of N (for fixed time intervals T). Note that there is no a priori reason for the exponent v’ obtained in the first experimental paradigm to be equal to 1) measured in the second paradigm. If, however, the normalized correlation between spikes separated by time t is only a function of the expected number of intervening spikes in time t, At, the two exponents will, in fact, be equal. This condition translates into A ( t ,A )

0: A”fv-’

(2.14)

implying that the autocorrelation function scales with the mean firing rate. In particular, this condition holds for a perfect integrate-and-fire unit, since rescaling the input rate will only change the output rate of such a unit, and not the relative temporal ordering of output spikes. Thus scaling the input rate is equivalent to a rescaling of time (playing the recorded tape at a different speed), and, therefore, the natural quantity for measuring the autocorrelation decay is not the absolute time itself, but rather the intervening number of spikes, t A. The origin of long-range temporal correlations in cortical spike trains remains a mystery. The emergence of such long range correlations is considered anomalous in most physical systems, where they generally occur only at a ”critical point” during a phase transition. Dynamic systems that robustly produce such behavior by self-organization have recently been introduced into the literature under the label of ”self-organized criticality” (Bak et af. 1987; Olami et al. 1992). In the following, we propose a neural model based on a self-organizing dynamic metastable system, which provides a solution to both aspects of the variability problem. 3 The Model

We searched for the simplest network that explains high output variability while still adhering to the fundamental constraints of cortical anatomy

Network Amplification of Local Fluctuations

805

and electrophysiology. Consequently, we used as our standard setup a model based on leaky integrate-and-fire neurons with semilocal connectivity. Modifications of this standard setup have also been studied and will be described below. We first describe the neural dynamics and subsequently the connectivity patterns used in the simulations. Although not necessary, it is helpful for the remainder of the paper to think of the simulated network as a sheet of cortical neurons in primary visual cortex receiving unstructured visual input from the lateral geniculate nucleus. 3.1 Neural Dynamics. The model consists of a two-dimensional lattice of units, connected within the layer by local excitatory and inhibitory synapses. Each unit integrates the inputs with a time constant r characterizing the passive neuronal membrane; once the potential reaches the threshold voltage V,, the unit emits a spike that is transmitted to synaptically connected neighboring units, and the potential is reset to zero. The dynamics of the integrate-and-fire model are given by

+ + C&O[Vj(t)- v,]

I;(t) = CJiO[Vj(t)- v,] i

1

with JE

=

a/N,,,

I'

=

PIE

excitatory synaptic coupling strength inhibitory coupling strength ( p < 1)

where the indices i and j denote the units, fd&y is a transmission delay, 0 is the Heaviside step function [O(x) = 1 for x > 0 and O ( x ) = 0 otherwise], and Iext is the external current impinging upon the cell. Some simulations used a biophysically correct model of synaptic input as conductance changes in series with ionic reversal batteries. The effective driving potential is thus given by the difference between the voltage and the reversal potential, that is, E,,, - Vi(t).E,,, was set at five times the threshold voltage, corresponding to fast voltage-independent AMPA excitatory input, while E , h was set to be shunting or silent, that is, E i h = 0, corresponding to GABAA-likeinhibition. We also ran some simulations using a current approximation for synaptic input, with no apparent qualitative difference. The Heaviside function reflects the fact that outputs are transmitted only when a cell's voltage exceeds its threshold. The external input is modeled independently for each cell as a Poisson process of pulses of width tspike, arriving at a mean rate Xext (in Hz) and of amplitude Vo/N,,,. All inputs to a unit are scaled by the number N,,, of connections a unit makes with other units; the parameter a is of order Vo, here set without loss of generality to 1. Furthermore, the input resistance in the update equation 3.1 is also set (without loss of generality) to 1.

M. Usher

806

et al.

Equation 3.1 is supplemented by a reset mechanism to model spike generation. If Vi(t)= V HVi , is reset to zero after a delay corresponding to the width of the action potential tspike, and kept at this value for a period representing an absolute refractory period. In our standard setup we have chosen, for simplicity, tdelay = tspike = fief = 1 msec. Under these assumptions, equation 3.1 can be integrated over the characteristic delay time of 1 msec, resulting in a discrete approximation for the subthreshold domain:

+

V ( t 1) = [kV(t)-t r ( t ) ]H[VH- V ( t ) ]

(3.3)

where k = exp( - I / T )is the decay factor of the membrane potential. This approximation assumes that the distribution of current at time scales smaller than 1 msec will not change the dynamics, which is characterized by a time constant 7 >> 1. In our standard model, T = 20 msec. It should be noted that the discretization equation 3.3 is natural in this context, since the original differential equation is not meant to capture the actual time course of the action potential during fspikt.. The reset after the spike is the simplest model of a rectifying potassium current. Since the unit needs time to recharge, the reset leads to an effective "refractory period." While it is true that the physiological refractory period is determined by the time course of a variety of voltage and calcium-dependent currents (Yamada et al. 1989), we are interested only in the temporal dynamics of spike times. When we henceforth refer to a refractory period in the model, we mean the effect of the reset. 3.2 Connectivity. As shown below, two aspects of the connectivity are crucial for high spike rate variability: locality of the connections and the range of inhibition. In the absence of local connections, for instance using all-to-all or sparse random connections, the population quickly reaches a steady or an oscillatory state (Amit and Tsodyks 1991a,b; Koch and Schuster 1992; van Vreeswijk and Abbott 1993; Tsodyks et al. 1993). Such states are characterized by low variability (except in the regime of very low firing rates). With local excitation but no inhibition, waves of excitation originate from random centers on the lattice, but variability does not increase. If the excitatory coupling becomes too large, the activity in the network explodes, reaching a steady state with high activity, but low variability. The optimal connectivity pattern that leads to high variability consists of local excitation and inhibition (but with inhibition more distant than the excitation). Our "standard model" was based on center-surround connectivity (see Fig. 2) on a rectangular array. Each unit is excitatorily connected to N,,,, = 50 units chosen from a Gaussian probability distribution of CT = 2.5 (in terms of the lattice constant), centered at the unit's position (square symbols in Fig. 2). The extent of excitatory connections was limited to a circular region of diameter 10 lattice units. Each unit was also connected

Network Amplification of Local Fluctuations

807

Gawlan ConnenMy

A

A

A

A

A

A

A

A A

A A

A A

A

E

E E E

A A

A

0

A E

E I E E E

0

E

0

E E E E

O D E

M E

I €I

A

B A A

D D E E

I

D

A

A E

A

0

0 8 8

A A

~a

A

A

B

A

E

E

B

E

E

A

A

A A A A

E

A A A A

A

A A

A

A

& A A A A

Figure 2: Basic gaussian connectivity pattern for the standard model. The cell (not shown) at the center of the rectangular array is connected in a probabilistic manner to units within a given distance determined by a gaussian distribution with cr = 2.5 lattice constants. These short-range connections are excitatory (squares). The center cell also inhibits a fixed fraction of cells on an annulus 8 and 9 lattice constants away (triangles). During a particular simulation, the connectivity pattern is fixed, although the exact synaptic weight varies stochastically.

in an inhibitory manner to N,,,units chosen from a uniform probability distribution on a ring eight to nine lattice constants away (triangular symbols Fig. 2). No cell was allowed to make more than one connection to any other cell. Each cell's connections were generated independently at the start of the simulation and remained fixed thereafter, so that the geometric pattern of excitation and inhibition is not uniform across the lattice. We occasionally use a sparse random connectivity. In this alternative setup, each unit makes N,,, excitatory and N,,, inhibitory connections to randomly chosen other units on the lattice, independent of distance. The amplitude or weight of the excitatory connection is = ct/N,,,, and that of the mhibitory ones is 1' = / j J E and is the same for all cells and synapses in the network. The normalization of the connection strengths by l/N,,,, in equation 3.2 takes into account the common physiological assumption that 50 to 100 summed EPSPs are needed to elicit a spike. For

M. Usher et al.

808

a 2 1 in the model, simultaneous firing of all local excitatorily connected units is sufficient to cause the receiving unit to spike. To mimic the well-known stochastic character of synaptic transmission (Stevens 1993), we add an independent random offset to each synapse throughout the network at every iteration in equation 3.2. This corresponds to choosing the excitatory weight at each iteration from a uniform probability distribution ( u - A)/N,, < IE < ( 0 A)/Nco,. Inhibitory weights were treated in the same fashion, but rescaled by B. The possibility that a spike may fail to activate a synapse in an allor-none fashion, as often happens in slice preparations of cortical cells (Stevens 1993) was also investigated in control simulations. In this last case, spike transmission at any given synapse was likely to fail with a fixed probability Pj. We used either cyclic wraparound or null-flux boundary conditions on a 100 by 100 unit lattice. In the latter case, activity beyond the boundary was treated as the mirror image of activity within the borders. Cyclic boundary conditions were preferred, since edge effects were absent.

+

3.3 The Standard Model. Unless specified otherwise, we always use in the following our “standard model” with gaussian center-surround connectivity on a 100 by 100 unit rectangular lattice with cyclic boundary conditions. Each cell excites N,,, = 50 excitatory and 50 inhibitory other cells and has a passive time-constant T = 20 msec. The Poisson distributed external input rate is X = 2.3 kHz, while the excitatory weight (Y is drawn from the uniform distribution [1.15,1.4]. The inhibition has two-thirds the strength of excitation, that is, [lr = 0.67 (EPSPs were renormalized by the magnitude of the driving potential from rest to make the IPSP and EPSP amplitude at the threshold voltage comparable). These parameters were chosen to achieve the maximal correspondence between model spike trains and those recorded from cells in monkey cortex. We usually compare our standard model against spike trains from isolated units, that is, from units in a network with no lateral connections (OY = 0). To mimic the observed arrival of EPSPs and IPSPs, we used a combination of excitatory and inhibitory external Poisson input. The conductance change induced by each individual input, whether excitatory or inhibitory, was identical, but the rate of the inhibitory input was set to be 0.67 of the excitatory rate. Both inputs were in series with the appropriate synaptic battery, of E,,, = 5 x V S = 5 and Vih = 0. To obtain approximately similar output rates, we had to increase the input rate to 15 kHz. 4 Results

We first discuss the overall dynamics of the entire network, before we turn toward properties of individual cells and, finally, of the local activity of small ensembles of neurons.

Network Amplification of Local Fluctuations

809

4.1 Pattern Formation and Metastability. We investigated the dynamic behavior of the system under the influence of three parameters: the excitation coefficient a, the ratio of synaptic inhibition to synaptic excitation /3 = J‘/JE, and the rate of external input A. High variability in the spike discharge results for cy > VO= 1, when the total excitation contributed by the generation of a spike, Nco,(cr/Nco,,),is larger than the decrease in potential caused by the resetting of the spiking neuron V O . In particular, 1.2 < CY < 1.6 and 0.5 < /3 are optimal for obtaining high variability. If the input rate X is high enough? one observes the emergence of clusters, or localized ”hotspots” of neural activity (as in Fig. 3 ) . The spatial range of inhibition effectively sets the bound on the radius of the clusters; for N = 50 excitatory and inhibitory connections, the clusters’ size is relatively insensitive to other parameter settings. Within this radius, cells fire at a higher rate (due to recurrent excitation that is strongest at the center of the cluster and decreases toward the boundaries), while cells in regions between clusters are inhibited. For very high external input rates, the clusters become stable and merge into stripes or hexagonal patterns, while at intermediate rates, the system of clusters is metastable, as characterized by a high degree of mobility. In this regime, the behavior of the clusters is dominated by two conflicting forces: diffusion (cells at the edge send activation to cells in the cluster’s vicinity) and the tendency to compactness, since the ring of inhibition acts as a ”fire wall.” As a result, the clusters fluctuate in size, move in a self-avoiding random walk-like fashion, and occasionally disintegrate or coalesce. The formation of localized metastable patterns is illustrated in Figure 3, where each frame displays the number of spikes emitted by all 10,000 units within the previous 50 msec. Regions of high activity do not remain fixed, but move from frame to frame. The motion of a typical cluster of high activity is shown in Figure 5. In some frames, elongated blobs/stripes appear and then disappear or change orientation in other frames. If one tracks the activity of individual fixed units during a long sequence of such frames, one observes large fluctuations leading to high variability in interspike intervals and total number of spikes. More interestingly, the fluctuations take place not only on a scale of 50 msec, but on many time scales. Figure 4 shows a similar display, but using a 2-sec time window. Evidently even on this time-scale the fluctuations are not completely averaging out. However, on this extended time scale, the activation patterns show more elongated forms, suggesting some degree of averaging over the patterns’ trajectories. For low inhibition values, /3 < 0.33, one obtains traveling waves (long fronts of activity) that are not fully periodic. Under such conditions the ‘Empirically, the asymptotic subthreshold voltage V = X7/Ncon must be at least 0.6 V s .

810

Figure 3

Figure 4.

M. Usher et al.

Network Amplification of Local Fluctuations

811

individual cells show low CVvalues. This parameter regime will not be further discussed in this work. 4.2 Single Cell Properties.

4.2.1 lnterspikelnterval Variability. We computed the coefficient of variation Cv-without attempting to renormalize CV for the existence of a refractory period-in our population of 10,000 cells over a 400-sec long simulation using a stationary input frequency of X = 2.3 kHz to all cells in the standard model (when the initial 10 sec following the onset of the stimulus were eliminated, no noticeable difference resulted). The resulting values of Cv for individual cells are displayed in Figure 6 as a function of the mean interspike interval. Higher As (or weaker inhibition) give rise to higher spiking frequencies and lower interspike intervals, while maintaining high variability. Note that almost all values of CVare on the order of one or larger. The pattern of observed Cvs reproduces qualitatively the CVvalues measured for cells in cortical areas V1 and MT in the awake monkey responding to bars and to clouds of moving dots (Softky and Koch 1993). As discussed above, this is surprising given that, for an integrate-and-fire model at high output rates (when the effect of the membrane leak can be neglected), Cv = 1 / a . For our parameter range, NOx 50 and therefore CV = 0.14. We computed the Cv of the cells without any lateral connections in the network, that is, when the cells are only responding to the external input (lower cloud of dots in Fig. 6 ) and find that the CVvalues are on average

Figure 3: Facing page. Spontaneous symmetry breaking in neural pattern formation. Each frame represents the summed activity over 50 msec in a simulation for random external input. These "snapshots" of neural activity are shown at 100 msec intervals from each other in a clockwise arrangement starting with the frame in the upper left-hand corner. Lighter colors denote higher firing rates (maximum firing rate 120 Hz). The direction of motion for selected clusters is shown by an arrow; the position of the moving cluster in the next frame is indicated by a diamond shape. Parameters for this and all subsequent figures pertaining to the model were external Poisson input rate X = 2.3 kHz, connection strength o is uniformly distributed between 1.15 and 1.4, membrane time constant r = 20 msec. Inhibition equals 2 / 3 of that of excitation (P = 0.67). Figure 4: Facing page. Here each frame represents the summed neuronal activity within a 2-sec long time window in the presence of random external input. All other parameters are as in the previous figure. The self-similarity of the neuronal activity across different time scales is evident in the emergence of clusters at the 50 msec as well as at the 2-sec time scale. The highest number of spikes in these frames for any unit is 145, i.e., 72 Hz.

M. Usher et al.

812

Random Walk of Clusters

18

I

16 14

.-

12

(I)

c

5 lo

.8

% =

8

%

6 4

2

n -

0

2

4

6

8

I0

12

14

16

18

x (lattice units)

Figure 5: To illustrate the motion of a typical cluster seen in Figure 3, the center of a cluster is tracked over 10 sec of simulation. Each vertex in the graph represents the cluster’s position averaged over 50 msec. Repulsive interactions with surrounding clusters generally constrain the motion to remain within a certain radius. This vibratory motion of a cluster is occasionally punctuated by longer-range diffusion. reduced by a factor of two. Qualitatively, a similar reduction by a factor of three in CV occurs in a network with sparse nonlocal connections as discussed in the ”Model” section. At the same mean spiking frequency, that is, same mean time between spikes, the isolated and the nonlocal networks show much lower variability than the network with the centersurround connectivity.

4.2.2 lnterspikelntemal Histogram and Power Spectra. The power spectra of spike trains from individual units in our standard model (Fig. 7) are similar to those published in the literature for nonbursting cells in area MT in the behaving monkey (Bair et al. 1994). Power spectra were generally flat for all frequencies above 100 Hz. The effective refractory period introduces a dip at low frequencies (Bair et al. 1994). Given the long duration of individual spike trains, here 400 sec, the frequency resolution is high enough to observe the l/f”.E decay at low frequencies (Fig. 7).

Network Amplification of Local Fluctuations

CV as function of

1.8 I

813

mean

IS1

1

>

0 O 64

t I

0

20

I

80 100 120 mean interspike interval (msec) 40

60

I 140

Figure 6: Coefficient of variability, C V , of a representative range of cells shown against the average time between spikes, that is, the inverse of the mean firing rate. The solid dots in the upper cloud are from our Standard Model. The crosses in the middle cloud represent the behavior when all network effects are eliminated (i.e., (Y = 0), and units receive both an excitatory and an inhibitory stream of external Poisson input, with X i = 0.67 x A,. Without inhibition, the associated CV values are reduced by a factor of about 2. The diamonds in the lowest cloud are from cells in a random network with sparse, nonlocal connections that have no organized topography. The same number of inhibitory and excitatory connections, 50, were used as in the local center-surround connection scheme. All connections are reciprocal but otherwise random. Parameters for this run were otherwise the same as for the standard model.

Since Bair et al. used a frequency resolution of 4 Hz (instead of 0.25 Hz for Fig. 7), they could not have observed a l/f component of the type shown for the model, even if I/’ noise had been present in the real data. Notice that the power spectrum associated with a pure Poisson process of rate A is flat a t all frequencies (no particular frequency is preferred), except for a delta function peak at the origin:

S(f) = x + 2TA%(f) If the point process is Poisson with an absolute refractory period drawn from a gaussian distribution of temporal width 0,the power spectrum

814

M. Usher et al.

-

SDike Train Power SDectrum Standard Model

0

20

40

Power Spectrum

1.5

0.; 0

Hz

60

80

100

- Unconnected Units

; 0

20

40

HZ

60

80

100

Figure 7 Power spectra associated with the spiking activity of single units. The spectra for 19 units were computed individually from a 400-sec long run for the standard model and averaged in the uppermost spectrum. The average spiking frequency of the cell throughout the run is 18.1Hz. At low frequencies, the power spectrum behaves as f-0.8*0.017 up to a cut-off frequency of zz 8 Hz (see superimposed solid line, and inset, which displays the same graph on a log-log scale). For comparison, the lower spectrum represents the behavior for the same parameters, but without any lateral connections. Note the absence of l/f noise here. To obtain reliable estimates of the spectrum’s low-frequency components, very long spike trains are required. develops a dip at low frequencies:

(4.2) with X 5 l/(d%o) (Bair et al. 1994). Figure 7 also shows the power spectrum in the disconnected network, that is, when (1 = 0. Individual units only receive external input. The major difference to the spectrum in Figure 7 is the lack of a l/f dependency around zero. Note the very weak peak around 50 Hz due to the more regular firing pattern. The IS1 histogram of the standard model, averaged over 19 units, is shown in Figure 8. Because of the effective refractory period, interspike intervals lower than 4 msec are not observed, while the clustering nature of the model leads to occasional long intervals. The slow decay in the

815

Network Amplification of Local Fluctuations

IS1 histogram

r

I

3

2

bp

1

n -

0

20

40

60 80 100 interspike interval (msec)

120

Figure 8: Interspike interval (ISI) histogram of single unit activity averaged over 19 units in the standard model. The inset displays the same graph on a log-log scale. The best fit to the power-law decay exponent of the IS1 histogram between 25 and 300 msec is -1.70 f 0.02, implying a fractal dimension of the underlying point process of 0.7. The tail of the IS1 histogram for isolated units (i.e., for a = 0; not shown) decays exponentially.

IS1 histogram for long intervals has a n associated power-law exponent in the trailing edge of the histogram of 1.7 over one decade (from 25 to 300 msec). The corresponding fractal covering dimension D is 0.7 (equation 2.7). In contrast, the IS1 for units in the disconnected network ( u = 0, not shown), has a tail that decays exponentially, with T = 15.8 msec. 4.3 Local Fluctuations and the Field Potential. To show the fluctuations in the input to individual units, we display in Figure 9 the total excitatory input received by a single cell from other cells in the population (excluding Iext). The lateral excitatory input to a cell is equivalent to the total activity in an area of radius 5 covered by excitation. The power spectrum of this input signal (Fig. 10) has a small peak at about 40 Hz (see the inset in Fig. 10) due to the internal dynamics of hotspots that oscillate in size. Increasing the area over which the total activity is measured to a disk with a radius of 9 lattice units leads to an increase in the periodicity, as observed by the enhancement of the 4070 Hz component and the disappearance of the l/f components around zero (Fig. 11). The existence of a peak in the power spectrum of the ensemble activity is robust to rescaling the size of clusters (by changing the number of connections), the size of the network, the addition of noise,

M. Usher et al.

816

Numkr of Excitatory lnpua from within Sheat

12 10

8 6 4 2

0

0

50

100

150

200

mnec

250

300

350

400

Figure 9: The total number of excitatory inputs into one particular neuron as a function of time. Given our connection geometry, this input comes from cells within 5 units distance. The external input is not included here and, for our standard input rate of X = 2.3 kHz, corresponds to a mean input level of 1.8 lateral inputs per msec. Thus,local feedback connectionsdominate the network, in agreement with the canonical microcircuit hypothesis (Douglas and Martin 1990). Note the quasiperiodicity around 25 msec in the strength of this signal, which arises from the internal dynamics of a cluster (see Fig. 10). and to the introduction of time-varying synaptic inputs (here, decaying exponentials). Since there are no (excitatory) long-range connections to link clusters, different clusters oscillate in size independently with different phases. The activity over the entire network is thus the sum of n incoherent quasiperiodic oscillators, where n is the average number of clusters on the lattice.

4.3.2 Cross-Correlation Among Cells. Other well known measures of spike train analysis are the auto and cross-correlation functions. In Figure 12 we display some typical correlation functions obtained for spike trains from the same 400-sec long simulation as in Figure 7. The correlation functions were computed according to

where x, and x, represent two spike trains (with either no or one spike event per msec bin), and T is the total duration of the recorded train. The multiplicative factor T / ( T - t’) is necessary for normalization, because we

Network Amplification of Local Fluctuations

817

Power Spectrum of Activity within radius d = 5

__

::L] :

80 70 60

8"

a40 30

0

20 10 0

im

rn I

0

20

40

60 Hr

80

100

Figure 10: Power spectrum of the summed spiking activity over a circular area of radius 5 recorded from a fixed point on the lattice for 400 sec. This spectrum is roughly equivalent to the power spectrum of the lateral excitatory input (see Fig. 9), since short-distance connections are solely excitatory. At low frequencies, S(f) 0:f-", where v = 0.69 0.02 (see the solid line). The inset shows an enlarged part of the spectrum, revealing a small peak around 40 Hz.

Figure 11: Power spectrum of the summed spiking activity over a circular area the size of a single cluster (with a radius of 9 lattice constants) recorded from a fixed point on the lattice for 400 sec. The signal is the total number of units spiking at any given iteration. Compared to the power spectrum of the spiking activity averaged over 5 units (previous figure), the peak, at 43 Hz, becomes much more noticeable, while the l/f component disappears. Similar power spectra were obtained for the total activity on small lattices (20 by 20) that contained only a single cluster and for interacting clusters whose positions were tracked on the full-scale lattice (100 by 100).

M. Usher et al.

818

1000

-

Autocorrelation I

I

I

-300

-200

-1 00

I

I

'\

I

I

100

200

300

I

'

200

300

I

,

750 500

250

1000

750

-

0

msec

Cross-Correlation at d I

-

1

_

500

250

-300

1000

750

-

-200

-100

0

msec

100

Cross-Correlation at d I

I

'

I

I

-

4

-

-

500 250

1000

750

-

-300

-200

I

I

-100

0

100

msec Cross-Correlation at d 1

I

I

-

200

300

I

I

9

-

-

500 250 -300

-200

-100

0

100

200

300

msec

Figure 12: Auto- and cross-correlation functions between cells separated by d units in the standard model. Each graph represents the average correlation functions over 400 sec of four pairs of cells that were randomly chosen (the baselines, therefore, are different). The autocorrelation A( t) behaves as l/t0.21, consistent with a power spectrum that behaves as l/f',6. The graph of the best power-law fit to the autocorrelation is shown (numerical goodness of fit f0.004), raised slightly for reasons of legibility. Cross-correlograms similar to these have been observed in cat visual cortex (Nelson e t a ! . 1992).

use finite trains to compute the correlations. When the two spike trains are identical, i = j , the result is the autocorrelation function A ( t ) . The autocorrelation (Fig.' 12a) shows the effect of a refractory period, followed by a period of enhanced firing probability which decays slowly to an asymptotic value, via a power law with exponent -0.21. This behavior takes place up to a temporal cutoff of 300 msec, after which

Network Amplification of Local Fluctuations

819

the autocorrelation function reaches the baseline of chance coincidence. We should note that a power law decay is not the only possible model that will fit the data. The power law decay in the autocorrelation spans roughly one decade, which is not long enough to rule out the hypothesis of an exponential decay. The difference between the best power law fit at-b and the best exponential fit aexp(-t/r) +c does not reach the level of statistical significance, as based on a x2 test assuming Poisson-distributed errors. In Figure 12b,c,d we display cross-correlation functions for cells at three different spatial separations. The excitatory cross-correlations exhibit three main features: a sharp central peak, termed a "castle" in the neurophysiological literature (Nelson et al. 1992), flanked by small secondary peaks, and a slow decline to an asymptotic value, termed "hill" by (Nelson et al. 1992),characterized by the same power exponent as for the autocorrelation (i.e., -0.21). The cross-correlation for d = 9 shows a central dip with a slow recovery (governed by the same exponent) to the asymptotic level. This dip around the origin is caused by the action of inhibition located 8 to 9 units away (see Fig. 2).

4.3.2 Variability and Fractal Firing Patterns. Figures 3 and 4 illustrate that the activity of the neural population undergoes fluctuations on several time scales. Self-affine correlated fluctuations occur on all scales between 20 and 300 msec, based on the power law behavior of the autoand cross-correlations, as well as that of the IS1 distribution over this range. To obtain a quantitative measure of these fluctuations, we calculated the variance-mean curve by dividing a long simulated spike train into nonoverlapping periods of time T, and then computing the mean and variance in the number of events in intervals ranging from 20 to 5000 msec. The results are displayed in Figure 13 on a logarithmic plot. The slope of the curve in Figure 13 is 1.402 f 0.007 between N = 0.5 and 20, corresponding roughly to intervals between 20 and 800 msec (at an average rate of 23 Hz). For t > 1 sec, temporal correlations in the spike train have decayed to chance level, so the variance in N behaves once again in a Poisson manner, that is, the slope of the standard model is equal to the slope of the Poisson process. For times shorter than the minimum spike interval, Var(N) = N, since there can be at most one spike in these intervals. We also simulated the second experimental paradigm that evaluates the variability in spiking in which many trials of fixed duration were repeated. For this we performed a set of 25 simulations of 2.2 sec each (but only used the last 2 sec), for each afferent input level (changing the initial conditions and the random noise fluctuations, but keeping the connectivity pattern the same), and computed the mean number of spikes N and variance Var(N) for 60 randomly chosen cells in the network. The

M. Usher et al.

820

Variance versus Mean

I

...'

loo.

E

g >"

a ..

10.

slope-1

1

0.5

1

5

10.

50.

loo.

Figure 13: Log-log plot illustrating how the variance in the number of spikes varies as the mean number of spikes N for variable sampling intervals T. The exponent in the power law Var(N) c( N V is v = 1.4 for the standard set of parameter values. For comparison purposes, results from a Poisson train with refractory period 5 msec are superimposed, showing a slope of 1. The base firing rate for both processes is roughly 23 Hz. Self-affine fractal behavior of the spike train extends over the range from 20 to 800 msec and is evident in the fact that the variance increases faster than the mean. For t > 1 sec, temporal correlations in the spike train have decayed to chance level, so the variance in N behaves again in a Poisson manner (as discussed in Section 2.2). Typical power law exponents v for real spike trains from the peripheral auditory nerve (Teich 1989) and parietal cortex (unpublished data) range from 1.2 to 1.7 for spontaneous activity. procedure was repeated with different values of stimulus strength to extend the range of the mean number of spikes. These data points, plotted on a log-log scale, are shown in Figure 14. The best linear fit (see solid line) has a slope of 1.54 f 0.02, higher than the slope in the previous figure using a different paradigm. An average slope of 1.21 and 1.10 was measured in similar experiments carried out in cells in cortical areas V1 and MT, respectively, in the monkey responding to random dots (Snowden et al. 1992). Softky and Koch (1993) find the slope for MT to be 1.25. For comparison, we also plotted the response of 60 isolated units, that is, with a = 0, subject to the same stimulation protocol. Here the slope of the best linear fit is 0.436 f 0.004.

Network Amplification of Local Fluctuations

821

Fixed Trial Duration

200. 100.

50.

p 20. e

>"

10.

5 2 5

10.

20.

50.

100.

N

Figure 14: Log-log plot illustrating how the variance in the number of spikes Vur(N) varies as the mean number of spikes N for a fixed sampling interval T. Spike trains from 60 randomly chosen cells in a 56 by 56 size network were recorded. The external input rates were ramped from 1.9 to 2.3 kHz in increments of 50 Hz, with 25 trials at each stimulus level lasting 2.2 sec. The last 2 sec of each trial were used for this analysis. The slope of the best linear fit on the log-log plot is 1.54 f0.02 (filled circles). The initial slope for the first four stimulus levels (1.9-2.05 kHz) is lower: 1.36h0.04. Similar plots for cells in visual cortices V1 and MT responding to moving bars and random dots show slopes of around 1.2 (Snowden et ul. 1992). We also plotted the same data for 60 units in a disconnected network, i.e., (Y = 0 (crosses). Here, the best linear fit is 0.436 f0.004.

5 Discussion

Our goal has been to forge a theoretical link between the statistics of spike trains from single cells and the dynamics of the entire network. Furthermore, we want to propose a solution to the dilemma posed by Softky and Koch (1992, 1993), on how to obtain high variability in networks of rapidly firing neurons that integrate over large number of synaptic inputs. Our approach here is to simulate a simple network of spiking cells receiving external input, reminiscent of a cortex receiving input from a dynamic random dot display, and to relate our findings to experimental findings in the cortex of cat and monkeys. Although Softky and Koch did offer a solution to the problem of high variability-neurons that act like coincidence detectors (see also Abeles 1982,1991)-we here offer an-

822

M.Usher et al.

other solution more in line with standard physiological thinking about the biophysics of pyramidal cells. Our simple network of integrate-and-fire units not only achieves high variability, but also shows several other noteworthy features that are shared with experimental data from cat and monkey visual cortex: the power spectrum of the local-field potential frequently shows a peak around 40 Hz while the spectrum of single unit recordings very often does not (in particular in the primate), the cross-correlation has a typical castle-upon-hill structure and the ratio of variance to the mean of the number of spikes increases faster than expected of a Poisson process. Various measures of the spiking dynamics in our network shows selfsimilar (fractal) behavior. It is at present not known to what extent such scaling laws can be found in cortical cells. Before we go on, let us discuss the fundamental limitations and assumptions of our modeling effort. 5.1 Limitations and Assumptions of Our Model. Our model of a single cell is a leaky integrate-and-fire unit receiving conductance inputs (Knight 1972). One could argue that such cells do not show the temporal dynamics of cells with a variety of voltage- and time-dependent Hodgkin-Huxley like currents acting over different time-scales. However, our previous research (Softky and Koch 1993; Bernander et al. 1994) has provided ample evidence that an anatomically very detailed compartmental model of a neocortical pyramidal cell with seven voltage-dependent currents at the cell body and a passive dendritic tree has temporal dynamics very similar to those of a leaky integrate-and-fire unit. The fundamental assumption critical to the functioning of the model is the center-surround pattern of connectivity, with short-range excitation and a longer-range inhibition. Note that this constraint must only hold on average. If both excitation and inhibition are short-range, we fail to reproduce the observed spike statistics such as power spectra and IS1 histograms, since the lattice dynamics are now dominated by waves of neural activation. The resulting spike trains have much lower CV values. Our model is consistent with the "canonical microcircuit" hypothesis (Douglas and Martin 1990) in which massive excitatory recurrent feedback dominates the behavior of cortex: for our standard model with an input rate of 2.3 kHz (see Section 3.3), the average sum of lateral currents is at least 50% larger than the external, sensory driven, current. Examining Figure 9 reveals that for short periods the lateral excitatory current greatly exceeds the external current at a 2.3 kHz input frequency by an even greater margin. Higher input rates lead to higher amplification factors. The circuit in the model thus serves to amplify the afferent input signal, particularly locally through the formation of "hotspots" (clusters). While anatomical and physiological evidence does not provide an unequivocal support for this connection scheme, some physiological evidence for the validity of our assumption can be found. Hess and coworkers (Hess et al. 1975) found that iontophoretic application of the

Network Amplification of Local Fluctuations

823

excitatory agonist glutamate to visual cortex of anesthetized cats induces excitation of neurons within 100 pm of the application site and distant inhibition at distances between 100 and 500 pm. Similar studies in rat somatosensory cortical slice preparations (Silva and Connors 1987) confirm the general pattern of an inhibitory surround enclosing the excited region for all layers of cortex with one exception: layer IV, which receives direct afferent sensory input. A second critical assumption for our model is the existence of fast inhibition. Simulations performed with exponentially decaying synaptic inputs show that the cluster scenario fails if the inhibitory synapses are much slower than the excitatory synapses. When, for instance, the inhibitory synapses are five times slower, the inhibitory ”fire wall” comes into play too late, after the excitation has already spread outside the cluster’s domain. In this case the population is entrained into periodic oscillations, which radically changes the IS1 histogram of single cells. Microstimulation studies give us an estimate of the speed with which inhibitory effects take hold: Asanuma and Rosen (1973) showed that inhibition outdistances excitation in motor cortex, with the onset of all excitatory and inhibitory interactions occurring within 3 msec of stimulation. Although inhibition must act through interneurons, the inhibitory response to stimulation can be almost as fast as the excitatory response in cortex. Also, spike-triggered averaging of EPSPs and IPSPs elicited synaptically in slice preparations show identical rise times for IPSPs and EPSPs of x 1.5 msec (Komatsu et al. 1988). We should notice, however, that the coexistence of long-term and short-term inhibition is not ruled out; in fact, a slower GABABtype current in the model would lead to greater mobility of the clusters of excitation and thus to greater variability. At the moment, our model does not contain any explicit interneurons, that is, individual units can both excite one set of postsynaptic targets while simultaneously inhibiting another set. Our future work will do away with this unrealistic feature of our model, but at a price of increased number of neurons and therefore an increased computation time. This constraint also held the connectivity to N,, = 50, a small fraction of the divergence seen in cortex. The connectivity scheme Figure 2 is meant to approximate a more realistic connectivity where both the inhibition and excitation are probabilistically spread over broader areas. To investigate the effect of rescaling the model in a preliminary manner, N,, = 150 connections per neuron (instead of 50) were used in short control runs, yielding similar results to the Neon = 50 case.4 We did not attempt to model cells that burst, that is, that discharge 2-4 spikes within 10 msec or less. A large fraction of cortical cells responding to sensory events in the awake monkey frequently fires bursts (Werner 4Although larger N,,, lead to more stable (thus less variable) firing patterns, ”metastability”is restored once a synaptic “failure”process is taken into account, even for very low failure probabilitiesPf < 5%. See Section 3.

824

M. Usher et al.

and Mountcastle 1963; Bair et al. 1994). It is easy for bursting cells, with more complicated internal dynamics, to show arbitrarily high values of Cv.Instead, we here assume the standard reset mechanism of integrateand-fire neurons; therefore, short interspike intervals were rare. What we consider the most important limitation of our current study is that we assumed stationary and uncorrelated input to all cells of the network. In particular, we did not model the bars or gratings that are the most common form of visual stimulus of the various experimental studies we cite. However, we were surprised by the rich dynamic behavior and its qualitative relationship to experimental data. Yet, this present study is clearly but a starting point for more realistic and sophisticated models of cortical networks. In the following, the main results of the model will be discussed.

5.2 Pattern Formation. We have shown that “hotspots,” that is, patterns of excitation, emerge even for homogeneous Poisson input to our network (Figs. 3 and 4). We carried out similar simulations using completely isotropic, circular connectivity patterns with fixed synaptic weights and again observed the emergence of circular excitation patterns (not shown). In our network, this pattern formation process implies spontaneous symmetry breaking due to random spatial correlations in the external input. The emergence of hexagonal and stripe-like patterns through symmetry-breaking in continuous firing-rate models with similar connectivity has been previously demonstrated (Cowan 1982). In our model, however, the patterns of excitation are not stable; rather, they are subject to a stochastic diffusive process, leading to spatiotemporal fluctuations on many time scales (Fig. 3). It should be possible to directly visualize these moving ”hotspots” using high-resolution optical imaging of the intact cortex in primates based on voltage-dependent dyes or intrinsic signals (Grinvald 1992). Indeed, moving circular or elliptic regions of high neuronal activity have been reported in visual cortex of the anesthetized squirrel monkey and the rat using voltage-sensitive dyes in response to electrode stimulation. Orbach et al. (1992) observed motions of such clusters on the millisecond time scale; multiple clusters were sometimes elicited in response to stimulation at a single point, and, in one instance, a single cluster was seen to split into two separate centers of high activity.

5.3 High Interspike Variability. We have shown that neural spike trains from our standard model have large coefficients of variation, that is, CV values between 1 and 1.5 for rates up to 50 Hz (Fig. 6). Increasing the external Poisson input rate or lowering B, the ratio of inhibition to excitation, results in higher firing rates, which we did not explore here.

Network Amplification of Local Fluctuations

825

These large Cv values are reflected in the slow decay in the tail of the IS1 (Fig. 8), here as l / t ' . 7 . This is in sharp contrast to the full IS1 for units in the disconnected network that decay exponentially. We do not know at present whether such power-law decay in the IS1 has been seen in cortical cells, but are studying this issue systematically. In their study of CV from nonbursting cells in cortical areas V1 and MT, Softky and Koch (1993) observed equally high values. The Cv associated with the disconnected units receiving only excitatory and inhibitory external Poisson input is about a factor of two lower (Fig. 6). Without the mhibitory Poisson input, the associated CV would be further reduced by a factor of 6 (see equation 2.4). CV values are even further reduced in a network with random connectivity. In our model, the high degree of randomness observed in the single unit firing statistics arises from the amplification of correlated fluctuations, resulting in clusters of excitation moving through the cortical layer. This can be compared to a recent model of short-term memory that was proposed to account for the observed interspike interval histogram of spikes in IT cortex of behaving monkeys (Zipser et al. 1993). Intrinsic variability in this model is the result of stochastic transitions between two firing rates (attractors in the language of dynamical systems) inside a probabilistic network of McCullough-Pitts neurons. Our model provides an actual mechanism for such transitions while generalizing the approach of Zipser and colleagues to include the leak of the membrane time constant and noise in the form of fluctuating Poisson external input to the cells. To be precise, the model has not only two fundamental firing rates, but rather a continuum of rates related to the cell's position relative to the cluster centers; moreover, as opposed to Zipser et al. (1993), who fitted interspike interval distributions with two exponentials of different time constants, in our model the interval distribution and correlation functions decay with a power law. Consequently, we find that the firing pattern is fractal up to a time scale of about 800 msec. It should be noted, however, that a power law decay can arise as the superposition of many exponentials with different time constants. If the majority of the input EPSPs is distributed as an infinite-range power law l/t" with cr < 3, the central limit theorem does not apply to the sum of EPSPs since the variance of the distribution is not finite. Consequently, for finite-range power laws, the CV of a sum of N O pulses will decrease (initially) more slowly than l/& as the number of pulses increaseseven the sum of a large number of pulses can still have a high C". As we mentioned previously, alternative solutions to the variability problem are either very large hyperpolarizing (inhibitory) currents or an active dendritic mechanism for coincidence detection at the millisecond level. To achieve the same high variability as in the model based on concurrent excitatory and inhibitory streams of input alone (without network effects or coincidence detectors) component inhibitory and excitatory currents must be on the order of 50 times larger than the net

826

M. Usher et al.

current. Physiological experiments rule out the existence of such extremely large hyperpolarizing or shunting currents in cortical cells (e.g., Nelson 1991; Douglas etal. 1988; Berman et al. 1991). A very fast, sodiumbased5 spiking mechanism working at the millisecond scale and located in the dendritic tree can, in principle, enable cortical cells to respond as a coincidence detector resulting in high variability (Softky and Koch 1992, 1993). As of yet, no solid experimental evidence exists for such fast, powerful, all-or-none dendritic phenomena in cortical cells, although, given the relative inaccessibility of distal dendrites, they cannot be ruled out.

5.4 Correlation Functions. It is remarkable that the cross-correlation functions show a dual process: a sharp peak at small delay intervals followed by a much slower decay, characterized by a power law with exponent -0.21 to an asymptotic level. Exactly this form of cross-correlation was found to be the most common correlation structure in physiological recordings from cat visual cortex by Nelson and his colleagues (Nelson et al. 1992), who termed it a ”castle on a hill” structure. Similar crosscorrelograms were reported among cells in the macaque inferotemporal cortex by (Gochin et al. 1991). Our cross-correlograms are more peaked for units that are near each other and receive similar input from the surround; as the distance between the units increases, the central peak becomes wider and smaller, as reported in physiological studies (Nelson et al. 1992). Following Toyama et al. (1981), we computed the correlation coefficient defined as the fraction of spikes from two separate trains that are within &9 msec of each other (the width of the central peak in the cross-correlogram Fig. 12b is 9 msec). This coefficient, measuring the degree to which two spike trains entrain each other, is 0.59 for adjacent cells and 0.29 for cells four lattice sites apart. When the distance between units corresponds to the characteristic inhibition range (here d = 9; see Fig. 2), the cross-correlation shows a slight dip at short time scales. It should be noted that the cross-correlation functions are peaked a t zero time delay, in agreement with experimental results (Nelson et al. 1992). However, as opposed to classical “common input” interpretations in which neurons are hypothesized to share a common external input, the zero-delay peaks here are generated by recurrent excitation, leading to local clusters of excitation. The form of the cross-correlation functions (the “castle on a hill” structure) suggests that two processes are taking place; on a fast time scale (of less than 10 msec), the presence of excitatory clusters leads to synchronization, while on a slower time scale (up to 300 msec) the correlations are due to the clusters’ diffusive trajectory. SCalcium dendritic spikes are too slow to affect the neuronal variability at the time scale observed.

Network Amplification of Local Fluctuations

827

5.5 Local Field Potentials. As illustrated in Figure 9, the excitatory current from the lateral connections impinging upon a cell shows strong fluctuations on a scale that is much faster than the characteristic time scale for the diffusion of clusters. These fluctuations are due to the internal dynamics of the activity inside a cluster (which oscillate, nonperiodically, in size). We measured a related signal, the total number of active cells within a circular region of radius 5, and computed its power spectrum (Fig. 10). This can be thought of as the local field potential; it has a small peak around 40 Hz. The power spectrum associated with the total spiking activity within a neighborhood of 9 units radius shows a much enhanced peak at 43 Hz (Fig. ll),corresponding to the y domain of slow wave activity in electrophysiology. Interestingly, the power spectra of spike trains of individual cells within the network (Fig. 7) , as well as those of disconnected units (Fig. 7), show little evidence of a peak in this frequency band. Furthermore, the "single unit" power spectrum (Fig. 7) shows a prominent l/f decay, while such a component-caused by the long-range fluctuations due to the network connectivity-is totally absent in the spectrum computed for the simulation of disconnected units (Fig. 7). Notice that in order to see such a decay for very low frequencies in the spectrum, single units must be recorded for on the order of 10-100 sec. This explains why they are not evident in the power spectra computed from 2 sec long spike trains (Bair et al. 1994). Not only does the l/f decay disappear as one moves from the power spectra of single units to the spectra of local field potentials (Figs. 10 and ll), but a peak in the y domain asserts itself. This change in the shape of the power spectra can be understood in terms of the autocorrelation function associated with the activity of the clusters. Since the cluster activity is the sum of all single-unit spiking activity within a cluster of N cells, the autocorrelation of the cluster spiking activity will be the sum of A! autocorrelations functions of the individual cells and N x (N - 1)cross-correlationfunctions among individual cells within the cluster (N = 261 for our standard model). Thus, the autocorrelation of the entire cluster activity is dominated by the cross-correlationsamong individual units. It can be shown with the help of the Wiener-Khinchin theorem and the additivity of the Fourier transform that the power spectrum of the cluster activity will be dominated by the sum of the Fourier transforms of the cross-correlations between cells within the cluster (see Fig. 12). For pairs of cells at distances of 4 or less (such that they excite each other), the Fourier transforms of the cross-correlation functions (the 2 and 3 plots in Fig. 12) have a l/f peak at the origin, a small, wide peak at about 40 Hz (due to the secondary peak in the excitatory crosscorrelations), and a decline to zero for frequencies larger than 100 Hz (the reciprocal of the width of the central peak of the cross-correlations). The amplitude of the Fourier transform of the correlation function of pairs of cells at d = 9, that is, units that directly inhibit each other,

828

M. Usher et al.

shows an inverted l/f (negative) component, which decays asymptotically to zero. Thus, when the area over which the activity is averaged is sufficiently large to include such “inhibitory” interactions, here a circle of minimal radius equal to nine, many such inhibitory cell pairs are included. As a result the l/f component in the power spectrum of the radius 5 activity is canceled. Two factors lead to a power spectrum peak in the 30-70 Hz range (Fig. 11). The first is the existence of secondary peaks around 25 msec in the excitatory cross-correlations. The second is the relative width of the excitatory cross-correlation peaks to the inhibitory troughs. In general, the excitatory “castles” are sharp relative to the broad dip in the crosscorrelation due to inhibition. In Fourier space, these relationships are reversed: broader Fourier transforms of excitatory cross-correlations are paired with narrower Fourier transforms of inhibitory cross-correlations. Superposition of such transforms leads to a peak in the 30-70 Hz range. Note that the peak around 40 Hz in the global activity develops in the absence of any explicit oscillators. The period of the ”oscillations” we observe reflects the delay between the buildup of excitation in a cluster and the inhibitory response. This agrees with previous simulations of oscillations in cortex (Wilson and Bower 1991; Bush and Douglas 1991) and analytical results (Wilson and Cowan 1972; Koch and Schuster 1992) outlining the crucial role of inhibitory interactions in generating neuronal oscillations at the population level. While cortical oscillations in the 30- to 90-Hz range are commonly found in local field potential or multi-unit activity measurements in both cat and monkey visual cortex (Gray et al. 1990; Kreiter and Singer 1992), these oscillations are much less evident in single-unit data (Eckhorn et al. 1993; Eeckman and Freeman 1990; Young et al. 1992; Bair et al. 1994). We here offer a general explanation for this phenomenon. 5.6 Long-Range Fluctuations. We find that the firing of integrateand-fire units, embedded in a network with specific connectivity rules, shows clustering, long-term fluctuations, and self-similarity (fractal behavior) over time scales from 20 to 800 msec (as based on the behavior of the variance versus mean curve in the number of events). This fractal behavior is not a trivial result in our model but depends critically on the ratio of the weight of the inhibitory synaptic input to its excitatory counterpart. Thus, for p = 0.67, the exponent u in the law relating the mean number of spikes to the variability in this mean rate (equation 2.6) u = 1.4, while for /3 = 0.50, u = 1.0, that is, identical to that expected for a Poisson process. Moreover, our model accounts for the power law increase (with exponents u between 1 and 2) for two different, and previously unrelated, experimental paradigms: repetitive short trials with varying sensory stimuli (Fig. 14) and trials with a stationary input and very long spike trains (Fig. 13). The similarity of the results (both showing power law rela-

Network Amplification of Local Fluctuations

829

tions) suggests that these two types of fluctuations may have a common origin. For the model, one can assume that fluctuations in space across the network are equivalent to temporal fluctuations, that is, that a spatial average of single cell properties is equivalent to a temporal average over a long spike train from one cell. This assumption is known as ergodicity. However, as this procedure involves a variation in the stimulus strength (affecting the mean rate A) in a fixed time interval, it indicates a power law dependency of the correlation function, A(X,t ) , on X (equation 2.14),suggesting that the assumption that the normalized correlation is a function of the number of events, At, is a reasonable approximation. The best power law fit t-0.2 to the autocorrelation function predicts that the power spectrum will decay with a power law of l/f”,6,via the Wiener-Khinchin theorem. Indeed, the power spectrum (see Fig. 7) does behave as 1/p6at low frequencies. Two implications of these results are worth noticing: First, these correlations decay slowly in time due to the small exponent of the power, leading to long temporal fluctuations. Practically, this means that these point processes do not behave as a Poisson process and averaging over spike trains of duration T will not increase the signal-to-noise by fi. In the extreme case of v = 2, the signal-to-noise is independent of the duration of the signal. For the nervous system this would imply that there is no penalty in performing computations based on the first few tens of milliseconds of spikes, rather than averaging over much longer times. Tovee et al. (1993) find that 50 msec intervals from spike trains in visual cortex are sufficient to recover most of the information in the spike code, and that longer intervals carry little additional information. Second, the theory of renewal processes (Lowen and Teich 1993) outlined in Section 2 predicts an exponent of -0.3 for the autocorrelation, which was confirmed by numerical simulations using the interval distribution of Figure 8 as the source for a renewal process. Higher order correlations between spikes lead to a slightly slower decay of the true autocorrelation than expected for a renewal process. In fact, Lowen and Teich (1992) suggest that such higher order correlations between interspike intervals are sufficient to produce l/f-type power spectra in recordings from the auditory nerve; these spike trains, however, are no longer fractal point processes. While the temporal range of correlations is limited by an upper cutoff of 300 msec, it is still remarkable that such a long time scale emerges from a system whose longest intrinsic time constant (the membrane time constant) is 20 msec. Changing model parameters can lead to an extended temporal range of correlation effects accompanied by lower firing rates. The underlying IS1 distribution that gives rise to the autocorrelation is clearly power law (see Fig. 8), even though no clear distinction between power law and exponential decay can be made for the autocorrelation itself, since the range is limited. The exact nature of correlations in the discharge patterns of various cortical cells in behaving monkey should

M. Usher et al.

830

be the subject of future experimental investigation to test to what extent and over which time scale they show self-similar behavior. The existence of I/’ noise is generally considered a puzzle in most physical systems, which tend to exhibit temporal correlations on some characteristic time scale, leading to faster decay in the power spectrum. Recently a scenario leading to generic formation of l/f noise was proposed via systems that exhibit self-organized criticality, such as sand-piles and earthquakes (Bak et al. 1987; Olami et al. 1992). Poised at the brink of a dynamic phase transition point, self-organized critical systems can display many long-range behaviors. From the computational point of view, such systems have the advantage of very sensitive responses to small fluctuations. Our model is exquisitely sensitive to small local variations in the external input rate-a feature that is general to all models of selforganized criticality. The brain may employ such dynamic behavior to adapt to low signal-to-noiseratios of the afferent input to cortex. Whether the brain exhibits such self-organized criticality is an open question that should be explored in future research. Appendix A: Fractal Dimension of a Point Process The fractal dimension of a point process (dust) (Mandelbrot 1983) is defined by the number n(6) of covering intervals of length 6 needed to cover all the events occurring during a process of length T:

The covering set n ( b ) , is obtained by the following procedure: 1. Divide a long process of length L in N ( 6 ) = L / 6 equal time intervals of length 6.

2. Define n(6) as the subset of intervals N ( 5 ) that contains at least one event [i.e., the process is totally covered by n(6) intervals of length

hl. 3. The number of coverage intervals n(6) is

.

N

n ( h ) = - [I -

6

where P,,,ty(S) is the probability that a randomly chosen covering interval is empty. 4. PemPty(6) can be calculated from the forward recurrence time distribution, F,,(x). (For a randomly chosen sampling point, the forward recurrence time w is the time duration between the sampling point

Network Amplification of Local Fluctuations

831

and the first following event; F,(x) is the probability distribution of w.)

(every coverage interval whose left edge has a forward recurrence time w > 6 is necessarily empty). 5. For a general stationary point process, F,(x) is (Cox and Lewis 1966)

F,(x)

= AR(x)=A

/

00

P ( x )dx

(A.4)

X

where P ( x ) is the interval distribution probability, A is a normalization factor for F,, and R ( x ) , the integral of P(x), is the survivor probability for x [Prob(T > x ) ] . In the following we calculate D for a point process characterized by an interval probability distribution, P ( t ) , which decays as a power law with exponent 1 < y < 2:

P(t)= B t P

t >A

for A < t < T and zero otherwise (the cut-off T regularization). The corresponding survivor function is:

f1

('4.5) +

co is assumed for

for x < A for A < x < T

The normalized F , (equation 2.5) is

can be obtained (equation A.3): From this, Pernpty 6 -7+2 Pernpty(6) = 1 - T

The number of coverage intervals n ( h ) is (equation A.2)

L 6 -7+2 n(6) = -c ( p+' 6T In conjunction with equation A.l, this implies that the fractal dimension is D = 7 - 1. Following the same steps, the dimension of a Poisson process (exponential interval distribution) can be shown to be D = 1.

M. Usher et al.

832

Appendix B: Relating the Variance to the Autocorrelation The proof of equation 2.8 follows Cox and Lewis (1966). To determine the variance in the number of events in time T, partition the window T into n bins (of width one millisecond). Let Xi be the random variable describing the number of events in bin number i. Then the variance is Var(T) = Var(X1 X2 . . . Xn) = E[(Xi X2 * .. Xn)2] - [E(Xi + X2 + . . . + &)I2

+ + + + + +

11-1

=

n

2 C C[E(XkXI) - E(Xk)E(Xl)]f I=1 k>l n-1

k[E(xz)

-

E(XiI2]

i=l

II

I=1 k=l+l

If the bin size is so small that the possibility of multiple spikes in the same bin is excluded, then the variance in the number of spikes in that interval is equal to the mean, Var(X,) = E(X,). Replacing Var(X,) and changing the indices on the sum, 11-1

Var(T)

=

2

n-l

11Cov(X,,Xl+k) + nE(X,)

1=1 k = l n-1

=

+

2 C(n- 1 ) Cov(X,,X,+I) nE(X,) I=1

Since we are interested in the limit for which T >> 1 msec, the covariance will tend toward the autocorrelation function minus the chance level of correlation, Cov(XI.K+I) A ( v ) - A(m) In this limit, one can replace the sum with an integral over the correlation. With nE(X,) = N ( T ) ,we obtain +

{

) A(w)] d r Var(T) = N(T) 1 + - ( T - T ) [ A ( T :TiT which is equation 2.8.

Acknowledgments We thank William Softky for many invaluable comments and ideas regarding the origin and the function of the high variability in cortical cells. We are also indebted to Terry Sejnowski, Wyeth Bair, and Emst Niebur for insightful discussions. Our research was supported by a Myron A. Bantrell Research Fellowship, the Howard Hughes Medical Institute, the National Science Foundation, the Office of Naval Research, and the Air Force Office of Scientific Research.

Network Amplification of Local Fluctuations

833

References

Abeles, M. 1982. Role of the cortical neuron: integrator or coincidence detector? lsrael J. Med. Sci. 18,83-92. Abeles, M. 1991. Corticonics-Neural Circuits of the Cerebral Cortex. Cambridge University Press, Cambridge. Amari, S.4. 1977. Dynamics of pattern formation in lateral-inhibition type neural fields. Biol. Cybern. 27, 77-87. Amit, D. J., and Tsodyks, M. V. 1991a. Quantitative study of attractor neural network retrieving at low rates: 1. Substrate spikes, rates and neuronal gain. Network Corn. 2(3), 259-273. Amit, D. J., and Tsodyks, M. V. 1991b. Quantitative study of attractor neural network retrieving at low rates: 2. Low-rate retrieval in symmetrical networks. Network Corn. 2(3), 275-294. Asanuma, H., and Rosen, I. 1973. Spread of mono- and polysynaptic connections within cat’s motor cortex. Exp. Brain Res. 16,507-520. Bair, W., Koch, C., Newsome, W., and Britten, K. 1994. Power spectrum analysis of MT neurons in the behaving monkey. 1.Neurosci., in press. Bak, P., Tang, C., and Wiesenfeld, K. 1987. Self-organized criticality: An explanation of l / f noise. Phys. Rm. Lett. 59, 381-384. Berman, N. J., Douglas, R. J., Martin, K. A. C., and Whitteridge, D. 1991. Mechanisms of inhibition in cat visual cortex. J. Physiol. 440, 697-722. Bemander, O., Koch, C., and Usher, M. 1994. The effect of synchronized inputs at the single neuron level. Neural Cornp. 6, 622-641. Bush, l? C., and Douglas, R. J. 1991. Synchronization of bursting action potential discharge in a model network of neocortical neurons. Neural Comp. 3’19-30. Chemjavsky, A., and Moody, J. 1990. Spontaneous development of modularity in simple cortical models. Neural Cornp. 2(3), 334-354. Cowan, J. D. 1982. Spontaneous symmetry breaking in large scale nervous activity. lnt. J. Quant. Chem. 22, 1059-1082. Cox, D., and Lewis, P. A. W. 1966. The Statistical Analysis of Series of Events. Chapman and Hall, London. Douglas, R. J., and Martin, K. A. C. 1990. Neocortex. In Thesynaptic Organization ofthe Brain, G. M. Shepherd, ed., pp. 389438. Oxford Univ. Press, New York. Douglas, R. J., Martin, K. A. C., and Whitteridge, D. 1988. Selective responses of visual cortical-cells do not depend on shunting inhibition. Nature (London) 332(6165), 642-644. Eckhom, R., Frien, A., Bauer, R., Woelbern, T., and Harald, K. 1993. High frequency (60-90 hz) oscillations in primary visual cortex of awake monkey. Neuroreport 4,243-246. Eeckman, F., and Freeman, W. 1990. Correlations between unit firing and EEG in the rat olfactory system. Brain Res. 528(2), 238-244. Ermentrout, G. B., and Cowan, J. D. 1980. Large scale spatially organized activity in neural nets. SIAM 1. Appl. Math. 38, 1-21. Feder, J. 1988. Fractals. Plenum Press, New York. Fetz, E., Toyama, K., and Smith, W. 1991. Synaptic interactions between cortical

834

M. Usher et al.

neurons. In Cerebral Cortex, Vol. 9, A. Peters and E. G. Jones, eds., pp. 1 4 8 . Plenum Press, New York. Gochin, P., Miller, E., Gross, C., and Gerstein, G. 1991. Functional interactions among neurons in inferior temporal cortex of the awake macaque. Exp. Brain Res. 84, 505-516. Gray, C. M., Engel, A. K., Konig, P.,and Singer, W. 1990. Stimulus dependent neuronal oscillations in cat visual cortex: Receptive field properties and feature dependence. Eur. J. Neurosci. 2, 607-619. Grinvald, A. 1992. Optical imaging of architecture and function in the living brain sheds new light on cortical mechanisms underlying visual perception. Brain Topogr. 5(2), 71-75. Grueneis, F., Nakao, M., and Yamamoto, M. 1990. Counting statistics of l / f fluctuations in neuronal spike trains. Biol. Cybern. 62, 407413. Hess, R., Negishi, K., and Creutzfeldt, 0. 1975. The horizontal spread of intracortical inhibition in the visual cortex. Exp. Brain Res. 22, 415419. Knight, B. 1972. Dynamics of encoding in a population of neurons. J. Gen. Physiol. 59, 734-766. Koch, C., and Schuster, H. 1992. A simple network showing burst synchronization without frequency-locking. Neural Comp. 4, 211-223. Komatsu, Y., Nakajima, S., Toyama, K., and Fetz, E. E. 1988. Intracortical connectivity revealed by spike-triggered averaging in slice preparations of cat visual cortex. Brain Res. 442(2), 359-362. Kreiter, A. K., and Singer, W. 1992. Oscillatory neuronal responses in the visual cortex of the awake macaque monkey. Eur. J. Neurosci. 4, 369-375. Lowen, S. B., and Teich, M. C. 1992. Auditory-nerve action potentials form a nonrenewal point process at short as well as long time scales. I. Acoust. SOC. A m . 92(2), 803-806. Lowen, S. B., and Teich, M. C. 1993. Fractal renewal processes generate l / f noise. Phys. Rev. E 47(2), 992-1001. Mandelbrot, B. 8. 1983. The Fractal Geometry of Nature. W. H. Freeman, New York. Mason, A., Nicoll, A., and Stratford, K. 1991. Synaptic transmission between individual neurons of the rat visual cortex in vitro. J. Neurosci. 11, 72-84. Nelson, J. I., Salin, P. A., Munk, M. H.-J., Arzi, M., and Bullier, J. 1992. Spatial and temporal coherence in cortico-cortical connections: A cross-correlation study in areas 17 and 18 in the cat. Visual Neurosci. 9, 21-38. Nelson, S. B. 1991. Temporal interactions in the cat visual system. iii. Pharmacological studies of cortical suppression suggest a presynaptic mechanism. J. Neurosci. 1991(11), 369-380. Olami, Z.,Feder, H. J. S., and Christensen, K. 1992. Self-organized criticality in a continuous, nonconservative cellular automaton modeling earthquakes. Phys. Rev. Lett. 68(8), 1244-1247. Orbach, H. S., Felleman, D. J., Ribak, E. N., and Van Essen, D. C. 1992. Visualization of cortical connections with voltage sensitive dyes. In Analysis and Modeling of Neural Systems, F. H. Eeckman, ed., pp. 15-28. Kluwer Academic, Norwell, MA. Shofner, W. P., and Dye, R. H. J. 1989. Statistical and receiver operating char-

Network Amplification of Local Fluctuations

835

acteristic analysis of empirical spike-count distributions: Quantifying the ability of cochlear nucleus units to signal intensity changes. 1. Acoust. SOC. Am. 86, 2172-2184. Silva, L. R., and Connors, B. W. 1987. Spatial distribution of intrinsic cortical neurons that excite or inhibit layer II/III cells: A physiological study of the neocortex in vitro. Soc. Neurosci. Abst. 12, 1435. Snowden, R. J., Treue, S., and Andersen, R. A. 1992. The response of neurons in areas V1 and MT of alert rhesus monkey to moving dot patterns. Exp. Brain Res. 88, 389400. Softky, W. 1993. Sub-millisecond coincidence detection in active dendritic trees. Neuroscience 58(1), 13-21. Softky, W. R., and Koch, C. 1992. Cortical cells should fire regularly, but d o not. Neural Comp. 4(5), 643-645. Softky, W. R., and Koch, C. 1993. The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci. 13(l), 334-350. Stevens, C. F. 1993. What form should a cortical theory take. In Large-Scale Neirronal Theories ofthe Brain, C. Koch and J. Davis, eds. MIT Press, Boston. Teich, M. C. 1989. Fractal character of the auditory neural spike train. IEEE Trans. Biomed. Eng. 36(1), 150-160. Teich, M. C. 1992. Fractal neuronal firing patterns. In Single Neuron Computation, T. McKenna, J. Davis, and S. F.Zornetzer, eds., pp. 589-625. Academic Press, San Diego, CA. Tolhurst, D., Movshon, J., and Dean, A. 1983. The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res. 23, 775-785. Tovee, M. J., Rolls, E. T., and Treves, A. 1993. Information encoding and the responses of single neurons in the primate temporal visual-cortex. J. Neurophysiol. 70(2), 640-654. Toyama, K., Kimura, M., and Tanaka, K. 1981. Organization of cat visual cortex as investigated by cross-correlation technique. J. Neurophysiol. 46(2), 202214. Tsodyks, M., Mitkov, I., and Sompolinsky, H. 1993. Pattern of synchrony in inhomogeneous networks of oscillaters with pulse interactions. Phys. Rev. Lett. 71(8), 1280-1283. Tuckwell, H. C. 1988. Introduction to Theoretical Neurobiology. Cambridge University Press, New York. van Vreeswijk, C., and Abbott, L. F. 1993. Self-sustained firing in populations of integrate-and-fire neurons. SlAM J. Appl. Math. 53(1), 253-264. Vogels, R., Spileers, W., and Orban, G. A. 1989. The response variability of striate cortical neurons in the behaving monkey. Exp. Brain Res. 77,432436. Werner, G., and Mountcastle, V. B. 1963. The variability of central neural activity in a sensory system and its implications for the central reflection of sensory events. J. Neurophysiol. 26, 958-977. Willshaw, D., and von der Malsburg, C. 1976. How patterned neural connections can be set up by self-organization. Proc. R . Sac. London B 194, 431445. Wilson, H. R., and Cowan, J. D. 1972. Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J. 12, 1-24.

836

M. Usher et al.

Wilson, H. R., and Cowan, J. D. 1973. A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. B i d . Cybern. 13, 55-80. Wilson, M. A., and Bower, J. M. 1991. A computer simulation of oscillatory behavior in primary visual cortex. Neural Cornp. 3, 498-509. Yamada, W. M., Koch, C., and Adams, P. R. 1989. Multiple channels and calcium dynamics. In Methods in Neuronal Modeling, C. Koch and I. Segev, eds., pp. 97-133. MIT Press, Cambridge, MA. Young, M., Tanaka, K., and Yamane, S. 1992. Oscillatory neuronal responses in the visual cortex of the monkey. 1.Neurophysiol. 67, 1464-1474. Zhou, Y., and Fuster, J. 1992. Unit discharge in monkey’s parietal cortex during perception and mnemonic retention of tactile features. SOC.Neurosci. Abstract 18(1), 706. Zipser, D., Kehoe, B., Littlewort, G., and Fuster, J. 1993. A spiking network model of short-term active memory. 1. Neurosci., in press. Received August 24, 1993; accepted November 5, 1993.

This article has been cited by: 1. Hideo Hasegawa. 2007. Generalized rate-code model for neuron ensembles with finite populations. Physical Review E 75:5. . [CrossRef] 2. A. N. Burkitt. 2006. A review of the integrate-and-fire neuron model: II. Inhomogeneous synaptic input and network properties. Biological Cybernetics 95:2, 97-112. [CrossRef] 3. A. N. Burkitt. 2006. A Review of the Integrate-and-fire Neuron Model: I. Homogeneous Synaptic Input. Biological Cybernetics 95:1, 1-19. [CrossRef] 4. G. S. Bhumbra, R. E. J. Dyball. 2005. Spike coding from the perspective of a neurone. Cognitive Processing 6:3, 157-176. [CrossRef] 5. Tim P. Vogels, Kanaka Rajan, L.F. Abbott. 2005. NEURAL NETWORK DYNAMICS. Annual Review of Neuroscience 28:1, 357-376. [CrossRef] 6. Naoki Masuda, Hiroyoshi Miwa, Norio Konno. 2005. Geographical threshold graphs with small-world and scale-free properties. Physical Review E 71:3. . [CrossRef] 7. B. Scott Jackson . 2004. Including Long-Range Dependence in Integrate-and-Fire Models of the High Interspike-Interval Variability of Cortical NeuronsIncluding Long-Range Dependence in Integrate-and-Fire Models of the High Interspike-Interval Variability of Cortical Neurons. Neural Computation 16:10, 2125-2195. [Abstract] [PDF] [PDF Plus] 8. George N. Reeke , Allan D. Coop . 2004. Estimating the Temporal Interval Entropy of Neuronal DischargeEstimating the Temporal Interval Entropy of Neuronal Discharge. Neural Computation 16:5, 941-970. [Abstract] [PDF] [PDF Plus] 9. Hideo Hasegawa. 2001. An Associative Memory of Hodgkin-Huxley Neuron Networks with Willshaw-Type Synaptic Couplings. Journal of the Physics Society Japan 70:7, 2210-2219. [CrossRef] 10. David L. Gilden. 2001. Cognitive emissions of 1/f noise. Psychological Review 108:1, 33-56. [CrossRef] 11. Marius Usher, James L. McClelland. 2001. The time course of perceptual choice: The leaky, competing accumulator model. Psychological Review 108:3, 550-592. [CrossRef] 12. Giacomo Indiveri . 2000. Modeling Selective Attention Using a Neuromorphic Analog VLSI DeviceModeling Selective Attention Using a Neuromorphic Analog VLSI Device. Neural Computation 12:12, 2857-2880. [Abstract] [PDF] [PDF Plus] 13. P. C. Bressloff , N. W. Bressloff , J. D. Cowan . 2000. Dynamical Mechanism for Sharp Orientation Tuning in an Integrate-and-Fire Model of a Cortical HypercolumnDynamical Mechanism for Sharp Orientation Tuning in an

Integrate-and-Fire Model of a Cortical Hypercolumn. Neural Computation 12:11, 2473-2511. [Abstract] [PDF] [PDF Plus] 14. Jan Karbowski , Nancy Kopell . 2000. Multispikes and Synchronization in a Large Neural Network with Temporal DelaysMultispikes and Synchronization in a Large Neural Network with Temporal Delays. Neural Computation 12:7, 1573-1606. [Abstract] [PDF] [PDF Plus] 15. Paul C. Bressloff , S. Coombes . 2000. Dynamics of Strongly Coupled Spiking NeuronsDynamics of Strongly Coupled Spiking Neurons. Neural Computation 12:1, 91-129. [Abstract] [PDF] [PDF Plus] 16. Hideo Hasegawa. 2000. Responses of a Hodgkin-Huxley neuron to various types of spike-train inputs. Physical Review E 61:1, 718-726. [CrossRef] 17. Michael A. Kisley , George L. Gerstein . 1999. The Continuum of Operating Modes for a Passive Model NeuronThe Continuum of Operating Modes for a Passive Model Neuron. Neural Computation 11:5, 1139-1154. [Abstract] [PDF] [PDF Plus] 18. Stefano Fusi , Maurizio Mattia . 1999. Collective Behavior of Networks with Linear (VLSI) Integrate-and-Fire NeuronsCollective Behavior of Networks with Linear (VLSI) Integrate-and-Fire Neurons. Neural Computation 11:3, 633-652. [Abstract] [PDF] [PDF Plus] 19. Boris S. Gutkin , G. Bard Ermentrout . 1998. Dynamics of Membrane Excitability Determine Interspike Interval Variability: A Link Between Spike Generation Mechanisms and Cortical Spike Train StatisticsDynamics of Membrane Excitability Determine Interspike Interval Variability: A Link Between Spike Generation Mechanisms and Cortical Spike Train Statistics. Neural Computation 10:5, 1047-1065. [Abstract] [PDF] [PDF Plus] 20. Charles F. Stevens, Anthony M. Zador. 1998. Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience 1:3, 210-217. [CrossRef] 21. David Horn , Irit Opher . 1997. Solitary Waves of Integrate-and-Fire Neural FieldsSolitary Waves of Integrate-and-Fire Neural Fields. Neural Computation 9:8, 1677-1690. [Abstract] [PDF] [PDF Plus] 22. Todd W. Troyer, Kenneth D. Miller. 1997. Physiological Gain Leads to High ISI Variability in a Simple Model of a Cortical Regular Spiking CellPhysiological Gain Leads to High ISI Variability in a Simple Model of a Cortical Regular Spiking Cell. Neural Computation 9:5, 971-983. [Abstract] [PDF] [PDF Plus] 23. Guido Bugmann, Chris Christodoulou, John G. Taylor. 1997. Role of Temporal Integration and Fluctuation Detection in the Highly Irregular Firing of a Leaky Integrator Neuron Model with Partial ResetRole of Temporal Integration and Fluctuation Detection in the Highly Irregular Firing of a Leaky Integrator Neuron Model with Partial Reset. Neural Computation 9:5, 985-1000. [Abstract] [PDF] [PDF Plus] 24. D.L. Gilden. 1997. FLUCTUATIONS IN THE TIME REQUIRED FOR ELEMENTARY DECISIONS. Psychological Science 8:4, 296-301. [CrossRef]

25. Sean Hill, Alessandro Villa. 1997. Dynamic transitions in global network activity influenced by the balance of excitation and inhibition. Network: Computation in Neural Systems 8:2, 165-184. [CrossRef] 26. Paul Bressloff. 1996. New Mechanism for Neural Pattern Formation. Physical Review Letters 76:24, 4644-4647. [CrossRef] 27. D. Hansel, H. Sompolinsky. 1996. Chaos and synchrony in a model of a hypercolumn in visual cortex. Journal of Computational Neuroscience 3:1, 7-34. [CrossRef] 28. André Longtin , Karin Hinzer . 1996. Encoding with Bursting, Subthreshold Oscillations, and Noise in Mammalian Cold ReceptorsEncoding with Bursting, Subthreshold Oscillations, and Noise in Mammalian Cold Receptors. Neural Computation 8:2, 215-255. [Abstract] [PDF] [PDF Plus] 29. K. Schmoltzi, H. Schuster. 1995. Introducing a real time scale into the Bak-Sneppen model. Physical Review E 52:5, 5273-5280. [CrossRef] 30. Christof Koch, �jvind Bernander, Rodney J. Douglas. 1995. Do neurons have a voltage or a current threshold for action potential initiation?. Journal of Computational Neuroscience 2:1, 63-82. [CrossRef] 31. Marius Usher, Martin Stemmler. 1995. Dynamic Pattern Formation Leads to 1/ f Noise in Neural Populations. Physical Review Letters 74:2, 326-329. [CrossRef]

Communicated by Shun-ichi Amari

NOTE

Statistical Analysis of an Autoassociative Memory Network A. M. N. Fu School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia

A statistical method for analyzing the dynamic behavior of a synchronous autoassociative memory network is presented based on an extension of Amari-Maginu theory. Through computer simulations it is shown that when the memorized pattern ratio is small (Y < 0.15) or large (r > 0.2), the derived theory shows good agreement.

Various methods based on statistical theories have been developed to analyze the dynamics of neural networks (see, e.g., Amari and Maginu 1988; Patrick and Zagrebnov 1991; Nishimori and Ozeki 1993). In all these studies a common assumption is that the main overlap A ( t ) is taken to be a deterministic function. Our computer simulations have shown that the variance of the main overlap is not close to zero except for a small pattern ratio ( r < 0.15) (see Fu 1993) and therefore, the above assumption is valid only for a small pattern ratio. A statistical method based on an extension of Amari and Maginu theory (1988) for analyzing the dynamics of a synchronous autoassociative memory network composed of n neurons will be presented, in which the effect of the fluctuation of A ( t ) is taken into account. The current state of the system is represented by X(t) = [Xl(t),. . . X,(t)] and the next state of the system is X(t 1)whose ith component is

+

X ; ( t + 1) = Sgn

1:C~;jXj(f) ' J

= Sgn

+

[s,'A(t) N ; ( t ) ]

where

wii

=

0,

i = 1!2, . . . , n

lfl

A(t)= - Cs;Xj(t)

'

j=1

is the main overlap and S' is the target memorized pattern; Neural Computation 6, 837-841 (1994) @ 1994 Massachusetts Institute of Technology

A. M. N. Fu

838

1 " Ni(t) = -

1C sps;lX,(t)

(1.4)

n#l j # l , i

denotes the noise. Here sp, i = 1,2,. . . , n, randomly with probability distribution

(Y =

2 , . . . nr, are generated ~

If A ( t ) and N , ( t ) ( i = 1,2,. . . , n ) are independent random variables, then we can state the following theorems: Theorem A. lf(i) N i ( t )isa normal random variable with mean p N ( f ) and variance $,(t), (ii) A ( f )is a normal random variable with mean a ( t ) and variance o i ( t ) , and (iii) P [ A ( t )= a ( t ) ] = 1, then the governing equationsfor the macroscopic state of the system are given by a(t

+ 1)

=

@[T/(t))

nk(t

+ 1)

=

r + 4p [U(t)]' + 4rp [a(!)] u ( t ) a ( t + 1)

(1.7)

pN(t

+ 1)

=

2rn~'(t)a(t)p[-~/(t)]

(1.8)

- @(-rj(f)]

(1.6)

where

@(Y)

-1' J27; 1

=

-m

p(x)dx.

The assumption (i) in Theorem A is different from that in Theorem 3 of Amari and Maginu (1988). When p N ( f ) = 0 Theorem A reduces to Theorem 2 in Amari and Maginu (1988), therefore Theorem A is a generalization of Theorem 2. Note that the condition p N ( f ) = 0 is satisfied only for a small pattern system with an initial expectation of the main overlap larger than a certain critical value (see Fu 1993), hence Theorem A reduces to Theorem 2 only in this case. In fact, for a system with a large pattern ratio ( r > 0.2) the theoretical analysis based on Theorem A is good agreement with the computer simulation, but that based on Theorem 2 is not (see Fig. lc). The reason is that / i N ( f ) # 0. Note that the assumption (iii) in Theorem A will not necessarily hold and a more general case has to be considered. Theorem B. lfassumptions (i) and (ii) in Theorem A are satisfied and (iii) N , ( t ) ( i = 1.2, . . . n ) are independent identical random variables, then the governing ~

Autoassociative Memory Network

839

The behaviour of Theorem A t

I

'

1

'

I

'

1

'

I

'

I

00 2

b

4

densltytovsrlap)Sx

0

Time

d

Figure 1: Main overlap and its density.

equationsfor the macroscopic state of the system are

(1.15)

A. M. N. Fu

840

In Figure la the behavior of Theorem A illustrates the limit expectation of the main overlap' as a function of pattern ratio r. The graph shows that there is a critical value r, of r (r, z 0.16), which is close to the value obtained by Hopfield (1982),Amit (1985),Amari and Maginu (1988),and Zagrebnov and Chvyrov (1989). In Figure l b and lc A ( t ) is shown resulting from the theoretical analysis and computer simulation of the system with [n = 400, a(0) = 0.7, r = 0.21 and [n = 400, a(0) = 0.7, r = 0.31, respectively. The crosses and solid line denote the 50 simulations of A ( t ) and their expectation, respectively. The other curves are the results of Theorems A, B, 2, and 3, where Theorems 2 and 3 were given in Amari and Maginu (1988). Computer simulations have shown that A ( t ) is approximately a normal random variable at the earlier several steps and the noise is approximately normal up to third step for a large pattern ratio, but not for an intermediate pattern ratio (0.15 5 r 5 0.2) (see Fu 1993). In Figure Id the density of A ( t ) for the system [ n = 400, r = 0.2, a ( 0 ) = 0.71 at the third step is shown not to be normal. Hence, the assumption (i) and (ii) in Theorem A and B are approximately satisfied (at earlier several steps) for a large r but not for an intermediate r system. Moreover, computer simulation also confirms that there is a strong dependence between A ( t ) and N ; ( t ) in the system with an intermediate pattern ratio, but a weak dependence for a large r system (see Fu 1993). Hence, the assumption of A ( t )and N ; ( t )to be independent is not satisfied for an intermediate r system, but approximately satisfied for a large r system. It is clear why Theorems A and B are efficacious for a large r system (see Fig. lc), but not for an intermediate r system (see Fig. lb).

Acknowledgments I would like to thank Professor J. Robinson and Dr. W. G. Gibson for introducing me to this topic and also for their advice and encouragement. I am also grateful to Dr. R. Poznanski for reading the original draft of this paper. References Amari, S., and K. Maginu. 1988. Statisticalneurodynamics of associative memory. Neural Networks 1, 63-73. Amit, D. J., Guttreund, H., and Sompolinsky, H. 1985. Storing infinite numbers of patterns in a spin-glass model of neural networks. Phys. Rev. Lett. 55,

1530-1533. 'The limit expectation of main overlap is defined as a, = limt-.m E [A(t)]for a given r and when the expectation of the initial main overlap is larger than a certain critical value.

Autoassociative Memory Network

841

Fu, A. M. N. 1993. Statistical analysis of an autoassociative memory network. Internal Report 93-12, School of Mathematics and Statistics, The University of Sydney. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collection computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Nishimori, H., and Ozaki, T. 1993. Retrieval dynamics of associative memory of the Hopfield type. I. Phys. A: Math. Gen. 26, 859471. Patrick, A. E., and Zagrebnov, V. A. 1991a. On the parallel dynamics for the Little-Hopfield model. 1. S. Phys. 63(1/2), 59-71. Patrick, A. E., and Zagrebnov, V. A. 1991b. A probabilistic approach to parallel dynamics for the Little -Hopfield model. I. Phys. A:Math. Gen. 24, 3413-3426. Zagrebnov, V. A., and Chvyrov, A. S. 1989. The Little-Hopfield model: Recurrence relations for retrieval-pattern errors. Sou. Phys.-IETP 68, 153-157.

Received April 22, 1993; accepted November 4, 1993.

Communicated by Stephen Judd

Loading Deep Networks Is Hard Jifi &ma Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod vodhenskou vP2i 2,182 07 Praha 8, Czech Republic

The loading problem formulated by J. S. Judd seems to be a relevant model for supervised connectionist learning of the feedforward networks from the complexity point of view. It is known that loading general network architectures is NP-complete (intractable) when the (training) tasks are also general. Many strong restrictions on architectural design andor on the tasks do not help to avoid the intractability of loading. Judd concentrated on the width expanding architectures with constant depth and found a polynomial time algorithm for loading restricted shallow architectures. He suppressed the effect of depth on loading complexity and left as an open prototypical computational problem the loading of easy regular triangular architectures that might capture the crux of depth difficulties. We have proven this problem to be NP-complete. This result does not give much hope for the existence of an efficient algorithm for loading deep networks. 1 The Relevance of the Loading Problem

Research of the neural network models for the computational exploitation has been rapidly developed in the last years. This fact is confirmed by the existence of many journals and conferences concerning this topic. One of the basic reasons for this exciting activity in this area is the successful practical use of the learning algorithm backpropagation (Rumelhart et al. 1986) for the multilayered neural network, and it is not by chance that this is the most widely applied neural network architecture. The defect of this algorithm, likewise of many others, is that it is very time consuming and learning larger networks can be intractable. The effort to speed up this algorithm has brought only limited results. Difficult problems cannot be solved easily. The task of theoretical computer scientists is to discover intrinsic causes of these difficulties. The interpretation and the explanation of such negative results are ungrateful and unpopular tasks but it is very needed. It can help in understanding learning problems and in proposing new techniques and methods for its solution. Rather than testing improved neural learning algorithms on XOR or parity tasks, the adequate tool for exploring the above-mentioned issues Neirrnl Computation 6, 842-850 (1994)

@ 1994 Massachusetts Institute of Technology

Loading Deep Networks

843

is the theory of complexity. From this point of view the practically solvable problems are those for which there is a polynomial time algorithm. On the other hand, so-called NP-complete problems are generally considered to be practically unsolvable. This approach leads to Valiant’s theory of the learnable (Blumer et al. 1986; Kearns et al. 1987; Pitt and Valiant 1988; Svaytser 1989; Valiant 1984, 1985; Wiedermann 1991). This theory takes into consideration the complexity issues of learning and generalization from examples, from the general artificial intelligence point of view. Our contribution issues from Judd’s work (1990), which formulated the loading problem as a computational decision problem, to be a relevant model for supervised connectionist learning of the feedforward neural networks from the complexity point of view. Judd presented many trustworthy arguments for the convenience of this model. The polemic of the relevance of the loading problem in the book review by Baum (1991) is also very inspiring. In our opinion his most relevant argument is that in contradistinction to the loading problem, in application of neural networks we typically have control over the net we are loading. Thus, we can choose one that makes our problem easy. If we want to formalize the classical connectionist learning from the complexity point of view we must somehow qualify the set of possible network architectures from which one is chosen to be loaded. In the opposite case, when the network architecture can be arbitrary, then it would be very simple to create a sufficiently large architecture with an appropriate configuration (synaptic weights) that performs the required task. But in this case the generalization does not make good sense, and the classical von Neuman’s computer architecture is more convenient for the memorization task. So, we can generalize the loading problem for the set of possible network architectures from which one is chosen to be loaded. But the original loading problem is a special case of this generalized loading problem for the one-element set of the network architectures. Therefore the generalized loading problem must be at least as hard as the original one. This consideration weakens Baum’s argument. Judd proved the general loading problem to be NP-complete and added many negative corollaries concerning the special cases of this problem, with strong restrictions on architectural design and/or on the (training) tasks that do not help to avoid the intractability of loading. Then he concentrated on the width expanding network architectures with constant depth and found a polynomial time algorithm for loading strongly restricted shallow architectures. His algorithm suppressed the effect of depth on loading complexity using shallow architectures that have a bounded size (by a constant) of the set of all partial configurations (i.e., assignments of functions to each computational node in the network that can potentially affect the given output). The possibility of there being an efficient method to search for correct partial configuration was completely ignored. In spite of this Judd stated the importance

Jiri %ma

844

t

Figure 1: An example of the triangular architecture ( d = 7). of the "depth issue" for loading complexity. He offered an open prototypical computational problem of the loading of easy regular triangular architectures that might capture the crux of depth difficulties. We prove this problem to be NP-complete. This result does not give much hope for the existence of an efficient algorithm for loading deep networks. 2 The Loading Problem for Triangular Architectures

Here we remember the definition of the loading problem for triangular architectures by Judd. The instance of this problem (i.e., the input for the loading problem) consists of an integer d and a task T . The integer gives the depth (and width) of an architecture depicted in Figure 1. The nodes are arranged in a triangle with each layer having one fewer node than the previous layer until the last layer has one node in it. Each node receives an input from both of the two below it. The task is a

Loading Deep Networks

845

collection of items, each with d + 1 stimulus binary bits and 1 response binary bit (i.e., T = { ( o , , p , ) / cE~ ,(0, p, E {O,l}, i = 1 , .. . , T } ) . The question of the decision form of the loading problem (i.e., the yes/no output of the loading algorithm) follows: Is there a configuration for this architecture that performs the given task? The configuration of a network is a list of functions corresponding one to one with the set of nodes of the net, meaning that the node computes the corresponding function. The configured architecture performs the task if the network function, composed of the node functions according to the configuration, computes for every item stimulus of the task the output value that agrees with the response of this item. In our case we say that the instance ( d . T) is performable if there is a relevant configuration. The node functions are taken from the so-called nodefunction set. Typically, connectionists have used the set of linearly separable functions. Due to the simplicity we consider for example NAOFns = {OR, NAND} as the node function set. Then we generalize our result for nontrivial subset of symmetric and monotone functions. More exact, we consider the node function sets .F that satisfy the following conditions: (i) {OR, NAND} C F or {AND. NOR} C F (ii)

F L {0,1,OR, AND. NOR. NAND}.

In our case for two-input gates it hardly matters. Moreover, Judd proved many theorems indicating that the difficulty in the loading problem has very little to do with the choice of node function sets (e.g., the loading problem is NP-complete also for the real-valued logistic-linear functions used in the backpropagation learning algorithm). Meanwhile, our result, like that of Blum and Rivest (1992),depends on this choice and it should be generalized for more complex node function sets. 3 The Intractability of Loading Triangular Architectures

In this section we will prove the main result implying the intractability of loading triangular architectures. For this purpose we will employ the technique of polynomial time reduction from satisfiability problem (Balcazar et al. 1988). An instance ( Z ,C) of the general satisfiability problem (SAT) is an expression in Boolean variables from Z given in conjunctive normal form (i.e., a conjunction of disjunctions) in which all disjunctions from C (called clauses) consist of literals. A literal is a logical variable or its negation. With respect to this we will distinguish the positive and the negative literal. Without loss of generality any clause will not contain two literals derived from the same variable. We will use the modified satisfability problem (MSAT) of which any instance ( Z ,C) satisfies the following condition. We consider a fixed ordering of the set of variables Z according to which the literals in clauses are ordered. Then the positive and negative literals alternate periodically in the clauses of an

Jifi %ma

846

instance (Z,C) of MSAT. It means that the positive literal always follows after the negative one and conversely. The instance (Z,C) is said to be satisfiable if the variables can all be given values such that the whole logical expression is true. First, we prove the following lemma concerning the complexity of the decision problem of whether the instance (Z,C) of MSAT is satisfiable. Lemma. MSAT is NP-complete.

Proof. We outline the basic idea of the proof. For every instance (Z,C) of SAT we construct in a polynomial time the instance (Z’,C’) of MSAT so that both logical expressions are equivalent. This means that (Z. C) is satisfiable iff (Z’,C’) is satisfiable. Suppose that there is a clause from C in which the positive literal r ~ , follows after the positive literal m,,i < j (similarly for negative literals), that is, . . . (rfV (r, . . . . In this case we add a new variable (1, to Z. We insert (kk in the given ordering of Z between the variables derived from these literals, that is, i < k < j . We put the negative literal - m k derived from this new variable into the considered clause, that is, . . . o, v V ( t , . . .. To preserve the satisfiability of the newly created logical expression we must add one more clause to C, which contains only one positive literal ( t k derived from the new variable. This is sufficient to prevent the influence of this variable on the original clause. Repeating this process we obtain the equivalent instance (Z’,C’) of MSAT from the instance of SAT in a polynomial time. 70,

Now, we are ready to prove the following theorem: Theorem. The loading problemfor triangular architectures (LTA)is NP-complete for the nodefunction set NAOFs = {OR, NAND}. Proof. First, we prove this loading problem to be NP-hard by polynomial time reduction from MSAT, which is NP-complete according to the previous lemma. Let (Z,C) be an instance of MSAT where Z = {o,,(t2, . . . ,(Y,,} is the set of n Boolean variables and C = {C,,C2, . . . ,C,} is the set of k clauses. For any clause C, E C we will use the following simplified notation omitting the index j of the clause

where 1 I il < i2 < . . . < i,, 5 n and L ( q P ) p, = 1,.. . , m, is the literal ). remember that for any derived from the variable alp(i.e., alpor y,We L(ofp+,) is positive and the other one possible p one from the literals L(a,,,)? is negative because it is an MSAT instance. For (Z, C) to be satisfiable, there must be a n assignment n: Z 4{0,1} such that at least one literal in each clause has value 1. For any instance (Z, C) of MSAT we construct the corresponding instance ( d , T ) for LTA in a polynomial time with respect to the size of (Z,C) so that (Z,C) is satisfiable iff ( d , T ) is performable. The formal

Loading Deep Networks

a47

construction follows: Put d = 2n - 1 and T = TOLJ Tl where

To Ti

= {([101",0)1, = { ( ~ , , l ) /Eg {O,l}d+l, ~

j = 1,.. .k}

The item stimuli uI of the partial task TI are constructed gradually according to the form of the clause C, E C. Due to the notation simplicity we again omit the index j of the clause.

I.

a. L(a,,)= a,, go = [10]11-1 b. L(a,,)= ~ a , , go

=

[Ol]J1-1

11. p = 1 , 2,.. . ,m - 1

a. L(aJp) = aJpLe., L(aIp+l)= ~ gP = g P - 1

~fp+ll

.1l[ol]lPtl-lP--l

b. L(a,,) = -aIPhe., L(aIp+,)= ~ , p + , l gP

111.

= gP-1

. oo[~o]~P+l-JP-~

a. L(cr,,) = afm 0I -- 0 - 1

b. L(aI,)= 0I -- a m - '

.ll[Ol]n-lm XI,^

. oO[lo]n-1m

+

The task size is IT[ = k 1 so this construction can be done in a polynomial time with respect to the size of the instance (Z, C). To make the preceding formal notation clearer we present an easy example of the above-mentioned construction. Let us consider the instance (Z, C) of MSAT where

z= { f l l r a 2 . @ 3 , a 4 } #

c1 = {a1

3

7a3},

c = {cl,cZ}? c 2=

{-%

a3, l a 4 1

so n = 4, k = 2. Then the corresponding instance ( d ,T) of LTA will take the following form: d = 7 : (10101010,0) T1 : (11010010,1) (01001100,1) 7-0

The semantics of this construction will be clear from the following. We will denote by

. . ,fn-l,gn-l,fnE NAOFns = {OR, NAND}

fl,g1,f2,g2,.

Jifi Sfma

848

the relevant node functions for the first-layer nodes in the triangular architecture determined by d = 2n - 1. For t = (s,r) E T denote by v(t) E { 0 , q d

v ( t )= (fi

S2)rgl(S21 S3)r

. . . 1gn-l (Sd-1,

Sd)rfn(Sd, sd+l))

the vector composed of the first-layer node function values and let vo = v ( t ) ,t E To. First, we prove that if ( d , T ) is performable then (Z,C) is satisfiable. Assume that the triangular architecture determined by d performs the task T = To U TI. It means that there is a relevant configuration for this architecture, especially the node functions for the first-layer nodes. It must hold vo # v(t) for all t E TI because the relevant resulting responses of the item of TO and the items of TI are different. We know that for both functions f E NAOFns = {OR, NAND} f ( 0 , l ) is equal to f ( 1 , O ) . Therefore by comparing TO and TI we can see that the node functions gl. i = 1,. . . , n - 1 do not help to differentiate between '00 and v(t), t E TI. Coming back to the above-mentioned example of the construction, we ask when vo # v ( t ) for all t E TI? It is just in the case when the following condition is satisfied: Ifi(1,O) # f i ( l , l ) or f3(1,0) #f3(0,0)]

and [f2(lr

0) #f2(03 0) Or

f3(I1 0)

#f3(1,1)

or

f4(lr

0) #f4(0, O ) ]

that is, (fi = NAND or

f3

$ NAND)

f3

=

and (f2

+

NAND or

NAND or

f4

f NAND)

This corresponds with the semantics of the original instance (Z,C) of MSAT when a, meansfi E NAND. Generally, we define an assignment

n:z

--$

{0,1}

II((tl) = 1 iffi = NAND 0 otherwise, i = 1,. . . , n

which confirms the satisfiability of (Z, C). Conversely, we prove that if (Z, C) is satisfiable then ( d , T) is performable. Assume that the instance (Z,C) is satisfiable, which is confirmed by an assignment II: Z ( 0 , l ) . We choose the following firstlayer node functions

fi =

NAND OR

g i = NAND

if II(a,)= 1 ifII(a,)=O, i = l , . . . ,n - l

i = l , . . . ,n

Loading Deep Networks

849

All second-layer node functions be NAND functions and all node functions in remaining layers be OR functions. We check this configuration of the triangular architecture determined by d to perform the task T = TOUTI. Using the preceding consideration we obtain v0 # D(t ) for all t E T1 because (Z, C) is satisfiable. In this case we know that DO = Id and the vector composed of the second-layer node function values is Od-' because of the NAND functions so the relevant resulting response for t E To will be 0 due to the OR functions in the remaining layers. Because DO differs from v(t),t E T I ,that contains at least one 0 the resulting response for t E T1will be 1 . Then ( d ,T ) is performable. This completes the proof that LTA is NP-hard. Finally, it must be demonstrated that LTA E NP, that is, that there is a nondeterministic machine that can decide the LTA problem in time polynomial in the length of ( d , T ) . Writing a complete configuration of NAOFns takes at most one bit for each node in the triangular architecture determined by d . Whether the configuration is correct can be checked by evaluating each node function once for each item in T; this takes O ( d ( d 1 ) / 2 . \TI), which is polynomial as desired. This completes the proof of the whole 0 theorem.

+

Corollary. LTA is NP-completealsofor the nodefunction sets Fof symmetric and monotonefunctions (i.e.,F C {0,1,OR, AND, NOR, NAND}) containingat least {OR, NAND} or {AND, NOR}functions. Proof. We outline the generalization of the proof of the preceding theorem. In the case when {AND, NOR} G F we modify the construction of the instance ( d , T ) in the way that we exchange the resulting response bits of the partial tasks TO,TI (i.e., TO= {([lo]",l)}, TI = { ( g j , O ) / g j E (0, l}df', j = 1,. . . ,k}). In this case we replace NAND by AND and OR by NOR in the rest of the proof. The part of the proof that (Z,C) is satisfiable if ( d , T ) is performable requires the node function set F containing only symmetric If(0,l) = f ( l , O ) , f E F]and monotone [ f ( O , O ) # f(1,l)for nonconstant f E F] functions. For such a general F we define the assignment II : Z {0,1} --f

II(ai) = 1 i f j = NAND or$ = AND 0 otherwise, i = 1,.. . ,n

0

References Balcdzar, J. L., Diaz, J., and Garabarr6, J. 1988. Structural Complexity I . SpringerVerlag, Berlin. Baum, E. B. 1991. Book reviews. IEEE Trans. Neural Networks 2, 181-182. Blum, A,, and Rivest, R. L. 1992. Training a 3-node neural network is NPcomplete. Neural Networks 5, 117-127.

850

Jifi %ma

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. 1986. Classifying learnable geometric concepts with the Vapnik-Chervonenkis dimension. Proc. 28th Annu. STOC, New York, 273-282. Judd, J. S. 1990. Neural Network Design and the Complexity of Learning. The MIT Press, Cambridge. Kearns, M., Li, M., Pitt, L., and Valiant, L. G. 1987. On the learnability of Boolean formulae. Proc. 29th STOC, 285-295. Pitt, L., and Valiant, L. G. 1988. Computational limitations on learning from examples. 1.ACM 35, 965-984. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back-propagating errors. Nature (London) 323, 533-536. Svaytser, H. 1989. Even simple neural nets cannot be trained reliably with a polynomial number of examples. Proc. IICNN, Washington, 11, 141-145. Valiant, L. G. 1984. A theory of the learnable. Commun. ACM 27, 1134-1142. Valiant, L. G. 1985. Learning disjunctions of cojunctions. Proc. 9th ZJCAI, Los Angeles, 560-566. Wiedermann, J. 1991. Some afterthoughts on computational learning theory. Proc. SOFSEM, Jasnd pod Chopkom (Slovakia), 271-274.

Received January 4, 1993; accepted October 6, 1993.

This article has been cited by: 2. David Windisch . 2005. Loading Deep Networks Is Hard: The Pyramidal CaseLoading Deep Networks Is Hard: The Pyramidal Case. Neural Computation 17:2, 487-502. [Abstract] [PDF] [PDF Plus] 3. Jiří Šíma . 2002. Training a Single Sigmoidal Neuron Is HardTraining a Single Sigmoidal Neuron Is Hard. Neural Computation 14:11, 2709-2728. [Abstract] [PDF] [PDF Plus] 4. Paul E. Utgoff , David J. Stracuzzi . 2002. Many-Layered LearningMany-Layered Learning. Neural Computation 14:10, 2497-2529. [Abstract] [PDF] [PDF Plus] 5. Y. Takahashi. 2000. A mathematical solution to a network construction problem. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 47:2, 166-184. [CrossRef] 6. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef]

Communicated by Eric Baum

Measuring the VC-Dimension of a Learning Machine Vladimir Vapnik Esther Levin Yann Le Cun ATbT Bell Laboratories, Holmdel, N / 07733 USA

A method for measuring the capacity of learning machines is described. The method is based on fitting a theoretically derived function to empirical measurements of the maximal difference between the error rates on two separate data sets of varying sizes. Experimental measurements of the capacity of various types of linear classifiers are presented. 1 Introduction

Many theoretical and experimental studies have shown the influence of the capacity of a learning machine on its generalization ability (Vapnik 1982; Baum and Haussler 1989; Le Cun et al. 1990; Weigend et al. 1991; Guyon et al. 1992; Abu-Mostafa 1993). Learning machines with a small capacity may not require large training sets to approach the best possible solution (lowest error rate on test sets). High-capacity learning machines, on the other hand, may provide better asymptotical solutions (i.e., lower test error rate for very large training sets), but may require large amounts of training data to reach acceptable test performance. For a given training set size, the difference between the training error and the test error will be larger for high-capacity machines. The theory of learning based on the VC-dimension predicts that the behavior of the difference between training error and test error as a function of the training set size is characterized by a single quantity-the VC-dimension-which characterizes the machine’s capacity (Vapnik 1982). In this paper, we introduce an empirical method for measuring the capacity of a learning machine. The method is based on a formula for the maximum deviation between the frequency of errors produced by the machine on two separate data sets, as a function of the capacity of the machine and the size of the data sets. The main idea is that the capacity of a learning machine can be measured by finding the capacity that produces the best fit between the formula and a set of experimental measurements of the frequency of errors on data sets of varying sizes. In the paradigm of learning from examples, the learning machine must learn to approximate as well as possible an unknown target rule, Neural Computation 6, 851-876 (1994)

@ 1994 Massachusetts Institute of Technology

V. Vapnik, E. Levin, and Y. Le Cun

852

or input-output relation, given a training set of labeled examples. The 1 input-output pairs composing the training set (x,w), x E X c R", w E {0,1}, (XI,~I),...r(X',~')

are assumed to be drawn independently from an unknown distribution function P(x,w )= P(w 1 x)P(x). Here P(x) describes the region of interest in the input space, and the distribution P(w I x) describes the target input-output relation. The learning machine is characterized by the set of binary classification functions f ( x , a ) , CY E A ((1 is a parameter that specifies the function, and A is the set of all admissible parameters) that it can realize (indicator functions). The goal of learning is to choose a function f ( x , a') within that set that minimizes the probability of error, that is, the probability of disagreement between the value of w and the output of the learning machinef(x, a ) ~ ( ( 1= )

Elw -f(x,

.)I

where the expectation is taken with respect to the probability distribution P ( x . w). The problem is that this distribution is unknown, and the only way to assess p ( a ) is through the frequency of errors computed on the training set vda) =

1 '

1

Id, -f(x,,a)I

I=1

Many learning algorithm are based on the so-called "principle of empirical risk minimization," which consists in picking the function f ( x , a,) that minimizes the number of errors on the training set. In Vapnik (1982) it is shown that for algorithms that minimize the empirical risk, the knowledge of three quantities allows us to establish an upper bound on the probability of error p ( a ) .The first quantity is the capacity of the learning machine, as measured by the VC-dimension h of the set of functions it can implement. The second one is the frequency of errors on the training set (empirical risk) .(a). The third one is the size of the training set 1. With probability 1 - 71, the bound

p ( 0 ) 5 4 c r ) + D[Lh, ~ ( a 711) ,

(1.1)

is simultaneously valid for all a E A [including the CYI that minimizes the empirical risk ~ ( a ) The ] . function D is of the form

D[I,h,4 L . = C

771

+

h[ln(2l/h)21 11 - l n q

{j

=

=

+

+ c[h[ln(21/h)+ 11- 1nv1 I} (1.2)

VC-Dimension of a Learning Machine

853

where c is a universal constant less than 1. It can be interpreted as a confidence interval on the difference between the training error and the "true" error. There are two regimes in which the behavior of the function D simplifies. The first regime is when the training error happens to be small (large capacity, or small training set size, or easy task), then D can be approximated by h(ln(2llh)+ 1) - lnq 1

W , h ,4 Q I ) , 7 7 1

When the training error is large (near 1/2), D can be approximated by

D[Lh, 4w), a]

N

/

+

h(ln(2llh) 1)- lna 1

Unfortunately, theoretical estimates of the VC-dimension have been obtained for only a handful of simple classes of functions, most notably the class of linear discriminant functions. The set of linear discriminant functions is defined by

f ( x , a ) = e [ ( x . a ) + aO1 where ( x . a ) is dot-product of the vectors x and a, and 0 is the threshold function: O(u) =

ifu>O otherwise

1 0

The VC-dimension of the set of linear discriminant functions with n inputs is equal to n 1 (Vapnik 1982). Attempts to obtain the exact value of the VC-dimension for other classes of decision rules encountered substantial difficulties. Most of the estimated values appear to exceed the real one by a large amount. Such poor estimates result in rather crudely overestimated evaluations of the confidence interval D. This article considers the possibility of empirically estimating the VCdimension of a learning machine. The idea that this might be possible was inspired by the following observations. It was shown in Vapnik and Chervonenkis (1989) that a learning algorithm that minimizes the empirical risk (i.e., minimizes the error on the training set) will be consistent' if and only if the following one-sided uniform convergence condition holds:

+

'A learning algorithm is said to be consistent if the probability of error on the training set converges to the true probability of error when the size of the training set goes to infinity.

V. Vapnik, E. Levin, and Y. Le Cun

854

In other words the one-sided uniform convergence (within a given set of function) of frequencies to probabilities is a necessary and sufficient condition for the consistency of the learning process. In the 1930s Kolmogorov and Smirnov found the distribution law for the maximal deviation between a distribution function and an empirical distribution function for any random variable. This result can be formulated as follows. For the set of functions

f.(x, a ) = B(x - a ) ,

aE

(-00,

m)

the equality

{

m

P sup[~,(ru) - u.(a)l> & }

= exp{-2€21)

-

2~(-1)"exp{-~n21) n=2

LlEA

holds for sufficiently large I, where p*(a) = Ef,(a) and v,(a) = 1/l C!=,f+(xi, a ) . This equality is independent of the probability measure on x. Note that the second term is very small compared to the first one. In Vapnik and Chervonenkis (1971) it was shown that for any set of indicator functionsf(x, a ) ,(Y E A with finite VC-dimension h, the rate of uniform convergence is bounded as follows

where c1 and c2 are universal constants such that c1 5 1 and c2 > 1/4. As above, this inequality is independent of the probability measure. A direct consequence is that for any given 6 there exists an 10 = lo(h,6, E ) such that for any 1 > lo, the inequality

{

}

P s u p [ ~ ( o-) .(a)] > E < exp{-(c2 OEA

-~)E21)

holds. In Vapnik and Chervonenkis (1971) it was shown that c2 is at least 1/4. The result was later improved to c2 = 2 by Devroye (1982). Interestingly, for c2 = 2, the above inequality is close to the KolmogorovSmimov equality obtained for a simple set of functions. This means that, although the above is an upper bound, it is asymptotically close to the exact value. Tightening the upper bound on the value of c1 is theoretically very difficult, because the term it multiplies is not the main term in the exponential for large values of llh. Now, suppose that there exists a value for the constant cl, 0 < c1 L 1, for which the above upper bound (which is independent of the probability measure) is tight. Furthermore, suppose that the bound is tight not only for large numbers of observations, but also for smaller numbers (statisticians commonly use the asymptotic Kolmogorov-Smirnov test starting with 1 = 30 and the learning processes usually involves sev-

VC-Dimension of a Learning Machine

855

era1 hundreds of observations). In this case, one can expect the function @*(l/h)defined by

to be independent of the probability measure, even for rather small values of llh. Now, assuming we know a functional form for @*, then we can experimentally estimate the expected maximal deviation between the empirical risk and expected risk for various values of I , and measure h by fitting @* to the measurements. The remainder of the paper is concerned with finding an appropriate functional form for @*, and applying it to measuring the VC-dimension of various learning machines. In practice, rather than measuring the maximum difference between the training set error and the error on an infinite test set, it is more convenient to measure the maximum difference between the error rates measured on two separate sets

where v I ( a )and y ( o )are frequencies of error calculated on two different samples of size 1. We will introduce an approximate functional form for the right-hand side @ ( I / h )of the relation

which we will use for a wide range of values of llh. To construct this approximation we will determine in Section 2 two different bounds for E
V. Vapnik, E. Levin, and Y. Le Cun

856

2 Estimation of the Maximum Deviation

In this section we first establish the formal definition of the VC-dimension, and then give the bounds on the maximum deviation between frequencies of error on two separate sets.

Definition. The VC-dimension of the set of indicator functions f ( x . N ) , N E A, is the maximal number h of vectors

. . 3xh from the set X that can be shattered byf(x, a ) , (Y E A. The vectors XI, . . . , xh are said to be shattered by f ( x , o), (r E A if, for any possible partition of these vectors into two classes A and B, (there are 2' such partitions) there exists a functionf(x, a * )that implements the partition, in the sense thatf(x, a*)= 0 if x E A and f ( x , t i * ) = 1 if x E B. To estimate a bound on the expectation of the random variable (1 we need to define

z

21

= XI. w1;.. . ;X Z I , w21

as a random independent sample of vectors (the xs) and their class [the u s , (w = 0. l)].We denote by Z(21) the set of all samples of size 21. Let us denote by ( Z2', 0 ) the frequency of erroneous classifications of the vectors XI,.

., ,XI

of the first half-sample of Z2', obtained by using the decision rulef(x, a ) : 1

1

Let us denote by v;(Z2',a ) the frequency of erroneous classification of the vectors x1+13

. . '

7

x21

obtained using the same decision rulef(x, o): 1

'

vi(z2',&)= -

21

1 Iwi

-f(xita)I

r=l+l

We study the properties of the random variable (21 =

((Z2')

= sup

[v;(z2', u ) - v{(Z2',4 1

(2.1)

L?EA

which is the maximal deviation of error frequencies on the two halfsamples over a given set of functions. There exists an upper bound for the expectation of this random value [ ( Z 2 ' ) for varying sample size 1. Consider three cases.

VC-Dimension of a Learning Machine

Case 1. (I/h

857

I 0.5). For this case we use the trivial bound:

Esup [v;(Z2',n)- v;(Z2',a)]5 1 &A

Case 2. (0.5 < l/h 5 g, g is small2).Let the supremum of the difference v;(z2', a ) - v;(z2', a )

be attained on the function f ( x , a'), where a+ = a*(Z2'). Consider the event B6 =

[v;(221, a+)- v;(z2',a*)]

{Z2' :

(2.2)

where v(Z2',a+)=

vi(Z2',a*)+ vi(Z2',a + ) 2

We denote the probability of this event by P(B6). In Appendix 1we prove the following theorem: Theorem 1. Let the set of indicatorfunctionsf ( x , a ) , a E A have VC-dimension h. Then the following bound of the conditional expectation of ((Z") is valid:

E { S035~.4 p [ ~ ~ ( Z " , a ) - ~ ~ ( ZI 2B'6,}o ) j 4 ln(21/h) l/h

[

+1

1

1 - lnP(B6)

+

I

From 2.3 and from the fact that the deviation of the frequencies on two half-samples does not exceed 1 we obtain

{

E sup[vf(Z2', a€A a ) - v;(Z21?a ) ]

1

This bound should be used when 6 is not too small (e.g., 6 > 0.5) and when the probability P(B6) is close to one. Such a situation seems to arise when the ratio l/h is not too large and when I is large. In this case, when P(B6) is close to one, l/h is not large and 1 is large, the bound ln(21/h) + 1 E{sup[v;(Z",a) a€A - v;(Z",a)]} I CI Vh is valid, where C1 is a constant. In the following, we shall make use of this bound for small l / h when constructing an empirical estimate of the expectation. 2From the experiments described in Section 6 we find that g = 8.

V. Vapnik, E. Levin, and Y. Le Cun

858

Case 3. (Jlh is large). In Appendix 1 we show that for the set of indicator functions with VC-dimension h the following bound holds: (2.5) where C2 is a constant. This bound is true for all Jlh. We shall use it only for large llh, where the bound of case 2 is not valid. 3 Effective VC-Dimension

According to the definition, the VC-dimension does not depend on the input probability measure P ( x ) . Our purpose here is to introduce a concept of eflective VC-dimension that reflects a weak dependence on the properties of the probability measure. Let the probability measure P be given on X and let the subset X*of the set X have a probability measure that is close to one, i.e.,

P ( X + )= 1 - 7) for small 17. Let the set of indicator functions f ( x , a ) , (Y E A with VC-dimension h be defined on X, and let the same set of functions on x* c X have VC-dimension k*. In Appendix 2 we prove the following theorem.

Theorem 2. For all J > k for which the inequality

isfulfilled, the following bounds are valid:

< 416

(

+

ln(2J/h*) 1

Ilk'

+

1 - lnP(B6) J

(3.2)

Remark. If 71 + 0 the inequalities are true for all 1 > h. Note that in this case the left-hand side of inequalities 3.2 and 3.3 do not depend on

VC-Dimension of a Learning Machine

859

x', but the right-hand side does. Therefore the tightest inequality will be achieved for the x' with the smallest h'.

Definition. The effective VC-dimension of the set f ( x , a ) , Q E A (for the given measure P ) is the minimal VC-dimension of this set of functions defined on all subsets X* c X whose measure is almost one [p(X*)> 1 - q*, where q* > 0 is a small value]. As in the previous section, from 3.2 we obtain that for the effective VC-dimension h* the following inequality is true

E (sup[v~(Z',n) LXEA - v;(Z",a

)I}

<

+

4P(Bs) h(21/h*) 1 6 l/h*

[

+ [1

-

P(B6)I

+

1 - hP(B6) 1

1

(3.4)

For the case when P(B6) is close to one, the quantity l/h* is small, and 1 is large we obtain

{

E sup[vi(Z2',a ) - vi(Z2',a ) ] &A

where C is a constant. Since the bounds 3.2 and 3.3 have the same form as the bounds 2.3 and 2.5, to simplify our notation we use h in the next sections to denote the efective VC-dimension.

4 The Law of the Largest Deviations In Section 2 we gave several bounds, for different cases, on the expectation of the largest deviation over the set of decision rules. In this section we first give a single functional form that reflects the properties of the bounds. Then we conjecture that the same functional form approximates the expectation of the largest deviation. The bounds given in the previous section are summarized as follows:

Consider the following continuous approximation of the right side of the bound 4.1

where T = l/h. The function (a(') has two free parameters u and b. The third parameter k < 0.5 is chosen from the conditions of continuity at point 7 = 0.5: Q(0.5) = 1. Note that the function @ ( r )has the same structure as the confidence interval (1.2), except that the frequency .(a) is replaced by a constant.

V. Vapnik, E. Levin, and Y. Le Cun

860

For small r, (0.5 < r < g), Q,(T) behaves as

';

ln(27) + 1

r-k

and for large r as

The constants u and b determine the regions of "large" and "small" values of T . We make the hypothesis that there exist constants u , b that are only weakly dependent upon the properties of the input probability measure, such that the function Q,( 7 ) approximates the expectation sufficiently well for all 1 > h:

{

4

E sup[v~(Z*',cx) - vi(Z*',ct atA

= @(r)

(4.3)

The constants can be found empirically by fitting Q, to experimental data using a machine whose VC-dimension is known. If we assume that the constants obtained are universal, we can then use the function Q, obtained this way to measure the capacity of other learning machines. 5 Measuring the Effective VC-Dimension of a Classifier

If the function @( r ) approximates the expectation with sufficient accuracy, then the effective VC-dimension of the set of indicator functions realizable by a given classifier can be estimated on the basis of the following experiment. First, we generate a random independent set of size 21, .

.

z;'= x i , w;;. . .;x ; , w;;

.

.

.

w;+l; . . . ;X i ' , w;'

(5.1)

using a generator of random vectors P ( x ) and a (possibly nondeterministic) generator of labels P(w I x ) . Using this sample, we measure the quantity

((z;'=)sup[v;(z;', a ) - v;(z;',a ) ] &A by approximating the expectation by an average over N independently generated sets of size 21.

Repeating the above procedure for various values of I, we obtain the set of estimates ((11)~. . . ,< ( l k ) .

VC-Dimension of a Learning Machine

861

The effective VC-dimension of the set of functions f ( x , a ) ,a E A, can then be approximated by finding the integer parameter h* that provides the best fit between @ ( l / h )and the ((1;)s: k

h* = argminC(((Ii)- @(l,/h)]’

’

i=l

The accuracy of the obtained VC-dimension estimate, and in fact, the

validity of the approach presented in this paper, depend crucially on how well the function @ ( l / h )describes the expectation 4.3. To estimate the expectation of the largest deviation empirically one must be able to define for every fixed sample 5.1 the value of the largest deviation 5.2. This can be done by considering a modified training set where the labels of the first half of the set have been reversed XI 1

m; . . . ;XI, ml; X l f l ,w + 1 ; . . .;X’lr

w21

where a denotes the opposite class of w. As shown in Appendix 3, evaluating 5.2 can be done by minimizing the functional

6 Empirical Measurement of the Effective VC-Dimension

In this section, we illustrate the method of measuring the VC-dimension by applying it to various types of linear classifiers. Our method relies on several assumptions that must be checked experimentally: 0

0

0

0

The expected deviation E { ( l } is largely independent of the generator distribution P(w I x ) . The expected deviation E{
Experiments with learning machines that implement various subsets of the set of linear classification functions were conducted to check these hypotheses. The VC-dimension of the set of linear classifiers (without bias) is known from theory to be equal to the dimension of the input space n. The experiments described here were conducted with n = 50, x = (x’ , . . . , x 5 0 ) . Additional experiments with various values of n ranging from 25 to 200 were performed with similar results.

V. Vapnik, E. Levin, and Y. Le Cun

862

Since no efficient algorithm is known for minimizing the classification error of linear threshold units, we trained the machine by minimizing the mean squared error between the labels and the output of a unit where the hard threshold was replaced by a sigmoid function. This procedure does not ensure that the classification error is minimized, but it is known empirically to give good approximate solutions in the separable and nonseparable cases. After training, the output labels were computed by thresholding the output of the sigmoid. 6.1 Independence of Average Deviations from the Task Difficulty. A set of experiments was performed to assess the influence of the difficulty of the task [an important property of the conditional probability distribution P(w 1 x)] on the deviation. Three ensembles of training sets were generated such that the expected frequency of errors using a linear classifier would be, respectively, 0, 0.25, and 0.5. First, we randomly selected a set of 50-dimensional input vectors with coordinates independently drawn with a uniform distribution over the [-1,1] interval. Then, the labels for the three experiments were generated by picking a 50dimensional vector of coefficients a* at random, and by generating the labels according to the following three conditional probability distributions Pl(w

I X) =

p2(w I ')

=

1 0

{ { 0.75 0.25 0.5 0.5

for w for w

= =

O(a*. x) 1 - ~ ( n. x) *

for w = O(cu* . x) for w = 1 - B(a* . x) for w for w

= B(tu* . x) = 1 - O(o* . x)

PI corresponds to a linearly separable task (no noise), P3 to a random classification task (random labels, no generalization is expected), P2 to a task of intermediate difficulty. Figure 1 shows the average values of the deviations as a function of l / h , the size of the half-set, for the three conditional probabilities. Each average value was obtained from 20 realizations of the deviations. As can be seen on the figure, the results for the three cases practically coincide, demonstrating the independence of the EE; from the difficulty of the task. In further experiments only the conditional probability P3(w I x) was used. 6.2 Estimation of the Free Parameters a and b in @ ( l / h ) . The value of the free parameters a and b in @ ( l / h )(equation 4.2) can be determined if we fit equation 4.2 to a set of experimentally obtained values of the average maximal deviation produced by a learning machine with a known

VC-Dimension of a Learning Machine

863

1-

3

0.9

..

0.8

..0 0

0.7 .,

a 0

Oa6 ''

8

.. 0 0.4 -. 0.5

8

0.3

..

0.2

..

0

8 Q&

B 0.1

a

0

I .

Ilh 0

5

TO

r5

20

25

30

& I

Figure 1: The measured values for average deviation (1 are shown as a function of I/h, the size of the half-sample normalized by the VC-dimension. These values were obtained in experiments with three different conditional probabilities P ( w I x). The symbol A denotes values measured for PI. The symbol 0 denotes values measured for Pz. The symbol 0denotes values measured for Pg. capacity h. Assuming the optimal values a* and b* thereby obtained are universal, we can simply consider them constant, and use them for the measurement of the capacity of other learning machines. Figure 2 shows the approximation of the empirical data by @ ( l / h ) , given in 4.2 with parameters set to a* = 0.16, b* = 1.2. Note how well the function describes the average maximal deviation throughout the full range of training set sizes used in our experiments (0.5 < l / h < 32). Experiments with various input sizes yielded consistent values for the parameters. A simpler functional form for the deviation @ l ( l / h ) ,

@I

(t)

ln(2f/h) + 1 =d(l/h)+d-0.5

V. Vapnik, E. Levin, and Y. Le Cun

864

1

0.9 0.8 0.7 0.6 0.5 0.4

0.3 0.2 0.1

Figure 2: The values for average deviation (1 are fitted by 9 ( I / h ) with parameter values a = 0.16, b = 1.2. inspired by the bound for small l/h, describes the data well for small l / h (up to l / h < 8). Figure 3 shows the approximation of the empirical data by @I using d = 0.39. The values a = 0.16 and b = 1.2 obtained with the experiment of Figure 2 were used for all the experiments described below. 6.3 Control Experiments. For further validation of the proposed method, a series of control experiments were conducted. In these experiments we used the above-described method to measure the effective VC-dimension of several learning machines, and compared it to their theoretically known values. The learning machine we used was composed of a fixed preprocessor which transformed the n-dimensional input vector x into another n-dimensional vector y using a linear projection onto a subspace of di-

VC-Dimension of a Learning Machine

865

E

Figure 3: The values for average deviation
= 0.39.

Table 1: The True and the Measured Values of VC-Dimension in 4 Control Experiments. Effective VC dim. Estimated VC dim.

40 40

30 31

20 20

10 11

mension k. The resulting vector y was then fed to a linear classifier with n inputs. It is easy to see that the theoretical value of the effective VC-dimension is k, as this can be reduced to a linear classifier with k inputs. Table 1, which shows the estimated VC-dimension for k = 10,20,30,40, indicates that there is a strong agreement between the estimated effective VCdimension, and the theoretically predicted dimension. Figure 4a-d demonstrates how well the function @ ( l / k * ) ,where k' is the estimate of the effective VC-dimension, describes the experimental data.

V. Vapnik, E. Levin, and Y.Le Cun

866

1

0.9

0.8 0.7 0.6 0.5 0.4

0.3 0.2 0.1

,

I/h

0 5

0

10

0.1

0 0

15

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

10

15

5

10

15

.E

1

0.9

5

0.5

I

0.4

0.4

0.3

0.3

0.2

0.2

0.1 .

0

I/h

0.1

I/h

0

0

Figure 4: The values of average deviation
6.4 Smoothing Experiments. In this section, we measure the effect of "smoothing" the input on the capacity. Contrary to the previous section, this effect was not predicted theoretically. As in the previous section, the learning machine incorporated a linear preprocessor which transformed x into y by a smoothing operation: 11

y'

=

C XIexp{ -1jlj \=I

where

- ilIl}

VC-Dimension of a Learning Machine

867

As described previously, the classifier was trained with gradient descent, but a very small “weight decay” term was added (the reason for the weight decay will be explained below). The parameter /3 determines the amount of smoothing: for very large [j, the components of the vector y are virtually independent, and the VCdimension of the learning machine is close to n. As the value of /j decreases, strong correlations between input variables start to appear. This causes the distribution of preprocessed vectors to have a very small variance in some directions. Intuitively, an appropriate weight decay term will make these dimensions effectively ”invisible” to the linear classifier, thereby reducing the measured effective VC-dimension. The measured VC-dimension decreases down to 1 for /3 = 0. When takes nonzero values, there is no theoretically known evaluation of the VC-dimension, and the success of the method can be measured only by how well the experimental data are described by the functional CP. Figure 5a-c shows the average maximal deviation as a function of I , for different smoothing parameters /3, and demonstrates how well the function @(l/h*),where h’ is the estimate of the effective VC-dimension, approximates the experimental data. Figure 6 shows the estimated value of the VC-dimension as a function of the smoothing parameter B.

7 Conclusion We have shown that there exist three different regimes for the behavior of the maximum possible deviation between the frequencies of error on two different sets of examples. For very small values of l / h (< l/Z), the maximal deviation is N 1, for small values of i / h (from 1 / 2 up to about 8), it behaves like Iog(2l/h)/(//h), for large values it behaves like

JI.Pc2l/rn.

We have introduced the concept of effective VC-dimension, which takes into account some weak properties of probability distribution over the input space. We prove that the effective VC-dimension can be used in place of the VC-dimension in all the formulas obtained previously. This provides us with tighter bounds on the maximal deviation. Based on the functional forms of the bounds for the three regimes, we propose a single formula that contains two free parameters. We have shown how the value of these parameters can be evaluated experimentally. We show how the formula can be used to estimate the effective VC-dimension. We illustrate the method by applying it to various learning machines based on linear classifiers. These experiments show good agreement between the model and the experimental data in all cases. Interestingly, it appears that, at least within the set of linear classifiers, the values of the parameters are universal. This universality was confirmed by recent results obtained with linear classifiers trained on

V. Vapnik, E. Levin, and Y. Le Cun

868

E

E

0.2

0.2

I

0.1

0

6

-

0.5 1

I/h 1.5 2 2.5 3 3.5 4 4.5 5

Ilh

01

0. 0 0 5 1 1.5 2 2 5 3 3 5

4 4 5

5

E

E

I/h - 00

Figure 5: The values of average deviation (1, measured in experiments with different values of the smoothing parameter 8, are fitted by @(l/h*)(h' is the measured value of VC-dimension), and are shown as a functioh of l/h*. (a) For /3 = 0.1, the estimated VC-dimension is 40. (b) For = 0.05, the estimated VCdimension is 33. (c) For /3 = 0.02, the estimated VC-dimension is 12. (d) For 0= 0.01, the estimated VC-dimension is 6.

real-life tasks (image classification). Excellent fit with the same values of the constants were obtained even when the classifiers were trained by minimizing the empirical risk subject to various types of constraints. This included simple constraints, such as a limit on the norm of the weight vector (equivalent to weight decay), and more complex ones which improve the invariance of the classifier to distortions of the input image by limiting the norm of some Lie derivatives (Cortes 1993; Guyon et al. 1992).

VC-Dimension of a Learning Machine

869

VC-dim.

Figure 6: The estimated effective VC-dimension as a function of the smoothing parameter [j. The extension of the present work to multilayer networks faces the following difficulties. First, the theory is derived for methods that minimize the empirical risk. However, existing learning algorithms for multilayer nets cannot be viewed as minimizing the empirical risk over the entire set of functions implementable by the network. Because the energy surface has multiple local minima, it is likely that, once the initial parameter value is picked, the search will be confined to a subset of all possible functions realizable by the network. The capacity of this subset can be much less than the capacity of the whole set (which may explain why neural nets with large numbers of parameters work better than theoretically expected). Second, because the number of local minima may change with the number of examples, the capacity of the search subset may change with the number of observations. This may require a theory which considers the notion of nonconstant capacity associated with an "active" subset of functions.

V. Vapnik, E. Levin, and Y. Le Cun

870

Appendix 1. Proof of Theorem 1 Let the random independent sample

be given. Call the classes of equivalence of the setf(x, a ) ,N E A on the sample A . l the subset of indicator functions f ( x , ( t ) , (Y E A' that take the same value on the vectors x from A . l . Denote the number of different classes of equivalence given on the sample A . l by

Consider the event

A,

=

{

=

(22'

2'' : sup[vf(Z*',0 ) - i/{(Z2',m)] > f
: vf(Z2',a * )- l/;(z2J, tv*)

> F}

and the event B b defined as in equation 2.2. The following lemma is true.

Lemma. P(A,Bb) 5 EAA(xi,W I ,

. . . , xz1w21)exp

{

-

y}

The proof of this lemma will follow the same scheme as the proof of Theorem A2 in Vapnik (1982, pp. 170-172). Write the probability in the evident form P(A,Bs)

=

k,,,,

e{[vi(Z2',a*)- v;(Z2', 0*)1- f}H [v;(z2', (Y*)

(A.2)

- l/;(z2',a')]

where Z(21) is the space of samples Z2', v ( z 2 ' , o * )= - [ I I ~ ( z ~ ' , N * ) + ~ ; ( Z ~ ' , ( Y * ) ] 2 1

and f ( x , a') is the function which attains the maximum deviation of frequencies in two half-samples. Denote by TI [i = 1 , 2 , .. . , (21)!]the set of all possible permutations of the elements of the sample A . l . Denote also X(TIZ2J,N * )

= I/{

(T1Z2',0') - v;(Tlz2', a*)

(r2(T,Z2',a*)= [v(TIZ2',n*) + 1/21][1 + 1/21 - I / ( T , Z ~ ~ , O * ) ]

VC-Dimension of a Learning Machine

871

It is obvious that the equality

is true. Note that for any fixed 2'' and T; the following inequality is true:

Using this bound, one can estimate the integrand of the right-hand side of the equality A.3. It does not exceed the value

The expression in the brackets is the fraction of the number of permutations TI for which the two events

[ ~ f ( T , z "oI) , - v:(T,Z~',~

j )> ] F

and

hold simultaneously, among all possible (21)! permutations. It is equal to the following value

and the summation is performed on all k satisfying the inequalities

k I

m-k 1

>

F

k-1 - m _-1_ k > 6 [ q ( l - y ) ]

V. Vapnik, E. Levin, and Y. Le Cun

872

We now estimate the value I?. The derivation of the estimate of r repeats the derivation of similar estimates in Vapnik (1982, pp. 173-176).

lnI'/2 <

1

-

4(I+l) (m+1)(21-m+l)

-

(k'- 7 + 1 ) 2

if m is an even number 4(l+l) (k* + l)(k* - 1

9

)

(A.7)

(m+l)(ll-m+l)

if rn is an odd number

where k' is the least integer satisfying the condition

k'- _ - m 6 1 2 1 3 From A.6 and A.7 we obtain 6 lnI'/2 < --(k* - m/2) 2 And from this inequality and A.8 we obtain

l n r / 2 < -6-

€1 4

Substitute now the estimate of the value inequality A.3. We obtain

r

into right-hand side of the

The lemma is proved. Note now that the inequality EA"(Z2') < maxA"(Z2') = m"(21) ZZ'

is true. The function m"(21) known as the growthfunction is estimated by means of the VC-dimension (Vapnik 1982):

Therefore

Then

VC-Dimension of a Learning Machine

873

Estimate now the conditional expectation

For this estimation divide the interval area into two parts: (0,u), where we use the trivial estimate of conditional probability P(A,I B 6 ) = 1, and the area ( u , co),where we use the estimate A.9. We obtain

=

u + ( T ) h m e4 x p { - y }

The minimum of the right-hand side of the inequality is attained when h[ln(2l/h) 11 - hP(B6) u=4 61 In such a way we obtain the estimate h[ln(2l/h) 11 - hP(B6) 1 E(A,1 B 6 ) < 4

+

+

+

61

The theorem is proved. Now we want to obtain the estimate of the expectation of the random value 2.2 for an arbitrary l/h. For this we use the inequality (Vapnik 1982, p. 172)

As in the case above we get

} Jo” dc + 3 ( Lmexp{ < + 3 (T) Lmexp(-cul)dc

E {sup~v~(a,Z’) C&A - vi(a,Z2’)1 <

$)h

u

=

For

we obtain the estimate

u+3

(T)

h l ,exp{-ul)

-c21}dc

V. Vapnik, E. Levin, and Y. Le Cun

874

Appendix 2. Proof of Theorem 2 We first prove the first part of the theorem. By the lemma proved in Appendix 1, the inequality (A.lO) is true. Then it follows that

(A.ll) Divide the integration domain Z(21) into two parts: I'21 and Z(21)/I'21, where r21is a sample space all of whose elements belong to X*.Then

exp

{ -?}L,,)A"(Z2')dP(Z2') exp }{:=

x

[12,

AA(Z2')dP(Z2')+

/

~ ( 2 1iru )

A"(Z2')dP(Zz')]

According to the conditions of the theorem, in area

I'21

(T)

(A.12)

the bound

h'

AA(Z2')<

(A.13)

holds, and in the area Z(21)/I'21 the bound (A.14) holds. Thus from A.12, A.13, and A.14 it follows that (A.15) where p is the measure of Z(21)/I'21. Note that ,u < 1 - (1- ~ 1 ) ~<'

2/71

where 1 - 71 is the measure of the set X'. According to the condition of

VC-Dimension of a Learning Machine

875

the theorem the inequality

holds. We then derive (A.16) Substituting A.16 into A.15 we obtain

Then, as in Appendix 1, we obtain the estimate of the expectation 3.3. We can derive the estimate 3.4 similarly. Appendix 3. Estimation of the Maximum Error Rate Difference on Two Half-Samples We would like to estimate the value of the maximal error rate difference of a learning machine on two half-sets (maximal deviation). Denote by Z2' a sequence obtained from the set Z2' of equation 5.1 by changing the values of wi for the first half set.

Z2'

=

a; =

x1, a]; x 2 , w 2 , . . . ,XI, w,; x,+1,w'+l; . . ;X21r WZI> 1 - w, f

(A.17)

We will use Z2' as a training set for the learning machine. In this case, training results in the minimization of (A.18) It is easy to see that the minimum of A.19 is obtained for the same function f ( x , a * )that maximizes the deviation. Since w and f ( x ,a ) have binary values,

From A.18 and A.19 it follows that (A.20)

V. Vapnik, E. Levin, and Y. Le Cun

876

Therefore, minimizing R ( a ) is equivalent to maximizing

=

[ V l ( ( u ,2 2 ' )

- Y ( N , Z")]

(A.21)

To summarize, the maximization of deviation is obtained by training the learning machine with the training set Z2'. The value of the maximal deviation is related to the minimum of the training functional through R(a) = 1- R(ck).

References Abu-Mostafa, Yaser, S. 1993. Hints and the VC dimension. Neural Comp. 5(2), 278-288.

Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Cortes, C. 1993. Personal communication. Devroye, L. 1982. Bounds for the uniform derivations of empirical measures. I. Multiuar. Anal. 12, 72-79. Guyon, I., Vapnik, V., Boser, B., Bottou, L., and Solla, S. 1992. Structural risk minimization for character recognition. In Advances in Neural Information Processing Systems 4, J. Moody, S. Hanson, and R. Lippmann, eds., pp. 471479. Morgan Kaufmann, Denver, CO. Le Cun, Y., Denker, J. S., Solla, S., Howard, R. E., and Jackel, L. D. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems 2, D. Touretzky, ed. Morgan Kaufmann, Denver, CO. Vapnik, V. 1982. Estimation of Dependencies Based on Empirical Data. SpringerVerlag, New York. Vapnik, V., and Chervonenkis, D. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16, 264-280. Vapnik, V., and Chervonenkis, D. 1989. The necessary and sufficient conditions for consistency of empirical risk minimization method. Pattern Rec. Zmage Anal. 1(3), 283-305. Weigend, A., Rumelhart, D., and Huberman, B. 1991. Generalization by weight elimination with application to forecasting. In Advances in Neural Znformation Processing Systems 3, R. Lippmann, J. Moody, and D. Touretzky, eds. Morgan Kaufmann, Denver, co. Received August 31, 1992; accepted November 5, 1993.

This article has been cited by: 2. Frederik Janssen, Johannes Fürnkranz. 2010. On the quest for optimal rule learning heuristics. Machine Learning 78:3, 343-379. [CrossRef] 3. Saket Pande, Mac McKee, Luis A. Bastidas. 2009. Complexity-based robust hydrologic prediction. Water Resources Research 45:10. . [CrossRef] 4. A. Malapheev, A. Kupriyanov, N. Ilyasova. 2008. Application of self organised Kohonen’s map for fundus vessels classification. Optical Memory and Neural Networks 17:4, 295-298. [CrossRef] 5. K. V. Vorontsov. 2008. Combinatorial probability and the tightness of generalization bounds. Pattern Recognition and Image Analysis 18:2, 243-259. [CrossRef] 6. Mineichi Kudo, Tetsuya Murai. 2008. Extended DNF Expression and Variable Granularity in Information Tables. IEEE Transactions on Fuzzy Systems 16:2, 285-298. [CrossRef] 7. Giorgio Corani, Marino Gatto. 2007. Structural risk minimization: a robust method for density-dependence detection and model selection. Ecography 30:3, 400-416. [CrossRef] 8. Vladimir Cherkassky , Yunqian Ma . 2003. Comparison of Model Selection for RegressionComparison of Model Selection for Regression. Neural Computation 15:7, 1691-1714. [Abstract] [PDF] [PDF Plus] 9. E.A. Rietman, S.A. Whitlock, M. Beachy, A. Roy, T.L. Willingham. 2001. A system model for feedback control and analysis of yield: A multistep process model of effective gate length, poly line width, and IV parameters. IEEE Transactions on Semiconductor Manufacturing 14:1, 32-47. [CrossRef] 10. S. Ridella, R. Zunino. 2001. Empirical measure of multiclass generalization performance: the K-winner machine case. IEEE Transactions on Neural Networks 12:6, 1525-1529. [CrossRef] 11. Xuhui Shao , Vladimir Cherkassky , William Li . 2000. Measuring the VC-Dimension Using Optimized Experimental DesignMeasuring the VC-Dimension Using Optimized Experimental Design. Neural Computation 12:8, 1969-1986. [Abstract] [PDF] [PDF Plus] 12. V. Cherkassky, Xuhui Shao, F.M. Mulier, V.N. Vapnik. 1999. Model complexity control for regression using VC generalization bounds. IEEE Transactions on Neural Networks 10:5, 1075-1089. [CrossRef] 13. Jianfeng Feng. 1998. Journal of Physics A: Mathematical and General 31:17, 4037-4048. [CrossRef] 14. H. Bischof, A. Leonardis. 1998. Finding optimal neural networks for land use classification. IEEE Transactions on Geoscience and Remote Sensing 36:1, 337-341. [CrossRef]

15. E.A. Rietman, D.J. Friedman, E.R. Lory. 1997. Pre-production results demonstrating multiple-system models for yield analysis. IEEE Transactions on Semiconductor Manufacturing 10:4, 469-481. [CrossRef] 16. N.P. Bradshaw, I. Aleksander. 1996. Improving the generalisation of the N-tuple classifier using the effective VC dimension. Electronics Letters 32:20, 1904. [CrossRef] 17. Florence d'Alché-Buc, Jean-Pierre Nadal. 1995. Asymptotic performances of a constructive algorithm. Neural Processing Letters 2:2, 1-4. [CrossRef]

Communicated by Eric Baum and Vladimir Vapnik

Neural Nets with Superlinear VC-Dimension Wolfgang Maass Institute for Theoretical Computer Science, Technische Universitaet Graz, Klosterwiesgasse 32/2, A-8010 Graz, Austria

It has been known for quite a while that the Vapnik-Chervonenkis dimension (VC-dimension) of a feedforward neural net with linear threshold gates is at most O(w. log w),where w is the total number of weights in the neural net. We show in this paper that this bound is in fact asymptotically optimal. More precisely, we exhibit for any depth d 2 3 a large class of feedforward neural nets of depth d with w weights that have VC-dimension Q(w . log w).This lower bound holds even if the inputs are restricted to Boolean values. The proof of this result relies on a new method that allows us to encode more "program-bits" in the weights of a neural net than previously thought possible.

The Vapnik-Chervonenkis dimension VC-dimension ( N )of a neural net N with n input nodes is defined as the size of the largest set S R" which is "shattered" by N in the sense that every function F : S + ( 0 , l ) can be computed by N with some assignment of real numbers to its weights. The VC-dimension of a neural net N is an important measure for the expressiveness of N, that is, for the variety of functions that can be computed by N with different choices for its weights. In particular it has been shown in Blumer et al. (1989) and Ehrenfeucht et al. (1989) that the VC-dimension of N essentially determines the number of training examples that are needed to train N in Valiant's model (Valiant, 1984) for probably approximately correct learning ("PAC-learning"). It has been known for quite a while that the VC-dimension of a neural net with linear threshold gates and w edges (respectively w weights) is at most O(w. logw). This result, which holds for arbitrary real valued input patterns, was first shown by Cover (1964) (see also Cover 1968),and later by Baum and Haussler (1989). It has frequently been conjectured that the "true" upper bound is O(w).This conjecture is quite plausible, since a single linear threshold gate with w edges has VC-dimension w + 1. Furthermore it is hard to imagine that the VC-dimension of a network of linear threshold gates can be larger than the sum of the VC-dimensions of the individual linear threshold gates in the network. Neural Computation 6, 877-884 (1994)

@ 1994 Massachusetts Institute of Technology

Wolfgang Maass

878

We disprove this popular conjecture by showing that for any depth d 2 3 quite a number of neural nets N of depth d have a VC-dimension

that is superlinear in the number w of edges in N . In particular, we exhibit for arbitrarily large w E N neural nets N of depth 3 (i.e., with 2 hidden layers) with w weights that have VC-dimension 62(w.logw). This shows that the quoted upper bound of O(w log w)is in fact asymptotically optimal. It is of some interest to note that the upper bound O(w . log w) for the VC-dimension of a neural net with w weights holds even for the case of realvalued inputs, whereas our matching lower bound n(w.logw) for the VC-dimension of certain neural nets N;, with w weights holds already for the restriction of N;, to Boolean inputs. Our lower bound also shows that the well-known upper bound 2w log(eN) for the VC-dimension of a neural net with w weights and N computation nodes (due to Baum and Haussler 1989) is asymptotically optimal. The result of this paper may also be viewed as mathematical evidence for a certain type of "connectionism thesis": that a network of neuronlike elements is more than just the sum of its elements. We show that in a large neural net a single weight may add more than a constant to the VC-dimension of the neural net: its contribution may increase with the logarithm of the total size of the neural net. Although we consider in this paper only neural nets with linear threshold gates, it is obvious that the same lower bound can also be ) derived for neural nets with other activation functions such as ~ ( y = 1/(1 e-Y) (see Rumelhart and McClelland 1986) or piecewise linear (respectively, polynomial) functions of a similar type (see Sontag 1992; Maass et al. 1991; Maass 1993). This paper improves our earlier results from Maass (1992) (see Maass 1993 for an extended abstract), where we had exhibited neural nets of depth 4 with superlinear VC-dimension. Both our preceding results and the proof of our new result employ classical circuit construction methods due to Neciporuk (1964) and Lupanov (1972). Bartlett (1993) has independently derived lower bounds for the VC-dimensions of various neural nets of depth 2 and 3 that are linear in the number w of weights. The neuralnets that are considered in this paper are feedforward neural nets with linear threshold gates (or simpler: threshold gates), that is, gates that apply the heaviside activation function to the weighted sum ~ , y+i no of their inputs yl . . . ym. The parameters n l , . . . a , and N O are the weights of such gate. We will consider in this paper only neural nets with Boolean inputs and one Boolean output. The depth of a neural net is the length of the longest path from an input node to the output node (= output gate). The depth of a gate in a neural net is the length of the longest path from an input node to that gate. We refer to all gates of depth d as "level d" of a neural net. We will focus our attention on neural nets that are layered in the sense that for each gate all paths from an input node to that gate have the same

+

xyll

Neural Nets with VC-Dimension

879

length. This means that only gates on successive levels are connected by an edge, and input nodes are connected by an edge only with gates on level 1. This is not a serious restriction, since for i + 1 < j one may replace an edge between nodes on levels i and j by a path of length j - i (by introducing j - i - 1 "dummy" gates on the intermediate levels). It is obvious that a layered neural net of depth d has exactly d - 1 hidden layers. One calls a layered neural net fully connected if any two nodes on successive levels (including input nodes and gates on level 1)are connected by an edge. We use the standard notation f = O(g) for arbitrary functions f , g : N + N to indicate that both f = O(g) and f = sZ(g). Theorem 1. Assume that (Nn)nEN is any sequence offully connected layered neural nets of depth d 2 3. Furthermore assume that N, has n input nodes and @ ( n )gates, of which R(n) gates are on thefirst hidden layer, and at least 4 logn gates are on the second hidden layer of N,. Then N, has O(n2)edges and VC-dimension (N,)= @(n210gn). The proof of Theorem 1 proceeds by "embedding" into the given neural nets N,of Theorem 1 the special neural nets Mi, that are constructed in the following Theorem 2. The precise derivation of Theorem 1 from the construction of Theorem 2 is given after the proof of Theorem 2. Theorem 2. Assume that n is some arbitray power of 2. Then one can construct a neural net M , of depth 3 with n input nodes and at most 17n2 edges such that VC-dimension ( M n )2 n2 . logn. Proof of Theorem 2. One may view the weights of a neural net M , collectively as the "program" of M,. It is obviously no problem to employ large weights in a neural net. But the question is how many bits of the weights are actual "program-bits" in the sense that they contribute to the VC-dimension of the neural net. We will exhibit in this proof a method for "efficient programming" of a neural net that allows us to encode an unusually large number of "program-bits" in the weights. More precisely we show that on average each of the @ ( n 2 )weights can be used to store R(1og n ) "program-bits." Furthermore our construction requires only integer weights whose absolute value is polynomial in n. Hence all weights have bit-length O(1ogn ) , and our construction shows that a constant fraction of all weight-bits can be used to store the "program" of M,. Before we describe the precise construction of the desired neural net M,,, we first illustrate this method for "efficient programming" of a neural net in a simpler example. This "example" will turn out to be an essential building block for the construction of M,. We construct a neural net M with n2 integer weights of size polynomial in n that can be programmed to compute any n-tuple of permutations of {el,.. . , en}(where ei E (0,l)" is the ith unit vector). Since (n!)" = 28(n2'ogn)different n-

Wolfgang Maass

880

tuples of permutations of { e l , .. . en}exist, this construction of M may be viewed as an example for "efficient programming" of a neural net [in the sense that on average R(1ogn) bits of each weight are relevant for determining the function that is computed by MI. The neural net M will have 2 . n inputs and n outputs. For the construction of M we assume that h is an arbitrary given function from {el,. . . , en}' into {el,.. . ,en}such that for any j E (1,.. . , n } the function h (.,el)is a permutation of {el, . . . , en}. Obviously any n-tuple of permutations of { e l , .. . , e,} can be represented by such a function h. We define for i,j E { 1,. . . ,n } weights wIJby

wl,, = r w (er,e,) = el

*

This implies that for any p , q E (1,.. . , n } one has h(e,, eq)= el wl,q= C;=, wl,, . (eq),= p [we write (x), for the jth coordinate of a vector XI. Hence one may take as the ith output gate of M a gate that checks (for example via the AND of two threshold gates) whether w,,) . (eq),= C21r . (ep)r(i.e., w , , ~= p ) . It is easy to verify that this neural net M computes the given function h. The above method for "efficient programming" of a neural net M cannot be used directly to show that a neural net has a large VC-dimension, because for that we have to be able to compute many 0 - 1 valued rather than vector-valued functions F on a neural net. However, the subsequent construction shows that any 0 - 1 valued function F (with a suitable domain S) can be encoded by four functions gl, . . . ,g4, which are each n-tuples of permutations of {el,.. . ,e,}. Hence we can employ in the construction of M , the preceding method for "efficient programming" of n-tuples of permutations in a neural net. The only difference is that in M, the output of each function gk will be considered in binary representation (i.e., gk outputs bin(i) E (0, l}log" instead of el E (0, l}",where bin(i) is the binary string that represents the natural number i - 1). This will be necessary since each bit of bin(i) will be used to encode the value of F for a different input. We will now describe the precise construction of a neural net M , with O(n') edges and VC-dimension(M,) 2 n2 . log n. We assume that n is of the form 2"' for some nonzero n' E N. This implies that n is even and that logn E N. We construct a neural net M , with 2n logn bina inputs and O(n2)weights that shatters the Ogfl of size n2 . log n: following set S (0,

p

*

c;=,

+

7

S := (e,eqe, : p , q E (1,.. . , n } and m E {l,.. . ,logn}} where e, E {0,1}" denotes the rth unit vector of length n (r = 1,.. . ,n), and emE {O,l}log" denotes the mth unit vector of length logn (m = 1,.. ., logn). Thus each u E S contains exactly three Is. Fix any map F : S + {Oil}. One can encode F by a function g : { e l , .. . , + (0, 1}I0g" where the mth output bit of g(e,, e4)equals 1 if

Neural Nets with VC-Dimension

881

and only if F(e,e,e,) = 1. It is straightforward to show (as follows) that for any function g : ( e l , .. . + (0, l}logn there exist for k = 1,.. . , 4 functions g k : (el,.. . ,en}’-+ (0, l}logn such that g k ( . , e4)is 1- 1 for every fixed q E ( 1 , . . . ,n } , and such that for all p,9 E (1,.. . , n } :

The symbol CB denotes here the bit-wise exclusive OR (i.e., parity) on bit-strings of length log n. To justify this claim, one chooses for each fixed e4 simultaneously for k = 1,2 values for gk(e1,eq),gk(e2,e4),. . . in such a way that earlier assigned values for g k ( . , eq)are avoided. After gl(e,, e4) and g2(ep,e4) have been defined for p = 1,.. . , I < n/2, one can choose for gl(er+l,e4) any string in (0, 1}lognthat is not in the set

Then one sets g2(er+1,eq) := g(el+l,e4)@ gl(e,+l,e4). One continues the definition of gl(ep,e4) and g2(ep,e9) for p > n / 2 in an arbitrary manner so that gl(.,e4)and g2(.,e4)become 1 - 1. The definition of g3(-,e4)and g4(.,e4)is analogous. The neural net M , computes F in the following way. The output gate on level 3 is an OR of 4logn threshold gates. These threshold gates consist of logn blocks of 4 threshold gates, such that for any b E (1,.. . , logn} some threshold gate in the bth block outputs 1 for network input epeqem if and only if m = b and (g(e,,eq))b= 1 [i.e., F(e,e,e,) = 11. More precisely, the ath threshold gate in block b outputs 1 if and only if the ath one of the following 4 conditions is satisfied:

4. m = b

A

p > n/2

A

(g3 (ep,e4))b =0

A

(g4

(ep3e4))b = 1.

The subcondition ”m = b” is satisfied if and only if (epeqem)2,,+b = 1. The subcondition “p 5 n/2” is satisfied if and only if C:=,r . (ep)r5 n/2, hence it can be tested by a threshold gate on level 1 (analogously for “p > n/2”). The remaining subconditions are tested with the help of 8n

Wolfgang Maass

882

E (1, . . . , n } , which threshold gates on level 1 that involve weights wk,,,/ are defined by the condition

wk,,,/ = r H gk ( e r ,e l ) = bin(i). Among these there are 4n threshold gates G{,(ep,e,) on level 1that output 1 if and only if cyz,r . ( e p ) r2 C;=, Wk,,,,(e,)] (i.e., p 2 wk,,,,), and 4n threshold gates G,,(e,, e,) on level 1 that output 1 if and only if Cr=,r . ( e p ) rL Wk,I,/ . (e,), (i.e., p 5 Wk,l,q), for k = I , . . . , 4 and i = 1,.. . , n. These are the only weights in the neural net M,, that depend on the function F : S -+ {0,1}. By definition one has that for each k, i at least one of the two gates G&(e,, e,), Gk,,(ep,e,) outputs 1. Furthermore for any k E { 1,. . . ,4} and any e,e,e,, E S there is exactly one i E { 1,. . . , n } such that both of these gates output 1. This index i is characterized by the equality gk(e,, e,) = bin(i). Hence one can check whether (gk(e,,eq))b= 1 by testing whether 1

c;=,

and one can check whether (gk(ep,eq))b= 0 by testing whether

c

,=I..

.,I

Furthermore the sums on the left-hand side of both inequalities can only assume the values n / 2 or ( n / 2 ) 1. Therefore one can test the AND of two subconditions of this type and of the subconditions "rn = b" and " p 5 n/2" ("p > n/2") by a single threshold gate on level 2 of M,. Hence one can test each of the conditions (l),. . . , (4) by a separate threshold gate in the bth block on level 2. Altogether the network M , has 2n + log n Boolean inputs and 8n 2 threshold gates on level 1 (the first hidden layer), which are connected by 16n2+ 2n edges to level 0 (the input level). Among these are the 8n2 edges with the weights wk,i,j that depend on the function F : S + ( 0 , l ) (all other weights and thresholds in M , are independent of F). On level 2 (the second hidden layer) the network M , has 4logn threshold gates that test the conditions (l),. . . , (4) as described before. Each of these gates is connected by 1 2n edges to gates on level 1, and by one edge to some input node. On level 3 the network M , has an OR that is connected by 4 log n edges to the gates on level 2. In order to change M , into a layered neural net we add 4 log n threshold gates on level 1 and replace the former 4logn edges between input nodes and gates on level 2 by paths of length 2 (with the new 4logn threshold gates on level 1 as intermediate nodes). Thus as a layered neural net M,, has 8n + 4 log n + 2 gates on level 1 and 4 log n gates on level 2, 16n2+ 2n + 4 log n edges between input nodes and gates on level 1, and (2n + 2) . 4 log n edges between gates on levels 1 and 2.

+

+

+

Neural Nets with VC-Dimension

883

Altogether the constructed layered neural net M,, consists of 2n+log n input nodes, 8n +8 log n + 3 computation nodes, and 16n2+ (8log n + 2)n + 16logn edges. Obviously the nodes and edges of M,, are independent of the given function F : S + (0, l}, whereas the weights on 8n2 of the edges depend on F (these weights range over (1, . . . , n}). Since the function F : S + ( 0 , l ) that is computed by M,, was choosen arbitrarily, the construction implies that the set S is shattered by M,. 0 Hence VC-dimension ( M n )2 IS1 = n2 . logn. Proof of Theorem 1. Since N;, has O ( n )gates, n ( n ) gates on level 1, and n input nodes, it is obvious that N;,has @(n2) edges. Let K 2 1 be some natural number such that each of the given neural nets N;,from Theorem 1 has at least n / K gates on level 1. For n 2 28K we choose 2 E N maximal of the form 2,' for some n' E N such that (*)82

+ 410gn + 2 5 -.Kn

It is obvious that the rational number n/14K in place of 2 satisfies (*). Hence for n 2 28K we can find within the interval [ n / 2 8 K ,n/14K] some power of 2 that satisfies (*). This implies that the previously defined natural number 2 satisfies 2 2 n/28K. Furthermore we have 4log2 5 4 log n, since ii 5 n. Thus the given neural net N, has at least 82+4 log 2+ 2 gates on level 1, at least 4 log 2 gates on level 2, and at least 1 gate on each level I with 3 5 I 5 d. Furthermore N, has n 1 22 + log2 input nodes. Since N, is by assumption fully connected, the preceding considerations imply that the graph of Mi, is contained as a subgraph in the graph of N, (where each node of Ma occurs in N, on the same layer as in M i ) . Hence we can simulate Mi, on N, by assigning the value 0 to all superfluous input nodes of N,, and by assigning the weight 0 to all superfluous edges of N,. Furthermore we select in the case d > 3 one node on each of the levels 4,. . . , d of N, and set its weights so that it computes the identity function on the output from the selected gate on the preceding level. Since N, can simulate any computation of M i , it follows that VCdimension (N,) 2 VC-dimension ( M i )2

(&).logmn 2

fi2.10g2>

=R(n2.logn).

0

Acknowledgments I would like to thank Gyorgy Turfin for inspiring discussions and Eric Baum for helpful comments.

Wolfgang Maass

884

References Bartlett, P. L. 1993. Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks. Proc. 6th Annu. ACM Conf. Comp. Learning Theory 144-150. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1,151-160. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkis dimension. 1,ACM 36(4), 929-965. Cover, T. M. 1964. Geometrical and statistical properties of linear threshold devices. Stanford Ph.D. Thesis 1964, Stanford SEL Tech. Rep. No. 6107-1, May. Cover, T. M. 1968. Capacity problems for linear machines. In Pattern Recognition, L. Kanal, ed., pp. 283-289. Thompson Book Co. Ehrenfeucht, A., Haussler, D., Kearns, M., and Valiant, L. 1989. A general lower bound on the number of examples needed for learning. Inform.Comp. 82, 247-261.

Lupanov, 0. B. 1972. On circuits of threshold elements. Dokl. Akad. Nauk SSSR 202, 1288-1291; English translation in Sov. Phys. Dokl. 17, 91-93. Maass, W., Schnitger, G., and Sontag, E. D. 1991. On the computational power of sigmoid versus Boolean threshold circuits. Proc. 32nd Annu. I E E E Symp. Found. Comput. Sci. 767-776. Maass, W. 1992. Bounds for the computational power and learning complexity of analog neural nets. IIG-Report 349 of the Technische Universitat Graz (October). Maass, W. 1993. Bounds for the computational power and learning complexity of analog neural nets (Extended Abstract). Proc. 25th Annu. ACM Symp. Theory Cornput. 335-344. Neciporuk, E. I. 1964. The synthesis of networks from threshold elements. Probl. Kibern. (ll),49-62; English translation in Autom. Expr. 7(1), 35-39. Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing, Vol. 1. MIT Press, Cambridge, MA. Sontag, E. D. 1992. Feedforward nets for interpolation and classification. 1. Comp. Syst. Sci. 45, 20-48. Valiant, L. G. 1984. A theory of the learnable. Commun. A C M 27, 1134-1142. Received June 21,1993;accepted November 17, 1993.

This article has been cited by: 2. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC DimensionNeural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 3. Michael Schmitt . 2002. On the Complexity of Computing and Learning with Multiplicative Neural NetworksOn the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Computation 14:2, 241-301. [Abstract] [PDF] [PDF Plus] 4. Hiroki Suyari, Ikuo Matsuba. 1999. Information theoretical approach to the storage capacity of neural networks with binary weights. Physical Review E 60:4, 4576-4579. [CrossRef] 5. Ansgar H L West, David Saad. 1998. Journal of Physics A: Mathematical and General 31:45, 8977-9021. [CrossRef] 6. Peter L. Bartlett , Vitaly Maiorov , Ron Meir . 1998. Almost Linear VC-Dimension Bounds for Piecewise Polynomial NetworksAlmost Linear VC-Dimension Bounds for Piecewise Polynomial Networks. Neural Computation 10:8, 2159-2173. [Abstract] [PDF] [PDF Plus] 7. Wolfgang Maass. 1997. Bounds for the Computational Power and Learning Complexity of Analog Neural Nets. SIAM Journal on Computing 26:3, 708. [CrossRef] 8. Wee Sun Lee , Peter L. Bartlett , Robert C. Williamson . 1995. Lower Bounds on the VC Dimension of Smoothly Parameterized Function ClassesLower Bounds on the VC Dimension of Smoothly Parameterized Function Classes. Neural Computation 7:5, 1040-1053. [Abstract] [PDF] [PDF Plus]

Communicated by Ramamohan Paturi

An Internal Mechanism for Detecting Parasite Attractors in a Hopfield Network Jean-Dominique Gascuel Bahram Moobed Michel Weinfeld hboratoire d'lnformatique de I'Ecole Polytechnique, CNRS URA 1439, F-91128 Palaiseau Cedex, France

This paper presents a built-in mechanism for automatic detection of prototypes (as opposed to parasite attractors) in a Hopfield network. It has a good statistical performance and avoids host computer overhead for this purpose. This mechanism is based on an internal coding of the prototypes during learning, using cyclic redundancy codes, and leads to an efficient implementation in VLSI. The immediate separation of prototypes from parasite attractors can be used to enhance the autonomy of the networks, aiming at hierarchical multinetwork structures. As an example, the use of such an architecture for the classification of handwritten digits is described. 1 Introduction Many biological and psychological facts show that natural information processing is performed in a very large number of functional units, each highly specialized in some particular simple task, connected together to form higher-level processing systems. One emerging trend in neural network research is to break down big networks into independent subunits, to implement many lower level and concurrent networks, each having a rather high connectivity but more loosely connected among them to form some kind of hierarchical network (Weinfeld 1990; de Bollivier et al. 1991). Among various candidates for the basic building block of these assemblies, totally connected or Hopfield neural networks (Hopfield 1982) are very interesting because, in addition to their intrinsic associative properties, they are simple, have an architecture not depending on the problem to solve (except for their size), and can be integrated rather easily in silicon. In these networks, the learning process consists of encoding some chosen stable states, called prototypes, into the synaptic weights in a distributed fashion. This process introduces some spurious stable states, called parasites, in the network. To know that the network has converged toward a parasite can be useful: it may mean that the stimulus was Neural Computation 6, 902-915 (1994) @ 1994 Massachusetts Institute of Technology

Parasite Attractors in a Hopfield Network

903

far enough from all the prototypes to be considered as not recognized, or rejected. The focus of this paper lies on the associative memory applications of Hopfield neural networks. The state of such a network is defined by a ”state vector” o = (q),= I...”, where q E {-1, f l } is the binary value associated with the ith neuron’s activation. The popular Hebb learning rule has the disadvantage of rather low capacity, which is lower if the learned patterns are even moderately correlated (Hopfield 1982). On the contrary, the Widrow-Hoff learning rule enables the weight matrix W to be built iteratively and has been proven to converge to the so-called pseudoinverse rule (projection rule), with better capacity and tolerance to patterns’ correlation (Personnaz et al. 1986; Diederich and Opper 1987). We intend to build multinetwork architectures using basic associative VLSI components that we have designed (Johannet et al. 1992), to avoid the strong overhead of software simulations. Hence, it is necessary to give these components as much autonomy as possible. The first requirement is to include the learning mechanism in the chip itself. The second, at least as important, is to identify a stable state as a prototype or a parasite, with as little overhead as possible. Straightforward methods are usually used for parasite identification. The most obvious one is to compare the current network attractor with the prototypes: this causes little overhead in a software simulation, but would have induced hardware and software overhead in the VLSI. Another method is to compute the energy of the attractor, knowing that the parasites have generally higher energy levels than the prototypes (Personnaz et al. 1986). But the architecture of our chip is such that the neuron’s potentials are computed locally in each of them, without external broadcasting. Therefore an energy computation would have required a substantial increase of the design complexity. We propose a new way to implement what we call Parasite DetectionMechanism (PDM).Our method requires the addition of a simple device that increases very slightly the size and the complexity of the network, with a good compromise between this increase of complexity and the statistical performance of the detection. The rest of the paper develops as follows: Section 2 introduces the basic concepts of this technique. Simulation results are presented in Section 3. A hardware implementation of PDM is briefly described in Section 4. Section 5 explains how to use this mechanism to improve retrieval. Finally, Section 6 describes a multinetwork classifier for handwritten character recognition, composed of several Hopfield networks that use PDM. 2 Automatic Parasife Detection Mechanism

The basic idea of PDM is to store some redundant data in the network during the learning phase and to verify it after each convergence. Storing

J-D. Gascuel, B. Moobed, and M. Weinfeld

904

redundant data in a Hopfield network has already been done, but with significantly different goals and strategies (Personnaz 1986). The important improvement of our proposal is to automatically generate and check this redundant data. More precisely, let a be a prototype to be stored, and C() a so-called “code function.” We propose to store the tuple (a,C ( a ) ) in the network. During retrievals, when the network converges to a tuple (a,b), a will be declared to be a genuine prototype if b = C(a). We can already notice that, if d bits are devoted to the code size, the probability that a parasite has the good code, and hence is recognized as a prototype, is approximately 2-d, corresponding to a random drawing of a d-bit code. Therefore, d must be large enough to make this probability as low as needed. On the other hand, free evolution in Hopfield networks can be viewed as a nearest prototype search; the more neurons are devoted to C(a), the more the search can be disturbed by these nonsignificant bits. Furthermore, increasing the network size might become expensive. Therefore, a compromise has to be found on d.

2.1 Choice of the Code Function. Ideally, the code function C should provide unrelated codes over different stimuli. In practice, the code function cannot be assured to be injective because d must be notably smaller than the size of the input vector (to conserve space). Hence, C() must have a behavior similar to an error detection code function, meaning that changes in the vector state u must cause a major change in the C ( a ) value. Moreover, somehow depending on the learning rule used, parasites are often formed of thresholded linear combinations of prototypes (Amit et al. 1985), so PDM should be checked against this property. The polynomial remainder functions are good candidates. The idea, based on the well-known Cyclic Redundancy Codes (Peterson and Weldon 1961), is to associate a polynomial in Z/zZ[X] of degree at most N - 1 to each N-sized state vector. The 11, function maps a -1 (respectively +1) state of the neuron i to Xi-l (respectively 0). We can define ?1, by

One can note that ?1, is a homomorphism from the state vector space {-l,+l}N with the term to term multiply (noted 18)to the polynomial space (Z)N-I[X]with the addition modulo 2 (noted @). The code function C, for some fixed polynomial Q[X] of degree d, is defined as C@) = l l , i l ( h d v ) mod

QWI)

Q [ X ]can easily be chosen so that the mechanism rejects the prototype’s opposite (always a stable attractor in Hopfield nets), that is, such that C ( v ) # -C(-v). Let (-1) be the N-sized vector of all -1, and

Parasite Attractors in a Hopfield Network

905

and

C(-v) = $i’(R[X] mod Q [ X ] )@ C ( V ) We just have to choose some Q [ X ]such that ( R [ X ]mod Q [ X ] )#

Cf:; Xi.

As $JN is a homomorphism, we have: C(a @ b ) = C(a) @ C ( b ) . So our mechanism cannot distinguish genuine prototypes from products of prototypes (with the @ product). However, there is no direct relation between the 8 combinations and the thresholded linear combinations of prototypes. We have made simulations establishing that none (or very few) of the multiplications of prototypes is a stable attractor provided that we are working with a prototype set allowing the iterated WidrowHoff learning rule to work, that is, not too many prototypes, not too strongly correlated. The following results confirm this assertion. 3 Statistical Results

The success rate of PDM has been assessed by computer simulations on a Connection Machine 2 and on an AIliant FX-2800. Results are obtained by the following rnodus operandi : A set of p prototypes d . . ‘ 7 r P is chosen at random and learned by the network, with the Widrow-Hoff learning rule iterated to approximate the pseudoinverse coefficients. Subsequently, we select a random vector n at a known Hamming distance Di from the nearest prototype K . The network relaxes from this stimulus ( 0 ,C ( a ) )up to an attractor (a, b). 0

0

0

If a = w, the retrieval is said to be successful: this is a “user” point of view. If C(a) = b, the attractor is said to be accepted: this is the output of PDM. If both conditions do not agree, an error is reported: this is either an erroneous rejection or an erroneous acceptation.

Several simulations have been performed to explore the success, acceptation, and error rates, for different values of N, p , d and different ways of choosing a prototype set. Figure 1 shows typical data, obtained from the simulations. The results are displayed as follows: D; is the normalized distance of the stimulus to the nearest prototype, Success is the rate at which the network relaxes to that prototype, and Accept is the rate at which the mechanism accepts the attractor as a prototype. The

J-D. Gascuel, B. Moobed, and M. Weinfeld

906

Success rate (%)

0

0.1

0.2

0.3

0.5

0.4

Normalized initial distance Di

Figure I: Typical success curve, for N

= 100, p = 20,

and d

= 5.

curve is the average of the relaxations obtained from 16 randomly chosen prototype sets. The total success rate of one prototype set learned by the network can be assessed by the integral of this curve. Figure 2 shows the total success rates for several values of the load factor (r = p / N of the network (each point is the average over several prototype sets). The aspect of this diagram can be viewed as a good semiquantitative characterization of the retrieval behavior of a Hopfield network in normal operation. We will use it in what follows to check possible degradation of the network properties due to the introduction of the code. Figure 3 shows the total retrieval success, without PDM (0,N = loo), with PDM (0,N = 100, d = 8), and with PDM and a reduced synaptic matrix' (A, N = 100, d = 8, and a matrix of size 108 x 100). One can see that the mechanism does not seem to cause any significant changes on the retrieval performances since the various results lie approximately on the same curve. This result is important because it means that adding a mechanism such as PDM to an existing associative network does not change its retrieval performance. Figure 4 shows the average identification error measured over several simulations, for values of d E {1,2,3,4,5,6,7,8,9,lo}, N = 100, cr = 0.3. It is difficult to carry out simulations much further because the meaning'We call "reduced matrix" the synaptic matrix with the d last columns zeroed, after each learning step. This means that there is no influence of the code on the data parts of the full prototypes or stimuli.

Parasite Attractors in a Hopfield Network

907

Total success rate (%) 0 0 0 0

50 -

0 0

O 0 0 0 0 0

0

0 0 0

0

1

1

I

Figure 2: Total success rates versus the load factor a.

50 % O

0 0

OO 0,

0 0

I

1

1

0.2

0.4

0.6

Load factor a

Figure 3: Comparing the total success rates: (0) without code, ( 0 ) with a code of size d = 8, (A) with a code of size d = 8, and a reduced matrix.

ful events (i.e., the errors of the mechanism) tend to become too sparse. The diagrams show that the maximum error observed decreases like 2-*. These results show that PDM effectively makes distinction between prototypes and parasites, that one can tailor the size of the system to the

J-D. Gascuel, B. Moobed, and M. Weinfeld

908

Error

1-

1

0.1-

0.01-

0

I

I

I

0.1 0.2 0.3 Normalized initial distance Dl

1

0.4

Figure 4: Variations of PDM identificationerror rate with the normalized initial distance D,for some values of the code size d E 1..10 (note the logarithmic scale). quality needed by an application and that there is no significant effect on the basic associative properties of the Hopfield network. 4 The VLSI Implementation

We have implemented the Hopfield network with PDM in a VLSI chip, for the first time in 1989. In Gascuel et al. (1991) and Gascuel (1991), we describe this implementation (CMOS 1.2 pm, 1 cm2, 420,000 transistors) that holds 64 binary neurons, 4096 synaptic weights, the relaxation, and the iterated Widrow-Hoff learning rule. We have chosen Q [ X ]= X 6 + X + 1, leading to a 6 bit code and 58 data neurons when PDM is activated (so the erroneous identification or rejection rate is less than 2-6, or z 2%). A systolic linear ring (Kung and Hwang 1988; Weinfeld 1989) is used to connect the neurons, each one having an arithmetic unit (12 bits wide), 64 weights (9 bits wide), and some convergence detection glue. With a cycle time of 80 nsec, a typical prototype set (15 moderately correlated vectors) is learned in z 1msec (assuming off-chip prototypes arrive without wait state), and one retrieval (3 iterations on average) takes x 20 psec. On this chip, neural states +1 and -1 are coded by the sign bit (0 for +1, and 1 for -1), directly using the mapping function $ used for C ( ) . Polynomial remainders are easily implemented in hardware using "linear feedback shift registers" (Peterson and Weldon 1961). These intrinsically serial structures match very well the systolic loop semiparallel architecture of the network.

Parasite Attractors in a Hopfield Network

909

5 Using PDM to Improve Retrieval Success Rate

A possible strategy to increase the retrieval success rate is to add a random offset to the potential of each neuron in the network. This has been shown to improve the performances of the network (Peretto and Niez 1986; Amit et al. 1985). But it is necessary to compute a pseudorandom number for each neuron and at each step in the relaxation process, which would have made chip design much more complex. The ability to detect parasites has led to a retry strategy when a relaxation pass leads to a parasite. As we have already mentioned, a Hopfield network always finds an attractor (provided that a correct leaming rule is used). When a parasite is detected, it is possible to restart a relaxation with a slightly different stimulus. This may be viewed as a random search in the neighborhood of the input stimulus. It succeeds if there is an attractor basin close enough. The search field should not be made too large: first, because a larger neighborhood takes more time to have a reasonable coverage, and second, because a large neighborhood may contain patterns leading to other prototypes. This strategy using PDM offers good performance and needs few random bits. Figure 5 shows the success rates obtained with such a strategy when allowing no retry or one to five retries. The gain can reach up to 25%. This result shows that PDM may indeed be used to significantly improve the retrieval success rate of a Hopfield neural network. The overall cost of this strategy is low: a small CRC generator for PDM and a low number of retries are sufficient. Furthermore, the computations used have simple and fast implementations both in hardware and software.

6 An Application of PDM. Handwritten Digit Recognition

We propose a multinetwork architecture for classification, composed of binary Hopfield networks using PDM. In this approach, the number of Hopfield networks used is equal to the number of classes of the problem (Nu); there is a Hopfield network Hi dedicated to each class wi.To classify an input pattern 0,it is fed to all the N, networks, and each delivers a Boolean response signal d , to indicate if it considers as a member of its goal class wi or not. For this purpose, each network Hi learns some prototype set n, composed of some representative patterns of his goal class w,. Let 0;be the output attractor when o is presented to Hi.If we suppose that convergence toward one of the prototypes in l l i is a good indication that the pattern belongs to the class w,,that is to say

J-D. Gascuel, B. Moobed, and M. Weinfeld

910

Retrieval success (“YO)

Normalized initial distance Dj

Figure 5: The effect of retries over the success curve, for a maximum of 0 to 5 retries (64neurons, no more than 4 perturbed bits). then the desired output di can be defined as the Boolean signal d; = (0’E ni). PDM is the key feature that allows us to generate efficiently this signal di,with a low error rate. Some ambiguous input patterns may be recognized by more than one network. In this case, an algorithm called the arbiter is needed to take an appropriate action: either determine the class of the pattern or start another process that gives more information about the competing classes. 6.1 Artificial Parasites. Learning only some prototypes of the goal class wi causes some problems: 0

0

Usually, the prototypes belonging to the same class are correlated, and it is known that correlated prototypes decrease the performance of the Hopfield network (Personnaz et a/. 1986), even when the pseudoinverse rule is used.

As each Hopfield network is responsible only for its goal class, only the patterns of the goal class should converge toward one of the prototypes (relation 6.1). Therefore the patterns of the other classes should converge to some parasite attractors. As there is no direct control on the number and location of the parasites created during learning, their attraction basins may cover only partially the space of the other classes.

Parasite Attractors in a Hopfield Network

911

The tests realized with different prototype sets show that these problems cut down the performance of the networks. To overcome this problem, it is necessary to create some other parasite attractors that can attract the patterns of the other classes. We propose to train each network H; with some pattern set Oi, composed of some patterns not belonging to the goal class, in addition to the prototypes in II;. These new attractors should not activate the recognition signal d, and must be detected as parasites. As a CRC code is used in the parasite detection mechanism, the easiest way to differentiate these attractors from prototypes is to append a false CRC code to these attractors (called arfificial parasites hereafter) during learning. 6.2 Prototypes and Artificial Parasites Selection. The performance of this classifier depends strongly on the prototypes and artificial parasites chosen for each network. In fact, the prototype set II,must roughly cover the space of the goal class, and the artificial parasite set 0, must cover the space of all the other classes, as well as possible. As this is essentially a vector quantization task, we employ an LVQ network (Kohonen 1988) to determine the HI and 0, sets. After choosing Nn, (number of prototypes in II,) and No, (number of artificial parasites in O,), an LVQ network with NU, + Nu,neurons is trained with the patterns of the learning database. We have tested the two following strategies for choosing HI and 0, for a network H I :

1. Using LVQ with Nu classes. The Nn, prototypes are taken from the neurons corresponding to class w,,and the Nu,artificial parasites are taken from the neurons corresponding to all the other classes.

2. Using LVQ with only two classes: the goal class w,and a class containing all the other patterns. Here, the No, artificial parasites are taken from the neurons corresponding to the second class.

After this procedure, the Nn,SNH, patterns given by LVQ are binarized and learned by the Hopfield network H,, and the process is repeated for all networks H,.The necessary binarization of the LVQ reference vectors may cause some loss of performance. The task of the artificial parasites is to attract the patterns of all the other classes, but no distinction is needed among them. Therefore the second method, being less constrained, seems to be more adapted to this problem, which is confirmed by the tests. The choice of Nn,and Ne, is important. If they are chosen too small, there may not be enough attractors in the space of the classes. On the other hand, if they are chosen too large, the patterns can overload the Hopfield network and cut down its performance. 6.3 The Arbiter. When several networks activate simultaneously their recognition signals, an arbiter is needed to select a winner them. Two simple arbiters are:

J-D. Gascuel, B. Moobed, and M. Weinfeld

912

Figure 6: Architecture of the multinetwork classifier used for handwritten digit recognition. 0

The network that gives the smallest Hamming distance between the input pattern and the converged output pattern is considered as the winner.

0

Several slightly noised versions of the pattern, as well as the original one, are presented to the networks, and the network that recognizes most frequently wins. The noise added must not change too many bits, otherwise the disturbed pattern may become too different from the original to be recognized. This mechanism has been described in Section 5.

In the tests, a combination of these two arbiters is used. In other words, the network that has most frequently given a nearest output pattern wins. A more appropriate choice of distorted versions (such as slightly translated, thinned or rotated images) may produce better results. 6.4 Experiments. As an example, we have tested this classifier for handwritten digit recognition (Fig. 6). Efficient methods for handwritten

Parasite Attractors in a Hopfield Network

913

digit recognition have already been suggested (Le Cun et al. 1990; Simard et al. 1992); we do not intend to suggest a better method, but just show that ours gives fairly acceptable results with rather low overhead, when dealing with this classical problem. This makes us confident about the general approach that we use for implementing multinetwork systems based on our integrated network. The database2 contained 8700 patterns, each one being a 16 x 16 binary image of a size normalized digit. The database was divided into a learning set of 4000 patterns, and a test set of 4700 patterns. The classifier included 10 Hopfield networks of 266 neurons (256 bits for image + 10 bits for CRC) trained by the Widrow-Hoff learning rule. The results given are all for the combined arbiter that gave the best results. With only 8 prototypes and 6 artificial parasites for each class, 92.1% of the characters were recognized correctly and 3.4% were rejected. With 16 prototypes and 14 artificial parasites, 93.1% of the characters were recognized correctly and 3.1% were rejected. The results are satisfying, taking into account the relatively low number of prototypes or artificial parasites stored in each network. The number of prototypes and artificial parasites cannot be increased too much because of the limited capacity of the Hopfield networks.

7 Conclusion

The Parasite Detection Mechanism presented in this paper is rather easy to control and cheap to implement in hardware. Furthermore, the information that it provides (convergence toward a prototype or a parasite) may be used to construct enhanced relaxation strategies, as shown in Section 5. This mechanism has also been the key feature in the use of Hopfield network in a multinetwork architecture. By using the artificial parasites and the LVQ network for the selection of the prototypes and the artificial parasites, this architecture has been applied to the recognition of handwritten digits. This multinetwork architecture has the advantage of modularity: modifications or optimizations of a particular network, or in some cases addition of a new class, might be done separately without disturbing much the other networks. Other applications may also use PDM to trigger reprocessing or even learning in some other kind of multinetwork architectures. We believe that a mechanism such as PDM is an efficient way to help implementing autoadaptivity in neural chips and, further, in higher-level neural architectures.

*Suppliedby ESPCI (Knerr et al. 1992).

914

J-D. Gascuel, B. Moobed, and M. Weinfeld

Acknowledgments This work received support from DRET contract 87-187, the Groupement

Circuit lntigris Silicium, and the Projet de Recherche Coordonnies Architectures Nouvelles de Machines. We want to thank A. de Maricourt for the preliminary work done on this subject, D. Descause a n d E. Lemoine for the simulations done on the Connection Machine, and L. Montoliu for simulations o n the Alliant computer. We are grateful to the referees for their fruitful comments.

References Amit, D., Gutfreund, H., and Sompolinsky, H. 1985. Spin glass model of neural networks. Phys. Rev. A 32, 1007-1018. de Bollivier, M., Gallinari, P., and Thiria, S. 1991. Cooperation of neural nets and task decomposition. Int. Joint Conf. Neural Networks 11, 573-576. Diederich, S., and Opper, M. 1987. Learning of correlated patterns in spin-glass networks by local learning rules. Phys. Rev. Lett. 58, 949-952. Gascuel, J-D. 1991. Integration VLSI d’un reseau d e Hopfield de 64 neurones: modPle formel, conception et realisation. Ph.D. dissertation, Universite Paris-Sud Orsay, Ecole Polytechnique. Gascuel, J-D., Weinfeld, M., and Chakroun, S. 1991. A digital CMOS fully connected neural network with in-circuit learning capability and automatic identification of spurious attractors. I E E E Conf. Euro ASIC 247-250. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Johannet, A,, Personnaz, L., Dreyfus, G., Gascuel, J-D., and Weinfeld, M. 1992. Specification and implementation of a digital Hopfield-type associative memory, with on-chip training. IEEE Trans. Neural Networks 3(4), 529-539. Knerr, S., Personnaz, L., and Dreyfus, G. 1992. Handwritten digit recognition by neural networks with single-layer training. I E E E Trans. Neural Networks 3(6), 962-968. Kohonen, T. 1988. Learning vector quantization. Neural Networks 1,303. Kung, S. Y., and Hwang, J. N. 1988. Parallel architectures for artificial neural nets. I E E E Int. Conf. Neural Networks 11, 165-172. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1989. Back-propagation applied to handwritten zipcode recognition. Neural Cornp. 1 (4), 541-557. Peretto, P., and Niez, J. J. 1986. Stochastic dynamics of neural networks. IE E E Trans. Syst. Man Cybern. 16, 73-83. Personnaz, L. 1986. Etude de reseaux de neurones formels: conception, propriet6s et applications. Ph.D. dissertation, Universite Pierre et Marie Curie, Paris 6. Personnaz, L., Guyon, I., and Dreyfus, G. 1986. Collective computational properties of neural networks: New learning mechanisms. Phys. Rev. A 34,42174228.

Parasite Attractors in a Hopfield Network

915

Peterson, W. W., and Weldon, E. J., Jr. 1961. Error-Correcting Codes. MIT Press, Cambridge, MA. Simard, P.,Victorri, B., Le Cun, Y., and Denker, J. 1992. Tangent prop, a formalism for specifying selected invariances in an adaptive network. In Advances in Neural Information Processing Systems, Vol. 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds., Morgan Kaufmann, pp. 895-903. Weinfeld, M. 1989. A fully digital CMOS integrated Hopfield network including the learning algorithm. In VLSI for Art$cial Intelligence, J. G . Delgado and W. R. Moore, eds., pp. 769-778, Kluwer Academic Publishers. International Workshop on VLSI for Artificial Intelligence, Oxford, July 1988. Weinfeld, M. 1990. Integrated artificial neural networks: components for higher level architectures with new properties. In Neurocomputing: Algorithms, Architecturesand Applications, NATO Advanced Workshop on Neurocomputing, F. Fogelman and J. Herault, eds., pp. 723-730, Springer-Verlag,Berlin.

Received April 6, 1993; accepted December 15, 1993.

This article has been cited by: 2. Richard A. Watson, C. L. Buckley, Rob Mills. 2010. Optimization in “Self-modeling” complex adaptive systems. Complexity n/a-n/a. [CrossRef]

Communicated by Rodney Goodman

A Novel Design Method for Multilayer Feedforward Neural Networks Jihong Lee Department of Mechatronics Eng., Chungnam National University, Kung-dong, Taejon 305-764, Korea A multilayer feedforward neural network and a design method that takes the distribution of given training patterns into consideration are proposed. The size of the network, initial values of interconnection weights, and parameters defining the nonlinearities of processing elements are determined from a specially selected portion of the given training patterns, which is called a set of feature points. With these initial settings, the performance of the network is further improved by a modified error backpropagation learning process. It is shown in several examples that the proposed model and the design method are capable of rapidly learning the training patterns compared to conventional multilayer feedforward neural networks with random initialization techniques. 1 Introduction Since artificial neural networks were devised, multilayer feedforward neural network has emerged as one of the most active and fruitful areas for research in theory and application. This popularity primarily revolves around the ability of the backpropagation network to learn complicated multidimensional mappings (Rumelhart et al. 1986) without intervention of operators. Nielsen (1989), Stinchombe and White (1989), and Homik etal. (1989) showed theoretically that multilayer feedforward networks with as few as one hidden layer, no squashing at the output layer, and arbitrary sigmoid activation function or general nonlinear function at the hidden layer are universal approximators: they are capable of arbitrarily accurate approximation to arbitrary mappings, provided sufficiently many hidden units are available. If one observes the basic idea of the analysis of Nielsen (1989), one can know that the analysis is based on the approximation by Fourier series: a function can be approximated by linear combination of base functions (sine functions and cosine functions in Fourier series) defined on the corresponding domain of the function. Also, Nielsen showed that Neural Computation 6, 885-901 (1994) @ 1994 Massachusetts Institute of Technology

886

Jihong Lee

the weights between input layer and hidden layer determine the characteristics of the activation functions of processing units in a hidden layer, and each weight between the hidden layer and output layer determines the strength of output of the hidden unit in summation at output units. Extending these concepts, Sartori and Antsaklis (1992) recently devised a multilayer feedforward network that systematically stores given information without generalization, provided the prior information is available in certain form (namely, input-output data points and relations between the data points). In their work, two sigmoidal hidden units are combined to give a bell-type activation function that gives nonzero values for a portion of input domain and the outputs of the bell-type activation units pass through multiplying units to be summed in output units. The required number of hidden units of the network, however, is the same as the number of elements in a given training set since they did not utilize the advantages of learning. In many practical cases, the network may not be able to have such a large number of units at the hidden layer. In this paper, we propose a multilayer feedforward neural network, each layer of which has a unique role in signal processing. While the proposed structure keeps the basic concepts of the original error backpropagation network (Rumelhart et al. 1986), the role of each unit is newly assigned for the network to be more suitable in approximating the given input-output relations than conventional multilayer networks. Under the assumption that the prior information is available in appropriate form, special points called feature points are selected from given input-output pairs, which play important roles in organizing the whole training pattern. The feature points are composed of five classes: primary points, secondary points, extension points, bounda y points, and corner points. The idea in this method is motivated from the observation that when studying large amounts of information or data, one usually sorts the given information according to its importance and correlations among the data and then memorizes the information with graded attention. Compared with the method of Denoeux and Lengelle (1993), in which initializing backpropagation networks with one hidden layer was also proposed, our method deals with initialization as well as the structure of the neural network and hence can handle more general cases. Our method differs from their work in the following points: in our method, first, the structure of the network is designed according to data structure (dimension) of the input value to be handled and the size of the network is determined by the number of feature points (of course, the relations between the size and structure and the training pattern are explicitly described in our work); second, a multiplication layer is introduced for improvement of function-duplicating capability; third, the concept of feature points, part of which was defined and called reference points in Denoeux and Lengelle (1993), is generalized according to their roles in organizing training pattern; and fourth, all parameters (i.e., not only weights but thresholds and slope values) are initialized before training.

Multilayer Feedforward Neural Networks

887

Note that reference points in their work are the same as primary points in our work, which is only one class of feature points in our work. To be specific, the size of the network, say the number of layers as well as the number of processing units in the layers, is determined from the number of elements of feature points, while the initial values of interconnecting weights and the characteristics of sigmoid activation functions of processing units are determined from the relations between feature points. As will be shown later, the learning speed is drastically increased and the generalization capability is improved by the proposed method compared to conventional multilayer feedforward neural networks with random initialization techniques. In Section 2, we briefly review the concept of the design method for a multilayer feedforward network without learning. Single input cases of novel design method are described in detail in Section 3, while two input cases are handled in Section 4. In Section 5, we extend the concept to multidimensional cases, and offer concluding remarks in Section 6 . 2 Network Design without Generalization

For completeness, we review the concept of the design method for a class of neural network without generalization proposed in Sartori and Antsaklis (1992) assuming rn inputs and one output. Denote an input . . ,j,,,) = [zll(jl),v2(j2),. . . , zl,,,(j,,l)], and of the training patterns as v(j1,j2,. assume that they are arranged in ascending order as follows: uk(jk)

2 uk(jk

-

I),

]k

= 2,. . . ,pk,

k = 1 , .. . ,

(2.1)

where p k is the number of all different values of kth component in training set. Also, let d(j1,.. . ,LI)be the desired output corresponding to the input v(j1,. . . ,jm). Then, each element of the training pattern having P = fly=:=,p k elements be described as [v(j1, . . . ,jnl), d ( j l . . . ,jIn)].Here, we define a gaussian type function by combining two sigmoid functions as follows:

where

and the thresholds, ck,,s, are determined by

888

Jihong Lee

For boundary functions, we define

for all u. Note that the term s is related to the slope of the semilinear part of the sigmoid function, and the term c is related to the point of symmetry for the sigmoid function. By combining the two sigmoid functions, we can construct a function that responds as a gaussian function. Assigning such functions as 2.2 for each hidden unit, we sum the output of each hidden unit at the output unit.

Let all the weights between the input and hidden layer be set to one, and the weights between the hidden layer and output unit be set by

Then, choosing appropriate values for the slopes skjrs,we have a network that reconstructs or approximates the given mapping described by the input-output pairs. To extend the concept described in this section to the multiple output case is straightforward, hence it is omitted here. 3 Design for a Single-Input Neural Network Incorporating

Training Patterns We have reviewed the concept of a network that does not require any learning in approximating the given input-output relations. The required size of the network is, however, proportional to the number of the elements of the training pattern, which may be a critical problem for cases where a huge number of input-output training pairs are given. Combining the concept of the network of Section 2 and the learning capability of the general backpropagation network (BPN), we propose a method of designing a class of multilayer feedforward neural networks incorporating prior knowledge about training patterns. The design procedure includes determining the size of the network, initializing the interconnecting weights, and defining the characteristic of each processing element. At first, the whole set of training patterns is examined, and points called feature points are collected. With the feature points, the network for one input and one output is designed, which is composed of one input layer, two hidden layers, and one output layer. The number of units in the first hidden layer equals the number offeaturepoints, while the second hidden layer has two units. Note that the number of input units, the number of output units, and the number of units in the second hidden layer are independent of the training patterns. In the network, each

Multilayer Feedforward Neural Networks

889

component plays a unique role: (1) the interconnections between input layer and the first hidden layer play roles determining the virtual slopes of semilinear parts of the sigmoid functions in the first hidden layer, (2) the threshold of each hidden unit determines where to locate each sigmoid function in the input domain, (3) the weights between hidden layers combine two sigmoid functions to give bell-shaped functions, and (4) the weights between the second hidden layer and the output unit determine the weighted summing operation at the output unit. 3.1 Feature-Point Extraction. Let the training input-output training pairs [uu),d u ) ] be given and arranged as follows:

v(j) > u(j - l ) ,

j = 2 , .. . , p

(3.1)

To make the explanation clear, assume the training patterns are equidistant as u(i) - u o - 1) = constant,

j

= 2,. . . , p

(3.2)

Then, call the kth points satisfying the following condition the feature points of R' + R', where R' is one-dimensional Euclidean space.

(d(k+ 1)+ d(k - 1)- 2d(k)l 2 6 > 0,

k

= 2,. . . ,

p-1

(3.3)

Note that the left-hand side of 3.3 calculates the absolute value of the approximated second derivative of the pattern at [u(k),d(k)]. Including the first training pair [ v ( l ) , d ( l ) and ] the last training pair [ v ( p ) , d ( p ) in ] the feature points, we denote each feature points pair as

[ ~ ( i ) , i i ( i ) ] , i = I, . . . ,p

(3.4)

where p is the number of feature points. Note that the following inequalities are satisfied:

V ( i ) > V ( i - l),

i = 2,. . . , p

(3.5)

An example of such a searching process and its result are shown in Figure la, where five feature points are selected. Note that adjacent training points are interconnected with straight lines in the figure for illustration.

3.2 Design for the Network. From now on, we describe the method of constructing a feedforward network approximating the given training input-output relations. The network considered in this section is composed of two hidden layers, input layer, and output layer. Since we assume the case where input and output are scalar values in this section, the number of input unit and output unit is one, respectively. There are two hidden layers: the number of processing units in the first hidden layer is p , that is, the number of the feature points, and the number of

890

Jihong Lee

Figure 1: Example of one-dimensional input: (a) selected feature points from training pattern, (b) effects of slope value s, (c) output of the network, (d) error histories during learning. processing units in the second hidden layer is two. Note that only p - 1 units of the first hidden layer are connected to the input unit, while all of the processing units of the first hidden layer are interconnected toward the second hidden layer. The activation functions of all processing units, except the units on the first hidden layer, are linear functions with slope one, say

The activation functions of the units on the first hidden layer are described by

Multilayer Feedforward Neural Networks

891

where Cj =

+ V ( j + 1)]/2,

[ij(j)

j

= I,.. .

,p- 1

The weights between the input unit and first hidden layer are all set to one, that is,

w;,=I,

i = 1 , 2,...,p - 1

(3.9)

The weights between the first hidden layer and the second hidden layer are initialized according to the following rules: (3.10)

w;o = 0

(3.11)

(3.12) where the subscript "0" denotes the unit that is not connected to the input unit. The weights between the second hidden layer and the output layer are initialized to the value of one, that is, w;j

= 1,

j

=

(3.13)

1,2

With the weights determined by equations 3.9-3.13, the two hidden layers constitute gaussian-shaped activation functions that act as gate functions for the input. Even though the conditions that the slopes s, of the sigmoid activation function in the first hidden layer should satisfy are summarized in Sartori and Antsaklis (1992), the values of the slope can be easily determined by simple trial and error procedure. The effect of the slope on the mapping is shown in Figure l b with several values of the slope. Note that the outputs of the network shown in Figure lb are obtained without training. Since the weights between input layer and the first hidden layer determine the actual slopes of sigmoidal activation functions of the units in the first hidden layer and they are adjusted in the learning phase, the initial selections of the slope values do not cause such a critical problem on the performance of the network. The learning process with all training input-output patterns follows the initialization process described so far. An illustrative example with the training pattern described by X

f ( x ) = sin(2x)cos(x)+ (3.14) 3 and appearing in Figure la is shown in the following. Note that adjacent points in the figure are interconnected with straight lines for illustration. The result after 10,000 iterations of general backpropagationlearning with

892

Jihong Lee

Figure 2: Example with noisy pattern: (a) output of the networks; (b) error histories during learning. an initial slope of 5 and a learning rate of 0.07 is shown in Figure lc. In this simulation, 31 uniform samples between 0 and 3.1 were used for training. If we compare the result with the case where a random initialization method is applied to the same network (see Figure Id), we can see that now the learning speed was drastically increased, which is a natural result because the initial values of the parameters are appropriately selected in our method. The final values after 10,000 iterations are given as {wh}= (1.25,1.29,1.36,1.25),{w:;}= (-0.25,0.16,0.33,-0.40, -0.93), {&} = (-0.39,1.73,-0.84,1.52, -2.26), { w : ~ }- 0.03, w:* = 0.82, and { c ; } = (0.23,1.42,2.75,4.03).Also, as shown in Figure Id, we can observe a slightly lower asymptotic error rate than the standard backpropagation network, which suggests that the proposed network also has a generalization capability as good as the standard backpropagation network. To examine the generalization capability of the proposed network, we show another simulation with a pattern contaminated with noise: the noisy pattern was obtained by adding random noise in (-0.1, +0.1) to every sample of the pattern of Figure la. With this pattern, at first the pattern is made smooth to find feature points (by simple smoothing techniques, similar locations of feature points were obtained compared with the case of the noise-free pattern), and then the network is trained with the noisy patterns. The result is shown in Figure 2 compared with the standard backpropagation network of random initialization. 4 Design for Tbo Input Neural Network Incorporating

Training Pattern The procedure for designing a network carrying R2 + R' with bounded input (mapping from two-dimensional Euclidean space to one-dimen-

Multilayer Feedforward Neural Networks

893

sional Euclidean space) is described here, which cannot be set up by simple extension of the R' + R' case. 4.1 Extracting Feature Points. Denote each element of the training vy(jy),d(jx,jy)]. We assume that the training patterns patterns as [vx(jx), are arranged so that we may count the index of elements in ascending or descending order, as in two-dimensional gray-level image data. Then, the followings are satisfied:

vx(jx)- vx(jx- 1) = constant, ux(jx)> vx(jx- l ) , jx = 2 , . . . , p x

(4.1)

vy(jy)- uy(jy- 1) = constant, u,(j,) > vy(jy - l ) , jy= 2 , . . . , p y

(4.2)

where px and p, are the number of the first and the second elements in training patterns, respectively. The feature points are composed of five classes of points: primay points, seconday points, extension points, bounday points, and corner points. We next define each class. The primary points are the points where the heights of training patterns change sharply in both the x and y directions. Definition 1. The point [vx(jx), uy(jy),d(jx,j,)] is called the prima y p o i n t in two-dimensional input, if the point satisfies the following two equations:

Let Q p be the set of primay points. For the case of Figure 3a, the elements of Q p are marked as small triangles. After primay points are determined, the seconda y points are searched along the planes that are perpendicular to the x-y plane and pass through primary points. Definition 2. Let [ u x ~ ~ ) , ~ ~ ~ , ) , d (be j ,a, jprimaypoint y)] in Qp. Then, the points [ v x ( j xuyey), ), d ( j x , j y ) l satisfying '

ldUX

+ Lj,) + 4 j x - Lj,)

-

2d(jx,jy)l> 6,

(4.5)

and the points [ u x c x )u,(jy), . d c x ,j,)] satisfying

ldex,jy+ 1 ) + d& j y - 1 ) - 2dex,jY)l> 6,

(4.6)

are called seconda y points (meaning secondary primary points) in twodimensional input. If a cross-sectional curve formed along a line parallel to the x-axis and passing a primary point is examined, one can easily know that the curve is the same as the curve in the one-dimensional case. So, by the same reason as for the one-dimensional case, a set of feature points called seconda y points is selected in the two-dimensional case.

894

Jihong Lee

Figure 3: Network for multidimensional input: (a) feature points of twodimensional example pattern, (b) a multilayer neural network carrying R2 R', (c) output of a selected unit in the third hidden layer of (b), (d)part of multilayer neural networks for M inputs and n outputs. --t

Denote the collection of secondary points of all primary points as as. The elements of for the case of Figure 3a are marked as small circles in the figure.

Definition 3. Let [VxCjrl), v y c y l ) ,d&l,iyl)l and [ r 4 . d 3 - u y & 2 ) ,

d(ir2rjy2)l

be

secondarypoints. Then the two points [v&), v,,(jY~), d(jrl,jy2)1 and [v,(jx2), v,,(&). d6.r2.jyl)J are extension points in two-dimensional input. The word extension is used for the reason that the points are not selected by a second-derivative criterion, which is the basic motivation of feature points, but instead is selected by manipulating previously selected feature points (primary points and secondary points). The extension points for the case of Figure 3a are shown as small rectangles in the figure, and the set of extension points is denoted as a,..

Multilayer Feedforward Neural Networks

895

Boundary points are the points at which the lines parallel to each axis and passing through primary, secondary, or extension points intersect the boundaries. Definition 4. For a point [vxcix),_.,ciY),dcix,jy)] E T pU asU ae,the points [ v x ( ~ ) ~ v y j j y ) l d ( l r j y[v.r(px), )l, vYuy)’d(p,.jY)ll [ v x u x ) ? ~ y ( W j j x 1)1! * and [v,(j,),vy(py),dux,py)] are called boundary points in two-dimensional input. We denote the set of boundarypoints as @b. For the example of Figure 3a, the boundary points are marked as solid dots. l)], Definition 5. The points [vx(l),vy(l),d(l.l)], [vx(px),vy(l),d(px, [vx(l).vy(py),d(l,py)], and [vx(px)> ~ y ( p y ) , d ( p x , p are y ) ] called corner points in two-dimensional input. For the example of Figure 3a, the corner points are marked as plus signs. We denote the set of corner points as @., Then, the set of feature points, @, of a given training pattern is the collection of primary, secondary, extension, boundary, and corner points, namely @ = @p

u @s u @ e u @b u @c

(4.7)

4.2 Design for the Network. Let @ be the set of feature points of a given set of training patterns, and the elements be denoted as

[ ~ ~ u ~ ) , ~ ~ ~ ~ )j x ,=d1,.~ . ~. ,px3 , ~ ~ j)y l= ,1 , . . . ,py

(4.8)

which satisfy

V&)

> &Ox

- l),

Vy(jy) > U Y u y - l ) ,

j x = 2,. . . , p x

(4.9)

= 2,. . . ,py

(4.10)

jy

Then the network is designed as shown in Figure 3b (Fig. 3d for mdimensional input). Note that the number of units in each layer is determined according to the number of elements of feature points; the numbers of units in the first hidden layers of x-group and y-group are px and py, the number of units in the second hidden layers of x-group and y-group are px and and the number of units in the third hidden layer is p x x py. The total inputs of the units in the first hidden layer for 9th training pattern are

PY,

netqx, = wijl uqx

(4.11)

netqyj= w;jlu,y

(4.12)

and all the weights wij, and wbjl are initialized to one.

wijl

=

1,

j = 1 , 2, . . . ,p x - l

(4.13)

wijl

=

1,

j = 1 , 2 , . . . ,p y - l

(4.14)

JihongLee

896

The semilinear activation function of the first hidden unit is given by (4.15) where the parameters cis are determined as follows:

The parameters sjs can be determined by trial and error, which will be discussed in an illustrative example. The net total inputs of the units in the second hidden layer are

netqx,= W ; ~ ; O , ~ ;

(4.18)

netqyj= wtjioqyi

(4.19)

and the weights between the first hidden layer and the second hidden layer are initialized according to the following rules: WXlOI

WylO =

1

(4.20) (4.21)

where the subscript "0" indicates the constant input to the first processing unit in the second hidden layer. Then, each unit in the second hidden layer combines two sigmoid activation functions to make a gaussianshaped transfer function. All the units in the third hidden layer are multiplication units and the net total input to these units is (4.22) where (4.23) and all the weights in the above equation are initialized to one. Note that only two units (one from each group) in the second hidden layer are connected actually to the ith unit in the third layer satisfying i = (jx- l)py +jv

(4.24)

to make bell-shaped transfer function (for example, see Fig. 3c). The

Multilayer Feedforward Neural Networks

897

weights between the third hidden layer and output unit are initialized as follows: w4 = dux,ju> (4.25) where j x and j y are the indices of the units in the second hidden layer that are connected to the ith unit in the third layer satisfying 4.24. From the forward propagation rules, the error backpropagation rule for qth training pattern is derived and summarized as follows: A4 w11. . - 7641',.o4'. (4.26) where ( f q j - oqj)J(netqj) output units wy,yko,ykf/(net4j) Ck6qkwkj multiplying units of group x (4.27) 4'' wxjrkojr4(netqj) Ckb p k j multiplying units of group y other units fi'(netqj)Ck bqkwkj Learning rules for cxs and cys are derived by the same procedure to the ones for ws, hence they are omitted here. We apply the proposed network and training algorithm to the example training pattern of Figure 4a where 31 samples between (0,3.1)along the x and y direction were used for training data. The equation of the training pattern is given: ,,

I

+

-1

(4.28) sin(2y)cos(y) Y 3 From the pattern, the number of feature points is calculated to be 25; 9 prima y points, 0 seconday points, 0 extension points, 12 bounday points, and 4 cornerpoints. The effect of slope initialization is shown in Figure 4b where the outputs of network for two different slopes, 10 and 100, are compared; we can see the smaller slope gives the smoother response of the network. After only 300 iteration of learning with an initial slope of 10 and 7 = 0.0015, the network gives the output shown in Figure 4d. The maximum difference between the result and the original training pattern is below 0.15. As shown in the result, several hundred iterations are sufficient for the network to duplicate the given training patterns. 5 Extension to the General Case

Even though the cases with more than two inputs are difficult to visualize, the basic concept can be established by extending the two-input case. Here, the method described in the previous section is extended to the general case of m inputs and n outputs. Denote an element of training pattern as [ ~( j1i ) , . . . ,urn (im) : dl (j1 . . . ,j m ) ,. . . , d, (il, . . . j m ) ] , which satisfy vk(jk - 1 ) - vk(ik) = constant, vkk(jk-1)

>

vk(jk)r

] k = 2 , . . . ,p k ,

k = l , ...,m

(5.1)

Jihong Lee

898

Figure 4: Example of two-dimensional input: (a) an example training pattern of R2 -+ R’, (b,c) effects of slope at initialization stage, (d) output of the network after 300 iterations of learning.

where pk is the number of the kth component of training patterns. Note that ”:” is introduced to separate the input part and the output part of training patterns. Then, feature points of Rm -+ R” are defined by separating [ul ( j l ) , . . . u,,(jm): dl (j,,. . . ,j,,,), . . . ,d,(j1>. . . .I,,,)] into n points, [ u ~ ( j l.).,. .u,,(j,,,) : dl(j1,. . . .j,ll)], . . ., and [ ~ ~ ( i l.).,. ,u,,,$,t) : 4 A j l . . . . .j,,j)l. Definition 6. The point [ ~ ~ ( j l ). ., .q&),. . . . ,u,,(jll1) : dr(ji. . . . .jk. . . . ,j n l ) ] is called a primary point for rth output, if the point satisfies the

.

Multilayer Feedforward Neural Networks

899

Definition 8. Let @: be the set of the secondary points for rth output, and V [ ,k = 1.. . . m be a subset in @; containing such secondary points as selected by searching along kth input components. Then, all the new points whose kth input components are copied from randomly selected elements of Vi, and whose output is assigned from the whole training pattern in accordance with the input, are called extension points of the rth output.

.

Definition 9. Let [v1G1),. . . ,vnIGnl) : d r ( i l r ...,j,") be a primary point, secondary point, or extension point for the rth output. Then, the points re-

sulting from changing the kth element of the point to 1 or pk are called

bounda y points. Definition 10. All the points [ v l ( j l ) ,. . . , : dr(j1,. . . ,j l f l ) ]having j k = 1 or pk, k = 1,. . . , m are called cornerpoints. Then,feature points are the collection of prima y points, seconda y poin ts, extension points, bounday points, and corner points. With the feature points, the corresponding network is designed for the m input and n output case. In Figure 36, a network for the rth output is depicted. The organization and initialization of network in each group are exactly the same as the case of the two-input case. Note that only one input from each group is connected to one of the multiplication units, determined from the following rule: Ill

i = (jl

- I)n pk k=2

+ (j2

If1

-

1)n

p k

+ ...+

(jm-l

+

- l)p,,t

- 1)

+1

(5.4)

k=3

By combining such networks of Figure 3d for each output, we have a network that is capable of approximating a given mapping R"' -+ R". By the same procedure as in the two-input case, the error backpropagation rule for the qth training pattern is derived as

900

JihongLee

where

where tqj and oqj are the desired and actual output of output unit, respectively. Note that the above equations are abstractly written, and the subscripts should be modified in an actual design. So far we have handled the case of uniformly spaced and sorted input values. Next, we consider the case where the given input values are not sorted and are not uniformly spaced. Even for this case the proposed method is applicable by sorting input values before determining feature points (componentwise sorting should be applied for the cases of multidimensional input values), and taking the distance between adjacent inputs into consideration when calculating the second derivative of adjacent example points. 6 Conclusion

The backpropagation network was reviewed from the viewpoint that it actually performs an approximation of given input-output relations. Under the assumption that a priori information is available, a new design method for a class of feedforward neural network and its initialization method was proposed. A slightly modified error backpropagation rule for the proposed network was derived and applied to the training procedure with the novel initialization method. To be specific, a set of points called feature points is selected according to the spatial characteristics of the training patterns. The feature points are composed of five classes: primary points, secondary points, extension points, boundary points, and corner points. The structure and the size of network (i.e., the number of layers and the number of units in each layer) are determined from the number of feature points. The initial characteristics of activation functions of each unit are determined from the relations between feature points, while initial values of the weights between layers are determined according to the contents of feature points. The case R’ + R’ and the case R2 + R’ were given as examples. With the proposed method networks trained much faster than with the standard random initialization method. In addition, the proposed initialization method may be combined with any other technique related to speeding up the learning procedure. It is noteworthy that if a classification problem is considered with this network the locations of the sigmoid functions are more crucial than the shapes (slopes) of the sigmoid functions. Generally, in classification problems, the initial slope value should be chosen with values larger than in the case of function fitting (see Fig. lb).

Multilayer Feedforward Neural Networks

901

The methods presented here, however, need further refinement, especially for extensive generalization studies to handle nonequidistant training patterns and patterns containing large amount of noise.

Acknowledgment The author would like to thank Mr. Byoungho Kim for his assistance in simulation and printing results.

References Denoeux, T., and Lengellk, R. 1933. Initializing back propagation networks with prototypes. Neural Networks 6, 351-363. Hecht-Nielsen, R. 1989. Theory of the backpropagation neural network. Proc. lnt. Joint Con5 Neural Networks 1, 593-606. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, Vol. 1, pp. 318-362. MIT Press, Cambridge, MA. Sartori, M. A., and Antsaklis, I? J. 1992. Implementations of learning control systems using neural networks. I€€€ Control Syst. 12(2), 49-57. Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with nonsigmoid hidden layer activation functions. Proc. lnt. Joint Con5 Neural Networks 1,613-617. Stinchcombe, M., and White, H. 1990. Approximating and learning unknown mappings using multilayer feedforward networks with bounded weights. Proc. lnt. Joint Con5 Neural Networks 3, 7-16.

Received June 18, 1993; accepted November 24, 1993

This article has been cited by: 2. Y. Takahashi. 2000. A mathematical solution to a network construction problem. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 47:2, 166-184. [CrossRef]

Communicated by Stephen Hanson

On Langevin Updating in Multilayer Perceptrons Thorsteinn Rognvaldsson Department ofTheoretica1 Physics, University of Lund, Solvegatan 14 A, S-223 62 Lund, Sweden

The Langevin updating rule, in which noise is added to the weights during learning, is presented and shown to improve learning on problems with initially ill-conditioned Hessians. This is particularly important for multilayer perceptrons with many hidden layers, that often have ill-conditioned Hessians. In addition, Manhattan updating is shown to have a similar effect. 1 Introduction

There are numerous examples in the literature of cases where performances of artificial neural networks (ANN) have been improved by clever use of noise in the training. For instance, their generalization ability can be improved by the use of noise-corrupted training patterns (Yau and Wallace 1991; Sietsma and Dow 1991; Krogh and Hertz 1992; Holmstrom and Koistinen 1992; Bishop 1993) and they can escape local minima with the help of synaptic noise (Hanson 1990; Murray 1992; Murray and Edwards 1993). This paper presents and analyzes such performance improvements achieved with a very simple method, the so-called Langevin updating (LV), where gaussian noise is added to the weights during training. This technique has two advantages over other algorithms: 0

0

Initially, the noise helps the network search difficult regions of the error surface. Asymptotically, the noise helps the network escape local minima.

Asymptotic properties of LV updating have previously been studied in the field of simulated annealing and results can be transferred directly to ANN learning (see Guillerm and Cotter 1991, and references therein). Consequently, our focus will instead be on its positive effects on the initial phase of learning. We show that LV updating, due to its ability to explore flat regions on the error surface, often outperforms on-line backpropagation (BP) both in convergence speed and quality of the final network solutions. It is also pointed out, and experimentally verified, that Manhattan updating does this too. Neural Computation 6, 916-926 (1994) @ 1994 Massachusetts Institute of Technology

Langevin Updating

917

2 The Langevin Updating Rule

The Langevin updating rule with a discrete time step q is given by Aw(t) = -qVE(w)

+ JE(

where A w is the weight update, VE the gradient of the error function, and ( a gaussian noise term with unit variance. This equation can also be modified to allow for other noise correlations, for instance if it is desired to vary the noise with the position in weight space (Soderberg 1988). The effect of the Langevin equation is to place the network in a "heat bath" with temperature T and a Gibbs equilibrium distribution. However, since the goal of ANN learning is not to achieve a Gibbs probability distribution we use the simpler form Aw(t) = -qVE(w)

+((t)

(2.2)

where the variance of the gaussian noise ( is varied during learning. To make sure the system escapes all local minima and ends up in the global minimum when t + 03, the temperature in equation 2.1 should decrease as 1

(2.3)

for large t (Kushner 1987). The optimal constant of proportionality, the "critical depth," can be calculated from knowledge about the deepest local minima. This annealing schedule is unfortunately prohibitively slow for practical purposes and we instead use a geometric annealing schedule T ( t )= kT(t - 1)

(2.4)

where k < 1. This is selected purely because of its simplicity and works fine here since we study changes during the initial stage of learning. However, a different annealing schedule would be needed for problems where local minima, and not saddle points, are a serious problem. On-line BP updating, where patterns are drawn at random from the training set and weights are updated after each pattern presentation, also introduces noise to the weights. The noise level depends on the variation of the training patterns, the procedure by which these are chosen, and the location in weight space (Heskes et al. 1993; Leen and Moody 1993). In this case, the learning rate 71 takes the role of the temperature and an annealing procedure similar to equation 2.3 is needed to guarantee convergence to the global minima (Heskes et al. 1993). Furthermore, to achieve optimal convergence, the learning rate for the weights should be tuned with the curvature of the energy surface, which is computationally expensive and practically impossible in most applications. The method used in this paper is actually an on-line LV, the gradient is evaluated over a subset of the training set before the updating.

Thorsteinn Rognvaldsson

918

3 Hessian Analysis

ANN learning can be extremely slow if the problem is "stiff" or if the error surface is very flat. The first difficulty, stiffness, is usually handled by using second order training algorithms (see Battiti 1992 for a review), dynamic learning rates (see Schiffmann et al. 1993 and references therein), or simply by proper rescaling of the inputs. The second difficulty, flatness, is on the other hand a serious and common problem in ANN applications (Saarinen et al. 1993), that cannot be handled by resorting to second order minimization algorithms. However, LV updating takes care of this problem. The curvature of the error surface is given by the eigenvalues A, of the Hessian matrix H (w)= V2E(w). The absolute ratio of the largest to the smallest eigenvalue is called the condition number of H, and H is illconditioned if the reciprocal of the condition number is very small ("small" meaning close to the floating point precision of the computer). The nullity of H is usually defined as the dimension of the weight-subspace spanned by the eigenvectors with eigenvalue zero but is used here as the dimension of the subspace spanned by eigenvectors with IX,/Xmaxl < E , where E is the floating point precision of the computer. An ill-conditioned Hessian with a large "nullity" corresponds to an error surface with very little or no information on the step sizes needed to converge. A normal leaming algorithm therefore stalls when it encounters the "nullity." Langevin updating, on the other hand, probes the "nullity" by Brownian motion looking for signs of any curvature. As soon as some curvature is found, the gradient term takes control and forces the motion downhill. It is instructive to inspect the Hessian for a multilayer perceptron (MLP) to understand why small eigenvalues are a general problem. We consider an MLP with one hidden layer with Nh units, N;inputs, one output unit, and the error measure

E = -lX (M 0 - t ) 2

2M p=l where M is the number of training patterns used in one updating, o the network output, and t the target training value. To simplify the Hessian as much as possible and still keep the essential features, we assume that &(o - t ) / M KZ 0, which is not uncommon in ANN problems, and that initial weight values are small. This gives an initial Hessian of the general form c2 ..O(c3).. ..O(c3w).. ..O(c3w(x)).. H = [ .

1

. ..O(c4).. ..O(c4w).. ..O(c4w(x)).. ... ..O(c4w2).. ..I?(c4w2(x) ) .. ... ... .*O(c4w2(xx))..

(3.2)

where ( ) denotes averaging over the M training patterns and x is the input signal (indices have been suppressed). The scaling constant c comes

Langevin Updating

919

from the transfer function and equals 0.5 for a [O. 11 sigmoid. With two hidden layers in the MLP we get additional terms of 0 ( w 3 ) , and similarly for more hidden layers. The right-most column is Ni x Nh elements wide and the two middle columns are Nh elements wide. The eigenvalue spectrum for a matrix of the form 3.2 typically has one, or very few, large eigenvalues and several groups of eigenvalues of decreasing magnitudes (see Fig. 3a). The sizes of the small eigenvalues decrease with the elements in the right-most columns. Hence, if the initial weights are small, the risk of having an ill-conditioned Hessian increases with the number of layers in the MLP. Also, since MLP networks tend to have a “fan-in” structure, that is, more input units than output units, the subspace corresponding to small eigenvalues (the ”nullity”) will be larger than the subspace associated with large eigenvalues. 3.1 Note on the Manhattan Algorithm. As a digression, we note that random search of the “nullity” is also achieved with the Manhattan (MH) updating rule (Peterson and Hartman 1989): Aw(t) = -QM sign [VE(t)]

(3.3)

which could explain why the MH algorithm has been more successful than BP on some problems (Peterson and Hartman 1989; Lonnblad et al. 1992). To see this we rewrite equation 3.3 as

Aw(t) = - 7 ) ~ sign [VE(t)] = -vVE(t)

+ 6w(t)

(3.4)

and study the distribution of the “corrections” 6w. The gradient vVE(t) can be considered a random variable for weights inside the “nullity.” Denote the distribution of vVE at time t by P[-vVE(t)] and the distribution of corresponding corrections 6w by p[6w(t)]. If the same learning rate is used for all the weights inside the “nullity,” the MH updating rule collapses P[-qVE(t)] into two spikes located at f v ~ resulting , in the following relation between P and p: p [ 6 ~= ]

{

P[-VM - 6w] P [ v M - 6 ~ ] P[-vM + - 6w] p[vM - hw]

if 6w > QM if 16~1< v,+.j if 6w < - 7 ~

(3.5)

The distribution P[-aV€(f)] is usually symmetric and centered around VE(t) = 0. Assuming a gaussian shape for P[-vVE(t)] with a width of u we get two probable scenarios: For VM > 0,p[6w] will be composed of two peaks at ~ Q M and , for VM < u, p[bw] looks like a truncated gaussian. The first scenario corresponds to a random walk with fixed step lengths (as expected), whereas the latter boils down to LV updating. Both result in a random search of the “nullity.”

920

Thorsteinn Rognvaldsson

4 Monte Carlo Experiments 4.1 Methods. Four different problems are used to test LV and MH versus BP updating. They are chosen as representatives of different problem areas in ANN applications; classificationwith real-valued inputs and binary inputs, prediction, and one real world classification task. The performance is measured on an ensemble of 100 networks to reduce effects of statistical fluctuations. Equation 2.2 is used for LV and BP, with zero noise in the latter case, and equation 3.3 is used for MH updating, and a momentum term is added in all three cases. We use a "block" updating scheme where the error gradient is summed over a randomly chosen subset, typically 10 patterns, of the training set before each weight updating. A mean square error, equation 3.1, is used so that the learning rate 71 is independent of the block size. Before training, a coarse search is done to find optimal learning parameters. During training, 71 is adjusted with a simple "bold driver" technique; we increase 7 geometrically if the error is decreasing, decrease it otherwise. In the MH case, the learning rate VM is monotonically decreased geometrically. The LV noise level is initially set to 0.01 for all the problems and is decreased geometrically to practically zero at the end of training. These heuristic parameter adjustment schemes are all chosen for their simplicity, not for optimality. All simulations are done with the feedforward network simulation package JETNET 2.2 (Lonnblad et al. 1994).

4.2 Results. The first problem is to separate two overlapping 10-dirnmsional gaussian distributions with the same mean but different standard deviations. Earlier studies indicate that LV or MH updating is essential to achieve high quality results on this problem (Peterson and Hartman 1989; Rognvaldsson 1993). Networks with 10 inputs, 20 hidden, and one output unit are trained for 100 epochs, with 1000 pattern presentations per epoch. Initial weight values are w E [-0.01,0.01] and momentum is set to a = 0.5. Figure la shows the classification performances for different values of the learning rate. The LV solutions are better and have a smaller variance than the BP ones. The second problem is to predict the Mackey-Glass time series at time x ( t + 85) given x ( t ) , x ( t - 6 ) , x ( t - 12), and x ( t - 18). This problem has been used in several benchmark tests on learning the dynamics of time series (see Hartman and Keeler 1991 and references therein). The data are translated to its mean value, which speeds up learning. The ne-tworks have 4 inputs, two hidden layers with 10 and 5 sigmoidal units, and one linear output unit. The training and test set contain 500 patterns each and the networks are trained for 20,000 epochs. The weights are initialized with w E [-0.01,0.01] and momentum is a = 0.3. Adding noise, by means of LV or MH, significantly shortens the convergence time, although 9M is tricky to tune and MH therefore deteriorates later

Langevin Updating

921

Figure 1: Classification performances on the overlapping gaussian problem. (a) Average, best, and worst final performances using BP and LV for different values of the learning rate 77. (b) Average classification performance during training using MH with QM = 0.005, LV with 5 = 0.01 and r] = 1, and BP with r] = 1. The line at 93.4%indicates the maximum classification possible on this problem.

on (Fig. 2a). After 20,000 epochs BP catches up with LV as the errors reach their lower limit (Fig. 2c). By this time the noise level in LV is zero and there is no difference between LV and BP. The third problem, n-parity, has previously been used in benchmark studies where Conjugate Gradient algorithms greatly outperformed BP (Johansson et al. 1992). We study 4-, 5-, and 6-dimensional parity problems, with the architecture n inputs, 8 hidden units, and one output. Using more hidden units than inputs decreases the risk of getting stuck in a local minima. Each network is trained for 10,000 epochs or until 100% correct classification. The weights are initialized with w E [-0.1,0.1]. A large momentum term is used with a = 0.9. The results are listed in Table 1. Adding noise does not improve the convergence in any significant way and the MH performance is outright miserable. The fourth problem is to determine whether a patient is hypothyroid or not (Quinlan 1987). It is a real world data set that has recently been used as "a very hard practical classification task" in a benchmark test of different ANN training algorithms (Schiffmann et al. 1993), with the conclusion that a variation of MH updating with dynamic individual learning rates is the quickest learning algorithm. The inputs are normalized to zero mean and a root mean square deviation of unity. This normalization shortens the convergence time with an order of magnitude compared to the results in Schiffmann etal. (1993). The same network architecture and weight initialization procedure is used as in Schiffmann et aZ. (1993), and the momentum term is set to zero. Table 2 shows the average training and generalization errors for networks trained 1000 epochs each. No

922

Thorsteinn Rognvaldsson

on

0.3

OA

06

OA

19

'1 0.4

BP LV..

(a)

Figure 2: Relative error (rms error/rms deviation of data) for the Mackey-Class prediction task: (a) Average values during the first 5000 epochs for rl = 0.7 and VM = 0.007. (b) Average, best, and worst relative error for BP and LV updating, after 5000 epochs, for different values of the learning rate '11. (c) The same after 20,000 epochs. (d) Distribution of final errors after 20,000 epochs for tI = 0.6. The solid line in (b) and (c) marks the lowest error reported in Hartman and Keeler (1991).

Table 1: Average Convergence Times T for Those Networks That Learned the Parity Problem within 10,000 Epochs?

4-Parity

Upd.

71

BP

0.5 0.05 0.5

MH LV

5-Parity

(T)

conv.

'1

1088 393 1034

99% 42% 97%

0.5 0.07 0.5

6-Parity

(T)

conv.

r/

2202 390 2196

83% 4% 87%

0.5 0.005 0.5

(T)

4111 -

3396

conv.

49% 0% 48%)

"The right-most column of each section shows the percentage of 100 networks that managed to converge.

Langevin Updating

923

Table 2: Performance of MLP on Classification of Hypothyroid Functioning! Thyroid classification with MLP Updating

Training set (E) Emin

BP, LV,

11.5 15.6

71 T/

20.0 20.0

2.0 5.4

(E) 108.8 110.7

Generalization Emin (Class.) 88.0 89.5

98.2% 98.2%

Class.beSt 98.6% 98.7%

“Angle brackets denote averages. The right-most column shows the best classification performance achieved. The latter should be compared with 98.5%, which is the best classification performance reported in Schiffman et al. (1993).

improvement is made with LV updating (MH was not tested on this problem). 5 Discussion

5.1 General Results. The initial Hessian (i.e., evaluated in the initial weight configuration) eigenvalue spectra in Figure 3, together with the experimental results, verify that LV and MH updating outperforms BP on problems with an initial flatness problem. Both problems where LV updating helps have Hessians that could be termed “ill-conditioned,” whereas the other two problems do not. Furthermore, the dynamics of the eigenvalue spectrum, shown in Figure 4, strikingly demonstrates the difference between BP and LV updating. The LV noise immediately helps the network find regions with larger curvature, while on-line BP has a very hard time leaving the “nullity,” if it ever leaves it! The LV training is also more robust than BP and less sensitive to the specific values of the training parameters, as can be seen from Figures 1 and 2. The average performance of LV is less sensitive than the BP one to different choices of learning rates. The results are also quite insensitive to the initial noise level (the same initial noise level was used in all four problems) and the final result is the same for noise levels within two orders of magnitude, downward, from the one used here. When evaluating the condition of the Hessian H, it is of course the eigenvalue spectrum determined for the whole training set that is important. Different data subsets may produce different “subnullities,” but if these “subnullities” have no overlap, the effect on the training will be negligible (see Fig. 3d). We have verified these results on a fifth problem: predicting the solar insolation onto a building using a two hidden layer MLP (Ohlsson et al. 1994). The initial Hessian for this problem has a condition number of C?(lO’’) and LV updating outperforms all other attempts on the problem.

924

Thorsteinn Rognvaldsson

Figure 3: Initial Hessian eigenvalue spectra for an ensemble of 100 networks for (a) the overlapping gaussian distributions, (b) the Mackey-Glass time series, (c) the six-dimensional parity problem, and (d) the thyroid classification problem. The solid line shows the eigenvalue spectrum evaluated over the complete training set and the dashed line shows the spectrum evaluated over randomly chosen subsets of patterns. All eigenvalues are computed with double precision.

Figure 4: Time evolution of the eigenvalue spectra for the overlapping gaussian problem (one network) when using (a) backpropagation and @) Langevin updating. Eigenvalues are computed with single precision.

Langevin Updating

925

5.2 Previous Results. At a first glance, these results seem to contradict previous studies where learning of the parity problem is improved with synaptic noise (Hanson 1990; Murray 1992). The parity problem is not ill-conditioned and should consequently not be easier to learn for LV updating. There is, however, no disagreement. In both these studies the effect comes from escaping local minima that are encountered late in the training: in Murray (1992) by keeping a constantly high noise level (20%), and in Hanson (1990) by increasing the noise level if the weight makes mistakes. This effect is not present in the results reported here, the noise is decreased too fast to aid escape from local minima encountered late in the learning process.

5.3 Conjugate Gradients. We have also tested the Conjugate Gradient algorithms on the overlapping gaussians and the Mackey-Glass time series, but they never reach final performances compatible with BP or LV. Similar results on the thyroid classification problem are reported in Schiffman et al. (1993). Acknowledgments The author thanks Tom Heskes and two anonymous reviewers for useful pointers to references. This research was sponsored in part by the Goran Gustafsson Foundation for Research in Natural Sciences and Medicine. References Battiti, R. 1992. First- and second-order methods for learning: Between steepest descent and Newton’s method. Neural Cornp. 4, 141-166. Bishop, C. 1993. Training with noise is equivalent to Tikhonov regularization. Neural Cornp., submitted. Guillerm, T., and Cotter, N. 1991. A diffusion process for global optimization in neural networks. Proc. lnt. joint Con& Neural Networks. IEEE Press. Hanson, S. 1990. A stochastic version of the delta rule. Physica D 42, 2 6 2 7 2 . Hartman, E., and Keeler, J. 1991. Predicting the future: Advantages of semilocal units. Neural Cornp. 3, 566-578. Heskes, T., Slijpen, E., and Kappen, B. 1993. Cooling schedules for learning in neural networks. Phys. Rev. E 47,44574464. Holmstrom, L., and Koistinen, P.1992. Using additivenoise in back-propagation training. IEEE Trans. Neural Networks 3, 24-38. Johansson, E., Dowla, F., and Goodman, D. 1992. Backpropagation learning for multilayer feed-forward neural networks using the conjugate gradient algorithm. Int. 1. Neural Syst. 2, 291-301. Krogh, A., and Hertz, J. 1992. Generalization in a linear perceptron in the presence of noise. 1. Phys. A 25, 1135-1147.

926

Thorsteinn Rognvaldsson

Kushner, H. 1987. Asymptotic global behavior for stochastic approximation and diffusion with slowly decreasing noise effects: Global minimization via Monte Carlo. SIAM 1.Appl. Math. 47, 169-185. Leen, T., and Moody, J. 1993. Weight space probability densities in stochastic learning: I. Dynamics and equilibria. In Advances in Neural lnformation Processing Systems 5, J.J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 451-458. Morgan Kaufmann, San Mateo, CA. Lonnblad, L., Peterson, C., and Rognvaldsson, T. 1992. Mass reconstruction with a neural network. Phys. Lett. B 278, 181-186. Lonnblad, L., Peterson, C., and Rognvaldsson, T. 1994. JETNET 3.0-A Versatile Artificial Neural Network Package. Comput. Phys. Commun., to appear. Murray, A. 1992. Multilayer perceptron learning optimized for on-chip implementation: A noise-robust system. Neural Comp. 4, 366-381. Murray, A., and Edwards, P. 1993. Synaptic weight noise during MLP learning enhances fault-tolerance, generalisation and learning trajectory. In Advances in Neiiral Information Processing Systems 5, J. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 491498. Morgan Kaufmann, San Mateo, CA. Ohlsson, M., Peterson, C., Pi, H., Rognvaldsson, T., and Soderberg, B. 1994. Predicting system loads with artificial neural networks-Methods and results from the great energy predictor shootout. 1994 ASHRAE Trans. 100(2), in press. Peterson, C., and Hartman, E. 1989. Explorations of the mean field theory learning algorithm. Neural Networks 2,475-494. Quinlan, J. 1987. Simplifying decision trees. Int. I. Man-Machine Stud., 221. Rognvaldsson, T. 1993. Pattern discrimination using feed-forward networks. Neural Comp. 5, 483-491. Saarinen, S., Bramley, R., and Cybenko, G. 1993. Ill-conditioning in neural network training problems. S I A M I. Sci. Comp. 14, 693-714. Schiffmann, W., Joost, M., and Werner, R. 1993. Comparison of optimized backpropagation algorithms. Proc. ESANN’93, Brussels. Sietsma, J., and Dow, R. 1991. Creating artificial neural networks that generalize. Neural Networks 4, 67-79. Soderberg, B. 1988. On the complex Langevin equation. Nucl. Phys. B295, 396408. Yau, H., and Wallace, D. 1991. Enlarging the attractor basins of neural networks with noisy external fields. 1. Phys. A 24, 5639-5650.

Received June 25, 1993; accepted December 15, 1993.

This article has been cited by: 2. Milan Lomsky, Peter Gjertsson, Lena Johansson, Jens Richter, Mattias Ohlsson, Deborah Tout, Andries Aswegen, S. Richard Underwood, Lars Edenbrandt. 2008. Evaluation of a decision support system for interpretation of myocardial perfusion gated SPECT. European Journal of Nuclear Medicine and Molecular Imaging 35:8, 1523-1529. [CrossRef] 3. Henrik Gutte, David Jakobsson, Fredrik Olofsson, Mattias Ohlsson, Sven Valind, Annika Loft, Lars Edenbrandt, Andreas Kj??r. 2007. Automated interpretation of PET/CT images in patients with lung cancer. Nuclear Medicine Communications 28:2, 79-84. [CrossRef] 4. Peter Gjertsson, Milan Lomsky, Jens Richter, Mattias Ohlsson, Deborah Tout, Andries van Aswegen, Richard Underwood, Lars Edenbrandt. 2006. The added value of ECG-gating for the diagnosis of myocardial infarction using myocardial perfusion scintigraphy and artificial neural networks. Clinical Physiology and Functional Imaging 26:5, 301-304. [CrossRef] 5. Sven-Erik Olsson, Mattias Ohlsson, Hans Ohlin, Samir Dzaferagic, Marie-Louise Nilsson, Per Sandkull, Lars Edenbrandt. 2006. Decision support for the initial triage of patients with acute coronary syndromes. Clinical Physiology and Functional Imaging 26:3, 151-156. [CrossRef] 6. Magnus Nielsen, Goran Granerus, Mattias Ohlsson, Holger Holst, Ola Thorsson, Lars Edenbrandt. 2005. Interpretation of captopril renography using artificial neural networks. Clinical Physiology and Functional Imaging 25:5, 293-296. [CrossRef] 7. Sven-Erik Olsson, Mattias Ohlsson, Hans Ohlin, Lars Edenbrandt. 2002. Neural networks - a diagnostic tool in acute myocardial infarction with concomitant left bundle branch block. Clinical Physiology and Functional Imaging 22:4, 295-299. [CrossRef] 8. Holger Holst, Mattias Ohlsson, Carsten Peterson, Lars Edenbrandt. 1999. A confident decision support system for interpreting electrocardiograms. Clinical Physiology 19:5, 410-418. [CrossRef] 9. Chuan Wang, J.C. Principe. 1999. Training neural networks with additive noise in the desired signal. IEEE Transactions on Neural Networks 10:6, 1511-1517. [CrossRef] 10. Holst, Ohlsson, Peterson, Edenbrandt. 1998. Intelligent computer reporting `lack of experience': a confidence measure for decision support systems. Clinical Physiology 18:2, 139-147. [CrossRef] 11. Nandakishore Kambhatla , Todd K. Leen . 1997. Dimension Reduction by Local Principal Component AnalysisDimension Reduction by Local Principal Component Analysis. Neural Computation 9:7, 1493-1516. [Abstract] [PDF] [PDF Plus]

12. Guozhong An . 1996. The Effects of Adding Noise During Backpropagation Training on a Generalization PerformanceThe Effects of Adding Noise During Backpropagation Training on a Generalization Performance. Neural Computation 8:3, 643-674. [Abstract] [PDF] [PDF Plus] 13. J. A. S. Freeman , D. Saad . 1995. Learning and Generalization in Radial Basis Function NetworksLearning and Generalization in Radial Basis Function Networks. Neural Computation 7:5, 1000-1020. [Abstract] [PDF] [PDF Plus]

Communicated by Steve Nowlan

Probabilistic Winner-Take-All Learning Algorithm for Radial-Basis-Function Neural Classifiers Hossam Osman Moustafa M.Fahmy Department of Electrical and Computer Engineering, Queen's University, Kingston, Ontario, Canada, K7L 3N6 This paper proposes a new adaptive competitive learning algorithm called "the probabilistic winner-take-all." The algorithm is based on a learning scheme developed by Agrawala within the statistical pattern recognition literature (Agrawala 1970). Its name stems from the fact that for a given input pattern once each competitor computes the probability of being the one that generated this pattern, the computed probabilities are utilized to probabilistically choose a winner. Then, only this winner is permitted to learn. The learning rule of the algorithm is derived for three different cases. Its properties are discussed and compared to those of two other competitive learning algorithms, namely the standard winner-take-all and the maximum-likelihood soft competition. Experimental comparison is also given. When all three algorithms are used to train the hidden layer of radial-basis-function classifiers, experiments indicate that classifiers trained with the probabilistic winner-take-all outperform those trained with the other two algorithms. 1 Introduction

Investigating links between artificial neural networks (ANNs) and statistical pattern recognition (SPR) is of enormous importance, since the latter relies on a large body of theoretical tools and results that can be incorporated into the ANN discipline. This incorporation provides us with an additional source of inspiration that helps in developing new neural architectures and learning algorithms. For instance, two popular competitive learning algorithms, namely the winner-take-all (WTA) (Grossberg 1976,1987; Rumelhart and Zipser 1985) and the maximum-likelihood soft competition (MLSC) (Nowlan 1990a,b; Abbas and Fahmy 1993),are based on two clustering schemes well-known in the SPR literature. Specifically, the WTA is based on the k-means (MacQueen 1967; Lloyd 1982), and the MLSC is based on the maximum-likelihood estimation (MLE) (Duda and Hart 1973).

Nrirrnl Cornputation 6, 927-943 (1994)

@ 1994 Massachusetts Institute of Technology

928

H. Osman and M. M. Fahmy

In this paper, we propose a new competitive learning algorithm, called "the probabilistic winner-take-all (PWTA)", that is also based on a leaming scheme found within the SPR literature. Specifically, it is based on Agrawala's "learning with a probabilistic teacher" scheme (Agrawala 1970). In the next section, Agrawala's scheme is reviewed and in Section 3, the PWTA algorithm is proposed and its learning rule is derived for three different cases. In Section 4, properties of the PWTA are discussed and compared to those of the WTA and the MLSC. Experimental comparison is also made with emphasis on the application of training the hidden layer of radial-basis-function (RBF) classifiers. Finally, Section 5 contains the conclusion of the given work. 2 Learning with a Probabilistic Teacher

Before reviewing Agrawala's scheme, we begin with an explicit statement of the employed definition of the clustering problem. Suppose that a collection of m clusters is needed to model the distribution of an ndimensional vector' x . Suppose that the jth cluster, Hi, has a known a priori probability P(Hj). Let p ( x I Hi,v,) denote the jth cluster-conditional probability density function, where v, is an unknown parameter vector. Finally, suppose that we are given a training set 7i1= { X I , x 2 , . . . x N } of N unlabeled samples drawn independently from the mixture density p ( x I v ) ,where v' = ( v : ,vk,.. . , vt), and ~

Given the above assumptions, our problem is to estimate v from 7 N . This unsupervised learning problem can be approached in several ways, one of which is the Bayesian scheme (Duda and Hart 1973). The Bayesian scheme assumes that part of our knowledge about v is contained in a known a priori density function p(v).By observing k patterns, p(v) is converted to the a posteriori density function p(v 1 & ) , where & = {XI, x2,. . . ,xk}. Let S denote the space over which v is defined. Then, the a posteriori density p ( v 1 xk) is computed from p(v I & I ) using Bayes' theorem as follows

where

'Throughout the paper, vectors are denoted by small boldfaced symbols, while components of vectors are denoted by small symbols that are not boldfaced.

Probabilistic Winner-Take-All Algorithm

929

and p(v I KO) = p(v). Unfortunately, the Bayesian approach is computationally infeasible for most practical choices of p ( x I HI,w,). For the Bayesian scheme to be feasible, this unsupervised clustering problem has to be converted to a supervised one, since in this case, for most practical choices of p ( x I Hj,vj), a sufficient statistic vector exists for v (Spragins 1965). Based on this argument, Agrawala has proposed his “learning with a probabilistic teacher” scheme (Agrawala 1970). In his scheme, when a training pattern xk is presented, the a posteriori probabilities of all clusters are computed using all available information. Using these probabilities, xk is probabilistically assigned to a cluster. In other words, using a probabilistic teacher, the unsupervised clustering problem is converted to a supervised one. The Bayesian approach is then applied. Mathematically, when xk comes in, the a posteriori probabilities are computed using

j

=

1,.. . ,rn

(2.4)

where

p(d I H’,4-1, Lk-1)= Lp(d I H’,v)p(v I 4-1,Lk-1) d v

(2.5)

and Lk-1 = {l’, 1 2 , . . . ,f-’} with I’ being the probabilistically chosen label for x’, that is, I’ = H’if x’ is probabilistically assigned to HI. It should be mentioned that the notation (&I, Ck-1) is employed to emphasize the role of the probabilistic teacher. A label lk is then probabilistically chosen for xk such that

lk

= HI,with

probability P(H’ I d,Xk-l,L~k-l),j = 1,.. . ,rn

(2.6)

Suppose that Ik = He. The pattern xk is finally used to update p(v Lk-1)according to Bayes’ theorem as follows

I

For a proper p ( x k I H e , v), such as any member of the exponential family, the density p(v 1 Xk,Lk) is a reproducing one, and for each computation of 2.7 we only need to update a sufficient statistic vector s (Spragins 1965). LkPl)= p ( x k I H’, s(k Hence, at the kth iteration, we have p ( x k 1 Hi, I)), P(H’ I x k , K k - l , L k - l ) = P(H’ I x k , s ( k- l ) ) , and p(v I &Lk) = p(v 1 s ( k ) ) . This makes Agrawala’s learning scheme computationally feasible. Agrawala has proved that if the assumptions are valid and the mixture density p ( x I v ) is identifiable, then as k + co,p(v I X k , L k ) converges in probability to a Dirac delta function centered at the true value of v (Agrawala 1970). In the next section, the neural implementation of Agrawala’s learning scheme is discussed.

H. Osman and M. M. Fahmy

930

3 Probabilistic Winner-Take-All Competitive Learning

Upon examination of 2.2 and 2.7, we see that Agrawala's learning scheme can be viewed as an approximation to the Bayesian one in the sense that the probabilistically chosen cluster HPhas the product p ( x 1 H P .V P )P(H') too much higher than those of the others and, therefore, instead of evaluating p ( x 1 v ) using the summation involved in 2.1, Agrawala's scheme uses

p ( x I w )sz probabilistically chosen::,

p ( x 1 H', v,)P ( H ' )

(3.1)

I t is worth remembering that when p ( x I H,. v I )are spherical gaussians of known equal variances and the a priori probabilities are equal, the k-means scheme is an approximation to the MLE in the sense that, instead of evaluating p ( x I w )using the summation in 2.1, it uses (Nowlan 1990a,b)

p ( x I v ) = maximumyl, p ( x 1 HI. v l )P ( H ' )

(3.2)

Now, assume that wl,j = 1... . , m are statistically independent, and let s' = (s;. 5:. . . . , sf,,), where s, is the sufficient statistic of the ith cluster, then Ill

P ( V I s)

=

np(Vl

I Sl)

(3.3)

1=1

Thus, 2.7 can be rewritten as

p [ w , 1 s,(k)]= p [ w , I s,(k- I ) ] ,

j

=

l , . ..

171.

j

#I

(3.4)

and (3.5)

where

p ( x A I H ' , s ( ( k - 1))=

/ p ( x k I H".vp)

.s

I

~ ( W Psp(k -

1))dv

(3.6)

As with the k-means and the MLE, Agrawala's learning scheme can be implemented using a single-layer neural network. The jth unit of the layer corresponds to the jth cluster and is fully connected to the ndimensional input x . The jth-unit parameters include s, and any other variable that may be required by the learning scheme. In general, the jthunit activation function, n , ( x ) , has two different forms; one while training and the other once training is complete. Specifically, while training n , ( x ) = p ( x I Hl.s,)P(H'), where the form of p ( x I H'.s,) is determined using 3.6, and once training is complete a , ( x ) = p ( x 1 H', w,)P(H'), where p ( x I H'. 6 , ) is of the same form of p ( x 1 HI, v,) but with the unknown vector v, replaced by v,, its estimate in s,. Under the assumption 3.3, in view of 3.4 and 3.5, it is noted that for each input presentation the

Probabilistic Winner-Take-All Algorithm

931

units, by evaluating their posterior probabilities, compete for the right to learn and only one of the them, the winner, is permitted to do so. Hence, this learning algorithm is a WTA competitive one. However, from 3.2, it follows that in the standard WTA, the unit with the highest a posteriori probability is chosen as the winner. But here, in view of 3.1, the winner is probabilistically chosen using the m posterior probabilities. Therefore, from now on, under the assumption 3.3, the neural implementation of Agrawala's learning scheme is called the "probabilistic WTA (PWTA)." At the kth iteration, the PWTA proceeds as follows. Given xk, the jth unit (competitor) computes its activation u,(xk).The rn computed activation values are then normalized. Hence, now, u,(xk) corresponds to the a posteriori probability P(Hj I xk,sj(k - 1)).The kth winner, Ik, is probabilistically chosen such that Ik = Hi with probability uj(xk), j = 1,. . . , m. A random number generator is required for this step. Suppose that Ik = He. The scheme learning rule is finally employed to update the tth-unit parameters. To derive this learning rule, we have to restrict our attention to special forms of p(x I H',v,). Agrawala has assumed the gaussian form and has derived the learning rule for a case of two clusters and one-dimensional pattern space with the unknown parameter being the mean of one of the two clusters (Agrawala 1970). In what follows, we also assume the gaussian form but we derive the learning rule for three more general cases. 3.1 PWTA-e. Here, the jth unit has an axes-aligned elliptical gaussian receptive field with unknown center c, and unknown covariance matrix whose diagonal uj = (ail,up,. . . , ujn)', that is, vj = (cjl, cj2,

. . . , cjn, uj1, oj2, . . ojn)'

(3.7)

(3.8) where, for a random parameter p, the gaussian density is defined by

G,(a, b)

5 (2~b)-'/~exp{-0.5(p- u)'/b)

(3.9)

Let S, be the space on which a random parameter p is positive. Define the Wishart density for p by

on S, A

=

0

(3.10) otherwise,

(3.11)

where r{.) is the known gamma function and d > 1. In view of 3.8, when cj and uj are estimated by the sample mean and sample variance,

H.Osman and M. M.Fahmy

932

respectively, p(cj1uj 1 sj) is known to be a composite Gaussian-Wishart density (Spragins 1965), that is,

sj

=

(q1pjl u;,Nj)t

(3.12)

and n

p ( C j 1 uj

I cj, pj, u j , Nj) =

n GcJ,(tjilaji/pj)

(3.13)

Nj)

wu,t(6jil

i=l

The interpretation of 3.13 is clear. The parameter p, reflects the confidence in c, as the estimate of c,. Similarly, the parameter N, reflects the confidence in u, as the estimate of U j . Substituting 3.8 and 3.13 into 3.6 yields

W.,,(6jilN j ) dcji daj;

(

W,,, 0.5L

'+pJ

+

(xi -iJJ2 +0.5NJ6,,,N,

where 3.15 follows from 3.14 by using the fact that J,, W,(e,d) dp = 1. Define A q0.5~~) (3.16) 4j = ~ { O . S ( N -,1)) Then, 3.15 can be rewritten as

Probabilistic Winner-Take-All Algorithm

933

Upon examination of 3.8 and 3.17, it is noted that the operation performed on x in 3.17 is exactly the same as the one performed on it in 3.8. Thus, while training the jth unit uses c, and &, as if they are the true values of the unknown parameters. Suppose that at the kth iteration, for an input xk, the probabilistically chosen winner is He. Substituting 3.8, 3.13, and 3.17 into 3.5 yields

~ ( c eb , e 1 ce(k),pe(k), b e ( k ) ,Ne(k))

1 (Ne(k - l)&ei(k- 1) (Np(k - 1) + 1

wufz

+ 1 +Pe(k - 1) (.fi - &(k - 1))') ,Ne(k - 1)+ 1) pe(k - 1)

(3.18)

In view of 3.13 and 3.18, the updating rule for sp is as follows

&(k)

=

pe(k)

=

&ii(k)

=

+

pe(k - l)ce(k - 1) xk 1 + pe(k - 1) pe(k - 1) 1

(3.19)

+

1

(3.20)

+

Ne(k - 1) 1

i = 1,.. . , n

(3.21)

+

(3.22)

Ne(k) = N e ( k - 1) 1 In view of 3.16, 3.22, and the fact that r{p

+ 1) = p I'{p},

we get (3.23)

Thus, 3.16 is only needed to compute &(O). Equations 3.19-3.23 constitute the learning rule of the PWTA-e. In view of 3.22, by applying Stirling's approximation to 3.16, we get (3.24) From 3.17, 3.20, 3.22, and 3.24, we have n

(3.25) lim p ( d I Helce(k),d k ) ,ue(k),Ne(k))=nG,p.(i.ei(m),~ei(oo))

k-tm

i=l

Equation 3.25 is of the same form as 3.8. If the assumptions are valid and the input distribution is identifiable, &(m)and &.e(m)are the true values of ce and me, respectively.

H. Osman and M. M. Fahmy

934

3.2 PWTA-s. Here, the jth unit has a spherical gaussian receptive field with unknown center c, and unknown variance u,, that is, ~1

= (c,1,C,2,.

. . ,c,n?u/)*t

(3.26)

(3.27) In view of 3.27, when c, and u, are estimated by the sample mean and sample variance, respectively, p ( c / , u , I 3) is known to be a composite Gaussian-Wishart density (Spragins 1965), that is, s/ = (C;, P,,

q,NJ'

(3.28)

and

n tl

P(C,,O/

I C,?P/!&,?N/) = w"J(&/J/) GC,,(+,/P,)

(3.29)

r=l

Since the analysis is identical to that of the PWTA-e, in what follows, we only give the final results. Define (3.30) Then

Upon examination of 3.27 and 3.31, it is noted that the operation performed on x in 3.27 is exactly the same as the one performed on it in 3.31. The learning rule of the PWTA-s is as follows (3.32) (3.33)

(N,(k

-

+ 1 + Pr(k- - 1)& - t a ( k ) ) 2 ) (3.34)

l)&p(k - 1)

(3.35) (3.36)

Probabilistic Winner-Take-All Algorithm

935

In addition, (3.37) and

3.3 PWTAl and PWTAZ Here, for the PWTA1, the jth unit has a spherical gaussian receptive field with unknown center cl and unknown variance v,, and for the PWTA2, the jth unit has an axes-aligned elliptical gaussian receptive field with unknown center cj and unknown covariance matrix whose diagonal is 0,. However, for both algorithms, before training, the variances of all units are set to a fixed value 0, and thus, only the centers are to be trained. Hence,

(3.39)

v,= C ] , and

(3.40)

In view of 3.40, when c, is estimated by the sample mean, p ( c , I s,) is known to be a gaussian density (Spragins 1965), that is, s, = (c;, q)’,

(3.41)

and (3.42) where the variance 0, reflects the confidence in Following the analysis of the PWTA-e, we get

as the estimate of c,.

C,

(3.43) and if the Pth unit probabilistically wins the competition, its sufficient statistic sp is updated as follows (T

Cdk)

=

v

+ Of(k

-

Cc(k - 1) 1)

+ + a&

1)

0

l) -

Xk

1)

(3.44) (3.45)

H. Osman and M. M. Fahmy

936

From 3.43 and 3.45, we have

n Gxpo(i.ti(m), n

lim p(xk I H e , Ce(k),~ ( k )=) k-m Define ce

A

0)

(3.46)

1=1

+ N e . Thus, 3.44 and 3.45 can be rewritten as ip(k - 1) +
= ot/u

&(k)

=

-

(3.47) (3.48)

Equations 3.47 and 3.48 constitute the learning rule of both the PWTAl and the PWTA2. However, for the PWTA1, once training is complete, the unknown variance 0, is estimated by (3.49) where NJis the number of patterns that belong to HI,and x E HJ,

if P(H’ I x,C,) 2 P(H’

I x,Cl)?

i = 1,.. . , m , i # j (3.50)

For the PWTA2, once training is complete, the unknown variance vector ml is estimated by (3.51) It should be noted that the learning rule of the standard WTA can be written as (MacQueen 1967) Cp(k)

=

ir(k - 1)+ ,r/t(k - l)($ - Cp(k - 1))

(3.52) (3.53)

where I / [ is the learning rate of the eth unit. In view of 3.47, 3.48, 3.52, and 3.53, the learning rule of the PWTAl and the PWTA2 is exactly that of the WTA. Actually, the only difference between these two algorithms and the WTA is the manner in which the winner is selected. 4 Probabilistic Winner-Take-All versus Other Competitive

Learning Algorithms In what follows, properties of the WTA, the MLSC, and the PWTA are discussed, and then experimental comparison is provided with special attention given to the application of training the hidden layer of RBF classifiers. The WTA is the most popular competitive learning algorithm. Its popularity is mainly due to its simplicity. A problem with the WTA

Probabilistic Winner-Take-All Algorithm

937

is that the ability of representing the input distribution by the hidden unit parameters is limited by the fact that the algorithm only adapts the unit centers. A second problem is that some units may never win and, therefore, never learn. Thus, these units do not contribute to modeling the data. This problem is referred to as the unit underutilization problem (Grossberg 1976, 1987; Rumelhart and Zipser 1985). The WTA can be implemented either in batch or in adaptive (pattern) mode. Theoretically speaking, in contrast to the batch WTA algorithm, the adaptive one does not guarantee an optimal clustering, where optimality is defined as reaching either a global or a local minimum of the minimized cost function. The MLSC adapts the centers as well as the variances of the units. As a matter of fact, it can also adapt the a priori probabilities of the units (Duda and Hart 1973). Therefore, relative to the WTA, the MLSC gives rise to a better representation of the input distribution. This is done at the expense of requiring more computational cost while training. The computational cost difference is significant when both algorithms are implemented on serial machines, since, for each pattern presentation, the MLSC updates the parameters of all units rather than just those of the winner. Since the parameters of all units are updated, all units learn. However, it is known that, in general, using the MLSC there is no guarantee that the maximum likelihood solution will be nonsingular (Triven 1991). Singular solutions emerge as a consequence of the fact that a solution with a unit centered on a single training pattern has a likelihood that approaches infinity as the variance of the unit approaches zero. Therefore, while training, there will be a tendency to dominate a unit by a single training pattern or, in other words, to let the posterior probability of a unit approaches 1 for one pattern and approaches 0 for the others until finally the unit is centered on this pattern and its variance is set to 0. This results in a singular unit that memorizes only a single training pattern. Singular units do not contribute to representing the input distribution and, therefore, the unit underutilization problem is again encountered. The MLSC can be implemented either in batch or in adaptive mode, but again, in contrast to the batch version, the adaptive one may not find a maximum likelihood solution. Nevertheless, through experience, some heuristics have been developed to give reasonably good solutions (Abbas and Fahmy 1993). Like the MLSC, the PWTA adapts the centers and the variances of the units. The PWTA, in general, has the problem of requiring two different forms for the unit activation function; one while training and the other during actual operation. This problem is somewhat reduced by noting that, in the addressed cases, the operations performed on an input pattern are the same in the two forms. In view of the addressed cases, it is noted that, in general, while training, the computational cost required to evaluate the unit activation function in the PWTA algorithm is higher than the cost needed to evaluate the gaussian activation of the MLSC. The computational cost difference is negligible on parallel ma-

938

H. Osman and M. M. Fahmy

chines. On serial machines, it is compensated by the fact that, for each pattern presentation, the PWTA updates only the winner parameters. By probabilistically choosing the winner and not maximizing the likelihood function, the PWTA overcomes the possibility of encountering the unit underutilization problem. Also, in contrast to the WTA and the MLSC, the PWTA is adaptive in nature. In addition, as previously said, if the assumptions are valid and the input distribution is identifiable, the algorithm asymptotically converges in probability to the true values of the model parameters. Three experiments were conducted to compare the performance of the PWTA to those of the WTA and the MLSC. In all experiments, gaussian units were used and the a priori probabilities of all units were assumed to be equal. Two cases were considered for the WTA scheme, namely the WTAl and the WTA2. Once training is complete, for the WTAI, the variances of all units were evaluated using 3.49, while for the WTAZ, they were evaluated using 3.51. Similarly, two cases were considered for the MLSC, namely the MLSC-e and the MLSC-s. The MLSC-e uses axesaligned elliptical gaussian units, while the MLSC-s uses spherical gaussian units (Nowlan 1990a). With the exception of the first experiment, 20 runs were implemented and the one that yielded the best result was selected for comparison. Since, the batch versions of the WTA and the MLSC always outperformed the adaptive ones or at least yielded results of similar quality, only the results of the batch versions are presented. In what follows, each of the three experiments is described.

Experiment I. The aim of this experiment was to verify that the PWTA is capable of overcoming the unit underutilization problem. The data was the four two-dimensional gaussian clusters shown in Figure 1. A onelayer ANN with four units was trained using the PWTAl scheme. The vectors c,(O), j = 1,.. . ,4, were all set to (1,O)'. The variance u was set to 0.05, and the variances a,(O), j = 1,.. . ,4, were all set to 0.25. The variances tr,(O) were chosen much greater than u so that initially all units had an approximately equal probability of being a winner, and once they became winners, they jumped to the cluster to which the input pattern belonged. While training, the movement of the centers of all units is shown in Figure 1. As shown, all four units learned and their centers moved to the means of the four gaussian clusters. Thus, the PWTA network did not suffer the unit underutilization problem. For the MLSC, the same result was obtained using the same c,(O) and large initial unit variances. However, using the WTA, starting from the same c,(O), only one of the four units learns (Ahalt et al. 1990). Experiment 11. The purpose of this experiment was to compare the performances of an RBF classifier when its hidden layer is trained using the following four competitive algorithms: the WTA1, the PWTA1, the PWTA-s, and the MLSC-s. The data were a two-dimensional bullseye with three different classes; 750 patterns were equally drawn from the

Probabilistic Winner-Take-All Algorithm

939

1 .o

0.5

0.0

-0.5

-1.0

- 1 .o

-0.5

0.0

0.5

1 .o

Figure 1: Movement of the centers of all units after each iteration. three classes. The whole data was then divided into 375 training and 375 testing patterns. An RBF classifier with 2 input units and 3 output units was used. The number of hidden units was varied by increments of 3. For all algorithms, the vectors cj(0)were randomly picked training patterns. The whole training data was assumed to have a covariance matrix given by 61, where I is the identity matrix. The variance 6 was estimated from the data and, then, divided by the number of hidden units to yield a value for the variances o,a,(O),and b,,(O). For the PWTAs, pj(0) were set to 1, and Nj(0) were set to 2, where from 3.30, N,(O) must be greater than 1. Thus, @(O) = l?{1.5}/l?{0.5} = 0.5. The hidden layer was first trained using one of the four algorithms. The output layer was then trained using the least-square algorithm. Figure 2 shows the performance of the RBF classifier for each of the four algorithms. The middle dashed line corresponds to the average classification rate E over the four algorithms. The two solid lines are one binomial standard deviation above and below this average. The binomial deviation was evaluated as (Ng and Lippmann 1991). Figure 2 illustrates that the PWTA-s significantly outperforms the other algorithms, and that

d

m

H. Osman and M. M. Fahmy

940

1.0 r

I

I

I

1

I

I

e Q)

m Q

0.9

-I

c 0

c

0 1-

w

g ..c

0.8

(I) (I)

0 0 +

0

0.7

0

PWTA1

A

PWTA-s

*

MLSC-s

2 L

0

0

0.6 I

7

I

10

1

13

I

16

I

I

I

19

22

25

28

Number of Hidden Units

Figure 2: The correct classification rate on the test set versus the number of hidden units for four algorithms: the WTAI, the PWTAI, the PWTA-s, and the MLSC-s. The middle dashed line corresponds to the average classification rate over the four algorithms. The two solid lines are one binomial standard deviation above and below this average. the performance of the PWTAl is better than that of the WTAl. It should be mentioned that for 19 and more hidden units, the MLSC-s gave many singular solutions.

Experiment 111. The purpose of this experiment was to compare the performances of an RBF classifier when its hidden layer is trained using the following four competitive algorithms: the WTA2, the PWTA2, the PWTA-e, and the MLSC-e. The data were four-dimensional vowel data that belongs to 10 classes2 (Peterson and Barney 1952). The four dimensions correspond to four format frequencies and the 10 classes *We would like to thank Richard Schwartz, Ann Syrdal, Raymond Watrous, and Richard Lippmann for supplying us with these data.

Probabilistic Winner-Take-All Algorithm

941

0.85 4

a,

v, 4-

cn Q)

F 0

0.75

c

0 .-u

.

WTA2

0

PWTA2

A

PWTA-e

*

MLSC-e

100

120

0

.-0 % .cn cn

2 0.65

0 -cI

0

!

L

0

0

0.55 20

40

60

80

140

Number of Hidden Units

Figure 3: The correct classification rate on the test set versus the number of hidden units for four algorithms: the WTA2, the PWTA2, the PWTA-e, and the MLSC-e. The middle dashed line corresponds to the average classification rate over the four algorithms. The two solid lines are one binomial standard deviation above and below this average.

correspond to 10 distinct vowel sounds. The data were arbitrarily divided into 894 training and 600 testing patterns. An RBF classifier with 4 input units and 10 output units was used. The number of hidden units was varied by increments of 20. The initial conditions were set as in Experiment I1 with the exception that, for the PWTA-e, from 3.16, 4j(0) = r{l.O}/r{0.5} = 0.564. Figure 3 shows the test set performance of the RBF classifier for each of the four algorithms. Figure 3 illustrates that the PWTA-e network significantly outperforms the other networks, and that the performance of the PWTA2 network is similar to that of the WTA2 network. It should be mentioned that for 80 and more hidden units, the MLSC-e gave many singular solutions.

942

H. Osman and M. M. Fahmy

5 Conclusion

A new competitive learning algorithm called ”the probabilistic winnertake-all (PWTA)”has been proposed. It is based on Agrawala’s ”learning with a probabilistic teacher” scheme (Agrawala 1970). Its name refers to the fact that, for each input pattern presentation, the a posteriori probabilities of all competitors are computed and used to probabilistically choose a winner. This winner is then allowed to learn. Three different cases for the algorithm have been analyzed and their learning rules have been derived. Furthermore, properties of the algorithm have been discussed and compared to those of the WTA and the MLSC. It has been shown that the PWTA is adaptive in nature, overcomes the unit underutilization problem, and is theoretically justified. While training, in general, the computational cost of the PWTA on serial machines is higher than that of the WTA and similar to that of the MLSC. Experimental results have indicated that the performance of RBF classifiers trained with the PWTA is better than the performances of those trained with the WTA and the MLSC. This improvement of performance can be understood through the following argument. Osman and Fahmy (1993) have proved that, once training is complete, the sample meansquare error (MSE) at the output of an RBF classifier equals the difference between the value of a discriminant criterion evaluated in the target space and its value evaluated in the hidden space. This criterion measures the separation of the different classes normalized by the total variance. The components of the Bayes risk vector are the optimum features with respect to this criterion. Relative to the WTA, by adapting the centers and variances of the units, the MLSC and the PWTA give a better representation of the input distribution and, hence, a higher value for this discriminant criterion in the hidden space. This results in a smaller sample MSE and a better performance. On the other hand, for the MLSC, all units adapt for each input pattern, whereas, for the PWTA, only the winner adapts. This results in a higher class separation for the PWTA and again a higher discrimination and a better performance. This argument has been confirmed using experimental observations. Acknowledgments The authors would like to thank the reviewers for many valuable comments. This work was supported by the Canadian Natural Science and Engineering Research Council (NSERC) under research grant no. A4149. References Abbas, H., and Fahmy, M. M. 1994. Neural networks for maximum likelihood clustering. Signal Process. 36, 111-126.

Probabilistic Winner-Take-All Algorithm

943

Agrawala, A. K. 1970. Learning with a probabilistic teacher. IEEE Trans. Inform. Theory IT-16:4, 373-379. Ahalt, S. C., Krishnamurthy, A. K., Chen, P., and Melton, D. E. 1990. Competitive learning algorithms for vector quantization. Neural Networks 3,277-290. Duda, R. O., and Hart, P. E. 1973. Pattern Classzjication and Scene Analysis. John Wiley, New York. Grossberg, S. 1976. Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. Biol. Cybernet. 23, 121-134. Grossberg, S . 1987. Competitive learning: From interactive activation to adaptive resonance. Cog. Sci. 11, 23-63. Lloyd, S. P. 1957. Least-squares quantization in PCM. I E E E Trans. Inform. Theory IT-282, 129-137. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics, and Probability, L. M. LeCam and J. Neyman, eds., pp. 281-297. Berkeley, CA. Ng, K., and Lippmann, R. P. 1991. A comparative study of the practical characteristics of neural network and conventional pattern classifiers. In Advances in Neural Information Processing Systems 3, R. I? Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 970-976. Morgan Kaufmann, San Mateo, CA. Nowlan, S. J. 1990a. Maximum likelihood competition in RBF networks. Tech. Rep. CRG-TR-90-2, Connectionist Research Group, University of Toronto. Nowlan, S. J. 1990b. Maximum likelihood competitive learning. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 574-582. Morgan Kaufmann, San Mateo, CA. Osman, H., and Fahmy, M. M. 1993. Back-propagation and radial-basis-function networks for pattern discrimination. In Proceedings of World Congress on Neural Networks WCNN,pp. W.181-185. Portland, OR. Peterson, G. E., and Barney, H. L. 1952. Control methods used in a study of vowels. 1. Acoust. SOC.Am. 24, 175-184. Rumelhart, D. E., and Zipser, D. 1985. Feature discovery by competitive leaming. Cog. Sci. 9, 75-112. Spragins, J. 1965. A note on the iterative application of Bayes’ rule. IEEE Trans. Inform. Theory IT-11:4, 544-549. Triven, H. G. C. 1991. A neural network approach to statistical pattern classification by ”semiparametric” estimation of probability density functions. IEEE Trans. Neural Networks 2:3, 366-377.

Received July 20 1993; accepted December 15, 1993.

This article has been cited by: 2. H. Osman, S.D. Blostein. 2000. Probabilistic winner-take-all segmentation of images with application to ship detection. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 30:3, 485-490. [CrossRef] 3. H. Osman, M.M. Fahmy. 1997. Neural classifiers and statistical pattern recognition: applications for currently established links. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 27:3, 488-497. [CrossRef]

Communicated by Mitsuo Kawato

Realization of the ”Weak Rod” by a Double Layer Parallel Network T. Matsumoto K. Kondo Department of Electrical Engineering, Waseda University, Tokyo 269, lapan The weak rod can be realized by a very natural double layer network with feedback. While the second layer is linear, the first layer is significantly nonlinear. Elements associated with the first layer are called “synaptic fuses.” An MRF argument is used to eliminate the line variables. The network makes what is cut transparent when a crease or a step edge is present. 1 Introduction

The weak rod (Blake and Zisserman 1987) in its discrete form refers to the minimization of N

N-l

E(v,1, d) := c ( U k - dk)’ k=l

+ 1(Vk-1 + k=2

Vk+l

- 2%)’(1

- h) + XI

N-l

lk

(1.1)

k=2

= (v1,. . . ,V N ) E 31N, 1 = ( l 2 , . . . , IN-^) E {0,1}N-2 where d = . . . , d ~ E) !RN is given. Speaking roughly, given noisy data d, one wants to smooth out the noise while maintaining (1) step edges and (2)

over v (dl,

creases inherent to the original data. If 1 is absent, 1.1 defines a onedimensional version (rod energy) of the well known thin-plate energy. Namely, one can interpolate noisy data d with the second-order smoothness penalty. For an obvious reason, the resulting solution loses step edges as well as creases inherent to the original data. If 1 is present, on the other hand, when the second order spatial difference (vk-1+vk+l-2vk)2 gets too large, it is cheaper to let Zk = 1 and pay the price XI than to pay & ( ~ k - ~ vk+l - 2vk)’. This enables one to detect creases as well as step edges because at these points (vk-1 vk+l - 2v# is large. The adjective “weak comes from the fact that the rod tends to “break” at a crease or a step edge. If one replaces the second term of 1.1 by

+

+

N XI C ( V k k=2

- V k - d 2 ( 1 - h)

Neural Computation 6, 944-956 (1994) @ 1994 Massachusetts Institute of Technology

Realization of “Weak Rod”

945

then one has the weakstring with which one cannot detect creases although one can find step edges. The weak string also suffers from gradient limit where discontinuity may arise even if the original data have no discontinuity. The purpose of this paper is to show that the minimization of (1) can be done by the temporal dynamics of the very natural double layer network given in Figure la, where each element on the first layer is called a “synaptic fuse” described by Figure lc. The minimization of 1.1 being done by the temporal dynamics means that the trajectory of

c-ddtv =f(v,d) converges to a stable limit point, which is a solution to the minimization problem where d is the given data and C represents the parasitic capacitance matrix. Major consequences are 1. The network makes it transparent that a synaptic fuse is cut when a crease or a step edge is present. 2. The network reproduces the disappearance of the Mach band as the intensity gradient exceeds a threshold when the synaptic fuse cuts off. The network also reproduces Craik-OBrien-Cornsweet phenomenon where the network extrapolates very sparse data.

3. If one designs a synaptic fuse using transistors, one can implement a parallel analog network that solves 1.1in an extremely fast manner. 2 Related Works

Blake and Zisserman (1987) discuss capabilities and computational algorithms of the weak rod among other paradigms. Kawato et al. (1991) and Hongo et al. (1992) show that the weak rod reproduces several visual illusions extremely well. Specifically, they show that the weak rod reproduces the disappearance of the Mach band as the intensity gradient exceeds a certain limit. They also show that the weak rod reproduces Craik-O’BrienXomsweet illusion. More specific comments on this will be given in Section 3. Lumsdaine et al. (1989) give a parallel resistive network that naturally realizes the weak string. The idea of the double layer network comes from Kobayashi et al. (1993) and (Shimmi et al. (1992). 3 The Double Layer Network

Minimization of 1.1 is nontrivial since there is a product term ( u k - 1 + 2vk)’(1 - l k ) , and since l k is binary instead of real. One of the means of overcoming this difficulty is to eliminate 1, the line variable. A

~ k + l-

T. Matsumoto and K. Kondo

946

Figure 1: A double layer network implementing the weak rod. (a) The network . Voltage-controlled architecture. (b) The voltage-current characteristic g ~ ( y ) (c) current source. The current injected into node k is a function of instead of v;.

simple way of doing this is to “minimize out” 1:

E’(v,d) = min E(v, 1, d) l€(o.l)N-2

(3.1)

Realization of “Weak Rod”

947

A problem with 3.1 is that E’(v,d) is not differentiable in v anymore although it is still continuous, so that one cannot differentiate it to derive necessary conditions for a minimum. In addition, one cannot use a gradient type algorithm to compute a minimum. To overcome this difficulty, we will use an MRF argument and ”integrate out” 1.

Main Claim. Consider (v,l) as a pair of random variables and assume that their joint probability distribution given d is Gibbsian:

where E(v,1, d ) is defined by 1.1, T > 0 and Z is the normalization constant. Then the double layer network given in Figure l a with the first layer element characteristic (3.3) (see Fig. lb) realizes the weak rod in the sense that for every T > 0, the second layer voltage v2 maximizes the marginal probability given d:

pT(v I d ) where d

=

(l/f~)u,A, = (l/fz)gl, and A/

(3.4) = &f~~/fzg*~.

Remarks. 1. The symbol given in Figure lc stands for a voltage-controlled current source. To explain the difference between a resistor and a voltagecontrolled current source, note that a resistor is characterized by v = f ( i ) or i = g(u), where v and i represent the voltage across a particular resistor and the current flowing into the resistor. If a resistor is linear, v = Xi, which is Ohm’s law. The voltage-controlled current source, on the other hand, is characterized by i = g ( y )where y is the voltage of some other element. At node k of the first layer in Figure 1, for instance, gT(&) injects current specified by Figure l b depending on the value u:*~, not vi. Similarly, another first layer current uk -f2v$ depends not only on the input uk but on vi, the second layer voltage. The current injected to the second layerflu: depends on the first layer voltage ul. Since 2gT(vl)is connected between node k and the ground, it can also be regarded as a nonlinear resistor between node k and the ground.

2. While the first layer is significantly nonlinear, the second layer is linear.

3. A resistor with the characteristic given in Figure l b is called a resistivefuse (Harris 1991). Note that gT(vl+,)is not a resistor. It can be called a “synapticfuse” for an obvious reason.

T. Matsumoto and K. Kondo

948

Proof of the Main Claim. We will first give an explicit formula for the marginal of v given d, namely PT(V

c

1 d) =

pT(v,l I d)

(3.5)

lE(O,l}N-2

Our argument follows (Lumsdaine et al. 1989) for weak string. It follows from 1.1 that 3.2 can be factored into two parts:

Since the first factor in 3.6 is independent of 1, we will compute the second factor:

Let

Then

+ . . . + e-(xZ+x3f"'+xN-l)

(1+ ePx2)(1 + ePx3).. . (1+ e-xN-l) N-1

+

(I e-'k) k=2

N-1

k=2 N-1

N-1

-

C Xk 1In 1 + exk k=2 -

k=2

(3.9)

Realization of "Weak Rod"

949

Using 3.6, 3.8, and 3.9, one has

N-1

1

(3.10)

where 1 Z'

1

N-1 k=2

Given a T > 0, 3.10 is differentiable (infinitely many times) so that the maximum a posteriori probability (MAP) estimate satisfies

which amounts to =0

(3.11)

The kth component of 3.11 reads

where

gT(Y) := 8 T ( X s , AI;Y)

=

+

e - [XSY ~,-~,y21/~

(3.13)

In order to show that the double layer network given in Figure 1 solves

T. Matsumoto and K. Kondo

950

3.12, first note that the KCL (Kirchhoff Current Law) of the second layer reads gz(ZJ-1+ v;+1

-

2 4 ) +flu:

=0

(3.14)

(3.15)

which is exactly the same as 3.12 and 3.13 where

Remarks. 1. The first layer voltage vl is the negative of the second-order spatial difference (g-, - 2 4 ) of the second layer voltage v:, because of the second layer KCL 3.15. The network given in Figure l a has an extremely natural mechanism of detecting discontinuity. Since the first layer elements are synaptic fuses, if exceeds a threshold, then gT(vi) cuts off and hence the first layer voltage stops "propagating" in space so that it signals the presence of a "discontinuity"

Realization of "Weak Rod"

951

in the second-order spatial difference via flv;. Since a discontinuity in the first-order spatial difference implies a discontinuity in the second-order spatial difference, the network detects not only creases but also step edges (see Section 4). It should be noted, however, that vi is not really the second derivative because the space variable k takes discrete values instead of continuous values. When the space variable takes continuous values, the second derivative does not exist when a crease or a step edge is present. 2. The second layer "processes an image" according to the input fivi from the first layer, and feeds its value back to the first layer via f2v:.

3. It is rather important to observe that each layer of the network given in Figure la needs only immediate neighbor connections. If one tries to realize the weak rod using a single layer network, each node must be connected with its second nearest neighbors. In the implementation of the thin plate (Kobayashiet al. 1991)the second nearest connections made the layout phase extremely difficult. It is argued in Kobayashi et al. (1993) that a double layer network with immediate neighbor connections has significantly smaller wiring complexity than a single layer network with second nearest-neighbor connections. A single layer realization also demands negative conductances (Kobayashi et al. 1991; Harris 1988) so that one has to worry about stability (Matsumoto et al. 1992). 4. Let

G2(v2):= @vZTLv2 2

-*

1 . 1 - 2 1 . 1 - 2

'

1

.

L :=

.

1 - 2 1 . 1 - 2 1 1 * '

'

where the asterisk stands for the boundary effect to be discussed elsewhere due to space limitation. Moreover, let C' and C2 be the parasitic capacitance matrices (not necessarily diagonal) of the first and the second layer networks, respectively. Then the temporal

T. Matsumoto and K. Kondo

952

dynamics of the network is described by

(3.18)

This network is not reciprocal (Matsumoto 1976) due to the presence of flv' and f2v2, fi, f2 > 0, and hence there is no cocontent for the network. It should be noted, however, that at an equilibrium, v2 satisfies 3.11. Interpretation of 3.18 in terms of optics-inverse optics explored in Kawato et al. (1991) will be very interesting. 5. In Kawato et al. (1991), the energy to be minimized is given by

E(f,l I d )

:=

{

+

/ u ( x ) - d(x)}*dx X /{l - I ( x ) } F } 2 d x

+ r / l ( x ) { l- l(x)}dx+ c/l(x)dx

(3.19)

where f and 1 correspond to our v and 1, respectively, and the space variable x is continuous. There are two distinctions between the approaches taken by Kawato et al. and ours. First, in Kawato et al. (1991), the line variable takes real values instead of binary values so that the third term of 3.19 is necessary to force I(x) to tend to binary as t + 00. Second, l(x) is not eliminated so that one needs to solve

dm -2cf(x) - d(x)} =

- 2X{1

-

l(x)}

=A{%}

{ 3) (3.20)

2

-y{1-2l(x)} - c

(3.21) Although it is argued in Kawato et al. (1991) that 3.20 describes a double layer network it is in a different sense from ours.

6. We are not claiming that the issue of local minima is resolved. 7. CMOS circuit design of the synaptic fuse is rather challenging and is under way. The fact that in a recent CMOS circuit (Yu et al. 1992) for resistive fuse only four transistors are arranged in a very interesting configuration suggests that a synaptic fuse may also be designed by a simple circuit because the two elements have exactly the same functional form. The only difference is that a synaptic

Realization of "Weak Rod"

953

Noisy data

1

1 0

20

30

10

50

6"

node

Noiseless data

n

0

1 0

20

30

10

30

node

Figure 2: (a) Input data with noise. (b) Noiseless data.

50

T. Matsumoto and K. Kondo

954

1st layer voltage

?

' 0

I

I

I

I

,

1 0

20

30

40

50

60

50

60

node

2nd layer voltage

U "l N

0

1 0

20

30

40

node

Figure 3: The double layer network nicely restores the creases while smoothing out the noise. (a) The first layer response vi. (b) The second layer response n:.

Realization of "Weak Rod"

955

fuse is a voltage-controlled current source, that is, the current is controlled by the voltage of another node, whereas the current of a resistive fuse is controlled by the voltage of the particular node that current flows into. 8. One of the major advantages of massively parallel analog networks is their speed. This is due to the fact that the computation in such a network is done by the temporal dynamics induced by the parasitic capacitors associated with transistors and the processed image is given as a stable limit point of the dynamics. A typical thin oxide parasitic capacitance in MOS transistors is less than 1fF (10-15 F)/pm2. If one has a resistor in the order of 100 kR, for instance, with an area of several hundred pm2, the "time constant" is only in the order of 10 ps. Even though no precise definition of time constant exists for nonlinear circuits, this gives us an intuitive idea of how fast the computation can be done. A measurement is done in Kobayashi et al. (1991) for the thin plate. Even though it is difficult to separate the dynamics of the photo transistors, which are slow, from the dynamics of the thin plate per se, a very conservative estimate is 5 ps for convergence of the thin plate. The convergence time of the weak rod will not be too different.

4 Numerical Experiments Due to space limitation, we will only give one numerical experiment. Many other results will be reported elsewhere. Figure 2a is the input, where the unit is the current of the current source of Figure la in 0.1 PA. Figure 2b is the noise-free original data. The network nicely restores the U-shape as the second layer voltage distribution 4 shown in Figure 3b, f2 = 1.0 x lop6, where l/gl = 100 kR, l/g2 = 1 MR,fi = 1.0 x T = 5.0 x The first layer voltage ZJ; (Fig. 3a) signals N = 4.0 x a large value whenever a crease is present. Acknowledgments We benefited a great deal from various discussions with M. Kawato of ATR. Discussions with J. Harris of MIT and S. Hongo of NTT were very helpful. Thanks are also due to the reviewers for comments. References Blake, A., and Zisserman, A. 1987. Visual Reconstruction. MIT Press, Cambridge, MA.

Harris, J. 1988. An analog VLSI chip for thin-plate surface interpolation. l E E E Conf. Neural Inform.Process. Syst. Natural Synthetic 1, 687-694.

T. Matsumoto and K. Kondo

956

Harris, J. 1991. Analog models for early vision. Ph.D. Thesis, California Institute of Technology, Pasadena, CA. Hongo, S., h i , T., and Kawato, M. 1992. A computational theory and a neural network model on the brightness perception-A Markov random field model for filling-in process. Trans. ZEZCE 11, 1959-1967. Kawato, M., Inui, T., Hongo, S., and Hayakawa, H. 1991. Computational theory and neural network models of interaction between visual cortical areas. ATR Tech. Rep. 3.22. Kobayashi, H., Matsumoto, T., Yagi, T., and Shimmi, T. 1993. Image processing regularization filters on layered architecture. Neural Networks 6, 327-350. Kobayashi, H., White, J. L., and Abidi, A. A. 1991. An active resistor network for Gaussian filtering of images. ZEEE J.Solid-State Circuits 26, 738-748. Lumsdaine, A., Wyatt, J., and Elfadel, I. 1989. Parallel distributed networks for image smoothing and segmentation in analog VLSI. Proc. 28th IEEE Conf. Decision Control 272-279. Matsumoto, T. 1976. On the dynamics of electrical networks. 1. Diferential Equations 21, 179-196. Matsumoto, T., Kobayashi, H., and Togawa, Y. 1992. Spatial vs. temporal stability issues in image processing neuro chips. ZEEE Trans. Neural Networks 3, 540-569.

Shimmi, T., Kobayashi, H., Yagi, T., Sawaji, T., Matsumoto, T., and Abidi, A. 1992. A parallel analog CMOS signal processor for image contrast enhancement. Proc. 1992 European Solid State Circuit Conference, Copenhagen. Yu, P., Decker, S., Lee, H., Sodini, C., and Wyatt, J. 1992. CMOS resistive fuses for image smoothing and segmentation. I E E E J. Solid-state Circuits 27, 54.5553.

Received May 19,1993; accepted December 15, 1993.

This article has been cited by: 2. T. Matsumoto , T. Sawaji , T. Sakai , H. Nagai . 1998. A Floating-Gate MOS Implementation of Resistive FuseA Floating-Gate MOS Implementation of Resistive Fuse. Neural Computation 10:2, 485-498. [Abstract] [PDF] [PDF Plus]

Communicated by David Willshaw

Learning in Neural Networks with Material Synapses Daniel J. Amit* INFN, Sezione di Roma, Istituto di Fisica, Universitd di Roma, La Sapienza, P.le Aldo Moro, Roma, Italy

Stefan0 Fusi INFN, Sezione Sanitd, Viale Regina Elena, 299, Roma, Italy

We discuss the long term maintenance of acquired memory in synaptic connections of a perpetually learning electronic device. This is affected by ascribing each synapse a finite number of stable states in which it can maintain for indefinitely long periods. Learning uncorrelated stimuli is expressed as a stochastic process produced by the neural activities on the synapses. In several interesting cases the stochastic process can be analyzed in detail, leading to a clarification of the performance of the network, as an associative memory, during the process of uninterrupted learning. The stochastic nature of the process and the existence of an asymptotic distribution for the synaptic values in the network imply generically that the memory is a palimpsest but capacity is as low as logN for a network of N neurons. The only way we find for avoiding this tight constraint is to allow the parameters governing the learning process (the coding level of the stimuli; the transition probabilities for potentiation and depression and the number of stable synaptic levels) to depend on the number of neurons. It is shown that a network with synapses that have two stable states can dynamically learn with optimal storage efficiency, be a palimpsest, and maintain its (associative) memory for an indefinitely long time provided the coding level is low and depression is equilibrated against potentiation. We suggest that an option so easily implementable in material devices would not have been overlooked by biology. Finally we discuss the stochastic learning on synapses with variable number of stable synaptic states. 1 Introduction 1.1 Memory Maintenance on Long Time Scales. A material neural network that is supposed to learn dynamically receives an uninterrupted flow of uncorrelated stimuli to be learned. The stimuli impinge on neural elements connected by synapses. An incoming stimulus imposes a certain activity distribution on the neural elements and each pair of neurons 'On leave of absence from Fbcah Institute of Physics. Neural Computation 6, 957-982 (1994) @ 1994 Massachusetts Institute of Technology

Daniel J. Amit and Stefan0 Fusi

958

generates a source for the learning by the synapse connecting them. On a short time scale it may be reasonable to assume that a synapse can modify its efficacy in an analog way, as would be the case for a capacitor. On long time scales, if memory is to be maintained even in the absence of stimuli and of neural activity, it is more likely that a synapse can preserve only a relatively small set of stable values. These we would identify with LTP. For the capacitor this is implemented by an asynchronous, continuous, stochastic threshold controlled refresh mechanism (Amit et al. 1992; Badoni et al. 1992). The discretized long-term synaptic values achieved this way must allow the network to act as an associative memory.

1.2 Learning as a Stochastic Process and Palimpsest Memory. Learning is a stochastic process either due to the nature of the data or due to the dynamics of synaptic modification. A stimulus in the sequence presented to the network is represented by a set of activity levels imposed on the neurons during its presentation. Since the stimuli are assumed uncorrelated each synapse will see a random sequence of pairs of activities on the two neurons connected by it. This is one source of stochasticity. We denote the information arriving on a given neuron by a binary variable (, indicating whether the corresponding neuron does or does not carry information. The second source of stochasticity is due to two possible factors. The actual coding of information on the neurons may be analog and hence the effect on the synapse may not be the same when the presented pattern has information represented with different amplitudes (such as different spike rates). Moreover, even given the same incoming pair of neural activities, it may still be the case that the transition from one stable synaptic state to another may not be deterministic (there may be noise in the threshold for the synaptic transition from one stable synaptic state to another). In other words, even upon the arrival of the same pair of information coding discrete variables a synapse will undergo the implied transition with probability that may be lower than unity. As a consequence the presentation of a sequence of uncorrelated stimuli induces a Markovian process on the set of values of the N ( N - 1) synapses. More formally, the probability that a synapse makes a transition J ---t J’ is a product of P I ( ( , f ) ,the probability of the arrival of the pair <‘( on the two neurons connected by the synapse, and the probability that given that pair the transition takes place, p z ( J + J’ I t . 0 . We shall further assume that a given pair (, f can produce a transition between a single pair of neighboring synaptic states, or no transition at all. The resulting Markovian process is a walk on the finite set of stable synaptic values and will be described by the probability distribution function of the synaptic values. In particular, the conditional distribution function i )of obtaining the value J following the presentation

4((,

Learning in Neural Networks with Material Synapses

959

of p patterns the first of which imposed E . i on the synapse satisfies the evolution equation:

in which MK,is the transition matrix whose elements are determined by the probabilities discussed above and the index j runs over all the stable synaptic states. The first conclusion is that this type of dynamics is generically ergodic (see, e.g., Section 3). When the number of presented patterns becomes very large

which is independent of <.<. This makes a memory of this type a palimpsest (Nadal et al. 1986). In other words, patterns learned far in the past are erased by new patterns learned subsequently in sharp contrast to memories of the Hopfield or the Willshaw types (Hopfield 1982; Willshaw 1969). In the latter, when considered as a learning dynamics, following a large number of presentations all memory is destroyed (Amit ef al. 1987). An immediate implication of the existence of an asymptotic distribution of synaptic values, for a network that is to be available for learning for indefinitely long periods, is that the generic initial distribution on top of which learning new patterns is to take place is the asymptotic distribution pi”. Having an asymptotic distribution is a necessary condition for palimpsest behavior. It is not sufficient. The asymptotic distribution must be such as to allow the learning process to imprint new stimuli upon it. A counterexample is provided by the Willshaw (1969) model, in which the asymptotic distribution is a synaptic distribution for which all synapses have the value + I with probability 1. The presentation of any subsequent pattern will leave this distribution invariant and no retrieval is possible (see also Section 5). To have a functioning learning network with a finite number of synaptic states, the presentation of a gizien new stimulus must change the conditional distribution p,(<. i). Following the presentation of a given pattern consecutive presentations drive the conditional distribution back toward the asymptotic form, making the effect of the initial pattern progressively weaker. The question of the number of patterns that can be retrieved reduces therefore to the question about the age (distance into the past) of the oldest pattern that can still be retrieved, despite the effect of the subsequent patterns. Given the palimpsestic nature of the process, younger patterns can be retrieved a fortiori. 1.3 The Findings. We analyze the learning process as described above for a wide variety of cases. One main conclusion, already noticed in

960

Daniel J. Amit and Stefan0 Fusi

Amit and Fusi (1992), is that if all parameters such as number of states per synapse; coding level in stimuli, and transition probabilities of a synapse for a given pair of neural activity variables, are independent of the number of neurons in the network, then at most logN patterns can be retrieved. Making some of these parameters N-dependent one can do better. If the number of synaptic states increases with N, as fast as then one can reach a storage of order N. This was also observed in Nadal et al. (1986) and Parisi (1986). Going beyond fiin the number of states destroys the palimpsest behavior. Special initial synaptic conditions become required and the network suffers from the blackout effect, that is, all memories disappear together. We then study a network with two states per synapse. In this case we find that if the coding level in the arriving stimuli is as low as log N/N one can reach storage capacities as high as N/log N.For this type of patterns it was found (Willshaw 1969) that a network with two state synapses could have the optimal storage of N2/log2Npatterns. This is the price paid here for uninterrupted learning. Yet, when the intrinsic synaptic transition probabilities compensate for the coding level to make the mean number of up transitions (potentiation) of the same order as the number of down transitions (depression), one recovers optimal storage and enjoys continual learning. This additional requirement finds an interesting echo in recent experiments on potentiation and depression in hippocampal slices (Stanton and Sejnowski 1989).

m,

2 Criteria for Retrieval

In the simple case of autoassociative memory the possibility of retrieving a memory is determined by the distribution of depolarizations among the neurons in the network upon the presentation of one of the previously memorized patterns. If that distribution is such that there exists a threshold that separates the depolarization of neurons that had been active in the learned pattern from those that had been quiescent, retrieval is in principle possible. The situation is even better if one can show that the relevant threshold can be plausibly generated by the neural dynamics. Retrieval is impossible, without errors if the two distributions of depolarizations overlap significantly (see, e.g., Weisbuch and Fogelman-Soulie 1985). The distribution of postsynaptic inputs is determined in turn by the collection of synaptic values. The conditional distribution, equation 1.1, allows for the computation of the (conditional) mean of the synaptic input to a neuron that p patterns into the past had activity (. Similarly, we can compute the fluctuations of the postsynaptic input. If the sequence of afferent stimuli to be learned is (,!, ( f . . . then the synaptic input to neuron i upon presentation of

Learning in Neural Networks with Material Synapses the oldest memory

961

(’, following the learning of the entire sequence is

where I,,(p) is the synaptic efficacy following the learning of the p patterns, and 1/N is a normalizing constant introduced for convenience. The synaptic inputs, hp, can be classified according to the value that was imposed on neuron i during the imposition of pattern number 3 . If the neural activity is coded by a binary variable ( t i = (1, ( 2 ) , there will be two distributions of synaptic inputs: one for neurons that saw the value (1 when was presented and another for those that saw (2. The values of the input in each class have a conditional mean:

WE = 1P i CIiA(Li) (

(2.1)

1

where the conditional expectation (. . .)( is defined as

V)( = EV I F!

=

0

and pE is the probability that a neuron had activity 4 when the network was stimulated by tl. The expectation is over all the with p > 1 and j = 1,. . . ,N , and on [/ with j different from i. In other words, this is the mean input to a neuron in the population that had activity f upon the presentation of (l. The signal, for the binary case can be defined as

s=

( V h l

- oIP)c2

(2.2)

If IS( is significantly greater than the sum of the noises around each of the mean hp for the two values of t, then a threshold can be found that will separate correctly the two outcomes <1,(2 to reproduce the retrieved pattern. S can be written in terms of the conditional probabilities as

(2.3)

i

where the sum on extends over the two possible values of the activity (,! and \ runs over all n values of the stable synaptic states. The fluctuations of the two h; are estimated by

R2(,9 = (ChP - (hP)d2)E If the random variables hp are gaussian, then total noise is

The computation of each of the current variances is complicated by the fact that it involves means of products like: \;,J;k, in which the efferent neuron i is the same in both synaptic efficacies. In general, the variables ]i, and \;k are correlated. In Appendix A we show that the variances are

Daniel J. Amit and Stefano Fusi

962

minimal when the two sets of Is are assumed independent. In that case, it is shown in the appendix that

Retrieval is possible if the ratio S I R is large enough. If one requires that the probability of an error on any neuron tends to zero with increasing N, then the square of the signal-to-noise ratio must grow at least as log N (see, e.g., Weisbuch and Fogelman-Soulie 1985). 3 The Logarithmic Constraint

In the wide class of learning processes we consider below, there is always a sequence of synaptic transitions, on any given synapse, that can bring the synapse from any one of its stable states to any other state. The corresponding stochastic process is, therefore, irreducible (see, e.g., Cox and Miller 1965). In that case the matrix M has a single eigenvalue 1. Writing equation 1.1 in terms of the eigenvalues, ,A, of M,one has

[(C,

i )=

c

c))((C,

E)M&'

= f),"

K

+

ok(Cqi)uFv;

A!-' lt>l

(3.1)

K

where u" and v'' are, respectively, the right and the left eigenvectors associated to eigenvalues A<". For an ergodic process, we have A1 = 1 > A2 = AM 2 A3 2 ...A, (Cox and Miller 1965). Note that the terms multiplying AL-' for (r > 1 depend on the initial conditional distribution and on the eigenvectors of M, corresponding to A., They are independent of p and N. Substituting p in equation 2.1 one finds

where h , is the term due to the asymptotic part of the distribution p r :

bx = UC)m =

c{it /'i c \if {I]

The coefficients F,, can be read by substituting equation 3.1 in equation 2.1. They are independent of N and of p . When A,(= AM) dominates, that is,

Learning in Neural Networks with Material Synapses

963

Calculating S by substituting equation 3.2 in equation 2.2, the asymptotic part h, cancels, leading to

where C,(P) are differences of the corresponding coefficients F , in equation 3.2, and P represents, schematically, their dependence on the set of parameters describing the learning dynamics. For fixed P , AM dominates and

s = [XM(P)]’-’

‘

c2(P)

in which C2 and AM depend on N only via an eventual dependence of one of the parameters that affect the learning dynamics (e.g., coding level of patterns, transition probabilities, presentation rate, number of stable synaptic states). The uncorrelated part (the lower bound, see Appendix A) of the variances of the two distributions of neuronal inputs, h, are given by equation (2.4). Each of the variances has two contributions:

and

(hy): = h k

+ 2h, C X:--’F,(<) + C X!-’F,(<) u>l

L

Again, if X2 = AM dominates, then

(r,:C;,c

=

(IZC2)m + w L - 1 ) G 2 ( 0

(/I;):

=

hk

1

1

2

(3.5)

+ 2XK1F2(()h,

The dependence on ( is contained in the functions G2, F2. For p + 00 all the terms that are multiplied by XK’ disappear and only the asymptotic part survives. So the signal-to-noise ratio behaves as

in which

If we impose that in the limit N 4 00 the ratio S 2 / R 2 grows at least as log N , then we obtain a bound on p: (3.7)

Daniel J. Amit and Stefan0 Fusi

964

This result makes sense, of course, only if the argument of the logarithm is greater than unity. Or that N C ( P ) > log N (3.8) Setting p=1 in equation 3.6, this condition is seen to be equivalent to the condition that the ratio of signal to noise will allow the recall of the most recently learned pattern (AM is strictly less than 1). In fact, the result 3.8 is a gross overestimate. The correlations mentioned above and discussed in Appendix A can make the variances remain finite as N becomes large. The increase in pc with N is all due to the fact that the noise decreases as N-'. Moreover, when the noise does not decrease with increasing N,the product N C ( P )does not increase with N . Hence the condition 3.8 can never be satisfied. As we proceed to show in what follows, the escapes from the tight storage constraint on pc are effective also when the correlations are included. 4 Possible Escapes

The logarithmic constraint on the number of retrievable patterns concerns a very wide class of networks with dynamic synapses. However, the form of pc, equation 3.7, suggests possible escapes. If one allows the parameters, P, contained in AM to vary with the size of the network, so that AM 1, then it is possible to go beyond the logarithmic constraint. The corresponding variation of C ( P ) ,limits the space of variation of the parameters P. Specifically, if AM has the form -+

AM =

1- x ( P )

and the dependence of P on N makes x the Constraint 3.8 is respected, then

(4.1) -+

0 in the limit of large N, while

pc x-'. (4.2) As mentioned at the end of the last section, if the constraint is not satisfied, there is no way to improve memory. Making AM tend toward unity can, at best, prolong the trace of the first imprinted pattern. But when the constraint is violated, there is no trace to maintain. Fortunately, in the memory optimizing cases to be considered the correlations contribute a negligible amount to the variances (see, e.g., Appendix A). We have considered four types of parameters P that affect the learning dynamics and that may depend on N: N

0

Speed of pattern presentation. If the number of stimuli presented to the network in the interval of a single transition between the synaptic states increases to rn, the storage capacity is multiplied by rn (Amit and Fusi 1992). Imposing a minimal rate of presentation seems rather artificial so we shall look for those types of remedies which improve the worst case: low rate presentation (rn = 1).

Learning in Neural Networks with Material Synapses 0

0

0

965

Stochastic refresh mechanism. The transition probabilities of a synapse for given input can be made to decrease with N. Coding level of the patterns. The fraction of information carrying bits per stimulus can be made to decrease with increasing N. Number of synaptic states. The number of stable states per synapse, n, can increase with N.

-

But when p O ( x - ’ ) we have (AM)? + Const # 0 as x -+ 0. In that case one must reexamine the dominance of AM = A2 in the expansion equation 3.1. In fact, usually a whole set of eigenvalues A, -, 1 and (A,)? -, K, # 0 in this limit (see, e.g., Section 7). The part of d(<,() corresponding to A1 remains distinguished from the contributions due to the other A, + 1, because it is the only part that is independent of the first learned pattern. The appearance of several eigenvalues for which (A,)P + K , # 0 implies that sums over eigenvalues, such as in equations 3.3, 3.4, and 3.5 separate into two parts: one part running over all the eigenvalues which tend to 1 and a part that includes all the lower eigenvalues and hence tends to zero. Since we have taken p x - ’ , the remaining sum may depend on x and effectively change the factor C ( P ) in equation 3.7 or 3.8, thus possibly modifying the constraint on the range of variation of the learning parameters P. In at least one such case, the case studied in Section 7, we find that no such change is induced by the degeneracy of the eigenvalues in the limit.

-

5 Stochastic Learning of Sparsely Coded Patterns

The most interesting results appear in the case of 0-1 neurons, with a low mean fractionf of Is and synapses with 2 stable states (J-,J+). An imposed stimulus can produce the following transitions at a synapse: 0

0

If If, = J- and the new stimulus activates the associated pair of neuthen a transition]- + ]+ occurs with probability rons (i.e., t:=<j’’=l), 9+. So upon each presentation of a new pattern the probability of potentiation isf9+. If If, = J+ and the stimulus contains a mismatched pair of activities, then the transition probabilities for a depression J+ + ]- are 9- (10) = O,(; = 1. The transition for
<:

0

A pair of inactive neurons leaves the corresponding synapse unchanged.

Daniel J. Amit and Stefan0 Fusi

966

The resulting transition matrix is

)

f(1-f)9- 1- a a l-f2q+ 1-(b I-b where a = f ( l-f)9-, b =f29+. The two eigenvalues are 1 and: 1 -f(l -f)9-

(5.1)

AM = 1 -f29+ - f ( l -f)9(5.2) The asymptotic distribution, the left eigenvector belonging to the eigenvalue 1, is

(5.3) where p+ = b / ( a + b ) is the fraction of synapses with value I+. Note that the Willshaw (1969) model has I+ = 1,j- = 0 +=O. Hence, a = 0 and consequently p" = (1,O): all synapses become 1. For the present case, since = 1equation 2.3 becomes

,(+,

4-

(I+ - I-Y/(+(L 1)+ I-f ( W O = (I+- I-Y4+(0?1)+ I-f

(5.4) (5.5)

(W+l =

The conditional probabilities are given by equation 3.1 as { ( t ( l , l ) A=' , - ' ~ P ~ ( 1 . 1 ) U K u , ,

+/Jc=AL'[/'4+(1,1)-p+] + p c

K

/$t(o- 1)= AL' K

p~(o~1)uKu,+ + /JC = AS'

[/,ti (0,1) - p + ] f ;0

where the eigenvectors corresponding to AM are given by U K = ( p - , - p + ) and V K = (1,-1). Assuming that one starts from asymptotic distribution, the conditional probabilities following the presentation of the oldest pattern E' are 1

=p++p-4+ * P ; + ( o J =p+(1-9-) (5.6) When calculating the signal, the asymptotic parts cancel and the leading term, is proportional to .A'-; P,t(w

s = A L 1 ( l + - 1-)(9+P- + 9-P+)f (5.7) Note again that for the Willshaw model p' = p" and S=O, that is, no learning is possible on top of the asymptotic distribution. For the calculation of the uncorrelated part of the noise, equation 2.4, we need ( h 2 ) ,which is the same as (h) with J+ replaced by 1: and I - by 12. One finds that

f

R2 = 2 [P+l:

+ p-ll

+ p - I - 17+ W K ' )]

- (P+l+

(5.8) For smallf we keep only terms of leading order inf and, for large p , the signal-to-noise ratio is (5.9)

Learning in Neural Networks with Material Synapses

967

5.1 Extremal Cases and the Return of Optimal Storage.

-

5.1.1 Lowest Coding Lmel. First we take the coding levelf logN/N (as in Willshaw 1969) keeping the transition probabilities 9+ and 9- fixed and both different from zero. From equation 5.2 we read that AM 1 - x (equation 4.1) with

-

and, according to equation 4.2,

-

In fact, f log N/N is as low as f is allowed to become without violating the bound (3.8). Moreover, even the above result for p , is too high. The reason is that when 9+ and 9- are fixed, the correlation term, Appendix A, overpowers the leading uncorrelated part when pE goes above N/(logN)’. In other words, this network performs much worse than Willshaw’s (1969), which for the same coding level gives pc N2/ log2N. This is a price for continual learning.

-

-

5.1.2 Optimal Storage Recovered. The optimal performance can be recuperated if we takef log N/N and the transition probability 9- =f9+. Provided the bound (3.8) is not violated, according to equation 4.2, since now x of equation 4.1, is (logN/N)2,one has the optimal storage

-

if 9+ does not tend to zero.

To verify that the retrieval bound, (3.8), is respected one first notes that in this case the part of the noise due to correlations is negligible. It is of magnitude pf3 relative to the uncorrelated part (see, e.g., Appendix A). It is therefore sufficient to read C ( P ) from equation 5.9 and to substitute it in equation 3.8. In the present case the asymptotic fractions p+ and p of I+ and I-, respectively, are finite. The only strong N dependence in C ( P ) is inf and hence the constraint reduces to Nf = O(1ogN). 5.1.3 Intermediate Cases. One could attempt to trade off some of the N dependence off for an N dependence of 9+, which has been assumed finite in the limit of large N. The constraint on C ( P ) implies that if

Daniel J. Amit and Stefan0 Fusi

968

then 9:

1-8

log N

=

(7)

with E [0,1] (f < 1 implies that p > 0 and 9+ < 1 gives the upper bound /3 < 1). For x of equation 4.1 we have x

-

logN

'+'o

-f29+ (7)

and hence p c < *1- = (lo;N)i+:p ~

The discussion in Appendix A shows that in the part of the intermediate regime in which P > the correlated part of the noise is negligible. reproduces the result of The case ij=O, that is, fixed finite Tsodyks (1990), with a capacity

i,

6 Simulations

We have carried out extensive simulations to test the predictions of the theoretical estimates in the most extreme case, that of optimal storage in 2-state synapses and 0-1 neurons. In fact, to make the test of the theory more stringent, we have tested separately the asymptotic behavior of the signal and the noise. In the simulations the parameters were set as follows: log N I+ = 1, I- = 0. f ( N ) =A- N (6.1) 9+ = 1, 9- =f 1

with fixed A = 4. The signal and the noise are estimated for each choice of parameters N and p in the following way: A sequence of N p = 500 + p random N-bit words is generated, the stimuli to be learned. p is the maximal age of a pattern to be tested. Each word is generated by assigning l's, at random, with probabilityf. All 500+p patterns are presented consecutively to the network. Upon the arrival of each pattern ( p , learning takes place, modifying the synaptic matrix according to the learning rule described at the beginning of Section 5. Following the leaming of ( p ( p < / L < p 500) the state of the network is set to the pattern of age p , si = Jp-P, that is, the stimulus to be retrieved. Then, with the new synaptic matrix I;, we calculate the average of the postsynaptic input over the foreground and the background neurons in order to estimate the conditional mean of equation 2.3, that is,

+

(hP)C(P) =

1 -

Nc

c

I:(%=<)

,#I

Learning in Neural Networks with Material Synapses

969

-1

-1.5 -8

-8.5 -9 -9.5

-7

-7.5 -8

-8.5

-9 -9.5 50

100

150

200

5u

100

150

200

Figure 1: Logarithm of signal vs. number of memories p for fixed N: (A) N = 600, (B)N = 800, (C)N = 1000, (D) N = 1400. The lines are a linear fit of the mean signals. The slopes a l ( N ) are reported in Table 1. Error bars are rms in measured signals. where the index i runs over all neurons for which (‘f”-P = C(= 0 , l ) ; Nc is the number of neurons with si = C(=O,l) in the pattern presented. From these data we compute the square of the signal as the average over all presentations, that is,

And the noise R2 is calculated as half the sum of the standard deviations of h around and ( h p ) ~ . These results are then compared to the theoretical estimates. In particular we have tested the dependence of S2 on N for fixed p and its dependence on p for fixed N. The theoretical expectations for the square of the signal, equation 5.7, are

Daniel J. Amit and Stefan0 Fusi

970

wheref(N) is defined in equation 6.1. With the present choice of parameters AM

=

1 - 3 f 2 ( ~+ ) 2j3(~)

The theoretical upper bound estimate for the noise can be obtained from equation 5.8. One has

(6.3)

If equations 6.2 and 6.3 are verified in the regime of the asymptotic behavior in N, then the number of storable and retrievable patterns can grow as N 2 / log2N. Indeed, as long as p is bounded by this value, there exists a threshold that separates the depolarization of neurons that should be active from those that should be quiescent. Equation 6.2 is written in the form

yl

+ b, ( N )

= log S2 1U I ( N ) p

(6.4)

with Q ( N )= 2log[l - 3f2(N)

+ 2f"N)l

(6.5)

The four insets in Figure 1 present logs2 vs p for N = 600, 800, 1000, and 1400. The straight lines are a fit of the mean signals by equation 6.4. From these fits we find values for u l ( N ) that are compared in Table 1 to the theoretical values given by equation 6.5 for several values of N. The agreement represented in the table implies that in the entire range of values of N and of p tested in the simulations one is already in the asymptotic regime for the behavior in N and p . Hence the fact that in this region S 2 / R 2 > logN implies, in turn, storage capacity quadratic in N. The behavior of S2 vs N, for the same value of A, is presented in Figure 2 where S2 is plotted as function of N for four different values of p (20, 30, 40, 60). The continuous line represents the theoretical estimate while the points are simulation results. The agreement improves with increasing N although, even for small N, the theoretical lines pass through the errorbars. It is worth noting that in case D the nonmonotonic behavior around N = 400 is captured by the theory. Finally the upper bound on R2 is tested in Figure 3. In particular, R2 is plotted as a function of p for N = 600, 800, 1000, and 1400. The noise tends to its asymptotic value, and is always below the straight dotted line which represents the upper bound (equation 6.3). The value of the upper bound decreases with increasing N and R2 approaches its asymptotic limit more slowly when N is larger. This is due to the fact that for large N AM is closer to 1, and the correction to asymptotic distribution goes to zero more slowly (see equation 6.3).

971

Learning in Neural Networks with Material Synapses Table 1: Testing the Asymptotic Regime."

N

theoretical a1

400 500 600 700 800 900 1000 1100 1200

-0.0208 -0.0144 -0.0106 -0.0082 -0.0066 -0.0054 -0.0045 -0.0038 -0.0033

a, from simulations -0.0218 -0.0148 -0.0110 -0.0086 -0.0068 -0.0056 -0.0045 -0.0040 -0.0033

0.0029 0.0022 0.0021 0.0019 0.0019 0.0019 0.0017 0.0017 0.0016

Tomparison between theoretical n, (N), equation 6.4, and the value measured in simulations.

0.0014

0.001

0,0006

0,0002

0.0014

0.001

0. OOOf

0.000; 400

700

1000

1300

1600

700

1000

1300

1600

Figure 2: S2 vs N for several values of p : (A) p = 20, (B) y = 30, (C) p = 40, (D) p = 60. Dots are simulations results. The continuous line is the theoretical prediction (equation 6.2). Note the improvement of agreement with increasing N.

Daniel J. Amit and Stefan0 Fusi

972

5e-05 7e-05 5e-05 3e-05 le-05

II

Se-05

D

7e-05 5e-05

....._.._...___.______._____.____.______._.__ 3e-05 le-05 50

100

150

200

50

100

150

200

Figure 3: Simulation results for R2 vs. p : (A) N = 600, (B) N = 800, (C) N = 1O00, = 1400. The horizontal lines are the theoretical upper bound (equation 6.3). When p grows then R2 approaches exponentially its asymptotic value.

(D)N

7 Multistate Synapses and fl Neurons The second example we consider, to demonstrate the dependence of the performance of an autoassociative network on the number of stable synaptic states, is a network of f l neurons (G = 1, (2 = -1) and synapses with n stable states:

., n - 1

(7.1)

Each pattern (,"is a random word of N f l bits chosen independently and with equal probability [Pr(J = 1) = Pr(( = -1) = 1/21. Upon presentation of a pattern a synapse is potentiated Urn + ] m + l ) with probability q if the source (f"(,!=lor depressed with the same probability -+ ]rn-l) if (,"(,!'= -1. If a synapse is at one of its extreme limits and is pushed on, its value is unchanged. So in the process of the presentation of patterns a synapse undergoes a random walk between two reflecting barriers. Note that in this model a stimulus communicates information to all N neurons and qN2 synapses are modified by each stimulus.

urn

Learning in Neural Networks with Material Synapses

973

The stochastic transition matrix MK/is tridiagonal with 9/2 along the two side diagonals: 1- 9 along the main diagonal, except in the first and last positions where it is 1 - 9/2. Since this matrix is symmetric, its right and left eigenvectors are identical and hence the asymptotic distribution is uniform, that is

and hence h, = 0 due to the f l symmetry. The full set of eigenvalues of M is

xa=1-29sin with

LY

2 77.Q

-

2n

= 0,. . . , n - 1 (A, =

1). Its second largest eigenvalue XI

=

AM is

and the corresponding eigenvector is

k = 0,.. ., n - 1

(7.3)

where, for large n, the constant c behaves as 42/n. The contribution (h)+l to the signal is @)+I

=

1 1 [&It 1)-

-111

/

where the index J runs over all the stable synaptic values J = J m , m= 0 , . . . , n - 1 (equation 7.1). Recall that (h)-l = - @ ) + I . Substituting the expression for the conditional distributions one finds

The difference of the two conditional distributions, following the presentation of the oldest pattern t', is

c

-1)= -29,0,. . . .

n

(7.5)

This form follows from the observation that when = 1 is presented to the uniform asymptotic distribution it leaves the probability of all states invariant except the two extreme ones: The lowest state that loses a fraction 9 of its occupation, becoming (1- 9)/n and the top one which gains the balance and becomes (1 + 9)/n. = -1 does the opposite, producing the following two conditional di.&butions:

=

1 -(I + 9 , 1 , . . . ,1.1 - 9)

n

Daniel J. Amit and Stefan0 Fusi

974

Hence, taking 2(h)+l of equation 7.4 and substituting equations 7.5 and 7.3 the signal becomes

where the constant C is independent of q, n, N , p . Note that the signal decreases with increasing n. This is a consequence of the fact that the process has a uniform asymptotic distribution of values. Even after the presentation of a single pattern, on the background of the asymptotic distribution, the signal will decrease as l / n . The noise can be calculated using equations 3.4 and 3.5:

In this case, due to the f symmetry of the process, the contributions of the synaptic correlations to the noise vanish identically. The dependence on ( is contained in the term O(AL-'), which vanishes as p + 00. Hence

where J m is given by equation 7.1. Substituting Jm one has

which does not depend on n, because the sum grows linearly in n. Hence the final signal-to-noise ratio behaves like (7.9)

where K is the ratio of C2 to the part of R2 that does not depend on N . Hence pc

1 q2N < -2 log AM log (nzlogN)

(7.10)

Again, if q and n are fixed, the capacity is logarithmic. On the other hand, if q and n are chosen so that q/n2 + 0 then AM tends to 1 and x of equation 4.1 is

Equation 4.2 gives (7.11)

Learning in Neural Networks with Material Synapses

975

7.1 Extremal Case. The constraint on the allowed variation of n and 9, equation 3.8, is equivalent to the requirement that the log in equation 7.11 be positive. Writing the probability q in the form

logN

92 = ( 7) We must have /3 2 0, since q must lie in the interval [0,1].Moreover, the constraint implies that

and since n > 1, /3 < 1. Substituting n and q in equation 7.11 we find (7.12) Since /3 is restricted to the interval [OJ], the number of stable synaptic If n becomes larger, the network is no longer states cannot surpass a palimpsest and all memory is destroyed together. When n reaches this limit (p = 0) the number of retrievable memories is proportional to N (see, e.g., Parisi 1986). The price is that the number of synaptic states is not a property of the synapse, it increases with the size of the network. If one introduces a stochastic transition mechanism with q # 1 (P > 0) then it is possible to store more than patterns. When P varies from 1 to 0 the process interpolates between p N fito p N N.

a.

7.2 The Role of the Other Eigenvalues. As was discussed at the end of Section 4, when AM 1 the contribution of the other eigenvalues must be reexamined. Equation 7.2 for the eigenvalues implies that all n of them tend to 1 as n 00 or q + 0. Nevertheless, we show in Appendix B that the dependence of both the signal and the noise on n and on q remains unchanged. -+

-+

8 Discussion

We have tried to open a discussion of the consequences of synaptic dynamics that may be taking place in a network that receives a temporally unconstrained stream of stimuli and maintains the same neural and synaptic dynamics whether the network is engaged in computation or in learning new memories. The requirement that memory be preserved in

976

Daniel J.Amit and Stefan0 Fusi

the synapses for long times has induced us to postulate that synapses have finite sets of states that are stable. In between such states the synapse is assumed to be able to vary continuously, but the analog values can be maintained only for short times and their main role is to allow a synapse, based on the neural activity in the neurons connected by it, to cross thresholds for transitions between neighboring stable states. Whether biological synapses have such ladders of stable states is a question of neurophysiology and biochemistry. Given recent progress in measuring synaptic efficacies between single pyramidal neurons (Mason et al. 1991), the neurophysiological test may soon be feasible. On the other hand, in electronic implementations of unsupervised neural networks this mechanism has proved very natural and effective. The stable states of a synapse are essentially (see, e.g., Amit et al. 1992; Badoni et al. 1992) an asynchronous refresh mechanism operating on some capacity associated with the synapse. The simplicity of implementation, the economy in means, and the accessibility to analysis should make this scheme rather attractive. In studying the retrievability of learned patterns only the existence of a potential threshold has been considered. We have ignored the possibility that for different stimuli this threshold may differ. In fact, it does, mostly due to fluctuations in the number of active neurons from pattern to pattern. We have noticed elsewhere (Amit and Brunel 1993) that this problem is naturally overcome by an unstructured inhibition reacting in proportion to the total excitatory activity. Another issue mentioned but not developed concerns the possible analog nature of the information coding in the stimuli. It has been raised in Section 1 in connection with the origin of the nondeterministic nature of the synaptic transition given the same pair of digitally coded information bits. In the discussion of the retrieval we have considered only the digital representation of the stimuli that had been learned. If in fact the fluctuating nature of the transition probabilities is related to the fluctuations of neuronal activity variables, such as spike rates, one must test also the retrieval of patterns that have fluctuating variables. We have not done this, either theoretically or in the simulations, but we believe that the modification should be minor. The reason is that for the digital coding to make sense it must represent approximately the analog variables. In other words, a neuron with a 0 digital code will have low frequency and one with a digital code of 1 will have high frequency. Thus the difference between the presentation of the analog vs. the digital pattern for retrieval can be considered as noise on the incoming stimulus to be retrieved. We have emphasized the issue of the palimpsest behavior of the networks. In the present context this type of behavior is quite natural. One may wonder whether experience indicates that brain functions as a palimpsest. We are not familiar with any direct evidence that this is the case, yet the issue is not moot. First, if the storage capacity of any

Learning in Neural Networks with Material Synapses

977

cortical module is of relevance, clearly the behavior of that module near capacity becomes important. At that point it is pertinent to raise the question of whether it behaves as a palimpsest or not. Experience does not produce the impression that old memories are replaced by new ones. In fact, often one has the opposite impression, that is, that old memories never die. Yet it should be kept in mind that the theoretical treatment presented here has dealt with strings of stimuli that are uncorrelated. The repetition of some subclasses of stimuli in the process of learning may create privileged memories. How this is included in a theoretical framework we leave as an open question. What seems important to realize in this context is that it is quite possible that a module will receive a very rich stream of stimuli. Since there is no dynamic distinction between those that should be learned and those that are transient, all make some modification of the synaptic structure. It may be the case that most of what enters the module is noise and hence that what is learned is learned on the background of an asymptotic distribution of synaptic values. This is the deeper sense of palimpsest behavior in our context. This connects with another question: what is the dynamics for leaming correlations in the input stream? Such correlation may be of two types: there may be correlations in the spatial activity distribution of patterns in the afferent sequence. Or it may be that the system manages to learn temporal correlations in the sequence, as is implied by the experiments of Miyashita (1988; Griniasty et al. 1992). In both cases there is a need for an extension of the techniques presented here. It appears that in some cases such extensions are not unsurmountable (Brunel 1993). Finally, one may be struck by the discrepancy between the tight storage bound that we find for networks with fixed parameters and the results on the capacity of networks with f l synapses of Amaldi and Nicolis (1989), Gutfreund and Stein (1990), and Krauth and Mezard (1989), which give a capacity linear in the number of neurons. Our conclusion is that there is no local learning algorithm that can lead to those matrices. Nonlocality is invoked in a double sense; it is spatial as well as temporal-spatial, because one needs more than the two activities imposed by the stimulus on the pair of neurons connected by the synapse to be modified and temporal, because one must know all the stimuli simultaneously while deciding if a modification is acceptable or not.

Acknowledgments We are indebted to Profs. Giorgio Parisi and Fabio Martinelli for advice concerning random walks between reflecting barriers and to Nicolas Brunel for discussions. We have a special debt to Prof. Yali Amit for pointing out to us the role of synaptic correlations that we overlooked in a previous version of this article.

978

Daniel J. Amit and Stefano Fusi

Appendix A The full expression for the variance of the neuronal input about its mean in one of the classes is

The first term on the last line is the term that ignores correlations between the distributions of ]i, and ];k. The second one, which is of order 1 as N + 00, is due to the correlations. At this stage we can conclude that the uncorrelated contribution to the variance, the first term in equation A.l, gives a lower bound on the total. The reason is that since in the large N limit the second term dominates the variance, it must necessarily be positive. Otherwise the total variance may become negative. Hence the second term can only increase the total variance. To calculate the first term in the second square brackets requires the conditional probability distribution P(];jJ;k I t&&), where [ is the value of (! common to both synapses. Thus we need the difference

To obtain the first term one has to use an equation of the type of equation 1.1, for the conditional probability that a pair of synapses with a common neuron has a given pair of values, conditioned on the values of the three neurons-i (the common one), j and k, in upon the presentation of pattern number 1. The distribution p ( I l J 2 I [J1[2;2), for fixed t, has four values, since each of the two Js can have two values. Hence, the transition matrix, corresponding to M in equation 1.1,is a 4 x 4 matrix. We have computed this matrix as well as the difference with the matrix driving the two synaptic values in the uncorrelated case for the model described in Section 5. The latter matrix is simply the outer product of the two 2 x 2 matrices of equation 5.1. We skip the details, which are straightforward but tedious, and summarize the results: The difference between the correlated transition matrix and the uncorrelated one is again a 4 x 4

Learning in Neural Networks with Material Synapses

979

matrix. For small values off, 9-, and 9+ its elements are all dominated by the largest of the terms:

fq5

f29-9+7

f39:

(A.3)

When the transition matrix is raised to the power p, the contribution to the difference 6pP can be estimated by

pMP-'GM where 6M is the differenceof the transition matrices in the correlated and the uncorrelated cases and Mp-l is the uncorrelated transition matrix iterated p - 1 times. Both terms in the correlated contribution to the variance are proportional t o p , since in the averages they contain two independent sums over variables E. Hence the estimate of the correlated contribution isf multiplied by the largest of the three terms in A.3. On the other hand, the uncorrelated part of the variance is dominated by f / N for small f and large N (see, e.g., equation 5.8). Example 1. if 9- and 9+ are fixed, as N becomes large, the leading term in the correlated part of the variance comes from the term linear in f and the uncorrelated part will dominate as long as

f

Pf3

Hence, in particular, when f log N / N and p =f -I, the correlated term takes over, leading to a violation of the retrieval criterion. N

Example 2. The intermediate cases discussed in Section 5. Taking q- = fq+ all three terms in A.3 become of the same order: f"9:. Multiplying by pf and comparing to f / N gives for the leading terms in the variance

f + PNf493 -0 N where the second term in the parentheses is due to the correlations. With the notation of Section 5, that is, logN

1/2+3/2P

the correlation term becomes

For this term to become negligible compared to 1, we must have /3 > 1/3.

Daniel J. Amit and Stefan0 Fusi

980

Appendix B Here we verify that the result (7.12) remains unchanged when one includes the contribution of all the eigenvalues in the calculation of the signal-to-noise ratio in the limit n -+ 03 and 9 0 when N --* 03. The eigenvalues of M are given by equation 7.12 as -+

7r a

A, = 1 - 2qsin2 -

2n

with N = 0, . . . , n - 1, and hence all go to 1 in the limit we are discussing. Denoting by S, the contribution to the signal from the eigenvalue A:,

The corresponding eigenvectors are, for n large nkrv

2

Hence, for a even, S,=O, and for S,

N

AP,-'-

8q

n2 n i = o

Im cos

N

odd, with Im given by equation 7.1,

(2) = A'-:

(1 - 2x) cos(7rox)dx (B.l)

when n is large. Carrying out the integration one has

S,

N

329 Ag"

-n7r2

a2

for a an odd integer and S,=O otherwise. The total signal is

The sum in the above expression for S neither diverges nor vanishes as n 03. It cannot diverge since all the eigenvalues are less than 1, so the series in a converges to a finite value. It cannot vanish because, when p x-' as in equation 7.11, there is at least one A! that tends to a constant different from zero. Furthermore, its sum cannot vanish by a cancellation, since the number of positive terms is greater (4 < 1) or equal (9 = 1) to the number of negative ones and for each negative term there is a positive one with a greater absolute value. As a consequence the inclusion of all the eigenvalues leaves the dependence of the signal on the learning parameters unchanged in the limit of large n and small 9, it has the form equation 7.7. -+

-

Learning in Neural Networks with Material Synapses

n

981

The noise around +1 is equal to the noise around -1. In the limit and 9 -, 0, it can be written as

+ 00

The second term on the right hand side of the first equality is the subtraction of (I$): in equation 2.4. The sum of the two conditional distributions p', given by equation 7.6, gives a uniform vector proportional to the asymptotic distribution. So the vector pk(1,l) + p i ( l , -1) is invariant when multiplied by matrix M and hence

So the asymptotic behavior of the noise, as a function of n and 9, preserves its form in the case of a single dominant eigenvalue, equation 7.8. References Amaldi, E., and Nicolis, S. 1989. Stability-capacitydiagram of a neural network with king bonds. J. Phys. France 50, 2333. Amit, D. J., and Brunel, N. 1993. Adequate input for learning in attractor neural network. NETWORK 4, 177. Amit, D. J., and Fusi, S. 1992. Constraints on Learning in Dynamic Synapses. NETWORK 3,443. Amit D. J., Fusi, S., Genovese, S., Badoni, D., Riccardi, R., and Salina, G. 1992. LANN: Learning attractor neural network, model and hardware implementation, (INFN internal report, in Italian). Amit, D. J., Gutfreund, H., and Sompolinsky, H. 1987. Statistical mechanics of neural networks near saturation. Ann. Phys. 173, 30. Badoni, D., Riccardi, R., and Salina, G. 1992. Learning attractor neural network The electronic implementation. International Journal of Neural Systems, Vol. 3, pp. 13-24. Brunel, N. 1993. Private communication. Nadal, J.-P., Toulouse, G., Changeux, J. P., and Dehaene, S. 1986. Networks of formal neurons and memory palimpsests. Europhys. Lett. 1, 535. Cox, D. R., and Miller, H. D. 1965. Theory of Stochastic Processes. Methuen, London. Griniasty, M., Tsodyks, M. V., and Amit, D. J. 1992. Conversion of temporal correlations between stimuli to spatial correlations between attractors. Neural COWIP.5, 1-17. Gutfreund, H., and Stein, Y. 1990. Capacity of neural networks with discrete couplings. J. Phys A: Math. Gen. 23, 2613.

982

Daniel J. Amit and Stefan0 Fusi

Hopfield, J. J. 1982. Neural networks and physical systems with emergent selective computational abilities. Proc. Natl. Acad. Sci. U S A . 79, 2554. Krauth, W., and Mezard, M. 1989. Storage capacity of memory networks with binary couplings. I. Phys. France 50,3057 Mason, A., Nicoll, A., and Stratford, K. 1991. Synaptic transmission between individual pyramidal neurons of the rat visual cortex in vitro. ]. Neurosci. 11,72. Miyashita, Y. 1988. Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature (London) 335, 817. Parisi, G.1986. A memory which forgets. 1. Phys. A19, L617. Stanton, P. K., and Sejnowsky, T.J. 1989. Nature (London) 339,215. Tsodyks, M. 1990. Associative memory in neural networks with binary synapses. Modern Phys. Lett. B4, 713. Weisbuch, and Fogelman-Soulie, F. 1985. Scaling laws for the attractors of Hopfield networks. ]. Phys. Lett. 2, 337. Willshaw, D. 1969. Non-holographic associative memory. Nature (London) 222, 960. Received May 19, 1993; accepted December 15, 1993.

This article has been cited by: 2. Terry Elliott. 2010. A Non-Markovian Random Walk Underlies a Stochastic Model of Spike-Timing-Dependent PlasticityA Non-Markovian Random Walk Underlies a Stochastic Model of Spike-Timing-Dependent Plasticity. Neural Computation 22:5, 1180-1230. [Abstract] [Full Text] [PDF] [PDF Plus] 3. Yali Amit, Yibi Huang. 2010. Precise Capacity Analysis in Binary Networks with Multiple Coding Level InputsPrecise Capacity Analysis in Binary Networks with Multiple Coding Level Inputs. Neural Computation 22:3, 660-688. [Abstract] [Full Text] [PDF] [PDF Plus] 4. Terry Elliott. 2010. Discrete States of Synaptic Strength in a Stochastic Model of Spike-Timing-Dependent PlasticityDiscrete States of Synaptic Strength in a Stochastic Model of Spike-Timing-Dependent Plasticity. Neural Computation 22:1, 244-272. [Abstract] [Full Text] [PDF] [PDF Plus] 5. Alireza Soltani, Xiao-Jing Wang. 2010. Synaptic computation underlying probabilistic inference. Nature Neuroscience 13:1, 112-119. [CrossRef] 6. Terry Elliott, Konstantinos Lagogiannis. 2009. Taming Fluctuations in a Stochastic Model of Spike-Timing-Dependent PlasticityTaming Fluctuations in a Stochastic Model of Spike-Timing-Dependent Plasticity. Neural Computation 21:12, 3363-3407. [Abstract] [Full Text] [PDF] [PDF Plus] 7. Massimilian Giulioni, Mario Pannunzi, Davide Badoni, Vittorio Dante, Paolo Del Giudice. 2009. Classification of Correlated Patterns with a Configurable Analog VLSI Neural Network of Spiking Neurons and Self-Regulating Plastic SynapsesClassification of Correlated Patterns with a Configurable Analog VLSI Neural Network of Spiking Neurons and Self-Regulating Plastic Synapses. Neural Computation 21:11, 3106-3129. [Abstract] [Full Text] [PDF] [PDF Plus] 8. Giacomo Indiveri, Elisabetta Chicca, Rodney J. Douglas. 2009. Artificial Cognitive Systems: From VLSI Networks of Spiking Neurons to Neuromorphic Cognition. Cognitive Computation 1:2, 119-127. [CrossRef] 9. Eleni Vasilaki, Stefano Fusi, Xiao-Jing Wang, Walter Senn. 2009. Learning flexible sensori-motor mappings in a complex network. Biological Cybernetics 100:2, 147-158. [CrossRef] 10. Sandro Romani, Daniel J. Amit, Yali Amit. 2008. Optimizing One-Shot Learning with Binary SynapsesOptimizing One-Shot Learning with Binary Synapses. Neural Computation 20:8, 1928-1950. [Abstract] [PDF] [PDF Plus] 11. Joseph M. Brader, Walter Senn, Stefano Fusi. 2007. Learning Real-World Stimuli in a Neural Network with Spike-Driven Synaptic DynamicsLearning Real-World Stimuli in a Neural Network with Spike-Driven Synaptic Dynamics. Neural Computation 19:11, 2881-2912. [Abstract] [PDF] [PDF Plus] 12. Stefano Fusi, L F Abbott. 2007. Limits on the memory storage capacity of bounded synapses. Nature Neuroscience . [CrossRef]

13. Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. 2007. Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight DistributionOptimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight Distribution. Neural Computation 19:3, 639-671. [Abstract] [PDF] [PDF Plus] 14. Colin Molter, Utku Salihoglu, Hugues Bersini. 2007. The Road to Chaos by Time-Asymmetric Hebbian Learning in Recurrent Neural NetworksThe Road to Chaos by Time-Asymmetric Hebbian Learning in Recurrent Neural Networks. Neural Computation 19:1, 80-110. [Abstract] [PDF] [PDF Plus] 15. Miruna Szabo, Martin Stetter, Gustavo Deco, Stefano Fusi, Paolo Del Giudice, Maurizio Mattia. 2006. Learning to Attend: Modeling the Shaping of Selectivity in Infero-temporal Cortex in a Categorization Task. Biological Cybernetics 94:5, 351-365. [CrossRef] 16. Stefano Fusi, Walter Senn. 2006. Eluding oblivion with smart stochastic selection of synaptic updates. Chaos: An Interdisciplinary Journal of Nonlinear Science 16:2, 026112. [CrossRef] 17. Alfredo Braunstein, Riccardo Zecchina. 2006. Learning by Message Passing in Networks of Discrete Synapses. Physical Review Letters 96:3. . [CrossRef] 18. G. Indiveri, E. Chicca, R. Douglas. 2006. A VLSI Array of Low-Power Spiking Neurons and Bistable Synapses With Spike-Timing Dependent Plasticity. IEEE Transactions on Neural Networks 17:1, 211-221. [CrossRef] 19. Walter Senn , Stefano Fusi . 2005. Learning Only When Necessary: Better Memories of Correlated Patterns in Networks with Bounded SynapsesLearning Only When Necessary: Better Memories of Correlated Patterns in Networks with Bounded Synapses. Neural Computation 17:10, 2106-2138. [Abstract] [PDF] [PDF Plus] 20. Walter Senn, Stefano Fusi. 2005. Convergence of stochastic learning in perceptrons with binary synapses. Physical Review E 71:6. . [CrossRef] 21. Emanuele Curti , Gianluigi Mongillo , Giancarlo La Camera , Daniel J. Amit . 2004. Mean Field and Capacity in Realistic Networks of Spiking Neurons Storing Sparsely Coded Random MemoriesMean Field and Capacity in Realistic Networks of Spiking Neurons Storing Sparsely Coded Random Memories. Neural Computation 16:12, 2597-2637. [Abstract] [PDF] [PDF Plus] 22. E. Chicca, D. Badoni, V. Dante, M. D'Andreagiovanni, G. Salina, L. Carota, S. Fusi, P. Del Giudice. 2003. A vlsi recurrent network of integrate-and-fire neurons connected by plastic synapses with long-term memory. IEEE Transactions on Neural Networks 14:5, 1297-1307. [CrossRef] 23. Daniel J. Amit , Gianluigi Mongillo . 2003. Spike-Driven Synaptic Dynamics Generating Working Memory StatesSpike-Driven Synaptic Dynamics Generating Working Memory States. Neural Computation 15:3, 565-596. [Abstract] [PDF] [PDF Plus]

24. S A Vakulenko. 2002. Journal of Physics A: Mathematical and General 35:11, 2539-2554. [CrossRef] 25. Massimo Mannarelli, Giuseppe Nardulli, Sebastiano Stramaglia. 2001. Diluted neural networks with adapting and correlated synapses. Physical Review E 64:5. . [CrossRef] 26. Yali Amit , Massimo Mascaro . 2001. Attractor Networks for Shape RecognitionAttractor Networks for Shape Recognition. Neural Computation 13:6, 1415-1442. [Abstract] [PDF] [PDF Plus] 27. Maurizio Mattia , Paolo Del Giudice . 2000. Efficient Event-Driven Simulation of Large Networks of Spiking Neurons and Dynamical SynapsesEfficient Event-Driven Simulation of Large Networks of Spiking Neurons and Dynamical Synapses. Neural Computation 12:10, 2305-2329. [Abstract] [PDF] [PDF Plus] 28. Stefano Fusi , Mario Annunziato , Davide Badoni , Andrea Salamon , Daniel J. Amit . 2000. Spike-Driven Synaptic Plasticity: Theory, Simulation, VLSI ImplementationSpike-Driven Synaptic Plasticity: Theory, Simulation, VLSI Implementation. Neural Computation 12:10, 2227-2258. [Abstract] [PDF] [PDF Plus] 29. Yali Amit . 2000. A Neural Network Architecture for Visual SelectionA Neural Network Architecture for Visual Selection. Neural Computation 12:5, 1141-1164. [Abstract] [PDF] [PDF Plus] 30. T. Lehmann. 1998. Classical conditioning with pulsed integrated neural networks: circuits and system. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 45:6, 720-728. [CrossRef] 31. Asnat Greenstein-Messica , Eytan Ruppin . 1998. Synaptic Runaway in Associative Networks and the Pathogenesis of SchizophreniaSynaptic Runaway in Associative Networks and the Pathogenesis of Schizophrenia. Neural Computation 10:2, 451-465. [Abstract] [PDF] [PDF Plus] 32. G. Lattanzi, G. Nardulli, G. Pasquariello, S. Stramaglia. 1997. Stochastic learning in a neural network with adapting synapses. Physical Review E 56:4, 4567-4573. [CrossRef] 33. Nicolas Brunel. 1996. Hebbian Learning of Context in Recurrent Neural NetworksHebbian Learning of Context in Recurrent Neural Networks. Neural Computation 8:8, 1677-1710. [Abstract] [PDF] [PDF Plus] 34. Daniel J. Amit. 1995. The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral and Brain Sciences 18:04, 617. [CrossRef] 35. Shimon Edelman. 1995. How representation works is more important than what representations are. Behavioral and Brain Sciences 18:04, 630. [CrossRef] 36. G. J. Dalenoort, P. H. de Vries. 1995. What's in a cell assembly?. Behavioral and Brain Sciences 18:04, 629. [CrossRef] 37. Jean Petitot. 1995. The problems of cognitive dynamical models. Behavioral and Brain Sciences 18:04, 640. [CrossRef]

38. Morris W. Hirsch. 1995. Mathematics of Hebbian attractors. Behavioral and Brain Sciences 18:04, 633. [CrossRef] 39. Eric Chown. 1995. Reverberation reconsidered: On the path to cognitive theory. Behavioral and Brain Sciences 18:04, 628. [CrossRef] 40. Elie Bienenstock, Stuart Geman. 1995. Where the adventure is. Behavioral and Brain Sciences 18:04, 627. [CrossRef] 41. Anders Lansner, Erik Fransén. 1995. Distributed cell assemblies and detailed cell models. Behavioral and Brain Sciences 18:04, 637. [CrossRef] 42. J. J. Wright. 1995. How do local reverberations achieve global integration?. Behavioral and Brain Sciences 18:04, 644. [CrossRef] 43. Josef P. Rauschecker. 1995. Reverberations of Hebbian thinking. Behavioral and Brain Sciences 18:04, 642. [CrossRef] 44. Wolfgang Klimesch. 1995. The functional meaning of reverberations for sensoric and contextual encoding. Behavioral and Brain Sciences 18:04, 636. [CrossRef] 45. Peter M. Milner. 1995. Attractors – don't get sucked in. Behavioral and Brain Sciences 18:04, 638. [CrossRef] 46. Masahiko Morita. 1995. Another ANN model for the Miyashita experiments. Behavioral and Brain Sciences 18:04, 639. [CrossRef] 47. Michael Hucka, Mark Weaver, Stephen Kaplan. 1995. Hebb's accomplishments misunderstood. Behavioral and Brain Sciences 18:04, 635. [CrossRef] 48. Maartje E. J. Raijmakers, Peter C. M. Molenaar. 1995. How to decide whether a neural representation is a cognitive concept?. Behavioral and Brain Sciences 18:04, 641. [CrossRef] 49. Ralph E. Hoffman. 1995. Additional tests of Amit's attractor neural networks. Behavioral and Brain Sciences 18:04, 634. [CrossRef] 50. Walter J. Freeman. 1995. The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral and Brain Sciences 18:04, 631. [CrossRef] 51. Frank der van Velde. 1995. Association and computation with cell assemblies. Behavioral and Brain Sciences 18:04, 643. [CrossRef] 52. David C. Krakauer, Alasdair I. Houston. 1995. An evolutionary perspective on Hebb's reverberatory representations. Behavioral and Brain Sciences 18:04, 636. [CrossRef] 53. Ehud Ahissar. 1995. Are single-cell data sufficient for testing neural network models?. Behavioral and Brain Sciences 18:04, 626. [CrossRef] 54. Daniel J. Amit. 1995. Empirical and theoretical active memory: The proper context. Behavioral and Brain Sciences 18:04, 645. [CrossRef] 55. Friedemann Pulvermüller, Hubert Preissl. 1995. Local or transcortical assemblies? Some evidence from cognitive neuroscience. Behavioral and Brain Sciences 18:04, 640. [CrossRef]

56. Joaquin M. Fuster. 1995. Not the module does memory make – but the network. Behavioral and Brain Sciences 18:04, 631. [CrossRef]

Communicated by Raimond L. Winslow

Model Based on Extracellular Potassium for Spontaneous Synchronous Activity in Developing Retinas Pierre-Yves Burgi Norbert0 M.Grzywacz The Smith-Kettlewell €ye Research Institute, 2232 Webster Street, San Francisco, CA 94115 USA Waves of action-potential bursts propagate across the ganglion-cell surface of isolated developing retinas. It has been suggested that the rise of extracellular potassium concentration following a burst of action potentials in a cell may underlie these waves by depolarizing neighbor cells. This suggestion is sensible for developing tissues, since their glial system is immature. We tested whether this extracellularpotassium suggestion is feasible. For this purpose, we built a realistic biophysical model of the ganglion-cell layer of the developing retina. Simulations with this model show that increases of extracellular potassium are sufficiently high (about fourfold) to mediate the waves consistently with experimental physiological and pharmacological data. Even if another mechanism mediates the waves, these simulations indicate that extracellular potassium should significantly modulate the waves’ properties. 1 Introduction

Recent multielectrode and optical-recording studies provided evidence that waves of bursts of activity propagate across the ganglion-cell surface (Meister et al. 1991)and inner plexiform layer (IPL) of the developing retina (Wong et al. 1992),correlating the activity of neighbor cells (Maffei and Galli-Resta 1990). Although the role of these waves has not yet been assessed, their possible contribution to forming layers of ocular dominance in the lateral geniculate nucleus (LGN) as well as in the refinement of a topographic map between retina and LGN has been hypothesized (Jeffery 1989; Meister et al. 1991). What is the mechanism underlying these waves of bursts of activity? Because the waves propagate slowly (approximately 200 msec delay between neighbor ganglion cells in mammals), Meister et d . (1991) argued that it is unlikely that the waves are mediated by a fast synaptic mechanism or gap junctions. Further evidence against a gap-junction mechanism came from the preliminary evidence that octanol and dopamine, gap-junction blockers, have no significant effect on the correlation of Neural Computation 6,983-1004 (1994) @ 1994 Massachusetts Institute of Technology

984

Pierre-Yves Burgi and Norbert0 M. Grzywacz

firing between neighbor ganglion cells (Semagor and Grzywacz 1993a, 1994). Below we discuss more evidence that synaptic transmission is not likely to be the sole mechanism for wave propagation in mammals. Instead of synaptic and gap-junction mechanisms, Meister et al. (1991) hypothesized that an extracellular agent, such as K+, might underlie the waves. Under this hypothesis, the ejection of K+ by a cell’s action potentials increases this ion’s concentration in the extracellular space ([K+],,,) and thus excites the neighbor cells. The significant reduction of correlation of firing between neighbor cells and the increase of interspike intervals when one blocks K+ conductances with Cs+ and tetraethylammonium (TEA) (Semagor and Grzywacz 1993a) are consistent with (although do not prove) this hypothesis. Moreover, it is quite plausible that [K+],,,increases more in developing retinas than in adult retinas where waves of bursts do not occur (Masland 1977; Meister et al. 1991). The reason is that uptake of K+ from extracellular space by Miiller cells (see Dowling 1987 for a textbook treatment) might be slower in developing retinas (Semagor and Grzywacz, personal communication). This immaturity of the retinal glial-cell system at early stages of development has already been reported (Rager 1979). The aim of this paper is to build a realistic neural network model of the ganglion-cell layer of the developing retina to test whether modulation of (K+],,,could in principle mediate the waves (comparison of anatomical-McArdle et al. 1977; Maslim and Stone 1986; Horsburgh and Sefton 1987;-and physiological-Masland 1977; Maffei and Galli-Resta 1990; Meister et al. 1991-data shows that the correlated burst activity does not require other retinal cells). This model includes voltage- and calcium-dependent conductances, extracellular space, and synaptic inputs. Constraints to this model came from several studies in a variety of species during embryonic or postnatal periods. By using these constraints, we could build a model that is at least qualitatively consistent with the data from all species where correlated bursts have been studied. Above, we described speed-propagation and some pharmacological constraints. Other studies obtained values for the silent interval between bursts, rate of action potential inside bursts, and burst duration (Maffei and Galli-Resta 1990; Meister et al. 1991; Sernagor and Grzywacz, unpublished data). In turtle, the burst duration increases upon application of Cs+ and TEA (Sernagor and Grzywacz 1993a). Furthermore, a dramatic reduction of activity with low Ca2+ and high Mg2+ (or in the presence of Co2+,a calcium channel blocker) and the increase in activity with neostigmine (a cholinesterase inhibitor) suggested a synaptic role, at least partially cholinergic, for the propagation of waves in turtles (Sernagor and Grzywacz 1993a). A caveat is that such a cholinergic mechanism may not be mediated by conventional synapses (Zucker and Yazulla 1982; Lipton 1988). In any event, synaptic transmission is not likely to be the sole mechanism for wave propagation in mammals, as correlated activity occurs in the embryonic rat’s retina (Maffei and Galli-Resta 1990),

Model Based on Extracellular Potassium

985

despite the absence of conventional synapses in the IPL (Horsburgh and Sefton 1987). Similarly, in cat, waves appear on or before embryonic day 52 (E52) (Meister et al. 1991), whereas the first conventional synapses onto ganglion cells appear on E56 (Maslim and Stone 1986). Moreover, waves still exist under low Ca2+without (Meister et al. 1991) and with (R. 0.L. Wong, personal communication) high Mg2+,and thus in the absence of conventional synaptic transmission. Although synapses might not be the main mechanism for wave propagation early in mammalian development, it has been suggested that they play a role in mammalian waves (Wong et al. 1992), perhaps later in development. The new biophysical model is first presented in Section 2 and then described in mathematical detail in Section 3. The model’s behavior has been studied by computer simulations of whxh wave-propagation results are presented in Section 4 (simulations of pharmacological results will be presented elsewhere-Burgi and Grzywacz 1994). The general discussion in Section 5 includes, among other things, a discussion on the choice of parameters and alternative models evaluated in terms of biological constraints. The material described in this paper appeared in abstract form elsewhere (Burgi and Grzywacz 1993a,b). 2 Description of the Model

In our model, K+ is extruded from a ganglion-cell soma during actionpotential activity. This extrusion increases [K+Iout,leading to the depolarization of neighbor cells. Consequently, excitation propagates from cell to cell in a wave-like manner. The wave is prevented from propagating backwards by the activation of a K+ conductance in the ganglion cells’ dendrites. This K+ conductance increases as intracellular Ca2+builds up during a burst. Muller cells remove K+ from the extracellular space to reset the retina. Our goal was not to build a model that included all the intricacies of the developing retina, as, for example, all ionic channels and the detailed description of the dendrites. For instance, we did not use the detailed model for adult amphibian ganglion cells of Fohlmeister, Coleman, and Miller (1990), which uses five distinct channels in the soma alone (but no dendrite). This model predicts action-potential trains with high accuracy. But it is unclear whether this model’s five channels exist in developing retinas and whether other channels are expressed in developing, but not adult retinas. Thus, rather than using an overly intricate model with elements unknown to exist in the developing retinas, we used a simple model that included only the elements necessary to account for the experimental behavior on correlated spontaneous activity. Nevertheless, we modeled these elements as realistically as possible to test the feasibility of whether ejection of K+ into the extracellular space within the vicinity of ganglion cells may underlie the wave of bursts of action potentials.

986

Pierre-Yves Burgi and Norbert0 M. Grzywacz

Figure 1: Model of ganglion cell layer. In this model, the cells are in a hexagonal arrangement. The circles delimit the area in which the cell is located, not the cell's size; the closest membrane-to-membrane distance between two neighbor cells is 12 pm. Wave initiation occurs at random border cells, shown in dark gray, where an external current is injected at times randomly drawn from a Poisson distribution. The three "electrodes" show cells whose spike activity is presented in Figure 2. Excitation propagates from cell to cell via extracellular K f . We built a hexagonal-geometry network of ganglion cells (Fig. l), whose somas had Hodgkin-Huxley-like Na+ and K+ conductances, besides leak conductance and capacitance. This network approximates one layer of the three-dimensional ganglion cell architecture. Such approximation assumes that during a wave's propagation [K+],,, is equal in all the layers and thus there is no diffusion of K+ from layer to layer. The assumption would fail near the extreme layers where loss of Kf would slow down wave propagation. The mechanism used to prevent waves from propagating backward is a calcium-activated potassium conductance (gAHp), known to be slowly activated and sensitive to intracellular calcium concentrations ( [Ca2' 1,") in the range of 0.1-1 1rM (Blatz and Magleby 1986). (This range of concentrations has been observed in dendrites of nerve cells-Regehr et a/. 1989.) Support for such a mechanism comes from the elongation in burst duration during application of Cs+ and TEA (Sernagor and Grzywacz 1993a) and by the elevation of [Ca2+Ii,during a burst (Wong et al. 1992). However, neither the effect of Cs+ and TEA nor the elevation

Model Based on Extracellular Potassium

987

of [Ca2+Iin is direct evidence to g A H P . Furthermore, only few papers reported the presence of calcium-activated potassium currents in ganglion cells (Lukasiewicz and Werblin 1988; Fohlmeister et al. 1990). Therefore, our confidence in the role of g A H P in waves is small. We only used this conductance because its slow kinetics is consistent with the long interval between bursts. Instead, other slow conductances might mediate bursts’ stoppage. In a subsequent paper (Burgi and Grzywacz 1994) modeling the effects of various pharmacological agents, we will address the likelihood of a calcium-dependentpotassium conductance. In our model, Ca2+ entrance into the intracellular space is mediated through a fast Ca2+conductance k~,), activated by cell depolarization (Borg-Graham 1991). This Ca2+-dependentK+ conductance mechanism is also consistent with the effects of neostigmine, as an increase in cell’s excitability by this drug should shorten the interspike interval, resulting in a faster increase in [Ca2+Iinand, thus, a shorter burst duration. Using the same substance (K+)for mediating excitation and inhibition may appear contradictory. We solved this contradiction by spatially separating the two processes. Ejection of K+ resulting from depolarization of the cell is confined to the extracellular space around the soma, whereas ejection of K+ resulting from g A H P (and the associated Ca2+conductance) occurs around the dendrite. At present, there is no evidence for the confinement of AHP and Ca2+ conductances to the dendrite. Exploratory computer simulations showed that if the K+ ejected from the soma were to affect the reversal potential of g A H p , then it would be harder, although still possible under limited conditions, to stop the bursts. This result provides a computational justification for our choice of the confinement of AHP and Ca2+conductances to the dendrite. Finally, the dendrite also receives an excitatory synaptic input (at least partially cholinergic) from amacrine cells to account for the low Ca2+(or high Co2+)and neostigmine effects in turtles (Sernagor and Grzywacz 1993a). (This input may not involve conventional synapses-Zucker and Yazulla 1982; Lipton 1988.) Although the model uses only one type of synapse, it is conceivable that other synapses, including possibly inhibitory ones, modulate the activity in the IPL. Hence, a limitation of our model may be the exclusion of other synapses involved directly or indirectly in the modulation of the waves’ properties. 3 Methods

Each ganglion cell was modeled using two compartments, representing the dendrite and soma, connected by an axial conductance. Action potentials at the soma were determined by integrating the general membrane equation

Pierre-Yves Burgi and Norbert0 M. Grzywacz

988

where V is somatic voltage, C is membrane capacitance, Iaxial is axial current flowing from the dendrite, INa and IK are currents generated by the Na+ and K+ voltage-dependent conductances respectively, IIKis the portion of the soma’s leak current dependent on K+, I I ~ K is the portion of the soma’s leak current not dependent on K+, and Ibder is a current injected into cells situated at the retinal border to initiate a wave (see below). The currents IN^ and IK were determined by using a minimal biophysical model that condenses Hodgkin-Huxley equations (Av-Ron et al. 1991-see Appendix). Current Iaxialwas given by the contribution of the conductances and Nernst potentials at the dendrite. After eliminating the dendritic voltage from the Kirchhoff equations, one gets

+

where is, is (constant) synaptic current and gt = C,gj = g A H P + gca g d , where g A W is calcium-dependent potassium conductance, gca is transient voltage-dependent Ca2+conductance, and gd is leak membrane conductance at the dendrite. (The assumption of constant synaptic current may be a limitation of the model, as introduction of stochastic current might modulate some of the bursts’ properties.) The g A H p was expressed as a function of the fraction of a gating particle in the open state w(t) as follows: gAHP(f) = g A H P u ( t )

(3.3)

where g A H p is maximal conductance. The differential equation for w, according to the forward and backward rate constants ( a and P, respectively), was (Borg-Graham 1991) L j ( t ) = cr[l - w(t)][ca2+]i3, - Pw(t)

(3.4)

which means that the binding of three calcium ions is required to reach the open state. Variations in [Ca2+Ihas a function of calcium current Ica were described by the differential equation (3.5)

where F is the Faraday constant, the factor of 2 is for Ca2+ valence, v d is dendritic volume, 7Ca is a decay time constant, and [Ca2+Imtis resting concentration in the dendrite. The calcium current that flows into the dendrite (negative by convention) was determined by using a two-state gate model (Borg-Graham 1991), which is described in the Appendix. Extracellular K+ is critical for the wave propagation in the model. This ion’s extracellular concentration was considered homogeneous within the extracellular space between every three ganglion cells, that is, we neglected diffusion within each such spaces. We justify this simplification

Model Based on Extracellular Potassium

989

because cell body surfaces are separated by approximately 10 pm (Meister et al. 1991) and K+ ions diffuse in a three-dimensional aqueous medium over this distance in about 8 msec (Hille 1984-21 msec if one takes into account the tortuosity due to extracellular matrix, Nicholson et al. 1979). This period is much shorter than the time a wave takes to go from one ganglion cell to the next (approximately 200 msec-Meister et al. 1991). We modeled active uptake of K+ from the extracellular space by retinal Miiller cells (Karwoski and Proenza 1980) by the following secondorder removal differential equation:

where uo is extracellular volume between every three cells, IK is potassium current that flows into each of the six extracellular spaces surrounding the soma (except at the retinal border where K+ flowing outward from the retina is assumed to be dissipated and not to contribute to the waves), TK is a decay time constant, KK is a constant, and [K+Imt is resting extracellular potassium concentration. Evidence that K+ uptake may be present concomitantly with waves of action potentials in developing retinas comes from recordings of b-waves in the turtle's electroretinogram (Semagor and Grzywacz, unpublished data). Variations in [K+Iout change the Nemst potential EK, which affects the delayed rectifier current (IK) and the portion of soma leak current (IIK)that is dependent on K+. Expression of the total soma leak current I1 = IIK+ I1-K as a function of EK is given by (3.7)

where gl = g ~ ~ + g land - ~ gu< , and ~ I - Krefer to the leak conductances whose Nernst potentials are K+ dependent and independent, respectively. A limitation in building our model is that no information is currently available on how waves begin. We assumed that cells are more excitable near the retinal border. In normal physiological conditions, this is probable, as neurogenesis occurs at the border (Polley et al. 1989), creating cells with small somata, and small or no dendrites, and thus cells that have high input resistance. Moreover, in experimental conditions, possible cuttings of dendritic trees at the border would depolarize cells. Therefore, we modeled border cells by reducing by half the dendritic volume and conductances. Waves were initiated by injecting a current Iborder over a period of time Tborder into a randomly chosen border-cell pair at times determined by a Poisson distribution. A numerical method based on exponential prediction (as described in Ekeberg et al. 1991) was used to solve all differential equations. The computer simulations were performed on a SUN SPARC 2 using an object oriented language (C++). A network composed of 241 ganglion cells was simulated over a period of 200 sec. Over this period of time, the average

990

Pierre-Yves Burgi and Norbert0 M. Grzywacz

spike frequency, [K+Iout,and the l A ~ pcurrent were recorded in all cells over successive 0.5 sec intervals. Furthermore, [K+IOut,IAHF,and soma potential were also recorded over successive 0.2 msec intervals for three neighbor cells situated in the middle of the network (shown in Fig. 1). The parameters used in the present simulations appear in the Appendix. Other simulations with a wide range of parameters and networks up to 1921 cells were also performed for exploratory purposes, obtaining results similar to those reported in this paper. 4 Results

The activity of single cells during wave propagation is shown in Figure 2. Figure 2A shows the burst activity of three neighbor cells resulting from seven sweeping waves. Although the activity of these three cells is correlated, the firing order can change from one burst to the next as the waves can start at different locations. This is apparent when comparing for instance the activity resulting from the first and second waves. The first wave hits the upper-trace cell first, then propagates to the middletrace cell, and finally reaches the bottom-trace cell. The second wave hits these cells in reversed order. Moreover, because of differing propagation directions, the delay separating the activation of two neighbor cells can vary. The trace with extended time scale in Figure 2B shows individual spikes, within a burst resulting from the first wave, whose front propagated almost perpendicularly to the line connecting the three cells. This case yielded a delay from cell to cell of about 500 msec in our model. This speed is about a factor of 2.5 lower than that observed in cat and ferret (Meister et al. 1991). This discrepancy could be due to our choice of parameters to fit bursts’ properties in the turtle retina. This cold-blooded tissue is less excitable than mammalian retinas, resulting in slower interspike intervals and thus, in our model, slower waves. Variations of the interspike interval within a burst can also be seen in Figure 2B. The initial decrease in this interval is explained by the depolarization resulting from the ejection of K+ from neighbor cells. This depolarization is apparent as an approximately 8 mV envelope on which the action potentials ride. Triggering of IAHp-mediatedself-inhibition by calcium accumulation counteracts this positive feedback, making the interspike interval increase again until the cell stops firing. (Despite the K -dependent inhibition at the end of the bursts, the voltage is not hyperpolarized, because the high [K+],,, causes compensatory depolarization.) Overall, with the present simulation parameters, the average interspike interval and burst duration were 190 and 2300 msec, respectively. These values are well within the experimentally observed range. The interplay between [K+Iout, firing frequency, and AHP current is illustrated in Figure 3. In this figure, the activity of a cell situated in the center of the network, and the variations in Kt concentration of an +

991

Model Based on Extracellular Potassium

0

2

6

4

8

10

Time (s)

Figure 2: Bursts of action potentials. (A) The activity of three cells situated in the middle of the network, as shown in Figure 1, is plotted over a 200-sec period. Spikes, represented by individual lines, cannot be dissociated because of the low temporal resolution used in this figure. Although temporally correlated, the bursts do not start at the same time (emphasized by dashed line) because of the waves’ directionality. (B) Action potentials from 0 to 10 sec expanded along the time axis to show time courses of individual spikes. Extracellular Kf causes about 8 mV depolarization, on which the spikes ride on.

associated individual extracellular space are shown. The arrival of a wave causes the elevation of [K+],,,. At a level of about 4.7 mM, the cell becomes so depolarized that a spike is generated. This spike, and the following ones, contribute to further (K+],,, elevation, which can be about

992

Pierre-Yves Burgi and Norbert0 M. Grzywacz

Figure 3: Time course of variables associated with a burst. Top trace shows the accumulation of K+ that excites the cell whose membrane voltage is shown in the middle trace. Bottom trace shows the AHP current that is caused by

membrane depolarization and that eventually inhibits the cell. 4-fold (from 2.6 to about 10 mM). The level of K+ fluctuates rapidly, since it is increased by action potentials from the three neighbor cells surrounding the extracellular space where the measurement is made. At the same time, due to Ca2+spikes (apparent in the IAHp trace as transient spikes), the intracellular Ca2+concentration starts building up. This rise in [Ca2+Ii,activates the AHP conductance, which exerts a significant inhibitory effect on the cell’s activity after a period of about 1.5 sec (visible in this figure as an elongation in the interspike interval). Once the inhibition is strong enough to stop the burst, [K+],, enters a period of slow decrease. The slow IAHFdecay imposes a refractory period during which new waves are prevented from propagating. The long tails of IAHP and [K+],,, are due to the relatively sluggish process of removal of Ca2+and K+ from intracellular and extracellular spaces, respectively. The overall spatiotemporal pattern of the network activity corresponding to the first wave is presented in Figure 4 where [K+Iout,average spike frequency (in 500 msec windows), and AHP current are shown over successive 1 sec for a period of 15 sec. The wave, triggered in the left bottom corner of the network, started propagating with a circular wave front. After the front reached the upper border of the network, the wave

Model Based on Extracellular Potassium

993

Figure 4: Spatiotemporal evolution of a wave. This wave swept through a network composed of 241 ganglion cells (Fig. 1).Successive frames show [K+Iout (top row), average firing rate (middle row), and logarithm of I A (lower ~ row) over successive 1 sec intervals covering the time interval from 1 to 15 sec; the serial number of the first frame of each row is indicated in the left column. Gray level scale is shown on the left and corresponds to the ranges 0-10 mM for [K+],,t, 0-10 Hz for firing rate, and log(O.l)-log(35) pA for IAW. The [K+],t wave precedes (causes) the action-potential firing wave, which in turn precedes (is terminated by) the AHP wave. front became linear because of higher wave speed at the border cells due to their higher excitability. The degree of linearization is probably much higher than what one would predict for the real retina, since the number of cells in our simulations is relatively small, making the wave front more sensitive to border effects. These results suggest what would happen to

994

Pierre-Yves Burgi and Norbert0 M. Grzywacz

the wave front if waves began in the center of the retina. The wave front would be circular until it reached the borders. Spatial anisotropies due to the refractory periods from previous waves would rarely affect the wave front, because the interwave interval is very long. Therefore, the simulation variables would be essentially reset from wave to wave. Also visible in this figure is the wave order: extracellular potassium wave leads the spike frequency wave (indicating that potassium triggers excitation), whereas the AHP wave lags spike frequency (indicating the refractory role of the calcium-dependent potassium conductances). 5 Discussion

5.1 Extracellular Potassium. Computer simulations of our biophysical model show that the propagation of waves of bursts of action potentials in the ganglion cell layer of developing retinas can in principle be explained by accumulation of K+ in the extracellular space. Are the model parameters determining [K+],,, reasonable? This concentration depends on the somatic K+ currents, extracellular volume, and mechanism for K+ removal. The parameters for the Hodgkin-Huxley K+ channels (as well as for all other channels) were adapted from those of other preparations, as they are not currently available for the embryonic retina, except for sodium channel in dissociated ganglion cells (Skaliora et al. 1993). In this adaptation, the total value of each conductance and capacitance has been reduced by 3-fold to take into account the small soma diameter of embryonic neurons (Ramoa et al. 1988) and their reduced channel density (McCormick and Prince 1987). The only exception was gc, for which dendritic values were not available and thus had to be adjusted to give [Ca2+Ii,in the range of g A H p operation. As such, these conductances and capacitance are reasonable. To estimate the extracellular volume, we used an 8 pm soma diameter liters. What (Ramoa et al. 1988), which occupies a volume of 2.7 x fraction of the total volume is extracellular space? In postnatal developing neocortex, the volume fraction was estimated to range between 20 and 40% according to the animal's age (Lehmenkiihler et al. 1993). (In adult cerebellum, the extracellular volume fraction is in average% -02 Van Harreveld 1972; Nicholson et al. 1979). In our simulations, we used an extracellular volume fraction of 30%, a value in the middle of the cortical postnatal range. Let us assume that the volume occupied by the extracellular matrix and the Miiller cells is negligible. (This assumption is validated, since gliogenesis appears to be the major factor in reduction of extracellular volume fraction during development-Lehmenkiler et al. 1993-and since Muller cells are not well developed at the stage when waves occur.) Hence, the volume is shared by ganglion cells and extracellular space. From the 30% fraction, one gets 1.16 x liters for the extracellular space associated to one cell. Because the K+ current is

Model Based on Extracellular Potassium

995

ejected into 6 extracellular spaces in our model, the volume of each space liters. is 1.9 x Finally, parameters of K+ removal must be considered. In the central nervous system of adult mammals, [K+],,, rarely rises above 4 mM under physiological conditions (Somjen 1979). Concentrations of 10 mM extracellular potassium and higher have been recorded in preparations following repetitive activity (Somjen 1979). To take into account the immaturity of the glial system in developing retinas and the repetitive actionpotential activity during bursts, we allowed a conservative elevation in [K+lOutup to 10 mM in control condition. Also, it has been suggested that a possible mechanism for K+ active uptake into glial cells is the Na+-K+ATPase exchanger (Hertz 1990). The Na+-K+-ATPase exchanger requires two K+ molecules and thus, consistently with our model (equation 3.6), has a second-order stoichiometry. (In exploratory simulations, removal of the second-order term resulted in such a large increase of [K+],,, that cells underwent inactivation.) An active uptake mechanism based on the Na+-K+-ATPase exchanger is consistent with studies on the cerebral cortex of rat during early postnatal development (Medzihradsky et al. 1972) where it has been shown that this exchanger in glial cells is immature during the initial stages of development. How do the extracellular volume and K+ removal change during development? There are no direct data on these issues for developing retinas. However, in rat neocortex the extracellular volume fraction appears to decrease during development possibly as a consequence of the emergence of glial cells (Lehmenkiihler et al. 1993). Because in the retina, glial (Muller) cells also emerge relatively late in development (Polley et al. 1989), one might expect a similar reduction of extracellular volume fraction in the retina as in the neocortex. Reduction in volume would make a [K+],,,mechanism for waves more efficient. On the other hand, an increase in the number of mature Muller cells would tend to compensate for the effect of reduction of volume. If our model is correct, one should expect the effect of increased K+ buffering by Muller cells to dominate the effect of change in extracellular volume, since waves disappear at later stages of development. 5.2 Possible Role of Waves in the Refinement of Topographic Maps. Synchronization is much more localized in developing retinas than in other tissues where bursty activity has been observed, such as in the pancreas (Santos et al. 1991), heart (Winfree 1990), and in aggregates of cultured cortical neurons (Murphy et al. 1992). The reason for this difference is probably qualitative, rather than quantitative: Because pharmacology performed on developing retinas of turtles (Sernagor and Grzywacz 1993a) suggested that gap junctions are not required to synchronize the activity of neighbor ganglion cells, we did not include such connections in our model. This is probably the main difference to previous models of cell synchronization in the heart and pancreas where coupling is appar-

996

Pierre-Yves Burgi and Norbert0 M. Grzywacz

ently done through gap junctions (Chay and Kang 1988; Sherman and Rinzel 1991; Wmslow el al. 1993). In those models, coupling is so fast that a global synchronization of the network is reached, in the sense that all cells fire simultaneously at some point. This global synchronization contrasts to the locality of developing retinas where the slow propagation of the activity restricts synchronization to a small portion of the network. Local synchronization is essential in developing retinas if its role is to refine a topographic map between retina and LGN (Jeffery 1989; Meister et al. 1991; Wong et al. 1993). Imagine that immature ganglion-cell axons sprout over a relatively large area and connect random LGN neurons (with a bell-shaped spatial distribution), forming a rough topographic map. Then, simultaneous activity in neighbor ganglion cells coupled to a Hebbian mechanism in the LGN would tend to reinforce connections to common, nearby LGN neurons, while weakening connections in the fringes of the axonal sprouting. This process would refine the receptive fields in LGN and, consequently, refine the topographic continuity of the retina-LGN map. The refinement would be limited by the wave’s spread. Hence, if waves were too broad, causing global synchronization, there would be no topographic refinement. 5.3 Alternative Models. Our model and the models for other tissues emphasize two main requirements for waves: (1) local lateral excitation, which allows wave propagation and (2) inhibition, which prevents waves from propagating backward (refractory period). Figure 5 displays three possible alternative mechanisms for inhibition. The first, self-inhibition (for example, via gAHP), has been used in our and other models (Fig. 5A). The second is a neurotransmitter that is depleted or a receptor that is desensitized following strong activity (Fig. 5B). The third, feedforward inhibition, is a delayed lateral synaptic inhibition between neighbor units (Fig. 5C). We argue that the second alternative, namely depletion or desensitization, is unlikely to be the main refractory mechanism stopping the bursts in developing retinas. These mechanisms take place at synapses and thus require that lateral excitatory synapses mediate wave propagation. However, in mammals, correlated activity and waves have been recorded in retinas so young that they did not have synapses onto ganglion cells yet (Maslim and Stone 1986; Horsburgh and Sefton 1987; Maffei and Galli-Resta 1990; Meister et al. 1991). Furthermore, the requirement that lateral excitatory synapses mediate wave propagation raises serious difficulties to explain the Cs+ and TEA result in turtle. Removing the inhibitory effect of K+ would tend to increase excitation and prolong spikes, and thus should cause faster transmitter depletion or receptor desensitization. Hence, these inhibitory mechanisms would cause shorter burst duration, contrary to experimental evidence (Sernagor and Grzywacz 1993a). We argue that not only depletion and desensitization are unlikely but also feedforward inhibition (Fig. 5C). In mammals, the IPL synaptic circuitry is not developed

Model Based on Extracellular Potassium

997

Figure 5: Three computational models for waves of activity. In all the models, there are excitatory (not necessarily synaptic) couplings between neighboring cells. Delayed activity suppression, critical to the waves’ refractory period, can be accomplished by using (A) self-inhibition (for example, through IAHP), where the cell’s activity determines its own level of inhibition, (B) attenuation of excitatory synaptic transmission through transmitter depletion or receptor desensitization, or (C) feedfonvard connections through an inhibitory interneuron. at the time where synchronous burst firing first occurs. In turtles, there is a serious time constraint, as synaptic inhibition onset would have to be at least as slow as the slowest burst duration, that is, about 20 sec (Semagor and Grzywacz, unpublished data). In our model, we chose [K+],,,to mediate lateral excitation, but two alternative mechanisms, gap junctions and synapses, cannot be completely ruled out. Although Sernagor and Grzywacz‘s (1993a, 1994) studies provide evidence against gap junctions, it is possible that the gap junction blockers used in these studies, namely octanol and dopamine, are not ap-

998

Pierre-Yves Burgi and Norbert0 M. Grzywacz

propriate in developing retinas. The evidence against the involvement of synapses as a mechanism for wave propagation in mammals is strong. Anatomical studies performed in postnatal rat show that IPL conventional synapses are first observed on postnatal day 11 (P11) (Horsburgh and Sefton 1987), whereas spontaneous synchronized discharges can be recorded before birth (Maffei and Galli-Resta 1990). Moreover, in cat, conventional synapses onto ganglion cells first appear on E56 (Maslim and Stone 1986),while waves have been recorded from an E52 retina (Meister et al. 1991). In cats and ferrets, activity and the synchronization remained (and even increased) under Ca2+-freeconditions (without-Meister et al. 1991-and with-R. 0. L. Wong, personal communication-high Mg2+) and, thus, presumably, without conventional synaptic transmission. (The increase in activity is perhaps due to a reduction of the Ca2+-dependent K+-mediated inhibition, a mechanism suggested by our model.) Finally, Meister et al. (1991)argued against fast synapses, because they must have a very slow integration time to account for the slow propagation between neighbor cells (approximately 200 msec in mammals-Meister et al. 1991). In view of the strong evidence against synaptic mediation of waves, the result of Wong et al. (1992) who found that waves activate both ganglion and amacrine cells of developing retinas is surprising. One way to interpret our model to explain this result is to postulate that K+ waves simultaneously propagate across amacrine and ganglion cells. However, if this is the case, then the inhibitory mechanism preventing the waves from propagating backwards might not be IAHP, since its effectiveness would be reduced by the increase in [K+],,,.Another K+-independent self-inhibition process might be used instead. 5.4 Synapses. In contrast to mammals, there is direct evidence in turtles for a role of synapses in correlated burst activity. Neostigmine, an anticholinesterase,increases burst activity, indicating release of acetylcholine (Ach) and the functionality of Ach receptors in the IPL (Semagor and Grzywacz 1993a). In addition, there is a large reduction of activity in low Ca2+and high Mg2+ (or in presence of Co2+) (Sernagor and Grzywacz 1993a). Although Co2+ or neostigmine experiments have not as yet been performed in mammals, it is possible that results similar to those of turtles would be found at late stages of development. (As mentioned above, in early stages of mammalian development waves have been recorded in the absence of conventional synapses.) The role of synapses might actually be to transmit excitation laterally to mediate propagation itself. Alternatively, our model suggests that the role of synapses might be of neuromodulation of speed or spatial spread of waves. A potential function of such neuromodulation could be the regulation of spatial interaction during development. For instance, increases in ganglion cell excitability due to strengthening of excitatory synapses might compensate for the decrease in [K+],,,resulting from the development of Muller cells, as it was discussed above.

Model Based on Extracellular Potassium

999

Synapses might not only modulate wave propagation, but also might be modulated by it. Because of the Wong et al.’s (1992) evidence that at some point during development waves propagate simultaneously in the IPL and ganglion cells, waves might modify amacrine-ganglion synaptic strength in a Hebbian manner (Borg-Graham and Grzywacz 1992) to form the experimentally observed prenatal orientation selectivity of turtles (Semagor and Grzywacz, 1993b) and pre-eye-opening directional selectivity of rabbits (Masland 1977). This idea is similar to that proposed by Linsker (1986) and Miller (1994) to explain the emergence of cortical orientational selectivity from random spontaneous activity. 6 Conclusions

Modulations in extracellular K+ concentration might be sufficiently high to carry the spontaneous waves of action potentials in developing retinas. Even if another mechanism underlies the waves, these modulations are so large that they would affect the waves‘ properties. We argue that the refractory mechanism preventing the waves from propagating backward is unlikely to be neurotransmitter depletion, receptor desensitization, or feedforward synaptic inhibition. Rather, this mechanism might be selfinhibition, such as mediated by an AHP or another slow conductance. Finally, it is proposed that excitatory synapses might have a neuromodulatory role in the shaping of waves’ properties. Appendix The condensed Hodgkin-Huxley equations at the soma for IN^ and IK are described by a minimal cell model proposed by Av-Ron et al. (1991),

whose equations are

where gNa and g~ are maximal conductances, EN^ and EK are reversal potentials, and s, dm), ViyJ, a@), ViY;, and X are positive channel parameters.

Pierre-Yves Burgi and Norbert0 M. Grzywacz

1000

Current ka at the dendrite (given the dendritic membrane potential Vd),is described by a two-state gate model (Borg-Graham 1991):

Ica = gcay2(1 - w4)(vd - Eta)

64.7)

where gca is maximal membrane conductance, Eca is reversal potential, and the gating particles y and 1- w represent activation and inactivation, respectively. The general equations for these gating particles (hereinbelow denoted by x) are given by

(A.lO) N

(A.ll) (A.12)

where V is the membrane potential, 2, CYO, /30, y, and V1/2 are parameters of the channel, and K = 25 mV at 20°C. All parameters used in the simulations described in this paper are adapted from Yamada et al. (1989), Borg-Graham (1991), and Ekeberg et al. (1991). This adaptation takes into account the small size of the cells and relatively low channel density in immature neurons. For the synaptic current, its value is comparable to external currents injected into dissociated fetal retinal ganglion cells (Skaliora et al. 1993). The parameters are summarized in Tables 1 and 2. Table 1 describes parameters related to somatic and dendritic conductances. Table 2 describes parameters related to intra- and extracellular concentrations, as well as other miscellaneous parameters.

Acknowledgments We thank Dr. Evelyne Semagor for the many discussions during the course of this project. Also, we are grateful to Drs. Semagor, Lyle BorgGraham, and Markus Meister for their fruitful comments on this paper. This work was supported by a grant from the Swiss National Fund for Scientific Research (8220-37180)to P.-Y. B., by grants from the National Eye Institute (EY-08921) and Office of Naval Research (N00014-91-J-1280), and by an award from the Paul L. and Phyllis C. Wattis Foundation to N. M. G., and by a core grant from the National Eye Institute to SmithKettlewell (EY-06883).

Model Based on Extracellular Potassium

1001

Table 1: Somatic and Dendritic Conductance Parameters. Soma

Leak Na K

Dendrite

Leak Ca

g l - ~= 0.12 nS

~ J = K 0.88 nS,

€I+ = 88 mV g~~ = 0.33 pS EN^ = 55 mV g~ = 0.067 pS, €K = variable s = 1.85 g d = 1 nS, E d

V{Y;

= -70 mV

gca = 0.017 nS, €ca = variable State variable s

State variable w

AHP

a@') = 0.059 mV-' a(m)= 0.047 mV-' v:?; = -57 mV = -34 mV X = 2.9 Hz

~ =4 ~

g (L

N O = 0.1 ms-' h = 0.1 ms-', y = 0.5 V 1 p = -24 mV, z = 4 ru0 = 0.008 ms-' [& = 0.008 ms-', y = 0.2 V 1 p = -35 mV, z = 12

nS,p EAHP = -88 mV

= 2.0 x lo6 ms-I mAK3,

/j= 0.002

ms-'

Table 2: Intracellular and Extracellular Concentration Parameters.

[K+IOU t

[K+]re,t = 2.6 mM, [K+]in = 100 mM 1 0 - l ~ liters, rr< = 2.0 sec,

vo = 2.0 x KK

[ca2+]i,

= 5.0 x lo-' sec-' mM-'

[CaZ+]rest= 150 nM, [Ca'+lout = 4 mM 5.0 x liters, T C ~= 3.3 sec

ud =

Miscellaneous

Zext = gaxjal

-25 PA, Tborder= 250 msec, lsyn= -2 pA C = 3 pF

= 0.03 pS,

References Av-Ron, E., Parnas, H., and Segel, L. A. 1991. A minimal biophysical model for an excitable and oscillatory neuron. B i d . Cybern. 65, 487-500. Blatz, A. L., and Magleby, K. L. 1986. Single apamin-blocked Ca-activated Kt channels of small conductance in cultured rat skeletal muscle. Nature (London) 323, 718-720. Borg-Graham, L. J. 1991. Modelling the non-linear conductances of excitable membranes. In Cellular and Molecular Neurobiology: A Practical Approach,

1002

Pierre-Yves Burgi and Norbert0 M. Grzywacz

H. Wheal and J. Chad, eds., pp. 247-275. Oxford University Press, Oxford, UK. Borg-Graham, L. J., and Grzywacz, N. M. 1992. A model of the directional selectivity circuit in retina: Transformations by neurons singly and in concert. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 347-375. Academic Press, Boston. Burgi, P.-Y., and Grzywacz, N. M. 1993a. A biophysical model of synchronous bursts of activity in ganglion cells of the developing retina. Invest. Ophthalmol. Vis. Sci. 34, 1155. Burgi, P.-Y., and Grzywacz, N. M. 1993b. Model for initiation and propagation of waves of action potentials in developing retinas. Neurosci. Abstr. 19, 239. Burgi, P.-Y., and Grzywacz, N. M. 1994. Model for the pharmacological basis of spontaneous synchronous activity in developing retinas. J. Neurosci., in press. Chay, T. R., and Kang, H. S. 1988. Role of single-channel stochastic noise on bursting clusters of pancreatic $cells. Biophys. 1. 54, 427435. Dowling, J. E. 1987. The Retina. Harvard University Press, Cambridge, MA. Ekeberg, 6.,Wallen, P., Lansner, A., TrAven, H., Brodin, L., and Grillner, S. 1991. A computer based model for realistic simulations of neural networks. Biol. Cybern. 65, 81-90. Fohlmeister, J. F., Coleman, P. A., and Miller, R. E 1990. Modeling the repetitive firing of retinal ganglion cells. Brain Res. 510, 343-345. Hertz, L. 1990. Regulation of potassium homeostasis by glial cells. In Differentiation and Functions of Glial Cells, pp. 225-234. A. R. Liss, New York. Hille, B. 1984. lonic Channels of Excitable Membranes. Sinauer, Sunderland, MA. Horsburgh, G. M., and Sefton, A. J. 1987. Cellular degeneration and synaptogenesis in the developing retina of the rat. J. Comp. Neurol. 263, 553-566. Jeffery, G. 1989. Shifting retinal maps in the development of the lateral geniculate nucleus. Dev. Brain Res. 46, 187-196. Karwoski, C. J., and Proenza, L. M. 1980. Neurons, potassium, and glia in proximal retina of Necturus. J. Gen. Physiol. 75, 141-162. Lehmenkiihler, A., Sykovd, E., Svoboda, J., and Nicholson, C. 1993. Extracellular space parameters in the rat neocortex and subcortical white matter during postnatal development determined by diffusion analysis. Neuroscience 55, 339-351. Linsker, R. 1986. From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Natl. Acad. Sci. U.S.A. 83, 8390-8394. Lipton, S. A. 1988. Spontaneous release of acetylcholine affects the physiological nicotinic responses of rat retinal ganglion cells in culture. J. Neurosci. 8, 3857-3868. Lukasiewicz, P., and Werblin, F. 1988. A slowly inactivating potassium current truncates spike activity in ganglion cells of the tiger salamander retina. J. Neurosci. 8, 44704481. McArdle, C. B., Dowling, J. E., and Masland, R. H. 1977. Development of outer segments and synapses in the rabbit retina. J. Comp. Neurol. 175, 253-274. McCormick, D. A., and Prince, D. A. 1987. Post-natal development of elec-

Model Based on Extracellular Potassium

1003

trophysiological properties of rat cerebral cortical pyramidal neurones. J. Physiol. (London) 393, 743-762. Maffei, L.,and Galli-Resta, L. 1990. Correlation in the discharges of neighboring rat retinal ganglion cells during prenatal life. Proc. Natl. Acad. Sci. U.S.A. 87, 2861-2864. Masland, R. H. 1977. Maturation of function in the developing rabbit retina. 1. Comp. Neurol. 175,275-286. Maslim, J.,and Stone, J. 1986. Synaptogenesis in the retina of the cat. Brain Res. 373, 35-48. Medzihradsky, F., Sellinger, 0. Z., Nandhasri, P. S., and Santiago, J. C. 1972. ATPase activity in glial cells and in neuronal perikarya of rat cerebral cortex during early postnatal development. 1.Neurochem. 19, 543-545. Meister, M., Wong, R. 0. L., Baylor, D. A., and Shatz, C. J. 1991. Synchronous bursts of action potentials in ganglion cells of the developing mammalian retina. Science 252, 939-943. Miller, K. D. 1994. A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activitydependent competition between ON- and OFF-center inputs. J. Neurosci. 14, 409-441. Murphy, T. H., Blatter, L. A., Wier, W. G., and Baraban, J. M. 1992. Spontaneous synchronous synaptic calcium transients in cultured cortical neurons. J. Neurosci. 12, 4834-4845. Nicholson, C., Phillips, J. M., and Gardner-Medwin, A. R. 1979. Diffusion from an iontophoretic point source in the brain: Role of tortuosity and volume fraction. Brain Res. 169, 580-584. Polley, E. H., Zimmerman, R. P., and Fortney, R. L. 1989. Neurogenesis and maturation of cell morphology in the development of the mammalian retina. In Development ofthe VertebrateRetina, B. L. Finlay and D. R. Sengelaub, eds., pp. 3-29. Plenum Press, New York. Rager, G. 1979. The cellular origin of the b-wave in the electroretinogram-a developmental approach. J. Comp. Neurol. 188, 225-244. Ramoa, A.S., Campbell, G., and Shatz, C. J. 1988. Dendritic growth and remodelling of cat retinal ganglion cells during fetal and postnatal development. J. Neurosci. 8, 42394261. Regehr, W. G., Connor, J. A., and Tank, D. W. 1989. Optical imaging of calcium accumulation in hippocampal pyramidal cells during synaptic activation. Nature (London) 341, 533-536. Santos, R. M., Rosario, L. M., Nadal, A., Garcia-Sancho,J., Soria, B., and Valdeolmillos, M. 1991. Widespread synchronous [CaZf]ioscillationsdue to bursting electrical activity in single pancreatic islets. Pflugers Arch. 418,417422. Semagor, E., and Grzywacz, N. M. 1993a. Cellular mechanisms underlying spontaneous correlated activity in the turtle embryonic retina. Invest. Ophthalmol. Vis. Sci. 34, 1156. Sernagor, E., and Grzywacz, N. M. 1993b. Emergence of isotropic and anisotropic receptive-field properties in the developing turtle retina. Neurosci. Abst. 19, 53. Sernagor, E., and Grzywacz, N. M. 1994. Synaptic connections involved in

1004

Pierre-Yves Burgi and Norbert0 M. Grzywacz

the spontaneous correlated bursts in the developing turtle retina. Invest. Opthalmol. Vis. Sci. 35, 2125. Sherman, A., and Rinzel, J. 1991. Model for synchronization of pancreatic D-cells by gap junction coupling. Biophys. J. 59, 547-559. Skaliora, I., Scobey, R. P., and Chalupa, L. M. 1993. Prenatal development of excitability in cat retinal ganglion cells: Action potentials and sodium currents. J. Neurosci. 13, 313-323. Somjen, G. G. 1979. Extracellular potassium in the mammalian central nervous system. Annu. Rm. Physiol. 41, 159-177. Van Harreveld, A. 1972. The extracellular space in the vertebrate central nervous system. In The Structure and Function of Netvous Tissue, G. H. Bourne, ed., pp. 447-511. Academic, New York. Winfree, A. T. 1990. Vortex action potential in normal ventricular muscle. Ann. N . Y. Acad. Sci. 591, 190-207. Winslow, R. L., Kimball, A. L., Varghese, A,, and Noble, D. 1993. Simulating cardiac sinus and atrial network dynamics on the Connection Machine. Pkysica D 64, 281-298. Wong, R. 0. L., Chernjavsky, A., Smith, S. J., and Shatz, C. J. 1992. Correlated spontaneous calcium bursting in the developing retina. Neurosci. Abstr. 18, 923.

Wong, R. 0.L., Meister, M., and Shatz, C. J. 1993. Transient period of correlated bursting activity during development of the mammalian retina. Neuron 11, 923-938.

Yamada, W. M., Koch, C., and Adams, P. R. 1989. Multiple channels and calcium dynamics. In Methods in Neuronal Modeling, C. Koch and I. Segev, eds., pp. 97-133. MIT Press, Cambridge, MA. Zucker, C., and Yazulla, S. 1982. Localization of synaptic and nonsynaptic nicotinic-acetylcholine receptors in the goldfish retina. J. Comp. Neurol. 204, 188-195.

Received September 3,1993; accepted January 11, 1994.

This article has been cited by: 2. Andreas Bringmann, Mike Francke, Thomas Pannicke, Bernd Biedermann, Hannes Kodal, Frank Faude, Winfried Reichelt, Andreas Reichenbach. 2000. Role of glial K+ channels in ontogeny and gliosis: A hypothesis based upon studies on M�ller cells. Glia 29:1, 35-44. [CrossRef] 3. Pierre-Yves Burgi, Norberto M. Grzywacz. 1997. Possible Roles of Spontaneous Waves and Dendritic Growth for Retinal Receptive Field DevelopmentPossible Roles of Spontaneous Waves and Dendritic Growth for Retinal Receptive Field Development. Neural Computation 9:3, 533-553. [Abstract] [PDF] [PDF Plus]

Communicated by Scott Kirkpatrick

Bayesian Modeling and Classification of Neural Signals Michael S. Lewicki Computation and Neural Systems Program, California Institute of Technology 21 6-76, Pasadena, CA 91125 U S A

Identifying and classifying action potential shapes in extracellular neural waveforms have long been the subject of research, and although several algorithms for this purpose have been successfully applied, their use has been limited by some outstanding problems. The first is how to determine shapes of the action potentials in the waveform and, second, how to decide how many shapes are distinct. A harder problem is that action potentials frequently overlap making difficult both the determination of the shapes and the classification of the spikes. In this report, a solution to each of these problems is obtained by applying Bayesian probability theory. By defining a probabilistic model of the waveform, the probability of both the form and number of spike shapes can be quantified. In addition, this framework is used to obtain an efficient algorithm for the decomposition of arbitrarily complex overlap sequences. This algorithm can extract many times more information than previous methods and facilitates the extracellular investigation of neuronal classes and of interactions within neuronal circuits. 1 Introduction Waveforms of extracellular neural recordings often contain action potentials (APs) from several different neurons. Each voltage spike in the waveform shown in Figure 1 is the result of APs from one or more neurons. An individual AP typically has a fast positive component and a fast negative component and may have additional slower components depending on the type of neuron and where the electrode is positioned with respect to the cell. Determining what cell fired when is a difficult, ill-posed problem and is compounded by the fact that cells frequently spike simultaneously, which results in large variations in the observed shapes. Identifying and classifying the APs in a waveform, which is commonly referred to as ”spike sorting,” have three major difficulties. The first is determining the AP shapes, the second is deciding the number of distinct shapes, and the third is decomposing overlapping spikes into their component parts. In general, these cannot be solved independently since the solution of one will affect the solution of the others. Algorithms Neural Computation 6,1005-1030 (1994) @ 1994 Massachusetts Institute of Technology

1006

Michael S. Lewicki

Figure 1: The extracellular waveform shows several different action potentials (APs)generated by an unknown number of neurons. Note the frequent presence of overlapping A h , which can, in the case of the right-most group, completely obscure individual spikes. The waveform was recorded with a glass-coated platinum iridium electrode in zebra finch nucleus lMAN (courtesy of Allison Doupe, Caltech).

for identifying and classifying APs (see Schmidt 1984 for a review) fall into two main categories: feature clustering and template matching. Feature clutering involves describing features of APs, such as the peak value, spike width, slope, etc., and using a clustering algorithm to determine distinct classes in the set of features. Using a small set of features, although computationally efficient, is often sufficient only to discriminate the cells with the largest APs. Increasing the number of features in the clustering often yields better discrimination, but there still remains the problem of how to choose the features, and it is difficult with such techniques to handle overlapping spikes. In template matching algorithms, typical action potential shapes are determined, either by an automatic process or by the user. The waveform is then scanned and each event classified according to how well it fits each template. Template matching algorithms are better suited for classifying overlaps since some underlying APs can be correctly classified if the template is subtracted from the waveform each time a fit is found. The main difficulty in template matching algorithms is in choosing the templates and in decomposing complex overlap sequences. The approach demonstrated in this paper is to model the waveform directly, obtaining a probabilistic description of each action potential and, in turn, of the whole waveform. This method allows us to compute the class conditional probabilities of each AP, which quantifies the certainty with which an AP is assigned to a given class. In addition, it will be possible to quantify the certainty of both the form and number of spike

Bayesian Modeling and Classification of Neural Signals

1007

shapes. Finally, we can use this description to decompose overlapping APs efficiently and to assign probabilities to alternative spike model sequences. 2 Modeling Action Potentials

First we consider the problem of fitting a model to events from a single cell. Let us assume that the data from the event we observe (at time zero) is a result of a fixed underlying spike function, s ( t ) , plus noise:

A computationally convenient form for s ( t ) is a continuous piece-wise linear function: s ( t ) = yj

+ Vi-(t h

- Xi),

xj

L t < Xj+l

where h = xj+l - x,, j = 1. . . R, and v, = y,+l - y,. We will treat R and the Xis as known. The noise, 77, is modeled as gaussian with zero mean and standard deviation u,,. 2.1 The Posterior for the Model Parameters. From the Bayesian perspective, the task is to infer the posterior distribution of the parameters, v = {v,,. . . ,vR}/ given the data from the observed events, D, and our prior assumptions of the spike model, M . Applying Bayes' rule we have

P ( D 1 v,u,,,M ) is the probability of the data for the model given in (2.2) and is assumed to be gaussian:

where ZD(U,,)= 1/(27~u;)'''. The time of the ith data point, d;, is taken By convento be relative to the corresponding event, i.e., t; = tic") - d"). tion, d*) is the time of the inferred AP peak. The data range over the predetermined extent of the action potential.' P ( v I uw,M ) specifies prior assumptions of the structure of s ( t ) . Ideally, we want a distribution over v from which typical samples result only in shapes that are plausible APs. Conversely, this space should not be so 'For the examples shown here, this range is from 1 msec before the spike peak to 4 msec after the peak.

Michael S. Lewicki

1008

restrictive that legitimate AP shapes are excluded. We adopt a simple approach and use a prior of the form

P ( s ( t ) I uw,M)0: exp

1

where the superscript (m)denotes differentiation. m = 1 corresponds to linear splines, m = 2 corresponds to cubic splines, etc. The smoothness of s ( t ) is controlled through the parameter uw with small values of ow penalizing large fluctuations. A prior simply favoring smoothness ensures minimal restrictions on the kinds of functions we can interpolate, but it does not buy us anything either. If we had a more informative prior, we would require less data to reach the same conclusions about the form of s ( t ) . Any reasonable prior should have little effect on the shape of the final spike function if there are abundant data. Even though the prior may have little effect on the shape, it still plays an important role in model comparison which will be discussed in Section 4. The components of the posterior distribution for v are now defined. There still remains, however, the problem of determining uq and uw. An exact Bayesian analysis requires that we eliminate the dependence of the posterior on u,, and a, by integrating them out:

P(v I D ,M ) = /duT1dow P(v I D,4, uw, M )P ( G , oq I M )

(2.6)

In this paper, we use the approximation P(v 1 D , M ) x P(v I D,u,MP, u r , M ) . The most probable values of v, uw,and uq were obtained using the methods of MacKay (1992), which we briefly summarize here. First, we transform v to a basis in which the Hessian of log P(v I uw,M ) is the identity. For splines, this is the Fourier representation:

(2.7) using the prior

where w = {a, b} -ao. The term uo is set to the known Dc level (the offset of the A/D converters). In the limit R -+ 00, C , 4 / 2 = $ [ s ( " ) ( ~ ) ] ~ d u (Wahba 1990), which is the splines regularizer. We take m = 1 for linear splines. The most probable parameter values, ww, were determined as follows. Let ED = Ci[d,- s(ti)12/2and Ew = C,z$/2. Letting B = VVED and C = VVEw (around vML),we obtain w" = o;'A-'BwML, where

Bayesian Modeling and Classification of Neural Signals

1009

+

A = U ; ~ C U;~B. The maximum likelihood values, vML, can be determined efficiently by inverting a tridiagonal matrix. The Fourier coefficients can be computed efficiently with the fast Fourier transform. The most probable values of a,,and a, were obtained using the rees( Iy) and ui = Ew/y, where y = C A,/(& timation formulas u: = ~ E D / u i 2 )and A, is the rth eigenvalue of u;*B. In terms of A, wy = A r w ~ / ( A r +

+

UiZ).

Note that we could at this point apply the methods described by MacKay (1992) and discussed later on in Section 4 to compare alternative spike models, in essence to determine the most probable spike model given the data. For example, we might choose cubic splines instead of piecewise linear functions or choose priors that better represented our knowledge about spike shapes. The piece-wise linear spike models discussed here can be made to fit any fixed shape, since they can contain arbitrarily many segments. With 75 segments, the spike models have been descriptively sufficient for the all the data we have observed. Situations for which this is not the case will be discussed in Section 9. Figure 2a shows the result of fitting one spike model to data consisting of 40 APs.

2.2 Checking the Assumptions. Before proceeding to the more complicated cases of multiple spike models and overlapping spikes, we must check our assumptions on real data. Equation (2.1)assumes that the noise process is invariant throughout the duration of the AP, but in principle this need not be the case. For example, the noise might show larger variation at the extremes. The spike model residuals, ~i = di - s ( t i ) , shown in Figure 2a, give no indication of an amplitude-dependent noise process. A second assumption we have made is that the noise is gaussian. Figure 2b shows a gaussian distribution with the inferred width u,,overlaid on a normalized histogram of the residuals from Figure 2a. The most significant deviation is in the tails of the distribution, which reflects the presence of overlapping spikes. In this case, the overlaps are evenly distributed over the range of the fitted event so they have little effect on the model’s form in the limit of large amounts data. The model would be poorly inferred, however, if the overlaps were not uniformly distributed over the interval, for example if one cell tended to fire within a few milliseconds of another. This is a common problem in practice and will be addressed in Section 5. An assumption that has not been tested is whether the residuals are independent. Figure 2c and d shows that the noise in these data is slightly correlated. This has little effecton the fit of the models but does affect the accuracy of the probabilities discussed in the later sections. A convenient way of reducing the correlation is to sample close to the Nyquist rate to avoid correlation introduced by the amplifier filters.

1010

Michael S. Lewicki

Figure 2: (a) Spike model fit to data consisting of 40 APs. The solid line is a 75 segment piece-wise linear model. Each AP is aligned with respect to the inferred spike peak. Each dot is one sample point. The residual error for each sample, 17; = d; - s(ti), is offset by -200 1V and plotted below. The flat residuals indicate that the data is well-fit by the model. (b) Normalized histogram of the residuals from a. The curve is the gaussian inferred with the methods discussed in the text. The outliers result from overlapping APs, which can be seen in the data in a. (c and d) Lagged scatter plot of a sample of the residuals in a. (c) 17; vs v,+~.(d) 17; vs 77;+2. These graphs indicate that there is some correlation between 11; and ~ l ; + l(c), but little between 17, and 77if2 (d). This is expected for these data because the sampling rate (20 kHz) was higher than the Nyquist rate (14 kHz).

Bayesian Modeling and Classification of Neural Signals

1011

3 Multiple Spike Shapes

When a waveform contains multiple types of APs, determining the spike shapes is more difficult because the classes are not known a priori. We cannot infer the parameters for one spike model if we don't know what data is representative of its class. Furthermore, if two spike models are similar, it is possible that an observed event could have come from either class with equal probability. The uncertainty of which class an event belongs to can be incorporated with a mixture distribution (Duda and Hart 1973). The probability of a particular event, D,, given all spike models, MIK , is K

P(Dn I V I K ? A ~ g q ~ M I K )

= xTkp(Dn

I Vkrgq,Mk)

(3.1)

k=l

where Tk is the a priori probability that a spike will be an instance of Mk (C Tk = 1). The joint probability for D I N= {D1. . . DN}is simply the product N

L:=P(DiN I~iK,r,gqtMiK)=n P ( D n I v i K , r , ~ q > M i ~ )

(34

n=l

The posterior for multiple spike models is then

P(vlK,r I DiN,.q,uw,MiK) P(DiN I V I K ~ r , ~ ~ ~ M I K ) P ( V I K I uW,'IK)P(r I MIK) (3.3) P(Dn I 071U w ,MIK ) We use P(vIKI u~,MIK)= &P(Vk I awk,Mk)and take P ( r I MIK) to be flat over [O, 1IKsubject to the constraint Ck Tk = 1. Note that we have implicitly assumed that the spike occurrence times are Poisson in nature with mean firing rates proportional to Tk. This assumes as little as possible about the temporal structure of the spikes. A more powerful description, for example, modeling the distribution of the interspike interval, would be obtained by incorporating this information into 3.2. -

3.1 Maximizing the Posterior. We proceed as before to find the maxima of the posterior that will give us the most probable values for the whole set of spike models. The conditions satisfied at the maxima of L given in 3.2 are obtained by differentiating logL with respect to vk and equating the result to zero,

= o

(3.4)

1012

Michael S. Lewicki

where 7, is the occurrence time of D,l. Thus we obtain a soft clustering procedure in which the error for each event, D,,, is weighted by the probability that it is an instance of Mk: (3.5) Although 3.4 can be solved exactly, it is still expensive to compute, because it uses all of the data. We adopt the approach of estimating each vk by fitting each model to a reduced event list allowing the possibility of an event being in the lists of multiple models. These lists are obtained by sampling events from the whole data set and including an event in a model’s reduced event list with probability proportional to P(Mk I D,, V k , K ,o,,). We apply the techniques used in the previous section to determine the values for a, and in turn the most probable values of V , K . Differentiating 3.2 and finding the condition satisfied at the maximum, we obtain the re-estimation formula (3.6) For each model, u7 can be estimated using the methods of the previous section. The mixture model estimate for g7 is obtained by a weighted average of the individual estimates using weight r k . 3.2 Selecting Events from the Data. For these demonstrations, any peak in the waveform that deviated from DC level by more than 4 times the estimated RMS noise level was labeled as an event, D,. Once an event is located, it is important to obtain accurate estimates of the occurrence time (with each spike model) by maximizing 2.4 over 7,. For the largest models, deviations from the optimal value as little as one-tenth the sampling period will introduce misfit errors greater than 4. The 7,’s must be re-estimated as the spike models change for optimal results. An efficient way to perform this optimization is to use the k-d trees discussed in Section 5. 3.3 Initial Conditions. Since the re-estimation formulas derived here will find local maxima, it is critical to use good initial conditions for the spike models. Poor fits will result if there are too few spike models representing what are in fact several distinct APs. Conversely, if there are more spike models than distinct APs, not only will there be excess computational overhead, but there is no guarantee that each AP will be represented, since some spike functions may converge to represent the same AP class. Ideally, we want all potential spike shapes to be represented in the initial spike function set, ~ , : ~ ( tOne ) . approach toward obtaining an even representation of the AP shapes is to initialize each

Bayesian Modeling and Classification of Neural Signals

1013

spike function to single events so that maxt s ( t ) - mints(t)is evenly distributed with a separation proportional to the estimated waveform RMS noise. This approach works well for present purposes, because the height of an AP captures much of the variability among classes. By erring on the side of starting with too many spike models, we can obtain a good initial representation of the AP shapes. There is still a need to decide if two different models should be combined and if one class should be split into two. How to choose the number of spike models objectively will be demonstrated in the next section. 4 Determining the Number of Spike Models

If we were to choose a set of spike models that best fit the data, we would wind up with a model for each event in the waveform. We might think of heuristics which would tell us when two spike models are distinct and when they are not, but ad hoc criteria are notoriously dependent on particular circumstances, and it is difficult to state precisely what information the rules take into account. A solution to this dilemma is provided by probability theory (Jeffreys 1939; Jaynes 1979; Gull 1988). To determine the most probable number of spike models, we need to derive the probability of a set of spike models, denoted by S, = {M$!}, conditioned only on the data and information known a priori, which we denote by H. From Bayes’ rule, we obtain

The only data dependent term is P(DINI S,,H), which is called the evidence for S,. If we assume all the hypotheses S,, under consideration are equally probable, P(DIN1 S,,H) ranks alternative spike sets, since it is proportional to P(S, I D,N,H). With equal priors, the ratio P(D I S,,H)/P(D I S,,H) is equal to the Bayes factor in favor of hypothesis S, over hypothesis S,, which is the standard way to compare hypotheses in the Bayesian literature. The evidence for S, is obtained by integrating out the nuisance parameters in 3.3:

P(D,N

I s,)

=

fi.1 xp(vIK

Kdrdcqduw W,N

I

I

K)

nj 0.7

sj)

sJ)p(n I sJ)p(Dq> I ’ J )

(4.2)

This integral is analytically intractable, but it is often well-approximated with a gaussian integral which for a functionf(w) is given by Sdwf(w) M f ( W ) (27r)d’2I-vv logf(w)~-”’

(4.3)

where d is dimension of w, w is a (local) maximum off(w), J A Jdenotes the determinant of A, and the derivatives are evaluated at w. With this

Michael S. Lewicki

1014

we obtain the evidence for spike set S,,

I Sj,H)

=

P(DiN I +iK,*,cq,S,)p(+iK I u w , S , )

p(*

1 s,)

p(eWwl

8J1

sJ)

1

-112

x (2~)~’~I-vVlogP(Di. 1 ~iKrrlgqrSj) x A log &,A log Cq

(4.4)

+ +

Alogii, = 2 / ( N I - T), and d = KR K 1. where Alog&, = nk yk is the number of good degrees of free om for Mk (MacKay 1992), which can be thought of as the number of parameters that are well-determined by the data. y = Ckyk. P ( u w , o ; 1 S,) is assumed to be separable and flat over log uw and log q. Since the labeling of the models is arbitrary, an additional factor of 1/K! must be included to estimate the posterior S,) (with revolume accurately. The Hessian -VV logP(DINI v, K , x , c,,, spect to vl and r)was evaluated both analytically and using a diagonal approximation. Both methods produced similar results, and the latter, being much faster to compute, was used for these demonstrations. Notice that the approximation for the evidence decomposes into the best-fit likelihood for the best fit parameters times the other terms, which collectively constitute a complexity penalty called the Ockham factor (MacKay 1992). Since this factor is the ratio of the posterior accessible volume in parameter space to the prior accessible volume, it is smaller for more complicated models. Overly broad priors will introduce a bias toward simpler models. Unless the best-fit likelihood for complex models is sufficiently larger than the likelihood for simple ones, the simple models will be more probable. A convenient way of collapsing the spike set is to compare spike models pairwise. Two models in the spike set are selected along with a sampled set of events fit by each model. We then evaluate P(D I S1) and P(D I S2). S1 is the hypothesis that the data are modeled by a single spike shape, S2 says there are two spike shapes. Included in the list of spike models should be a “null” model which is simply a flat line at DC.This hypothesis says that there are no events and that the data are a result of only the noise. Examples of this comparison are illustrated in Figure 3. If P(D I S1) > P(D I Sp), we replace both models in SZ by the one in S1. The procedure terminates when no more pairs can be combined to increase the evidence.

i-

5 Decomposing Overlapping Events

The method of inferring the spike models we have discussed thus far is valid if the event occurrence times can be accurately determined and if the noise is gaussian and stationary. Often these conditions cannot be met without identifying and decomposing overlapping events. Even if

Bayesian Modeling and Classification of Neural Signals

Sz: {Model I , Model 2)

a - l

0

1

2

nm imq

3

4b.l

0

,S,:{Model 1 t 2)

2

I

6rm (mr)

3

4

e

l

0

Sz: {Model 3 , Model 4 )

d l

0

1

2

nm (my)

3

r e ’

0

1015

41:

1

2

nme (mr)

3

4

f

l

O

I

2

nm (ma)

3

4

{Model 3 t 4 )

1 2 6rm (mr)

3

4

Figure 3: The most probable number of distinct spike models is determined by evaluating the evidence for alternative hypotheses for a given set of data. Simple hypotheses are generated by selecting similar shapes in a spike set. S2 is the hypothesis that there are two distinct spike models; the fits of two such models two a sampled set of data are shown in a and b. S1 is the hypothesis that there is only one spike model; the fit of this model is shown in c. In this case, even though the total misfit is less for S2, the simpler hypothesis, SI, is more probable by exp(ll1) to 1. In the second row, S2 (d and e) is more probable than S1 (f) by exp(343) to 1. Note the increase in residual error with the model shown in f. The difference between models 3 and 4 is better illustrated in Figure 8 (where they are labeled M2 and MJ,respectively). The large log probability ratios reported here result mainly from the abundance of data and the nongaussian outliers in the noise. A more realistic noise model, such as heavy-tailed gaussian, would result in more accurate estimates of the true probability ratios.

the spike models are good, overlap decomposition is necessary to detect and classify individual events with accuracy. For a given sequence of overlapping A h , there are potentially many spike model sequences that could account for the same data. An example is shown in Figure 4. We can calculate the probability of each alternative,

Michael S. Lewicki

1016

b . i

a

i

z

Ttm (mi)

i

Figure 4: Overfitting also occurs in the case of decomposing overlapping events. Shown are three of many well-fitting solutions for a single region of data. Thick lines are drawn between the data samples. The thin lines are the spike functions (note that these examples were taken from the first iteration of the algorithm, so the spike functions are noisy estimates of the underlying AP shapes). The best-fitting overlap solution in this case is not the most probable: the solution with four spike functions shown in a is more than eight times more probable than either b (five spike functions) or c (six spike functions) even though these fit the data better. The simple approach of using the bestfitting overlap solution actually increases the classificationerror especially in the number of false positives for the smaller models. To minimize classification error, it is necessary to find the most probable overlap solution. but there are an enormous number of sequences to consider, not only all possible models for each event but also all possible event times. A brute-force approach to this problem is to perform an exhaustive search of the space of overlapping spike functions and event times to find the sequence with maximum probability. This approach was used by Atiya (1992) in the case of two overlapping spikes with the times optimized to one sample period. Unfortunately, for many realistic situations this method is computationally too demanding even for off-line analysis. For overlap decomposition to be practical, we need an efficient way to fit and rank a large number of model potential spike sequences. In addition, we would like to state precisely what hypothesis subspace is searched, so we can say what model combinations cannot account for a given region of overlapping events. We can obtain a more efficient decomposition algorithm by employing

Bayesian Modeling and Classification of Neural Signals

1017

Figure 5: As the peaks of two action potentials get closer together, it becomes more difficult to classify either one with accuracy. It is necessary in this case (b and c) to fit multiple models simultaneously.

two techniques. The first is to consider only AP sequences that occur with non-negligible probability. This allows us to obtain a large, but manageable hypothesis space in which to search. The second is to make the search itself efficient using appropriate data structures and dynamic programming. 5.1 Restricting the Overlap Hypothesis Space. The main difficulty with overlapping APs is that there is no simple way to determine the event times. For many overlaps, such as the one in Figure Sa, the event times can be determined directly, because the APs are separated enough so that the models can be fit independently. As the degree of overlap increases, as in Figure 5b and c, accurate classification of one event depends on accurate classification of the surrounding events. In this case, the overlapping models must be fit simultaneously. Moreover, since small misalignments of the model with respect to the event can introduce significant residual error, each model in the overlap sequence must be precisely aligned. The continuum of possible event times is the major factor contributing to the multitude of potential overlap models. We can reduce this space significantly if we consider to what precision the T,,’s must be optimized. For a given spike model, sk(t ) , the maximum error resulting from a misalignment of & is given by2

From this we obtain the precision necessary to ensure that the error introduced by the model alone is less than E and only need to choose among a discrete set of point^.^ 2We ignore the discontinuities in the derivative of the piece-wise linear model. 3For these demonstrations we use 6 = 0.50,, which results in 6,s ranging from 0.05 to 0.3 sampling periods.

Michael S. Lewicki

1018

Even with this reduction, the number of possible sequences is still exponential in the number of overlapping models. This space can be reduced by considering only sequences that are likely to occur. For example, if there are five units with a Poisson firing rate of 20 Hz, the probability of observing three events within 0.5 msec is about 0.1%. Eliminating sequence models with more than two peaks within 0.5 msec of each other will introduce about 0.1% error. In this manner, the desired trade-off between classification accuracy and computational cost can be determined. In practice, however, spikes often do not fire in a Poisson manner but fire in bursts. The firing rate model in this case should be adapted accordingly so that the expected number of missed events is estimated accurately. 5.2 Searching the Overlap Hypothesis Space. Let us first outline the decomposition algorithm. To fit general model sequences, we use the methodology of dynamic programming. The event data is fit in sections from left to right. At every stage, a list is maintained of all plausible sequences4from the restricted hypothesis space determined by the methods described above. The length of data fit is extended by computing for each sequence on the list all plausible models that result by fitting the residual structure in the next region. The probabilities for all sequences are then recomputed, discarding any sequences below the probability threshold. The search terminates when no further overlaps are encountered in the most probable sequence model. We now discuss each step in more detail. The primary operation in the algorithm is that of determining the most probable sequence models for a region of data. For efficiency, we precompute all possible waveform segments and store the set in a k-d tree (gently 1975) with which a fixedradius nearest-neighbor search can be performed in time logarithmic in the number of models (Friedman et al. 1977; Ramasubramanian and Paliwal 1992). O(N log N ) time is required to construct the tree, but once it is set up, each nearest-neighbor search is very fast. The set of overlap functions for a region from II to b around the spike peak is defined by L

Ak,L,,,(t) =

C~k,(f - ncik,),

/c/ = 1,.. . , K ,

kl < . . . < kL,

(5.2)

/=I

a < t - n6k. < b, n integer where L is the maximum number of overlapping spike function segments in the peak region [a, b], and is the 7-resolution for sk(f) defined in 5.1. The size of the peak region is somewhat arbitrary; the larger the region, the larger the number of waveform segments that must be considered, but the smaller the number of plausible overlap sequences found. In practice, the size of the peak region is largely limited by the memory 4By plausible sequenceswe mean sequences with probability greater than a specified threshold.

Bayesian Modeling and Classification of Neural Signals

1019

required for the k-d tree. For these demonstrations, we take L = 2 (up to two overlapping spike function segments within a peak region) with a peak region of 0.25 msec and include a “noise” model A0 that has constant value equal to the DC voltage level. The number of waveform segments in the set can be reduced by eliminating overlapping spike functions for which the peak would have been (with high probability) detected at a sample position other than that of the data. Even with this reduction, an 11-model spike set results in about 50,000 waveform segments. Once the best-fitting waveform segments for the first peak region are obtained, each segment is extended up to the next peak in the residuals for that segment. This peak is then fit using the k-d tree, which in turn generates additional overlap sequences. As long as the introduction of new waveform segments does not alter our conclusions about the ordering sequence list, for example, by fitting structure in a preceding region, we ensure either that one of the overlap sequences is true or that the sequences we are considering cannot account for the data. After each sequence from the original list has been extended, the probability of each sequence model, c;, is recomputed. The exact relation is given by (5.3) where D is the subset of data common to all sequences, and S = { v , : ~T,, 4, The form of the probability density function, P(D I c;, T;),is the same as 2.4. Equation 5.3 can be approximated with a gaussian integral by treating each peak region as a separable component,

P(c; I D,S) z

nj

P(D 1 c;, 3;,S)( 2 ~ ) ~ / d,:’” * P(c; I S ) P(T;I S ) P(D I S)

I

(5.4)

where C is the number of total number of spike functions in the sequence, and dj is the determinant of Hessian of the 7’s for the jth peak region. The values needed to compute the Hessians can be obtained directly from the k-d tree. Note that integrating over T ;performs the function of Ockham’s Razor by penalizing sequences with many spike models. Omitting this would reduce the solution to one of maximum likelihood, which chooses the sequence that best fits the data. For example, the solutions shown in Figure 4b and c both fit the data better than in 4a, but by 5.4,4a is more than eight times more probable than the others. Use of the best-fitting solutions would result in an increase in the classification error due to the introduction of too many models. Classification error is minimized by using the most probable overlap sequences. P(c;,T ;I S)describes the a priori probability of the sequence of models in c; with associated occurrence times 7 ; . For this discussion, we assume P(c; I S) to be Poisson with rate proportional to (q)and P(T;I S) to be proportional to l / ( r k ) . Useful alternatives for P(c;,T;1 S ) include models

1020

Michael S. Lewicki

Figure 6: Example overlap solutions. Thick lines are drawn between the data samples. The thin lines are the spike models. The overlap sequence in (a) has three spike functions, (b) contains four spike functions, and (c) contains five spike functions. which take into account a refractory period or describe different types of spiking patterns. Once the probabilities for the sequence models have been computed, the improbable models are discarded. The decomposition algorithm iterates until no overlapping structure is found in the most probable model. The search can fail if an outlier is encountered or if the true sequence is outside the hypothesis space. Otherwise, upon termination the search results in a list of all plausible sequence models of the given data along with their associated probabilities. Example decompositions are shown in Figure 6 . 6 Performance on Real Data

The algorithm was first tested on real data, a section of which was shown in Figure 1. The whole waveform consisted of 40 sec of data, filtered from 300 to 7000 Hz and sampled at 20 kHz. Three iterations of the algorithm were performed with overlap decomposition after the second (with L = 1) and third (with L = 2) iterations. Spike models that occurred fewer than 10 times were discarded for efficiency, and the remaining events were reclassified. The inferred spike models are shown in Figure 7. The residuals indicate that these spike models account for almost all events

Bayesian Modeling and Classification of Neural Signals

1021

in the 40 sec waveform. Out of about 1500 total events, only 6 were not fit to within 50,. By eye, these events looked very noisy and had no obvious composition in terms of the spike models. One possibility is that they resulted from animal movement. Such events were not present in the synthesized data set described in Section 7 where all the events were fit with the inferred spike models. By eye, all the models look distinct except perhaps for M2 and M3. One way to see the difference between these two models is to fit the data from model 3 with model 2 as shown in Figure 8. With a single electrode it is difficult to determine whether or not these two shapes result from different neurons, but they are clearly two types of events. One possibility is that these are different states of the same neuron; another is that the shape in model 3 results from a tight coupling between two neurons. Recording with multiple electrodes from a local region of tissue would help resolve issues like this. In spite of all the math, the algorithm is fast. Inferring the spike set with overlap decomposition takes a few minutes on a Sun Microsystems Sparc IPX. Classification of the 40 sec test waveform with overlap decomposition (using L = 1) takes about 10 sec. 7 Performance on Synthesized Data

The accuracy of the algorithm was tested by generating an artificial data set composed of the six inferred shapes shown in Figure 7. The event times were Poisson distributed with frequency equal the inferred firing rate of the real data set. Gaussian noise was then added with standard deviation equal to 0,. The algorithm was run under the same conditions as above. The algorithm chose 14 initial spike models, which were subsequently collapsed to 6 using the methods discussed in the previous section. Note that in this case, the number of inferred models matches the number of true models, but this need not be the case if some true models are too similar to be resolved, or if there are insufficient data to identify two distinct classes. The six-model spike set was preferred over the most probable five-model spike set by exp(34) : 1 and over the most probable seven-model spike set by exp(l9) : 1. A summary of the accuracy of the spike shapes is shown in Table 1. The results of inferring and classifying the synthesized data set are shown for the nonoverlapping spikes in Table 2 and for the overlapping spikes in Table 3. An event was considered an overlap if the extent5 overlapped the extent of another event. Perfect performance would have all zeros in the off-diagonal entries and no undetected events. An event can be missed if it is not detected in an overlap sequence or if all its 5The extent of a event is defined as the minimum and maximum values in time at which the best-fitting spike function differs from DC by more than 0 . 5 ~ ~ .

1022

Michael S. Lewicki

Figure 7: The results of applying the algorithm to a real data set. The solid lines are the inferred spike models. The data overlying each model are a sample of at most 40 events. The residual errors are plotted below each model.

1023

Bayesian Modeling and Classification of Neural Signals

-1

0

1

Turn (ma)

2

3

4

Figure 8: One way to see the difference between the spike models M2 and M3 is to fit the data from M3 (points) with M2 (solid line). The residual errors are plotted below. All the data from both spike models are plotted. If the noise level is constant throughout the duration of the AP, the large deviation in the residuals indicates that there are two distinct classes. Table 1: Results of the Spike Model Inference Algorithm on the Synthesized Data Set." Model

Arnax/LT,

maxt Sdf)/G1 Number of occurrences

1

2

3

4

5

0.44 17.9 39

0.36 11.1 63

1.07 10.6 45

0.78 7.4 238

0.84 4.4 155

6

0.40 5.0 1055

'Both the form and number of spike models were determined by the algorithm. The inferred number of spike models matched the true number (six models). The second row is the maximum absolute difference between the true spike model and the inferred model normalized by uq. The third row is the normalized peak of the inferred spike models, which is an indication of how far each type of AP is above the noise level. The last row shows the number of times each model occurred in the synthesized data.

sample values fall below the threshold for event detection ( 4 4 . The tables indicate that for the largest four spikes, the performance is nearly perfect, even including the overlapping cases. Performance is worst in the two smallest spike models where there are a large number of missed events. For these models, there are typ-

Michael S. Lewicki

1024

Table 2: ClassificationResults for the Nonoverlapping Events of the Synthesized Data Set.” True models 1 2 3 4 5 6

Inferred models

Missed

Total events

1

2

3

4

5

6

events

17 0 0 0 0 0

0 25 0 0 0 0

0 1 15 0 0 0

0 0 0 116 0 0

0 0 0 0 56 0

0 0 0 0 0 393

0 0 0 1 17 254

17 26 15 117 73 647

OEach matrix component indicates the number of times true model i was classified as inferred model j . Events were missed if the true spikes were not detected in an overlap sequence or if all sample values for the spike fell below the event detection threshold (40,). There was one false positive for M5 and seven for Mg. See text for additional comments.

Table 3: Classification Results for the Overlapping Events of the Synthesized Data Set! Inferred models

True

models 1 2 3

4 5 6

Missed

Total events

1

2

3

4

5

6

events

22 0 0 0 0 0

0 36 0 1 0 0

0 1 20 0 0 0

0 0 0 116 1 3

0 0 0 0 61 2

0 0 0 1 1 243

0 0 0 3 19 160

22 37 20 121 82 408

‘Each matrix component indicates the number of times true model i was classified as inferred model j . Events were missed if the true spikes were not detected in an overlap sequence or if all sample values for the spike fell below the event detection threshold (40,). There was one false positive for M5 and seven for Mg. See text for additional comments.

ically only two or three samples that would be expected to exceed the noise level. As the threshold for event detection is lowered, there is a trade-off between the number of real spikes missed and the number of false positives resulting from common instance of when the noise contains a spike-like shape. The number of below threshold missed events can be minimized (with additional computational expense) by computing the probabilities at every sample point instead of only those that cross threshold. It is worth noting that this situation often does not pose

Bayesian Modeling and Classification of Neural Signals

1025

a problem in practice, since observed spikes just above the noise level frequently correspond to many different neurons.

8 Comparisons with Other Approaches

It is instructive to contrast the spike sorting algorithm presented here with other methods by comparing their performances on the synthesized data set used in the previous section. The most common method of classifying APs is through use of a hardware level detector, which detects an AP if the voltage exceeds a user-determined level. For the synthesized data set, a level detector is sufficient only to classify the largest AP (M1) with accuracy. Another common hardware approach is a window discriminator with which APs are detected only if the peak value is within a voltage window. A window discriminator can classify M1 accurately and classify Mq with some error since the distribution of the M4 peak voltages overlaps somewhat with other models, but it is not sufficient to discriminate between M2 and M3 or between M5 and Mg. These discriminations demand more sophisticated methods. A common software-based method for spike sorting is a feature clustering algorithm such as the one used in the commercial physiological data collection system Brainwave. The synthesized waveform was classified independently by an experienced Brainwave user (Matt Wilson). The features used to perform the classification were maximum spike amplitude, minimum spike amplitude, and time from the spike maximum to the spike minimum. Brainwave generates a list of occurrence times for each cluster but not explicit spike functions, so it was not possible to see how close the ”inferred spike functions” were to the true spike functions. The occurrence times were compared to the known AP positions. Two separate classifications were performed (one using four clusters and another using six clusters), and the results of the most accurate classification (six clusters) are reported here. Tables 4 and 5 show the classification results for the synthesized data set for the nonoverlapping and overlapping action potentials, respectively. A total of six clusters were found, but not all of these correspond to the true underlying clusters. The tables show that true models M1 and M4 were accurately identified and classified. True models M 2 and M3,however, were collapsed into a single cluster. This discrimination is difficult to make without accurately estimating the occurrence time of the APs. Brainwave uses the spike peak for the occurrence time, which is accurate to within 1 sample period and introduces a significant amount of noise into the features. In contrast, the Bayesian approach estimates the spike occurrence times with subsampling period accuracy. Note also that with no overlap decomposition, there are significantly more missed events for the larger APs.

Michael S. Lewicki

1026

Table 4: Brainwave Classification Results for the Nonoverlapping Events of the Synthesized Data Set.a True

Cluster number

Missed

Total

events

models

1

2

3

4

5

6

events

1 2 3 4 5 6

17 0 0 0 0 0

0 26 15 0 0 0

0 0 0 116 1 0

0 0 0 0 24 22

0 0 0 0 0 13

0 0 0 0 6 188

0 0 0 1 42 424

17 26 15 117 73 647

‘Each matrix component indicates the number of times true model i was classified as belong to Brainwave cluster j . An event was missed if a true AP did not correspond to any of the APs identified by Brainwave. The false positives counts were 2, 3, 4, and 2 for Brainwave clusters 3, 4, 5, and 6, respectively.

True models M5 and M6 were described with three clusters, with clusters 4 and 5 roughly corresponding to M5 and cluster 6 corresponding to Mh. For these models, the features used make it difficult to choose the correct clusters, since the smaller models are not well separated in the three-dimensional feature space. There is less separation, because the occurrence times are not estimated accurately and no overlap decomposition is done. Even if the cluster centers were accurate, we would expect the Brainwave classification to be less accurate than the Bayesian approach. Using spike functions to perform the classification utilizes all significant sample points in the waveform, which for the smallest two models is between four and eight. In contrast, only three features are used by Brainwave. 9 Extensions

There are a number of possible directions for improvements to the general waveform model we have described. At the lowest level there are possibilities for alternative noise models. For example, real extracellular noise tends to be correlated and slightly nongaussian. Incorporating this information would make the probabilities more accurately reflect the real world. The piece-wise linear model we have described is general enough to fit almost arbitrary shapes, but that generality is also part of its shortcoming. Since in the algorithm we have placed minimal restrictions on the form of the spike model, more data are required to infer the shape. Incorporating knowledge about the spike shapes would result in more accurate conclusions with the same amount of data. Overly weak spike

Bayesian Modeling and Classification of Neural Signals

1027

Table 5: Brainwave Classification Results for the Overlapping Events of the Synthesized Data Set.‘ True

Cluster number

models

1

1

22 0

2 3 4 5 6

0 0 0 0

Missed

Total

events

2

3

4

5

6

events

0

0 1

0 0 0

0 0 0

0 0 0

0

3 26 15

0 8 2

2 10 108

34 18 3 0

3

1 106 1 9

2 1 7 45 264

22 37 20 121 82 408

OEach matrix component indicates the number of times true model i was classified as belong to Brainwave cluster j . An event was missed if a true AP did not correspond to any of the APs identified by Brainwave. The false positives counts were 2, 3, 4, and 2 for Brainwave clusters 3, 4, 5, and 6, respectively.

shape priors will also result in overly strong Occam factors, which will bias the results of model comparisons toward simpler models. For some types of neurons the shape of an action potential is not constant. Bursting neurons, for example, have spikes that decay dramatically during a burst. Modeling the resulting shape is complicated because the interspike intervals during a burst are not constant over different bursts, and the degree of attenuation depends on the intervals. Another way in which APs can change their shape is due to electrode drift, which results in a slow change of the spike shapes over time. This can be handled readily by the algorithms since re-estimating previously inferred shapes is very fast. Another limitation stems not from the algorithm but from the method of recording. Since a single electrode gives little information about a neuron’s position, decisions about whether two shapes constitute two neurons must be made based on shape and firing frequency alone. The use of multiple electrodes in a local area resolves this issue by recording the same group of neurons from different sites. Thus even if two neurons have identical shapes when recorded from electrode, it is mlikely that those two neurons will generate the same AP shape when observed simultaneously from a different electrode. A trivial extension of the algorithm would be to run it on each electrode and then look for cross-correlations in the event times, but better results could be obtained by incorporating the information about multiple electrodes into a single model. 10 Discussion

Formulating the task as the inference of a probabilistic model made clear what was necessary to obtain accurate spike models. Optimizing the r,’s

1028

Michael S . Lewicki

is crucial for both inference and classification, but this step is commonly ignored by algorithms which cluster the sample points or derive spike shapes from principal components. The soft clustering procedure makes it possible to determine the spike shapes with accuracy even when they are highly overlapping. Unless the spike shapes are well-separated, hard clustering procedures such as k-means will lead to inaccurate estimates of the spike shapes. Probability theory also provided an objective means of determining the number of spike models, which is an essential reason for the success of this algorithm. With the incorrect number of spike models, overlap decomposition becomes especially difficult. If there are too few spike models, the overlap data cannot be fit. If there are too many, decomposition becomes a very expensive computation. The evidence has proved to be a sensitive indicator of when two classes are distinct, as was shown in Figure 8. Previous approaches have relied on ad hoc criteria or the user to make this decision, but such approaches cannot be relied upon to work under varying circumstances since their inherent assumptions are not explicit. An advantage of probability theory is that the assumptions are explicit, and given those assumptions, the answer provided by the evidence is optimal. One might wonder if the user, having much more information than has been incorporated into the model, can make better decisions than the evidence about what constitutes distinct spike models. Probability theory provides a calculus for stating precisely what can be inferred from the data given the model. When the conclusions reached through probability theory do not fit our expectations, it is due to a failure of the model or a failure of the approximations (if approximations are made). From the performance on the synthesized data, however, the approximations appear to be reasonable. Thus when the conclusions reached through the evidence are at variance with the user’s, information is at hand about possible shortcomings of the current model. In this manner, new models can be constructed, and, moreover, they can be compared objectively using the evidence. Probability theory is also essential for accurate overlap decomposition. It is not sufficientjust to fit data with compositions of spike models. That leads to the same overfitting problem encountered in determining the number of spike models and in determining the spike shapes. The Ockham penalty introduced by integrating out the 7’s was a key to finding the most probable fits and consequently for achieving accurate classification. Previous approaches have been able to handle only a limited class of overlaps, mainly due to the difficulty in making the fit efficient. The algorithm we have described can fit an overlap sequence of virtually arbitrary complexity in milliseconds. In practice, the algorithm we have described allows us to extract much more information from an experiment than with previous methods. Moreover, this information is qualitatively different from a simple

Bayesian Modeling and Classification of Neural Signals

1029

list of spike times. Having reliable estimates of the action potential shapes makes it possible to study the properties of these classes, since distinct neuronal types can have distinct neuronal spikes (Connors and Gutnick 1990). With stereotrodes this advantage would be amplified, since it is then possible to estimate somatic size, which is another distinguishing characteristic of cell type. Finally, accurate overlap decomposition makes it possible to investigate interactions among local neurons, which were previously very difficult to observe.

Acknowledgments I thank David MacKay for helpful discussions and encouragement during the early stages of this work, Jamie Mazer for many conversations and extensive help with the development of the software, and Matt Wilson for classifying the synthesized data set with Brainwave. Thanks also to Allison Doupe and Ken Miller for helpful feedback on the manuscript. This work was supported by Caltech fellowships and an NIH Research Training Grant.

References Atiya, A. F. 1992. Recognition of multiunit neural signals. ZEEE Transact. Biomed. Eng. 39(7), 723-729. Bently, J. L. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM l8(9), 509-517. Connors, B. W., and Gutnick, M. J. 1990. Intrinsic firing patterns of diverse neocortical neurons. TINS 13(3),99-104. Duda, R. O., and Hart, I? E. 1973. Pattern Classificationand Scene Analysis. WileyInterscience, New York. Friedman, J. H., Bently, J. L., and Finkel, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Software 3(3), 209-226. Gull, S. F. 1988. Bayesian inductive inference and maximum entropy. In Maximum Entropy and Bayesian Methods in Science and Engineering, Vol. I: Foundations, G. J. Erickson and C. R. Smith, eds. Kluwer. Jaynes, E. T. 1979. Review of Inference, Method, and Decision (R. D. Rosenkrantz). 1.Am. Stat. Assoc. 74, 140. Jeffreys, H. 1939. Theory of Probability. (3rd rev. ed 1961). Oxford University Press, New York. MacKay, D. J. C. 1992. Bayesian interpolation. Neural Comp. 4(3), 415-445. Schmidt, E. M. 1984. Computer separation of multi-unit neuroelectric data: A review. 1.Neurosci. Methods 12, 95-111.

1030

Michael S. Lewicki

Ramasubramanian, V., and Paliwal, K. K. 1992. Fast k-dimensional tree algorithms for nearest-neighbor search with application to vector quantization encoding. I E E E Trans. Signal Proc. 40(3), 518-531. Wahba, G. 1990. Spline Models for Observational Data, SIAM.

Received June 29, 1993; accepted January 11, 1994.

This article has been cited by: 2. C. J. Keller, W. Truccolo, J. T. Gale, E. Eskandar, T. Thesen, C. Carlson, O. Devinsky, R. Kuzniecky, W. K. Doyle, J. R. Madsen, D. L. Schomer, A. D. Mehta, E. N. Brown, L. R. Hochberg, I. Ulbert, E. Halgren, S. S. Cash. 2010. Heterogeneous neuronal firing patterns during interictal epileptiform discharges in the human cortex. Brain 133:6, 1668-1681. [CrossRef] 3. Michael T. Wolf*, Joel W. Burdick. 2009. A Bayesian Clustering Method for Tracking Neural Signals Over Successive Intervals. IEEE Transactions on Biomedical Engineering 56:11, 2649-2659. [CrossRef] 4. K. Imfeld, A. Maccione, M. Gandolfo, S. Martinoia, P.-A. Farine, M. Koudelka-Hep, L. Berdondini. 2009. Real-time signal processing for high-density microelectrode array systems. International Journal of Adaptive Control and Signal Processing 23:11, 983-998. [CrossRef] 5. Ariel Tankus, Yehezkel Yeshurun, Itzhak Fried. 2009. An automatic measure for classifying clusters of suspected spikes into single cells versus multiunits. Journal of Neural Engineering 6:5, 056001. [CrossRef] 6. Valérie Ventura. 2009. Automatic Spike Sorting Using Tuning InformationAutomatic Spike Sorting Using Tuning Information. Neural Computation 21:9, 2466-2501. [Abstract] [Full Text] [PDF] [PDF Plus] 7. Zhi Yang, Qi Zhao, Wentai Liu. 2009. Improving spike separation using waveform derivatives. Journal of Neural Engineering 6:4, 046006. [CrossRef] 8. V. Ventura. 2009. Traditional waveform based spike sorting yields biased rate code estimates. Proceedings of the National Academy of Sciences 106:17, 6921-6926. [CrossRef] 9. Raul Benitez, Zoran Nenadic. 2008. Robust Unsupervised Detection of Action Potentials With Probabilistic Models. IEEE Transactions on Biomedical Engineering 55:4, 1344-1354. [CrossRef] 10. Satoshi Kojima, Allison J. Doupe. 2008. Neural encoding of auditory temporal context in a songbird basal ganglia nucleus, and its independence of birds' song experience. European Journal of Neuroscience 27:5, 1231-1244. [CrossRef] 11. Weidong Ding, Jingqi Yuan. 2008. Spike sorting based on multi-class support vector machine with superposition resolution. Medical & Biological Engineering & Computing 46:2, 139-145. [CrossRef] 12. Sridhar Kalluri, Didier A. Depireux, Shihab A. Shamma. 2008. Perception and cortical neural coding of harmonic fusion in ferrets. The Journal of the Acoustical Society of America 123:5, 2701. [CrossRef] 13. G.-L. Wang, Y. Zhou, A.-H. Chen, P.-M. Zhang, P.-J. Liang. 2006. A Robust Method for Spike Sorting With Automatic Overlap Decomposition. IEEE Transactions on Biomedical Engineering 53:6, 1195-1198. [CrossRef]

14. Z. Nenadic, J.W. Burdick. 2006. A Control Algorithm for Autonomous Optimization of Extracellular Recordings. IEEE Transactions on Biomedical Engineering 53:5, 941-955. [CrossRef] 15. X. Liu, D.B. McCreery, L.A. Bullara, W.F. Agnew. 2006. Evaluation of the Stability of Intracortical Microelectrode Arrays. IEEE Transactions on Neural Systems and Rehabilitation Engineering 14:1, 91-100. [CrossRef] 16. Alex Zviagintsev, Yevgeny Perelman, Ran Ginosar. 2006. Algorithms and architectures for low power spike detection and alignment. Journal of Neural Engineering 3:1, 35-42. [CrossRef] 17. Z. Nenadic, J.W. Burdick. 2005. Spike Detection Using the Continuous Wavelet Transform. IEEE Transactions on Biomedical Engineering 52:1, 74-87. [CrossRef] 18. Ronen Segev, Joe Goodhouse, Jason Puchalla, Michael J Berry. 2004. Recording spikes from a large fraction of the ganglion cells in a retinal patch. Nature Neuroscience 7:10, 1155-1162. [CrossRef] 19. F. Wood, M.J. Black, C. Vargas-Irwin, M. Fellows, J.P. Donoghue. 2004. On the Variability of Manual Spike Sorting. IEEE Transactions on Biomedical Engineering 51:6, 912-918. [CrossRef] 20. Kyung Hwan Kim, Sung June Kim. 2003. Method for unsupervised classification of multiunit neural signal recording under low signal-to-noise ratio. IEEE Transactions on Biomedical Engineering 50:4, 421-431. [CrossRef] 21. David P Nguyen, Loren M Frank, Emery N Brown. 2003. An application of reversible-jump Markov chain Monte Carlo to spike classification of multi-unit extracellular recordings. Network: Computation in Neural Systems 14:1, 61-82. [CrossRef] 22. G.P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 30:4, 451-462. [CrossRef] 23. R. Chandra, L.M. Optican. 1997. Detection, classification, and superposition resolution of action potentials in multiunit single-channel recordings by an on-line real-time neural network. IEEE Transactions on Biomedical Engineering 44:5, 403-412. [CrossRef]

REVIEW

Communicated by Idan Segev

Information Processing in Dendritic Trees Bartlett W. Me1 Department of Biomedical Engineering, University of Southern California, University Park, Los Angeles, C A 90089 USA This review considers the input-output behavior of neurons with dendritic trees, with an emphasis on questions of information processing. The parts of this review are (1) a brief history of ideas about dendritic trees, (2) a review of the complex electrophysiology of dendritic neurons, (3) an overview of conceptual tools used in dendritic modeling studies, including the cable equation and compartmental modeling techniques, and (4) a review of modeling studies that have addressed various issues relevant to dendritic information processing. 1 Introduction

Dendritic trees come in many shapes and sizes, and are among the most beautiful structures in nature (Fig. 1). They account for more than 99% of the surface area of some neurons (Fox and Barnard 1957), are studded with up to 200,000 synaptic inputs (Llinb and Walton 1990), are the largest volumetric component of neural tissue (Sirevaag and Greenough 1987), and consume more than 60% of the brain’s energy (Wong-Riley 1989). Most importantly, though, they are the computing workhorses of the brain. This review concerns the question as to how information is processed in dendritic trees. In order to achieve focus, the scope of this review is limited in several ways. First, we bias our discussion toward studies that bear directly on questions of information processing. Less attention is devoted to work that is primarily biophysical in nature, such as studies aimed at matching dendritic neuron models to experimental data-for a thorough recent review see (Rall et al. 1992). Second, we focus on questions that pertain specifically to dendritic computation, i.e., for which the spatially extended structure of the dendritic tree is essential. Questions of relevance to general neuronal computation are not emphasized here, such as the nature of information coding in spike trains (Attick and Redlich 1990; Bialek et al. 1991; McClurkin et al. 1991; Theunissen and Miller 1991; Softky and Koch 19931, or the mechanisms underlying complex neuronal spiking behavior (Traub and Llintis 1979; Traub 1982; Traub et af. 1985; Borg-Graham 1987; Lytton and Sejnowski 1991; Traub et al. 1991). Third, we consider computation on fast time scales only. Thus, the central topic Neural Computation 6,1031-1085 (1994) @ 1994 Massachusetts Institute of Technology

1032

Figure 1: Caption on overleaf.

Bartlett W. Me1

Information Processing in Dendritic Trees

Figure 1: Continued.

1033

1034

Bartlett W. Me1

of this review concerns how dendritic trees transduce complex patterns of synaptic input into a stream of action potentials at the cell body, over periods of tens of milliseconds. Computations on longer time scales are not considered, such as those involving second-messenger systems (Waterhouse et ul. 1991), ion channel migration or regulation (Popov and Po0 1992; Bell 1992; LeMasson et al. 19931, ultrastructural changes (Greenough and Bailey 1988), or any consequence of gene-expression (Esterle and Sandersbush 1991). Fourth, the emphasis of this review is primarily on dendritic function in vertebrate neurons. This is a consequence of the fact that the bulk of dendritic modeling studies to date have dealt with vertebrate neurons. Finally, specific discussion of the computational significance of dendritic spines is left to a number of excellent reports and reviews available elsewhere (Koch and Poggio 1983; Miller et al. 1985; Perkel and Perkel 1985; Rall and Segev 1987; Segev and Rall 1988; Koch and Poggio 1987; Shepherd and Greer 1988; Zador et al. 1990; Koch et al. 1992). The remainder of this section is devoted to a few historical notes regarding the role of dendrites in neuronal function. In Section 2, experimental evidence is reviewed that proves dendrites to be physiologically highly complex objects. In Section 3, the main conceptual and computational tools that have been applied in the study of dendritic function are introduced, including the cable equation, compartmental modeling, and several useful rules of thumb regarding the flow of current in dendritic trees. Finally, modeling studies relating directly to dendritic information processing are reviewed in Section 4.

Figure 1: Previous two pages. Dendrites and other neural trees structures, reproduced with permission. Lengths given are approximate and correspond to direction of maximal extent. (A) Alpha motorneuron in spinal cord of cat (2.6 mm); from Cullheim et al. (1987). (B) Spiking interneuron in mesothoracic ganglion of locust (540 pm); courtesy G. Laurent. (C) Layer 5 neocortical pyramidal cell in rat (1030 pm); from Amitai et al. (1993). (D) Retinal ganglion cell in postnatal cat (390 pm); from Maslim et al. (1986). (E)Amacrine cell in retina of larval tiger salamander (160 pm); from Yang and Yazulla (1986). (F) Cerebellar Purkinje cell in human; from Ramon y Cajal (1909), v. 1, p. 61). (G) Relay neuron in rat ventrobasal thalamus (350 pm); from Harris (1986). (H) Granule cell from olfactory bulb of mouse (260 pm); from Greer (1987). (I) Spiny projection neuron in rat striatum (370 pm); from Penny et al. (1988). (J) Nerve cell in the Nucleus of Burdach in human fetus; from Ramon y Cajal (1909), v. 2, p. 902. (K) Purkinje cell in mormyrid fish (420 LLm); from Meek and Nieuwenhuys (1991). (L) Golgi epithelial (glial) cell in cerebellum of normal-reeler mutant mouse chimera (150 pm); from Terashima et al. (1986). (M) Axonal arborization of isthmotectal neurons in turtle (460 pm); from Sereno and Ulinski (1987).

Information Processing in Dendritic Trees

1035

1.1 History of Ideas about Dendrites. Historical notes that trace the progression of ideas about dendritic physiology and information processing are available elsewhere and may be consulted by the interested reader (Brazier 1959; Lorente de N6 and Condouris 1959; Purpura 1967; Jack et al. 1975; Spencer 1977; Rall 1977; Kandel and Schwartz 1985; Shepherd and Koch 1990; Segev 1992). A few historically significant conceptual landmarks have been collected below. Dendritic trees, like terrestrial trees, root systems, and vascular systems, are a class of branched structures well suited for the penetration of a volume, such as for the extraction or delivery of nutrients. Before the advent of electrophysiology, it was reasonable to conclude, as was done by the great neuroanatomist Golgi (1886), that dendrites played a strictly nutritive role in neuronal function. The modern idea, that a dendritic tree exists to extract information from the volume in which it sits, has led some to propose that the need to increase neuronal surface area to make room for more synapses has been the main driving force in the evolution of the vertebrate brain (Purpura 1967). Beyond simply increasing a cell’s connectional ”fan-in,” though, it has also been frequently observed that spatially extended dendritic arborizations make possible the physical segregation of functionally distinct input pathways to a cell, a pattern seen in many areas of the vertebrate brain including the olfactory bulb, hippocampus, and cerebellum (Shepherd 1990). What of the physiological functions of dendritic trees? The ”neuron doctrine” of Cajal (1909), and in particular the “principle of dynamic polarizaton,” entailed that signals flowed from the input surfaces of the cell (dendrites and soma) toward the axon, where a wave of nervous energy was then propagated to other neurons through synaptic contacts.’ How the effects of distinct synaptic inputs were to be combined within the dendrites was at that time a matter beyond the resolution of either experiment or theory. When the all-or-none nature of nervous impulses was first discovered (Gotch 1902; Adrian 1914), the idea that dendrites, too, might conduct nondecremental impulses prevailed for decades. In a dissenting opinion, Lorente de NO (1934) considered the idea of all-ornone impulse conduction in dendrites to be problematic, pointing out that dendritic integration would in this case degenerate to the propagation of any strong synaptic event directly to the cell’s output; the possibility of combining the effects of widely separated synaptic inputs would thus be eliminated (Lorente de N6 and Condouris 1959). In the 1930s and 1940s, a ”classical” picture emerged based in large part on the study of spinal motorneurons, which held that the exclusive spike trigger zone in a neuron was the axon initial segment, while the dendrites simply collected and summated synaptic inputs-in essence reducing the single neuron to a McCulloch-Pitts-type “morpho-less” unit (see Segev 1992). In a refinement of this idea, it was commonly believed that inputs onto distal ’Cajal came to this conclusion based solely on careful inspection of tissue sections under the microscope.

1036

Bartlett W. Me1

dendrites, which were thought to be both weak and slow, acted primarily to modulate the responsiveness of a cell to more powerful, more proximal soma-dendritic inputs (Adrian 1937; Chang 1952; Grundfest 1957; Eccles 1957; Bishop 1958; Purpura 1959). In the intervening years, a considerable body of data has accumulated showing that the dendrites of many types of neurons in the vertebrate CNS are replete with complex, interacting voltage-dependent membrane conductances, often capable of generating full-blown action potentials and other highly nonlinear behaviors. These data are reviewed in the next section. 2 Electrophysiology of Dendrites

As this review is centrally concerned with the question of dendritic information processing, especially the possibility that dendritic trees are intrinsically sophisticated computing devices, the experimental studies cited in the following sections have been chosen to emphasize interesting nonlinear voltage behavior in dendritic trees. For a review of experimental work relevant to the passive cable properties of dendritic trees, see Rall et al. (1992). 2.1 Motorneurons. The earliest experiments suggesting active behavior in dendrites were carried out in "chromatolyzed" motorneurons, i.e., spinal motorneurons whose axons to the peripheral musculature had been cut 2-3 weeks prior to recording (Eccles et al. 1958). These authors recorded small, spike-like, excitatory potentials several millivolts in height, which they termed partial responses (Fig. 2). Partial responses were distinguished from conventional synaptic EPSPs in three ways: (1) they were brief and spike-like in character, (2) they occured with a greater range of latencies in response to synaptic stimulation, and (3) they were blocked in all-or-none fashion by hyperpolarizing current injection at the soma; EPSPs would normally increase in size under these conditions. A distal dendritic origin was inferred since the partial responses could only be blocked with large hyperpolarizing currents at the cell body. Kuno and Llinas (1970)provided further evidence for a distal dendritic origin of partial responses in chromatolyzed motorneurons by showing that they were more easily blocked by synaptic inhibition delivered to the distal dendrites than by hyperpolarizing currents at the soma. They emphasized that somatic action potentials could arise from partial responses of different sizes, suggesting multiple sites of spike origin in the dendritic tree. In spite of the profoundly abnormal anatomical and physiological condition of these axotomized cells (see Eccles et al. 1958), the idea that dendrites could contain discrete excitable "hot spots" first gained popularity in these early results. The methodological problems associated with the study of chromatolyzed motorneurons were later highlighted

Information Processing in Dendritic Trees

1037

Figure 2: Partial responses in chromatolyzed motor neurons (Eccles et al. 1958). (A) Response to depolarizing current pulse that generated an action potential. (I3-J) Responses to afferent synaptic volleys. Traces in E-J were selected to show the wide range of variability of partial responses that are superimposed on the EPSP shown in E. Arrows indicate initiation of full spikes. (Sernagor et al. 19861, where it was suggested that dendritic spikes in these cells may result from a pathological redistribution of excitable Na+ channels into the dendrites that might otherwise have been destined for the severed axon. 2.2 Cerebellar Purkinje Cells. Cerebellar Purkinje cells have yielded one of the most clearcut examples of dendritic spiking among vertebrate neurons (Fig. 3). In early work in cerebellar slices, Hild and Tasaki (1962) recorded spike-like blips from an extracellular electrode pressed

1038

Bartlett W. Me1

Figure 3: Composite picture showing the relationship between somatic and dendritic action potentials following DC current injection through the recording electrode. A clear shift in amplitude of the somatic spike against the dendritic Ca2+-dependentpotentials is seen when comparing the more superficial recording in B with the somatic recording in E. Note that at increasing distances from the soma the fast spikes are reduced in amplitude and are barely noticeable in the more peripheral recordings. Reprinted with permission from Llinas and Sugimori (1980).

u p against a dendritic branch of a Purkinje cell. Intradendritic recordings from Purkinje cells both in vitro and in vivo have since confirmed the existence of active dendritic spike generation in mammalian (Fujita 1968; Llinds and Sugimori 1980; Ross et al. 1990; Sugimori and Llinas 19901, avian (Llinhs and Hess 19761, reptilian (Llinas et al. 1968; Llinas and Nicholson 19711, and amphibian (Hounsgaard and Midtgaard 1988) cerebella. The probability that dendritic spikes in Purkinje cells were due to Ca2+and not Na+ currents was first demonstrated by Llinas and Hess (1976). Unlike fast somatic sodium spikes that peak in 1-2 msec, dendritic calcium spikes were slow rising, typically reaching a peak of 30-60 mV at 5-10 msec, and could be evoked by a 10 mV depolarizing intradendritic current injection (Llinhs & Nicholson 1971; Llinls & Sugimori 1980b; Fig. 3). A subsequent detailed study of the electrophysiological properties of dendritic membrane in these cells also revealed voltage-

Information Processing in Dendritic Trees

1039

dependent calcium ”plateau potentials” of between 10 and 30 mV, i.e., noninactivating excitatory potentials that could outlast a small depolarizing current stimulus by hundreds of milliseconds (Llinas and Sugimori 1980). In these experiments, calcium spiking and plateau potentials were ubiquitous in Purkinje cell dendrites; more recent optical recording experiments have established that calcium influx occurs across essentially the entire dendritic tree (Tank et al. 1988; Sugimori and Llinas 1990; Ross et al. 1990). In contrast, Na+-mediated spiking and plateau channels were found to be localized near the soma (Llinh and Sugimori 1980; Hounsgaard and Midtgaard 1988). 2.3 Hippocampal Pyramidal Cells. Spencer and Kandel (1961) first observed ”fast prepotentials” (FPPs) in hippocampal pyramidal cells, citing similarities to the partial responses described by Eccles et a1. (1958). Somatic or axonal origins of the spikes were again ruled out since they were not blocked by hyperpolarizing current injection. Following Eccles (1957) the authors conjectured that the spike-like prepotentials may have originated at the main bifurcation of the apical dendritic tree, where a patch of excitable membrane could act as a booster to otherwise ineffective distal synaptic inputs (Fig. 4). Subsequent studies using a variety of techniques have provided extensive further evidence for dendritic spiking mechanisms in hippocampal pyramidal cells (Andersen and Lomo 1966; Purpura et al. 1966; Schwartzkroin and Slawsky 1977; Wong et al. 1979; Schwartzkroin and Prince 1980; Benardo et al. 1982; Masukawa and Prince 1984; Herreras 1990; Poolos and Kocsis 1990; Turner and Richardson 1991; Wong and Stewart 1992). In the first direct intradendritic recording in CAI and CA3 hippocampal cells in vitro, Wong et al. (1979) showed that bursts consisting first of fast Naf-dependent spikes followed by slow (presumably Ca2+-dependent) spikes, could occur either spontaneously or in response to depolarizing current pulses. A subsequent study on CAI dendrites that had been surgically isolated from their cell bodies (Benardo et al. 1982) showed similar mixtures of fast Na+ and slow Ca2+ spikes (as well as spikes of intermediate duration) in response to intradendritic current injections (Fig. 5A). The dependence of fast and slow dendritic spikes on sodium and calcium channels, respectively, was demonstrated in CAI pyramidal cells by Poolos and Kocsis (1990), as illustrated in Figure 5B. The occurrence of multiple points of inflection and mixtures of spikes of differing amplitudes in these studies was interpreted as evidence of multiple dendritic sites of spike generation; other workers have reached similar conclusions (Masukawa and Prince 1984; Jones et al. 1989; Turner and Richardson 1991; Wong and Stewart 1992). The relative independence of dendritic and somatic action potentials was explicitly demonstrated by impaling the same neuron with two electrodes (Wong and Stewart 19921, as illustrated in Figure 5C. Finally, in a recent optical recording study, Jaffe et al. (1992) reported that when potassium currents were blocked with TEA, voltage-dependent cal-

1040

Bartlett W. Me1

Figure 4: The idea that fast prepotentials recorded at the cell body arise from a patch of excitable membrane at the main apical bifurcation of a hippocampal pyramidal cell was illustrated by Spencer and Kandel (1961); reprinted with permission.

cium entry could be recorded across the entire apical and basilar dendritic tree of a CAI cell, whereas rapid sodium entry indicative of active penetration of sodium spikes was present only in proximal dendrites up to a distance of 200 pm from the cell body. In addition to dendritic spikes, subthreshold voltage-dependent conductances have been observed in intact and isolated CAI dendrites that give rise to larger responses to depolarizing (excitatory) than hyperpolarizing (inhibitory) current pulses (Benardo et al. 1982). This anomalous rectification was considered to be due either to voltage-dependent Ca2+ channels or to background activation of NMDA channels, whose negative slope conductance near resting potential (Mayer and Westbrook 1987) can result in an apparent increase in input resistance with depolarization (Sutor and Hablitz 1989). In hippocampal dentate granule cells depolarized to -50 mV, Keller et al. (1991) have shown that when synaptic inhibition has been blocked, the NMDA (vs. non-NMDA) compo-

Information Processing in Dendritic Trees

A

1041

fi

1

NS 0 Ca++ 6mm Mg++

Figure 5: (A) Recordings from intact (upper trace) and surgically isolated (lower trace) dendrites of C A I pyramidal cells reveal complex spiking traces in response to current injections (Benardo ef al. 1982); recordings were made between 100 and 350 pm from the cell body (reprinted with permission). (B) Intradendritic recordings at more than 200 pm from the cell body in CAI pyramidal cells in response to 0.9 nA current injections are shown for the same cell under three conditions: normal saline (upper), O-calcium (middle), and STX (lower). Note disappearance of underlying slow spike when calcium was removed from the bath, and disappearance of fast spikes in the presence of a sodium channel blocker (reprinted with permission from Poolos and Kocsis 1990). (C) Simultanous recording in apical dendrites and soma in CAI pyramidal cell in response to direct dendritic depolarization. Somatic response (lower trace) to dendritic burst (like that in upper trace) reveals independence of dendritic and somatic spiking activity. Reprinted with permission from Wong and Stewart (1992).

nent of the EPSP accounts for half the peak current and three quarters of the total injected synaptic charge. This experiment argues that the voltage dependence of the NMDA channels is likely to be a n important nonlinearity influencing subthreshold synaptic integration (see also Salt 1986). An NMDA-dependent potentiation in the dendritic spiking response in CAI cells following a tetanus has also been shown (Poolos and Kocsis 19901, revealing a n intricate interplay between a t least two kinds of excitatory, voltage-dependent nonlinearities in these cells. In-

1042

Bartlett W. Me1

terestingly, neither fast nor slow spikes were generated in the dendrites of dentate granule cells in response to large depolarizing current pulses (Benardo et al. 1982), highlighting the fact that significant differences are found in the distribution of membrane nonlinearities in different types of neurons within the same brain area, even among closely related neuron types* 2.4 Neocortex. Unlike their hippocampal and cerebellar counterparts, the dendritic trees of neocortical pyramidal cells are not confined to specific neural laminae that are essentially free of cell bodies (such as straturn radiaturn in hippocampus or stratum rnoleculare in cerebellum). It is therefore technically difficult to systematically record from neocortical dendrites in order to study their electrophysiological properties directly. Early evidence for active dendritic responses in neocortical pyramidal cells was mostly indirect (Chang 1952; Cragg and Hamlyn 1955; Purpura and Shofer 1964; Spear 1972; Arikuni and Ochs 1973; Spencer 1977). Deschenes found fast prepotentials in 40% of the fast-conducting pyramidal tract neurons of the primary motor cortex in anesthetized cats, of a form very similar to those previously described in the hippocampus (Spencer and Kandel 1961) and elsewhere (see Spencer 1977). However, the possibility that the FPPs were electrotonically attenuated spikes from other neurons transmitted through gap junctions was not entirely ruled out (Deschenes 1981). In cultured neonatal pyramidal cells from rat sensorimotor cortex, Huguenard et al. (1989) demonstrated the existence of voltage-dependent Na+ channels on the proximal apical dendrites up to at least 80 pm from the cell body, using cell-attached patch and whole cell recordings. Evidence for distal dendritic calcium spikes has also been provided in mature neocortical neurons in vitro (Stafstrom et al. 1985). In these experiments, depolarizing clamp voltages at the cell body were frequently seen to initiate large uncontrolled Ca2+ spike currents, suggesting that the Ca2+spikes were occurring in an electrotonically remote dendritic location. Also in somatic recordings, Reuveni ef al. (1993) blocked Na+ spikes using TTX and K+ currents using TEA to reveal prolonged Ca2+ spike-plateaus. Since repolarization of the calcium spike was often seen to occur in several discrete steps (Fig. 6A), and based on the results of modeling studies, the calcium spikes were concluded to originate from several discrete dendritic Ca2+ hot spots separated by passive membrane. In the most direct evidence for dendritic excitability in these cells, intradendritic recordings from layer 5 pyramidal cells have been reported by Pockberger (1991) and Amitai et al. (1993; Fig. 6B), in both cases showing complex superpositions of spikes of varying widths and amplitudes in response to constant current injections. In another vein, Cauller and Connors (1992) have shown in vitro that stimulation of layer 1 afferents alone is sufficient to drive strong responses at the cell bodies of layer 5 pyramidal cells. In an attempt to understand this result using compartmental modeling techniques, they subsequently demon-

Information Processing in Dendritic Trees

200-ms

1043

p--Q,20 100 ms I 0.5 nA

-

-

0.2 s

50

ms

Figure 6: (A) Stepwise repolarization in neocortical pyramidal cells of calcium plateaus induced by small intracellular current pulses in TTX-TEA medium. Duration of plateaus was variable from trial to trial, but breakpoint voltages remained relatively constant. Two plots are from two different neurons. Reprinted with permission from Figure lC,D, Reuveni et al. (1993). (B) Intradendritic recording near main apical branchpoint (200-300 pm from the cell body) also in a neocortical pyramidal cell (reprinted with permission from Amitai rt al. 1993). Complex superposition of fast and slow spikes were seen in response to a 2 nA current injection. Two plots are same trace at different time scales.

strated that in a cell with a purely passive dendritic tree, stimulation of distal apical synapses was never sufficient to generate cell responses of a magnitude observed in their experiments. Their conclusion: active dendritic conductances are involved in the amplification of distal dendritic input in these neurons. As in hippocampus, another type of voltage-dependent nonlinear membrane mechanism known to play an important role in dendritic integration in neocortex is the NMDA channel (Mayer and Westbrook 1987). In pharmacological blocking studies, NMDA channels have been shown to account for a major proportion of the excitatory synaptic drive onto neocortical pyramids (Miller et a/. 1989; Fox et a/. 19901, consistent with histochemical labeling studies that reveal relatively high concentrations of NMDA receptor binding sites in the superficial, synapse-rich layers

1044

Bartlett W. Me1

of cerebral cortex where the dendrites of these cells receive much of their synaptic input (Cotman et al. 1987). The voltage dependence of the NMDA channel (Mayer and Westbrook 1987) has in modeling studies proven to be capable of significantly influencing the integrative behavior of a pyramidal cell’s dendritic tree (Me1 1992b, 1993b). 2.5 Other Cells. Evidence for voltage-dependent dendritic membrane mechanisms has been acquired in other types of vertebrate neurons, including cells in the thalamus (Maekawa and Purpura 1967; Jahnsen and Llin6s 1984), inferior olive (LlinAs and Yarom 1981), and substantia nigra (Llin6s et al. 1984). Maekawa and Purpura reported fast depolarizing potentials in response to synaptic stimulation, distinct from slow EPSPs, which they interpreted as possible partial spikes of dendritic origin that could account for the ”extraordinary responsiveness” of these cells to synaptic input. In another study of thalamic neurons, Jahnsen and Llink (1984) reported high-threshold Ca2+spikes lasting 18-22 msec in presumed intradendritic recordings that were similar to those observed in Purkinje cell dendrites. In the inferior olive, Llinas and Yarom (1981) have described a number of voltage-dependent conductances, including multiple all-or-none high-threshold Ca2+ spikes of presumed dendritic origin. Finally, in the substantia nigra, Llinis et al. (1984) provided evidence for two types of dendritic Ca2+ spikes, one low- and the other high-threshold with respect to somatic current injection. In these cells, the Ca2+ spikes are thought to be involved in the dendritic release of dopamine. 2.6 Summary. While it is difficult to achieve a meaningful summary of the experimental work discussed above, certain general tendencies in the experimental data may be identified. For several important classes of vertebrate neurons, the conception of the dendritic tree as an essentially passive collector of synaptic inputs has not been borne out in the results of 30 years of electrophysiological work. First, in several major types of output neurons in the cerebellum and mammalian forebrain, mechanisms capable of generating dendritic spikes of one or more ionic varieties have been either positively demonstrated to exist through intradendritic recordings, or are strongly suggested in intrasomatic recordings or through a variety of other techniques. Dendritic calcium spikes are a particularly widespread phenomenon, having been observed in cerebellar Purkinje cells, hippocampal and neocortical pyramidal cells, and cells in the thalamus and inferior olive. Second, in many of these same neuron types, especially hippocampal and neocortical pyramidal cells, but also neurons in the pyriform cortex, thalamus, basal ganglia, midbrain, spinal cord, and other areas (Nicoll et al. 1990; Mayer and Westbrook 1987; Cotman et al. 19871, a component of the excitatory synaptic input to the dendrites is carried by

Information Processing in Dendritic Trees

1045

voltage-dependent NMDA-type channels; in some cases the NMDA component may predominate, such as in response to high-frequency stimulation and/or "natural" sensory input (Jones and Baughman 1988; Salt 1986; Miller et al. 1989; Fox et al. 1990; Keller et al. 1991). In the next section, we review the mathematical and conceptual tools that have been used to model the electrical behavior of dendritic trees. 3 Conceptual and Computational Tools

In order to understand computation in dendritic trees, it is necessary to first understand the principles that govern the flow of electric current in dendrites. For example, when an excitatory synapse is activated on a dendritic spine, where does the injected current flow, how long does it take, and what is its effect on the membrane potential both locally and elsewhere in the dendritic tree? In this section we review conceptual and computational tools and basic results relevant to current flow in dendrites. There are excellent introductory reviews of the biophysical mechanisms underlying membrane resistance and capacitance, membrane potential, time constants, and basic steady state and transient responses of passive and active membranes (Kandel and Schwartz 1985; Hille 1984; McCormick 1990; Shepherd and Koch 1990; Jack ef al. 1975; Rall 1977). A working knowledge of these concepts is assumed in the following. One of the most important early developments in the study of neural information processing was the application of the one-dimensional "cable" equation to problems of current flow in branched, passive neuronal structures (Rall 1959, 1964). To begin, we present the cable equation, consider the physical interpretation of its terms, and discuss an important analytical solution. We then enumerate several useful rules regarding the electrical behavior of passive dendritic trees. Finally, we introduce the enterprise of compartmental modeling, a discretized numerical approach to the solution of the cable equation that allows treatment of arbitrary neuronal geometries and voltage-dependent membrane mechanisms. 3.1 The Cable Equation. Due to their physical construction, neurites (dendrites and axons) have been likened to electrical cables. A cable consists of a long, thin, electrically conducting core surrounded by a thin membrane whose resistance to transmembrane current flow is much greater than that of either the internal core or the surrounding medium. In the case of a dendrite, both the internal cytoplasm and the extracellular space are thought to conduct nearly as well as seawater (see Jack et al. 1975, ch. 1). Because the resistance to current flow through the cytoplasm is relatively low, injected current can travel long distances down the dendritic core before a significant fraction leaks out across the highly

Bartlett W. Me1

1046

X

.......

membrane cytoplasm

.......

axial current entering

,y

++ :/,

I

*

axial current leaving

........ ........

resistive capacitive current current

Figure 7 Section x of a one-dimensional cable. Arrows show axial current flowing into and out of section x, and the ionic (resistive)and capacitive transmembrane currents. The cable equation (equation 3.1) specifies that, for all x , these currents must always be in balance. resistive membrane. The fundamental equation used to describe current flow in passive dendrites is the ”cable equation”: ia2v -= c,-

r; ax2

av + v at

rm

(3.1)

where V = V ( x ,t ) is the transmembrane voltage at location x and time t, and for a unit length of cable, ri (in kR/cm) is the resistance to internal current flow along the core, rm (in kR-cm) is the transmembrane resistance, and c, (in pF/cm) is the membrane capacitance (see Jack et al. 1975; Shepherd and Koch 1990; Rall et al. 1992 for discussion of units). The cable equation was developed in 1855 as a part of a mathematical theory with practical applications for transatlantic telegraph lines (Kelvin 1855; see Rall 1977 for historical perspective). Derivations and assumptions of the cable equation can be found elsewhere (Jack et al. 1975; Rall 1977,1989; Shepherd and Koch 1990). The basic cable equation may be easily interpreted as a balance of three kinds of electrical current. Consider a short length of a dendritic branch labeled x (Fig. 7). A fundamental law of conservation of electrical current (Kirchhoff’s current law) tells us that the net accumulation of “axial” current at x (i.e., axial current entering - axial current leaving) must be equalled by the net current flowing out of the cell across the membrane at location x (i.e., ionic membrane current + capacitive membrane current). This equality is represented directly in equation 3.1. The single term on

Information Processing in Dendritic Trees

1047

the left represents the net axial current into node x through the cytoplasm from neighboring "compartments" to the left and right of x. The secondderivative form of the term indicates that the axial current flowing into x is, roughly speaking, related to the "curvature" of the voltage profile along the dendritic branch centered at x. For example, when the voltage at x is less than the average of its neighbors (positive voltage curvature), then there is a net accumulation of axial current at x. The two terms on the right represent the capacitive and ionic currents, respectively, that must flow across the membrane at x in order to balance the net axial current influx. These terms simply state that (1) the capacitive current at x is proportional to the rate of change of the local transmembrane voltage-rapid changes in voltage are associated with large capacitive currents, and (2) the resistive, or ionic, current at x is, according to Ohm's law, directly proportional to the membrane voltage at x. To summarize, there are three types of current flowing in and around point x in a dendritic branch: (1) net axial current into x through the cytoplasm, related to the local membrane voltage "curvature," (2) capacitive current, proportional to the rate of change of membrane voltage, and (3) ionic current, proportional to the membrane voltage itself in the case of a passive dendrite. The cable equation simply says that these three quantities must always be in balance.

3.2 A Classical Solution to the Cable Equation. One historically important solution gives the voltage of a uniform infinite cable in response to a current step loinjected at the origin X = 0:

v = riloX { exp (-x)erfc (2> fi) 4 ~

-

-

exp (X)erfc

where X = x/X measures distance from the origin in space constants in a cylinder with infinite extension, and T = t / measures ~ time in time constants. The space constant X = (rm/rj)*/*is the distance at which the voltage has fallen off to Vo/e, i.e., to about one-third its value at the site of stimulation; the membrane time constant T = r,cm is the time required for an isopotential patch of membrane to charge to within l / e of its steady-state value. The charging of the inifinite cable about the origin during one time constant is plotted in Figure 8. If attention is restricted to the transient voltage change at X = 0, (3.3)

which represents growth that is significantlyfaster than a single exponential (bold contour at X = 0). If attention is restricted to the steady-state

1048

Bartlett W. Me1

Figure 8: Plot of infinite cable charging in response to current step 10 at X = 0. Transient voltage change at X = 0 is faster than exponential, given by equation 3.3. At long times, voltage decays symmetrically and exponentially about X = 0, as given by equation 3.4. voltage along the entire cable at long times, we see that the voltage profile at long times is given by (3.4)

a simple exponential decay with distance (dashed contour at T >> 7 ) . For derivations and solutions of the cable equation under these and a number of other boundary conditions see Jack et al. (1975) and Rall (19771, such as the response to step currents for finite or semi-infinite cables, with either sealed or open ends, response to voltage steps for finite or infinite cables, response to noninstantaneous voltage changes, and step changes in membrane conductance; a variety of other quantities, such as input resistance, electrotonic length, and velocity of the propagating decremental voltage wave in response to a stimulus, have also been derived explicitly. For branched passive cables, an early and historically important contribution was the idea of an “equivalent cylinder” representation of a

Information Processing in Dendritic Trees

1049

dendritic tree. Rall (1959) showed that when (1) all branch points in a dendritic tree obey the d3/* law (see item 4 below), (2) all terminal tips have identical boundary conditions, and (3) all terminal tips lie at same electrotonic distance from the soma, then the entire dendritic tree may be replaced by a single "equivalent" cylinder for certain restricted input conditions, greatly simplifying the calculation of dendritic responsessee (Rall 1977) for details. An algorithm for solving the case of arbitrary passive dendritic trees was also first provided by Rall (1959). Butz and Cowan (1974)later developed a graphic method to compute the Laplace transform for arbitrary passive trees, and Honvitz (1981, 1983) extended and actually applied this method for arbitrary trees. Abbott et a/. (1991) showed how the path integral can be computed for arbitrary dendritic trees, using a method borrowed from statistical physics that is both computationally efficient, and allows the assumption of time and/or spatially varying membrane resistivities. Most recently, Evans et a/. (1992) and Major et al. (1993) have treated arbitrary passive dendritic trees in the most complete way. Another recent paper describes a method for computing explicit signal delays between arbitrary pairs of stimulus and recording sites (Agmon-Snir and Segev 19931, and examine the consequences for a cell's sensitivity to synchronous synaptic inputs. Some interesting analytical extensions of cable theory have also been developed for dendritic trees containing active membrane, including a linearized analysis of active neuronal cables valid for voltage perturbations of a few millivolts (Koch 19841, and a continuum-limit analysis of the propagation of signals in dendrites with active spine heads (Baer and Rinzel 1991). One of the most important applications of the cable equation and its extensions to branched structures has been to the estimation of biophysical parameters of real neurons based on experimental data, especially membrane and cytoplasmic resistivity (R, and Ri), and electrotonic length L. An excellent recent review of this large body of work is available (Rall et al. 1992). 3.3 Dendritic Electrotonus: Rules of Thumb. One of the most significant legacies of cable theory lies in the rules it has provided relating to the electrical behavior of passive dendritic trees. A paradigmatic assumption common to many electrophysiologists and modelers of singleneuron function has been that a thorough understanding of the electrical behavior of passive neural processes is desirable, even in cases when active voltage-dependent nonlinearities are known to be present en force. We thus close our overview of cable theory by consolidating several rules regarding the flow of current and distribution of voltage in passive dendrites and dendritic trees. 1. Voltage signals attenuate with distance. In the reference case of a cylinder with infinite extent, a steady-state input decays exponen-

Bartlett W. Me1

1050

10

01

02

00

I'P

0

m

Imo

Iya

m

Figure 9: Steady-state spread of voltage from a stimulus origin (e.g., a voltage clamp at X = 0 ) in an infinite cable, adapted from Shepherd and Koch (1990). Default cable parameters are R, = 8000 R - cm2, R; = 80 R - cm. Larger diameters (A) and higher specific membrane resistance (B) give longer space constants. (C) Steady-state spread of voltage under 3 branching conditions: (a) terminated, (b) connection to 2 or more daughter branches that satisfy the d3l2 law, and (c) connection at a branch point to one equivalent and one thicker branch. Adapted and extended from material in Shepherd and Koch (1990).

tially with distance, with X = (r,,,/r,)1/2 = (d/2)(R,/R;)1/2, where d is the branch diameter. Hence, voltage attenuation is more severe for thinner dendrites (Fig. 9A) and/or leakier membrane (Fig. 9B). The upper graphs in both A and B show the voltage at T >> 7 along the length of the cable in response to a current step 10 at the origin; in the lower graphs the three curves are normalized to allow comparison of electrotonic lengths. 2. Voltage attenuation is more severe for high frequency components of an input waveform. For w >> l/r, X(w) 0: 1/m.Thus, the peaks of signals that are fast relative to 7,such as spikes or fast synaptic

Information Processing in Dendritic Trees

1051

inputs, are much more strongly attenuated with distance than are steady-state inputs (Rinzel and Rall 1974;Zador 1993).

3. The input resistance Ri, of a neurite is the magnitude of the voltage response at T = 00 to a unit DC current step. For an infinitely long uniform neural process, Rin = Jrmr,/2grows as l / d 3 I 2 . As a consequence, inputs to distal dendritic branches, which are often of very small diameter, can result in large local depolarizations in comparison to identical inputs delivered to large branches, which are often found closer to the cell body. The upper graph in Figure 9A illustrates the variation in Ri, that results from changes in branch diameter: the relatively large voltage deflection at X = 0 for the 1 pm diameter case is indicative of its relatively high input resistances; the same is true for the high R, case in B. 4. Voltage attenuation depends on boundary conditions, that is, what

a branch is connected to (Fig. 9 0 . For example, when a stimulus is applied near to a branch end (a), the voltage attenuation is significantly reduced in the direction of the closed end, since the charge that would have flowed past the closed end "piles up" locally and causes a relative increase in membrane potential (see Jack ef al. 1975;Shepherd and Koch 1990). When a dendritic "parent" branch connects to a set of k smaller daughter branches (b), where d,312 + . . . d;/* = d;fRnt, then the voltage attenuation is uninterrupted through the branch point, as if the daughter tree were a simple continuation of the parent branch (see Rall 1977). When a stimulus is applied to a thin branch that connects to a thick branch (c), then the voltage attenuation is exaggerated in the direction of the thick branch, since some of the charge that would have depolarized the thinner branch near the branch point is "sucked" into the relatively low resistance pathway offered by the thicker branch.

5. Voltage attenuation is directionally asymmetric in a dendritic tree, as illustrated for an idealized neuron in Figure 10A. If a constant current stimulus is applied at distal tip I, then the steady-state voltage response is strongly attenuated in the direction of the cell body (upper solid curve), whereas if the same stimulus is applied at the cell body, the voltage attenuation from the cell body to the distal tip is modest (lower dashed curve; figure from Rall and Rinzel 1973). This asymmetry is due to the difference in cable boundary conditions looking toward or away from the cell body, as discussed in item 4, and has been frequently treated in the literature, e.g. (Rall and Rinzel 1973; Koch et al. 1982; Brown ef al. 1988). An excellent graphic representation of this asymmetry is provided by the morphoelectrotonic transform (MET) introduced by Zador (1993). As shown in Figure 10B for a hippocampal CAI pyramidal cell, the length of each section of dendrite is scaled by the log steady-

Bartlett W. Me1

1052

A

B

"out V

Figure 10: (A) Diagram of idealized neuron and plot of steady-state voltage for different stimulus conditions. Solid curve shows voltage profile in dendritic tree for constant current stimulus I at a single distal tip. Note steep voltage attenuation along trajectory to soma vs. gradual attenuation outbound along sister ( S ) and cousin (C-1,2) branches. Curve with short dashes shows voltage profile when same input I is delivered to soma. Reprinted with permission from Rall and Rinzel (1973). (B) Morphoelectrotonic transform (METs) of hippocampal C A I pyramidal cell. Distance from cell body to every dendritic section is proportional to the log Dc voltage attenuation I!.,",, from the soma to that dendritic section. Reprinted with permission from Zador (1993); morphometric data courtesy Brenda Claiborne. (C) Different MET of same cell; now distance to cell body from each dendritic section is proportional to L x , i.e., the log-attenuation from the dendritic section inward to the soma.

Information Processing in Dendritic Trees

1053

state voltage attenuation from the soma to that section. (The logattenutation is defined as L, = log IA,,I, where A,, = V,/V, is the voltage attenuation from i to j for a stimulus at i; see Zador 1993). The entire tree is highly reduced, particularly the basal dendrites and small apical side branches due to their closed-end boundary conditions. By contrast, in Figure 1OC the distance from every point to the soma is made proportional to the log-attenuation from that point to the soma. In this case, small side branches and thin basal dendrites are exaggerated in length, reflecting the strong attenuation of voltage signals in the direction of the cell body. 6. Voltage and current attenuation are reciprocal: voltage attenuation from i to j is exactly equal to current attentuation from j to i in a passive dendritic tree (i.e., A: = Aji), for any locations i and j (Koch et al. 1982). Thus, current attenuation from a distal site to the cell body can be modest, implying that a large fraction of the charge injected at a distal dendritic site flows to the cell bodycompare responses at cell body due to somatic vs. distal stimulus in Figure 10A.

7. Speed, delay, and input synchronization. A transient input to a dendritic branch, such as a synaptic current, is reduced in size and smoothed out in time as it propagates away from the site of stimulation. In the case of an infinitely long unbranched passive dendrite, the centroid of the wave propagates at a speed of 2x17, i.e., two space contstants per time constant. More generally, the total signal delay is symmetric between any two points in a passive dendritic tree, and is independent of the shape of the input signal. Delays from dendrites to soma are on the order of one membrane time constant in morphologically realistic dendritic trees (Agmon-Snir and Segev 1993). Importantly, local charging times on thin dendritic branches may be an order of magnitude faster than the membrane time constant T . Consequently, distal dendritic arbors may function more as coincidence detectors for local synaptic inputs whereas the soma functions more as an integrator. 3.4 Compartmental Modeling. For all of their considerable conceptual appeal, analytic solutions to the cable equation become increasingly cumbersome to the extent that the case under study diverges from a passive unbranched uniform cable stimulated with a constant current or voltage source. When a cell has a complex irregular branching structure, nonuniform passive membrane properties, contains voltage- or concentration-dependent membrane channel conductances, or is driven by synaptic conductance changes in lieu of current inputs, then "compartmental'' modeling is the technique of choice. Originally introduced

1054

Bartlett W. Me1

by Rall (19641, compartmental modeling represents a finite-difference approximation of the linear cable equation, or its nonlinear extensions. Compartmental modeling entails that the dendritic tree, axonal tree, or other cable-based stucture be broken into a branched network of discrete isopotential compartments. Each compartment consists of a set of lumped circuit elements representing the biophysical properties of the corresponding length of neuronal cable, and the compartments are connected together via lumped axial resistances (Fig. 11). The time evolution of voltages and other variables within this “equivalent circuit” structure in response to an arbitrary pattern of synaptic or other input is computed using standard numerical integration techniques. The advantage of such a representation is that the biophysical properties of the membrane and the cytoplasm can vary arbitrarily from compartment to compartment if so desired, and the membrane or synaptic conductances within a compartment can be defined to have complex dependencies on voltage, time, and other variables-Hodgkin-Huxley channels are one example. The nuts and bolts of compartmental modeling are available elsewhere (for example, see Perkel et al. 1981; Segev et al. 1989; Claiborne et al. 1992 for various treatments of the method). See also Traub et al. (19911, BorgGraham (1991), Brown etal. (1991b),and Me1 (1993b)for examples including modeling of nonlinear membrane conductances, Mascagni (1989) for a lucid discussion of numerical issues, Shepherd (1992) for an interesting historical perspective, and De Schutter (1992) for an overview of currently available software for creating and running compartmental models. 4 Computational Studies of Dendritic Function

A variety of modeling studies have been carried out over the past three decades to explore various aspects of dendritic function beyond simple summation of synaptic inputs. In the following, we discuss this work in the context of four main ideas that have dominated the conceptual landscape. These are,

1. The spatially extended nature of a dendritic tree permits useful spatiotemporal interactions among active synapses.

2. Dendritic trees can have multiple pseudoindependent processing subunits. 3. Passive dendritic structure may be modulated by external influences to alter the input-output behavior of the cell as a whole, or of individual subunits. 4. Nonlinear membrane mechanisms appropriately deployed can allow the dendritic tree of a single neuron to act as a powerful multilayer computational (e.g., logical) network.

Information Processing in Dendritic Trees

1055

S i m p a r t m e n t Spine

B

with 2 SyM*

conductances

single Dendriliccompartment with Active Conductanm

Figure 11: (A) Simplified compartmental representation of the dendritic tree of a hippocampal pyramidal cell (figure adapted from Brown et al. (1992)). (B) Blowup of equivalent circuit for a single dendritic compartment with attached spine. Main dendritic compartment is depicted with voltage-dependent Na+ and K+ conductances for fast Hodgkin-Huxley spiking, as well as a slow voltage-dependent Ca2+ conductance and a Ca2+-dependent Kf channel. 4.1 Spatiotemporal Integration. Wilfred Rall (1964) first demonstrated that a passive dendritic branch, by virtue of its spatial extension, can act a s a spatiotemporal filter that selects for specific temporal sequences of synaptic inputs. Since time is required for signals to propagate along a dendritic branch, it matters what part of the dendrite gets stimulated at what time. For example, the largest superposition of sig-

Bartlett W. Me1

1056

B mrdinn

PREFERRED

NULL

Figure 12: (A) Passive 10-compartment model demonstrates directional difference for input sequence ABCD vs. DCBA; reprinted with permission from Rall (1964). (B) Single branch schematic of Koch-Poggic-Torre (1983) model of direction selectivity as measured at the cell body. Fast acting excitatory inputs (black circles) are effectively “vetoed” by slow-activating shunting inhibition (open circles) only when the direction of sweep is away from the cell body. (C) Reconstructured direction-selective cell from the rabbit retina (reprinted with permission from Koch et al. 1986). (D) Cell in C was modeled assuming passive dendrites and on-path inhibition as schematized in B. Responses to preferred-direction stimulus are shown at left, null-direction at right. Lower traces include small DC current injection at cell body. Units are mV above rest (ordinate) vs. msec (abscissa). nals at the cell body occurs when distal synapses are activated before proximal synapses. This principle is illustrated in Figure 12A: the peak of the voltage waveform at the cell body is twice as large when inputs are activated in a sweep toward the cell body (DCBA) than in a sweep away from the cell body (ABCD). Poggio and Torre (1977) and Koch et al. (1982, 1983) pursued this basic idea further in the effort to explain direction-selective (Ds)responses in retinal ganglion cells. They amplified on Rall’s basic idea by showing that the relative placement and timing of excitatory and inhibitory synapses on the same dendrites could lead to a much more pronounced directional

Information Processing in Dendritic Trees

1057

difference than in Rall's study case of excitation alone. Essentially, a large synaptic conductance increase whose reversal potential is close to the resting potential, usually called "silent" or "shunting" inhibition, acts like a hole in the membrane that shunts a large fraction of any passing current directly to the extracellular ground. While a shunting synapse cannot by itself alter the potential at the cell body, it can effectively short out the path to the cell body for any more distal depolarizing or hyperpolarizing influences (Poggio and Torre 1977; Koch et al. 1982, 1983); see Shepherd and Koch (1990) for an explanation of shunting inhibition. In the simplest instance, the Koch-Poggio-Torre model for retinal direction selectivity entails that photoreceptors are topographically mapped onto each ganglion cell dendrite, and each photoreceptor is assumed to activate both an excitatory and an inhibitory synapse at approximately the same dendritic locus (Fig. 12B). The inhibitory conductances are assumed to activate with slower kinetics than the excitatory conductances, however, such that the excitatory input has time to begin propagating toward the cell body before the colocalized shunting inhibition is sufficiently activated to exert its "veto" effect. Thus, if the photoreceptorinduced stimulus sweeps along the branch toward the cell body, the slowly activating inhibitory conductances consistently exert their influence distal to the snowballing excitatory wave bound for the cell body, and are therefore ineffective at blocking it. If the photoreceptor stimulus sweeps along the branch away from the cell body, then the inhibitory conductances consistently exert their influence on the direct path to the cell body for all subsequently activated more distal excitatory inputs. This elemental nonlinear synaptic interaction was shown to produce strongly direction selective responses in a modeled retinal ganglion cell (Koch et al. 1986; Fig. 1 2 0 . This modeling study depended critically, however, on a neuroanatomical assumption that has remained unsubstantiated-i.e., that inputs to retinal ganglion cell dendrites are precisely "wired" such that each inhibitory synapse is positioned to veto subsequently activated excitatory synapses when the stimulus sweep is in the "null" direction, but not when the stimulus sweep is in the preferred direction. An example of such an asymmetric wiring diagram is illustrated in Figure 13A. An interesting alternative source of the necessary asymmetry for DS responses is discussed by Borg-Graham and Grzywacz (1992; see also Vaney et al. 1989), that does not depend on precise asymmetric deployment of excitatory and inhibitory synapses onto the dendrites of retinal ganglion cells. The authors point out that even in the case of entirely symmetric, topographically mapped mixed excitatory and inhibitory input onto a circularly symmetric dendritic tree, the tip of each dendritic branch is direction selective when considered as an output, i.e., responds more strongly to photoreceptor sweeps toward the tip (Fig. 13B). Evidence that amacrine cells in the rabbit retina preferentially make contacts onto retinal ganglion cells via their branch tips has led these authors to the

Bartlett W. Me1

1058

A

=

recording electrode

B

-

-

recording electrode

Figure 13: (A) One possible asymmetrical system of input connections giving rise to direction selectivity as measured at cell body (adapted from Koch et al. 1982). (B) Tip of long dendritic branch is always direction selective (adapted from Borg-Graham and Grzywacz 1992). conjecture that the D!3 of retinal ganglion cells is due to DS inputs from amacrine cell branch tips rather than due to internal processing in retinal ganglion cells themselves. Consistent with this hypothesis, Borg-Graham and Gryzwacz (1990) demonstrated that retinal ganglion cells remained direction selective even when their inhibitory inputs were blocked. For an elegant demonstration of the probably passive dendritic basis of direction selectivity in an invertebrate, the blowfly, see Haag et al. (1992). Another issue relevant to dendritic spatiotemporal integration is the question as to whether synchronously activated synapses scattered about a dendritic tree are more effective at driving a cell than the same number of inputs activated asynchronously-as has been commonly postulated (e.g., Abeles 1982). This question has been investigated in Bernander et al. (1994) for passive dendrites (though NMDA synapses were considered in one condition), where it was shown that when the number of active synapses is less than the number of fully-synchronized inputs needed to fire a single action potential, then synchronous inputs are al-

Information Processing in Dendritic Trees

1059

ways more effective than asynchronous. When the number of activated synapses exceeds this threshold, then the saturating nonlinearity associated with excitatory synaptic action tips the balance gradually in favor of desynchronized inputs. With regard to precise timing of synaptic inputs, Softky and Koch (1992) and Softky (1993) have discussed the plausibility and consequences of submillisecond coincidence of synaptic inputs to dendritic trees that contain very large, very brief synaptic conductances and/or the potential for fast spike generation. 4.2 Dendritic Subunits. A second important idea relevant to dendritic information processing is that of ”dendritic subunits,” i.e., the idea that pseudo-independent computations can be carried out simultaneously in different dendritic subregions. An early discussion of dendritic subunits can be found in Llinhs and Nicholson (1971), where it was proposed that synaptic integration and consequent local spiking activity could occur pseudo-independently in different branches of the Purkinje cell dendritic tree. Koch et al. (1982) first formally defined a dendritic subunit as a region within which the voltage attentuation is small between any pair of synapses i and j in the subunit, but for which the voltage attentuation is large between any subunit synapse and the soma s. More precisely, a subunit consisted of any group of synapses such that A I / A i > c, (c > 1) for all i and j in the subunit. (The original definition was expressed in terms of transfer resistances instead of attenuations.) For a particular choice of membrane parameters ( R , = 2500 R - cm2, R, = 70 12 - cm, C, = 2 /iF/cm2) and the value c = 4, and under the assumption that subunits should not overlap, a set of subunits was determined as shown in Figure 14A for an n-type retinal ganglion cell. The implication of this result is that an 0-type retinal ganglion cell does indeed have a considerable capacity for independent processing within its dendritic tree. A /3-type ganglion cell with a much smaller dendritic tree (- 100 pm) and relatively thick branches had very small subunits. The subunit boundaries of Figure 14A and B should not, however, be taken too literally, as they depend on c, which was arbitrarily chosen, on the cell morphology, on the algorithm used to grow subunits beginning at the dendritic tips, and on membrane parameters-they disappear almost completely when R, > 8000 R - cm2. It is also important to emphasize that the definition of dendritic subunits from Koch et al. (1982) stresses the electrotonic independence of subunits from the cell body, but not electrotonic independence of subunits from each other. Woolf et al. (1991) carried out a somewhat different analysis of subunit structure in granule cells of the olfactory bulb. In this case a subunit was defined to be a neighborhood in the dendrites about an arbitrarily chosen reference spine, for example, consisting of all spines at which the steady-state voltage attenuation was less than 5% relative to the stimulated reference spine (Fig. 14C). When the subunit criterion was changed to include all neighboring spines that were depolarized by more than

1060

Bartlett W. Me1

Figure 1 4 (A,B) Subunit structure of two types of retinal ganglion cells based on passive cable properties (reproduction of Figs. 2, p. 240 and 3, p. 241, with permission from Koch et al. 1982). (A) Large subunits in an cy retinal ganglion cell. (B) Much smaller subunits are seen in ganglion cell in same study. C,D. Study of subunit structure of granule cell of the mouse olfactory bulb (reprinted with permission from Woolf e t a / . 1991). (C) Subunits around 7 reference spines are illustrated, where subunit is defined as region around input spine with no more than 5% steady-state voltage attenuation. (D) Same cell, where subunit criterion is region of greater than 10 mV depolarization in response to 4 nS input at reference spine.

10 mV in response to a 4 nS transient synaptic conductance input at the reference spine, some subunits remained essentially unchanged, 0th-

Information Processing in Dendritic Trees

1061

ers grew dramatically, and still others essentially disappeared (Fig. 14D). When the E S P rise time was slowed from 0.2 to 1 msec, subunits grew so large as to encompass the entire dendritic tree. In these two subunit studies, the observed sensitivity of subunit structure to biophysical parameter assumptions and to subunit definitions has double-edged significance. Though the concept of dendritic subunits is heuristically useful, and has guided important questions as to the passive integrative properties of dendritic trees, the marked sensitivity of subunit size to changes in modeling assumptions makes their explicit graphic enumeration less informative than it might be hoped. Any attempt to characterize a dendritic tree in terms of the locations of a fixed number of discrete subunits is thus necessarily misleading. On the other hand, the sensitivity of subunit size to parameters and assumptions in these studies is informative in and of itself, as it makes explicit the notion that the effective electrotonic structure of a dendritic tree depends strongly on both biophysical membrane parameters (see next section), and on the specific type of intradendritic voltage communication under consideration. In this latter case, for example, synaptic interactions that have sharp voltage thresholds may be expected to operate within a radically different virtual subunit structure than those that depend smoothly on voltage (Fig. 14C,D). Given the virtual impossibility of counting discrete subunits or assigning their boundaries in a meaningful way, an alternative statistical approach may be used to quantify the relative electrotonic independence of dendritic synapses from each other. In a study of the input-output behavior of neocortical pyramidal cell dendrites, a histogram was generated to quantify the steady-state voltage and current attentuation between randomly chosen pairs of synapses in the passive dendritic tree (Fig. 15); from Me1 1992~).The histogram shows that the average steady-state attenuation factor for voltage or current between randomly chosen synapse pairs is nearly 70; for about half the pairs of synapses in the dendritic tree, the attenuation factor is greater than 25. The histogram representation of a dendritic tree is weak in that it quantifies the electrotonic independence of each dendritic locus from each other in only a probabilistic sense, but it is immune to the dramatic parameter sensitivity characteristic of efforts to define and discretely label dendritic subregions.

4.3 Modulation of Passive Membrane Properties. The fact that dendritic subunit structure is sensitive to biophysical membrane parameters leads directly to the suggestion that intradendritic information processing could be modulated by any outside influence acting on passive membrane properties. A third idea relevant to dendritic integration thus entails that outside modulating influences can act to alter the cable properties of part of all of a dendritic tree, thereby changing its integrative behavior in response to patterns of synaptic input.

Bartlett W. Me1

1062

Voltage/Current Attenuation Dendrite to Dendrite

Figure 15: Voltage or current attenuation histogram for DC current inputs to electrically passive pyramidal cell dendritic tree, with R, = 10,000 R - cm2, R; = 200 R - cm, C, = 1 pF/cm2. Pairs of input and recording locations were chosen at random, uniformly in dendritic length. Average steady-state attenuation factor was 67.7. Pyramidal cell morphology courtesy of Rodney Douglas and Kevan Martin.

Holmes and Woody (1989) first demonstrated that different spatially nonuniform patterns of background synaptic activity impinging onto a modeled cortical pyramidal dendritic tree give rise to different nonuniform resting membrane potential distributions, various distortions in the effective length constants of dendritic branches, location-dependent changes in the ability of distal synapses to influence the cell body, and pronounced changes in membrane time constants (see Fig. 16). Essentially, the spontaneous low-frequency openings of 10,000-20,000 synapses in the dendritic tree of the modeled cell yielded, in the aggregate, a significant change in effective membrane resistivity which in turn induced the observed changes in electrotonic structure and time constants of the cell. Though these authors considered only the passive membrane case, the straightforward inference could be made from this work that induced inhomogeneities in the dendritic voltage environment under variable patterns of background synaptic activity could lead to variable operating regimes for any voltage-dependent membrane mechanisms residing in the dendritic tree. Bernander et al. (1991) further explored the effects of background synaptic activity on the passive cable structure of a layer 5 neocortical pyramidal cell. A fixed spatial distribution of 4000 excitatory and 1000

Information Processing in Dendritic Trees

A. BASELINE

B APICAL EXCIT

1063

C. APICAL EXCIT.

I\

Figure 16: Steady-stateresponse of passive neuron for different synaptic activity distributions. The resting potential (upper number) and electrotonic distance (lower number) to selected dendritic locations in the cortical pyramidal cell were computed for each of the three conditions (reprinted with permission from Holmes and Woody, 1989).

inhibitory synapses was modeled, while frequency of background activity was varied from 0 to 7 Hz. Over this range, the time constant and input resistance of the cell measured at the cell body were reduced by a factor of 10, while the electrotonic length of the cell grew by a factor of 3 (Fig. 17). The authors further demonstrated that the reduction in membrane time constant associated with more vigorous background activity could lead to an increased selectivity for synchronous vs. asynchronous activation of other synaptic inputs (see also Bernander et al. 1993; Rapp et al. 1992). This study thus demonstrates that the activity of the intrinsic cortical network is likely to exert a powerful influence on the integrative behavior of individual neurons. In a different vein, Laurent and Burrows (1989) and Laurent (1990) have proposed that nonspiking interneurons in the metathoracic ganglion of the locust may have independently modulable input-output regions within their dendritic trees. In these invertebrate neurons, input and output synapses intermingle along the same dendritic branches (Fig. 18A), in contrast to most vertebrate neurons for which dendrites are input structures exclusively (see Shepherd 1990 for rules and exceptions). A

Bartlett W. Me1

1064

A

B

m

I' Figure 17: (A) Impact of synaptic background firing frequency from 0 to 7 Hz on cell parameters, including input resistance (upper left), time constant (upper right), resting potential (lower left), and electrotonic length (lower right). (B) Electronic size of cell at two levels of synaptic background activity: 0 Hz (upper cell), representative of conditions in a slice, and 2 Hz (lower cell), representative of condition of low-level background activity. Reprinted with permission from Bernander et al. (1991).

biophysical mechanism was proposed whereby a system of intersegmental control axons could modulate a sensorimotor reflex arc. In the reflex circuit of Figure 18B, afferent inputs from mechanosensory receptors on the legs are spatially intermingled with outputs to motor neurons in the dendrites of a nonspiking interneuron. Intersegmental control inputs make additional synaptic contacts onto these same branches. While their precise function is unknown, these inputs seem capable of locally modulating membrane properties in such a way as to enhance or suppress the afferent-to-motor reflex connection within a restricted region of the dendritic tree. One putative mechanism involves large shunting synaptic conductances activated by intersegmental inputs that simply lower the local input resistance, reducing the size, spread, and effectiveness of afferent EPSPs (Laurent and Burrows 1989). The subunit structure of the dendritic tree has also been shown in modeling studies to permit the input resistance to be pseudoindependently modulated in different dendritic regions (Laurent and Haiyun 19931, consistent with the idea that different reflex arcs may indeed be controllable by different intersegmental inputs.

Information Processing in Dendritic Trees

1065

Figure 18: Possible input-output function of intersegmental interneuron dendritic tree in locust. (A) Reconstruction of portion of a nonspiking local interneuron in thoracic nervous system of locust. Twenty-two input (circles)and 64 output (triangles)synapses are spatially intermingled. Reprinted with permission from Watson and Burrows (1988). (B) Model of a nonspiking interneuron that receives input from 3 local afferents (a’,b’, c’) and from 3 intersegmental interneurons (a,b,c);output connections (A,B,C) project to three motor neurons. In this model, the 3 local circuits (a’,a,A),(b, b, B), and (c’, c, C) can be modulated separately. Reprinted with permission from Laurent and Burrows (1989).

4.4 Nonlinear Processing in Dendrites. One of the questions of greatest interest in the study of neuronal information processing regards the limits of computational power of the single neuron. In this vein, the fourth idea we consider here is that nonlinear membrane mechanisms; if appropriately deployed in a dendritic tree, can allow the single neuron to act as a powerful multilayer computational network. The most common instantiation of this idea has been the proposal that individual neurons may implement a hierarchy of logical operations within their dendritic trees, consisting of AND, OR, AND-NOT, and XOR operations (Lorente de N6 and Condouris 1959; Llinas and Nicholson 1971; Poggio and Torre 1977; Koch et al. 1982, 1983; Shepherd et al. 1985, 1989; Rall and Segev 1987; Shepherd and Brayton 1987; Zador et al. 1992; see Fig. 19).

4.4.1 Synaptic Nonlinearities. One version of this idea was based on the mathematical observation that synaptic conductance changes interact in a nonlinear way on a dendritic branch (Ralll964; Poggio and Torre 1977);

Bartlett W. Me1

1066

AND-NOT

'---

Figure 19: Representation of a Boolean network within a dendritic tree; reprinted with permission from Shepherd (1990). the emphasis of this latter study was on the second-order multiplicative interaction between excitatory and shunting inhibitory synapses underlying the so-called AND-NOT operation (Koch etal. 1982,1983,1986);see Koch and Poggio (1987, 1992) for discussions of multiplicative and other nonlinear mechanisms in neuronal computation. 4.4.2 Voltage-Dependent Membrane and Logic Operations. A number of other modeling studies have further pursued the neuron-as-logic-network metaphor, primarily by demonstrating that dendrites appropriately configured with voltage-dependent membrane can approximatively implement two-input logical operations, such as AND, OR, and XOR. For example, Shepherd and Brayton (1987) showed that simultaneously synaptic input to two neighboring spines with excitable Hodgkin-Huxley spine heads could, once both spines fired action potentials, result in sufficient depolarization in the underlying dendritic branch as to fire off two additional nearby spines; a single synaptic input was presumably insufficient (Fig. 20A). They argued that this behavior was AND-like in that the output of the dendritic region, signaled by the activation of the entire cluster of four spines, depended on the simultaneous activation of two inputs. By increasing the synaptic conductance onto each spine

Information Processing in Dendritic Trees

A

1067

B

Figure 20: Dendritic implementation of two input logic functions. (A) A group of 4 neighboring spine heads contained active (Hodgkin-Huxley)membrane. Synaptic conductances were chosen such that a single synaptic input was insufficient to fire any of the spines, whereas simultaneous activation of spines 1 and 2 led to firing of all 4 spines heads (voltage traces for each spine head are numbered). This thresholding behavior was likened to a two-input logical AND (this and part B reprinted with permission from Shepherd and Brayton 1987). (B) When synaptic conductances were doubled, input to spine 1 alone led to firing of all 4 spine heads. This condition was likened to a logical OR.

head, a single presynaptic event could be made to trigger suprathreshold activity in all four spines; this was termed an OR-gate, since only one of many possible inputs was needed to generate an output for the region (Fig. 20B). While the existence of Hodgkin-Huxley membrane in spine heads has not been demonstrated experimentally, results of this kind generalize well when the excitable membrane resides instead on dendritic shafts (Shepherd et al. 19891, or when altogether other excitatory voltage-dependent mechanisms are assumed (Me1 1992b, 199313). In analogy to a logical XOR function, Zador et al. (1992) have demonstrated that a combination of two voltage-dependent membrane mechanisms could produce a nonmonotonic output from a dendritic region in response to monotonically increasing synaptic input (Fig. 21). Thus, low levels of synaptic input to dendritic sites A and B (corresponding to a logical 0,O) were insufficient to cause a somatic spike. At intermediate levels of net synaptic input delivered to the two sites (correponding to logical 1 , 0 or O . l ) , action potentials could be elicited. However at higher levels of combined synaptic input to A and B (corresponding to logical 1, l), membrane depolarization in the dendritic tree was sufficient to activate voltage-dependent calcium channels, leading to an influx of calcium ions, followed by activation of calcium-dependent potassium

Bartlett W. Me1

1068

A

B -..

. . . . . . . . . .

_".... . . .. .. . -. . . . . . . . . . ......... . . m -... --.. ... ... ....... ... .. .. .. ... -. ...... -.

. .

t -

Input A

+

Figure 21: (A) In a reconstructed hippocampal pyramidal cell, a "cold spot" was placed in the dendritic membrane, consisting of a high density of Ca2+dependent K channels, and a low-density of voltage-dependent Ca2+ channels. Synapses were activated on dendritic branches A and B. Nonmonotonic responses at the cell body with increasing input at branches A and B are shown in plots B (bold dot for single spike);the nonmonotonicity was likened to that of a logical XOR. Reprinted with permission from Zador et al. (1992). channels that gave rise to a strong inhibitory (outward) current. In this case, action potentials at the cell body were suppressed (Fig. 21B). In an attempt to explore the complexity of interactions among many synaptic inputs in a dendritic tree, Rall and Segev (1987) tested the inputoutput behavior of dendritic branches containing passive and excitable dendritic spines (Fig. 22). Several synaptic input conditions are shown at left, where black spines are excitable and white spines are passive. The corresponding output condition shows all those spines that fired action potentials as a result of the given synaptic input. The resulting somatic depolarization is also given for each case. Based on the complexity of interactions seen in these demonstrations, the authors concluded that dendritic trees with excitable spine clusters (and presumably other varieties of excitable membrane nonlinearities) afford rich possibilities for pseudoindependent logic-like computations (Rall and Segev 1987).

4.4.3 Problems with the Logic-Network Metaphor. The dendritic-tree-aslogic-network metaphor has been a useful concept that has motivated a variety of analyses of the nonlinear computational properties of single neurons. However, two issues may be raised that suggest a potential misfit between logical computation and dendritic computation. First, logical computation is inherently "ill behaved"-changes in any and every input variable must in general be capable of altering or not altering

Information Processing in Dendritic Trees

A

1069

B

24

84

I40

230

w

Figure 22: Summary of 5 different cases with active spine heads on distal dendritic branches; reprinted with permission from Rall and Segev (1987). (A) Location of subtree relative to soma of passive cell; each branch had 50 spines, 45 passive (open), 5 active (filled). (B) Left column shows spines receiving synchronous excitatory (open synapses) or inhibitory (filled) synaptic input. Center column shows active spine heads that fired action potentials. Right column shows peak of passively propagated wave as measured at cell body. the output of the logic function depending on the values of all other inputs. Second, inputs to a logic function are in general unrelated to each other in terms of their effects on the output. Both of these aspects of logical computation are at odds with the fact that the electrotonic structure of a dendritic tree gives rise to extensive voltage sharing and smooth neighborhood relations among synaptic inputs (see Figs. 9 and 10A). Furthermore, logic functions require precise wiring diagrams, where a single erroneous input connection or malfunctioning computational element can corrupt the input-output behavior of an entire circuit. A crucial accompaniment of any logic-network theory of dendritic processing, therefore, is a mechanism, such as some form of learning, that can drive the appropriate microorganization of individual dendrites, synapses, and membrane channels. Such a process would, for example, need to explain the development of the precise spatial juxtaposition of each afferent synapse both with other specific afferent synapses and with the appropriate type of nonlinear membrane as suggested by Figure 19. Such a theory must also cope with accumulating evidence that significant dendritic remodeling

1070

Bartlett W. Me1

occurs in the mammalian brain, even during adulthood (Greenough and Bailey 1988), implying that a dendritic logic circuit must continuously adapt to "on line" changes in its basic computing architecture. A literal interpretation of the dendritic-tree-as-logic-network metaphor would be significantly bolstered by a demonstration in which a biologically relevant nontrivial (e.g., multiple input) logic function is mapped onto a realistically modeled dendritic tree, such that the output of the cell follows the specified truth table. The case would be particularly strong if accompanied, as suggested above, by a biologically plausible account for the establishment of the dendritic logic circuit. No such demonstration has yet appeared in the published literature.

4.4.4 Low-Order Polynomial Functions. An alternative metaphor for nonlinear dendritic computation, which may be viewed as a smooth, analog version of the "logical dendrites" hypothesis, is the idea that a dendritic tree acts as an approximative low-order polynomial function with many terms-in short, a big sum of little products. A number of authors have fielded conjectures along these lines (Poggio and Torre 1977; Feldman and Ballard 1982; Durbin and Rumelhart 1990; Me1 1990; Me1 and Koch 1990; Poggio and Girosi 1990), where the requisite multiplicative nonlinearity has typically been assumed to derive from some nonlinear membrane mechanisms of an excitatory nature. The Hodgkin-Huxley thresholding mechanism previously used in demonstrations of AND-like synaptic interactions is one example (Shepherd and Brayton 1987; Shepherd et al. 1989);NMDA channels and a variety of other mechanisms have also been considered good candidates (see Koch and Poggio 1992; Wilson 1992; Me1 1992a,b, 1993a,b). As a nonlinear approximator, the low-order polynomial representation is highly constrained in that (1) only two levels of computation are involved-a sum of products, (2) only the simplest nonlinear interaction is allowed-multiplication, and (3) the number of terms in each product is small-e.g., 2 or 3. An example of a biologically relevant computation of this order of complexity is a correlation between two high-dimensional input patterns. An abstract model neuron called a "clusteron" has recently been introduced that maps low-order polynomial functionality onto a dendritic tree in a way that is directly testable within a detailed biophysical model (Me1 1992a, 1993a,b). The clusteron consists of a "cell body" where the global output of the unit is computed, and a dendritic tree, which for present purposes is visualized as a single long branch attached to the cell body (Fig. 23). The dendritic tree receives a set of N excitatory weighted synaptic contacts from a set of afferent "axons." The output of the clusteron is given by

Information Processing in Dendritic Trees

1071

a

9= l Q=O

Figure 23: The clusteron is a limited second-order generalization of a thresholded linear unit, in which the excitatory effect of each synaptic input depends on the activity of other synapses in the neighborhood (active inputs designated by arrows). A cluster-sensitiveHebb-type learning rule says that synapses that are frequently coactivated with their neighbors should be stabilized; those that tend to act alone are destabilized and allowed to reestablish connections at new dendritic loci. where a; is the net excitatory activity due to synapse i, w; is its weight, and g is an optional thresholding nonlinearity. Unlike the thresholded linear unit, in which the net input due to the ith synapse is w,x;,the net input at the ith clusteron synapse is its weight w; times its activity a,, where

x , is the direct input stimulus intensity at synapse i, and V ; = {i r . . . . i, . . . , i + r } represents the neighborhood of radius r around synapse i. The clusteron as defined includes only pairwise interactions among synapses in the same neighborhood, denoted by the x;x, terms in equation 4.2. Thus, the underlying "biophysical" assumption implicit in clusteron "physiology" is that the output of a dendritic neighborhood grows quadratically, i.e., expansively, with increasing input. We note that as a fixed number of synaptic inputs are delivered to a clusteron, first in a diffuse spatial pattern, and then in progressively more clustered spatial patterns, the response of the clusteron steadily increases. Experiments of exactly this kind were carried out in a detailed biophysical model of a layer 5 neocortical pyramidal cell containing various complements of excitatory voltage-dependent mechanisms in its dendrites (Me1 1992a, 1993a; Fig. 24). Under a variety of conditions in which either NMDA synapses, slow calcium spikes, fast sodium spikes, or combinations of the three were placed in the dendrites in differing spatial distributions, an initial positive-slope regime in response to increasingly clustered synaptic input was indeed observed (Fig. 24);

Bartlett W.Me1

1072

Passive

NMDA, Ca++ over distal tree, Na+ at BPs

No inma1 cluster sensitivity

1

3

5

7

9

II

1315

Figure 24: Plots of average cell response vs. cluster size under three biophysical conditions (from Me1 1993a). 100 synapses were activated at 100 Hz for 100 msec and resulting somatic spikes were counted. (A) Response of cell with passive dendrites and AMPA synapses falls off monotonically due to classical synaptic (saturation) nonlinearity as cluster size is increased from 1 to 15. (B) Combination of high-NMDA, slow-spiking membrane distributed distally, and fast-spiking membrane distributed at branch points gave strong cluster sensitivity. (C) In two cases when no NMDA was present, and dendritic spiking conductances were sparse (at branch-points only), cluster sensitivity was abolished. Reprinted with permission from Me1 (1993).

results for a passive dendritic tree as control condition are shown in A. Since the biophysically modeled cell was subject to saturation effects, unlike its abstract clusteron counterpart, responses to highly clustered synaptic input patterns were gradually diminished. The positive-slope "cluster-sensitivity" regime was observed to be present as long as the dendrites contained a sufficiently powerful and widely distributed complement of expansive nonlinear membrane mechanisms. No significant dependence on the kinetics, voltage dependence, or localization of the voltage-dependent membrane channels was observed in these steadystate stimulus-response experiments. Figure 24C shows two conditions in which a sparse, patchy distribution of either sodium or calcium spikes was insufficient to yield a cluster-sensitive regime. 4.4.5 Functional Significance of Dendritic Cluster Sensitivity. The functional significance of dendritic cluster sensitivity was demonstrated in Me1 (1992a1, where it was shown that the nonlinear input-ouput behavior of an NMDA-rich dendritic tree could provide a capacity for nonlinear pattern discrimination (Fig. 25A). In a subsequent study, it was estimated that a 5 x 5 mm slab of neocortex containing cluster-sensitive cells has the capacity to represent on the order of 100,000 sparse input-output pattern associations with high accuracy (Me1 1993a). Another recent study

Information Processing in Dendritic Trees

1073

~

B

n Image

Figure 25: (A) Discriminating familiar from unfamiliar patterns within a clustersensitive dendritic tree. Visual input patterns drive a layer of visual featureselective cells whose axons terminate on the cluster-sensitive dendrites. Patterns that drive synapses in clusters elicit relatively strong cell responses. Reprinted with permission from Me1 (1992a). The spatial ordering of afferent synaptic connections onto the dendritic tree may thus be of crucial importance for information storage. (B) Implementation of a ”tuned excitatory” disparity-selective binocular cell in a schematized visual cortex. Axons from corresponding left and right monocular units, which are strongly correlated during normal vision, make synaptic connections onto neighboring patches of the binocular dendritic tree. Zero-disparity binocular images thus tend to activate synapses in pairs, a relatively effective (i.e., clustered) stimulus condition. Nonmatching inputs to the left and right receptive fields activate synapses diffusely, and are thus relatively ineffective stimuli. Reprinted with permission from Me1 (199313).

has shown that a cluster-sensitive neuron can implement a n approximative correlation operation entirely internal to its dendritic tree, with possible relevance to the establishment of nonlinear disparity tuning in binocular visual cells (Fig. 258; see Ohzawa et al. 1990). Interestingly, a sum-of-products computation has been proposed in various forms as a

1074

Bartlett W. Me1

crucial nonlinear operation in other types of visual cell responses, including responses to illusory contours (Peterhans and von der Heydt 1989), responses to periodic gratings (von der Heydt et al. 1991), and velocitytuned cell responses (Nowlan and Sejnowski 1993). The underlying idea in both examples of Figure 25 is that groups of frequently coactivated afferent axons represent prominent higher-order "features" in an input stream. These features may be encoded as groups of neighboring synaptic contacts onto a cluster-sensitive dendritic tree (see Brown et al. 1991 for discussion of related ideas). In the context of pattern memory, a "prominent" higher-order feature is any group of frequently coactivated input lines corresponding to a frequently observed conjunction of elemental sensory features in the input stream. If the prominent higher-order features accumulated from a set'of training patterns are dendritically encoded in this way, then training patterns, which contain relatively many of these higher-order features, will (1) activate relatively many clustered pairs of synapses in the dendritic tree, and, hence, (2) produce stronger cell responses than unfamiliar control patterns. The memory capacity of a cluster-sensitive dendritic tree is studied empirically in Me1 (1993a). In the case of stereopsis, a prominent higher-order feature is any pair of afferents from corresponding locations in the left- and right-eye receptive fields, which have a high probability of being coactivated during normal visual behavior. As illustrated in Figure 25B, if these features are dendritically encoded, a zero-disparity visual stimulus contains relatively many higher-order "correspondence" features, activates many clustered pairs of synapses in the dendritic tree, and produces a stronger cell response than a noncorresponding stereo stimulus. Such a preference for stimuli at fixed disparity across the receptive field is a characteristic of many binocularly drivable complex cells in the primate visual system (Ohzawa et nl. 1990). Both of the scenarios of Figure 25 entail that the ordering of synaptic connections onto a dendritic tree be manipulated by some form of learning process, such that frequently coactivated input lines tend ultimately to form neighboring synaptic connections. An abstract learning rule with these properties is discussed in detail elsewhere (Me1 1992a, 1993a). Biophysically detailed modeling studies of Hebbian learning mechanisms in dendritic trees have also been carried out (Holmes and Levy 1990; Brown eta!. 1991a; Pearlmutter 1994).

5 Conclusions We are at present in an awkward period, where many of the new ideas relating to dendritic function are the products of modeling studies, but where limited experimental access to dendritic trees means that modeling work proceeds mostly without direct experimental support. Recent

Information Processing in Dendritic Trees

1075

progress in the development of optical recording methods portends a n unprecedented period in which modeling techniques and experimental hypothesis testing can proceed in concert. The secrets of dendritic information processing may then be fully told. Acknowledgments Thanks to Idan Segev and Tony Zador and for many helpful comments o n the manuscript. This review would not have been possible without the generous working environment provided by Christof Koch a t Caltech. This work was supported by the McDonnell-Pew Foundation and the Office of Naval Research. References Abeles, M. 1982. Role of the cortical neuron: Integrator or coincidence detector? Israel J. Med. Sci. 18, 83-92. Adrian, E. D. 1914. The “all-or-none” principle in nerve. J. Physiol. 47, 460. Adrian, E. D. 1937. The spread of activity in the cerebral cortex. J. Physiol. London 88,127-161. Agmon-Snir, H., and Segev, I. 1993. Signal delay and input synchronization in passive dendritic structures. J. Neurophysiol. 70, 2066-2085. Amitai, Y., Friedman, A., Connors, B. W., and Gutnick, M. J. 1993. Regenerative electrical activity in apical dendrites of pyramidal cells in neocortex. Cerebral Cortex 3, 26-38. Andersen, P., and Lomo, T. 1966. Mode of activation of hippocampal pyramidal cells by excitatory synapses on dendrites. Exp. Brain Res. 2, 247-260. Arikuni, T., and Ochs, S. 1973. Slow depolarizing potentials and spike generation in pyramidal tract cells. I. Neurophysiol. 36, 1-12. Attick, J. J., and Redlich, A. N. 1990. Toward a theory of early visual processing. Neural Comp. 2,308-320. Baer, S. M., and Rinzel, J. 1991. Propagation of dendritic spikes mediated by excitable spines: A continuum theory. J. Neurophysiol. 65(4), 874-890. Bell, A. J. 1992. Self-organisationin real neurons: Anti-Hebb in ’channel space’. In Advances in Neural Information Processing Systems, J. E. Moody, S. J. Hanson, and S. P. Lippmann, eds., Vol. 4, pp. 59-66. Morgan Kaufmann, San Mateo, CA. Benardo, L. S., Masukawa, L. M., and Prince, D. A. 1982. Electrophysiology of isolated hippocampal pyramidal dendrites. J. Neurosci. 2, 1614-1622. Bernander, O., Douglas, R., Martin, K., and Koch, C. 1991. Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. U.S.A. 88, 11569-11573. Bernander, O., Koch, C., and Usher, M. 1994. The effects of synchronized inputs at the single neuron level. Neural Comp. 6, 622-641. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., and Warland, D. 1991. Reading a neural code. Science 252, 1854-1857.

1076

Bartlett W. Me1

Bishop, G. H. 1958. The dendrite; receptive pole of the neurone. Clin. Neurophysiol. Suppl. 10, 12-21. Borg-Graham, L. J. 1987. Modelling the somatic electrical response of hippocampal pyramidal neurons. Master’s thesis, MIT. Borg-Graham, L. J. 1991. Modelling the non-linear conductances of excitable membranes. In Cellular and Molecular Biology: A Practical Approach, H. Wheal and J. Chad, eds., pp. 247-275. Oxford University/IRL Press, Oxford. Borg-Graham, L., and Grzywacz, N. M. 1990. An isolated turtle retina preparation allowing direct approach to ganglion cells and photoreceptors, and transmitted-light microscopy. Invest. Ophthalmol. Visiial Sci. 31, 1039. Borg-Graham, L. J., and Grzywacz, N. 1992. A model of the direction selectivity circuit in retina: Transformations by neurons singly and in concert. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds. Academic Press, Cambridge, MA. Brazier, M. A. B. 1959. The historical development of neurophysiology. In Handbook of physiology, Sec. I Neurophysiology, J. Field, H. W. Magoun, and V. E. Hall, eds., pp. 1-58. American Physiological Society, Washington, DC. Brown, T. H., Chang, V. C., Ganong, A. H., Keenan, C. L., and Kelso, S. R. 1988. Biophysical properties of dendrites and spines that may control the induction and expression of long-term synaptic potentiation. In Long-Term Potentiation: From Biophysics to Behavior, pp. 201-264. Alan R. Liss, New York. Brown, T. H., Mainen, Z. F., Zador, A. M., and Claiborne, B. J. 1991a. Selforganization of Hebbian synapses in hippocampal neurons. In Advances in Neural Informution Processing Systems, R. Lippmann, J. Moody, and D. Touretzky, eds., Vol. 3, pp. 39-45. Morgan Kaufmann, Palo Alto, CA. Brown, T. H., Zador, A. M., Mainen, Z. F., and Claiborne, B. J. 1991b. Hebbian modifications in hippocampal neurons. In Long-Term Potentiation: A Debate of Current Issues, J. Davis and M. Baudry, eds., pp. 357-389. MIT Press, Cambridge, MA. Brown, T. H., Zador, A. M., Mainen, Z. F., and Claiborne, B. J. 1992. Hebbian computations in hippocampal dendrites and spines. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 88-116. Academic Press, Boston, MA. Butz, E. G., and Cowan, J. D. 1974. Transient. potentials in dendritic systems of arbitrary geometry. Biophys. I. 14, 661-689. Cauller, L. J., and Connors, B. W. 1992. Functions of very distal dendrites: Experimental and computational studies of layer I synapses on neocortical pyramidal cells. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 199-229. Academic Press, Boston, MA. Chang, H. T. 1952. Cortical neurons and spinal neurons. Cortical neurons with particular reference to apical dendrites. Cold Spring Harbor Symp. Quant. Biol. 17, 189-202. Claiborne, B. J., Zador, A. M., Mainen, Z. F., and Brown, T. H. 1992. Computational models of hippocampal neurons. In Single Neuron Computation, T. McKenna, J. Davis, and S. F.Zornetzer, eds., pp. 61-80. Academic Press, Boston, MA. Cotman, C. W., Monaghan, D. T., Ottersen, 0. P., and Storm-Mathisen, J. 1987.

Information Processing in Dendritic Trees

1077

Anatomical organization of excitatory amino acid receptors and their pathways. TlNS 10(7), 273-280. Cragg, B. G., and Hamlyn, L. H. 1955. Action potentials of the pyramidal neurones in the hippocampus of the rabbit. J. Physiol. London 130, 326-373. Cullheim, S., Fleshman, J. W., and Burke, R. E. 1987. Three-dimensional architecture of dendritic trees in type-identified alpha motoneurones. J. Comp. Neurol. 255, 82-96. De Schutter, E. 1992. A consumer guide to neuronal modeling software. TlNS 15(11), 462464. Deschenes, M. 1981. Dendritic spikes induced in fast pyramidal tract neurons by thalamic stimulation. Exp. Brain Res. 43, 304-308. Durbin, R., and Rumelhart, D. E. 1990. Product units: a computationally powerful and biologically plausible extension to backpropagation networks. Neural Cornp. 1, 133-142. Eccles, J. C. 1957. The Physiology of Nerve Cells. The Johns Hopkins Press, Baltimore. Eccles, J. C., Libet, B., and Young, R. R. 1958. The behavior of chromatolysed motorneurones studied by intracellular recording. J. Physiol. 143, 1140. Esterle, T. M., and Sandersbush, E. 1991. From neurotransmitter to g e n e identifying the missing links. Trends Phar. 12(10), 375-379. Evans, J. D., Kember, G. C., and Major, G. 1992. Techniques for obtaining analytical solutions to the multicylinder somatic shunt cable model for passive neurons. Biophys. J. 63,350-365. Feldman, J. A., and Ballard, D. H. 1982. Connectionist models and their properties. Cog. Sci. 6, 205-254. Fox, C. A., and Barnard, J. W. 1957. A quantitative study of the Purkinje cell dendritic branchlets and their relationship to afferent fibers. J. Anat. (London) 91, 299-313. Fox, K., Sato, H., and Daw, N. 1990. The effect of varying stimulus intensity on NMDA-receptor activity in cat visual cortex. J. Neurophysiol. 64, 1413-1428. Fujita, Y. 1968. Morphological and physiological properties of neurons and glial cells in tissue culture. J. Neurophysiol. 31, 131-141. Golgi, C. 1886. Sulla Fina Anatornia degli Organi Centrali del Sisterna Neruoso. Hoepli, Milan. Gotch, F. 1902. The sub-maximal electrical response of nerve to a single stimulus. J. Physiol. 28, 395. Greenough, W. T., and Bailey, C. H. 1988. The anatomy of a memory: Convergence of results across a diversity of tests. TINS 11, 142-147. Greer, C. A. 1987. Golgi analyses of dendritic organization among denervated olfactory bulb granule cells. J. Cornp. Neurol. 257, 442452. Grundfest, H. 1957. Electrical inexcitability of synapses and some consequences in the central nervous system. Physiol. Reu. 37, 337-361. Haag, J., Egelhaaf, M., and Borst, A. 1992. Dendritic integration of motion information in visual interneurons of the blowfly. Neurosci. Lett. 140, 173176. Harris, R. M. 1986. Morphology of physiologically identified thalamocortical relay neurons in the rat ventrobasal thalamus. J. Cornp. Neurol. 254,382-402.

1078

Bartlett W. Me1

Herreras, 0. 1990. Propagating dendritic action potential mediates synaptic transmission in CAI pyramidal cells in situ. J. Neurophysiol. 64, 1429-1441. Hild, W., and Tasaki, I. 1962. Morphological and physiological properties of neurons and glial cells in tissue culture. J. Neurophysiol. 25, 277-304. Hille, B. 1984. Ionic Channels of Excitable Membranes, 1st ed. Sinauer Associates, Sunderland, MA. Holmes, W. R., and Levy, W. B. 1990. Insights into associative long-term potentiation from computational models of NMDA receptor-mediated calcium influx and intracellular calcium concentration changes. J. Neurophysiol. 63, 1148-1168. Holmes, W. R., and Woody, C. D. 1989. Effects of uniform and non-uniform synaptic 'activation-distributions' on the cable properties of modeled cortical pyramidal neurons. Brain Res. 505, 12-22. Horwitz, B. 1981. An analytical method for investigating transient potentials in neurons with branching dendritic trees. Biophys. J. 36, 155-192. Horwitz, B. 1983. Unequal diameters and their effects on time-varying voltages in branched neurons. Biophys. J. 41, 51-66. Hounsgaard, J.,and Midtgaard, J. 1988. Intrinsic determinants of firing pattern in Purkinje cells of the turtle cerebellum in vitro. J. Physiol. 402, 731-749. Huguenard, J. R., Hamill, 0. P., and Prince, D. A. 1989. Sodium channels in dendrites of rat cortical pyramidal neurons. Proc. Natl. Acad. Sci. U.S.A. 86, 2473-2477. Jack, J. J. B., Noble, D., and Tsien, R. W. 1975. Electric Current Flow in Excitable Cells. Oxford University Press, Oxford. Jaffe, D. B., Johnston, D., Lasser-Ross, N., Lisman, J. E., Miyakawa, H., and Ross, W. N. 1992. The spread of Na+ spikes determines the pattern of dendritic Ca2+ entry into hippocampal neurons. Nature (London) 357, 244-246. Jahnsen, H., and LlinBs, R. 1984. Ionic basis for the electroresponsiveness and oscillatory properties of guinea-pig thalamic neurons in vitro. J. Physiol. 349, 227-247. Jones, K. A,, and Baughman, R. W. 1988. NMDA- and non-NMDA-receptor components of excitatory synaptic potentials recorded from cells in layer V of rat visual cortex. j . Neurosci. 8, 3522-3534. Jones, 0. T., Kunze, D. L., and Angelides, K. J. 1989. Localization and mobility of w-conotoxin-sensitive Ca++ channels in hippocampal cal neurons. Science 244, 1189-1193. Kandel, E. R., and Schwartz, J. H. 1985. Principles of Neural Science, 2nd ed. Elsevier Science Publishing, New York. Keller, B. U., Konnerth, A., and Yaari, Y. 1991. Patch clamp analysis of excitatory synaptic currents in granule cells of rat hippocampus. I. Physiol. London 435, 275-293. Kelvin, W. T. 1855. On the theory of the electric telegraph. Proc. Roy. Soc. 7, 382-399. Koch, C. 1984. Cable theory in neurons with active, linearized membranes. Biol. Cybern. 50, 15-33. Koch, C., and Poggio, T. 1983. A theoretical analysis of electrical properties of spines. Proc. R . Soc. Lond. B 218,455-477.

Information Processing in Dendritic Trees

1079

Koch, C., and Poggio, T. 1987. Biophysics of computation: Neurons, synapses, and membranes. In Synaptic Function, G. E. Edelman, W. F. Gall, and W. M. Cowan, eds., pp. 637-697. Wiley, New York. Kocli, C., and Poggio, T. 1992. Multiplying with synapses and neurons. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 315-345. Academic Press, Cambridge, MA. Koch, C., Poggio, T., and Torre, V. 1982. Retinal ganglion cells: A functional interpretation of dendritic morphology. Phil. Trans. R. SOC.Lond. B 298, 227264. Koch, C., Poggio, T., and Torre, V. 1983. Nonlinear interaction in a dendritic tree: Localization, timing and role of information processing. Proc. Nutl. Acnd. Sci. U.S.A. 80, 2799-2802. Koch, C., Poggio, T., and Torre, V. 1986. Computations in the vertebrate retina: Gain enhancement, differentiation and motion discrimination. TINS May 1986, 204-211. Koch, C., Zador, A., and Brown, T. H. 1992. Dendritic spines: Convergence of theory and experiment. Science 156,973-974. Kuno, M., and Llinls, R. 1970. Enhancement of synaptic transmission by dendritic potentials in chromatolysed motorneurones of the cat. J. Physiol. 210, 807-821. Laurent, G. 1990. Voltage-dependent nonlinearities in the membrane of locust nonspiking local interneurones, and their significance for synaptic integration. J. Neurosci. 10, 2268-2280. Laurent, G., and Burrows, M. 1989. Intersegmental interneurons can control the gain of reflexes in adjacent segments of the locust by their action on nonspiking local interneurons. J. Neurosci. 9, 3030-3039. Laurent, G., and Haiyun 1993. A modeling study of voltage-dependent integration of synaptic potentials by locust non-spiking local neurons. Western Nerve-Net Conference, Seattle, WA. LeMasson, G., Marder, E., and Abbott, L. F. 1993. Activity-dependent regulation of conductances in model neurons. Science 259, 1915-1917. Lliniis, R., and Hess, R. 1976. Tetrodotoxin-resistant dendritic spikes in avian Purkinje cells. Proc. Natl. Acad. Sci. U.S.A. 73, 2520-2523. Lliniis, R., and Nicholson, C. 1971. Electrophysiological properties of dendrites and somata in alligator Purkinje cells. J. Neurophysiol. 34, 534-551. Lliniis, R., and Walton, K. D. 1990. Cerebellum. In Thesynaptic Orgmizotionofthe Brain, G. M. Shepherd, ed., pp. 214-245. Oxford University Press, Oxford. Lliniis, R., and Yarom, Y. 1981. Properties and distribution of ionic conductances generating electroresponsiveness of mammalian inferior olivary neurones in uitro. J. Physiol. 315, 560-584. Llinh, R., and Sugimori, M. 1980. Electrophysiology properties of in uitro Purkinje cell dendrites in mammalian cerebellar slices. J. Physiol. (London) 305, 197-21 3. Lliniis, R., Nicholson, C., Freeman, J. A., and Hillman, D. E. 1968. Dendritic spikes and their inhibition in alligator Purkinje cells. Science 160, 1132-1135. Lorente de N6, R. 1934. Studies on the structure of the cerebral cortex. 11. Con-

1080

Bartlett W. Me1

tinuation of the study of the ammonic system. 1. Psychol. Neurol. Leipzig 46, 11 3-177. Lorente de NO, R., and Condouris, G. A. 1959. Decremental conduction in peripheral nerve. Integration of stimuli in the neuron. Proc. Natl. Acad. Sci. U.S.A.45,592-617. Lytton, W. W., and Sejnowski, T. J. 1991. Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons. 1. Neurophysiol. 66, 10591079. Maekawa, K., and Purpura, D. P. 1967. Properties of spontaneous and evoked synaptic activities of thalamic ventrobasal neurons. ].Neurophysiol. 30,360381. Major, G., Evans, J. D., and Jack, J. J. B. 1993. Solutions for transients in arbitrarily branching cables: I. Voltage recording with a somatic shunt. Biophys. ]. 65,423-449. Mascagni, M. V. 1989. Numerical methods for neuronal modeling. In Methods in Neuronal Modeling, C. Koch and I. Segev, eds., pp. 439484. Bradford, Cambridge, MA. Maslim, J., Webster, M., and Stone, J. 1986. Stages in the structural differentiation of retinal ganglion cells. ]. Conzp. Neurol. 254, 382-402. Masukawa, L. M., and Prince, D. A. 1984. Synaptic control of excitability in isolated dendrites of hippocampal neurons. 1. Neurosci. 4, 217-227. Mayer, M. L., and Westbrook, G. L. 1987. The physiology of excitatory amino acids in the vertebrate central nervous system. Prog. Neurobiol. 28, 197-276. McClurkin, J. W., Optican, L. M., Richmond, B. J., and Gawne, T. J. 1991. Concurrent processing and complexity of temporally encoded neuronal messages in visual perception. Science 253,675-677. McCormick, D. A. 1990. Membrane properties and neurotransmitter actions. In The Synaptic Organization ofthe Brain, G. M. Shepherd, ed., pp. 32-66. Oxford University Press, Oxford. Meek, J., and Nieuwenhuys, R. 1991. Palisade pattern of mormyrid Purkinje cells-a correlated light and electron-microscopic study. ]. Comp. Neurol. 306, 156-192. Mel, B. W. 1990. The sigma-pi column: A model of associative learning in cerebral neocortex. CNS Memo 6, Computation and Neural Systems Program, California Institute of Technology. Mel, B. W. 1992a. The clusteron: Toward a simple abstraction for a complex neuron. In Advances in Neural Information Processing Systems, J. Moody, S. Hanson, and R. Lippmann, eds., Vol. 4, pp. 35-42. Morgan Kaufmann, San Mateo, CA. Mel, B. W. 1992b. NMDA-based pattern discrimination in a modeled cortical neuron. Neural Comp. 4, 502-516. Mel, B. W. 1992c. Information processing in an excitable dendritic tree. CNS Memo 17, Computation and Neural Systems Progmnz, California Institute of Technology, pp. 1-69. Mel, B. W. 1993a. Memory capacity of an excitable dendritic tree. In revision. Mel, B. W. 1993b. Synaptic integration in an excitable dendritic tree. ]. Neurophysiol. 70(3), 1086-1101.

Information Processing in Dendritic Trees

1081

Mel, B. W., and Koch, C. 1990. Sigma-pi learning: On radial basis functions and cortical associative learning. In Advances in Neural Information Processing Systems, D. S. Touretzky, ed., Vol. 2, pp. 474-481. Morgan Kaufmann, San Mateo, CA. Miller, J. P., Rall, W., and Rinzel, J. 1985. Synaptic amplification by active membrane in dendritic spines. Brain Res. 325, 325-330. Miller, K. D., Chapman, B., and Stryker, M. P. 1989. Visual responses in adult cat visual cortex depend on N-methyl- o-aspartate receptors. Proc. Natl. Acad. Sci. U.S.A. 86, 5183-5187. Nicoll, R. A., Malenka, R. C., and Kauer, J. A. 1990. Functional comparison of neurotransmitter receptor subtypes in mammalian central-nervous-system. Physiol. Rev. 70, 513-565. Nowlan, S. J., and Sejnowski, T. J. 1993. Filter selection model for generating visual motion signals. In Advances in Neural Information Processing Systems, S. Hanson, J. Cowan, and L. Giles, eds., Vol. 5, pp. 369-376. Morgan Kaufmann, San Mateo, CA. Ohzawa, I., DeAngelis, G. C., and Freeman, R. D. 1990. Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science 279, 1037-1041. Pearlmutter, B. A. 1993. Time-skew Hebb rule in a nonisopotential neuron. Neural Comp., submitted. Penny, G. R., Wilson, C. J., and Kitai, S. T. 1988. Relationship of the axonal and dendritic geometry of spiny projection neurons to the compartmental organization of the neostriatum. J. Comp. Neurol. 269, 275-289. Perkel, D. H., and Perkel, D. J. 1985. Dendritic spines: Role of active membrane in modulating synaptic efficacy. Brain Res. 325, 331-335. Perkel, D. H., Mulloney, B., and Budelli, R. W. 1981. Quantitative methods for predicting neuronal behavior. Neuroscience 6, 823-837. Peterhans, E., and von der Heydt, R. 1989. Mechanisms of contour perception in monkey visual cortex. 11. contours bridging gaps. J. Neurosci. 9, 1749-1763. Pockberger, H. 1991. Electrophysiological and morphological properties of rat motor cortex neurons in vivo. Brain Res. 539, 181-190. Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 978-982. Poggio, T., and Torre, V. 1977. A new approach to synaptic interactions. In Lecture Notes in Biomathematics. Theoretical Approaches to Computer Systems, H. Heim and G. Palm, eds., Vol. 21, pp. 89-115. Springer, Berlin. Poolos, N. P., and Kocsis, J. D. 1990. Dendritic action potentials activated by NMDA receptor-mediated EPSPs in CAI hippocampal pyramidal cells. Brain Res. 524, 342-346. Popov, S., and Poo, M-M. 1992. Diffusional transport of macromolecules in developing nerve processes. J. Neurosci. 12(1), 77-85. Purpura, D. P. 1959. Nature of electrocortical potentials and synaptic organiztions in cerebral and cerebellar cortex. Intern. Rev. Neurobiol. 1,47-163. Purpura, D. P. 1967. Comparative physiology of dendrites. In The Neurosciences. A Study Program, G. C. Quarton, T. Nelnechuk, and F. 0. Schmitt, eds., pp. 373-392. Rockefeller Univ. Press, New York.

1082

Bartlett W. Me1

Purpura, D. P., and Shofer, R. J. 1964. Cortical intracellular potentials during augmenting and recruiting responses. I. Effects of injected hyperpolarizing currents on evoked membrane potential changes. J. Neurophysiol. 27, 117132. Purpura, D. P., McMurty, J. G., Leonard, C. F,, and Malliani, A. 1966. Evidence for dendritic origin of spikes without depolarizing prepotentials in hippocampal neurons during and after seizure. 1.Neurophysiol. 29,954-979. Rall, W. 1959. Branching dendritic trees and motoneuron membrane resistivity. Exp. Neurol. 1, 491-527. Rall, W. 1964. Theoretical significance of dendritic trees for neuronal inputoutput relations. In Neural Theory and Modeling, R. F. Reiss, ed., pp. 73-97. Stanford University Press, Stanford, CA. Rall, W. 1977. Core conductor theory and cable properties of neurons. In Handbook of Physiology: The Nervous System, E. R. Kandel, J. M. Brookhardt, and V. B. Mountcastle, eds., Vol. 1, pp. 39-98. Williams & Wilkins, Baltimore, MD. Rall, W. 1989. Cable theory for dendritic neurons. In Methods in Neuronal Modeling, C. Koch and I. Segev, eds., chapter 2. MIT Press, Cambridge, MA. Rall, W., and Rinzel, J. 1973. Branch input resistance and steady attenuation for input to one branch of a dendritic neuron model. Biophys. I. 13,648-688. Rall, W., and Segev, I. 1987. Functional possibilities for synapses on dendrites and on dendritic spines. In SynapticFunction, G. E. Edelman, W. F. Gall, and W. M. Cowan, eds., pp. 605-636. Wiley, New York. Rall, W., Burke, R. E., Holmes, W. R., Jack, J. J. B., Redman, S. J., and Segev, I. 1992. Matching dendritic neuron models to experimental data. Physiol. Rev. 72(4), S159-5186. Rambn y Cajal, S. 1909. Histologie du systPme nerueux de l'homme et des vert&rks, L. Azoulay, trans. Malaine, Paris. Rapp, M., Yarom, Y., and Segev, I. 1992. The impact of parallel fiber background activity on the cable properties of cerebellar Purkinje cells. Neural Camp. 4, 518-533. Reuveni, I., Friedman, A., Amitai, Y., and Gutnick, M. J. 1993. Stepwise repolarization from CaZ+ plateaus in neocortical pyramidal cells: Evidence for non-homogeneous distribution of HVA Ca2+ channels in dendrites. J. Neurosci. 13, 46094621. Rinzel, J., and Rall, W. 1974. Transient response to a dendritic neuron model for current injected at one branch. Biophys. 14, 759-789. Ross, W. N., Lasser-Ross, N., and Werman, R. 1990. Spatial and temporal analysis of calcium-dependent electrical activity in guinea pig Purkinje cell dendrites. Proc. R. SOC.London B 240, 173-185. Salt, T. E. 1986. Mediation of thalamic sensory input by both NMDA receptors and non-NMDA receptors. Nature (London) 322,263-265. Schwartzkroin, P. A., and Prince, D. A. 1980. Changes in excitatory and inhibitory synaptic potentials leading to epileptogenic activity. Brain Res. 183, 61-76. Schwartzkroin, I? A., and Slawsky, M. 1977. Probable calcium spikes in hippocampal neurons. Brain Res. 135, 157-161.

Information Processing in Dendritic Trees

1083

Segev, I. 1992. Single neurone models: oversimple, complex, and reduced. TINS 15, 414-421. Segev, I., Fleshman, J. W., and Burke, R. E. 1989. Compartmental models of complex neurons. In Methods in Neuronal Modeling, C. Koch and I. Segev, eds., pp. 63-96. MIT Press, Cambridge, MA. Segev, I., and Rall, W. 1988. Computational study of an excitable dendritic spine. J. Neurophysiol. 60, 499-523. Sereno, M. I., and Ulinski, P. S. 1987. Caudal topographic nucleus isthmi and the rostra1 nontopographic nucleus isthimi in the turtle, Pseudemys scripta. J. Comp. Neurol. 261, 319-346. Sernagor, E., Yarom, Y., and Werman, R. 1986. Sodium-dependent regenerative responses in dendrites of axotomized motorneurons in the cat. Proc. Natl. Acad. Sci. U.S.A. 83, 7966-7970. Shepherd, G. 1992. Canonical neurons and their computational organization. In Single Neuron Computation, T. McKenna, J. Davis, and S . F. Zornetzer, eds., pp. 27-60. Academic Press, Boston, MA. Shepherd, G. M. 1990. The Synaptic Organization of the Brain. Oxford University Press, Oxford. Shepherd, G. M., and Brayton, R. K. 1987. Logic operations are properties of computer-simulated interactions between excitable dendritic spines. Neuroscience 21, 151-166. Shepherd, G. M., and Greer, C. A. 1988. The dendritic spine: Adaptation of structure and function for different types of synaptic integration. In Intrinsic Determinants of Neuronal Form and Function, R. Lasek and M. Black, eds., pp. 245-262. Alan R. Liss, New York. Shepherd, G. M., and Koch, C. 1990. Dendritic electrotonus and synaptic integration. In The Synaptic Organization of the Brain, G. M. Shepherd, ed., pp. 439-473. Oxford University Press, Oxford. Shepherd, G. M., Brayton, R. K., Miller, J. I?, Segev, I., Rinzel, J., and Rall, W. 1985. Signal enhancement in distal cortical dendrites by means of interactions between active dendritic spines. Proc. Natl. Acad. Sci. U.S.A. 82, 2192-2195. Shepherd, G. M., Woolf, T. B., and Carnevale, N. T. 1989. Comparisons between active properties of distal dendritic branches and spines: Implications for neuronal computations. Cog. Neurosci. 1, 273-286. Sirevaag, A. M., and Greenough, W. T. 1987. Differential rearing effects on rat visual cortex synapses. 111. Neuronal and glial nuclei, boutons, dendrites, and capillaries. Brain Res. 424,320-332. Softky, W., and Koch, C. 1994. The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSF's. J. Neurosci. 13, 334-350. Softky, W. R. 1993. Submillisecond coincidence detection in active dendritic trees. Neuroscience 58, 1 3 4 1 . Spear, P. J. 1972. Evidence for spike propagation in cortical dendrites. Exp. Neurol. 35, 111-121. Spencer, W. A. 1977. The physiology of supraspinal neurons in mammals. In Handbookof Physiology, Sec. 1 The Nervous System, J, M. Brookhart, V. B. Mount-

1084

Bartlett W. Me1

castle, and E. R. Kandel, eds., Vol. 1, pp. 969-1021. American Physiological Society, Bethesda, MD. Spencer, W. A., and Kandel, E. R. 1961. Electrophysiology of hippocampal neurons. iv. Fast prepotentials. J. Neurophysiol. 24, 272-285. Strafstrom, C. E., Schwindt, P. C., Chubb, M. C., and Crill, W. E. 1985. Properties of persistent sodium conductance and calcium conductance of layer V neurons from cat sensorimotor cortex in uitro. J. Neurophysiol. 53, 153-170. Sugimori, M., and Llinds, R. R. 1990. Real-time imaging of calcium influx in mammalian cerebellar Purkinje cells in vitro. Proc. Natl. Acad. Sci. U.S.A. 87, 5084-5088. Sutor, B., and Hablitz, J. J. 1989. EPSPs in rat neocortical neurons in uitro. 11. Involvement of N-methyl-waspartate receptors in the generation of EBPs. J. Neurophysiol. 61(3), 621-634. Tank, E. W., Sugimori, M., Connor, J. A., and Llinas, R. 1988. Spatially resolved calcium dynamics of mammalian Purkinje cells in cerebellar slice. Science 242,773-777. Terashima, T., Inoue, K., Inoue, Y., Yokoyama, M., and Mikoshiba, K. 1986. Observations on the cerebellum of normal-reeler mutant mouse chimera. J. Comp. Neurol. 252, 264-278. Theunissen, F. E., and Miller, J. P. 1991. Representation of sensory information in the cricket cercal sensory system. 11. Information theoretic calculation of system accuracy and optimal tuning-curve widths of four primary interneurons. J. Neurophysiol. 66, 1690-1703. Traub, R. D. 1982. Simulation of intrinsic bursting in CA3 hippocampal neurons. Neuroscience 7, 1233-1242. Traub, R. D., Dudek, F. E., Taylor, C. P., and Knowles, W. D. 1985. Simulation of hippocampal afterdischarges synchronized by electrical interactions. J. Neurosci. 4, 1033-1038. Traub, R. D., Wong, R. K. S., Miles, R., and Michelson, H. 1991. A model of a CA3 hippocampal pyramidal neuron incorporating voltage-clamp data on intrinsic conductances. 1.Neurophysiol. 66, 635-650. Traub, R. D., and Llinds, R. 1979. Hippocampal pyramidal cells: Significance of dendritic ionic conductances for neuronal function and epileptogenesis. J. Neurophysiol. 42, 476496. Turner, R. W., and Richardson, T. L. 1991. Apical dendritic depolarizations and field interactions evoked by stimulation of afferent inputs to rat hippocampal CAI pyramidal cells. Neuroscience 42, 125-135. Vaney, D. I., Collin, S. P., and Young, H. M. 1989. Dendritic relationships between cholinergic amacrine cells and direction-selective retinal ganglion cells. In Neurobiology of the lnner Retina, R. Weiler and N. N. Osborne, eds., pp. 157-168. Springer-Verlag, Berlin. von der Heydt, R., Peterhans, E., and Dursteler, M. R. 1991. Grating cells in monkey visual cortex: Coding texture? In Channels in the Visual Nervous System: Neurophysiology, Psychophysics, and Models, 8. Blum, ed., pp. 53-73. Freund, London. Waterhouse, B. D., Sessler, F. M., Liu, W., and Lin, C. S. 1991. 2nd messengermediated actions of norepinephrine on target neurons in central circuits-a

Information Processing in Dendritic Trees

1085

new perspective on intracellular mechanisms and functional consequences. Prog. Brain Res. 88, 351-362. Watson, A. H. D., and Burrows, M. 1988. Distribution and morphology of synapses on nonspiking local interneurones in the thoracid nervous system of the locust. J. Comp. Neurol. 272, 605-616. Wilson, C. J. 1992. Dendritic morphology, inward rectification and the functional properties of neostriatal neurons. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 141-171. Academic Press, Boston, MA. Wong-Riley, M. T. T. 1989. Cytochrome oxidase: An endogenous metabolic marker for neuronal activity. Trends Neurosci. 12(3), 94-101. Wong, R. K. S., and Stewart, M. 1992. Different firing patterns generated in dendrites and somata of CAI pyramidal neurones in guinea-pig hippocampus. 1.Physiol. 457, 675-687. Wong, R. K. S., Prince, D. A., and Busbaum, A. I. 1979. Intradendritic recordings from hippocampal neurons. Proc. Natl. Acad. Sci. U.S.A. 76, 986-990. Woolf, T. B., Shepherd, G. M., and Greer, C. A. 1991. Local information processing in dendritic trees: Subsets of spines in granule cells of the mammalian olfactory bulb. J. Neurosci. 11, 1837-1854. Yang, C.-Y., and Yazulla, S. 1986. Neuropeptide-like immunoreactive cells in the retina of the larval tiger salamander: Attention to the symmetry of dendritic projections. J. Comp. Neurol. 248, 105-118. Zador, A., Koch, C., and Brown, T. H. 1990. Biophysical model of a Hebbian synapse. Proc. Natl. Acad. Sci. U.S.A. 87, 6718-6721. Zador, A. M. 1993. Biophysics of computation in single hippocampal neurons. Ph.D. Thesis, Yale University, Interdepartmental Neuroscience Program. Zador, A. M., Claiborne, B. J., and Brown, T.J. 1992. Nonlinear pattern separation in single hippocampal neurons with active dendritic membrane. In Advances in Neural Information Processing Systems, J. Moody, S. Hanson, and R. Lippmann, eds., Vol. 4, pp. 51-58. Morgan Kaufmann, San Mateo, CA.

Received March 29, 1993; accepted January 11, 1994.

This article has been cited by: 1. Bradley E Losavio, Vijay Iyer, Saumil Patel, Peter Saggau. 2010. Acousto-optic laser scanning for multi-site photo-stimulation of single neurons in vitro. Journal of Neural Engineering 7:4, 045002. [CrossRef] 2. James Ting-Ho Lo. 2010. Functional model of biological neural networks. Cognitive Neurodynamics . [CrossRef] 3. Xiaosong Yuan, Joshua T. Trachtenberg, Steve M. Potter, Badrinath Roysam. 2009. MDL Constrained 3-D Grayscale Skeletonization Algorithm for Automated Extraction of Dendrites and Spines from Fluorescence Confocal Images. Neuroinformatics 7:4, 213-232. [CrossRef] 4. Michael Krumin, Shy Shoham. 2009. Generation of Spike Trains with Controlled Auto- and Cross-Correlation FunctionsGeneration of Spike Trains with Controlled Auto- and Cross-Correlation Functions. Neural Computation 21:6, 1642-1664. [Abstract] [Full Text] [PDF] [PDF Plus] 5. I. B. Kulagina. 2009. Impact of Structural Characteristics of Reconstructed Motoneurons on Their Excitability (a Simulation Study). Neurophysiology 41:2, 116-121. [CrossRef] 6. Kenji Morita, Masato Okada, Kazuyuki Aihara. 2007. Selectivity and Stability via Dendritic NonlinearitySelectivity and Stability via Dendritic Nonlinearity. Neural Computation 19:7, 1798-1853. [Abstract] [PDF] [PDF Plus] 7. Minija Tamosiunaite, Bernd Porr, Florentin Wörgötter. 2007. Developing velocity sensitivity in a model neuron by local synaptic plasticity. Biological Cybernetics 96:5, 507-518. [CrossRef] 8. Erhan Oztop. 2006. An Upper Bound on the Minimum Number of Monomials Required to Separate Dichotomies of {−1, 1}nAn Upper Bound on the Minimum Number of Monomials Required to Separate Dichotomies of {−1, 1}n. Neural Computation 18:12, 3119-3138. [Abstract] [PDF] [PDF Plus] 9. Chris Eliasmith . 2005. A Unified Approach to Building and Controlling Spiking Attractor NetworksA Unified Approach to Building and Controlling Spiking Attractor Networks. Neural Computation 17:6, 1276-1314. [Abstract] [PDF] [PDF Plus] 10. Alain Destexhe, Eve Marder. 2004. Plasticity in single neuron and circuit computations. Nature 431:7010, 789-795. [CrossRef] 11. Drazen Domijan . 2004. Recurrent Network with Large Representational CapacityRecurrent Network with Large Representational Capacity. Neural Computation 16:9, 1917-1942. [Abstract] [PDF] [PDF Plus] 12. Patrick Byrne , Suzanna Becker . 2004. Modeling Mental Navigation in Scenes with Multiple ObjectsModeling Mental Navigation in Scenes with Multiple Objects. Neural Computation 16:9, 1851-1872. [Abstract] [PDF] [PDF Plus]

13. Dan Ryder. 2004. SINBAD Neurosemantics: A Theory of Mental Representation. Mind and Language 19:2, 211-240. [CrossRef] 14. Simone Fiori . 2003. Closed-Form Expressions of Some Stochastic Adapting Equations for Nonlinear Adaptive Activation Function NeuronsClosed-Form Expressions of Some Stochastic Adapting Equations for Nonlinear Adaptive Activation Function Neurons. Neural Computation 15:12, 2909-2929. [Abstract] [PDF] [PDF Plus] 15. K. Kürten, J. Clark. 2003. Higher-order neural networks, Polyà polynomials, and Fermi cluster diagrams. Physical Review E 68:3. . [CrossRef] 16. Christoph Kayser , Konrad P. Körding , Peter König . 2003. Learning the Nonlinearity of Neurons from Natural Visual StimuliLearning the Nonlinearity of Neurons from Natural Visual Stimuli. Neural Computation 15:8, 1751-1759. [Abstract] [PDF] [PDF Plus] 17. M. J. Barber , J. W. Clark , C. H. Anderson . 2003. Neural Representation of Probabilistic InformationNeural Representation of Probabilistic Information. Neural Computation 15:8, 1843-1864. [Abstract] [PDF] [PDF Plus] 18. S. M. Korogod, I. B. Kulagina, V. I. Kukushka, P. Gogan, . S. Tyc-Dumont. 2002. Spatial reconfiguration of charge transfer effectiveness in active bistable dendritic arborizations. European Journal of Neuroscience 16:12, 2260-2270. [CrossRef] 19. Reginald T Cahill. 2002. Synthetic quantum systems. Smart Materials and Structures 11:5, 699-707. [CrossRef] 20. M. W. Spratling , M. H. Johnson . 2002. Preintegration Lateral Inhibition Enhances Unsupervised LearningPreintegration Lateral Inhibition Enhances Unsupervised Learning. Neural Computation 14:9, 2157-2179. [Abstract] [PDF] [PDF Plus] 21. Costa M. Colbert, Enhui Pan. 2002. Ion channel properties underlying axonal action potential initiation in pyramidal neurons. Nature Neuroscience 5:6, 533-538. [CrossRef] 22. Michael Schmitt . 2002. On the Complexity of Computing and Learning with Multiplicative Neural NetworksOn the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Computation 14:2, 241-301. [Abstract] [PDF] [PDF Plus] 23. K. A. Lindsay , J. M. Ogden , J. R. Rosenberg . 2001. Dendritic Subunits Determined by Dendritic MorphologyDendritic Subunits Determined by Dendritic Morphology. Neural Computation 13:11, 2465-2476. [Abstract] [PDF] [PDF Plus] 24. C. Rasche, R.J. Douglas. 2001. Forward- and backpropagation in a silicon dendrite. IEEE Transactions on Neural Networks 12:2, 386-393. [CrossRef] 25. Ryosuke Enoki, Masashi Inoue, Yoshinori Hashimoto, Yoshihisa Kudo, Hiroyoshi Miyakawa. 2001. GABAergic control of synaptic summation in hippocampal CA1 pyramidal neurons. Hippocampus 11:6, 683-689. [CrossRef]

26. Sergey M. Korogod, Irina B. Kulagina, Ginette Horcholle-Bossavit, Paul Gogan, Suzanne Tyc-Dumont. 2000. Activity-dependent reconfiguration of the effective dendritic field of motoneurons. The Journal of Comparative Neurology 422:1, 18-34. [CrossRef] 27. Panayiota Poirazi , Bartlett W. Mel . 2000. Choice and Value Flexibility Jointly Contribute to the Capacity of a Subsampled Quadratic ClassifierChoice and Value Flexibility Jointly Contribute to the Capacity of a Subsampled Quadratic Classifier. Neural Computation 12:5, 1189-1205. [Abstract] [PDF] [PDF Plus] 28. Chang Nian Zhang, Ming Zhao, Meng Wang. 2000. Logic operations based on single neuron rational model. IEEE Transactions on Neural Networks 11:3, 739-747. [CrossRef] 29. David M. Halliday . 2000. Weak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal BandwidthWeak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal Bandwidth. Neural Computation 12:3, 693-707. [Abstract] [PDF] [PDF Plus] 30. Paul F. M. J. Verschure , Peter König . 1999. On the Role of Biophysical Properties of Cortical Neurons in Binding and Segmentation of Visual ScenesOn the Role of Biophysical Properties of Cortical Neurons in Binding and Segmentation of Visual Scenes. Neural Computation 11:5, 1113-1138. [Abstract] [PDF] [PDF Plus] 31. Suzanna Becker . 1999. Implicit Learning in 3D Object Recognition: The Importance of Temporal ContextImplicit Learning in 3D Object Recognition: The Importance of Temporal Context. Neural Computation 11:2, 347-374. [Abstract] [PDF] [PDF Plus] 32. Martin Stetter, Elmar W. Lang, Klaus Obermayer. 1998. Unspecific long-term potentiation can evoke functional segregation in a model of area 17. NeuroReport 9:12, 2697-2702. [CrossRef] 33. S. M. Korogod, I. B. Kulagina, S. Tyč-Dumont. 1998. Transfer properties of neuronal dendrites with tonically activated conductances. Neurophysiology 30:4-5, 203-207. [CrossRef] 34. D. Contreras, N. Dürmüller, M. Steriade. 1997. Plateau Potentials in Cat Neocortical Association Cells In Viva Synaptic Control of Dendritic Excitability. European Journal of Neuroscience 9:12, 2588-2595. [CrossRef] 35. P. Bressloff, S. Coombes. 1997. Synchrony in an Array of Integrate-and-Fire Neurons with Dendritic Structure. Physical Review Letters 78:24, 4665-4668. [CrossRef] 36. M. Storace, M. Bove, M. Grattarola, M. Parodi. 1997. Simulations of the behavior of synaptically driven neurons via time-invariant circuit models. IEEE Transactions on Biomedical Engineering 44:12, 1282-1287. [CrossRef] 37. Epifanio Bagarinao, Caesar Saloma. 1996. Frequency analysis with Hopfield encoding neurons. Physical Review E 54:5, 5516-5521. [CrossRef]

38. David P. M. Northmore, John G. Elias. 1996. Spike Train Processing by a Silicon Neuromorph: The Role of Sublinear Summation in DendritesSpike Train Processing by a Silicon Neuromorph: The Role of Sublinear Summation in Dendrites. Neural Computation 8:6, 1245-1265. [Abstract] [PDF] [PDF Plus] 39. Paul Bressloff. 1996. New Mechanism for Neural Pattern Formation. Physical Review Letters 76:24, 4644-4647. [CrossRef] 40. Rodney J. Douglas, Misha Mahowald, Kevan A. C. Martin, Kenneth J. Stratford. 1996. The role of synapses in cortical computation. Journal of Neurocytology 25:1, 893-911. [CrossRef] 41. Paul C. Bressloff. 1995. Integro-differential equations and the stability of neural networks with dendritic structure. Biological Cybernetics 73:3, 281-290. [CrossRef]

Communicated by John Huguenard

Simulations of Intrinsically Bursting Neocortical Pyramidal Neurons Paul A. Rhodes* Charles M. Gray The Salk lnstitufefor Biological Studies, La Jolla, CA 92037 USA Neocortical layer 5 intrinsically bursting (IB) pyramidal neurons were simulated using compartment model methods. Morphological data as well as target neurophysiological responses were taken from a series of published studies on the same set of rat visual cortex pyramidal neurons (Mason, A. and Larkman, A. J., 1990. J. Neurosci. 9,1440-1447; Larkman, A. J. 1991. J. Comp. Neurol. 306,307-319). A dendritic distribution of ion channels was found that reproduced the range of in uitro responses of layer 5 IB pyramidal neurons, including the transition from repetitive bursting to the burst/tonic spiking mode seen in these neurons as input magnitude increases. In light of available data, the simulation results suggest that in these neurons bursts are driven by an inward flow of current during a high threshold Ca2+ spike extending throughout both the basal and apical dendritic branches. 1 Introduction

Neocortical pyramidal neurons have been classified into two categories, regular spiking (RS) and intrinsically bursting (IB), based upon their response to input in vitro (Connors et al. 1982; McCormick ef al. 1985). RS pyramidal neurons respond to a depolarizing step of current injected into the soma with a train of spikes that starts at a rapid rate, and immediately begins to slow, reaching a steady rate in several hundred milliseconds. In contrast, IB pyramidal neurons produce an all-or-none burst consisting of an after-depolarizing potential (ADP), lasting 10-25 msec, atop which several fast spikes are emitted at a high rate (> 200 Hz), followed by a regular train of single spikes. However, low levels of input often produce repetitive bursts at < 10 Hz (Agmon and Connors 1989; Mason and Larkman 1990). Though it is not clear what role intrinsically bursting neocortical pyramidal neurons play in normal cortical function, several studies have suggested IB pyramidal neuron involvement in the establishment of synchrony in neocortical (Connors 1984; Chagnac-Amitai and Connors 1989; 'Present address is c/o Center for Neuroscience, University of California at Davis, Davis, CA. Neural Computation 6, 1086-1110 (1994) @ 1994 Massachusetts Institute of Technology

Simulations of Neocortical Pyramidal Neurons

1087

Silva et al. 1991b) as well as hippocampal (Miles and Wong 1983) tissue slices. Further, the nature of burst output, a short fast series of spikes, suggests IB pyramidal neurons could play a special role in the establishment of LTP-mediated synaptic change (Gamble and Koch 19871, inviting more intensive study of their function in the formation of neocortical circuits. Compartment models with distributed active membrane were pioneered 15 years ago for motorneurons (Traub and Llinds 19771, cerebellar Purkinje neurons (Pellionisz and Llinds 19791, and hippocampal pyramidal neurons (Traub and Llin6s 1979). The mechanisms by which neocortical IB pyramidal neurons produce their pattern of output have not, however, been simulated in similar detail. In the present study, we have used data on neocortical layer 5 IB pyramidal neuron responses from rat visual cortex in vitro (Mason and Larkman 19901, along with morphological information from the same set of neurons (Larkman 1991a) and information about the ionic conductances present (Connors et al. 1982; Stafstrom et al. 1982, 1984; McCormick et al. 1985; Friedman and Gutnick 1987, 1989; Schwindt et al. 1988a,b; Huguenard et al. 1988, 1989; Hamill et al. 1991) to guide the construction of a compartment model of this neuronal type. Our goal was to search for distributions of ion channels that resulted in a model neuron that responded to current injection in the same manner as measured in vitro, over as wide a range of inputs and perturbations as possible. A preliminary report of this work has been presented (Rhodes and Gray 1991). 2 Materials and Methods 2.1 Construction of the Compartment Model. Compartment simulation methodology (Traub and Llinas 1977; Perkel etal. 1981; MacGregor 1987) was used to reconstruct a layer 5 IB pyramidal neuron. Averages for morphological quantities such as number of basal branches, number of tips, number of apical tuft end segments, and their average length and diameter were taken from a study of rat visual cortex layer 5 IB pyramidal neurons (Larkman 1991a). These were used to construct a "prototypical" layer 5 IB pyamidal neuron (Fig. 1 and Table 1) in which each of these measures was maintained. Each morphological segment, delineated by dendritic branching points, was assigned a compartment, resulting in 128 compartments in total, ranging from 15 to 129 pm in length. To confirm that the longer compartments did not introduce inaccuracies into the compartment simulation due to changes in voltage along their length (of particular concern when voltage-gated channels are present in the compartment) we carried out trials in which all compartments of length > 25 /Lm were broken down into smaller subcompartments. Comparison of simulations revealed only a few percent difference in output in response to current injection as measured at the soma, indicating that

1088

Paul A. Rhodes and Charles M. Gray

simulation with the long compartment lengths was acceptable for our purposes. Simulations were programmed by us and were integrated

Simulations of Neocortical Pyramidal Neurons

1089

Table 1: Specification of the Simulated Pyramidal Neurono

Number Length Diameter Na

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Soma

26.0 26.0 15.0 117.0 104.0 45.0 20.0 104.0 40.0 40.0 35.0 30.0 30.0 30.0 30.0 30.0 97.5 97.5 97.5 97.5 39.0 110.0

1.5 1.5 1.5 0.8 0.8 5.0 1.3 0.7 4.5 4.0 3.5 3.0 3.0 3.0 3.0 3.0 3.0 2.5 2.5 2.0 1.o 0.5 25.0

100.0 8.6 8.6 8.6 8.6 100.0 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 0.0 0.0 0.0 0.0 0.0 0.0

K(DR) Ca K(Ca) K(mAHP) K(A) K(M)

80.0 3.6 3.6 3.6 3.6 80.0 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 0.0 0.0 0.0 0.0 0.0 0.0

6.7 6.7 6.7 6.7 6.7 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 0.0 0.0 0.0 0.0 0.0 0.0

2.7 2.7 2.7 2.7 2.7 3.3 3.3 3.3 3.3 3.3 3.3 3.3 3.3 3.3 3.3 3.3 0.0 0.0 0.0 0.0 0.0 0.0

1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 0.0 0.0 0.0 0.0 0.0 0.0

0.9 0.6 0.3 0.3 0.3 0.9 0.3 0.3 0.6 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.0 0.0 0.0 0.0 0.0 0.0

0.9 0.6 0.3 0.3 0.3 0.9 0.3 0.3 0.6 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.0 0.0 0.0 0.0 0.0 0.0

200.0 150.0 10.0 10.0

12.5

2.0

1.5

‘Lengths, diameters, and channel densities of each species of ion channel by compartment, given in pm and mS/cm2. Compartment numbers correspond to the segment numbers indicated in Figure 1.

Figure 1: Facing page. Dendritic tree structure of the simulated neuron based upon data reported in Larkman (1991a) for rat visual cortex layer 5 intrinsically bursting pyramidal neurons. For simplicity, only 2 of 6 basal branches are depicted. Density distibutions used in simulations for Na+ and high threshold Ca2+ channels are coded in color. The Na+ channel distribution is most concentrated in the soma at > 100 mS/cm2. Ca2+ and Naf channels are both present at x 5 1 0 mS/cm2 throughout the basal and apical dendrites. Recent imaging data suggest the distribution may extend even farther distally in the apical trunk. Though not shown here, K+ channels [K(DR), K(A), K(M), K(Ca) and K(mAHP)] accompanied the depolarizing channels. Scale bar, 100 pm for lengths and 50 pm for diameters. Soma somewhat smaller than to scale. Values for lengths, diameters, and channel densities are tabulated in Table 1.

1090

Paul A. Rhodes and Charles M. Gray

using a second-order Taylor series method with a 10-psec integration step, and carried out on 80486 based PCs. 2.2 Passive Properties. Membrane capacitance C, and cytoplasmic resistivity were set to 1 pF/cm2 and 150 12-cm, respectively. Since the neurophysiological data used to constrain the simulations were taken from a study using sharp electrodes (Mason and Larkman 1990), a somatic leakage conductance (25 ns) was included. This somatic shunt was maintained throughout all the simulations, and as discussed below we found it had little effect on response properties. Membrane resistance R, was set to 100,000 R-cm2, as suggested by recent studies using patch electrodes (Pongracz et al. 1991; Staley ef al. 1992). Input resistance Ri, was measured (after active conductance distributions were in place) by setting DC input bias to reach voltage equilibrium, and then applying a small (0.05-nA) positive current at the soma, which was divided into the resulting steady-state voltage deflection. The time constant T,, was measured assuming a single exponential decay of voltage from this steady state. With the above parameters Ri, and 7in measured 28 M62 and 10 msec, respectively, in the simulated neuron, in good accord with the average values of 18 M62 and 10.4 msec reported in slice for the same set of neurons from which morphological data were taken (Mason and Larkman 1990). R,, and qn increased to 100 MR and 25 msec, respectively, when the 25-nS leakage conductance was eliminated.

2.3 Spines. Spine membrane was taken to be passive. The effect of spines was accounted for by the ”area insertion” method (Stratford et al. 1989; Cauller and Connors 1992) whereby effective membrane capacitance is increased and effective membrane resistance decreased in each compartment in proportion to the additional membrane area provided by spines. This is a sensible simplification if current can flow into a spine much more readily than it can exit across its membrane. With spines modeled with a 0.1 x 1.0-pm cylindrical neck and a 0.6-pm-diameter spherical head, neck axial resistance is thousands of times lower than neck membrane resistance (even with R, reduced to 10,000 62-cm2),comfortably satisfying this condition. Larkman (1991b) reported that this set of layer 5 pyramidal neurons had approximately 2 spines per pm length per pm diameter. This density of spines of this size resulted in a 92% increase in dendritic membrane area, which was taken into account by multiplying C, and dividing R, by 1.92. 2.4 Ion Channel Properties. Hodgkin-Huxley type formulas for the fast Na+ current, IN^, the fast delayed rectifier IK(DR), the A-current I,+ the and the L-type high threshold Ca2+curfast Ca2+-gatedKf current IK(c~), rent Ica were taken from a simulation study of hippocampal pyramidal neurons (Traub et al. 1991). Of the Na+ channels 50% were modified to

Simulations of Neocortical Pyramidal Neurons

1091

contain a slowly (71,zz 10 msec) inactivating component (Huguenard et a/. 1988). Deactivation of ZK(DR) was sped up by a factor of 1.5. This provides an n,= curve that better matches that reported in a study of I K ( D R ) kinetics in dissociated neocortical pyramidal neurons (Hamill et nl. 19911. Also, we have separately found that an increase in the rate of ZK(DR) deactivation allows a realistically fast first interspike interval in simulations of RS pyramidal neurons (Rhodes and Gray, in preparation). [Ca2+li-gateddeactivation of Ica (discussed in Hille 1984) was incorporated as in Yamada et a/. (1989). As specified in Table 2, we included a simple kinetics for voltage and Ca2+-gated activation of the medium after hyperpolarizing K+ current I m A ~ p ,based on records from neocortical layer 5 pyramidal neurons (Schwindt et a/. 1988a,b). Kinetics for the muscarinic K+ current IM were taken from Yamada et al. (1989). The low threshold Ca2+ current IT (Coulter et a/. 1989; Fox et al. 1987) and the persistent sodium current ]Nap (French et a/. 1990) were in the set of currents included in trials, but were not ultimately included in the simulated neuron (see below). Formulas for currents used are summarized in Table 2. In all cases, rate constants were adjusted when necessary to correct temperature to 36"C., using a Qlo of 2.5. Resting membrane potential V , = -71 mV. The reversal potentials for Na+ and K+ currents were +55 and -75 mV, respectively. The reversal potential for Ca2+ was dynamically recalculated with the Nernst equation (Hille 1984), with the maximum level capped at +lo0 mV. Capping the Ca2+ reversal potential at +60 mV did not alter the behavior of the simulations, after a compensating increase in Ca2+ channel density was made. 2.5 Internal Ca2+ Concentration. [Ca2+li was simulated assuming two independent saturable buffer systems, and decay via a membrane bound pump of constant density (Yamada et al. 1989; Sala and HernandezCruz 1990; Zador et al. 1990). Table 3 summarizes the kinetic equations and constants governing this system; rate and dissociation constants were taken directly from the references, while initial buffer concentration was varied somewhat to ensure that [Ca2+Iiremained within physiological levels (< 2 pM) in all dendritic compartments. Radial diffusion gradients were neglected (see Discussion).

2.6 Distribution of Channel Densities. To initiate the search for channel density distributions, we first sought order-of-magnitude estimates for conductance density per unit membrane area, quantities that have not yet been directly measured in neocortical pyramidal neurons. We adapted data on the maximal apparent conductance density per unit membrane area measured for the Na+, K(DR), and Ca2+ channels in a study of dissociated neocortical pyramidal neurons (Hamill et al. 1991). These measurements provide information about the average maximal conductance density in the soma and proximal dendritic membrane only,

1092

Paul A. Rhodes a nd Charles M. Gray

Table 2: Formulas for Currents Useda

"Equations determining the activation and inactivation variables rn and h, using standard Hodgkin-Huxley notation for forward and backward rates (Iand p, as functions of membrane potential V (millivolts), internal calcium concentration [Ca2+li (pmol/liter), or both, with time in seconds. Figure T2 plots m and where applicable h along with their time constants (indicated by open and filled circles for T , ~ ,and q, respectively) as a function of voltage. For the Na+ channel the slower component of inactivation is labeled with triangles. The heavier line denotes activation, and so in some cases plots m2.Time constant scales indicated in upper right in msec, with Th below T,~,.

since the dissociation procedure strips away the fine caliber basal and distal apical dendritic branches (cf. Huguenard et al. 1989, Fig. 2). The dissociated pyramidal neurons were from young (< P30) rats, and pre-

Simulations of Neocortical Pyramidal Neurons

1093

Table 2 Continued.

sumably contained an unspecified mixture of RS and IB types. In this study maximum conductance was derived by isolating the current of interest and dividing the maximal current measured during a voltage clamp step by the appropriate driving force. Conductance density was then estimated by dividing by the neuron's area, estimated by measuring neuronal capacitance and assuming C, = 1 pF/cm2 (Hamill et al. 1991). This provided a measure of maximum conductance density reached during the voltage clamp step, but conductance density used in compartment equations is the theoretical maximum density when all channels are open, and only a voltage-dependent fraction of any channel population actually opens during a voltage clamp step (e.g., Hille 1984). To make the necessary correction, we divided by the maximum percentage of channels open reached during a simulated voltage clamp, starting as they did from a holding potential of -100 mV to command potentials from -30 to 0 mV. With this adjustment, we found that the data in Hamill etal. (1991) implied average densities for Na, K(DR), Ca2+(L),and Ca2+(LTC)channels of approximately 30-40, 15-70, 10-100, and 3 mS/cm2, respectively, depending upon command potential. The Ca2+(L)channel's high thresh-

1094

Paul A. Rhodes and Charles M. Gray

'

0.1 6

0.0

0.1

0.6

0.1

0.4

0.4

0.2

0.2

0,

Figure T2.

2 200

.m do -so u1 r) -20

.lo o 10 20 30

u)

Simulations of Neocortical Pyramidal Neurons

1095

Table 3: Equations Governing [Ca2+]iDynamics'

d[Bl]/dt = bsi [Bl - Ca2+]- f ~[Ca2'] l [Bl] d[B2]/dt = bs2[B2 - Ca2+]- f~2[Ca~+][B2] d[ca2+]/dt = Zca x 106/(2FV)- d[B2]/dt - d[Bl]/dt - A . Opurnp . Raterurnp.

[ca2+] .~ 1 [Ca2+,Krn] pM . V

concentration of slow buffer concentration of slow buffer/Ca2+ complex concentration of fast buffer concentration of fast buffer/Ca2+ complex 2 x 103 1 x 103 5.0 x lo5, forward rate for B1 + B1 - Ca2+ (M-' s-l) 2.0 x l O - l , backward rate for 81 - Ca2+ + B1 (s-') 1.0 x lo8, forward rate for 8 2 + B2 - Ca2+ (M-' s-') 1 x lo2, backward rate for 82 - Ca2+ 82 ( s ~ l ) net inward Ca2+ current due to ion channels (amps) Faraday's number Avogadro's number volume of compartment (liters) membrane area of compartment (pm2) 2 x lo2, density of pumps in membrane (pm-2) 3, rate of Ca2+ extrusion per pump (s-') -+

"Equations involved two buffers, a faster buffer B1 corresponding to cytosolic buffering molecules such as calmodulin, and a slower buffer 82 corresponding to Ca2+ pumps in the endoplasmic reticulum. Extrusion was via a membrane-bound pump of constant speed. Inward leakage was simplified by maintaining a minimum concentration of 50 nM. All concentrations in pmol/liter.

old makes the corrected conductance density quite sensitive to command potential. These estimates were used to provide an order-of-magnitude range for channel conductance densities; in most cases conductance densities chosen for the final simulations turned out to be somewhat lower. Spike height and slope were sensitive to both the absolute and relative density of Na+ and K(DR) channels at and immediately proximal to the soma. We chose somatic and proximal dendritic distributions for Na+ and K(DR) channels that produced spike heights (= 100 mV) as well as maximum and minimum slopes (z500 and -100 V/sec) consistent with those measured in uitro (Mason and Larkman 1990). In seeking the distribution of channels away from the soma, systematic methods are unavailable. The problem of determining channel dis-

Paul A. Rhodes and Charles M. Gray

1096

30 msec a.

b.

UJJk

Figure 2: ( a 4 Response of layer 5 intrinsically bursting pyramidal neuron in vitro to current step injection in the soma, reproduced with permission from Mason and Larkman (1990). Input increases from a to c. (d-g) Output of the simulated neuron to a range of current steps, increasing from d to g. Continued next page.

tribution, given as constraint only responses to current inputs, is not analytically tractable, and gradient descent methods for searching channel density space are problematic because of the severe nonlinearity of the interaction between currents and lack of a smooth objective function to compute error. Following the approach of Traub and co-workers (e.g.,

Simulations of Neocortical Pyramidal Neurons

1097

30 rnsec A c

e.

\

1

,

Figure 2: (d) At low (0.4 nA) levels of input repetitive burst output is produced at < 10 Hz. (e) At 0.5 nA a transitional mode is seen in which an initial burst is followed by doublets. (f) At 0.7 nA, the initial burst is followed by a slightly longer interval then a series of single spikes at a steady rate. Prominent ADPs appear after the singlets. (g) With increasing input 1.0 nA, the burst/single spike train pattern does not change but the ADPs after single spikes become less pronounced and the tonic rate increases linearly with input current.

1098

Paul A. Rhodes and Charles M. Gray

Traub et al. 1991), we searched parameter space by hand: We added or subtracted channel density in the model neuron incrementally, shifting a few mS/cm2 of conductance density for 1 or 2 channel types at time, at a given level in the basal or apical branches, reran the simulated response to current step, and compared the output with that of the previous run and with the physiologically measured response to the same input. Responses measured in vitro over a range of current steps (see Fig. 2a-c, from Mason and Larkman 1990) were taken as targets. It was found that the model neuron channel distributions thus constructed also provided for other physiological properties, such as the all-or-none intrinsic nature of the burst (see Discussion). We were unable to find instances where addition of CaT or Nap channels improved the output, and so excluded these channel types from the final distribution. This does not exclude these channels, which are indeed known to be present in rat necortical pyramidal neurons (Hamill et al. 1991; Schwindt et al. 1988a), but rather indicates that their inclusion was not found to improve the performance of the simulated neuron in reproducing IB pyramidal responses. 3 Results 3.1 Response of Simulated IB Pyramidal Neurons. Figure 2a-c shows the response of an IB pyramidal neuron in vitro (Mason and Larkman 1990) to three levels of injected current. At low levels of input (Fig. 2a) the response is slow (< 10 Hz) repetitive bursting; at higher inputs (Fig. 2b and c) the response consists of a burst followed by a train of single spikes, with visible ADPs. Increasing input increases the single spike firing rate with a linear relationship (Mason and Larkman 1990), but does not further alter the burst/tonic pattern of response. Results for the simulated pyramidal neuron are compared in Figure 2d-g. Both repetitive bursting at low levels of input and the transition to the burst/tonic mode with larger inputs were reproduced by the simulated model neuron. At an intermediate input level, the simulation also produced a transition response pattern not often reported in vitro (but see Agmon and Connors 1989, Fig. 2), consisting of an initial burst followed by a mixture of doublets and single spikes with pronounced ADPs (Fig. 2e). As is seen in vitro (Mason and Larkman 19901, in the simulations there was a linear relationship between input magnitude and single spike rate (data not shown). Thus the simulated neuron reproduced physiological responses to current step input across a range of input levels. Figure 3 shows voltage traces from a dendritic compartment in the central apical trunk, the area of impalement for dendritic recordings from these neurons (Amitai et al. 1993, see, e.g., their Fig. 4; Kim and Connors 1993). Both traces show a depolarization to about -10 mV, maintained for x 15 msec during the burst. This was found to be a calcium spike,

Simulations of Neocortical Pyramidal Neurons

1099

iI Figure 3: The upper plot shows voltage in a compartment of the mid apical trunk during a 0.7 nA current step input to the soma. Ica and IN^ (dotted) density are also plotted. The lower plot is of voltage in the soma during the simulation. During the burst Ica is prominent in the dendrites, associated with a sustained (x 10-15 msec) depolarization. The subsequent dendritic spikes originate from single spikes in the soma. While comparatively little Ica flows in the dendrites during these events, IN^ remains large, indicating that it is primarily the sodium current that carries the somatic spike depolarization outward into the dendrites. Voltage and time scales apply to both plots.

driven by sustained lea. Subsequent single spikes in the soma triggered brief accompanying depolarizations in the dendrites, which depended upon a n outwardly propagating Na+ spike; during these dendritic spikes lNa(rather than lea) is the primary membrane current activated. In vitro, a threshold pulse of current often (though not always) triggers the generation of a full burst (McCormick et al. 19851, suggesting that the

1100

Paul A. Rhodes and Charles M. Gray

Figure 4: Response of the simulated neuron to a short (5-msec)threshold square pulse of current. Even though input is stopped shortly after the first spike begins, the form of the entire burst is similar to those in Figure 2d-g, in which current was injected continuously. During the development of the burst, depolarking activity is driven by membrane currents and is independent of external input. The burst is thus an intrinsic, self-generative phenomenon in the simulated neuron, as is it in vitro.

burst is a self-generated intrinsic event, rather than one dependent on continuing external input. This all-or-none property was reproduced by the simulated neuron (Fig. 4). A short (5 msec) threshold current pulse triggered production of the full burst complex, even though the ADP and last spikes of the burst were produced with external input entirely absent. This demonstrated that after the first spike the burst in the simulated neuron was driven by membrane currents rather than external input, indicating that the simulated bursts were intrinsically generated.

3.2 Transient Response Properties Insensitive to R,. To test the sensitivity of simulated output to membrane resistance R,, we reduced R, from 100,000 to 10,000 R-cm2. Surprisingly, transient responses to current injection were insensitive to this change; the response to injection of 0.7 nA with R, = 100,000 R-cm2 was nearly identical to the response to 1.0 nA with R, = 10,000 R-cm2. Neither doubling spine dimensions and density nor complete removal of the somatic shunt had appreciable effect on transient intrinsic responses (data not shown). These results indicated clearly that responses to current step inputs were insensitive to R,, somatic leakage, and spine dimensions.

Simulations of Neocortical Pyramidal Neurons

1101

3.3 Channel Distribution. The ion channel distribution that best generated the response properties of IB neocortical pyramidal neurons was characterized by a M 5 mS/cm2 density of high threshold Ca2+channels extending throughout the dendritic branches. Wherever Ca2+ channels were present, the [Ca2+Ii-gatedK+ channels K(Ca) and K(mAHP) were included, at a similar conductance density. We found that including Na+ and K(DR) channels in the dendrites at M 1 mS/cm2 or higher was necessary to trigger the dendritic Ca2+spikes. These dendritic Na+ channels permitted the somatic spike to propagate outward, raising dendritic voltage to Ica activation threshold. When removed, the model neuron did not burst at all, even with Ca2+ channel density unaltered. Thus we are led to predict the widespread presence of Na+ channels along with Ca2+channels in IB pyramidal neurons. We found that the model neuron performance could be equally well obtained with a slightly more concentrated but otherwise identical distribution of channels extended throughout the basal branches only, with the apical dendrite entirely passive. In order to obtain spikes of typical height (M 100 mV) and slopes (+500 and -100 V/sec for the rising and falling slopes of the action potential, respectively), fast Na+ K(DR) channels were concentrated at > 100 mS/cm2 in the soma and abutting dendritic segments. Additional hyperpolarizing currents, such as IA and IM, were needed to lower spiking rates toward those reported in slice (McCormick et al. 1985; Mason and Larkman 1990). These K+ currents were added to all compartments at low (= 1 mS/cm2) levels. The specification of all channel types and densities for the model neuron are in Table 1 and illustrated in Figure 1. 3.4 [Ca2+liin the Dendrites Controlled Burst vs. Tonic Spike Mode. We found a direct relationship between the level of [Ca2+1iin the dendrites at the initiation of the somatic spike, and the number of spikes generated. When [Ca2+Iiin the dendrites was near rest at the initiation of a somatic spike, a burst was generated, while high [Ca2+licorrelated with single spike production (data not shown). To consider whether the effect of [Ca2+Iiwas mediated by IK(&), we increased the level of [Ca2+li needed for IK(c,) activation. This manipulation led to longer bursts for a given level of [Ca2+li. Conversely, enhancing IK(c=) activation by reducing the [Ca2+Iiactivation threshold abbreviated the number of spikes produced and shortened the ADP (data not shown). These perturbations of the model neuron indicated a role for IK(c~)in controlling bursting similar to that suggested previously (Friedman and Gutnick 1987) based upon manipulations of conditions in slice. 4 Discussion 4.1 A Model for Burst Generation. The foregoing simulations suggest a model for the generation of intrinsic bursts in neocortical IB pyra-

1102

Paul A. Rhodes and Charles M.Gray

midal neurons, with dynamics illustrated in the following series of steps: (1) Current injected into the soma triggers a Na+ spike there (Fig. 5a), and

Simulations of Neocortical Pyramidal Neurons

1103

depolarization propagates outward into the dendrites, boosted by dendritic Na+ channels. (2) As high threshold Ca2+ channels are activated in the dendrites, an Ic,-driven depolarization (the Ca2+ spike) begins in the dendritic branches (Fig. 5b). Since [Ca2+liin the dendrites starts out at its low resting level, activation of [Ca2+Ii-gatedIK(c,)is not immediate, and the Ca2+ spike is sustained. (3) IK(DR) at the cell body repolarizes the somatic spike, establishing a pronounced (> 50 mV) potential difference between the depolarized distal branches and the repolarized soma. During this period, ohmic current flows axially from the dendrites into the soma (Fig. 5c). (4) This inward current depolarizes the soma, driving the ADP and additional Na+ spikes at short (several millisecond) latency (Fig. 5d). The inward drive from the proximal dendrites thereby produces the burst, accounting for the burst’s independence from continuing current injection. (5) As Ca2+ influx in the dendrites continues, buildup of [Ca2+li leads to increasing activation of IK(c,), which repolarizes the membrane potential in the dendrites, eliminating the potential difference with the soma (Fig. 5e). The current flowing from the dendrites into the soma ceases, ending the ADP and the burst. After the initial burst, there are two alternatives that depend upon the magnitude of the input current and the rate at which [Ca2+1ireturns Figure 5: Facing page. Snapshots of voltage activity in the simulated neuron during current step injection. The lettered arrows on the inset indicate the voltage in the soma at points in time corresponding to the snapshots. (a) A 0.8-nA current step injected into the soma triggers a Na+ spike there. (b) Depolarization spreads to activate high threshold Ca2+ channels in dendritic membrane proximal to the soma, leading to the initiation of a sustained Ca2+ spike in the basal and apical dendrites. (c) A s the somatic spike repolarizes, a potential difference between the dendrites and the repolarized soma is established, causing current to flow axially into the soma from the surrounding dendrites (arrows). (d) This inward current from the dendrites causes the ADP in the soma, driving another spike, thereby producing the burst. (e) As CaZ+influx continues in the dendrites, the [Ca2+]i-gatedK+ current IK(ca) is activated, repolarizing the dendrites, eliminating the potential difference with the soma, and ending the ADP and the burst. (f) At this input, the first postburst spike is triggered after a 40-msec interval. Because [Ca2+litakes 100-200 msec to pump down, its level is still elevated in the dendrites, leading to the rapid reactivation of IK(c,), which quickly quenches the Ca2+ spike in the dendritic branches, so ( g ) only a small ADP and a single spike are produced. With continued input this cycle repeats, leading to a train of single spikes. Inset shows the voltage trace (measured at soma) during simulation, with arrows indicating approximate points in time corresponding to a-g. With 0.4-nA input, there is a longer (150-msec)interval after the burst. By the time of the first postburst spike, [Ca2+]ihas declined sufficiently to slow onset of IK(c,), permitting a sustained dendritic Ca2+ spike to drive another burst; this sequence repeats, leading to repetitive bursting (not shown).

1104

Paul A. Rhodes and Charles M. Gray

to resting levels: (6) With continuing application of current, a postburst spike is triggered (Fig. 5f). If at this time enough [Ca2+liremains in the dendrites to activate IK(Ca),this current quickly quenches the dendritic Ca2+ spike, so the dendrites hyperpolarize immediately (Fig. 5g); little axial current flows inward to drive the soma, and only the single spike is produced, with a small ADP. Compare the basal dendritic depolarization in Figure 5c and g, which are both snapshots immediately after the first spike of a cycle. With continuing external input to the soma this cycle continues, and a train of single.spikes is produced. Alternatively, (6’) if injected current magnitude is low, so that the interval to the first spike after the burst is long compared to the time necessary for internal processes to pump down [Ca2+Iito near resting levels, initial conditions are reset and a new burst is generated in the same manner as the first; repetitive bursting ensues. This description of the processes involved in bursting accounts for the typical series of IB pyramidal neuron outputs (e.g., Fig. 2a-4, and for the intrinsic nature of the burst. A predicted intermediate transition state (Fig. 2e) arises in simulation when the input applied is such that after the first interval postburst [Ca2+liin the dendrites remains at a level sufficient to rapidly but not immediately begin IK(ca)activation, permitting a short period of inward axial drive resulting in a doublets and/or single spikes with large ADPs. The model is consistent with the finding that Ca2+ currents are required for burst generation in both hippocampal CA3 (Wong and Prince 1978; Johnston and Hablitz 1980) and neocortical (Friedman and Gutnick 1987) pyramidal neurons, and that reducing [Ca2+lipromotes bursting (Friedman and Gutnick 1987, 1989), presumably by reducing the activation of IK(c~). A related prediction arising from the above model is that increasing the rate at which cytoplasmic Ca2+is pumped down will increase the maximal rate at which repetitive bursting may be sustained. The model is similar to that proposed by Traub and LlinAs (1979) for hippocampal pyramidal neurons, in that the burst and ADP are driven by currents flowing toward the soma from dendrites depolarized by ka. In a model of neocortical pyramidal neurons, Traub (1979)concluded the apical dendrite was the key driver of the ADP and the burst; we find, however, that the basal dendrites are equally well suited for that role in this neuronal type. 4.2 Bursting Neurons Have Depolarizing Channels Distributed in Both Apical and Basal Dendrites. As noted above, the soma-based current response data presently available can be reproduced with simulations in which the Ca2+-spikedriving the burst is produced only in basal dendrites, only in the apical dendrite, or both. Further, recent experiments in which the apical dendrite of layer 5 pyramidal neurons were truncated at the layer 4/5 border in slice have shown that generation of the intrinsic response properties of neocortical IB pyramidal neurons

Simulations of Neocortical Pyramidal Neurons

1105

1. Figure 6: Response of simulated layer 5 IB pyramidal after truncation of apical dendrite just above the soma. Diagram to right indicates surviving dendritic tree. The simulated truncated neuron showed a set of responses similar to those shown in Figure 2 for the intact simulated neuron. does not require the apical dendrite (Telfeian ef al. 1991; B. W. Connors, personal communication). We performed the same manipulation in simulation, truncating the simulated dendritic tree just above the soma, and found that with the channel distribution we chose all the characteristics of IB pyramidal neuron responses were intact after apical dendrite truncation (Fig. 6). Also, bursting neurons appear in layer 2 in the inside-out reeler cortex (Silva et al. 1991a; see their Fig. 2). Due to their proximity to the pia, some of these neurons have no distinguishable apical dendrite, yet they produce conventional bursting responses, further supporting the notion that an apical dendrite is unnecessary for burst generation. Thus, while recent results (Cauller and Connors 1989, 1990, 1992; Amitai et al. 1991, 1993) have provided direct evidence that Na+ and Ca2+ channels are present in the apical dendrite of layer 5 pyramidal neurons, both simulations and other data indicate that basal dendrites are capable of driving bursts, and that the apical dendrite is not required for burst generation. Very recently, Ca2+ imaging in neocortical pyramidal neurons has indicated Ca2+ channels in the basal and apical dendrites (R. Yuste, personal communication), supportive of the conclusions in this modeling study. It should be noted, however, that a recent study (Westenbroek et al. 1992) has immunohistochemically visualized L- and N-type Ca2+ channels in layer 5 neocortical pyramidal neurons, and finds them restricted to the soma and proximal apical main trunk. If some of these neurons were of the IB type (their physiology was not determined) and if the comparatively fine basal processes permit adequate visualization

1106

Paul A. Rhodes and Charles M. Gray

of label, then this study contradicts our conclusions and the recent optical Ca2+ imaging results regarding the presence of channels in the basal dendrites. There are a number of limitations inherent in the present study that should be emphasized. First, we have not (and with present techniques cannot) fully explored the vast parameter space of possible channel distributions. The main feature that we found must remain is the distribution of Ca2+ and K(Ca) channels, as well as a lower level of Na+ and K(DR), throughout much of the dendrites, coupled with sodium spike generation in the soma. Another potential shortcoming is that [Ca2+li radial diffusion gradients were neglected. This simplification depends upon the assumption that Ca2+ equilibrates throughout the compartment more quickly than any phenomenon affected by [Ca2+l,,such as activation of IK(Caj. If Ca2+diffuses in cytoplasm at a rate similar to that applicable in water, it takes a Ca2+ion approximately 200 psec to diffuse into the core of a 2 pm-diameter dendritic segment (Hille 1984); this is fast compared to the millisecond time scale relevant to IK(c~)activation. On the other hand, if diffusion in cytoplasm is much slower, or if the presence of relatively immobile buffers renders the effective diffusion of [Ca2+Iislow (Sala and Hernandez-Cruz 1990), then diffusion gradients exist over longer time scales. Under these circumstances, a better model would treat radial calcium and buffer gradients explicitly. We believe that the current kinetics we employed also require refinement; we were unable to eliminate the overly pronounced hyperpolarization after the burst in the simulations, and suspect that the kinetics we have used for IK(DRj need modification for neocortical pyramidal neurons. As more detailed information becomes available regarding current kinetics and [Ca2+lidynamics, we expect that compensating modifications in the channel density distributions will in turn be required. The question arises whether it would be possible to reproduce the range of input/output relationships in Figure 2 with a single compartment neuron, or with a compartment model in which channels are restricted to the soma (Lytton and Sejnowski 1991). We have been unable to find distributions of channels restricted to the soma that reproduced the range of IB pyramidal neuron outputs in Figure 2, though it is easy to get only repetitive bursting with just a single compartment model (MacGregor 1987). While we have not attempted to address this question rigorously, it appears to us that burst generation integrally involves interplay between high threshold Ca2+currents in the dendrites and Na+ spikes in the soma. We believe that to obtain bursts of > 300 Hz spikes it is necessary to separate the fast Na+ spike in the soma from the longer Ca2+spike in nearby dendrites (Traub and Llinas 1979). Our results, as well as prior work on hippocampal pyramidal neurons, suggest that the presence of a significant dendritic distribution of depolarizing currents is a key characteristic of bursting pyramidal neurons. In addition to driving bursts, dendritic depolarizing membrane also

Simulations of Neocortical Pyramidal Neurons

1107

confers special properties on the integration of synaptic inputs. For example, a dendritic distribution of depolarizing channels may support dendritic spikes that boost the efficacy of distal EBPs (Spencer and Kandel 1961; Llinhs and Nicholson 1968, 1969). Further, voltage-sensitive membrane could gate cooperativity between N-methyl-D-aspartate (NMDA)mediated synapses (Me1 1993), enriching the signal-processing capabilities of the dendritic tree. We note further that depolarizing dendritic channels can do more than boost the inward flow of information from dendritic synapse to soma. The presence of depolarizing channels, particularly Na+, in the dendrites enhances the outward propagation of depolarization through the dendritic tree after the somatic spike. This becomes of particular importance when NMDA receptor-mediated synapses are present on the dendrites (Rhodes and Gray 1993), because of the voltage-gated property of the NMDA receptor conductance. In this regard, dendritic depolarizing channels may have multiple functions in bursting neurons: they drive the generation of the burst, they amplify the efficacy of dendritic synaptic inputs, and they provide these neurons with a powerful means to correlate somatic spiking with increased NMDA conductance at synapses throughout the dendritic tree. It remains to determine the joint implications of these properties for information processing in neurons and neuronal circuits. Acknowledgments We would like to acknowledge valuable discussions with Drs. B. W. Connors, F. H. C. Crick, J. R. Huguenard, C. Koch, and C. F. Stevens. This work was supported by a grant from the Office of Naval Research. References Agmon, A., and Connors, B. W. 1989. Repetitive burst-firing neurons in the deep layers of mouse somatosensory cortex. Neurosci. Lett. 99, 137-141. Amitai, Y., Friedman, A., Connors, B., and Gutnick, M. J. 1991. Dendritic electrogenesis in neocortical neurons in vitro. SOC. Neurosci. Abstr. 17, 311. Amitai, Y., Friedman, A., Connors, B. W., and Gutnick, M. J. 1993. Regenerative electrical activity in apical dendrites of pyramidal cells in neocortex. Cerebral Cortex 3, 26-38. Cauller, L. J., and Connors, B. W. 1989. Origin and function of horizontal layer I afferents to rat Sl neocortex. SOC.Neurosci. Abstr. 15, 281. Cauller, L. J., and Connors, B. W. 1990. Horizontal layer I inputs to primary somatosensory (Barrelfield)neocortex of rat. SOC.Neurosci. Abstr. 16, 242. Cauller, L. J., and Connors, B. W. 1992. Functions of very distal dendrites: Experimental and computational studies of layer 1 synapses on neocortical pyramidal cells. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 199-230. Academic Press, San Diego.

1108

Paul A. Rhodes and Charles M. Gray

Chagnac-Amitai, Y., and Connors, B. W. 1989. Synchronized excitation and inhibition driven by intrinsically bursting neurons in neocortex. J. Neurophysiol. 62,1149-1162. Connors, B. W. 1984. Initiation of synchronized neuronal bursting in neocortex. Nature (London) 310, 685-687. Connors, B. W., Gutnick, M. J., and Prince, D. A. 1982. Electrophysiological properties of neocortical neurons in vitro. 1.Neurophysiol. 48, 1302-1320. Coulter, D.A., Huguenard, J. R., and Prince, D. A. 1989. Calcium currents in rat thalamocortical relay neurons: Kinetic properties of the transient lowthreshold current. J. Physiol. 414,587-604. Fox, A. P., Nowycky, M. C., and Tsien, R. W. 1987. Kinetic and pharmacological properties distinguishing three types of calcium currents in chick sensory neurones. J. Physiol. 394, 149-172. French, C. R., Sah, P., Buckett, K. J., and Gage, P. W. 1990. A voltage dependent persistent sodium current in mammalian hippocampal neurons. J. Gen. Physiol. 95, 1139-1157. Friedman, A., and Gutnick, M. J. 1987. Burst generation in EGTA-injected neocortical neurons. Neuroscience 22, 5400. Friedman, A., and Gutnick, M. J. 1989. Intracellular calcium and control of burst generation in neurons of guinea-pig neocortex in vitro. Eur. I. Neurosci. 1, 374-381. Gamble, E., and Koch, C. 1987. The dynamics of free calcium in dendritic spines in response to repetitive input. Science 236, 1311-1315. Hamill, 0.P., Huguenard, J. R., and Prince, D. A. 1991. Patch clamp studies of voltage-gated currents in identified neurons of the rat cerebral cortex. Cerebral Cortex 1, 48-61. Hille, B. 1984. Zonic Channels of Excitable Membranes. Sinauer Associates, Sunderland, MA. Huguenard, J. R., Hamill, 0. P., and Prince, D. A. 1988. Developmental changes in Na+ conductances in rat neocortical neurons: Appearance of a slowly inactivating component. J. Neurophysiol. 59, 778-795. Huguenard, J. R., Hamill, 0. P., and Prince, D. A. 1989. Sodium channels in dendrites of rat cortical pyramidal neurons. Proc. Natl. Acad. Sci. U.S.A. 86, 2473-2477. Johnston, D., and Hablitz, J. J. 1980. Voltage clamp discloses slow inward current in hippocampal burst-firing neurones. Nature (London) 286, 391-393. Kim, H. G., and Connors, B. W. 1993. Apical dendrites of the neocortex: Correlation between sodium- and calcium-dependent spiking and pyramidal cell morphology. J. Neurosci. 13, 5301-5311. Larkman, A. U. 1991a. Dendritic morphology of pyramidal neurones of the visual cortex of the rat: I. Branching Patterns. J. Comp. Neurol. 306, 332-343. Larkman, A. U. 1991b. Dendritic morphology of pyramidal neurones of the visual cortex of the rat: 111. Spine Distributions. J. Comp. Neurol. 306, 307319. Llinas, R., Nicholson, C., Freeman, J. A., and Hillman, D. E. 1968. Dendritic spikes and their inihibition in alligator Purkinje cells. Science 160, 11331135.

Simulations of Neocortical Pyramidal Neurons

1109

Llinas, R., Nicholson, C., and Precht, W. 1969. Preferred centripetal conduction of dendritic spikes in alligator Purkinje cells. Science 163, 184-187. Lytton, W. W., and Sejnowski, T. J. 1991. Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons. 1. Neurophysiol. 66, 10591079. MacGregor, R. J. 1987. Neural and Brain Modeling. Academic Press, San Diego. Mason, A., and Larkman, A. 1990. Correlations between morphology and electrophysiology of pyramidal neurons in slices of rat visual cortex. J. Neurosci. 10, 1415-1428. McCormick, D. A., Connors, B. W., Lighthall, J. W., and Prince, D. A. 1985. Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. 1. Neurophysiol. 54, 782-806. Mel, B. 1993. Synaptic integration in an excitable dendritic tree. 1. Neurophysiol. 70, 1086-1101. Miles, R., and Wong, R. K. S. 1983. Nature (London) 306,371-373. Pellionisz, A., and Llinis, R. 1977. A computer model of cerebellar Purkinje cells. NKUrOSClKnCK 2, 37-48. Perkel, D. H., Mulloney, B., and Budelli, R. W. 1981. Quantitative methods for predicting neuronal behavior. Neuroscience 6,823-837. Pongracz, F., Firestein, S., and Shepard, G. M. 1991. Electrotonic structure of olfactory sensory neurons analyzed by intracellular and whole cell patch techniques. 1. Neurophysiol. 65, 747-758. Rhodes, P. A., and Gray, C. M. 1991. Compartment models of neocortical intrinsic bursting and regular spiking pyramidal neurons. Soc. Neurosci. Abstr. 17, 1113. Rhodes, P. A., and Gray, C. M. 1993. Effects of electrically active dendrites and NMDA-type synaptic conductances. SOC.Neurosci. Abstr. 19,242. Sala, F.,and Hernandez-Cruz, A. 1990. Calcium diffusion modeling in a spherical neuron. Biophys. J. 57, 313-324. Schwindt, P. C., Spain, W. J., Foehring, R. C., Stafstrom, C. E., Chubb, M. C., and Crill, W. E. 1988a. Multiple potassium conductances and their functions in neurons from cat sensorimotor cortex in vitro. J. Neurophysiol. 59,424-449. Schwindt, P. C., Spain, W. J., Foehring, R. C., Chubb, M. C., and Crill, W. E. 1988b. Slow conductances in neurons from cat sensorimotor cortex in vitro and their role in slow excitability changes. 1.Neurophysiol. 59,450467. Silva, L. R., Gutnick, M. J., and Connors, B. W. 1991a. Laminar distribution of neuronal membrane properties in neocortex of normal and reeler mouse. J. Neurophysiol. 66,2034-2040. Silva, L. R., Amitai, Y., and Connors, B. W. 1991b. Intrinsic oscillations of neocortex generated by layer 5 pyramidal neurons. Science 251, 432-435. Spencer, W. A,, and Kandel, E. R. 1961. Electrophysiology of hippocampal neurons. IV. Fast Prepotentials. J. Neurophysiol. 24, 272-285. Stafstrom, C. E., Schwindt, P. C., Flatman, J. A., and Crill, W. E. 1982. Properties of subthreshold response and action potential recorded in layer V neurons from cat sensorimotor cortex in vitro. 1. Neurophysiol. 52, 244-263. Stafstrom, C. E., Schwindt, P. C., and Crill, W. E. 1984. Repetitive firing in layer V neurons from cat neocortex in vitro. 1. Neurophysiol. 52, 264-277.

1110

Paul A. Rhodes and Charles M. Gray

Staley, K. J., Otis, T. S., and Mody, I. 1992. Membrane properties of dentate gyrus granule cells: Comparison of sharp microelectrode and whole-cell recordings. 1.Neurophysiol. 67, 1346-1358. Stratford, K. J., Mason, A. J. R., Larkman, A. U., Major, G. M., and Jack, J. J. B. 1989. The modeling of pyramidal neurones in the visual cortex. In The Computing Neuron, R. M. Durbin, R. C. Miall, and G. J. Mitchison, eds., pp. 296-321. Addison-Wesley, Wokingham. Telfeian, A. E., Cauller, L. J., and Connors, B. W. 1991. Soc. Neurosci. Abstr. 17, 311. Traub, R. D. 1979. Neocortical pyramidal cells: A model with dendritic calcium conductance reproduces repetitive firing and epileptic behavior. Brain Res. 173,243-257. Traub, R. D., and Llinas, R. 1977. The spatial distribution of ionic conductances in normal and axotomized motorneurons. Neuroscience 2, 829-849. Traub, R. D., and Llinas, R. 1979. Hippocampal pyramidal cells: Significance of dendritic ionic conductances for neuronal function and epileptogenesis. I. Neurophysiol. 42, 47fj-496. Traub, R. D., Wong, R. K. S., Miles, R., and Michelson, H. 1991. Model of a CA3 hippocampal pyramidal neuron incorporating voltage-clamp data on intrinsic conductances. I. Neurophysiol. 66, 635-650. Westenbroek, R. E., Hell, J. W., Warner, C., Dubel, S. J., Snutch, T. P., and Catteral, W. A. 1992. Neuron 9, 1099-1115. Wong, R. K. S., and Prince, D. A. 1978. Participation of calcium spikes during intrinsic burst firing in hippocampal neurons. Braiiz Res. 159, 385-390. Yamada, W. M., Koch, C., and Adams, P, R. 1989. In Methods in Neuronal Modeling, C. Koch and I. Segev, eds., pp. 97-134. MIT Press, Cambridge, MA. Zador, A., Koch, C., and Brown, T. H. 1990. Biophysical model of a Hebbian synapse. Proc. Natl. Acad. Sci. U.S.A. 87, 671845722.

Received July 13, 1993; accepted March 15, 1994.

This article has been cited by: 2. Fadi N. Karameh, Munther A. Dahleh, Emery N. Brown, Steve G. Massaquoi. 2006. Modeling the contribution of lamina 5 neuronal and network dynamics to low frequency EEG phenomena. Biological Cybernetics 95:4, 289-310. [CrossRef] 3. Corrado Bernasconi , Kaspar Schindler , Ruedi Stoop , Rodney Douglas . 1999. Complex Response to Periodic Inhibition in Simple and Detailed Neuronal ModelsComplex Response to Periodic Inhibition in Simple and Detailed Neuronal Models. Neural Computation 11:1, 67-74. [Abstract] [PDF] [PDF Plus]

Communicated by William Lytton

Effects of Input Synchrony on the Firing Rate of a Three-Conductance Cortical Neuron Model Venkatesh N. Murthy' Eberhard E. Fetz Department of Physiology and Biophysics and Regional Primate Research Center, University of Washington, Seattle, WA 981 95 U S A

For a model cortical neuron with three active conductances, we studied the dependence of the firing rate on the degree of synchrony in its synaptic inputs. The effect of synchrony was determined as a function of three parameters: number of inputs, average input frequency, and the synaptic strength (maximal unitary conductance change). Synchrony alone could increase the cell's firing rate when the product of these three parameters was below a critical value. But for higher values of the three parameters, synchrony could reduce firing rate. Instantaneous responses to time-varying input firing rates were close to predictions from steady-state responses when input synchrony was high, but fell below steady-state responses when input synchrony was low. Effectiveness of synaptic transmission, measured by the peak area of cross-correlations between input and output spikes, increased with increasing synchrony. 1 Introduction Synchronous activity in the millisecond time range has been observed in many regions of the brain. Cross-correlation histograms between spike trains from neocortex of mammals often exhibit central peaks on this time-scale (Fetz et al. 1991; Nelson et al. 1992). The possible function of synchronous neural activity in the mammalian brain has attracted renewed interest in light of recent reports that many neurons are synchronously coactivated in the visual cortex (Eckhorn et a / . 1988; Gray and Singer 1989; Engel et al. 1991) and sensorimotor cortex (Bouyer et al. 1981; Murthy and Fetz 1992; Llinas and Ribary 1993; Sanes and Donoghue 1993). While synchronous activity could simply reflect the consequences of convergent neural circuitry, it could also play a significant role in neural transmission. The present study explores the role of synchrony in determining firing rates of a model neuron. Postsynaptic potentials have finite 'Current address: The Salk Institute, CNL, 10010 N. Torrey Pines Road, La Jolla, CA 92037 USA.

Neural Computation 6,1111-1126 (1994) @ 1994 Massachusetts Institute of Technology

1112

Venkatesh N. Murthy and Eberhard E. Fetz

time constants and their amplitudes are usually a small fraction of the depolarization necessary to evoke a spike. Therefore, near-simultaneous arrival of spikes is likely to be more effective in depolarizing a cell to threshold than asynchronous inputs, especially when the interspike intervals are greater than the membrane time constant. Relatively few studies have investigated this issue parametrically (Abeles 1982, 1991; Kenyon et al. 1990; Reyes and Fetz 1993; Murthy and Fetz 1993; Bernander et al. 1994). A primary aim of our simulations was to describe quantitatively how the output of a biophysically based model neuron depends on the degree of synchrony in its inputs. Further, we wanted to determine the parameter regime in which synchrony could play a significant role. We systematically explored the effect of the following parameters on the average output firing rate (fo): (1) N, the number of inputs to the model neuron, (2)fi, the average firing rate of each input, (3) w,the synaptic efficacy defined as the maximal unitary synaptic conductance, and (4) s, the degree of synchrony among the inputs (varied from 0 to 100%). We found that for low values of the first three parameters, synchrony increased fo, whereas for higher values of these parameters, synchrony could in fact reduce foe Preliminary results were presented at the Computation and Neural Systems 1992 meeting (Murthy and Fetz 1993). 2 Methods

The driven neuron was modeled to represent a cortical pyramidal cell. It had five compartments: a soma and four apical dendritic compartments (Fig. la). We chose a simplified biophysical model because it has been shown to reproduce many of the features of more detailed compartmental models (Lytton and Sejnowski 1991; Bush and Sejnowski 1993). Three Hodgkin-Huxley style membrane currents were included in the soma: fast sodium, delayed rectifier potassium, and an A-like potassium current (the A-current was included to generate low-frequency firing in response to injected current). The parameters for the kinetics of the currents were similar to those of the three-channel model of Lytton and Sejnowski (1991), and are listed in Table 1. The dendritic compartments did not have any nonsynaptic active currents. The excitatory synaptic inputs occurred equally on the second and the third dendritic compartments and were modeled with a conductance change whose time course was similar to the fast glutamate synapses in the cortex (Hestrin 1992). The conductance change was modeled as a sum of two exponentials, with a rise time constant of 0.5 msec and a decay time constant of 1 msec:

where t is time in milliseconds and g is a scaling factor that determined the maximal conductance w in units of Siemens. The strength of the input connections was varied by changing g. For each w,the peak of the EPSP

Firing Rate of a Cortical Neuron Model

1113

a dendrites N

E 150 soma

100 50 ..

-80

0. 0 b

1

2

3 4 current (nA)

5

6

7

N = 100 w = 3.75 nS fi =25 Hz

-20

>

h

-40

E

>

-60

gTsyn 200 msec

Figure 1: Responses of the model cell. (a) Steady-state firing rate vs. injected current determined by injecting 500 msec current steps into the soma. Insets show schematic of the model cell and an action potential evoked by injecting 0.5 nA current for 1.5 msec. (b) The membrane potential (V,) in the soma ” ) second dendritic compartment and the total synaptic conductance ( 8 ~ ~at~the compartment for two levels of synchrony in the inputs. For both panels, N = 100, fi = 25 Hz, zu = 3.75 nS (200 p V ) . In the bottom panel (s = 0.3) the large transients in the membrane potential directly cause spikes. Action potentials are truncated.

Venkatesh N. Murthy and Eberhard E. Fetz

1114

a

0

0.2

0.4

0.6

0.8

1.o

9

b

50 Hz

f0

40

w 25 Hz

20

-1

-10 Hr

Figure 2: (a) Dependence of fo on synchrony and synaptic strength w, for N = 100, fi = 25 Hz. Each curve represents the relation between fc, and s for a particular value of w.The coefficient of variation of fo (variance divided by mean) increased nearly linearly with synchrony. (b) Dependence of jt,on synchrony and input frequency, for N = 100, w = 5 nS (250 pV). Each curve represents the relation between fo and s for a particular value of fi. from resting membrane potential was determined, since this is the most common experimentally measured quantity. Corresponding values of w and EPSP amplitudes are given in Figure 2. Inhibitory inputs were not systematically explored in these simulations. The input spike trains were modeled as Poisson processes and all inputs in a given simulation had the same mean rate. The total number of inputs was either 50, 100, or 200. These numbers were chosen so that for EPSP sizes and spontaneous firing rates seen in the cortex, the model neuron’s firing rate would be in the physiological range. For example, if the inputs had a mean rate of 10 Hz and produced EFSPs of 200 pV, then about 100 inputs would give

Firing Rate of a Cortical Neuron Model

1115

rise to a firing rate of 10 Hz in the postsynaptic cell. This number is intended to represent not the total number of anatomical input connections to a cortical neuron, but the mean number of active inputs. No synaptic input other than that specified by N, nor any other source of background activity was included. Synchronization was simulated by lumping multiple inputs into one, whose strength was increased proportionately. When all inputs were independent of each other, the value for the synchrony factor s was defined as 0. The synchrony among the inputs was varied in steps of 0.1 to a maximum value of 1, for 100% synchrony. For example, at ”0.3 synchrony,” 30% of the afferents had identical spike trains; the other 70% had independent spike trains. The average number of spikes per second (fo) was used as the measure of the output. All simulations were performed using the computer program GENESIS running on a Sun 4 workstation (Wilson and Bower 1989). 3 Results

The maximal conductances of the fast sodium, delayed rectifier potassium, and the A channels were adjusted to reproduce action potential waveforms and f-I curves seen in cortical pyramidal neurons (Stafstrom et al. 1984). With the cell at rest (-65 mV), the passive membrane time constant was 15 msec and the total input resistance was 40 MR. The passive time constant of the model was not systematically varied. However, the effective time constant would vary implicitly as a function of synaptic inputs and the activity of the cell (Bernander et al. 1991). Figure l a shows the responses of the model, whose intrinsic parameters were fixed at values in Table 1. 3.1 Steady-State Activity. For each combination of the four parameters N, fi, w,and s, 5 sec of activity was simulated and the average firing frequency of the cell over this time was calculated. The values of fo as a function of s were determined with the other three parameters fixed. For sake of clarity, the results are discussed in three sections; each section describes the effect of increasing synchrony on fo when one variable was varied systematically, with the other two fixed. 1. N and fi held constant, w varied: Figure l b shows the membrane potential trajectories and the total synaptic conductance for two levels of input synchrony with the other parameters kept constant [N= 100, f, = 25 Hz, and w = 3.75 nS (200 pV)l. With minimal synchrony (s = O), the membrane potential fluctuations were relatively small and the cell fired intermittently. When the level of synchrony was raised to 0.3, the membrane potential was strongly influenced by the synchronized input, which produced large EPSPs. The average firing rate was higher for s = 0.3 than for s = 0. Figure 2a shows a typical family of curves depicting the dependence of fo on w and s, for N = 100 and fi = 25 Hz. For low values of w,the

Venkatesh N. Murthy and Eberhard E. Fetz

1116

Table 1: Parameters of the Model Passive Parameters Dimensions

Global variables

ra = 0.2 62-m Soma: 100 x 10 pm r , = 1.5 R-m Apical 1: 50 x 2 pm Apical 2: 50 x 1.5 pm c, = 0.01 F m-* Apical 3: 50 x 1.0 pm Apical 4: 50 x 1.0 pm

Active Parameters Channel g (S/cm2)

13aact

aact

Ninact

binact

Synapse gsyn(f)= g(-eZ

+ -e-')

w = MaximumkSyn( t ) ]= 1.25 to 6.25 nS

output increased with increasing synchrony. For instance, for an EPSP of 150 pV, no spike was evoked in the model neuron at s < 0.3. At s = 0.3, an average of four spikes was evoked per second, and at 100% synchrony (s = l), the average output was 10 Hz. Unitary EPSPs of 250 pV (w= 5 nS) evoked an average output frequency of about 20 Hz at all levels of synchrony. With even larger EPSPs, jo actually decreased with increasing synchrony. This occurred when the synaptic strength was large enough to make jo greater than fi at low synchrony. At high synchrony all inputs were effectively lumped into one source, and since each EPSP evoked at most one spike in the model, jo could not be greater than fi. The excess depolarizing input beyond the amount required to reach threshold was effectively wasted. jo was actually less than fi for the highly synchronized inputs since some of the input spikes arrived during the cell's refractory period (see below). At low levels of synchrony, the individual synaptic currents summed more smoothly and led to a relatively regular firing rate, which could be greater than fi. As a simple analogy, injecting 2 nA for 20 msec into the soma with a microelectrode can evoke multiple spikes, while injecting a

Firing Rate of a Cortical Neuron Model

1117

20 nA current pulse for 2 msec would evoke only 1 spike, even though the total charge is the same. For most of the parameters considered, the coefficient of variation of fo (variance/mean) increased with increasing synchrony. When the inputs were highly synchronized, the output spikes were more tightly linked with the inputs and therefore tended to have the same statistics (Poisson) as the inputs. Simulations were performed for other values of N and fi and the results were qualitatively similar to those in Figure 2a. 2. N and w held constant, fi varied: Figure 2b shows the dependence of fo on s for different values of f i [N= 100 and w = 5 nS (250 pV)I. At low values of fi, increasing synchrony led to an increase in fo. For example, for fi = 10 Hz, no spike was evoked until s = 0.2. For s > 0.2, increasing the synchrony led to a steady increase in fo. As discussed above, this is because for low values of fi, the membrane potential of the cell seldom reached threshold if the inputs were independent. Synchronizing the inputs led to more frequent threshold crossings and to a greater fo. For fi = 25 Hz, fo was relatively independent of synchrony and for fi > 40 Hz, fo decreased with increasing synchrony. This behavior is due to the cumulative effect of the refractory period. If the refractory period after each spike is r, then the total amount of refractory time per second is rfo, where r represents the entire postspike period during which the fully synchronized input is incapable of producing a spike ( r depends on w in addition to intrinsic parameters of the cell). When input spikes are fully synchronized (s = 1) and are sufficient to evoke a postsynaptic spike from resting level, the output firing rate at s = 1 (denoted as fd) is given by

f;

=fi(l- rf3

(3.1)

Therefore,

f:

=

fi

Isrf,

(3.2)

At low values of fi, rfi << 1 and f,' is almost equal to fi. As fi increases, the rfi term causes f,' to become significantly less than fi (Fig. 3). At the other extreme of zero synchrony the output frequency (f,") can be determined by the steady state f-I curve (Fig. la), with the injected current approximated by the sum of all the asynchronous synaptic inputs. For the parameters considered in Figure 2b, the f," and f,' curves crossed at fi z 40 Hz (Fig. 3). Above that value of fi, fo decreased with increasing s. Although this crossover point was different for other values of N and w, qualitatively similar results were obtained. 3. w and fi held constant, N varied: Changing N was essentially equivalent to changing fi, because each of the inputs had the same mean rate and statistics, and the sum of Poisson processes is another Poisson process. However, by keeping the number of inputs and the mean rate of each distinct, synchrony could be interpreted physically by assigning one

Venkatesh N. Murthy and Eberhard E. Fetz

1118

/ /

loo! 80

f

0

,Oi 40

/ /

1 0

20

40

60 f

'O

80

100

i

Figure 3: Dependence of fo onfi at two extremes of input synchrony, for N = 100, w = 5 nS (250 pV). fd is the curve for s = 1, as calculated using equation 3.2 with r = 3 msec (this is the refractory period for the specific N and w used). f," plots output frequency for s = 0, as estimated from the f-I curve in Figure la assuming that the total injected current was

-

V , is the average potential difference driving the synaptic currents and gSy,(t) is the unitary synaptic conductance change (equation 2.1). The two curves cross over at fi x 40.

presynaptic neuron to each input. For this reason, N and fi were treated explicitly as two separate parameters. In the parameter range we have considered, increasing N had an effect similar to that of increasing fi. That is, for a given w and fi, for low values of N , fo increased with s, and above a certain value of N , f o decreased with s. It is of interest to determine how the three factors N , w,and fi can be combined to predict when the synchrony among inputs would increase the firing rate of the cell. Synchrony was considered to have a significant effect on fo, if fo increased by at least 10% when synchrony was changed from 0 to 1. With this criterion, synchrony was found to have a significant effect on f o if the product of the three factors N , w,and f,, denoted by P, was less than 9.4 pS sec-'. That is,

P = Nwfi < 9.4 pS sec-'

(3.3)

If w is expressed as EPSP amplitude, this value would correspond to 0.5 Vsec-' (assuming a linear relation between the conductance and the

Firing Rate of a Cortical Neuron Model

1119

resultant EPSP at rest). This particular cut-off for P depends on the choice of model cell and parameters. For instance, if the resting membrane time constant is halved, synchrony will be effective for P up to 18.8 $3 sec-'.

3.2 Time-Varying Inputs. The above results pertain to measures of average steady-state firing rates of the neuron. To determine whether these results would also hold for changing rates, we measured responses to step and sinusoidal variations in input rates. 1. Step changes in f,: For a particular value of N, w,and s, a step increase in f, from 25 to 50 Hz was introduced at t = 0 sec. A histogram was constructed from the spike trains showing the firing rates before and after this change, as illustrated in Figure 4a. At t = 0 sec the firing rate increased to a new level within 15 msec for s = 0.4, comparable to the time constant of the membrane. This transient phase was shorter for s = 0.8 because the synchronous inputs dominated the response and caused the increase in firing rate to occur faster than the membrane time constant. Similar results were obtained for other step changes in rate. In addition to the transient due to the membrane time constant, a phasic component in the response would be expected to occur if the model had included membrane currents that cause adaptation of firing frequency. For parameter values in the ranges considered in the previous section, the output frequencies were predicted well by the values of fo obtained from the steady-state simulations. 2. Sinusoidal variation ofh: The input firing rate was modulated sinusoidally about a mean value E, as determined by the formula

f i ( t ) = E[I + 6 sin(27rvt)l

(3.4)

The modulation frequency v was arbitrarily set to 10 Hz and 6, the depth of modulation about the mean, was set to 0.5. For a simulation of 40-sec duration, a cycle-triggered time histogram was compiled by aligning the responses on a specific phase. Figure 4b shows one such compilation for two values of synchrony, s = 0 and s = 0.5. For low synchrony, the output was a distorted sinusoid, with sharp responses at the peaks and cutoff in the troughs. The distortion was produced by modulation around threshold and the low-pass characteristics of the membrane, which led to a slow depolarization during the positive phase of the input sinusoid, producing output spikes near the peak. Moderate synchrony in the inputs produced more jitter in the spike times, making the fo sinusoids smoother. The response to sinusoidal modulations of input frequency was close to that predicted by steady-state curves such as those in Figure 2a for values of s 2 0.3 (Fig. 4b, bottom). For s < 0.3, the sinusoidal inputs led to a greater peak fa than predicted by the steady state and a smaller fa at the troughs (Fig. 4b, top). This occurred because the synaptic currents during the troughs of the input sinusoids were insufficient to evoke spikes by themselves but helped to slowly depolarize the cell. In effect, the spiking near the peaks of the sinusoids included contributions

Venkatesh N. Murthy and Eberhard E. Fetz

1120

b 8o

-

I

0

n

N = 100 w = 2.5 nS (150pV)

50

100

time (msec)

Figure 4: Effectof synchrony on responses to modulated inputs. (a) Histograms of f,, when the input frequency was stepped from 25 to 50 Hz at t = 0 sec. N = 100, w = 3.75 nS (200 pV), and s was 0.4 (black bars) or 0.8 (light bars). The binwidth was 3 msec. (b) Histograms of spikes from sinusoidally modulated inputs (fi), from the model cell (fo) and the predicted steady-state response (f,,,), for two values of s. N = 100, w = 2.5 nS (150 pV), and fi was varied sinusoidally about a mean frequency of 50 Hz. With all independent inputs (s = 0), the spikes were clustered near the peak of the input frequency (top). When s = 0.5, the spikes were distributed more evenly and were predicted well by steady-state rates.

Firing Rate of a Cortical Neuron Model

1121

of synaptic currents from the troughs. Therefore, the fo obtained from steady-state curves underestimated the firing rates at the peaks of the sinusoids. 3.3 Cross-Correlations. The results above show that for certain parameters, synchrony does not alter the output firing rate. For those parameters we examined, the cross-correlation histogram (CCH) between the synchronous input and the output spikes; the CCH peak area measures the probability of evoking a postsynaptic spike in association with the input spikes. Figure 5 shows a simulation in which N = 100, fi = 50 Hz, w = 2.5 nS (150 pV), and the levels of synchrony were 0.1, 0.3, 0.5, 0.7, and 0.9. Figure 5a shows the CCH for three levels of synchrony in the input (aligned on the spikes in synchronous units). For these parameters, the average firing rate of the postsynaptic cell did not change with increasing synchrony, as seen by the similar baselines in the histograms. However, increasing synchrony produced an increase in the CCH peak. The cumulative sum (cusum) of the CCH was calculated as the integral of the CCH values minus the average pre-trigger counts (to the left of the origin). The resultant cusum levels show that the probability of evoking a postsynaptic spike for every synchronized input spike increased with increasing input synchrony (Fig. 5b). However, the probability normalized to the number of inputs that were synchronized (the area under the CCH peak/sN) was actually smaller for higher values of s, because the excess depolarization above that required to fire the cell increased with synchrony. Therefore, the effective contribution of an individual synchronized input decreased as other synchronized inputs were added. 4 Discussion

A major objective of this study was to determine parametrically the conditions under which synchrony in the inputs affected the average output firing rate of a biophysical model of a cortical neuron. An interesting finding was that increased synchrony could decrease as well as increase average firing rate. For small values of input firing rates, number of inputs and EPSP amplitude, synchrony could play an important role in increasing the output firing rate. In this parameter range, the membrane potential trajectory was relatively smooth for asynchronous inputs, whereas synchronized inputs caused large deviations in the membrane potential, which often led to spikes (Fig. lb). The size of unitary EPSPs in cortical pyramidal neurons is in the range of a few hundred microvolts (Thomson et al. 1988; Fetz et al. 1991; Mason et al. 1991; Chen and Fetz 1993). Using a value of 3.75 nS (200 pV) for w,equation 3.3 predicts that synchrony will increase the firing rate when the total rate at which input spikes arrive is less than 2500 sec-'. When N, fi, and w were large, asynchronous inputs could produce output rates higher than the firing rates

Venkatesh N. Murthy and Eberhard E. Fetz

1122

a N

E.

-40

-20

0 delay (msec)

20

40

b

delay (msec)

Figure 5: (a) Cross-correlation between the synchronized spikes and the output spikes at three levels of synchrony IN = 100, w = 2.5 nS (150 pV), fi = 50 Hzl. The background frequency of the model cell was around 30 Hz for all three cases. The EPSP (E) is shown on the same time-scale for comparison. (b) The cumulative sum of the correlograms for five levels of synchrony indicated at the right-hand side of each line. The small rise in the cusum at around -30 ms is due to the rhythmic firing of the post-synaptic cell.

of the individual inputs. In contrast, highly synchronized inputs caused at most one spike for each large synchronized EPSP. Therefore, although there is an efficient one-to-one transfer of spiking, fo could not exceed fi. So, in this parameter regime (large Nwfi), synchrony could actually

Firing Rate of a Cortical Neuron Model

1123

reduce f,,. In agreement with our findings, Bernander et al. (1994) have recently shown that the number of output spikes evoked by a fixed number of inputs to an integrate-and-fire model increases with synchrony up to a point, after which further synchrony reduces the number of output spikes. They primarily measured the effectiveness of a fixed number of inputs in activating the postsynaptic cell from rest as a function of temporal dispersion of the inputs. For repetitive inputs to an integrate-and-fire model, they determined analytically the parameter space in which synchrony could increase output firing rate. Their equation for the boundary between the parameter domains in which synchrony increased and decreased output firing rates (equation 5 in Bernander et al. 1994) can be approximated by the equation Nwfi = 0.6 V sec-’ for fi < 100 (using 7, = 17 msec, threshold = 10 mV above resting membrane potential), which is close to our equation 3.3. The fact that the behavior of our biophysical model is in close quantitative agreement with calculations from their integrate-and-fire model further strengthens the overall conclusions. The range of parameters in which synchrony could increase firing rates would be altered if the model neuron were made more complex in various ways: (1) Multiple spikes per EPSP could be produced by active membrane currents such as the persistent sodium current, calcium currents that cause bursts of spikes, and long NMDA currents in the postsynaptic response. (2) If membrane currents that exhibit strong inactivation (known to be present in cortical neurons) are included in the model, the neuron will be more sensitive to transient depolarizations such as those arising from synchronous inputs, than to steady depolarization. It has been demonstrated experimentally that EPSPs and transient current pulses injected into cat neocortical neurons in uitvo increase firing rates more effectively than similar net currents injected steadily (Reyes and Fetz 1993). (3) If inputs are distributed over the dendrites, synchronized inputs can occur on different parts of the dendritic tree, resulting in less cancellation. Further, dendrites with active currents can alter the effective time constant locally and make the neuron more sensitive to specific patterns of inputs (Softky and Koch 1992). Further simulations using models that include these features would be necessary to address these issues. The results from our simulations of the steady-state condition were compared to the instantaneous response to modulated inputs. For step changes in input frequency and sinusoidally modulated input frequency, the membrane time constant dominated the transient response under conditions of low input synchrony. The cell behaved as an integrator of synaptic inputs for low input synchrony, and produced spikes when threshold was reached. For highly synchronized inputs, the cell produced spikes predominantly in response to synchronized inputs and the time taken for the membrane potential to reach threshold became much smaller. Thus synchrony overcame the slow transient due to the mem-

1124

Venkatesh N. Murthy and Eberhard E. Fetz

brane time constant, therefore mitigating the low-pass behavior of the neuron. Instead of using fo, the effect of synchrony could also be assessed by the probability of evoking a spike in the output cell for each input spike; the net probability increased with increasing synchrony. In our simulations this was observed as an increase in the peak of the cross-correlation between synchronous input and the output spike trains. However, the probability of evoking an output spike normalized to the number of inputs that were synchronized decreased with increasing synchrony. Nevertheless, the reliability of evoking a spike within a short interval following a particular event (such as the occurrence of synchronous input spikes) could be important for information coding and transmission, and increasing input synchrony would clearly assure this. Such temporal precision in spiking could also be important for evoking synchronous firing in target neurons. Kenyon et al. (1990) showed that when a ratemodulated signal is propagated across many layers of leaky integrateand-fire neurons, adding a degree of synchrony in the inputs increases the fidelity of transmission of the signal when the firing rates are low. They also found that if the inputs to the first layer had low degrees of synchrony, successive layers had diminishing synchrony as measured by the autocorrelation of spikes pooled from all neurons in the layer. However, for input synchrony 2 30%, the synchrony between cells of a layer was enhanced in successive layers. Abeles (1982, 1991) also studied synchronous transmission in networks as a function of various parameters and found that for low values of unitary EPSPs and input frequencies, transmission of activity through networks was more effective when neurons within layers were synchronized. The existence of synchronous activity in the cortex has been documented extensively (Eckhorn et al. 1988; Gray and Singer 1989; Fetz et al. 1991; Murthy and Fetz 1992; Nelson et al. 1992). In addition to extracellular recording studies, intracellular recordings of cortical cells show that the membrane potentials exhibit large fluctuations which are synchronized with waves in local field potentials (Jagadeesh et ul. 1992; Chen and Fetz 1993). Is this synchronous activity likely to be useful in increasing firing rates in cortical neurons, as compared to asynchronous activity with similar mean rates? Our model would predict that firing rates during oscillatory episodes would increase compared to periods just prior to the episodes if the prior rates were low, and decrease if the prior rates were high. The transition would depend on the total input rate (Nf,,) to the neuron. Preliminary analysis of single neurons recorded in sensorimotor cortex of monkeys indicates that for units that were synchronized with LFP oscillations, firing rates during the oscillatory episodes increased for low prior rates and decreased for high prior rates, as predicted. This effect was not observed in neurons that were not synchronized with the LFP oscillations (Murthy and Fetz, unpublished observations).

Firing Rate of a Cortical Neuron Model

1125

In summary, for a simple biophysical model of the cortical neuron with three active conductances the spike rate could be increased significantly by increasing the synchrony among the inputs without necessarily altering the mean firing rates of the inputs. The range of parameters for which this effect is significant includes biologically plausible values (Nwfi < 9.4 pS sec-’ or 0.5 V sec-I). At higher firing rates synchrony in the inputs leads to a decrease in the average rates.

Acknowledgments This work was supported by NIH Grants NS12542 and RR00166. V. N. Murthy was supported in part by a predoctoral fellowship from the W. M. Keck Foundation.

References Abeles, M. 1982. Role of cortical neuron: Integrator or coincidence detector? lsr. 1.Med. Sci. 18, 83-92. Abeles, M. 1991. Corticonics: Neural Circuits of the Cerebral Cortex. Cambridge University Press, Cambridge. Bernander, O., Douglas, R. J., Martin, K. A. C., and Koch, C. 1991. Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. U.S.A. 88, 11569-11573. Bernander, O., Koch, C., and Usher, M. 1994. The effect of synchronous inputs at the single neuron level. Neural Comp. 6, 622-641. Bouyer, J. J., Montaron, M. F., and Rouged, A. 1981. Fast fronto-parietal rhythms during combined focused attentive behavior and immobility in cat: Cortical and thalamic localizations. Electroenceph. Clin. Neurophysiol. 51, 244-252. Bush, P. C., and Sejnowski, T. J. 1993. Simulations of synaptic integration in neocortical pyramidal cells. In Computation and Neural Systems, F. H. Eeckman and J. M. Bower, eds., pp. 97-101. Kluwer Academic Publishers, Boston. Chen, D.-F., and Fetz, E. E. 1993. Effect of synchronous neural activity on synaptic transmission in primate cortex. Soc. Neurosci. Abstr. 19, 781 (319.7). Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitbock, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121-130. Engel, A,, Kreiter, A. K., Konig, P., and Singer, W. 1991. Synchronization of oscillatory neuronal responses between striate and extrastriate visual cortical areas of the cat. Proc. Natl. Acad. Sci. U.S.A. 88, 6048-6052. Fetz, E. E., Toyama, K., and Smith, W. S. 1991. Synaptic interactions between cortical neurons. In Cerebral Cortex, A. Peters, ed., Vol. 9, pp. 1 4 7 . Plenum Press, New York. Freeman, W. J., and Van Dijk, B. W. 1987. Spatial patterns of visual cortical fast EEG during conditioned reflex in a rhesus monkey. Brain Res. 422,267-276.

1126

Venkatesh N. Murthy and Eberhard E. Fetz

Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in the cat visual cortex. Proc. Natl. Acad. Sci. U.S.A. 86, 1698-1702. Hestrin, S. 1992. Activation and desensitization of glutamate-activated channels mediating fast excitatory synaptic currents in the visual cortex. Neuron 9, 991-999. Jagadeesh, B., Gray, C. M., and Ferster, D. 1992. Visually evoked oscillations of membrane potential in cells of cat visual cortex. Science 257, 552-554. Kenyon, G. T., Fetz, E. E., and Puff, R. D. 1990. Effect of firing synchrony on signal propagation in layered networks. In Advances in Neural Information Processing Systems 2, D. Touretzky, ed., pp. 141-148. Morgan Kaufmann, San Mateo, CA. Llinas, R., and Ribary, U. 1993. Coherent 40-Hz oscillation characterizes dream state in humans. Proc. Natl. Acad. Sci. U.S.A. 90, 2078-2081. Lytton, W. W., and Sejnowski, T. J. 1991. Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons. J. Neurophysiol. 66, 10591079. Mason, A., Nicoll, A., and Stratford, K. 1991. Synaptic transmission between individual pyramidal neurons of the rat visual cortex in uitro. J. Neurosci. 11, 72-84. Murthy, V. N., and Fetz, E. E. 1992. Coherent 25-35 Hz oscillations in the sensorimotor cortex of awake behaving monkeys. Proc. Natl. Acad. Sci. U.S.A. 89,5670-5674. Murthy, V. N., and Fetz, E. E. 1993. Effects of input synchrony on the response of a model neuron. In Computation and Neural Systems, F. H. Eeckman and J. M. Bower, eds., pp. 475479. Kluwer Academic Publishers, Boston. Nelson, J. I., Salin, P. A., Munk, M. H.-J., Arzi, M., and Bullier, J. 1992. Spatial and temporal coherence in cortico-cortical connections: A cross-correlation study in areas 17 and 18 in the cat. Vis. Neurosci. 9, 21-37. Reyes, A. D., and Fetz, E. E. 1993. Effects of transient depolarizing potentials on the firing rate of cat neocortical neurons. J. Neurophysiol. 69, 1673-1683. Sanes, J. N., and Donoghue, J. P. 1993. Oscillations in local field potentials of the primate motor cortex during voluntary movement. Proc. Natl. Acad. Sci. U.S.A. 90, 4470-4474. Softky, W. R., and Koch, C. 1992. The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci. 13, 334-350. Stafstrom, C. E., Schwindt, P. C., and Crill, W. E. 1984. Repetitive firing in layer V neurons from cat neocortex in uitro. J. Neurophysiol. 52, 264-277. Thomson, A. M., Girdlestone, D., and West, D. C. 1988. Voltage-dependent currents prolong single-axon postsynaptic potentials in layer 111 pyramidal neurons in rat-neocortical slices. J. Neurophysiol. 60, 1896-1907. Wilson, M. A., and Bower, J. 1989. The simulation of large-scale neural networks. In Methods in Neuronal Modeling: From Synapses to Networks, C. Koch and I. Segev, eds., pp. 291-334. MIT Press, Cambridge, MA. Received July 9, 1993; accepted January 24, 1994.

This article has been cited by: 2. Stan Gielen, Martin Krupa, Magteld Zeitler. 2010. Gamma oscillations as a mechanism for selective information transmission. Biological Cybernetics 103:2, 151-165. [CrossRef] 3. Shashikala Kattla, Madeleine M. Lowery. 2010. Fatigue related changes in electromyographic coherence between synergistic hand muscles. Experimental Brain Research 202:1, 89-99. [CrossRef] 4. Christoph Börgers, Nancy J. Kopell. 2008. Gamma Oscillations and Stimulus SelectionGamma Oscillations and Stimulus Selection. Neural Computation 20:2, 383-414. [Abstract] [PDF] [PDF Plus] 5. Alexandre Kuhn , Ad Aertsen , Stefan Rotter . 2003. Higher-Order Statistics of Input Ensembles and the Response of Simple Model NeuronsHigher-Order Statistics of Input Ensembles and the Response of Simple Model Neurons. Neural Computation 15:1, 67-101. [Abstract] [PDF] [PDF Plus] 6. A. N. Burkitt , G. M. Clark . 2001. Synchronization of the Neural Response to Noisy Periodic Synaptic InputSynchronization of the Neural Response to Noisy Periodic Synaptic Input. Neural Computation 13:12, 2639-2672. [Abstract] [PDF] [PDF Plus] 7. David M. Halliday . 2000. Weak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal BandwidthWeak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal Bandwidth. Neural Computation 12:3, 693-707. [Abstract] [PDF] [PDF Plus] 8. S. M. Bohte , H. Spekreijse , P. R. Roelfsema . 2000. The Effects of Pair-wise and Higher-order Correlations on the Firing Rate of a Postsynaptic NeuronThe Effects of Pair-wise and Higher-order Correlations on the Firing Rate of a Postsynaptic Neuron. Neural Computation 12:1, 153-179. [Abstract] [PDF] [PDF Plus] 9. J. M. Kilner, S. N. Baker, S. Salenius, V. Jousmaki, R. Hari, R. N. Lemon. 1999. Task-dependent modulation of 15-30 Hz coherence between rectified EMGs from human hand and forearm muscles. The Journal of Physiology 516:2, 559-570. [CrossRef] 10. W. Martin Usrey, R. Clay Reid. 1999. SYNCHRONOUS ACTIVITY IN THE VISUAL SYSTEM. Annual Review of Physiology 61:1, 435-456. [CrossRef] 11. Terence David Sanger . 1998. Probability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking NeuronsProbability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking Neurons. Neural Computation 10:6, 1567-1586. [Abstract] [PDF] [PDF Plus]

12. S. N. Baker, E. Olivier, R. N. Lemon. 1997. Coherent oscillations in monkey motor cortex and hand muscle EMG show task-dependent modulation. The Journal of Physiology 501:1, 225-241. [CrossRef] 13. D. Hansel, H. Sompolinsky. 1996. Chaos and synchrony in a model of a hypercolumn in visual cortex. Journal of Computational Neuroscience 3:1, 7-34. [CrossRef]

Communicated by Idan Segev

The Functional Role of Excitatory and Inhibitory Interactions in Chopper Cells of the Anteroventral Cochlear Nucleus Ying-Cheng Lai Raimond L. Winslow Murray B. Sachs Department of Biomedical Engineering and Center for Hearing Sciences, The Johns Hopkins University School of Medicine, Baltimore, M D 21205 USA Chopper cells in the anteroventral cochlear nucleus of the cat maintain a robust rate-place representation of vowel spectra over a broad range of stimulus levels. This representation resembles that of low threshold, high spontaneous rate primary auditory nerve fibers at low stimulus levels, and that of high threshold, low spontaneous rate auditory-nerve fibers at high stimulus levels. This has led to the hypothesis that chopper cells in the anteroventral cochlear nucleus selectively process inputs from different spontaneous rate populations of primary auditorynerve fibers at different stimulus levels. We present a computational model, making use of shunting inhibition, for how this level dependent processing may be performed within the chopper cell dendritic tree. We show that this model (1)implements level-dependent selective processing, (2) reproduces detailed features of real chopper cell post-stimulus-time histograms, and (3) reproduces nonmonotonic rate versus level functions in response to single tones measured. 1 Introduction The neural encoding of acoustic stimuli begins with the discharge patterns of primary auditory nerve fibers projecting from the cochlear to the central auditory system. Two distinct populations of auditory-nerve fibers can be defined on the basis of acoustic response properties. The first population exhibits low stimulus thresholds, high spontaneous discharge rates, and a narrow dynamic range. These fibers are referred to as high spontaneous rate fibers (spontaneous rate > 20 spikes/sec). The second population exhibits higher stimulus thresholds, medium to low spontaneous rate (0 < spontaneous rate L 201, and a broad dynamic range. These fibers are referred to as low spontaneous rate fibers. Rate-place representations (discharge rate plotted as a function of auditory-nerve fiber most sensitive, or best, frequency) derived from responses of auditory-nerve fibers are capable of encoding stimulus spectra Neural Computation 6,1127-1140 (1994) @ 1994 Massachusetts Institute of Technology

1128

Y-C. Lai, R. L. Winslow, and M. 8.Sachs

over a wide range of sound levels. In particular, rate-place representations of vowel spectra computed from responses of low-threshold, high spontaneous rate auditory-nerve fibers encode vowel formant frequencies at low stimulus levels (Sachs and Young 1979). At high stimulus levels, formant peaks present in rate-place representations derived from responses of high spontaneous rate auditory-nerve fibers disappear because of the limited dynamic range of this population, as well as effects of a cochlear nonlinearity known as two-tone suppression (Sachs and Kiang 1968). In this case, formant peaks are maintained in representations derived from the responses of the high threshold, low spontaneous rate fibers (Sachs and Young 1979). The cochlear nucleus is the first central nucleus of the mammalian auditory system. As such, it is the initial site at which acoustic information from primary auditory-nerve fibers is divided into parallel streams, each of which undergoes processing and encoding for transmission to higher auditory centers. In this paper we investigate the acoustic information processing performed by one of these parallel pathways, the stellate cells of the anteroventral cochlear nucleus. Studies have demonstrated that, unlike auditory-nerve fibers, stellate cells exhibit highly regular discharge patterns in response to short best frequency tone bursts (Young etal. 1988; Blackburn and Sachs 1989). These regular discharges are a result of significant temporal integration of excitatory poststimulus potentials generated by auditory nerve fiber inputs. This gives rise to multiple peaks in poststimulus-time histograms in response to such tone bursts, hence the name "chopper cells." Three classes of chopper cells have been defined on the basis of mean interspike interval pisi(t), variance q s i ( t ) , and coefficient of variation C,(t). These classes are (1) sustained choppers [Chop-S; average C,(t) of 0.05-0.35, constant pisi(f)plots], (2) slowly adapting choppers [average C,(t) 0.05-0.7, linearly increasing pisi(f)plots], and (3) transient choppers [Chop-T; average C,(t) > 0.35, rapidly adapting pisi(t) plots]. The data of Blackburn and Sachs (1990) on rate-place representations of steady-state vowel spectra derived from responses of chopper cells in the anteroventral cochlear nucleus indicate that unlike primary auditorynerve fibers, these cells maintain peaks in discharge rate corresponding to vowel formant frequencies over a broad range of stimulus levels. This finding is especially true for Chop-T cells. At low stimulus levels, rateplace representations computed from responses of Chop-T cells appear similar in shape to those computed from responses of low threshold high spontaneous rate primary auditory-nerve fibers. At high stimulus levels, rate-place profiles for Chop-T cells are similar to those computed from high threshold low spontaneous rate fibers (Blackburn and Sachs 1990). These data suggest that chopper cells perform a level-dependent selective processing of auditory-nerve fiber inputs, an hypothesis presented previously on the basis of theoretical considerations (Winslow et al. 1987). The mechanism proposed by Winslow et al. (1987) is based on the principle of excitatory and inhibitory interactions (Koch et al. 19821, as

Chopper Cells of the Cochlear Nucleus

2;Y

1129

0 Excltatory Input

Figure 1: A schematic illustration of the hypothesized neural circuit for performing level-dependent selective processing of primary auditory-nerve fiber inputs by chopper cells in the anteroventral cochlear nucleus. shown in Figure 1. In its initial form, high spontaneous rate auditorynerve fibers with best-frequency equal to that of the target chopper cell are assumed to make excitatory synapses on the distal regions of the dendritic tree. Low-medium spontaneous rate auditory-nerve fibers from the same best-frequency region are assumed to form excitatory inputs on the proximal dendrites and/or on the soma. Off best-frequency auditorynerve fibers are assumed to project to interneurons that form inhibitory synapses on the target chopper cell. These inputs are positioned on the direct path that current must take when flowing from the distal synapses to the soma (on-path inhibition). Under these assumptions, at low stimulus levels, only the high spontaneous rate auditory-nerve fibers will be activated. Distal inputs from the low threshold fibers will therefore drive the chopper cell at low stimulus levels. As stimulus intensity increases, spread of excitation along the basilar membrane would result in activation of off best-frequency fibers, which in turn provide on-path inhibitory inputs via interneurons to the chopper cell. On-path inhibition can be very effective in vetoing distal excitation, but highly ineffective in reducing somatic response to more proximal excitation. Thus, at high stimulus levels, synaptic current from distal high spontaneous rate auditory-nerve fiber inputs will be shunted, allowing low-medium spontaneous rate auditory-nerve fibers innervating the proximal region to drive the chopper cell. In this paper we show that the above model of auditory-nerve fiber synaptic innervation within the chopper cell dendritic tree and soma supports level-dependent selective processing, and can successfully account for post-stimulus-time histograms and rate-versus-level functions

Y-C. Lai, R. L. Window, and M. B. Sachs

1130

(plots of discharge rate versus best-frequency tone level) measured in real chopper cells (Blackburn and Sachs 1990). 2 A Compartmental Model for Chopper Cells

We use a simple cable model of chopper cells formulated originally by Banks and Sachs (1991). The model consists of a passive uniform cylinder of diameter 7.0 pm (equivalent to 6 identical 2.12-pm-diameter dendrites by Rall’s 3/2 power law), space constant X = 1066.5 pm and electrotonic length L = 1. The cylinder is divided into 20 compartments, where the distal compartments are numbered as Cl-Cl6, proximal compartments are C17-C20, and the soma C21. The soma membrane conductance G, is set to 1.96 nS. This value was computed assuming a spherical soma diameter of 25 pm and a specific membrane resistance of 10,000 R-cm2. The resulting input resistance (510 M62) and membrane time constant 10 msec (specific membrane capacitance C, = 1 pF/cm2) are within the range measured by whole-cell recording in enzymatically dissociated chopper cells from guinea pig VCN (447 265 Mil; Manis and Marx 1991). Axial resistance Ri was set to 150 R-cm. Synaptic inputs are modeled as maintained conductances in series with batteries representing reversal potentials. Inhibition in chopper cells of the cochlear nucleus slice preparation is known to be mediated by glutamate-evolved changes in membrane permeability to chloride ions. The reversal potential of this inhibition (Ei) is equal to or near the cell resting potential E , (Oertel 1983; Wu and Oertel 1986). We therefore model inhibition as the shunting type (Ei = Er). The excitatory conductance change G, is chosen to be 6Go, where Go = 1.26 nS is the membrane conductance for a single compartment (Wang 1991). The inhibitory conductance is Gi = 80Go. The value of “p is set to 0.25 msec (Banks and Sachs 1991) so that the excitatory poststimulus potentials have halfwidths comparable with those reported in chopper cells of the cochlear nucleus slice preparation (3-4 msec) (Wu and Oertel 1984, 1986). There are no precise estimates of the half-width of inhibitory poststimulus potentials, but they are clearly larger than those for excitatory poststimulus potentials (Wu and Oertel1984). Time to peak of the inhibitory conductance change t; is therefore chosen in the range of 0.25-2 msec in our studies. The spike generator was modeled by the Hodgkin-Huxley-type voltage-gated membrane currents (Banks and Sachs 1991) incorporated with the modifications by Wang (1991), e g , the thresholds of sodium and potassium channels are shifted in the depolarizing direction by 9 mV. This shift is required to fit the relationship between input current magnitude and chopper cell firing rate measured by Oertel (Oertel 1983; Wang 1991). A consequence of this shift is that the spontaneous discharge rate of the chopper cell can be made negligible despite the presence of high spontaneous rate auditory-nerve fiber inputs. This is consistent with the

*

Chopper Cells of the Cochlear Nucleus

1131

observation that the majority of chopper cells have a spontaneous rate less than 1 spike/sec (Blackburn and Sachs 1989). Occurrence times of synaptic inputs from auditory-nerve fibers to chopper cells are modeled using dead-time-modified inhomogeneous Poisson counting processes (Barta and Young 19861, with dead time set to 700 psec. The process is characterized by an instantaneous rate function that matches the rate adaptation pattern of real auditory-nerve fibers (Rothman 1991). High spontaneous rate and low-medium spontaneous rate auditory-nerve fiber inputs can be obtained by specifying different parameter values for the rate function (Rothman 1991). We use spontaneous rate of 68 spikes/sec and 0.5 spike/sec for high and low-medium spontaneous rate fibers, respectively. Firing patterns of inhibitory interneuron are assumed to be identical to that of high spontaneous rate auditory-nerve fibers. Synaptic conductance changes evoked by each input were modeled by a-functions (Rall 1964). To calculate the post-stimulus-time histogram, interspike interval pis,(t) and its standard deviation qsi(f),and the corresponding coefficient of variation C,(t) = oisi(t)/[pisi(t) - 74 of the model chopper cell, 800 statistically independent trials are run for each auditory-nerve fiber input configuration. Post-stimulus-time histograms, plots of pisi(t), gii,i( f ) , and C,(t) are smoothed using a moving triangular filter with a window of 1.68 msec. 3 Model Results

According to the selective processing hypothesis, at low stimulus levels, responses of chopper cells are determined by inputs from distally positioned high spontaneous rate auditory-nerve fiber inputs. To simulate this situation, a single high spontaneous rate auditory-nerve fiber input with a maintained rate of 160 spikes/sec is applied to each of the 16 distal compartments. The resulting post-stimulus-time histogram is shown in Figure 2A. The post-stimulus-time histogram exhibits an initial period of chopping, leading to a steady-state discharge rate of 124 spikes/sec. Plots of pis,(t) and aisi(t)(Fig. 2B) as well as C,(t) (Fig. 2C) also increase abruptly during the initial stimulus interval to reach plateau values, with the average C,(t) value 0.41 during the time interval 60-80 msec. This response can be classified as Chop-T (Blackburn and Sachs 1989). Figure 2D illustrates the effect of on-path shunting inhibition, where 1 inhibitory input is placed at each of the distal compartments. The inhibition rate is set to 50 spikes/sec. For small excitatory conductance changes, inhibition co-located with excitation is highly effective at eliminating response to the excitation (Rall 1964; Koch et al. 1982). Clearly, on-path inhibition has a powerful effect on response of the model chopper cell. The chopping response seen subsequent to stimulus onset for the data of Figure 2A is completely eliminated. Other combinations of

Y-C.Lai, R. L. Winslow, and M. B. Sachs

1132

A

B

f3 30 40

50 60 70

80

304050607080

Time (ms)

nme (ms)

C

D

"1 30 40

50

60 70

Time (ms)

80

30 40 50 60 70

80

nme (ms)

Figure 2: PST histogram (A), interspike interval /Lisi(t) (solid line) and standard deviation ai,i(t) (dotted line) (B), and coefficient of variation C,(t) (C) of a model chopper cell receiving one high spontaneous rate auditory-nerve fiber input in each of its 16 distal compartments. This simulates a possible synaptic input configuration for chopper cells at low stimulus levels. The average discharge rate and the averaged C,(t) value in (60-80) msec are 124 spikes/sec and 0.41, respectively. This response can be classified as Chop-T. (D) The effect of one inhibitory input co-located with the high spontaneous rate auditory-nerve fiber input. The inhibition rate is 50 spikes/sec and tb is 1 msec. It is clear that the on-path inhibition is highly effective to reduce the cell response. The chopping pattern seen in (A) is completely eliminated.

Chopper Cells of the Cochlear Nucleus

1133

the inhibition rate and f b within physiologically reasonable ranges yield similar results. The general conclusion is that on-path shunting inhibition applied to the distal dendritic region can effectively veto excitation in that region. At high stimulus levels, the selective processing hypothesis predicts that both excitatory synaptic inputs from low-medium spontaneous rate fibers to the proximal dendrite, and inhibitory synaptic inputs from interneurons providing the distal inputs are activated. Discharge rates of high spontaneous rate auditory-nerve fibers providing distal excitatory synaptic inputs are saturated. To simulate this situation, we apply 8 low-medium fibers (with maintained rate of 100 spikes/sec) in each of the four proximal compartments (C17-C20) and on the soma (C21), one inhibitory input (rate = 50 spikes/sec, fa = 1 msec) in each distal compartment, and one high spontaneous rate fiber (rate = 160 spikes/sec) in each distal compartment. The choice of number ”8” is based on Liberman’s data (Liberman 1991) indicating that chopper cells receive substantially more synaptic input from low-medium spontaneous rate fibers than high spontaneous rate fibers in proximal dendrites and soma. The resulting post-stimulus-time histogram is shown in Figure 3A, and the corresponding pisi(t) (solid line) and nsi(f) (dotted line) plots are shown in Figure 3B. The average discharge rate and average C,(t) in the time period (60-80) msec are 231 spikes/sec and 0.383, respectively. This response can be classified as Chop-T (Blackburn and Sachs 1989). We have tested 15 other combinations of inhibition rate and fb; the chopping pattern exhibited in Figure 3A and B is robust. Since high spontaneous rate auditory-nerve fiber inputs in the distal region are shunted by inhibitory inputs in the same region (Fig. 2D), the main driving force responsible for the chopping pattern in Figure 3A is the low-medium spontaneous rate fibers in the proximal region. This can be seen in plots of post-stimulustime histogram (Fig. 3C), and the corresponding piSi(t)and nsi(f)(Fig. 3D) where there are only 8 low-medium fibers in the proximal/somatic region. Figures 3A-D thus indicate that inhibition in the distal dendritic region is ineffective in reducing somatic response to more proximal excitation. This conclusion holds for 15 other combinations of the inhibition rate and fb. The results of Figures 2 and 3 are precisely what are required for chopper cells to perform selective processing. The post-stimulus-time histograms at low (Fig. 2A) and high (Fig. 3A) stimulus levels are similar to those observed in real chopper cells (Blackburn and Sachs 1989). We next simulate the response of a model chopper cell, with synaptic inputs arranged according to the selective processing hypothesis, to bestfrequency tones of varying level. The resulting model rate-versus-level functions can be compared to experimental data. To compute rate-versuslevel functions for the chopper cell, it is necessary to compute discharge rate for populations of primary auditory-nerve fibers with best-frequency near that of the model chopper cell. To do this we use a phenomenological model of the auditory periphery (Sachs and Abbas 1974; Sachs et al.

Y-C. Lai, R. L. Winslow, and M. B. Sachs

1134

A

B

C

D

304050607080

30 40 50 60 70 80

Time (ms)

nrne (rns)

Figure 3: Plots of post-stimulus-time histogram (A), pisi(t) and ai,i(t) (B) of a model chopper cell receiving 8 low-medium spontaneous rate fibers in each of 4 proximal compartments and on the soma, one inhibitory input (with the same rate and ta as in Fig. 2D) and one high spontaneous rate auditory-nerve fiber input in each of its 16 distal compartments. This simulates a possible input configuration at high stimulus levels. The average discharge rate and the averaged C,(t) value in (60-80) msec are 231 spikes/sec and 0.383, respectively. This response can be classified as Chop-T. (C and D) Plots of post-stimulus-time histogram (C), pisi(t) and q s i ( t ) (D) of a model chopper cell receiving only 8 low-medium fibers in proximal/somatic compartments. The average discharge rate is 282 spikes/sec. Comparison between A and C indicates that the model chopper cell is mainly driven by low-medium fibers in the proximal/somatic region at high stimulus levels.

Chopper Cells of the Cochlear Nucleus

1135

1989; Neti and Young 1992). Assuming the model chopper cell has bestfrequency = 5.5 kHz, inhibitory inputs to the chopper cell are assumed to be generated by interneurons receiving inputs from two above-bestfrequency high spontaneous rate fibers at 6.6 kHz. Inhibitory sidebands of the model chopper cell are then similar to those shown for real chopper cells (see Fig. 5 of Blackburn and Sachs 1992). Rate-versus-level functions for high spontaneous rate fiber, low-medium spontaneous rate fibers and inhibitory interneurons are shown in Figure 4A with solid, dotted, and dashed lines, respectively. Chopper cell rate-versus-level functions are generated by computing discharge rate to best-frequency tones with level ranging from 0 to 100 dB in steps of 2 dB. The synaptic input configuration is identical to that used in Figure 3A-C, except that the relative sizes of the distal and proximal dendritic regions is varied. We use ti = 0.25msec. For each stimulus level, 50 statistically independent triars are run and a 5-point moving triangular window is used to smooth the rate-versus-level function. Figure 4B shows three types of chopper cell rate-versus-level functions with relative sizes of distal and proximal regions 0.8:0.2(solid line), 0.9:O.l (dotted line), and 0.95:0.05(dashed line). In all three cases, discharge rate increases rapidly when the stimulus level is increased from 0 to 25 dB. This range is consistent with the dynamic range of high spontaneous rate auditory-nerve fibers (Fig. 4A, solid line). After this initial increase, as the stimulus level is increased further, chopper cell rate-versus-level functions either increase slowly (solid line), maintain discharge rate at a plateau (dotted line), or decrease slowly (dashed line). These are three typical types of rate-versus-level functions measured experimentally in chopper cells (see Fig. 17, Blackburn and Sachs 1990). The relative size of the distal and proximal dendritic regions is clearly a sensitive parameter for determining the shape of chopper cell rateversus-level functions. This can be qualitatively understood as follows. Chopper cells are driven by distal high spontaneous rate fibers at low stimulus levels and by proximal low-medium fibers at high stimulus levels, according to the selective processing hypothesis. The initial rapid increase in rate-versus-level functions is due to the increase in the input rate of high spontaneous rate auditory-nerve fibers in the distal region, since for stimulus levels below 25 dB, the input rate of low-medium spontaneous rate auditory-nerve fibers and inhibitory inputs are close to zero. As tone level increases beyond 25 dB, the discharge rates of low-medium spontaneous rate auditory-nerve fiber as well as inhibitory inputs begin to increase. On-path inhibitory inputs veto the high spontaneous rate auditory-nerve fiber inputs in the distal region. Hence, when stimulus level is greater than about 30 dB, the main driving source for the chopper cell is low-medium spontaneous rate auditory-nerve fibers in the proximal region. Figure 4D shows rate-versus-level functions for the case where inhibitory inputs are applied distally, and where there are either distal excitatory inputs from high spontaneous rate auditory-nerve

1136

Y-C. Lai, R. L. Winslow, and M. B. Sachs

C

Figure 4: (A) Rate-versus-level functions for at best-frequency high spontaneous rate fibers (solid line), low-medium spontaneous rate fibers (dotted line), and two off best-frequency fibers providing inputs to a model chopper cell (dashed line). These rate-versus-level functions are used to compute rate-versus-level functions for the model chopper cell. (B)Three types of chopper cell rate-versuslevel functions computed with a relative size of the distal to proximal dendritic region of 0.8:0.2 (solid line), 0.9:O.l (dotted line), and 0.95:0.05 (dashed line). These rate-versus-level functions are typical of real chopper cells (Blackburn and Sachs 1990). (D)Rate-versus-level functions of a model chopper cell when there is only 1 high spontaneous rate fiber in the distal region (solid line) and when there are only 8 low-medium fibers in the proximal region. Distal inhibition is present in both cases.

Chopper Cells of the Cochlear Nucleus

1137

fibers (solid line), or proximal/somatic inputs from low-medium spontaneous rate fibers (dotted line). The relative size of the distal and proximal regions is identical to that for the dotted line in Figure 4B. At low stimulus levels, excitation from high spontaneous rate auditory-nerve fibers drives the model chopper cell. At high stimulus levels, inhibitory inputs veto excitation generated by the distal inputs (solid line), allowing more proximal/somatic inputs from low-medium spontaneous rate fibers to drive the model cell. This is precisely the behavior required for selective processing. On-path inhibition does have some effect on more proximal excitatory inputs (Lai et al. 1993). The extent to which low-medium spontaneous rate fibers drive the chopper cell at high stimulus levels is determined by the relative size of the inhibitory region (distal) and excitatory region (proximal). The larger the distal region receiving inhibition, the greater are the effects of that inhibition on discharge rate in response to the offpath excitatory inputs. For the solid line in Figure 48, the size of the distal region is relatively small. This causes a continuous (though slow) increase in the discharge rate at high stimulus levels since the input rate of low-medium spontaneous rate auditory-nerve fibers keeps increasing as the stimulus level is increased. For the dotted line in Figure 4B (moderately sized distal region), the effect of off-path inhibition balances the increase in the low-medium spontaneous rate auditory-nerve fiber input rate, giving rise to a plateau in the discharge rate of the chopper cell. For the dashed line in Figure 4B, the distal region is relatively large. In this case, the effect of off-path inhibition is strong. This leads to a slow decrease in the chopper cell discharge rate in spite of a steady increase in the low-medium spontaneous rate auditory-nerve fiber input rate. It should be noted that the nonmonotonic behavior (the dashed line in Fig. 4B) occurs often in chopper cells (Blackburn and Sachs 1990). 4 Discussion

Inhibitory inputs to chopper cells are of the shunting type (Oertel 1983; Wu and Oertel 1986). The model presented here indicates that shunting inhibition can play an important role in enabling chopper cells in the anteroventral cochlear nucleus to selectively process inputs from primary auditory nerve fiber populations with different acoustic thresholds and dynamic ranges, thereby enabling these cells to maintain a robust rate-place representation of acoustic spectra over a wide range of sound levels. The model reproduces important dynamic characteristics of real chopper cells such as post-stimulus-time histograms and rate-versus-level functions in response to best-frequency tones. At present there is yet no detailed information of the projection pattern of high versus low-medium spontaneous rate auditory-nerve fibers, as well as inhibitory inputs within the chopper cell dendritic tree. There-

1138

Y-C. Lai, R. L. Winslow, and M. B. Sachs

fore, the spatial arrangement of these excitatory and inhibitory inputs hypothesized in our model is speculative. Nonetheless, it may be possible to determine the pattern of high versus low-medium spontaneous rate fibers. While there are estimates of the number and type of auditorynerve fiber synaptic inputs to chopper cell somas (Liberman 1991), little is known regarding the innervation pattern of the dendrites. Efforts are under way to correlate the morphology of auditory-nerve fiber synaptic endings on chopper cell dendrites with spontaneous discharge rate of these fibers (Berglund et al. 1993; Ryugo et al. 1993). If such correlation can be established, the hypothesis put forth in this paper can be tested. As noted previously, the majority of best-frequency tone rate-level functions measured from cat chopper cells are nonmonotonic at high stimulus levels. These nonmonotonic rate-level functions can be reproduced by our model (Fig. 4B). In terms of the model, nonmonotonic behavior reflects the influence of distal inhibitory inputs on more proximal excitatory inputs from low-medium spontaneous rate auditory-nerve fibers. Clearly, nonmonotonic rate responses are of little value for discriminating changes in tone intensity; there are sound levels at which intensity increments cannot be detected as being different from intensity decrements. The importance of these responses lies in the fact that at higher stimulus levels (levels greater than about 30 dB SPL for the model chopper cell in Fig. 4B), responses are determined solely by inputs from low-medium spontaneous rate auditory-nerve fibers. As a result, chopper cell responses are well-suited for encoding the spectral envelope of broadband acoustic signals at very high stimulus levels, as observed by Blackburn and Sachs (1990). There are two particularly important parameters that affect the dynamic characteristics of chopper cells. The first one is tb, time to peak of the inhibitory conductance change. This parameter critically determines the effectiveness of the on-path inhibition. It can also influence the shape of the post-stimulus-time histograms (Lai et al. 1993). The second parameter is the relative size of the distal dendritic region in which high spontaneous rate fibers are heavily innervated to the proximal region in which low-medium spontaneous rate fibers make heavy synapses. This relative size crucially determines the shape of the rate-versus-level function of chopper cells. Those two parameters should therefore be accurately estimated.

Acknowledgments We thank E. D. Young for many valuable discussions. This work was supported by NIH Grant DCOO979-03 and ONR Grant N00014-92-J-1134, the Whitaker Foundation, and the W. M. Keck Foundation.

Chopper Cells of the Cochlear Nucleus

1139

References

Barta, I? E., and Young, E. D. 1986. Rate responses of auditory nerve fibers to tones in noise near masked threshold. J.Acoust. SOC.Am. 79, 426-442. Berglund, A. M., Jacob, K., and Liberman, M: 1993. Morphometry of synaptic vesicles in anteroventral cochlear nucleus: Some correlations with terminal origins. In 26th Midwinter Research Meeting, Association for Research in Otolaryngology, p. 118. St. Petersburg Beach, FL. Blackburn, C. C., and Sachs, M. B. 1989. Classification of unit types in the anteroventral cochlear nucleus: Post-stimulus-time histograms and regularity analysis. J.Neurophysiol. 62(6), 1303-1329. Blackburn, C. C., and Sachs, M. B. 1990. The representations of the steadystate vowel sound /e/ in the discharge pattern of cat anteroventral cochlear nucleus neurons. J. Neurophysiol. 63(5), 1191-1212. Blackburn, C. C., and Sachs, M. B. 1992. Effects of off-bf tones on responses of chopper units in ventral cochlear nucleus i. regularity and temporal adaptation patterns. J. Neurophysiol. 68(1), 124-143. Koch, C., Poggio, T., and Torre, V. 1982. Retinal ganglion cells: A functional interpretation of dendritic morphology. Phil. Trans. R. SOC. London B 227, 227-264. Lai, Y-C., Winslow, R. L., and Sachs, M. B. 1994. A model of selective processing of auditory-nerve inputs by stellate cells of the anteroventral cochlear nucleus. J. Comp. Neurosci. 1, 167-194. Liberman, M. C. 1991. Central projections of auditory-nerve fibers of differing spontaneous rate. i. anteroventral cochlear nucleus. J. Comp. Neurol. 313, 240-258. Manis, P. B., and Marx, S. 0.1991. Outward currents in isolated ventral cochlear nucleus neurons. I. Neurosci. 11(9), 2865-2880. Neti, C., and Young, E. D. 1992. Neural network models of sound localization based on directional filtering by the pinna. J. Acoust. SOC.Am. 92(6), 31403156. Oertel, D. 1983. Synaptic responses and electrical properties of cells in brain slices of the mouse anteroventral cochlear nucleus. J. Neurosci. 3, 2043-2053. Rall, W. 1964. Theoretical significance of dendritic trees for neuronal inputoutput relations. In Neural Theory and Modeling, R. F. Reiss, ed., pp. 73-97. Stanford University Press, Stanford, CA. Rothman, J. S. 1991. An electrophysiological model of bushy cells of the anteroventral cochlear nucleus. Master’s thesis, The Johns Hopkins University, Department of Biomedical Engineering, Baltimore, MD. Ryugo, D: K., Wright, D. D., and Pongstaporn, T. 1993. Ultrastructural analysis of synaptic endings of auditory nerve fibers in cats: Correlations with spontaneous discharge rate. In The Mammalian Cochlear Nuclei: Organization and Function, M. Merchan, J. M. Juiz, D. A. Godfrey, and E. Mugnaini, eds., pp. 65-74. Plenum Press, New York. Sachs, M. B., and Abbas, P. J. 1974. Rate versus level functions for auditorynerve fibers in cats: Tone-burst stimuli. ]. Acoust. SOC.Am. 56, 1835-1847.

1140

Y-C. Lai, R. L. Winslow, and M. B. Sachs

Sachs,'M. B., and Kiang, N. Y. S. 1968. Two-tone inhibition in auditory-nerve fibers. j . Acoust. Soc. Am. 43, 1120-1128. Sachs, M. B., and Young, E. D. 1979. Encoding of steady-state vowels in the auditory-nerve: Representation in terms of discharge rate. j . Acoust. Soc. Am. 66, 470-479. Sachs, M. B., Winslow, R. L., and Sokolowski, B. H. A. 1989. A computational model for rate-level functions from cat auditory-nerve fibers. Hearing Res. 41, 61-70. Wang, X. 1991. Neural encoding of single-formant stimuli in auditory-nerve and anteroventral cochlear nucleus of the cat. Ph.D. thesis, The Johns Hopkins University, Department of BiomedicaI Engineering, Baltimore, MD. Winslow, R. L., Barta, P. E., and Sachs, M. B. 1987. Rate coding in the auditory nerve. In Auditory Processing of Complex Sounds, C. S. Watson and W. Yost, eds., pp. 212-224. Lawrence Erlbaum, Hillsdale, NJ. Wu, S. H., and Oertel, D. 1984. Intracellular injection with horseradish peroxidase of physiologically characterized stellate and bushy cells in slices of mouse antero-ventral cochlear nucleus. j . Neurosci. 4, 1577-1588. Wu, S. H., and Oertel, D. 1986. Inhibitory circuitry in the ventral cochlear nucleus is probably mediated by glycine. J. Neurosci. 6, 2691-2706. Young, E. D., Robert, J. M., and Shofner, W. P. 1988. Regularity and latency of units in the ventral cochlear nucleus: Implications for unit classification and generation of response properties. j . Neurophysiol. 60, 1-29.

Received September 20, 1993; accepted January 24, 1994.

This article has been cited by: 2. E. D Young. 2008. Neural representation of spectral and temporal information in speech. Philosophical Transactions of the Royal Society B: Biological Sciences 363:1493, 923-945. [CrossRef]

Communicated by Jacques Belair

Control of Chaos in Networks with Delay: A Model for Synchronization of Cortical Tissue C. Lourengo A. Babloyantz Service de Chimie-Physique, Universiti Libre de Bruxelles, CP 231 -Campus Plaine, Boulevard du Triomphe, B-1050 Bruxelles, Belgium The unstable periodic orbits of chaotic dynamics in systems described by delay differential equations are considered. An orbit is stabilized successfully, using a method proposed by Pyragas. The system under investigation is a network of excitatory and inhibitory neurons of moderate size, describing cortical activity. The relevance of the results for synchronized cortical activity is discussed. 1 Introduction

Synchronized activity of neuronal populations has been shown to appear in several physiological processes (Rougeul-Buser et al. 1983; Eckhorn et al. 1988; Gray et al. 1989). Usually this phenomenon has been modeled by networks of interacting units. In these models a parameter of the system, or equivalently an external synchronizing agent mimicking a sensory input, is allowed to take values in appropriate ranges (Kurrer et al. 1991; Eckhorn et al. 1989; Sompolinsky et al. 1990; Schuster and Wagner 1990; Schillen and Konig 1990). The change in the parameter brings about temporal correlation in the network. Usually the parameter change is not small and moreover it assumes the transition between two different dynamics. A change in a given parameter of the neurons or their connectivity assumes the synthesis or degradation of key physiological molecules, which require some length of time. However some of the behavioral changes of the cerebral activity, for example, the switch between N and /3 waves, could take place extremely rapidly. Therefore it is difficult to imagine a very rapid and back and forth structural change in the network that could underlie the synchronization process. This point may be seen more clearly when examining the field potentials recorded from the surface of the cortex of an attentive cat (RougeulBuser et al. 1983). The iol waves appear as short episodes of rapid synchronized activity followed by equally short episodes of much less synchronized behavior. If synchronized states appear under the action of external Neural Computation 6,1141-1154 (1994) @ 1994 Massachusetts Institute of Technology

1142

C. Lourenco and A. Babloyantz

stimuli why does the neuronal network desynchronize episodically if the input requiring attention is still present? We believe that the classical theory of synchronization of oscillators in terms of nonlinear dynamics (Kuramoto 1984) and bifurcation theory is not adequate for the description of rapid changes in cortical activity such as, for example, attention shifts. The framework of the theory of chaos seems more promising for the treatment of such problems. Indeed from the analysis of EEG, deterministic chaotic dynamics was shown to appear in several behavioral states of cerebral cortex as well as in a few neurological pathologies (Babloyantz et al. 1985; Babloyantz and Destexhe 1986; Babloyantz 1991). Later on drawing on model systems it was suggested that the EEG as a temporal signal is a spatial average of a spatiotemporal chaotic network of excitatory and inhibitory neurons (Destexhe and Babloyantz 1991). If cerebral cortex is seen as a network endowed with spatiotemporal chaotic activity then the above cited rapid transition between states of different synchronized activity could be explained. Indeed, because of the extreme sensitivity to small perturbations, chaotic motions are not predictable. Systems exhibiting chaotic attractors are endowed with an infinite variety of behaviors such as unstable fixed points, or unstable periodic orbits (UPOs) (Ruelle 1985; Auerbach et a!. 1987; Pawelzik and Schuster 1991). If one of these orbits is stabilized under the action of a very small fluctuation, a synchronized state may appear spontaneously in the network. No structural changes are needed and the synchronization could be switched on and off rapidly. In recent years there has been a great deal of interest in techniques that attempt to select stable motions from the infinite variety of unstable behaviors of chaotic dynamics (Ott et al. 1990a,b; Dressler and Nitsche 1992; Auerbach et al. 1992; Pyragas 1992). The important fact is that stabilization must arise under the action of tiny perturbations, without large excursions in the parameter space of the system. Although in principle the techniques of stabilization of UPOs are applicable in a large class of problems, in practice their use was limited to systems with few degrees of freedom. Thus only the temporal evolution of the phenomena was considered. However, very recently it was shown (Sepulchre and Babloyantz 1993) that the method proposed by Ott, Grebogi, and Yorke (OGY) (1990a) could be extended to the stabilization of UPOs in spatiotemporal chaotic networks of moderate size. Stabilization gives rise to bulk oscillations, standing waves, and rotating waves in the network. In biological neural networks propagation delays are important elements of the dynamics. Such dynamics are usually described in terms of delay differential equations (DDEs). Therefore, stabilization into a synchronized state must be achieved in infinite dimensional systems. The aim of this paper is to show that stabilization may also be achieved in DDEs. The model describing the cortex consists in a two-dimensional network of excitatory and inhibitory neurons (Destexhe and Babloyantz

Control of Chaos in Networks With Delay

1143

1991). The spatiotemporal chaotic activity of the network is stabilized into bulk oscillations of the system, without any reference to intrinsic parameter changes of the original system or bifurcation schemes. The system switches from chaotic dynamics into stable oscillations as a result of small fluctuations. In this example stabilization was achieved with the help of the method of Pyragas (1992). Section 2 is devoted to the description of the stabilization technique proposed by Pyragas, in the context of DDEs. In Section 3 we introduce the simple cortical model. In the last section we discuss the relevance of the stabilization of UPOs in spatiotemporal chaotic networks for the information processing in cerebral cortex. 2 Stabilization of Orbits in DDEs

Let us consider the set of n nonlinear delay differential equations (DDEs) dX

-=

dt

f,(X,X,)

(2.1)

where X E w”, X, = X(t - 7 )and 7 is the delay. We assume that for a range of parameter values / L and T these equations exhibit deterministic chaotic dynamics. The chaotic attractor comprises an infinite number of unstable periodic orbits, which could be stabilized by the addition of small perturbations EX,(t)to one or more of the equations (2.1) (Pyragas 1992). To this end we define ”distance functions”

Dx,(t) = X , ( t - T ) - X , ( t )

(2.2)

Here T is an artificially introduced delay that should not be confused with the delay T entering the original differential equations (2.1). Ideally, T should coincide with the period T , of one of the unstable orbits, u,. If T = T, and X(t) = u,(t), then all D x , ( t ) vanish. The perturbations EX,(^) and the functions Dx,(t) are related in the following manner (2.3) The constant K is the weight of the perturbation. Its value is determined numerically. €0 > 0 is a saturating value that ensures that the perturbations remain small at all times. The search for the unstable orbits is conducted in the two-dimensional parameter space (K, T ) . We vary K and T and use the Dx,(t) as probe functions. The vanishing of the (D;,(f)) is an indication of the fact that stabilization has been achieved under the action of the chosen EX,(^). Let us note that the method can stabilize fixed points as well as limit cycles. For the case of limit cycles, a further confirmation of the fact that a true stabilization is achieved is obtained when the period of the

1144

C. Lourenqo and A. Babloyantz

stabilized orbit coincides with the parameter T. In general there is no obvious relation between the delay T and the characteristic frequencies of the system as seen in the chaotic power spectrum, and there is also not any explicit relation between any of these quantities and the periods T,,, of the unstable periodic orbits. Counterexamples can however be found, as is the case of the system we will present in Section 3. More details regarding the methodcan be obtained from the original paper (Pyragas 1992). When applying the stabilization method as discussed in this paper, a number of difficulties may arise. For example, when nonoptimal values of K and/or T are used, the corresponding values (Di,(t))are not sufficiently low and side effects may appear such as for example a slight modulation superimposed on the basic stabilized orbit. Typically this modulation has a much longer period than the one of the basic orbit and may not be apparent from the observation of short time series. The partial character of the control may generate other problems. Control is usually performed over a fraction of the variables of a multivariable system, and the ones that are not being controlled are expected to adapt adiabatically. In the set of DDEs (2.1) only the variables X, at present time t are perturbed directly. Due to this fact the “ideal” (K,T) might even not be found with the set of variables available for direct control. In the case of finite-dimensional dynamic systems, such a problem can be solved with multivariable control. More serious difficulties may arise because of the infinite dimensional nature of the DDEs. Although the chaotic attractor may be embedded in a space with few degrees of freedom, during the stabilization process the system again evolves in an infinite dimensional space. So in the case of failure of the control when one or several variables are perturbed directly at present time t only, it is difficult to know u priori how many more variables must be perturbed for effective control. In order to investigate numerically the robustness of the stabilization procedure in the presence of additive random white noise, we point out that in systems that can be described by equations 2.1, it seems reasonable to use two distinct noise terms per equation so as to simulate the many independent small perturbations arising in physical problems. One term accounts for the perturbation of the main variable at time t and for the variability resulting from the very existence of the “delay line” associated to the controlling method. Delay line is another name for the infinite dimensional array that memorizes all values of the variable X, between t-T and t. Some degradation of the information along this delay line is to be expected. A “transfer” error may also arise when the system reads the value of the main variable t-T time units earlier, via some unspecified physical process. All these effects are summed up and assumed to obey to a gaussian distribution. One can also consider the already existing delay line of the unperturbed system, which ranges from t--7. to t and that we take as independent from the one mentioned above. More specifically,

Control of Chaos in Networks With Delay

1145

the noise terms are viewed as derivatives of Wiener random processes W obeying the relation (W(t1)- W(f2))=0,([W(t,)-W(t2)I2) = x21tl-t21. Equations 2.1 are then modified into Xl,,(t)

+

X i ( t + dt)

=

+ xx, Gx,

XI,T(t)

Xi(f)

+

[ft

(dW2

(X(t).X,(t))

+ EX,(^)] dt + XX,,r G x , r~(dt)”2

where Gx, and Gx,l represent independent gaussian distributions with zero average and unit variance. 3 Stabilization of Orbits in a Neuronal Network

It is customary to model the cerebral cortex as a network of excitatory and inhibitory neurons in interaction. The information transfer between neurons requires considering a propagation delay that is a function of the distance between interacting units. Thus the description of cortical dynamics necessitates the use of DDEs. It has been shown that the following model may represent satisfactorily several behavioral states of the cortex (Destexhe and Babloyantz 19911.

dt

=

-?(XI - V,) - (XI - E l ) x u i : ’ F x [ X k ( t - 7 k k r ) ] k#I

- (XI - €2)

Cu:;2’FY[Yl(f

- ~ I I ) ]

I f 1

(3.1)

Here X I and Y, represent, respectively, the postsynaptic potentials of the

N,, excitatory and Nin inhibitory neurons. The resting potential VL takes the value -60 mV, whereas the excitatory and inhibitory transmitter equilibrium potentials are El = 50 mV and E2 = -80 mV. 2 = 0.25 msec-’ is the inverse membrane constant. The constant synaptic weights ui:),u f ’ , u;,?’, and u;)refer, respectively, to the excitatory-to-excitatory, inhibitoryto-excitatory, excitatory-to-inhibitory,and inhibitory-to-inhibitory connections. The sigmoidal firing function is F(V) =

1+

1 exp[-cu(V

-

Vc.)]

The constants N are different for the firing functions of excitatory and inhibitory neurons. Their values are ux = 0.09 mV-’ and (YY = 0.2 mV-’, respectively. The common firing threshold is V, = -25 mV. T ~ ,is the propagation delay between neurons i and j . We assume that the sum

C. Lourenqo and A. Babloyantz

1146

12, = Cku;f) of the synaptic weights corresponding to the same type of interaction remains constant and r,, = T . In a given range of parameter values, the network (3.1) may exhibit spatiotemporal chaotic activity. In these states irregular patches of coherent chaotic activity appear and disappear in time (Destexhe and Babloyantz 1991). The aim of this section is to show that a uniform periodic solution X , ( t ) = X ( t ) and Y,(t) = Y(t) exists and can be stabilized when the network exhibits a spatiotemporal chaotic activity. Let us first consider the case when the network has a uniform solution X , ( t ) = X ( t ) and Y,(t) = Y(t). In this case equations 3.1 reduce to two coupled nonlinear DDEs

5 df

=

-y(X

- V,) - ( X

-

E l ) 12, F x [ X ( t- r ) ]

- ( X - E2) Q2 F y [ Y ( t - T ) ] dY - -- -y(Y - V L )- (Y - E l ) 123Fx[X(t - r ) ] dt - ( Y - Ez) Q 4 F y [ Y ( f - 711

(3.2)

These equations can be seen as a model describing the interaction between an excitatory and an inhibitory neuron. Moreover the excitatory neuron feedbacks into itself with a propagation delay r. For the parameter values 01 = 6.3, (22 = Q3 = 5, !14= 0, and T = 16 msec, the reduced system (3.2) exhibits deterministic chaos. Figure l a and b shows the time evolution of the variable X and the phase portrait of the dynamics. Before considering the case of the full network, we stabilize an unstable periodic orbit embedded in the attractor of the reduced system (3.2). As we shall see in the sequel, the determination of periodic orbits of this system is of great value for the stabilization of UPOs of the entire network. We perturb the dynamics as explained in Section 2. In equations 3.2, we add the small perturbations E X ( f ) and ~ y ( f to ) the righthand sides of the X and the Y equations, respectively. Here D x ( t ) = X ( t - T ) - X ( t ) and D Y ( t ) is defined similarly. The common saturating value for EX(^) and ~ ( tis) €0 = 0.3 mV msec-'. We remember that both functions D x ( t ) and DY(t) are multiplied by the same parameter K , and that the same delay T enters both terms. Fixing K = 0.275 msec-', we display in Figure 2 the dependence of ( D i ( t ) ) on the parameter T, as computed from equations 3.2 in the presence of the perturbations EX(^) and Ey(f). The values of T corresponding to the minima of this figure are the best choices to use in the E ( f ) terms in order to achieve stabilization of orbits with the chosen value of K. A strong resonance is found at T = 105.3 msec, along with a weaker one when T has twice this value. The two resonances are found to correspond to the same periodic orbit. Figure 3a and b displays the time variation of the variable X for this periodic orbit and the phase portrait of the dynamics, which indicates a very complex periodic motion. This figure was obtained by integrating the perturbed system with the use of T = 105.3 msec and K = 0.275 msec-'

Control of Chaos in Networks With Delay

1147

-20.00

2-

%

-40.00 -60.00 -80.00 100.00

0.00

M0.m

200.00

400.00

500.00

1 ( m 4

45.00

35.00

-

25.00

-

I

I

I

I

I

I

I

I

1

I

I

I

-

15.00

-

4 -

5.00

&.

0.00

-5.00

-15.00

-

-2S.M -35.00 -

-45.00 t

Figure 1: (a) Chaotic time series of the activity of the excitatory neuron in the reduced system. (b) Phase portrait of the reduced system in the chaotic regime. Parameters used: R1 = 6.3, 0 2 = 0 3 = 5, 0 4 = 0. T = 16 msec. in the ~ ( f terms. ) As it happens, the 105.3 msec periodicity of the stabilized orbit can be associated with a major peak of the power spectrum of the unperturbed chaotic system. As we mentioned in Section 2, this is not a general feature. The other major peak of that spectrum can also be retrieved from Figure 2 under the form of a marked 15 msec periodicity on the ( D $ ( f ) ) ( Tdependence. ) Presently we turn to the network as described by equations 3.1. In order to avoid costly computations we concentrate on a small network. In this network N excitatory and N inhibitory neurons are arranged in a two-

C. Lourenco and A. Babloyantz

1148

0.00

50.00

100.00

150.00

200.00

T (-4

Figure 2: Dependence of (Di(t)) on the parameter T, for the reduced system. K = 0.275 msec-', €0 = 0.3 mV msec-'.

dimensional regular lattice. We consider only first-neighbor interactions and the same delay T for all neurons. Moreover the synaptic strengths of all neurons have the unique values ui:) = u(I)and RI = C k q(1)i are the same as for the reduced system (3.1). Two networks of size N,, = 9, Ni, = 9, and N,, = 16, Ni, = 16, with zero flux boundary conditions, are considered. In this region of parameter values both networks showed a spatiotemporal chaotic activity. However, the spatial distribution of activity shows chessboard type structure. That is, the time evolution of excitatory as well as inhibitory neurons is divided into two sets. The activities of adjacent chaotic neurons are not identical, but those of second neighbors are the same. Thus the network shows four different chaotic activities. The remarkable fact is that groups of the excitatory and inhibitory neurons show synchronized chaotic activity. Due to the sensitivity of chaotic dynamics to initial conditions this fact may seem paradoxical. Such behavior has already been studied and discussed in the literature (Pecora and Carrol 1990; Fujisaka and Yamada 1983; Badola et al. 1991; Anishchenko et al. 1992). We illustrate the overall chaotic dynamics of the network by plotting in Figure 4a the simultaneous evolution of two first-neighbor excitatory neurons. The comparison of the dynamics of the network with that of the reduced system (3.2) with equivalent parameters (see Fig. la) indicates that chaos is more developed in the network. In order to stabilize a periodic orbit X i ( t ) = X ( t ) and Y,(t)= Y(t) in the entire network, we add the perturbation terms E X , ( f ) and cy,(f)

Control of Chaos in Networks With Delay

-mmC

'

I

I

1149

I

I

4.00

;; -60110 -80.00

35.00

-

25.00

-

15.00

-

.E z

5.00

0.00 -5.00

-

-

-15.00 -25.00

-

-35.00

-

-45.00 3

-80.00

I

-7O.W

-60110

I

I

-50.00

-40.00

I

-30.00

l

i

-20.00

.Y (mV)

Figure 3: (a) Time series of the activity of the excitatory neuron after stabilization. T = 105.3 msec and K = 0.275 msec-'. (b) Phase portrait of the dynamics after stabilization of the reduced system. The values of 621 and 'T are as in Figure 1. to the right-hand sides of the equations for the X, and the Y, variables, respectively. The definition of the distance functions takes the form

Dx,(t)= X,(t - T ) - X , ( t ) where X I denotes the average postsynaptic potential of the four first excitatory neighbors of neuron X I . A similar definition holds for D y , ( f ) , where the average is taken over the four first inhibitory neighbors of Y,. Again, the same value of K multiplies all the Dx,(t) and D y , ( f ) ,and

C. Lourenqo and A. Babloyantz

1150

-20.00

-40.00

-

-60.00

-

-80.00

> E

k

0.00

100.00

200.00

300.00

400.00

SW.W

3W.W

4oo.W

500.00

1 (msec)

-20.00

-40.00

-> -E

-60.00 -80.00

k

0.00

100.00

2W.W 1 (rnsec)

Figure 4: (a) Simultaneous variation of two adjacent excitatory neurons when the network is in the chaotic regime. N,, = Ni, = 9. (b) The same neurons as in (a), after stabilization of the network. The values of R, and T are as in Figure 1. the parameter T is the same for all controlling terms. Thus, a variable is not perturbed by its own value some time earlier, but by the average delayed value of its first neighbors. Other choices could be made for the functions Dx,(t) and D Y , ( t ) . With 7 fixed at the value 16 msec, which is the same as was used in the reduced equations (3.2), we profit from the preliminary studies performed with the reduced system and take the value 105.3 msec as the period of the uniform stabilizable oscillations of the network. A further search for adequate values of K in the network case is still required. The 3 x 3 and 4 x 4 networks turn out to be stabilizable by choosing for K the values 4.6 and 2.7 msec-', respectively. The net result is the destruction of the chessboard structure and the passage from the chaotic regime to uniform regular oscillations of period 105.3 msec. The shape of the characteristic signal thus obtained is identical to the one obtained with the stabilization of the reduced equations 3.2 and is illustrated in Figure 4b.

Control of Chaos in Networks With Delay

1151

Stabilization of orbits in the full network is more difficult to achieve than in the case of the reduced system, because the dynamics now evolves in a "higher" dimensional phase space. The stabilized orbits, although analogous, are not strictly equal in the two cases. They have the same shape and period, but their Floquet multipliers are different. Moreover, the fraction of the phase space occupied by the basin of attraction of the periodic orbit (with the perturbed system) is smaller in the full network case. Therefore, longer transient times are observed before stabilization is achieved, and a somewhat lengthy search procedure with varying initial conditions and K values must be followed. Nonetheless, even when a perfect stabilization is not attained, one observes an overall increase in temporal and spatial coherence of the network, thus a decrease in attractor dimension. With (K,T) parameters allowing for the stabilization of the network in the noiseless case, we examined the effect of adding random white noise to the dynamics. As discussed at the end of Section 2, we perturb the variables at present time t with the terms xx ,,,, EX,, and xu,,,,, EY,,,,T, and the variables at time t - 'T with XX,,,EX,,, and XY,,, E y J T . In the following, we assume that the x values do not depend on the network site. Here xx,,,= 4.5 x lo-' mV sec-'/2, xy,,,= 7.2 x lop1 mV seccl/*, xX, = 2.2 x lo-' mV sec-'f2, and xy, = 3.6 x lo-' mV sec-'I2. We verify that stabilization does occur in the presence of the noise. However, the presence of noise tends to decrease the time during which UPOs remain stabilized. But even after stabilization is destroyed, spatially and temporally more coherent states are seen as compared to the original nonstabilized chaotic regime. 4 Conclusions

In this paper we have shown that, under the action of tiny fluctuations, the stabilization of unstable periodic orbits embedded in attractors resulting from delay differential equations is possible. The stabilization was performed in moderate size networks describing cortical activity. The results are important, especially in the context of biological modeling since delays arise at various levels in living matter and more specifically in cerebral cortex. The possibility of stabilization of UPOs in cortical networks gives us for the first time a framework that enables us to propose a mechanism for understanding some aspects of brain dynamics. Behavioral states of the brain change extremely rapidly, thus the underlying dynamics must change states under the action of very small fluctuations. The events arise in such short time scales that it seems reasonable to think that they are the result of small fluctuations in ionic afluxes rather than de izozlo protein synthesis leading to synchronization via a bifurcation scheme. Thus one may think that these small fluctuations stabilize the network momentarily into a more coherent state, more appropriate

1152

C. Lourenqo and A. Babloyantz

for information processing. Our simulations indicated that in larger and noisier networks stabilization is of short duration. This behavior may explain why for example in an attentive cat short episodes of synchronization are followed by desynchronized states until another episode of coherent activity arises. Contrary to other models explaining network synchronization we do not assume a priori that all neurons are in an oscillatory state and that system parameters are such that bulk synchronization occurs. Here synchronization arises only from the action of a small external input in an otherwise desynchronized state. If the cortical activity is of a spatiotemporal chaotic nature, such processes are more relevant from a physiological point of view. In our network only the bulk oscillation of the system was stabilized. However in a previous paper (Sepulchre and Babloyantz 1993) considering a network of oscillators with no delay term it was possible to stabilize four different orbits giving rise to spatial structures such as rotating and standing waves. It can be shown that in a system formed from two interconnected chaotic networks of oscillating units, the stabilization of UPOs makes possible the construction of a chaotic categorizer (Babloyantz and Sepulchre 1993). Work is in progress for the construction of cortical networks where one considers several interconnected cortical layers. Such systems, we hope, will be able to have the same performances as the ones seen in chaotic networks of oscillators. Acknowledgments We thank J. A. Sepulchre for interesting discussions. C. Lourenco acknowledges the support of the Portuguese Government (JNICT-Programa CIENCIA, BD/1765/91-IA). This work was also supported in part by the Belgian Government (IMPULSE project RFO A1 10, Calcul de Puissance project IT/SC/25) and the E.E.C. (ESPRIT, Basic Research, 3234). References Anishchenko, V., Vadivasova, T., Postnov, D., and Safonova, M. 1992. Synchronization of chaos. lnt. I. Sif. Chaos 2, 633-644. Auerbach, D., Cvitanovik, P., Eckmann, J.-P, Gunaratne, G., and Procaccia, I. 1987. Exploring chaotic motion through periodic orbits. Phys. Rev. Lett. 58, 2387-2389. Auerbach, D., Grebogi, C., Ott, E., and Yorke, J. 1992. Controlling chaos in high dimensional systems. Phys. Rev. Lett. 69, 3479-3482. Babloyantz, A. 1991. Evidence for slow brain waves: A dynamical approach. Electroencephalogr. Clin. Neurophysiol. 78, 402-405. Babloyantz, A., and Destexhe, A. 1986. Low-dimensional chaos in an instance of epilepsy. Proc. Natl. Acad. Sci. U.S.A. 03, 3513-3517.

Control of Chaos in Networks With Delay

1153

Babloyantz, A., and Sepulchre, J. A. 1993. In Proceedings of the 1993 International Conference on Artificial Neural Networks (lCANN93), Amsterdam, S. Gielen and 8. Kappen, eds., pp. 670-675. Springer, London. Babloyantz, A,, Salazar, J., and Nicolis, C. 1985. Evidence of chaotic dynamics of brain activity during the sleep cycle. Phys. Lett. A 111, 152-156. Badola, P., Kumar, V., and Kulkarni, B. 1991. Effects of coupling nonlinear systems with complex dynamics. Phys. Lett. A 155, 365-372. Destexhe, A., and Babloyantz, A. 1991. Pacemaker-induced coherence in cortical networks. Neural Comp. 3, 145-154. Dressler, U., and Nitsche, G. 1992. Controlling chaos using time delay coordinates. Phys. Rev. Lett. 68, 1-4, Eckhorn, R., Bauer, R., Brosch, M., Jordan, W., Kruse, W., Munk, M., and Reitboeck, H. 1988. Functionally related modules of cat visual cortex show stimulus-evoked coherent oscillations: A multiple electrode study. Invest. Ophthalmol. Vis. Sci. 29, 331-342. Eckhorn, R., Reitboeck, H., Arndt, M., and Dicke, I? 1989. A neural network of feature linking via synchronous activity. Models ofBrain Function. Cambridge University Press, Cambridge. Fujisaka, H., and Yamada, T. 1983. Stability theory of synchronized motion in coupled-oscillator systems. Progr. Theor. Phys. 69, 32-47. Gray, C., Konig, P., Engel, A., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Kuramoto, Y. 1984. Chemical Oscillations, Waves,and Turbulence. Springer-Verlag, Berlin. Kurrer, C., Nieswand, B., and Schulten, K. 1991. A model for synchronous activity in the visual cortex. In Self-Organization, Emerging Properties, and Learning. A. Babloyantz, ed. NATO AS1 Series B, Physics Volume 260. Ott, E., Grebogi, C., and Yorke, J. 1990a. Controlling chaos. Phys. Rev. Lett. 64, 1196-1 199. Ott, E., Grebogi, C., and Yorke, J. 1990b. Controlling chaotic dynamical systems. In Chaos: Soviet-American Perspectives on Nonlinear Science. D. Campbell, ed. American Institute of Physics, New York. Pawelzik, K., and Schuster, H. 1991. Unstable periodic orbits and prediction. Phys. Rev. A 43,1808-1812. Pecora, L., and Carrol, T. 1990. Synchronization in chaotic systems. Phys. Rev. Lett. 64, 821-824. Pyragas, K. 1992. Continuous control of chaos by self-controlling feedback. Phys. Lett. A 170, 421-428. Rouged-Buser, A., Bouyer, J., Montaron, M., and Buser, P. 1983. Patterns of activities in the ventrobasal thalamus and somatic cortex SI during behavioral immobility in the awake cat: Focal waking rhythms. Experimental Brain Research, Suppl. 7. Springer-Verlag, Berlin. Ruelle, D. 1985. Thermodynamic Formalism. Addison-Wesley, Reading, MA. Schillen, T., and Konig, P. 1990. Coherency detection by coupled oscillatory responses-synchronizing connections in neural oscillator layers. In Parallel Processing in Neural Systems and Computers. North Holland, Amsterdam.

1154

C. Lourenqo and A. Babloyantz

Schuster, H., and Wagner, I? 1990. A model for neuronal oscillations in the visual cortex. In Parallel Processing in Neural Systems and Computers. North Holland, Amsterdam. Sepulchre, J. A., and Babloyantz, A. 1993. Controlling chaos in a network of oscillators. Phys. Rev. E 48, 945-950. Sompolinsky, H., Golomb, D., and Kleinfield, D. 1990. Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acnd. Sci. U S A . 87, 7200-7204.

Received July 16, 1993; accepted February 22, 1994.

This article has been cited by: 2. I. V. Ermakov, V. Z. Tronciu, Pere Colet, Claudio R. Mirasso. 2009. Controlling the unstable emission of a semiconductor laser subject to conventional optical feedback with a filtered feedback branch. Optics Express 17:11, 8749. [CrossRef] 3. G. Franceschini, S. Bose, E. Schöll. 1999. Control of chaotic spatiotemporal spiking by time-delay autosynchronization. Physical Review E 60:5, 5426-5434. [CrossRef] 4. Keiji Konishi, Michio Hirai, Hideki Kokame. 1998. Decentralized delayed-feedback control of a coupled map model for open flow. Physical Review E 58:3, 3055-3059. [CrossRef] 5. Boualem Mensour, André Longtin. 1998. Chaos control in multistable delay-differential equations and their singular limit maps. Physical Review E 58:1, 410-422. [CrossRef] 6. Michel Le Van Quyen, Jacques Martinerie, Claude Adam, Francisco Varela. 1997. Unstable periodic orbits in human epileptic activity. Physical Review E 56:3, 3401-3411. [CrossRef] 7. K. Pakdaman, C. P. Malta, C. Grotta-Ragazzo, O. Arino, J.-F. Vibert. 1997. Transient oscillations in continuous-time excitatory ring neural networks with delay. Physical Review E 55:3, 3234-3248. [CrossRef] 8. K. Pakdaman , C. P. Malta , C. Grotta-Ragazzo , J.-F. Vibert . 1997. Effect of Delay on the Boundary of the Basin of Attraction in a Self-Excited Single Graded-Response NeuronEffect of Delay on the Boundary of the Basin of Attraction in a Self-Excited Single Graded-Response Neuron. Neural Computation 9:2, 319-336. [Abstract] [PDF] [PDF Plus] 9. David W. Sukow, Michael E. Bleich, Daniel J. Gauthier, Joshua E. S. Socolar. 1997. Controlling chaos in a fast diode resonator using extended time-delay autosynchronization: Experimental observations and theoretical analysis. Chaos: An Interdisciplinary Journal of Nonlinear Science 7:4, 560. [CrossRef] 10. Carlos Lourenço, Marc Hougardy, Agnessa Babloyantz. 1995. Control of low-dimensional spatiotemporal chaos in Fourier space. Physical Review E 52:2, 1528-1532. [CrossRef]

Communicated by Raymond Watrous

First-Order Recurrent Neural Networks and Deterministic Finite State Automata Peter Manolios' Department of Computer Science, Brooklyn College, Brooklyn, N Y 11210 U S A and Department of Computer Science, CUNY Graduate Center mid University Place, New York, N Y 10036 U S A Robert Fanelli Department of Physics, Brooklyn College, Brooklyn, N Y 11210 U S A We examine the correspondence between first-order recurrent neural networks and deterministic finite state automata. We begin with the problem of inducing deterministic finite state automata from finite training sets, that include both positive and negative examples, an NPhard problem (Angluin and Smith 1983). We use a neural network architecture with two recurrent layers, which we argue can approximate any discrete-time, time-invariant dynamic system, with computation of the full gradient during learning. The networks are trained to classify strings as belonging or not belonging to the grammar. The training sets used contain only short strings, and the sets are constructed in a way that does not require a priori knowledge of the grammar. After training, the networks are tested using various test sets with strings of length up to 1000, and are often able to correctly classify all the test strings. These results are comparable to those obtained with secondorder networks (Giles et al. 1992; Watrous and Kuhn 1992a; Zeng et al. 1993). We observe that the networks emulate finite state automata, confirming the results of other authors, and we use a vector quantization algorithm to extract deterministic finite state automata after training and during testing of the networks, obtaining a table listing the start state, accept states, reject states, all transitions from the states, as well as some useful statistics. We examine the correspondence between finite state automata and neural networks in detail, showing two major stages in the learning process. To this end, we use a graphics module, which graphically depicts the states of the network during the learning and testing phases. We examine the networks' performance when tested on strings much longer than those in the training set, noting a measure based on clustering that is correlated to the stability of the networks. Finally, we observe that with sufficiently long training 'Current address: Department of Computer Sciences, Taylor Hall, University of Texas at Austin, Austin, TX 78712-1188 USA. Neiirnl Computation 6, 1155-1173 (1994) @ 1994 Massachusetts Institute of Technology

1156

Peter Manolios and Robert Fanelli

times, neural networks can become true finite state automata, due to the attractor structure of their dynamics.

1 Introduction

The NP-hard problem of inferring deterministic finite state automata (DFA) from finite training sets that include both positive and negative examples has been addressed by many researchers. For an up-to-date listing of algorithmic inference methods see Miclet (1990). Researchers have also trained both first- and second-order recurrent neural networks to induce grammars. When speaking of “grammars,” we will refer only to regular grammars, which are equivalent to finite state automata. For a proof of this equivalence, and for an introduction to formal language theory, see Harrison (1978). We use recurrent networks to infer DFA because recurrent networks have a memory and can process strings of any length. For a more detailed discussion on the advantages of recurrent networks, see Elman (1990). Furthermore, we argue that the recurrent architecture we use is a universal approximator for discrete-time, timeinvariant dynamic systems, and can therefore approximate any DFA. The researchers who have used first-order recurrent networks, have concentrated on the Reber (1967) grammar, and have trained networks to predict the next symbol of a string generated from the grammar. Cleeremans et al. (1989) used a simple recurrent network architecture proposed by Elman (1990), computing a truncated gradient during learning, and succeeded in training networks to learn the Reber grammar. They had some difficulty with the embedded Reber grammar, for which Smith and Zipser (19891, who used real time recurrent learning that computes the full gradient, were able to successfully train networks. The researchers who used second-order recurrent networks, concentrated on the Tomita (1982) grammars, and instead of training networks to predict the next symbol of a string, they trained networks to decide whether a string is grammatical. Watrous and Kuhn (1992a1, using a learning algorithm that computes the full gradient, were able to train networks that can generalize for a subset of the Tomita grammars. Giles et al. (1992), using a real time forward training algorithm that also computes the full gradient, had successful results for Tomita grammar 4, and reported that similar results were obtained for the other Tomita grammars. Giles et al. (1992) and Omlin et al. (1992) also presented a finite state automaton extraction algorithm used to extract the network‘s conception of the finite state automaton it is learning. Fanelli (1993) has trained Elman type nets to infer some Tomita grammars using full gradient computation. We consider the Tomita grammars over {0,1}*, since work in this area has tended to focus on these grammars. Specifically, we consider the

Neural Networks and DeterministicAutomata

1157

following subset of the seven Tomita (1982) grammars: L1. 1,

L2. (lo)* L4. no more than two 0s in a row L6. number of Is - number of 0s

=0

(mod 3)

The architecture, learning, and training procedures are discussed in the next two sections. A finite state automaton extraction algorithm similar to that of Giles et al. (1992) is presented in Section 4. In order to better understand the behavior of the networks, we develop a graphics algorithm (Section 5) that allows us to depict various aspects of the networks' behavior. In the results section, we show that the architecture is capable of learning the Tomita grammars studied, with only a restricted number of examples. Also in the results section, we analyze the learning process in detail, examining clustering during and after training and testing, noting a feature of the networks based on clustering, which is correlated to their stability, and giving an example of a neural network that becomes a DFA.

2 Architecture Motivated by Hornik et al. (19891, who proved that feedforward networks with one hidden layer are universal approximators, we argue that the architecture we consider is the simplest first-order architecture with two recurrent layers that is a universal approximator for discrete-time, time-invariant dynamic systems. Such a system can be described by the following formulas: y(t) = n[x(t - l).u(t - l)] x(t) = @[x(t- l).u(t- l ) ]

where y ( t ) E 9" is the output of the system at time t, x(t) E 9" is the state of the system at time t, u(t) E ?JPis the input of the system at time t, and t is a nonnegative integer. With two feedforward networks, one which approximates and the other @, we can approximate any discrete-time, time-invariant dynamic system. The resulting architecture is shown in Figure la and b. If we connect Y to H2, H2 to X, Y to H1, and H1 to Y, the resulting architecture (shown in Fig. lc) loses no generality because with weights of 0 for all the added connections, it is the same as before, i.e., Figure l b is a specific case of Figure lc. Note that layers H1 and H2 of Figure l c have the same connections and can be combined into one hidden layer, and similarly, layers X and Y can be combined into a single state-output layer. The resulting three layer recurrent network architecture, shown in Figure Id and e, consists of input, hidden, state,

Peter Manolios and Robert Fanelli

1158

(a)

I

H2

fi Input

-

State Output

yyH t 0

H1

U

t

t+1

t+n

Figure 1: (a) and (b) are two views of an architecture that is a universal approximator for the two functions R and @ that describe a discrete-time, time-invariant dynamic system. By adding connections from Y to H2, H2 to X, Y to H1, and H1 to Y, we obtain the structure (c), which reduces to the architecture of (d) and (e). In (e), the three layers are spread out and shown during successive time steps t . t 1 , . . . ,t + n. The input units, together with the state units, feed into the hidden units. The hidden units in turn modify the state units. A nonempty subset of the state units is called the output unit. These are treated exactly as the state units, except that their activations are considered the output of the network. A bias unit connected to all the other units is also part of the architecture, but is not shown.

+

and bias units, and is the architecturally simplest first-order network with two recurrent layers that is a universal approximator for discrete-time, time-invariant dynamic systems. A subset of the state units corresponds to the output units, which are treated exactly as the state units, except that their activations are considered the output of the network. The activations of units can be any value in the closed interval [0,1]. 3 Learning Algorithm and Training

We train the architecture with full gradient computation, using backpropagation through time (Rumelhart et al. 1986). An epoch consists of the

Neural Networks and Deterministic Automata

1159

presentation of all strings in the training set once. Batch updating is used, i.e. the error is accumulated and the weights of the network modified at the end of each training epoch so that the order in which the training set is presented is immaterial. The architecture is trained on a subset of the Tomita grammars, specifically grammars one, two, four, and six. No attempt is made to train the architecture on the other Tomita grammars. One input unit, two state units with one designated as output, and two hidden units are used for Tomita 1 and 2, while three hidden units are used for Tomita 4 and 6. Therefore, for Tomita 1 and 2 there are 14 trainable weights, and for Tomita 4 and 6, there are 20, including the biases. The learning rates and momentum terms used for Tomita grammars 1,2, and 4 are held constant during training at 0.1 and 0.5, respectively. The learning rates and momentum terms used for Tomita grammar 6 are 0.1 and 0.5, respectively, for the first IK epochs, and are then set to 0.01 and 0.05, respectively, for the rest of training. For this application, an activation of 1 at the output unit is used to indicate a grammatically correct string, and an activation of 0 a grammatically incorrect string. The set of training strings used is small and their length short, so that we can establish a tight upper bound on the information required by a network in order to achieve good generalization ability. The training sets used for all the Tomita grammars consist only of all strings up to length four, including the NULL (empty) string. This means that the training sets of Tomita 1 and Tomita 2 contain only five and three grammatically correct strings, respectively. In spite of this, our networks are able to not only correctly classify all the strings in the training set, but are also able to correctly classify strings of arbitrary length, as discussed in Section 6. In addition, the networks can induce the minimal DFA that correspond to the grammars. The networks had some difficulty with Tomita 6 because there are many local minima. To avoid the local minima, we train networks for 1K epochs and select for further training the first four networks whose R M S error is less than 0.16. Other work in the field has tended to use much longer training strings, and more elaborate procedures for modifying the frequency with which strings in the training set are presented. Both of these procedures help keep the network out of local minima, thereby making it easier for the network to converge, but in our case we find that such procedures are not necessary. 4 Deterministic Finite State Automaton Extraction Algorithm

~

Motivated by Cleeremans et al. (1989)who used cluster analysis to determine the internal representations of their networks, and Giles et al. (1992) and Watrous and Kuhn (1992b)who present processes for extracting FSA from second-order networks, we have developed a vector quantization based DFA extraction algorithm similar to that of Giles rt al. (1992), Wa-

1160

Peter Manolios and Robert Fanelli

trous and Kuhn (1992b1, and Zeng et al. (1993). This algorithm is used to determine the DFA that the network is approximating. The networks we obtain are usually able to induce the minimal DFA for all the Tomita grammars we consider. Some DFA induced are not minimal, but are reducible to the minimal DFA using a standard DFA minimization algorithm (Harrison 1978). Following the spirit of the discussion of the architecture, we use the state units alone to represent the state of the system being approximated. If the cardinality of the set of state units is /-l, and the activations of units can be any value in the closed interval [0.1], then the activations of the state units can be mapped into a point in a /+dimensional unit hypercube. We will call such points network states, and we will call the hypercube the state space. When all network states corresponding to the activations of the set of state units in response to a set of test strings are plotted, it is observed that the network states are not uniformly distributed, and in fact tend to cluster together. The DFA extraction algorithm can confirm that these clusters correspond to the states of a DFA that is being approximated. The DFA extraction algorithm is given as input the sequence of network states visited by the network in the /%dimensional hypercube, in response to a test set, and analyzes the network states, trying to identify the clusters. This is done by randomly distributing ii markers within the hypercube. For every network state, the closest marker is moved toward the network state a certain distance d, which is equal to the distance between the marker and the network state divided by the number of times the marker has been moved plus one. This guarantees that any marker moved will be exactly in the centroid of the network states it is closest to. Now, every marker moved should represent one cluster of network states. The DFA extraction algorithm also verifies that the clusters found can be identified with states in a DFA, i.e., as a network state is associated with a certain cluster, the algorithm checks that the possible transitions out of the network state are exactly the same as those out of the other network states in the cluster. For example, if one marker represents two clusters, then the network states of one cluster will have different transitions than the network states of the other cluster, and it will seem that we have a nondeterministic finite state automaton. The algorithm will reject this result and will restart itself with a different set of randomly distributed markers. Several such consistency checks are performed, and statistics are compiled giving the centroid, standard deviation, and distance to the farthest point in each cluster. If a deterministic automaton can be extracted, then a state table that identifies the start state, all accept states, all transitions into and out of states, as well as all the above mentioned statistics is printed. Note in general it is not possible to explore the state space completely since there are a countably infinite number of possible test strings in {0,1}*,and different test sets will cause the network to visit different sets

Neural Networks and DeterministicAutomata

1161

of network states. Hence, it is possible that given two different test sets, the algorithm will succeed in extracting a DFA for one of the sets, and fail for the other. We have obtained such results with networks that have limited generalization ability. Thus, it is useful that a DFA extraction algorithm allow for alternate explorations of the state space, and this is an important feature of our algorithm. 5 Graphics In order to further understand the behavior of the networks, we have developed a graphics algorithm that we use to view the behavior of the networks during training and testing. The graphics algorithm displays only a two-dimensional view of the activation space of the state units (hypercube), but since the state space of our networks is only two-dimensional, this does not pose any difficulties. To view the behavior of the networks during training, the algorithm requires several input sequences that depend on the training set. The first sequence is provided by the DFA extraction algorithm, which partitions all the network states in the state space of the fully trained network in response to the training set, into clusters, and outputs a sequence of pairs that consist of network states and the clusters they belong to. The remaining sequences consist of the coordinates of network states resulting from the presentation of the training set to the network at varying degrees of training. The first input sequence is used to plot all network states belonging to the same final cluster with the same color and to plot each cluster with a different color. The other sequences are used to show the evolution of states visited by the network as training progresses. That is, initially the state space of the network without training is presented, then the state space after a few epochs, and so on until at last the state space of the network after all learning has been completed is presented. The result is an animated graphic presentation of the learning process. Using the graphics algorithm to view networks during training, we observe that the networks go through two stages, which we call the decide and reinforce stages. We will have more to say about the two stages in the next section so we will only give a quick overview now. During the decide stage, the network decides which DFA it will approximate, and during the reinforce stage, the network strengthens its approximation of that DFA. The reinforce stage consists of tightening the clusters, or equivalently forcing every network state in a cluster toward the cluster centroid. To view the behavior of the networks during testing, the algorithm requires input sequences that depend on the training set and several test sets. All sequences are provided by the DFA extraction algorithm, with the requirement that for all the test sets, the extracted DFA are isomorphic, a requirement met by all the data considered. The first sequence

1162

Peter Manolios and Robert Fanelli

is the same as the first sequence discussed above. Each of the other sequences consists of the coordinates of network states resulting from the presentation of one of the test sets to the fully trained network. As before, the first input sequence is used to plot all network states belonging to the same cluster with the same color and to plot each cluster with a unique color. The remaining sequences are used to show the network states in the hypercube visited by the network as longer strings are considered. That is, initially the network states visited while processing the training set are presented, then the network states visited as some new strings are considered, and so on until at last the state space of the network after all test strings have been considered is presented. Using the graphics algorithm with test sets consisting of many more strings than the training set allows us to graphically view the performance of the network on strings that it has never before encountered. Our training sets consist of all 31 strings of length 5 4, and the test sets we use with the graphics algorithm consist of all strings of length 5 5, 5 6 , .. . ,I12. The result is an animated graphic presentation of the testing process. 6 Results

For all the Tomita grammars we consider, we train randomly initialized networks on the training set (all strings of length 5 41, test the networks’ generalization by using various test sets, and extract DFA for each test set using the DFA extraction algorithm. For Tomita grammars 1, 2, and 4, we use eight test sets, namely all strings of length 5 5, 5 6, . . . , 5 12. Many of our networks can correctly classify all strings in the test sets and infer a correct DFA, which retains its structure for all the test sets. For Tomita 6, we use the eight previously mentioned test sets as well as two additional test sets: 1000 random strings of length 5 100, and 1000 random strings of length 5 1000. Our networks have the most difficulty learning Tomita 6 because there are many local minima. Given the difficulty of Tomita 6, our results will focus on this grammar. One consequence of using a training set as small as we do, is the possibility that there is more than one minimal DFA that accepts all the positive and rejects all the negative examples. This is the case with Tomita 4, where from two different networks trained on the same training set, we extract two different minimal DFA (Fig. 2). This is an important point because if a network is not given enough examples to constrain the grammar in question, then the network may not be able to generalize. One would then conclude that the network is unable to learn the grammar, but this is not a valid conclusion, since with a different training set it is possible that the network will learn the grammar. Watrous and Kuhn (1992a)suggest that the reason why some of their networks are not able to generalize is that the grammar is not sufficiently constrained by their training sets. However, knowing if a training set contains enough

Neural Networks and Deterministic Automata

1163

Figure 2: The two DFA shown (a,b) were obtained from two different networks trained on Tomita 4 with the same training set, using the DFA extraction algorithm described previously. The heavy arrow indicates the start state, and accept states are indicated by two circles. Notice that both will correctly classify all strings of length 5 4,but (a) will not be able to correctly classify the string 00100.

strings to uniquely constrain the set of possible minimal grammars requires some knowledge of the grammar in question, and in many cases such knowledge is not available. See Fanelli (1993)for further discussion. Tomita 6 is the hardest grammar for our networks to learn, but even so, network 1 correctly classifies all strings in the test sets to within 0.02 of the target output. Note that the networks are trained using only the training set and that after all learning is over, the test sets are used to

Peter Manolios and Robert Fanelli

1164

R M S Error for Tomita 6 - Run 1

RMS Error

Figure 3: This graph shows the RMS errors for network 1 on the training set and on various test sets. The network is trained for 219 epochs on all strings whose length is 5 4. The state of the network at various points in training is saved, and later tested on ten test sets. The test sets are all strings of length 5 5,s 6 , .. . ,I12, 1000 random strings of length 5 100, and 1000 random strings of length 5 1000. The results for the training set and each test set are shown from left to right, respectively. The fully trained network is able to correctly classify all the strings in all the test sets, with a maximum RMS error of 0.0075. gauge the generalization ability of the networks. Figure 3 shows the root mean squared error for network 1. The DFA extracted from network 1 is shown in Figure 4 along with the minimal DFA for Tomita 6, to which it can be reduced by standard methods. Each of the six states of the extracted DFA corresponds to a cluster, a set of network states in the B-dimensional hypercube. Recall that the DFA extraction algorithm determines the centroids and standard deviations of these clusters. We introduce the term "tightness" of a clus-

Neural Networks and Deterministic Automata

1165

Tomita 6

DFA 1

Figure 4: DFA 1 is the actual DFA extracted from network 1, using the DFA extraction algorithm. Using a standard minimization algorithm, we can reduce it to the minimal DFA for Tomita 6. ter, which is inversely related to the standard deviation of the geometric distances of the network states from their centroid. The smaller this measure, the ”tighter” the cluster, and the larger this measure, the “looser” the cluster. Using the graphics algorithm of the previous section to view network 1 (Figs. 3 and 4) during training, we see (Fig. 5) that initially all network states are in one cluster, but as training continues, the network states start to move around, even traveling from one end of the hypercube to the other as they form clusters. After 21° training epochs, the clusters stay fixed and start to tighten, i.e., converge to the cluster centroid. Note that cluster 1 of network 1 contains only the fixed network start state and can therefore never change. All our networks exhibit similar behavior, and to analyze these results, we use the following procedure. We extract the DFA from the trained network, noting which network states belong to which clusters. We then look at the network during various points in the training process, grouping network states that will be in the same cluster into sets, and calculate the standard deviations of the resulting sets at these points. The results of this procedure on network 1 are presented in Figure 6, where we notice that the standard deviations of the clusters behave erratically at first, and then start to approach zero. From these results, we conclude that there are two stages during learning: the decide and reinforce stages. We observe that during the decide stage, the cluster centroids move around in the space. Once the network has decided on the DFA it will approximate, there is a transition to the re-

Peter Manolios and Robert Fanelli

1166

-. . . . . .

. . . . . 0 0.10101040~0607080.9 I UNI

0 010101040506070809 I

suu urn1

suu UM

Figure 5: This figure shows the network states of network 1 during training, showing the development of clusters that correspond to network states. We find that initially all points cluster together, and that eventually they spread apart, so that both network states and clusters travel through the state space until they settle on a certain location. This is part of the decide stage and in this case lasts about 21° (1K) epochs, as can be seen in the diagram. inforce stage, where it strengthens its approximation of the DFA. During the reinforce stage, individual network states converge to the cluster centroid, hence the standard deviation approaches zero. This transition can be observed when the standard deviations of the clusters stop behaving erratically and start to get smaller (at around 21° epochs in Fig. 6). Having seen what happens during training, we consider what happens after the training is over, when networks are tested on strings they have not previously encountered. For this phase of analysis, we test the networks first on the training set, and then on all strings up to length 5, length 6, . . . , length 12, and for Tomita 6, 1000 random Strings of length 5 100, and 1000 random strings of length I 1000. Using the DFA extraction algorithm, we find that networks that learn the training set are approximating DFA and that many of these DFA remain stable for all the test sets. The graphics algorithm is used to confirm this (see Fig. 7 for the

Neural Networks and Deterministic Automata

1167

Standard Deviations of Clusters for Tomita 6 - Run 1

Figure 6: This graph shows the standard deviations of the six clusters of network 1 (Figs. 3, 4, and 5 ) during training. From left to right we have clusters 1, 5, 4, 2, 6, and 3. Large changes in the standard deviations up to 2'O epochs reflect the movement of network states and clusters during the decide stage. The 21° epoch marks the transition from the decide to the reinforce stage, within which the network strengthens the approximation of the DFA it has decided on. During the reinforce stage, the clusters stay fixed in the state space and the graph shows the network states in the clusters converging. example of network 1). We find that for network 1, the DFA structure not only remains stable, but the clusters do not grow in size, even when tested on strings with u p to 1000 characters. To confirm this performance, we test network 1 using two different simulators written independently by each of the authors. The simulators run on different computer architectures and their results for a test set containing 1000 random strings of length 5 100 are identical. Analysis of the networks' behavior on test sets that include very long strings suggests that as clusters are revisited, new network states are generated, but at a certain point, the clusters become

Peter Manolios and Robert Fanelli

1168

A

0

0.8

0 0.10.10.30.~0.10.60.70809I

SULe unit

0.3

Figure 7: These graphs show how network 1 behaves during testing. Test Set 12 contains all strings of length 5 12. Test Set 1000 contains 1000 random strings of up to length 1000. Note that Test Set 1000 does not contain cluster 1 because cluster 1 contains only the fixed network start state and the empty string is not in the test set. For similar reasons, Test Set 1000 Is does not generate clusters 5 or 6. The DFA extraction algorithm is used to partition the network states resulting from the presentation of the Training Set and Test Set 12.

absolutely stable, neither growing nor moving, perhaps shrinking. At that point, the network is in effect a DFA. We suspect that some of this is due to numerical round-offs inherently associated with limited computer precision, but that the primary effect is due to the attractor structure of the network considered as a dynamic system. An example of this behavior can be seen in Figure 7. When the network is presented with a string of 1000 Is, after the twenty-third 1, the same three points are revisited in succession, within the precision of the floating point processor. Thus we conclude the simulation has entered a limit cycle, but due to machine round-off, we cannot say exactly what kind of attractor the ideal network would have entered.

Neural Networks and Deterministic Automata

10.9..

Opog

o.:

0.1, , 0.7..

0.1

4 0.6..

3

i ::::: -

1 -*

1169

0

0.6

j :.:

0.3

0.3 01 0. O'Ij_. I 0 0.10.10.30.40.J0.60.70.80.9 I

0.1

0 0.10.10.30.40JJ0.60.10.80.9I

0 0.10.10.30.40.J0.60.70.80.9 I

stm Udl

s i n Unil

o.q '

0

0.8

:::I:

0.3 0.1 0.1

0

0 0.10.20.30.40.J0.6010.80.9 I

stm U.n

~

sprv.)1

~

~

~

-

0 O.1O.lO.3O.4O.JO.6O.7O.1O.9

I

stm Udt

Figure 8:These graphs show how network 2, which performs flawlessly on the training set but is not able to generalize well, behaves during testing. Using the DFA extraction algorithm, we can extract the clusters shown for the Training Set (all strings of length 5 4). However, the DFA extraction algorithm fails to find a consistent DFA for Test Set 5 (all strings of length 5 5 ) because clusters 1 and 3 have loosened. For Test Sets 6 and 12 (all strings of length 5 6, 5 12), there are no identifiable clusters, but as Test Set 1000 1s (all strings in the grammar 1* of length 5 1000) shows, the network can retain its automaton structure for a subset of the grammar.

All four networks trained on Tomita 6 can correctly classify all strings in the training set, however, only two become DFA. An example of a network that does not become a DFA is network 2, for which we can extract a DFA only for the training set. When we view the behavior of this network on various test sets, we see that the network's clusters quickly lose their cohesion (Fig. 8). However, when tested on a string of 1000 Is, we see that network 2's performance is flawless. Recall that the network is emulating the two functions R and a, which map the current state and input to the next output and state, respectively. Any particular DFA cor-

Peter Manolios and Robert Fanelli

1170

-

-

Std. Dcv. for Tomih 6 RM 4 2S6K

-

-

Std. Dev. for Tomih 6 RM 4 Sl2K

Epoeb

Figure 9: The two graphs show the standard deviations of the clusters for network 4, trained on Tomita 6 for 218 (256K) and 219 (512K) epochs, when tested on all strings of length 5 4... . , 5 12. Both versions of the network induced equivalent DFA, and corresponding clusters are shown in the same order. These graphs illustrate that there is a threshold and when the standard deviations of network clusters are below the threshold, as is the case with the clusters of network 4 trained for 219 (512K) epochs, they do not grow further. The standard deviations of the clusters of network 4, trained for. 2'* (256K) epochs, are not below the threshold, and therefore expand to the point where the DFA extraction algorithm cannot extract a DFA for test sets containing all strings of length 8 and above.

responds to a class of such functions that emulate it either approximately or, through their attractor structure, exactly. What Figure 8 shows is that network 2 approximates these two functions exactly for only a proper subset of their domains. We observe that as training is increased, network clusters tighten and converge to the cluster centroid. Figure 9 shows the standard deviations of the clusters for network 4 when fully trained and when trained for only 2'* epochs. When clusters are sufficiently tight, they do not grow further, i.e., there is a threshold and when cluster standard deviations are below the threshold, then the clusters do not loosen. On the other hand, when cluster deviations are above the threshold, the clusters loosen as the network is presented with longer strings. As Figure 9 shows, the clusters of network 4 trained for 219 epochs are tight, and the network is able to retain its automaton structure (Fig. 10). However, the clusters of network 4 trained for 218 epochs are loose (Fig. 9); they start to overlap with other clusters, and the network's state structure becomes poorly defined (Fig. 10).

Neural Networks and Deterministic Automata

TdUb-h

TadU 6 - -4,lSK

T"

1171

--Tea

sdl

A 08

j :::

GI 0.1 0.2

02 01

0

e

0 010201040306010809 I

0 0.l0.10.30.40.10.60.1080.9 I

0 010203040106070809 I

SUEUM

SUE Ulil

SUE um1

0.3

0 0.10.10.30.40.10.60.70.80.91

0 0 I 0.1 0 30 4 0.1 0.6 0.1 0.8 0.9 I

sc*s Umt

SUE Udl

Figure 10: These graphs show how two versions of network 4, one trained for 218 (256K) epochs, and the other for 219 (512K) epochs behave when tested on various test sets. The clusters of the 256K network are not tight enough to be below the threshold, and the network is not able to retain its automaton structure. After additional training, the standard deviations of the resulting 512K network's clusters are below the threshold, and the network is able to retain its automaton structure indefinitely. The Training Set contains all strings of length 5 4, and Test Sets 7 and 12 contain all strings of length 5 7 and 5 12, respectively.

7 Discussion

Given that our architecture is arguably a universal approximator for discrete-time, time-invariant dynamic systems, we expect, in principle, that it can learn the Tomita grammars, and w e see in practice, that using very small networks containing only a few units, it has been able to learn all of the Tomita grammars attempted. The size of the training sets has been kept small (31 strings), and the training strings used have been short (length 5 41, in comparison to those used in other studies. Many networks can generalize from the training sets to strings of length

1172

Peter Manolios and Robert Fanelli

12. Some can become DFA and therefore remain stable and generalize perfectly on strings of any length. We use the DFA extraction algorithm, which has been implemented as a C program, to determine whether the networks are inducing deterministic finite state automata. The algorithm gives an excellent indication of how well the network approximates a particular DFA when processing a given test set of strings, but as we point out in Figures 9 and 10, two networks that have induced equivalent DFA on the training set can generalize to varying degrees. By comparing the standard deviations of clusters, our DFA extraction algorithm allows us to distinguish between such networks. Note that once a DFA is extracted, one can write a program to simulate the DFA, resulting in the obvious advantages that the program will always be correct, and will also be more efficient. We use the DFA algorithm in combination with the graphics algorithm to view the development and modification of the clusters during training and testing of the networks. We observe two stages during learning: the decide and reinforce stages. During the decide sta e, the network settles on a specific DFA. During the reinforce stage, the c usters shrink and the network strengthens its approximation of the DFA. We use the standard deviation of the network states of a cluster as a quantitative measure of its tightness. The smaller this measure, the tighter the cluster. We have found that in certain networks, when the standard deviation of the clusters reaches a threshold, the network becomes a DFA and is capable of correctly classifying strings of arbitrary length. The networks are therefore capable of absolute stability and flawless generalization ability. The ability of some extensively trained (P9 epochs) networks to do this is a notable result of this work. We note that Hirsch (1989) has discussed the possibility of the correspondence of the attracting equilibria of convergent recurrent networks to the states of a finite automaton and of the induction of such a correspondence through training. The details and some further instances of this phenomenon are currently under investigation. We think that the above work helps to clarify the relationship between DFA and recurrent neural networks trained to emulate them.

H

References Angluin, D., and Smith, C. H. 1983. Inductive inference: Theory and methods. ACM Computing Sum. 15(3), 237-269. Cleeremans, A., Servan-Schreiber, D., and McClelland, J. 1989. Finite state automata and simple recurrent networks. Neural Comp. 1(3), 372-381. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14, 179-211. Fanelli, R. 1993. Grammatical Inference and Approximation of Finite Automata by Simple Recurrent Neural Networks Trained with Full Fotward Error Propagation.

Tech. Rep. NNRG 930811A, Dept. of Physics, Brooklyn College of the City University of New York.

Neural Networks and Deterministic Automata

1173

Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., and Lee, Y.C. 1992. Learning and extracting finite state automata with second-order recurrent neural networks. Neural Comp. 4(3), 393-405. Harrison, M. H. 1978. Introduction to Formal Language Theory. Addison-Wesley, Reading, MA. Hirsch, M. W. 1989. Convergent activation dynamics in continuous time networks. Neural Netzvorks 2(5), 331-349. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Miclet, L. 1990. Grammatical inference. In Syntactical and Structural Pattern Recognition; Theory and Applications, H. Bunke and A. Sanfeliu, eds., Chap. 9. World Scientific, Singapore. Omlin, C. W., Giles, C. L., and Miller, C. B. 1992. Heuristics for the extraction of rules from discrete-time recurrent neural networks. Int. Joint Conf. Neural Networks I, 33-38. Reber, A. S. 1967. Implicit learning of artificial grammars. J . Verbal Learning Verbal Behau. 6, 855-863. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by backpropagating errors. Nature (London) 323, 533-536. Smith, A. W., and Zipser, D. 1989. Encoding sequential structure: Experience with real-time recurrent learning algorithm. Proc. Int. Joint Conf. Neural Networks I, 645-648. Tomita, M. 1982. Dynamic construction of finite automata from examples using hill-climbing. Proc. Fourth Int. Cog. Sci. Conf. pp. 105-108. Watrous, R. L., and Kuhn, G. M. 1992a. Induction of finite-state languages using second-order recurrent networks. Neural Comp. 4(3), 406-414. Watrous, R. L., and Kuhn, G. M. 1992b. Induction of finite-state automata using second-order recurrent networks. In Advances in Neural Information Processing Systems 4, J. Moody et al., eds., pp. 309-316. Morgan Kaufmann, San Mateo, CA. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1(2), 270-280. Zeng, Z., Goodman, R. M., and Smyth, P. 1993. Self-clustering recurrent networks. I E E E lnt. Coizf. Neural Networks I, 33-38.

Received September 7, 1993; accepted February 10, 1994.

This article has been cited by: 2. Henrik Jacobsson. 2006. The Crystallizing Substochastic Sequential Machine Extractor: CrySSMExThe Crystallizing Substochastic Sequential Machine Extractor: CrySSMEx. Neural Computation 18:9, 2211-2255. [Abstract] [PDF] [PDF Plus] 3. Henrik Jacobsson . 2005. Rule Extraction from Recurrent Neural Networks: ATaxonomy and ReviewRule Extraction from Recurrent Neural Networks: ATaxonomy and Review. Neural Computation 17:6, 1223-1263. [Abstract] [PDF] [PDF Plus] 4. P. Tino, M. Cernansky, L. Benuskova. 2004. Markovian Architectural Bias of Recurrent Neural Networks. IEEE Transactions on Neural Networks 15:1, 6-15. [CrossRef] 5. Peter Tiňo , Barbara Hammer . 2003. Architectural Bias in Recurrent Neural Networks: Fractal AnalysisArchitectural Bias in Recurrent Neural Networks: Fractal Analysis. Neural Computation 15:8, 1931-1957. [Abstract] [PDF] [PDF Plus] 6. Peter Tiňo , Bill G. Horne , C. Lee Giles . 2001. Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks)Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks). Neural Computation 13:6, 1379-1414. [Abstract] [PDF] [PDF Plus] 7. Rafael C. Carrasco , Mikel L. Forcada , M. Ángeles Valdés-Muñoz , Ramón P. Ñeco . 2000. Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid UnitsStable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units. Neural Computation 12:9, 2129-2174. [Abstract] [PDF] [PDF Plus] 8. P. Tino, M. Koteles. 1999. Extracting finite-state representations from recurrent neural networks trained on chaotic symbolic sequences. IEEE Transactions on Neural Networks 10:2, 284-302. [CrossRef] 9. Alan D. Blair, Jordan B. Pollack. 1997. Analysis of Dynamical RecognizersAnalysis of Dynamical Recognizers. Neural Computation 9:5, 1127-1142. [Abstract] [PDF] [PDF Plus] 10. C.W. Omlin, C.L. Giles. 1996. Rule revision with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 8:1, 183-188. [CrossRef] 11. Mikel L. Forcada, Marco GoriNeural Nets, Recurrent . [CrossRef]

Communicated by Radford M. Neal

Learning in Boltztnann Trees Lawrence Saul Department of Physics, Massachusetts Institute of Technology, Cambridge, MA 02139 USA Michael I. Jordan Department of Brain and Cognitive Sciences, Massachusetts lnstitrrte of Technology, Cambridge, M A 02139 USA We introduce a large family of Boltzmann machines that can be trained by standard gradient descent. The networks can have one or more layers of hidden units, with tree-like connectivity. We show how to implement the supervised learning algorithm for these Boltzmann machines exactly, without resort to simulated or mean-field annealing. The stochastic averages that yield the gradients in weight space are computed by the technique of decimation. We present results on the problems of N-bit parity and the detection of hidden symmetries. 1 Introduction Boltzmann machines (Ackley et al. 1985) have several compelling virtues. Unlike simple perceptrons, they can solve problems that are not linearly separable. The learning rule, simple and locally based, lends itself to massive parallelism. The theory of Boltzmann learning, moreover, has a solid foundation in statistical mechanics. Unfortunately, Boltzmann machines-as originally conceived-also have some serious drawbacks. In practice, they are relatively slow. Simulated annealing (Kirkpatrick et al. 19831, though effective, entails a great deal of computation. Finally, compared to backpropagation networks (Rumelhart et al. 1986), where weight updates are computed by the chain rule, Boltzmann machines lack a certain degree of exactitude. Monte Carlo estimates of stochastic averages (Binder and Heerman 1988) are not sufficiently accurate to permit further refinements to the learning rule, such as quasi-Newton or conjugate-gradient techniques (Press et al. 1986). There have been efforts to overcome these difficulties. Peterson and Anderson (1987) introduced a mean-field version of the original Boltzmann learning rule. For many problems, this approximation works surprisingly well (Hinton 1989), 90 that mean-field Boltzmann machines learn much more quickly than their stochastic counterparts. Under certain circumstances, however, the approximation breaks down, and the Neural Computation 6, 1174-1184 (1994) @ 1994 Massachusetts Institute of Technology

Learning in Boltzmann Trees

1175

hidden units Figure 1: Boltzmann tree with two layers of hidden units. The input units (not shown) are fully connected to all the units in the tree. mean-field learning rule works badly if at all (Galland 1993). Another approach (Hopfield 1987) is to focus on Boltzmann machines with architectures simple enough to permit exact computations. Learning then proceeds by straightforward gradient descent on the cost function (Yair and Gersho 1988), without the need for simulated or mean-field annealing. Hopfield (1987) wrote down the complete set of learning equations for a Boltzmann machine with one layer of noninterconnected hidden units. Freund and Haussler (1992) derived the analogous equations for the problem of unsupervised learning. In this paper, we pursue this strategy further, concentrating on the case of supervised learning. We exhibit a large family of architectures for which it is possible to implement the Boltzmann learning rule in a n exact way. The networks in this family have a hierarchical structure with tree-like connectivity. In general, they can have one or more layers of hidden units. We call them Boltzmann trees; an example is shown in Figure 1. We use a decimation technique from statistical physics to compute the averages in the Boltzmann learning rule. After describing the method, we give results on the problems of N-bit parity and the detection of hidden symmetries (Sejnowski ef al. 1986). We also compare the performance of deterministic and true Boltzmann learning. Finally, we discuss a number of possible extensions to our work. 2 Boltzmann Machines

We briefly review the learning algorithm for the Boltzmann machine (Hertz ef al. 1991). The Boltzmann machine is a recurrent network with

Lawrence Saul and Michael I. Jordan

1176

binary units S; = rtl and symmetric weights w;, = w,;. Each configuration of units in the network represents a state of energy

H

=-

c

(2.1)

w1js;sj

ij

The network operates in a stochastic environment in which states of lower energy are favored. The units in the network change states with probability 1

(2.2) Once the network has equilibrated, the probability of finding it in a particular state obeys the Boltzmann distribution from statistical mechanics: (2.3)

The partition function Z = CepUlTis the weighted sum over states needed to normalize the Boltzmann distribution. The temperature T determines the amount of noise in the network; as the temperature is decreased, the network is restricted to states of lower energy. We consider a network with I input units, H hidden units, and 0 output units. The problem to be solved is one of supervised learning. Input patterns are selected from a training set with probability P*(llc). Likewise, target outputs are drawn from a probability distribution P'(0, I I,). The goal is to teach the network the desired associations. Both the input and output patterns are binary. A particular example is said to be learned if after clamping the input units to the selected input pattern and waiting for the network to equilibrate, the output units are in the desired target states. A suitable cost function for this superJised learning problem is (2.4)

where P ' ( 0 , I I,) and P ( 0 , I I,) are the desired and observed probabilities that the output units have pattern 0, when the input units are clamped to pattern I,. The Boltzmann learning algorithm attempts to minimize this cost function by gradient descent. The calculation of the gradients in weight space is straightforward. The final result is the Boltzmann learning rule

Aw;, =

1 P*(I!,) [(S;S,)t,

-

(2.5)

T P

where brackets (. . .) indicate expectation values over the Boltzmann distribution. The gradients in weight space depend on two sets of correlations-one in which the 0 output units are clamped to their desired

Learning in Boltzmann Trees

1177

targets, the other in which they are allowed to equilibrate. In both cases, the I input units are clamped to the pattern being learned. The differences in these correlations, averaged over the examples in the training set, yield the weight updates Aw,,.An on-line version of Boltzmann learning is obtained by foregoing the average over input patterns and updating the weights after each example. Finally, the parameter 77 sets the learning rate. The main drawback of Boltzmann learning is that, in most networks, it is not possible to compute the gradients in weight space directly. Instead, one must resort to estimating the correlations (S,Sj) by Monte Carlo simulation (Binder et al. 1988). The method of simulated annealing (Kirkpatrick et al. 1983) leads to accurate estimates but has the disadvantage of being very computation-intensive. A mean-field version of the algorithm (Peterson and Anderson 1987) was proposed to speed up learning. It makes the approximation (S,S,) = ( S , ) ( S j ) in the learning rule and estimates the magnetizations (S,) by solving a iet of nonlinear equations. This is done by iteration, combined when necessary with an annealing process. So-called mean-field annealing can yield an order-ofmagnitude improvement in convergence. Clearly, however, the ideal algorithm would be one that computes expectation values exactly and does not involve the added complication of annealing. In the next section, we investigate a large family of networks amenable to exact computations of this sort.

3 Boltzmann Trees

A Boltzmann tree is a Boltzmann machine whose hidden and output units have a special hierarchical organization. There are no restrictions on the input units, and, in general, we will assume them to be fully connected to the rest of the network. For convenience, we will focus on the case of one output unit; an example of such a Boltzmann tree is shown in Figure 1. Modifications to this basic architecture and the generalization to many output units will be discussed later. The key technique to compute partition functions and expectation values in these trees is known as decimation (Eggarter 1974; Itzykson and Drouffe 1991). The idea behind decimation is the following. Consider three units connected in series, as shown in Figure 2a. Though not directly connected, the end units S1 and Sz have an effective iteration that is mediated by the middle one S. Define the temperature-rescaled weights 1.. ', = - w;,/T. We claim that the combination of the two weights 1, and 12 in series has the same effect as a single weight 1. Replacing the weights in this way, we have integrated out, or "decimated," the degree of freedom represented by the intermediate unit. To derive an expression for 1, we require that the units S1 and SZin both systems obey the same Boltzmann

Lawrence Saul and Michael I. Jordan

1178

distribution. This will be true if

where C is a constant prefactor, independent of S1 and S 2 . Enforcing equality for the possible values of S1 = f l and S2 = kl, we obtain the constraints

fiei1

= 2 cosh(Jl f J2)

It is straightforward to eliminate C and solve for the effective weight J. Omitting the algebra, we find tanh J

= tanh J1 tanh J2

(3.2)

Choosing J in this way, we ensure that all expectation values involving S1 and/or S2 will be the same in both systems. Decimation is a technique for combining weights "in series." The much simpler case of combining weights "in parallel" is illustrated in Figure 2b. In this case, the effective weight is simply the additive sum of 11 and 12, as can be seen by appealing to the energy function of the network, equation 2.1. Note that the rules for combining weights in series and in parallel are valid if either of the end units S1 or S 2 happen to be clamped. They also hold locally for weight combinations that are embedded in larger networks. The rules have simple analogs in other types of networks (e.g., the law for combining resistors in electric circuits). Indeed, the strategy for exploiting,these rules is a familiar one. Starting with a complicated network, we iterate the rules for combining weights until we have a simple network whose properties are easily computed. Clearly, the rules do not make all networks tractable; networks with full connectivity between hidden units, for example, cannot be systematically reduced. Hierarchical networks with tree-like connectivity, however, lend themselves naturally to these types of operations. Let us see how we can use these rules to implement the Boltzmann learning rule in an exact way. Consider the two-layer Boltzmann tree in Figure 1. The effect of clamping the input units to a selected pattern is to add a bias to each of the units in the tree, as in Figure 3a. Note that these biases depend not only on the input weights, but also on the pattern distributed over the input units. Having clamped the input units, we must now compute expectation values. For concreteness, we consider the case where the output unit is allowed to equilibrate. Correlations between adjacent units are computed by decimating over the other units in the tree; the procedure is illustrated in Figure 3b for the lower leftmost hidden units. The final, reduced network consists of the two adjacent

Learning in Boltzmann Trees

1179

Figure 2: (a) Combining weights in series: the effective interaction between units S1 and S2 is the same as if they were directly connected by weight J, where tanhJ = tanhJ1 tanhfz. (b) Combining weights in parallel: the effective weight is. simply the additive sum. The same rules hold if either of the end units is clamped. '

units with weight and effective biases (SIS2) =

elcosh(hl e'cosh(hl

( h l , h2).

A short calculation gives

+ h2) - e-lcosh(hl - h2) + h2) + e-lcosh(hl - h2)

(3.3)

The magnetization of a tree unit can be computed in much the same way. We combine weights in series and parallel until only the unit of interest remains, as in Figure 3c. 'In terms of the effective bias h, we then have the standard result (S,) = tanhh

(3.4)

The rules for combining weights thus enable us to compute expectation values without enumerating the 213 = 8192 possible configurations of units in the tree. To compute the correlation (S1S2) for two adjacent units in the tree, one successively removes all "outside" fluctuating units until only units S1 and S2 remain. To compute the magnetization ( S , ) , one removes unit 52 as well. Implementing these operations on a computer is relatively straightforward, due to the hierarchical organization of the output and hidden units. The entire set of correlations and magnetizations can be computed by making two recursive sweeps through the tree, storing effective weights as necessary to maximize the efficiency of the algorithm. Having to clamp the output unit to the desired target does not introduce any difficulties. In this case, the output unit merely contributes (along with the input units) to the bias on its derivative units. Again, we use recursive decimation to compute the relevant stochastic averages. We are thus able to implement the Boltzmann learning rule in an exact way.

Lawrence Saul and Michael I. Jordan

1180

(b2)

?

(b3)

?

Learning in Boltzmann Trees

1181

Table 1: Boltzmann Tree Performance on N-bit Parity.a N

Hidden units

emax

Success %

eavg

2 3 4

1 1 3 4

50 250 1000 1000

97.2 (89.3) 96.1 (88.5) 95.1 (69.2) 92.9 (84.2)

25.8 42.1 281.1 150.0

5

“The results in parentheses are for mean-field learning.

4 Results

We tested Boltzmann trees on two familiar problems: N-bit parity and the detection of hidden symmetries (Sejnowski et al. 1986). We hope our results demonstrate not only the feasibility of the algorithm, but also the potential of exact Boltzmann learning. Table 1 shows our results on the N-bit parity problem, using Boltzmann trees with one layer of hidden units. In each case, we ran the algorithm 1000 times. All 2N possible input patterns were included in the training set. A success indicates that the tree learned the parity function in less than emaxepochs. We also report the average number of epochs eavgper successful trial; in these cases, training was stopped when P(O* I I,L)2 0.9 for each of the 2N inputs, with 0’ = parity(1,). The results show Boltzmann trees to be competitive with standard backpropagation networks (Merller, 1993). We also tested Boltzmann trees on the problem of detecting hidden symmetries. In the simplest version of this problem, the input patterns are square pixel arrays that have mirror symmetry about a fixed horizontal or vertical axis (but not both). We used a two-layer tree with the architecture shown in Figure 1 to detect these symmetries in 10 x 10 square arrays. The network learned to differentiate the two types of patterns from a training set of 2000 examples. After each epoch, we tested the network on a set of 200 unknown examples. The performance on these patterns measures the network‘s ability to generalize to unfamiliar inputs. The results, averaged over 100 separate trials, are shown in Figure 4. After 100 epochs, average performance was over 95% on the training set and over 85% on the test set.

Figure 3: Facing page. Reducing Boltzmann trees by combining weights in series and parallel. Solid circles represent clamped units. (a) Effect of clamping the input units to a selected pattern. (b) Computing the correlation between adjacent units. (c).Computing the magnetization of a single unit.

Lawrence Saul and Michael I. Jordan

1182

0.9

0.8

$

8

0.7

0.6

0.5

0

20

40

60

80

100

epoch

Figure 4: Results on the problem of detecting hidden symmetries for true Boltzmann (TB)and mean-field (MF) learning. Finally, we investigated the use of the deterministic, or mean-field, learning rule (Peterson and Anderson 1987) in Boltzmann trees. We repeated our experiments, substituting (S,)(S,) for (S,S,) in the update rule. Note that we computed the magnetizations (S,) exactly using decimation. In fact, in most deterministic Boltzmann machines, one does not compute the magnetizations exactly, but estimates them within the mean-field approximation. Such networks therefore make two approximations-first, that (S,S,) x (S#)(S,) and second, that (S,) x tanh(CJ,(S,) + h , ) . Our results speak to the first of these approximations. At this level alone, we find that exact Boltzmann learning is perceptibly faster than mean-field learning. On one problem in particular, that of N = 4 parity (see Table I), the difference between the two learning schemes was quite pronounced.

5 Extensions

In conclusion, we mention several possible extensions to the work in this paper. Clearly, a number of techniques used in backpropagation networks, such as conjugate-gradient and quasi-Newton methods (Press

Learning in Boltzmann Trees

1183

et al. 1986), could also be used to accelerate learning in Boltzmann trees. In this paper, we have considered the basic architecture in which a single output unit sits atop a tree of one or more hidden layers. Depending on the problem, a variation on this architecture may be more appropriate. The network must have a hierarchical organization to remain tractable; within this framework, however, the algorithm permits countless arrangements of hidden and output units. In particular, a tree can have one or more output units, and these output units can be distributed in an arbitrary way throughout the tree. One can incorporate certain intralayer connections into the tree at the expense of introducing a slightly more complicated decimation rule, valid when the unit to be decimated is biased by a connection to an additional clamped unit. There are also decimation rules for q-state (Potts) units, with q > 2 (Itzykson and Drouffe 1991). The algorithm for Boltzmann trees raises a number of interesting questions. Some of these involve familiar issues in neural network designfor instance, how to choose the number of hidden layers and units. We would also like to characterize the types of learning problems best suited to Boltzmann trees. A recent study by Galland (1993) suggests that meanfield learning has trouble in networks with several layers of hidden units and/or large numbers of output units. Boltzmann trees with exact Boltzmann learning may present a viable option for problems in which the basic assumption behind mean-field learning-that the units in the network can be treated independently-does not hold. We know of constructive algorithms (Frean 1990) for feedforward nets that yield tree-like solutions; an analogous construction for Boltzmann machines has obvious appeal, in view of the potential for exact computations. Finally, the tractability of Boltzmann trees is reminiscent of the tractability of tree-like belief networks, proposed by Pearl (1986,1988); more sophisticated rules for computing probabilities in belief networks (Lauritzen and Spiegelhalter, 1988) may have useful counterparts in Boltzmann machines. These issues and others are left for further study. Acknowledgments The authors thank Mehran Kardar for useful discussions. This research was supported by the Office of Naval Research and by the MIT Center for Materials Science and Engineering through NSF Grant DMR-90-22933. References Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. 1985. A learning algorithm for Boltzmann machines. Cog. Sci. 9, 147-169. Binder, K., and Heerman, D. W. 1988. Monte Carlo Simulation in Statistical Mechanics. Springer-Verlag, Berlin.

1184

Lawrence Saul and Michael I. Jordan

Eggarter, T. P. 1974. Cayley trees, the Ising problem, and the thermodynamic limit. Phys. Rev. B 9, 2989-2992. Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Comp. 2, 198-209. Freund, Y., and Haussler, D. 1992. Unsupervised learning of distributions on binary vectors using two layer networks. In Advances in Neural lnforrnation Processing SystemsIV (Denver 1992),J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 912-919. Morgan Kaufmann, San Mateo, CA. Galland, C. C. 1993. The limitations of deterministic Boltzmann machine learning. Network: Cornp. Neural Syst. 4, 355-379. Hertz, J., Krogh, A., and Palmer, R. G. 1991. lntroduction to the Theory of Neural Computation. Addison-Wesley, Redwood City. Hinton, G. E. 1989. Deterministic Boltzmann learning performs steepest descent in weight space. Neural Comp. 1, 143-150. Hopfield, J. J. 1987. Learning algorithms and probability distributions in feedforward and feed-back networks. Proc. Natl. Acad. Sci. U.S.A. 84, 84294433. Itzykson, C., and Drouffe, J. 1991. Statistical Field Theory. Cambridge University Press, Cambridge. Kirkpatrick, S., Gellatt Jr, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220, 671-680. Lauritzen, S. L., and Spiegelhalter, D. J. 1988. Local computations with probabilities on graphical structures and their application to expert systems. 1.R. Stat. SOC.B 50, 157-224. Moller, M. E 1993. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6, 525-533. Pearl, J. 1986. Fusion, propagation, and structuring in belief networks. Artif. Intelligence 19,241-288. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA. Peterson, C., and Anderson, J. R. 1987. A mean field theory learning algorithm for neural networks. Complex Syst. 1, 995-1019. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. 1986. Numerical Recipes. Cambridge University Press, Cambridge. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back-propagating errors. Nature (London) 323, 533-536. Sejnowski, T. J., Kienker, P. K., and Hinton, G. E. 1986. Learning symmetry groups with hidden units. Physica 22D, 260-275. Yair, E., and Gersho, A. 1988. The Boltzmann perceptron network: A multilayered feed-forward network equivalent to the Boltzmann machine. In Advances in Neurallnformation Processing Systems 1 (Denver 1988),D. s.Touretzky, ed., pp. 116-123. Morgan Kaufmann, San Mateo, CA.

Received October 27, 1993; accepted January 24, 1994.

This article has been cited by: 2. E. Farguell, F. Mazzanti, E. Gomez-Ramirez. 2008. Boltzmann Machines Reduction by High-Order Decimation. IEEE Transactions on Neural Networks 19:10, 1816-1821. [CrossRef] 3. K. Wong, David Saad. 2007. Inference and optimization of real edges on sparse graphs: A statistical physics perspective. Physical Review E 76:1. . [CrossRef] 4. Jonathan S Yedidia, Jean-Philippe Bouchaud. 2003. Renormalization group approach to error-correcting codes. Journal of Physics A: Mathematical and General 36:5, 1267-1288. [CrossRef] 5. B. G. Giraud, Alan S. Lapedes. 1999. Superadditive correlation. Physical Review E 59:5, 4983-4991. [CrossRef] 6. H. J. Kappen , F. B. Rodríguez . 1998. Efficient Learning in Boltzmann Machines Using Linear Response TheoryEfficient Learning in Boltzmann Machines Using Linear Response Theory. Neural Computation 10:5, 1137-1156. [Abstract] [PDF] [PDF Plus] 7. Laurene V. FausettBoltzmann Machines . [CrossRef]

Communicated by Steven Whitehead

O n the Convergence of Stochastic Iterative Dynamic Programming Algorithms Tommi Jaakkola Michael I. Jordan Department of Brain and Cognitive Sciences, Massachusetts lnstitute of Technology, Cambridge, M A 02139 U S A

Satinder P. Singh Department of Computer Science, University of Massachusetts, Amherst, M A 01003 U S A

Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TDW algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(X) and Q-learning belong. 1 Introduction

An important component of many real world learning problems is the temporal credit assignment problem-the problem of assigning credit or blame to individual components of a temporally-extended plan of action, based on the success or failure of the plan as a whole. To solve such a problem, the learner must be equipped with the ability to assess the long-term consequences of particular choices of action and must be willing to forego an immediate payoff for the prospect of a longer term gain. Moreover, because most real world problems involving prediction of the future consequences of actions involve substantial uncertainty, the learner must be prepared to make use of a probability calculus for assessing and comparing actions. There has been increasing interest in the temporal credit assignment problem, due principally to the development of learning algorithms based on the theory of dynamic programming (DP) (Barto et al. 1990; Werbos, 1992). Sutton’s (1988) TD(X) algorithm addressed the problem of learning to predict in a Markov environment, utilizing a temporal difference Ncural Coriiputafian 6, 1185-1201 (1994) @ 1994 Massachusetts Institute of Technology

1186

T. Jaakkola, M. I. Jordan,and S. P. Singh

operator to update the predictions. Watkins' (1989) Q-learning algorithm extended Sutton's work to control problems, and also clarified the ties to dynamic programming. In the current paper, our concern is with the stochastic convergence of DP-based learning algorithms. Although Watkins (1989) and Watkins and Dayan (1992) proved that Q-learning converges with probability one, and Dayan (1992) observed that TD(0) is a special case of Q-learning and therefore also converges with probability one, these proofs rely on a construction that is particular to Q-learning and fail to reveal the ties of Q-learning to the broad theory of stochastic approximation (e.g., Wasan 1969). Our goal here is to provide a simpler proof of convergence for Q-learning by making direct use of stochastic approximation theory. We also show that our proof extends to TD(X) for arbitrary A. Several other authors have recently presented results that are similar to those presented here: Dayan and Sejnowski (1993) for TDW, Peng and Williams (1993) for TD(A), and Tsitsiklis (1993) for Q-learning. Our results appear to be closest to those of Tsitsiklis (1993). We begin with a general overview of Markov decision problems and DP. We introduce the Q-learning algorithm as a stochastic form of DP. We then present a proof of convergence for a general class of stochastic processes of which Q-learning is a special case. We then discuss TD(N and show that it is also a special case of our theorem. 2 Markov Decision Problems

A useful mathematical model of temporal credit assignment problems, studied in stochastic control theory (Aoki 1967) and operations research (Ross 19701, is the Markov decision problem. Markov decision problems are built on the formalism of controlled Markov chains. Let S = 1,2, . . . ,N be a discrete state space and let U ( i ) be the discrete set of actions available to the learner when the chain is in state i. The probability of making a transition from state i to state j is given by p,,(u), where u E U(i). The learner defines a policy p , which is a function from states to actions. Associated with every policy p is a Markov chain defined by the state transition probabilities pij[p(i)]. There is an instantaneous cost ci(u) associated with each state i and action u, where c,(u) is a random variable with expected value Z;(u). We also define a valuefunction V I L ( i )which , is the expected sum of discounted future costs given that the system begins in state i and follows policy p:

(2.1)

where Sr E S is the state of the Markov chain at time t. Future costs are discounted by a factor y', where y E (0,l). We wish to find a policy that

Convergence of DP Algorithms

1187

minimizes the value function:

V * ( i )= min V,(i) P

(2.2)

Such a policy is referred to as an optimal policy and the corresponding value function is referred to as the optimal value function. Note that the optimal value function is unique, but an optimal policy need not be unique. (For example, if the costs are independent of the actions and states all policies are optimal.) Markov decision problems can be solved by dynamic programming (Bertsekas 1987). The basis of the DP approach is an equation that characterizes the optimal value function. This equation, known as Bellman's equation, characterizes the optimal value of the state in terms of the optimal values of possible successor states: (2.3) To motivate Bellman's equation, suppose that the system is in state i at time t and consider how V * ( i )should be characterized in terms of possible transitions out of state i. Suppose that action u is selected and the system transitions to state j . The expression ci(u) + r V ' u ) is the cost of making a transition out of state i plus the discounted cost of following an optimal policy thereafter. The minimum of the expected value of this expression, over possible choices of actions, seems a plausible measure of the optimal cost at i and by Bellman's equation is indeed equal to V * ( i ) . There are a variety of computational techniques available for solving Bellman's equation. The technique that we focus on in the current paper is an iterative algorithm known as value iteration. Value iteration solves for V * ( i )by setting up a recurrence relation for which Bellman's equation is a fixed point. Denoting the estimate of V * ( i )at the kth iteration as V ( k()i ) ,we have (2.4)

This iteration can be shown to converge to V * ( i )for arbitrary initial V(O)(i) (Bertsekas 1987). The proof is based on showing that the iteration from Vck)(i)to V ( k + l ) (is i ) a contraction mapping. That is, it can be shown that - V*(i)l5 Ymax I V k ) ( i ) V*(i)l max IV(k+l)(i) I

(2.5)

I

which implies that Vck)(i)converges to V ( i )and also places an upper bound on the convergence rate. Watkins (1989). utilized an alternative notation for expressing Bellman's equation that is particularly convenient for deriving learning al-

T. Jaakkola, M. I. Jordan, and S. P. Singh

1188

gorithms. Define the function Q‘(i, u ) to be the expression appearing inside the ”min” operator of Bellman’s equation: j€S

Using this notation Bellman’s equation can be written as follows:

V * ( i )= min Q*(i,u)

(2.7)

IiEU(I)

Moreover, value iteration can be expressed in terms of Q-functions Q@+’)(i, U ) = Ci(U)

+~ C ~ j j ( ~ ) V ‘ ~ ’ ( j )

(2.8)

jeS

where V @ ) ( iis) defined in terms of Q(k)(i, u ) as follows: (2.9)

The mathematical convenience obtained from using Qs rather than V s derives from the fact that the minimization operator appears inside the expectation in equation 2.8, whereas it appears outside the expectation in equation 2.4. This fact plays an important role in the convergence proof presented in this paper. The value iteration algorithm in equation 2.4 or equation 2.8 can also be executed asynchronously (Bertsekas and Tsitsiklis 1989). In an asynchronous implementation, the update of the value of a particular state may proceed independently of the updates of the values of other states. Bertsekas and Tsitsiklis (1989) show that as long as each state is updated infinitely often and each action is tried an infinite number of times in each state, then the asynchronous algorithm eventually converges to the optimal value function. Moreover, asynchronous execution has the advantage that it is directly applicable to real-time Markov decision problems (RTDP; Barto et al. 1993). In a real-time setting, the system uses its evolving value function to choose control actions for an actual process and updates the values of the states along the trajectory followed by the process. Dynamic programming serves as a starting point for deriving a variety of learning algorithms for systems that interact with Markov environments (Barto et al. 1993; Sutton 1988; Watkins 1989). Indeed, real-time dynamic programming is arguably a form of learning algorithm as it stands. Although RTDP requires that the system possesses a complete model of the environment [i.e., the probabilities pi,( u ) and the expected costs Ci(u) are assumed known], the performance of a system using RTDP improves over time, and its improvement is focused on the states that are actually visited. The system “learns” by transforming knowledge in one format (the model) into another format (the value function). A more difficult learning problem arises when the probabilistic structure of the environment is unknown. There are two approaches to dealing

Convergence of DP Algorithms

1189

with this situation (cf. Barto et al. 1993). An indirect approach acquires a model of the environment incrementally, by estimating the costs and the transition probabilities, and then uses this model in an ongoing DP computation. A direct method dispenses with constructing a model and attempts to estimate the optimal value function (or the optimal Q-values) directly. In the remainder of this paper, we focus on direct methods, in particular the Q-learning algorithm of Watkins (1989) and the TD(X) algorithm of Sutton (1988). The Q-learning algorithm is a stochastic form of value iteration. Consider equation 2.8, which expresses the update of the Q-values in terms of the Q-values of successor states. To perform a step of value iteration requires knowing the expected costs and the transition probabilities. Although such a step cannot be performed without a model, it is nonetheless possible to estimate the appropriate update. For an arbitrary V-function, the quantity ClES p,(u)V(j) can be estimated by the quantity Vfj), if successor state j is chosen with probability p l , ( u ) . But this is assured by simply following the transitions of the actual Markov environment, which makes a transition from state i to state j with probability p,,(u). Thus the sample value of V at the successor state is an unbiased estimate of the sum. Moreover c,(u) is an unbiased estimate of Cl(u). This reasoning leads to the following relaxation algorithm, where we use Qf(i,u ) and Vf(i)to denote the learner’s estimates of the Q-function and V-function at time t, respectively: Qi+i(Sir

u i ) = [I -.t(Si.

ui)]Qi(si.u i ) + O i ( s i , u i ) [ ~ , , ( u i ) + ~ V , ( s i + ~ ) I ( 2 . 1 0 )

where The variables cuf(st, u t ) are zero except for the state that is being updated at time t. The fact that Q-learning is a stochastic form of value iteration immediately suggests the use of stochastic approximation theory, in particular the classical framework of Robbins and Monro (1951). Robbins-Monro theory treats the stochastic convergence of a sequence of unbiased estimates of a regression function, providing conditions under which the sequence converges to a root of the function. Although the stochastic convergence of Q-learning is not an immediate consequence of Robbins-Monro theory, the theory does provide results that can be adapted to studying the convergence of DP-based learning algorithms. In this paper we utilize a result from Dvoretzky’s (1956) formulation of Robbins-Monro theory to prove the convergence of both Q-learning and TDW. 3 Convergence Proof for Q-Learning

Our proof is based on the observation that the Q-learning algorithm can be viewed as a stochastic process to which techniques of stochastic ap-

T. Jaakkola, M. I. Jordan, and S. P. Singh

1190

proximation are generally applicable. Due to the lack of a formulation of stochastic approximation for the maximum norm, however, we need to slightly extend the standard results. This is accomplished by the following theorem, the proof of which is given in Appendix A.

+

Theorem 1. A random iterative process A , , + l ( x ) = [l - ~ i , , ( x ) ] A , , ( x ) p n ( x ) F , , ( x )converges to zero with probability one (w.p.1) under the following assumptions: 1 , x E S , where S is a finite set.

2. C , , W , ( X )= 00, Z , N ; ( X < ) 00, C , , B , , ( x ) = 00, C , , / $ ( X<) E{[j,,(,x)I PI,} 5 E { a n ( x ) I P,,} uniformly over x w.p.1.

00, and

3. I1 E { F l , ( x ) I PI13ij,,} IIw Iy (1 A,, JIw, where y E (0.1). 4. Var{F,,(x)I P,,, [I,,} I C(1+ 11 A,, l l ~ ) where ~ , C is some constant.

HereP,, = {X,,,X,,-I , . . . .F,,-I , . . . , N , , - I , . ..,ij,l-l,...}P,,stepn.F , , ( x ) , ~ ~ , , ( x ) , and D,,(X)are allowed to depend on the past insofar as the cv,,(x)and /j,,(x) are assumed to be nonnegative and mutually independent given P,,. The notation 11 . IIw refers to some weighted maximum norm. In applying the theorem, the A,, process will generally represent the difference between a stochastic process of interest and some optimal value (e.g., the optimal value function). The formulation of the theorem therefore requires knowledge to be available about the optimal solution to the learning problem before it can be applied to any algorithm whose convergence is to be verified. In the case of Q-learning the required knowledge is available through the theory of DP and Bellman's equation in particular. The convergence of the Q-learning algorithm now follows easily by relating the algorithm to the converging stochastic process defined by Theorem 1.' In the form of the theorem we have: Theorem 2. The Q-learning algorithm given by

Qt+i(sr.ur)= [I - t r t ( ~ t , u , ) ] Q r ( ~ r + , ~~i )r ( ~ r , u i ) [ +yvr(sr+i)] ~,(~r)

converges to the optimal Q*(s,u ) values if 1. The state and action spaces are finite. N ~ ( su , ) = 00 and Cr o:(s, u ) < 00 uniformly overs and u w.p.l. 2. 3. Var{c,(u)}isfinite. 4. If y = 1 all policies lead to a cost free terminal state w.p.1. 'We note that the theorem is more powerful than is needed to prove the convergence of Q-learning. Its generality, however, allows it to be applied to other algorithms as well [see the following section on TD(X)I.

Convergence of DP Algorithms

1191

Proof. By subtracting Q * ( s , u ) from both sides of the learning rule and by defining &(s, u ) = Qt(s,u ) - Q*(s.u ) together with

Ft(s7 u ) = ~ ( u+) ~Vt(snext) - Q*(s,u )

(3.1)

the Q-learning algorithm can be seen to have the form of the process in Theorem 1 with /j,(s,u ) = N , ( s , u ) . To verify that F,(s, u ) has the required properties we begin by showing that it is a contraction mapping with respect to some maximum norm. This is done by relating F , to the DP value iteration operator for the same Markov chain. More specifically, max I€{F,(i,u ) } l

=

where we have used the notation V*U) = max, IQ,U. v) - Q*U,v)l and T is the DP value iteration operator for the case where the costs associated with each state are zero (cf. equation 2.4). If y < 1 the contraction property of € { F , ( i , u ) } can be seen from the fourth formula by bounding & pi,(u)V*(j) by maxi V*U) and then including the y factor. When the future costs are not discounted (y = 1) but the chain is absorbing and all policies lead to the terminal state w.p.1 there still exists a weighted maximum norm with respect to which T is a contraction mapping (see, e.g., Bertsekas and Tsitsiklis 1989) thereby forcing the contraction of €{F,(i, u ) } . The variance of F,(s, u ) given the past is within the bounds of Theorem 1 as it depends on Qt(s,u) at most linearly and the variance of c,(u) is bounded. Note that the proof covers both the on-line and batch versions.

4 T h e TD(N Algorithm The TD(X) (Sutton 1988) is also a DP-based learning algorithm that is naturally defined in a Markov environment. Unlike Q-learning, however, TD does not involve decision-making tasks but rather predictions about the future costs of an evolving system. TD(N converges to the same predictions as a version of Q-learning in which there is only one action available at each state, but the algorithms are derived from slightly different grounds and their behavioral differences are not well understood. In this section we introduce the algorithm and its derivation. The proof of convergence is given in the following section. Let us define V,(i) to be the current estimate of the expected cost incurred during the evolution of the system starting from state i and let

T. Jaakkola, M. I. Jordan, and S. P. Singh

1192

denote the instantaneous random cost at state i. As in the case of Qlearning we assume that the future costs are discounted at each state by a factor y. If no discounting takes place (y = 1) we need to assume that the Markov chain is absorbing, that is, there exists a cost-free terminal state to which the system converges with probability one. We are concerned with estimating the future costs that the learner has to incur. One way to achieve these predictions is to simply observe n consecutive random costs weighted by the discount factor and to add the best estimate of the costs thereafter. This gives us the estimate

ci

The expected value of this can be shown to be a strictly better estimate than the current estimate is (Watkins 1989). In the undiscoudted case this holds only when n is larger than some chain-dependent constant. To demonstrate this let us replace V, with V’ in the above formula giving E{V:‘”)(if)}= V*(if)which implies maxIE{Vi“)(i)}- V*(i)l 5 y”maxPr{m; 2 n}maxlVf(i)- V*(i)( (4.2) I

I

I

where m; is the number of steps in a sequence that begins in state i (infinite in the nonabsorbing case). This implies that if either y < 1 or n is large enough so that the chain can terminate before n steps starting from an arbitrary initial state then the estimate Vj”’ is strictly better than V,. In general, the larger n the more unbiased the estimate is as the effect of incorrect V, vanishes. However, larger n increases the variance of the estimate as there are more (independent) terms in the sum. Despite the error reduction property of the truncated estimate it is difficult to calculate in practice as one would have to wait n steps before the predictions could be updated. In addition it clearly has a huge variance. A remedy to these problems is obtained by constructing a new estimate by averaging over the truncated predictions. TDU) is based on taking the geometric average: (4.3)

As a weighted average it is still a strictly better estimate than Vf(i)with the additional benefit of being better in the undiscounted case as well (as the summation extends to infinity). Furthermore, we have introduced a new parameter X that affects the trade-off between the bias and variance of the estimate (Watkins 1989). An increase in X puts more weight on less biased estimates with higher variances and thus the bias in V: decreases at the expense of a higher variance.

Convergence of DP Algorithms

1193

The mathematical convenience of using the geometric average can be seen as follows. Given the estimates V:(i) the obvious way to use them in a learning rule is

In terms of prediction differences, that is

the geometric weighting allows us to write the correction term in the learning rule as

Note that up to now the prediction differences that need to be calculated in the future depend on the current V,(i). If the chain is nonabsorbing this computational implausibility can, however, be overcome by updating the predictions at each step with the prediction differences calculated by using the current predictions. This procedure gives the on-line version of TDO): (4.7) where x ; ( k ) is the indicator variable of whether state i was visited at kth step (of a sequence). Note that the sum contains the effect of the modifications or activity traces initiated at past time steps. Moreover, it is important to note that in this case the theoretically desirable properties of the estimates derived earlier may hold only asymptotically (see the convergence proof in the next section). In the absorbing case the estimates V,(i) can also be updated offline, that is, after a complete sequence has been observed. The learning rule for this case is derived simply from collecting the correction traces initiated at each step of the sequence. More concisely, the total correction is the sum of individual correction traces illustrated in equation 4.6. This results in the batch learning rule (4.8) t=l

k=l

where the ( m + 1)th step is the termination state. We note that the above derivation of the TDW algorithm corresponds to the specific choice of a linear representation for the predictors V,(i) (see, e.g., Dayan 1992). Learning rules for other representations can be obtained using gradient descent but these are not considered here. In practice TD(X) is usually applied to an absorbing chain thus allowing the use of either the batch or the on-line version but the latter is usually preferred.

T. Jaakkola, M. I. Jordan, and S. P. Singh

1194 5 Convergence of TD(N

As we are interested in strong forms of convergence we need to modify the algorithm slightly. The learning rate parameters all are replaced by crll(i)/ which satisfy Cnal,(i) = 00 and Ellai(i)< 00 uniformly w.p.1. These parameters allow asynchronous updating and they can, in general, be random variables. The convergence of the algorithm is guaranteed by the following theorem, which is an application of Theorem 1.

Theorem 3. For any finiteabsorbing Markov chain, for any distribution of starting states with no inaccessible states, and for any distributions of the costs with finite variances the TD(N algorithms given by Ill

1. Vll+l(i)= V,(i)

+ c r l l ( i ) ~ [ c+; yV,.(it+l) l - V l 1 ( i fk(?.X)"X;(k) )] r=i

k= 1

El,a,(i) = 00 and Ella f ( i ) < 00 uniformly over i w.p.1. 2. vt+l(i)= v t ( i )+ ar(i)[C;,

+ yvt(ir+i)- vt(it)l k ( r ~ ) r - k x l ( k ) k=l

Er:I(')

and Et a;(i) < 00 uniformly w.p.1and within sequences ar(l)/maxrEsar(i) -,1 uniformly over i w.p.1 =

00

converge to the optimal predictions w.p.1 provided 7 ,X E [0,1]with yX < 1. Proof for (1). Using the ideas described in the previous section the learning rule can be written as

where V:(i;k) is an estimate calculated at the kth occurrence of state i in a sequence and for mathematical convenience we have made the transformation all(i)-, E{rn(i)}cxll(i), where m(i) is the number of times state i was visited during the sequence. To apply Theorem 1 we subtract V . ( i ) , the optimal predictions, from both sides of the learning equation. By identifying all(i):= an(i)m(i)/ E{m(i)}, a , ( i ) := all(i)/and F,(i) := Gll(i) - V * ( i ) m ( i ) / E { m ( i we ) } need to show that these satisfy the conditions of Theorem 1. For a l l ( i )and pll(i)this is obvious. We begin here by showing that Fll(i) indeed is a contraction mapping. To this end,

Convergence of DP Algorithms

1195

which can be bounded above by using the relation lE{Vt(i;k) - V*(i)I Vll}l

I I

5 where Q ( x ) = 0 if x < 0 and 1 otherwise. Here we have also used the fact that V; (i) is a contraction mapping independent of possible discounting. As CkP{m(i)2 k} = E { m ( i ) }we finally get

The variance of F,(i) can be seen to be bounded by E{m4}max lV,(i)12 I

For any absorbing Markov chain the convergence to the terminal state C ( k ) , implying that the is geometric and thus for every finite k, E{mk}I variance of F,,(i) is within the bounds of Theorem 1. As Theorem 1 is now applicable we can conclude that the batch version of TD(N converges to 0 the optimal predictions w.p.1. Proof for (2). The proof for the on-line version is achieved by showing that the effect of the on-line updating vanishes in the limit thereby forcing the two versions to be equal asymptotically. We view the on-line version as a batch algorithm in which the updates are made after each complete sequence but are made in such a manner so as to be equal to those made on-line. Define G:,(i) = G,,(i) + Gf(i) to be a new batch estimate taking into account the on-line updating within sequences. Here G,(i) is the batch estimate with the desired properties [see the proof for (111 and Gf (i) is the difference between the two. We take the new batch learning parameters to be the maxima over a sequence, that is u,,(i) = maxfESa,(i). As all the a,(i) satisfy the required conditions uniformly w.p.1 these new learning parameters satisfy them as well. To characterize the new batch estimate we consider the on-line updating iteration and decompose it into three parallel processes. Let V;(i)be the value function due to the real batch updating with uIl(i) as learning parameters, let Vp (i) be the difference between this batch value function and the value function resulting from the on-line updating, and let R,(i) be the change in the value function after t on-line updating steps. These

T. Jaakkola, M. I. Jordan, and S. P. Singh

1196

processes can be written as VF+l(i) = V,B(i)+a,(i)[C,,+ yV,(ir+l) - Vtl(ir)]Pt(i) VP,,(i) = VP(i) + .r(i)[c,, + ~V,,(it+l) - Vll(it)+ yRr(it+i) - Rt(ir)]Pr(i)- a i i ( i ) [ ~ + i , yV,(it+i) - Vn(ir)]Pt(i) Rt+l(i) = Rt(i) at(i)[c,,+ yVIl(ir+i)- Vll(it) + yRr(it+l)- Rr(it)]Pr(i)

+

with initial conditions V,B(i) = V,(i)/ Vf(i) = Ro(i) = 0. By denoting C; = lci, yV*(it+l)- V*(ir)land A,, =[I V,, - V* 11 and bounding the learning parameters by their norms and by the largest member of the sequence [a,(i)] we readily obtain

+

E{lV$l(ill} IE{IV,a(i)l}+ x

E{ll Rr+i

2

aIl(i)

{ II 2 - 1 I1 (An + E{C;)) + E{ll RI ll}} 2

I l l IE{ll Rt I l l + j - - q II

all

II

(All

+ E{C;) + E{ll Rr 11))

Let us define C = 2/(1 - yX) and C' 2 E{C;}. Note that such a C' can always be found as the optimal value function is bounded under the assumptions of the theorem. By iterating we get E{ll R m

11)

I c(Aii + C*)[(l+ C 11 ail

11)"

- 11

and inserting this into the E{lVp(i)l} iteration and noting that it is only a sum we get an upper bound E{IVt(i)l} 5 a,(i)mC2(A,,+ C*) x {mpx

11 a n

-1

11 +(1 + C 11 a,, 11) "

- 1)

(5.1)

As the new batch updates are made using a,(i) parameters the quantity of interest to us is E{E{ lV$(i)l}/u,,(i)}mwhich is a bound for E{Gf(i)}. In order to use Theorem 1 we need to show that the estimates G i have the desired properties. In other words, as the G,(i) have these properties we need to show that the effect of Gf(i) is vanishing. Let us first consider the contraction property of the estimate. In any absorbing Markov chain the probability of absorption increases geometrically and thus the probability of having m steps in a sequence can be bounded by Cpy," where yp < 1. Due to the convergence of a,($ for n sufficiently large [a,(i) small enough], the expectation of E{ 11 R,, 11) over m is bounded and goes to zero in the limit w.p.1 (all the terms contain ak, where k 2 1). As also 11 (atla,) - 1 11 converges to zero w.p.1, E{Gf(i)} vanishes in the limit (see equation 5.1). Using equation 5.1 we can write these results as

II EW:,

-

V')

II III E{Gn

I1 + II GR II I(7'+ CL) 11 v, - v* 11 +c; - V'}

Convergence of DP Algorithms where C;, and CE go to zero w.p.1. This implies that for any 1) V , - V' (I>> t there exists y < 1 such that

1197 t

> 0 and

(I E{G:, - V * }115 Y I1 Vt1 - V' I1 for n large enough. The variance of the estimate satisfies condition (4) of Theorem 1. To see this, note that in deriving the bound for E{Gf(i)} the only random variables were C;, the variances of which are bounded, and m, the variance of which can be bounded exactly in the same way as the expectation; it too vanishes in the limit. Theorem 1 now guarantees that for any t the value function in the on-line algorithm converges w.p.1 into some t-bounded region of V* and therefore the algorithm itself converges to V' w.p.1. 0 6 Conclusions

In this paper we have extended results from stochastic approximation theory to cover asynchronous relaxation processes that have a contraction property with respect to some maximum norm (Theorem 1). This new class of converging iterative processes is shown to include both the Q-learning and TD(X) algorithms in either their on-line or batch versions. We note that the convergence of the on-line version of TD(X) has not been shown previously. We also wish to emphasize the simplicity of our results. The convergence proofs for Q-learning and TD(X) utilize only high-level statistical properties of the estimates used in these algorithms and do not rely on constructions specific to the algorithms. Our approach also sheds additional light on the similarities between Q-learning and TD(X). Although Theorem 1 is readily applicable to DP-based learning schemes, the theory of dynamic programming is important only for its characterization of the optimal solution and for a contraction property needed in applying the theorem. The theorem can be applied to iterative algorithms of different types as well. Finally we note that Theorem 1 can be extended to cover processes that do not show the usual contraction property, thereby increasing its applicability to algorithms of possibly more practical importance. 7 Proof of Theorem 1

In this section we provide a detailed proof of the theorem on which the convergence proofs for Q-learning and TD(X) were based. We introduce and prove three essential lemmas, which will also help to clarify ties to the literature and the ideas behind the theorem, followed by the proof of Theorem 1. The notation 11 . JIw= max, I . /W(x)I will be used in what follows.

T. Jaakkola, M. I. Jordan, and S. P. Singh

1198

Lemma 1. A random process

wJI+l(x)=

-

+

crJl(x)]wll(x)

pll(x)r~l(x)

converges to zero with probability one if the following conditions are satisfied: 2 . CJIn J I ( X ) = O 0 t Ell at(x)< O0, CJI/ ' f l ( ' ) = O0! CJl/:'()' < O0, PI,} I E { a , , ( x ) I PN}uniformly over x w.p.1. 2. E{rJl(x)I P J l , P , }= Oand E { < ( x ) I PJl,/~ll} I C w.p.1, where PJI= {

~

J

l

~

~

~

-

~

~

~

~

~

~

r

~

~

-

~

~

r

J

l

-

~

~

E{411(x)1

~

~

~

~

a

J

All the random variables are allowed to depend on the past PJl. a J , ( x )and & ( x ) are nonnegative and mutually independent given Pll. Proof. Except for the appearance of i&(x) this is a standard result. With the above definitions convergence follows from Dvoretzky's extended theorem (Dvoretzky 1956). 0 Lemma 2. Consider a stochastic iteration

Xn+l(X) = G,(Xii, Yir,x)

where G,, is a sequence offunctions and Y,,is a random process. Let (a,3,P ) be a probability space and assume that the process is scale invariant, that is, w.p.2 for all w E R G(/jXli.y i i ( w ) , x ) = P G ( X , i , Y , ( w ) , x )

Assumefurther that if we kept )I X,, 11 bounded by scaling, then X , would converge to zero w.p.2. These assumptions are sufficient to guarantee that the original process converges to zero w.p.1. Proof. Note that multiplying X I , by 13 corresponds to having initialized the process with [ ~ X ONow . fix some constant C. If during the iteration, 11 X I , 11 increases above C, then X , is scaled so that 11 X f l ]I= C. By the second assumption then this process must converge w.p.1. To show that the net effect of the corrections must stay finite w.p.1 we note that if 11 X n 11 converges then for any 6 > 0 there exists M, such that 11 X , [I< 6 < C for all n > M, with probability at least 1 - 6 . But this implies that the iteration stays below C after M, and converges to zero without any further corrections. 0 Lemma 3. A stochastic process XI1+1(x)= [I - @17(X)]Xll(X) + converges to zero w.p.2 provided

2 . x E S, where S is a finite set. 2. C,,N,(X) = 00, C n a 2 , ( x ) < 00, C , P , , ( X ) = 00, C,,P,2(X) < € { p , , ( x ) I P,} I E { a , ( x ) I P,,} uniformly over x w.p.1

11 x n 11

00,

and

l

-

~

~

a

~

l

-

Convergence of DP Algorithms

1199

where P,

= {XnrXn-l?...ra,,-l,an-2

,...,ijj,i-l,D,l-2....)

a,@) and P,,(x) are nonnegative and mutually independent given P,,.

Proof. Essentially the proof is an application of Lemma 2. To this end, assume that we keep 11 X, 115 C1 by scaling, which allows the iterative process to be bounded by

IX,r+l(X)l I [1- & ( X ) I I X n ( X ) I

+ rP,,(x)C1

This is linear in IX,(x)l and can be easily shown to converge w.p.1 to some X . ( x ) , where 11 X' 115 yC1. Hence, for small enough c, there exists M I ( € )such that 11 X,, 115 C1/(1 + c) for all n > M l ( c ) with probability at least pl(c). With probability p l ( 6 ) the procedure can be repeated for C2 = C 1 / ( l + 6 ) . Continuing in this manner and choosing p k ( c ) so that n k p k ( c ) goes to one as E + 0 we obtain the w.p.1 convergence of the 0 bounded iteration and Lemma 2 can be applied.

+

Theorem 1. A random iterative process A,,+I(x)= [l - (i,,(x)]A,,(x) ~~,,(x)F,,(x) converges to zero w.p.1under the following assumptions: 1. x E S , where S is a finite set.

c,&)

2. E,afl(x) = m, < 0 , E , l D , J ( X ) = 00, Z,I&X) < E{,O,,(x)lP,} 5 E{n,(x)lP,}uniformlyover x w.p.1.

and

3. 1) E{Fn(x)JP,t,P,i}l l w l Y II An Ilwtwhere~E (031). where C is some constant. 4. Var{F,(x)lP,,,P,} 5 C(1+ 11 A,,

Here P, = { X , l r X n - ~. ,. . ,F,-l.. . . , a l l - l , .. . , /I,,-,, . . .} stands for the past at step n. F,(x), a l l ( x ) and , & ( x ) are allowed to depend on P,,. ( Y , , ( x and ) P,,(X) are assumed to be nonnegative and mutually independent given PI,. The notation 11 IIw refers to some weighted maximum norm. +

Proof. By defining r,(x) = F , ( x ) - E{F,(x)lP,, a,}we can decompose the iterative process into two parallel processes given by

b,,+l(x) = [1- tr,(x)]&(x)+ A(x)E{F,l(x)I P,,. A } = [1- ~ , ( X ) I W , , ( X )+ ijMx)rfl(x)

Wn+l(X)

(7.1)

where A,,(x)= S,,(x) + w,,(x).Dividing the equations by W ( x ) for each x and denoting 6A(x) = b,,(x)/W(x),w:,(x) = w , ( x ) / W ( x ) , and r:,(x) = r , ( x ) / W ( x ) we can bound the S:, process by assumption (3) and rewrite the equation pair as l6b+,(x)l

I 11 - a n ( X ) I I S ; ( X ) I + Y P , l ( X ) I1 16'1 + wl, [1- 4 x ) l w l ( x )+ Yo~l(X)r~,(x)

WI,+I(X) =

II

T. Jaakkola, M. I. Jordan, and S. P. Singh

1200

Assume for a moment that the An process stays bounded. Then the variance of r n ( x ) is bounded by some constant C and thereby w:, converges to zero w.p.1 according to Lemma 1. Hence, there exists M such that for all n > M )I wk \I< E with probability at least 1 - E . This implies that the 6; process can be further bounded by l6;,+1(x)l

I 11 - ~ ~ f ( ~ ) l l ~ ; f ( ~ ) l + Y M X ) II 4, + 6 II

with probability > 1 - E . If we choose C such that r ( C for 11 S:, [I> C E

Y II 4 + 6 1 1 1Y(C + 1)/C II s:,

+ 1)/C < 1 then

II

and the process defined by this upper bound converges to zero w.p.1 by Lemma 3. Thus 11 S:, 11 converges w.p.1 to some value bounded by Cf, which guarantees the w.p.1 convergence of the original process under the boundedness assumption. By assumption (4) Y:,(x) can be written as (I+ 1) 6,, w,, l l ) S n ( X ) , where E{s2,(x)IP,,,/$,} 5 C. Let us now decompose w,, as u,, u,, with

+

+

U,,+l(X)

+

= 11 - ~Yn(x)Iu,,(x) rion(x) II

4,+ u,, + uf, I1 s,,(x)

and u,, converges to zero w.p.1 by Lemma 1. Again by choosing C such that y(C + 1)/C < 1 we can bound the hi, and u,, processes for 11 6:, u,, )I> Cc. The pair (b:,, u,,) is then a scale invariant process whose bounded version was proven earlier to converge to zero w.p.1 and therefore by Lemma 2 it too converges to zero w.p.1. This proves the w.p.1 convergence of the triple 6:,,u,,, and u,, bounding the original process. 0

+

Acknowledgments This project was supported in part by a grant from the McDonnell-Pew Foundation, by a grant from ATR Human Information Processing Research Laboratories, by a grant from Siemens Corporation, and by Grant N00014-90-J-1942 from the Office of Naval Research. The project was also supported by NSF Grant ASC-9217041 in support of the Center for Biological and Computational Learning at MIT, including funds provided by DARPA under the HPCC program. Michael I. Jordan is an NSF Presidential Young Investigator.

References Aoki, M. 1967. Optimization of Stochastic Systems. Academic Press, New York. Barto, A. G., Sutton, R. S., and Watkins, C. J. C. H. 1990. Sequential decision

problems and neural networks. In Advances in Neural Information Processing

Convergence of DP Algorithms

1201

Systems, 2, D. Touretzky, ed., pp. 686-693. Morgan Kaufmann, San Mateo, CA. Barto, A. G., Bradtke, S. J., and Singh, S. P. 1994. Learning to act using real-time dynamic programming. A1 I., in press. Bertsekas, D. P. 1987. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, NJ. Bertsekas, D. I?, and Tsitsiklis, J. N. 1989. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Englewood Cliffs, NJ. Dayan, P. 1992. The convergence of TD(N for general A. Machine Learning 8, 341-362. Dayan, P., and Sejnowski, T. J. 1993. TD(N converges with probability 1. CNL, The Salk Institute, San Diego, CA. Dvoretzky, A. 1956. On stochastic approximation. Proceedingsof the Third Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. Peng, J., and Williams, R. J. 1993. TD(A) converges with probability 1. Department of Computer Science preprint, Northeastern University. Robbins, H., and Monro, S. 1951. A stochastic approximation model. Ann. Math. Stat. 22, 400-407. Ross, S. M. 1970. Applied Probability Models with Optimization Applications. Holden-Day, San Francisco. Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learn. 3 , 9 4 4 . Tsitsiklis, J. N. 1994. Asynchronous stochastic approximation and Q-learning. Machine Learn., in press. Wasan, M. T. 1969. Stochastic Approximation. Cambridge University Press, London. Watkins, C. J. C. H. 1989. Learning from delayed rewards. Ph.D. Thesis, University of Cambridge. Watkins, C. J. C. H, and Dayan, P. 1992. Q-learning. Machine Learn. 8, 279-292. Werbos, P. 1992. Approximate dynamic programming for real-time control and neural modeling. In Handbookoflntelligent Control: Neural, Fuzzy, and Adaptive Approaches, D. A. White and D. A. Sofge, eds., pp. 493-525. Van Nostrand Reinhold, New York.

Received July 26, 1993; accepted March 23, 1994.

This article has been cited by: 2. Kazuyuki Hiraoka, Manabu Yoshida, Taketoshi Mishima. 2009. Parallel reinforcement learning for weighted multi-criteria model with adaptive margin. Cognitive Neurodynamics 3:1, 17-24. [CrossRef] 3. L. Busoniu, R. Babuska, B. De Schutter. 2008. A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38:2, 156-172. [CrossRef] 4. Daniel Lockery, James F. Peters. 2008. Adaptive learning by a target-tracking system. International Journal of Intelligent Computing and Cybernetics 1:1, 46-68. [CrossRef] 5. Zhicong Zhang, Li Zheng, Michael X. Weng. 2007. Dynamic parallel machine scheduling with mean weighted tardiness objective by Q-Learning. The International Journal of Advanced Manufacturing Technology 34:9-10, 968-980. [CrossRef] 6. Olivier Buffet, Alain Dutech, François Charpillet. 2007. Shaping multi-agent systems with gradient reinforcement learning. Autonomous Agents and Multi-Agent Systems 15:2, 197-220. [CrossRef] 7. He-Sheng Tang, Songtao Xue, Tadanobu Sato. 2007. H[sub ∞] Filtering in Neural Network Training and Pruning with Application to System Identification. Journal of Computing in Civil Engineering 21:1, 47. [CrossRef] 8. D. Srinivasan, M.C. Choy, R.L. Cheu. 2006. Neural Networks for Real-Time Traffic Signal Control. IEEE Transactions on Intelligent Transportation Systems 7:3, 261-272. [CrossRef] 9. Vladislav B. Tadić. 2006. Asymptotic analysis of temporal-difference learning algorithms with constant step-sizes. Machine Learning 63:2, 107-133. [CrossRef] 10. Benjamin Van Roy. 2006. Performance Loss Bounds for Approximate Value Iteration with State Aggregation. Mathematics of Operations Research 31:2, 234-244. [CrossRef] 11. Min Chee Choy, Dipti Srinivasan, Ruey Long Cheu. 2006. Neural Networks for Continuous Online Learning and Control. IEEE Transactions on Neural Networks 17:6, 1511-1531. [CrossRef] 12. CHRISTOPHER J. FONNESBECK. 2005. SOLVING DYNAMIC WILDLIFE RESOURCE OPTIMIZATION PROBLEMS USING REINFORCEMENT LEARNING. Natural Resource Modeling 18:1, 1-40. [CrossRef] 13. Florentin Wörgötter , Bernd Porr . 2005. Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological MechanismsTemporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms. Neural Computation 17:2, 245-319. [Abstract] [PDF] [PDF Plus]

14. A. Potapov, M. Ali. 2003. Convergence of reinforcement learning algorithms and acceleration of learning. Physical Review E 67:2. . [CrossRef] 15. V. S. Borkar. 2002. Q-Learning for Risk-Sensitive Control. Mathematics of Operations Research 27:2, 294-311. [CrossRef] 16. J. Abounadi, D. Bertsekas, V. S. Borkar. 2001. Learning Algorithms for Markov Decision Processes with Average Cost. SIAM Journal on Control and Optimization 40:3, 681. [CrossRef] 17. V. S. Borkar, S. P. Meyn. 2000. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning. SIAM Journal on Control and Optimization 38:2, 447. [CrossRef] 18. Csaba Szepesvári , Michael L. Littman . 1999. A Unified Analysis of Value-Function-Based Reinforcement-Learning AlgorithmsA Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms. Neural Computation 11:8, 2017-2060. [Abstract] [PDF] [PDF Plus] 19. B Bharath, V S Borkar. 1999. Stochastic approximation algorithms: Overview and recent trends. Sadhana 24:4-5, 425-452. [CrossRef] 20. L. Jouffe. 1998. Fuzzy inference system learning by reinforcement methods. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 28:3, 338-355. [CrossRef] 21. Vivek S. Borkar. 1998. Asynchronous Stochastic Approximations. SIAM Journal on Control and Optimization 36:3, 840. [CrossRef] 22. Fernando J. Pineda . 1997. Mean-Field Theory for Batched TD(λ)Mean-Field Theory for Batched TD(λ). Neural Computation 9:7, 1403-1419. [Abstract] [PDF] [PDF Plus] 23. J.N. Tsitsiklis, B. Van Roy. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:5, 674-690. [CrossRef]

Communicated by Scott Fahlman

A Dynamic Neural Network Architecture by Sequential Partitioning of the Input Space R. S. Shadafan M. Niranjan Cambridge University Engineering Department, Trumpington St., Cambridge, CB2 IPZ,England

We present a sequential approach to training multilayer perceptrons for pattern classification applications. The network is presented with each item of data only once and its architecture is dynamically adjusted during training. At the arrival of each example, a decision whether to increase the complexity of the network, or simply train the existing nodes is made based on three heuristic criteria. These criteria measure the position of the new item of data in the input space with respect to the information currently stored in the network. During the training process, each layer is assumed to be an independent entity with its particular input space. By adding nodes to each layer, the algorithm is effectively adding a hyperplane to the input space, hence adding a partition in the input space for that layer. When existing nodes are sufficient to accommodate the incoming input, the corresponding hidden nodes will be trained accordingly. Each hidden unit in the network is trained in closed form by means of a recursive least-squares (RLS) algorithm. A local covariance matrix of the data is maintained at each node and the closed form solution is recursively updated. The three criteria are computed from these covariance matrices to keep low computational cost. The performance of the algorithm is illustrated on two problems. The first problem is the two-dimensional Peterson and Barney vowel data. The second problem is a 33-dimensional data derived from a vision system for classifying wheat grains. The sequential nature of the algorithm has an efficient hardware implementation in the form of systolic arrays, and the incremental training idea has better biological plausibility compared with iterative methods. 1 Introduction Neural networks have been successfully applied to many pattern recognition problems. Usually, the classification problem is cast as an interpolation problem by assigning real valued "targets." The network attempts to interpolate these targets at locations defined by a set of training data in multidimensional. input space or feature space. In this input space, Neural Computation 6, 1202-1222 (1994) @ 1994 Massachusetts Institute of Technology

Sequential Partitioning of Input Space

1203

the network may be seen as approximating a nonlinear function. Often, a contour on the network output function is treated as a discriminant function. Sometimes, with proper normalization, network outputs are also interpreted as posterior probabilities of class membership. Training, or parameter estimation, is usually done by minimizing the total squared interpolation error over the collection of training examples, using a gradient descent type procedure. In this paper, we consider an algorithm that is suitable for data that arrive sequentially. As each item of data arrives, the classifier has to be trained incrementally so as to satisfy the new arrival and at the same time be consistent with the past observations. Past observations are not retained. Additionally, even in situations where all data are available before training, one might be able to achieve a computational advantage with sequential learning procedures as such algorithms see each item of data only once. A further motivation for this work is the observation that in a multilayer perceptron classifier the contribution of each unit in hidden layers is essentially "local." If we view the classification boundary as being approximated by segments of hyperplanes, the positions of these segments are determined primarily by data that lie close to them. Data far away from the segment boundaries contribute little to the error function. In this paper, we exploit this observation by sequentially partitioning the input space so that training nodes in the multilayer perceptron are local. We present to the network each example only once and dynamically increase the size of the network. As every example arrives, the net will decide whether to add a new node or simply train the existing nodes according to some criteria. There are many advantages in this approach. The network can be expanded to a multilayer network by introducing new layers whenever they are needed and training them separately and sequentially with the set of outputs from the previous layer. In addition, this network allows continuous learning (i.e., learning every new example even outside the training set), which means the net will dynamically forget and remember examples according to their distributions and occurrences. Such on-line learning is especially useful when the data arrive sequentially and have a slowly drifting pattern of statistical behavior. Every node in the network is trained by the recursive least-squares (RLS) algorithm, a technique that is widely known in the area of adaptive filters (Haykin 1984). Azimi-Sadjadi and Sheedvash (1991) and Sin and de Figueiredo (1992) have used the RLS algorithm in training multilayer perceptrons (MLP). Other related work is the use of the extended Kalman filter (EKF) algorithm, which is similar in form to RLS, but allows one to incorporate knowledge or estimates of noise variances in the data. Singhal and Wu (1989) have used the EKF algorithm to train an MLP. The EKF algorithm has also been used for estimating radial basis function models by Kadirkamanathan et al. (1987, 1992). The common

R. S. Shadafan and M. Niranjan

1204

theme underlying these methods that we borrow from adaptive signal processing is that the inverse of a matrix appearing in the closed form expression for the least-squares solution of a system of equations can be evaluated recursively, using the matrix inversion lemma. In the following sections we will show how a perceptron can be trained by the RLS method, and construct the criteria that will control network growth. Then, the sequential machine will be introduced and illustrated by a simple example. Finally, we will demonstrate the performance of this classifier on two real life problems, namely the vowel classification as a low 2-dimensional problem (Peterson and Barney 1952), and the wheat classificationas a large 33-dimensional problem (Keefe and Draper 1986; Keefe 1990). 2 Recursive Least Squares

Given a set of N training data x;, t;, i = 1,. . . ,N, where x; E RP+'and t , are real values known as targets, the interpolation conditions for a perceptron model are a set of simultaneous equations given by 9

i = 1... . , N For input data of dimension p , we work in a p ti =f(xf

w,),

(2.1)

+ 1-dimensional space to absorb the bias term in the perceptron model, to make the notation easier. The ( p + 1)th component is set to 1. The nonlinear function f ( . ) is usually a sigmoid, given by 1 (2.2) f(a)=' 1 exp(-a) The system of equations becomes,

+

ti

r

x; w; = log

~

1- t; In matrix notation, we rewrite this as

x w = c

(2.3)

(2.4)

where X is an N x ( p + 1) matrix of the input data vectors, w is the unknown parameters of the perceptron, and C is computed from the target values by the inverse of the nonlinearity. Due to the nature of the log function, for pattern recognition problems we use 0.1 and 0.9 as targets, instead of 0 and 1. The above is usually an overdetermined set, i.e., N >> ( p + l),of linear equations. The least-squares solution that minimizes the total squared error, (2.5) e;

= t; -f(xI

w;)

(2.6)

Sequential Partitioning of Input Space

1205

can be obtained from the pseudoinverse as

[X'XI-' x'c Letting [X'X I = B and X' C = G, reduces this to w

=

w

=

B-' G

Here, the matrix B is the autocorrelation matrix of the input patterns, and the vector G is the cross-correlation between the inputs and their designated targets (Gardner 1986); both quantities can be found recursively at the introduction of a new example x, as follows:

B,

=

GI

=

B,-, + x, x: GI-* X, t,

+

(2.9) (2.10)

In sequential training, we use the matrix-inversion lemma (Haykin 1984) to compute the inverse of these terms recursively. The inverse of B, can be computed from BY'

=

B-'I-'

-

B,Y'l x, xf BZl (1 XI B;" XI)

+

(2.11)

For sequential learning we prefer the above RLS approach to a gradient descent type algorithm. This is because the network sees each example only once and additional information is retained in the form of inverse covariance matrices. A simpler approach is to use a Widrow-Hoff type algorithm [the least mean squares (LMS) algorithm in adaptive signal processing] to perform an approximate gradient descent using a(e,)2/8w at the arrival of each example {x,, t i } . In this case, the information we extract from each item of data is minimal. With the RLS algorithm, we accumulate data covariance information in addition to the gradient information. Though the amount of computation at each iteration is much higher than the LMS type approach, convergence is much faster (see Fig. 1). Further, in a practical implementation where the data arrival is sequential (e.g., time series prediction or on-line control), we can pipeline these computations on a systolic array type architecture (McCanny and White 1987; Kung 1988). 3 The Sequential Machine

We now look at how to incorporate the use of the RLS algorithm to train a single perceptron into building a multilayer network of perceptrons sequentially. This is done essentially by partitioning the input space of each layer into local areas and the application of the RLS algorithm on each node local to the newly presented datum. As each example arrives, we apply three criteria to either associate the example with an existing node and re-estimate its parameters or expand the network if the existing nodes cannot properly account for the data.

R. S. Shadafan and M. Niranjan

1206

10,

I

70

80

90

m

No.of Examples

Figure 1: A comparison on a simple vowel classification problem between the LMS (solid line) and the RLS (dashed line) convergence rates as a function of number of examples the network was sequentially presented with. The problem here is linearly separable and has only two dimensions.

3.1 Criterion for Linear Separability. Since the main function of a hyperplane is to linearly separate between two classes by linearly dividing the input space into two regions, then training this boundary with a linearly nonseparable example using the RLS algorithm will shift the boundary to an undesired place. We refer to this situation as the training being unrelaxed. We introduce a criterion that will check whether our new input example will cause an unrelaxed training situation with every node in the current layer. If such a situation is detected with a particular node, then training this node with the current example will be discouraged. This criterion is based on the correlation between the outputs Y = f ( X w) and their designated targets T (the quantity T' Y). To make use of the quantities B;' and Gi computed by the RLS algorithm, we consider the approximate corresponding linear version of the same quantity z = C' X w. If the RLS solution was exact, which is the case when X w = C (hence Y = T ) , then z will go to a maximum value of C' C. On the other hand, if the RLS solution for w totally failed, the correlation between X w and C is zero. The above idea can be developed to a criterion figure that will equal one if the network has a total success and zero if the network has a total failure, when traini'ng the new example with the existing B-' and G. This

Sequential Partitioning of Input Space can be expressed in the normalized correlation: C' w

x z=--C' c -

GI

B-' G

C'

1207

(3.1) (3.2)

c

The normalized quantity Z can take any value between 0 (least correlation) and 1 (highest correlation). The degree of correlation depends mostly on whether the new example maintains the linear separability in the input space to the boundary under test. We can set a threshold value 0 5 Z , 5 1 to specify the permitted degree of linear separability needed to be maintained to allow training. For every example x; presented to the system, B;' and G; for each node will be temporarily updated using equations 2.7 and 2.8, and Z can be calculated using equation 3.2. C' C can be easily updated for c; = XIw; by c: c; = ClPl c;-1 c;.2 (3.3) If Z > Z , for a particular node then training this node with that example is going to keep the situation linearly separable and B;' and G; will keep their present values, otherwise training is not relaxed or the new example is linearly inseparable therefore a new node is introduced to ease the training process, and B;' and G; will retain their previous values. Figure 2 shows different values of Z for a newly presented example from both classes around the boundary with respect to its location in the input space. The threshold Z , is set according to how relaxed the user wants the network. However, a more relaxed network will create more nodes hence a larger network size.

+

3.2 Criterion for Remoteness. There might be cases where the new example is linearly separable (i.e., high Z value) but it is too remote to be associated with its class cluster. In this case, we associate the input to a new cluster of the same class. To measure remoteness, we introduce a second criterion, again determined by quantities that can be computed by the RLS algorithm. For instance we can compute the quantities B-' and G for each class the node is classifying, then each new example will be trained with the same class but with opposite target using the RLS algorithm for a new hyperplane. The error Ic, - x, WI will be an indication of how near x, is to that cluster, where c, is the corresponding inverse of the nonlinearity of the target t,. This value will swing from near zero to 12 c,I as x, travels from remote distance to the center of that cluster. Normalizing this quantity will give a measure of how remote this input is to that cluster: Ic; - x; w( D = (3.4) 12 C i I

R. S. Shadafan and M. Niranjan

1208

0.5

I

1st feature

Figure 2: A system with two classes in two-dimensional space separated by one hyperplane. New examples are introduced from both classes at different locations labeled with their corresponding Z value. The advantage of this criterion is that it takes the geometry of the tested cluster into consideration by applying the RLS algorithm on that cluster. However, separate B-' and G quantities for each side of the hyperplanes must be maintained and calculated each time the node has been trained with a new input. Figure 3 shows different values of D for newly introduced examples to a system of two classes. In each case, D is computed for both classes and the lowest value of the two is chosen. 3.3 Criterion for Locality. Locality in this context is defined as follows: The input x; is local to a boundary if there is no other bounda y between the location of the input and that boundary. When testing the present input for locality to a boundary, the input is orthogonally projected onto the boundary, producing a new point x) in the input space which will be tested by all the other nodes in the same layer. The present example is local only if all the outputs of the other nodes maintained their status within the same class. Since the weight vector w is always perpendicular to the boundary, the difference between xi and x, should be in the same/opposite direction, or x,-x;=qw

(3.5)

where q is a scalar, and all the vectors are taken without the bias term in

Sequential Partitioning of Input Space

1209

1

pd.9685 pO.9654

0.5 -

pa.9541 p4.9647 po.897 0 0

0.5

Figure 3: A two-class system similar to Figure 2. The target of the new example does not affect D since each example will be trained with the opposite target of the cluster under test. D is calculated for both classes and the lower value is always chosen. this equation. Moreover, x: is on the boundary, hence (3.6)

x:w=o

where vectors now include the bias term. Using these two equations we can solve for x: and carry out the test on the other nodes. Changing output status for a node k can be detected when the quantity xi W Ahas a different sign to x, wk. Computationally, this can be done by multiplying both quantities:

Lk

=

[x: wk][x, wk]

(3.7)

If the answer is negative then a change in status has occurred, otherwise L k is positive. 3.4 The Final Classifier and an Illustrative Example. The machine developed here classifies only two classes, but expansion to more than two classes is straightforward. For r class problems, we will build r machines, each of which will classify one class against the other r - 1 classes, hence the final classifier will have r outputs, each mapping to only one class. In a two-class problem, the machine will only have one output which fires when the example belongs to one of the classes.

1210

R. S. Shadafan and M. Niranjan

Each layer generated within the network will be treated as an entity and trained separately. The role of every layer is to successfully partition its input space to solve for any nonseparable (or unrelaxed) training situations. The algorithm for partitioning the input space is summarized as follows; snapshots of the process are illustrated in Figure 4: 1. Start the machine with one node, and train it with the first example using the RLS algorithm.

2. Present a new example and for every node in this layer check remoteness by computing D for both sides of the boundary, check locality by computing L, and check separability by computing Z. 3. If the example is remote to all boundaries, create a new node and train it with this example using the RLS algorithm; go back to step 2. 4. If the example is local and linearly separable with a set of nodes, train these nodes with the example using the RLS algorithm; go back to step 2.

5. If the example is local but linearly nonseparable with a node, then create a new node and use B-' and G of the opposite class of that node and the present class to train the newly created node; go to step 2. In order to solve the nonseparability situation created by the presence of a new example, a new node would be created (as in step 5) to correctly place a hyperplane between this example and its opposite class in a locality of an already existing hyperplane. In this case, the newly created node should inherit certain knowledge from the local nearby node. The necessary information needed can be obtained by sequentially calculating BA', Bi', and GA,GB of each class (class A and B) separately in addition to the quantities B-' and G for the two classes (A and B) combined. Moreover, the new node will require the quantity C' C of the old examples, which belongs to the opposite class in order to normalize the new Z, hence the quantities Ck CA and Cf, CB (of classes A and B) also need to be preserved and updated. Training a node as in step 1 involves updating all these quantities using the RLS algorithm and equation 3.3. This will ensure that at any time, any node will have all the relevant knowledge about its local examples and will be ready to share it with any newly created node within its surroundings. Thus the basic data structure of each node that has to be preserved in memory is as follows: 1. B-' and G to be updated on every example considered local to the node. 2. BA' and GA to be updated on every local example belonging to class A.

o.:

Sequential Partitioning of Input Space

[

1211

0.6

j::;m/ &....y 0.4

0.4

0.2 0

I

1

0.5

0

++**+ *+

.

* + *+ +f

0

0.2

0.5

I

0.5

1st fCatUrC

0.4 0.2 0

+

0.2

1st fuave

-w

*.

I

0

1st fcatluc

0%"

0

0.5

1

1st futurr

"0 1st feature

0.5

I

1st fCaNrC

"0

0.5

I

Figure 4:Space partitioning for one layer of 2-D input space. New hyperplanes are introduced by creating new nodes when necessary. Old hyperplanes are trained when training is relaxed enough. Both training and creating are done upon the arrival of a new example.

3. B,' and GB to be updated on every local example belonging to class B. 4. Z to be updated with every local example

5. Ci CA to be updated on every local example belonging to class A. 6. Cl, C B to be updated on every local example belonging to class B.

1212

R. S. Shadafan and M. Niranjan

+

Apart from B-', which is a square matrix of p 1 dimension, and G, which is a vector of p + 1 elements ( p is the number of features), the other parameters are scalars. Not only does this scheme consume relatively less memory (depending on the size of the network), when compared to other neural network techniques, but it also presents a fixed data structure, which is dependent only on the input space dimensionality p , and more importantly, independent of the number of training examples. For instance, in a layer with p inputs, each node requires 3 x [ ( p 1)2 ( p + l)]+ 3 memory elements, compared to N x p fixed memory elements to hold the example batch, in addition to at least p 1 memory elements for each node dedicated to store its parameter w in.backpropagation. For example, if p = 2 then each node in our scheme will require 39 memory elements, which is equivalent to storing only 18 examples + w needed in one-node-network trained by backpropagation. Since we are looking at a 2-class problem with targets either 0 or 1, it is sufficient to have only one output which fires for one class and stays unfired to denote the other classes. This idea of keeping only one final output all the time can be used to control the addition of a new layer to the system, as described in the following algorithm: Start with one layer containing only one node (hence one output).

+ +

+

1. Present a new example to the system and apply the partitioning algorithm on the first layer.

2. Apply the partitioning algorithm on subsequent layers with its inputs fed from the outputs of the previous layers, using the same original targets. 3. Check the number of nodes in the final layer: If greater than 1, then create a new layer with only one node and train it with the outputs of the previous layer.

4. If equal to 1, then go to step 1. This scheme will ensure that after each example is presented, the machine will converge to only one output. Since Z, is the most crucial factor to decide when to expand in a localized region, it can also be used to control the number of layers. In the problem shown in Figure 5, we started with one node and increased the value of Z, in order to relax the training. Applying the partitioning algorithm expanded the first layer into three nodes (hence three hyperplanes). To prevent the network from expanding to more than two layers, Z, in the output layer was reduced to always enforce training and prevent creating extra nodes. Hence, the machine converged into only two layers after training. In many practical problems, two layers are enough to classify one class out of all other classes, however, one extra layer can be added to solve more complicated classes. To illustrate the idea of expanding to more than two layers, we use the same example in Figure 5, but this time Z,

Sequential Partitioning of Input Space

3wo

Net with 1 nodc

3000

Not with 3 nodoi

1

1213

3000

Ncc wiul2 Dodos

l*t reuUra

- oO

1

I’

500

1000

1 s t feuuln

Figure 5: Simulation of the algorithm on a simple 2-D problem. The dotted lines denote the class boundaries contributed by each of the three nodes in the hidden layer; the solid line is the effective class boundary of the whole network.

was set to enable the network to expand into 2-2-1 size. We limited the number of nodes to 2 in the two hidden layers in order to visualize the inputs of each layer (see Fig. 6). Clearly, this scheme allows expansion to many layers. However it is known that a three-layer network is adequate to solve any problem, and adding more layers may not improve network performance. One fact to point out is that since the size and parameters of the network are modified sequentially with every example without anticipating future examples, the final network architecture depends on the order in which the examples are presented during the training process. This might affect the final performance of the network. However, if the number of examples is large enough for the network architecture to converge to a proper solution, then its performance is expected to have small variations for different sequences. We will show that this is true for the vowel data (see Fig. 9). Moreover, one might anticipate that noisy examples will shape the network to a poorer solution, but because of the localization of training and the stability of the RLS learning rule, the network maintains a good generalization under noisy conditions, although it affects the network size and performance, as noise added to the examples will expand the input space and introduce more complexity to the shape of final boundary. This point is illustrated on one of the vowel classes in Section 4.1, where we added noise to the original training data and examined the final architecture and its performance on a clean testing set of data (see Fig. 10).

R. S. Shadafan and M. Niranjan

1214

1 0.8

c

rnlrdLmHI

Figure 6: Simulation of the algorithm on the previous example. The network converged into 2-2-1 format. (A) Partitioning the input space in layer 1. (B) Partitioning the output space of the first layer using the two nodes in layer 2. (C) Partitioning the output space of the second layer using the node in layer 3. (D) A diagram of the final network. 4 Experimental Work

We now show the performance of the model on two pattern recognition problems. The first is a 2-dimensional vowel recognition problem. The second is a more complicated 33-dimensional wheat classification problem. 4.1 Vowel Recognition. The data here is based on the Peterson and Barney database (Peterson and Barney 1952). The classes are vowel sounds characterized by the first four formant frequencies made by people of different ages and gender. To visualize the problem, we reduced the problem to two features by taking the first two formant frequencies. The database consists of 1494 examples. The examples were randomly split into a training set of 1000 and a test set of 494 examples. The examples are mapped to 10 vowels (10 classes). Therefore, we built 10 one-output machines; each machine has a hidden layer and one output that fires in response to only one specific class. The hidden layer was allowed to evolve to a sufficient number of nodes to partition the input space using our algorithm. Table 1 shows the performance of each machine and its final size. These results also indicate good post-training generalization to the testing data set. Further, the outputs of the above 10 networks were tested collectively to recognize the corresponding classes. The scheme was to choose the

Sequential Partitioning of Input Space

1215

Table 1: Results of the Vowel Problem.o Vowel

IY

IH

EH

AE

AH

AA

A0

UH

UW

ER

Ave.

Train 97.6 95.5 94.4 96.8 95.3 96.0 95.0 93.4 95.9 92.8 95.3 Test 98.0 96 94.3 96.4 94.7 97.0 95.5 91.5 94.7 91.3 94.9 Net size

9

19

16

16

41

17

15

15

11

28

18.7 ~

~~~~

"Each percentage corresponds to individual class recognition. Net size corresponds to number of nodes in the hidden layer.

highest output to designate the correct class. Accordingly, the networks correctly classified 78% of the training data, and 74.5% of the testing data. This modest result can be due to the overlapping between classes, taking into account that only the first two formant frequencies were considered in the training data (see Fig. 7). Nevertheless, our results are comparable to other sequential and nonsequential methods applied on the same problem, reported in Kadirkamanathan and Niranjan (1992; Lowe 1989; Bridle 1990). Table 2 shows these reported results and the results of our algorithm. It has to be noted that the work of Kadirkamanathan and Niranjan (1992) is the nearest to our work in terms of the size of the training and the testing sets; Lowe (1989) used only 338 examples as the training set and 333 examples as the testing set. Other results, not mentioned here, were reported by Prager (1992) and Jacobs et al. (1991). However, Prager used the first four formant frequencies instead of two, and Jacobs used only 4 out of the 10 vowel classes. The overall picture of how the classes were finally bounded using our scheme is shown in Figure 8. To illustrate the effect of the input data sequence on the architecture and its final performance, we trained one of the machines with different sequences by randomly changing the sequence of the first 250-300 examples in the training set without any repetition in the examples, as these examples contribute to the shaping of the network. The dynamics of the hidden layer for different sequences for the vowel UH is shown in Figure 9. We chose this vowel because it is a difficult problem as noticed from its low performance and it is surrounded by other classes from all directions (see Fig. 8). Although different sequences resulted in different network sizes, the performances, on both training and testing sets, were comparable (within f 3 % difference). Table 3 shows the performances for 10 different sequences and the corresponding hidden layer size. On the noise issue, again we selected the vowel (UH) data and added different levels of noise to the training examples, and tested its performance on the clean testing set. During these experiments, hidden layer size was not allowed to exceed 100 units. The results are shown in Figure 10 for both normal and uniform added noise. The network size increased with the noise level to compensate for the added complexity. The performance was maintained up to about 25%, after which the drop

1216

R. S. Shadafan and M. Niranjan

Table 2: Percentage Correct of Different Reported Classifiers and the SISP Algorithm. Classifier

Train

Test

Optimum linear transformation (Lowe 1989) Distance-to-class mean (Euc.) (Lowe 1989) Distance-to-class mean (Mah.) (Lowe 1989) Full gaussian classifier (Lowe 1989) Nearest neighbor (Lowe 1989) K-Nearest neighbor (K=5) (Lowe 1989) Gaussian RBF (36 hidden) (Lowe 1989) Thin plate spline (32 hidden) (Lowe 1989) Gaussian RBF (softmax) (Bridle 1990) Dynamic network (85 hidden) (Kadirkamanathan and Niranjan 1991) Sequential input space partitioning

41.12 68.05 78.40 76.63 82.25 na" na na

40.24 68.77 80.18 74.85 77.48 81.38 80.18 81.08 78

76.58 78.0

75.40 74.5

100.0

%a, not available.

1st formant frquulcy

Figure 7 Initial boundaries made by the 10 networks at the output level 0.5; notice the overlapping between the different classes. Symbols may represent more than one class.

Sequential Partitioning of Input Space

1217

Figure 8: The final classification of the 10 classes in the vowel problem using the ”highest output denotes the correct class” scheme. Class (UH) is denoted by ‘0’ in the middle of the lower half of the graph. Table 3: Percentage Correct Results of 10 Different Training Input Sequence of the Vowel (UH)? Cycles

1

2

3

4

5

6

7

8

9

10

Ave. SD

Train 93.9 93.7 93.8 93.8 93.0 93.5 93.3 93.4 93.8 93.3 93.6 0.3 91.7 91.9 91.3 90.9 91.7 91.7 89.7 91.3 92.1 90.9 91.3 0.7 Test 12 18 13 16.5 5.8 14 21 13 12 16 31 Net size 15 “et size corresponds to the number of nodes in the hidden layer.

in performance was significant to the number of correct classifications on both training and testing sets. Generalization was maintained as the classification trend on the testing set followed the training set. 4.2 The Wheat Problem. The wheat problem provides a 33-dimension database. The task here is to classify one strain of wheat grain called merciu against 21 other varieties. Mercia wheat grain is important for good quality bread. The database originates from a special imaging system developed at the National Institute for Agricultural Botany in Cambridge (Keefe and Draper 1986; Keefe 1990) for wheat identification.

R. S. Shadafan and M. Niranjan

1218

20 18 16 14

1 1 2 110

6 4

2

Figure 9: Growth patterns of a hidden layer for the class (vowel UH) during training for different presentation sequences of the input examples. Table 4: Results on the Wheat Problem Using the SISP Algorithm. z o

0.9

0.45

0.35

Net Size

17

3

1

81.7 79.1

83.2

84.2

80.9

81.9

82.1

83.5

84.5

Train-2000

Train-23006 Test-2000

It contains 23,006 training examples and a separate set of 2000 examples for testing. Each example is a vector of 33 elements. Out of the entire training set we selected 2000 examples (1151 mercia and 849 other classes) as our training set. The experiment involved three networks using three different Zo values (see Section 3.1), leading to different network sizes. Since we are dealing with a one-class problem, two layers were enough to solve it. Hence, all networks were made to converge to only one node in the second layer. The network sizes and their performances are summarized in Table 4. The same problem was tackled by Prager (1992) using different kinds of training methods. For the sake of comparison, the performances of

Sequential Partitioning of Input Space

1219

- . ,80

70

-

-

-

-

A

r’

.*

0: tralnina

+: tetslng

.

: net slze

..

a-

nt

-.-d

/ /

I /

30 I

I I

6

100

10

16

20 25 30 35 Percentage n o b e level Normal Nolse ’ ’.. m’

I

I

__ -

--

40

45

-_ - -

1

I

I I

’ i

i5

0 : tralnlng

+: tetslng

.

,

60-

: net slze

I I

li

6040-

n

,F‘ I

30-

10;

’ ,

&

o;

15

20 25 3i Percentage nolse level

i5

a0

45

43

Figure 10: Performance of the algorithm in a noisy environment. The problem used is the vowel (UH). The y-axis represents percentage correct for training and testing data; it also represents the number of nodes in the hidden layer. At any time, the number of nodes was not allowed to exceed the 100-unit limit.

these networks are listed in Table 5. O u r results are comparable to those obtained by the networks investigated by Prager. However, emphasis is drawn to the number of data cycles each network needed to converge compared to one data cycle in our case.

R. S. Shadafan and M. Niranjan

1220

Table 5: Performance of Prager’s Networks (20) on the Wheat Problem and Our Network Performance. Network type Single layer network 500 random locations Approx. 500 pruned locations Sequential input space partitioning

Training cycles

Train

Test

42818 3714 3476 1

83.3% 72.8% 78.4%

86.9% 77.4%

82.8%

81.9%

84.5%

5 Conclusions

A sequential approach to designing a multilayer perceptron pattern classifier is presented. The architecture of the MLP (i.e., the number of nodes)

is increased dynamically with the complexity required to classify the incoming data. Training each node in the network is achieved by a closedform solution using the recursive least-squares algorithm applied to local covariance matrices. Thus each node contributes to a segment in the class boundary that makes local classification decisions. The approach has the advantage that training is sequential hence it is fast and particularly useful when the nature of the problem is essentially sequential too, whereas iterative methods require the whole training set to be passed to the network a large number of times, making it slower and harder to implement. In situations where there are a large number of data items, the computations may be pipelined and implemented on a systolic array type architecture. We are currently working on a simulation of such architecture. The performance of the algorithm is illustrated on the simple twodimensional vowel classification and on the larger dimension wheat problem. The results are comparable to other methods on the same benchmark problems, although fewer computations were required. One might expect RLS computations to be complex, but since effectively no matrix inversion is involved at any stage, the mathematical operations are limited to simple matrix manipulations. Hence, even under ordinary serial programming, the degree of complexity in computations is very comparable to many other neural net algorithms. This algorithm, however, has great potential to benefit from parallelism in the sense that the computational modules can be broken into more simple ones and the data can then be pipelined through these simple modules, hence increasing the processing efficiency. As expected, the order of the input data affected the network architecture, but provided that the training data set is large enough, these effects can be minor and may have negligible effect on the final performance and generalization of the network. The algorithm also shows encourag-

Sequential Partitioning of Input Space

1221

ing signs of maintaining its performance even in a noisy environment. However, the increase in complexity d u e to the noisy data increased the size of the network. We believe that this increase in the network size is not contributing much to its performance, and many of these boundaries could be redundant. We are currently working on a pruning algorithm to remove such redundant units.

Acknowledgments Part of this work was published in the proceedings of the IEEE International Conference o n Neural Networks, March 28-April 1, 1993, San Francisco, CA, Vol. 1, pp. 226-231. R. S. S. was supported by a grant from Trinity College, Cambridge, and Karim Rida Said Foundation.

References Azimi-Sadjadi, M. R., and Sheedvash, S. 1991. Recursive node creation in backpropagation neural networks using orthogonal projection method. ZCASSP 91, pp. 2181-2184. Toronto. Bridle, J. S. 1990. Probabilistic interpretation of feedforward classification network outputs, with relation to statistical pattern recognition. In NeuroComputing: Algorithms, Architectures and Applications, F. Fogelman-Soulie and J. Herault, eds., 227-236. Springer-Verlag, Berlin. Chatfield, C., and Collins, A. J. 1980. Introduction to Multivariate Analysis, pp. 190-193. Chapman and Hall, London. Draper, N. R., and Smith, H. 1966. Applied Regression Analysis. John Wiley, New York. Gardner, W. A. 1986. Introduction to Random Processes. Macmillan, New York. Haykin, S. 1984. Introduction To Adaptive Filters. Macmillan, New York. Jacobs, R., Jordan, M., Nowlan, S., and Hinton, G. 1991. Adaptive mixtures of local experts. Neural Comp. 3(1), 79-87. Kadirkamanathan, V., and Niranjan, M. 1991. Nonlinear adaptive filtering in nonstationary environments. ICASSP 91, pp. 1713-1715. Toronto. Kadirkamanathan, V., and Niranjan, M. 1992. Application of an architecturally dynamic network for speech pattern classification. Proc. Inst. Acoust. 14(6), 343-350. Keefe, P. D., and Draper, S. R. 1986. The measurement of new characters for cultivar identification in wheat using machine vision. Seed Sci. Technol. 14, 715-724. Keefe, P. D. 1990. Observations concerning shape variation in wheat grains. Seed Sci. Technot. 18, 629-640. Kung, S. Y. 1988. VLSI Array Processors. Prentice Hall, Englewood Cliffs, NJ. Lowe, D. 1989. Adaptive radial basis function nonlinearities and the problem of generalisation. Proc. I E E Conf. Artificial Neural Network, pp. 171-175. McCanny, J. V., and White, J. C. 1987. VLSI Technology and Design. Academic Press, New York.

1222

R. S. Shadafan and M. Niranjan

Peterson, G. E., and Barney, H. L. 1952. Control methods used in a study of the vowels. JASA 24, 175-184. Prager, R. W. 1992. Some experiments with fixed non-linear mappings and single layer networks. Tech. Rep. (CUED/F-INFENG/TR.l06), Cambridge University Engineering department. Sin, SK., and deFigueiredo, R. J. I? 1992. An evolution-oriented learning algorithm for optimal interpolative net. IEEE Trans Neural Networks 3(2), 315323. Singhal, S., and Wu, L. 1989. Training feed-forward networks by the extended Kalman algorithm. ICASSP 89, Glasgow.

Received May 19, 1993; accepted January 11, 1994.

This article has been cited by: 2. B. Mak, R.W. Blanning. 1998. An empirical measure of element contribution in neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 28:4, 561-564. [CrossRef] 3. Ke Chen, Xiang Yu, Huisheng Chi. 1997. Combining linear discriminant functions with neural networks for supervised learning. Neural Computing & Applications 6:1, 19-41. [CrossRef]

Communicated by David MacKay

Pruning from Adaptive Regularization Lars Kai Hansen Carl Edward Rasmussen CONNECT, Electronics Institute, Technical University of Denmark, B 349, DK-2800 Lyngby, Denmark

Inspired by the recent upsurge of interest in Bayesian methods we consider adaptive regularization. A generalization based scheme for adaptation of regularization parameters is introduced and compared to Bayesian regularization. We show that pruning arises naturally within both adaptive regularization schemes. As model example we have chosen the simplest possible: estimating the mean of a random variable with known variance. Marked similarities are found between the two methods in that they both involve a ”noise limit,” below which they regularize with infinite weight decay, i.e., they prune. However, pruning is not always beneficial. We show explicitly that both methods in some cases may increase the generalization error. This corresponds to situations where the underlying assumptions of the regularizer are poorly matched to the environment. 1 Introduction We believe in Ockham’s Razor: the generalization error of a model with estimated parameters is decreased by constraining capacity to a minimum needed for capturing the rule (see, e.g., Thodberg 1991; Solla 1992). However, this minimum may be hard to define for nonlinear noisy systems where the rule is ill-defined. Pruning is a popular tool for reducing model capacity and pruning schemes have been successfully applied to layered neural networks (Le Cun et al. 1990; Thodberg 1991; Svarer etal. 1993). While pruning is a discrete decision process, regularization introduces soft constraints such as weight decay (Moody 1991). A common feature of these techniques is the need for control parameters: stop criteria for pruning and weight decays for regularization. In Svarer et al. (1993) a statistical stop criterion was developed for pruning of networks for regression problems. Recently, MacKay reviewed a Bayesian approach to adaptive regularization in the context of neural networks, demonstrating that the evidence-based method can improve the generalization properties and he compared it to cross-validation (MacKay 1992a,b). Cross-validation is known to be rather noisy; hence methods based on statistical arguments Neural Computation 6, 1223-1232 (1994) @ 1994 Massachusetts Institute of Technology

Lars Kai Hansen and Carl Edward Rasmussen

1224

are recommended (see, e.g., Akaike 1969; Moody 1991; Hansen 1993). In this presentation we define such an alternative approach for adaptive regularization, and we test it on a case of ultimate simplicity, namely that of estimating the mean of a gaussian variable of known variance. Detailed insight can be obtained for this case. In the course of comparing the two schemes, we have discovered a new feature of adaptive regularization. We find that both approaches involve a "noise limit," below which they regularize with infinite weight decay, i.e., they prune. This finding unifies the pruning and regularization approaches to capacity control. 2 Specification of the Toy Problem

The problem is defined as follows: consider a student-teacher setup based on a teacher (with parameter W)providing N examples of the form y, =

w + v,

v,,

-

m = 1 , . . . ,N

N(O,u2)

(2.1)

where the normal noise contributions are independent and have zero mean and common known variance u2. Based on the training data set, D = {yl, I m = 1 , .. . ,N},the task of the student (with parameter w) is to infer the mean: w. The measure of success for the student is the generalization error,

EG(w) = /dvP(v)(W + v - 7.0)~

=

(W- w ) +~u2

(2.2)

This is the expected error on a random new test example for the specific student weight as estimated on the given training set. In evaluating an estimation scheme, our measure will be the average generalization error obtained by averaging over all possible training sets of the given size N. In the next three sections we consider three such estimators: the usual "empirical mean" obtained from maximum likelihood, an estimator based on MacKay's maximum evidence, and a novel estimator based on explicit minimization of the expected generalization error. In Section 6 we compute the average generalization error of each of the schemes, as a function of the teacher parameter. 3 Maximum Likelihood Estimation

The likelihood of the student parameter associated with the training set D is

Pruning from Adaptive Regularization

1225

Maximizing this with respect to w produces the standard maximum likelihood estimator, the empirical mean

This estimator is unbiased, i.e., the average value, averaging over all possible training sets, is the true mean G. For a specific training set, the generalization error is obtained using 2.2. However, as mentioned above, we are not so much interested in the value for the particular training set. Rather we are interested in the expected error, obtained by averaging over all possible training sets of size N. To compute this quantity, we note that WML N N ( 6 ,u 2 / N ) ,hence the average generalization error is

The first term is the average excess error made by the student due to the finite training set. The second term is due to the imperfectness of the data; there is noise in the test examples used for grading the student. The minimal error approached for large training sets is simply the noise level. 4 Bayesian Regularization

A Bayesian scheme, recently suggested to the neural net community by MacKay, adopts two levels of inference. The first level of inference consists of estimating the teacher parameters from the data, conditioned on a parameterized prior distribution of the teacher mean value. The second level of inference consists of estimating the parameters of the prior, with the purpose of maximizing the evidence. The probability P(w I D ) of the parameter conditioned on the data can be obtained using Bayes' rule, (4.1) by specifying a prior distribution P ( w ) of the parameter. We follow MacKay and employ a parameterized prior: P(w)= P ( w I a ) , with a parameter a playing the role of a "weight decay," to be determined at the second level of inference. The prior takes the form

Lars Kai Hansen and Carl Edward Rasmussen

1226

Using the likehood from equation 3.1, we arrive at the posterior distribution

(4.3) To find the most probable teacher parameter, we maximize P(w I D , (Y) with respect to w, to get (4.4)

In the following we will suppress the explicit dependence on the training data. The second level of inference is to estimate the regularization parameter, and we do this by again invoking Bayes' rule (4.5)

We assume the prior on a to be flat,' indicating that we do not know what value Q should take. Hence, the most probable regularization is obtained by maximizing the likelihood of it. MacKay dubbed this quantity the evidence

P ( D I a ) = /dwP(D =

N

=

(4.6)

1: d w E e x p (-$w2) x

==+ logP(D I a )

I w,rr)P(w I a )

II m

in=l

1 1 e x p[-G(ym

- PI

1 1 -log(a/27r) - -log 2 2

+ 2a2(N ( N w ~ ~ )+ + @a2)2 const where we have lumped all terms not depending on Maximizing the evidence with respect to N,

(4.7) Q

'By flat we mean flat over log(cy),since (Y is a scale parameter,

in the constant.

Pruning from Adaptive Regularization

1227

which may be substituted into the expression for WMP to find (4.9) Hence there is a sharp ”noise limit” below which the scheme tells us to ”prune,” that is, to regularize with infinite strength. The noise limit has a straightforward interpretation: prune the parameter if the signal-to-noise ratio is less than unity. Note that the estimator is biased; its mean value is not W . It is consistent since WMP -+ W for large samples, while the noise limit shrinks to zero. To illustrate the power of the method, we compute in Section 6 the (training set) ensemble average of the generalization error. In the above derivation we have used the ”flat in log(a) space” prior. This prior is improper, since it cannot be normalized. Howeve;, this is not important for the evidence approximation. To see this we introduce limits: a E [cq,4.The value of W M Pwill not be affected by these limits as long as (YML E [ Q I , 4. Choosing a1 << N / a 2 and a2 >> N / a 2 the effect of the limits will be negligible for all a. This is seen by introducing the limits in equation 4.4. In particular, we note that qualitative pruning still takes place: ( W M P ( = JwMLIN/(N cr202)<( I W MLI for &ML < a2/N 1 / ( ~ 2 . The evidence framework is an approximation to the ideal Bayesian approach. Instead of using the maximum likelihood value of a, we should ideally integrate over the distribution of a, i.e., evaluate P(w)= J P ( w I a)P(a)da and then use the posterior mean for our prediction: W P M = J wP(w)dw. In contrast to the evidence framework the ideal approach is quite sensitive to the upper limit a2. Indeed if a2 = 00 then W P M = 0 regardless of the value of WML. For a range of a2 we find that W P M shows a qualitative pruning behavior similar to that of WMP.

+

+

5 Generalization Error Minimization

While the evidence is an interesting statistical quantity, it is not obvious what the relation is between maximizing the evidence and minimizing test error (MacKay 1992a). Since the latter often is the basic modeling objective, we propose here to base the optimization of N on the expected generalization error (cf. equation 2.2). A similar approach was mentioned in Moody (1991). In order to be able to compare directly with the Bayesian approach, we use a regularized least-squares procedure to find the student weight l

N

1 2

ET = - C(ym - w ) +~- N W ~ ’a2 m=l

Minimizing with respect to w,we recover 4.4 N WG = N + aa2wML ~

(5.1)

(5.2)

Lars Kai Hansen

1228

and Carl Edward Rasmussen

Our aim is to minimize the average generalization error (averaged again with respect to training sets). To this end we note that the distribution of W G ( D , Q ) can be computed using simple manipulations of random variables. Noting again that W M L N ( w , u 2 / N )we have WG N (&W, (N&1)1u2) .2 Consequently the averaged generalization error becomes

-

-

(5.3) You might be uncomfortable with the appearance of the unknown teacher parameter Ul, however, it will shortly be replaced by an estimate based on the data. Minimizing the generalization error 5.3 with respect to a, we find the simple expression

a=-

1

(5.4)

w2

which is inserted in equation 5.2, to obtain =

w2

(5.5) w2 + u2/NWML To proceed we need to insert an estimate of the teacher parameter w. We will consider two cases; first we try setting .IzI = WML in equation 5.5, hence WG

(5.6) This estimator is biased, but consistent and it does not call for pruning even if the data are very noisy. Secondly, being more brave, we could argue that the best estimate we have of 61is WG and accordingly set 6= w G 2 in equation 5.5, to obtain a self-consistent equation3 for wG2 (5.7)

By substituting w = wG2, the student is not operating with the true cost function 5.3, but with a modified, self-consistent, cost function depicted in Figure 1. We note that by this substitution, a potentially dangerous global minimum is created at wG2 = 0.

-

<

-

2Using (< = a + b$) A [$ N ( c , d * ) ]+ N(a + bc, b2d2). 'Interestingly, this equation is recovered from the Bayesian approach if we optimize w and a simultaneouslv Le., if we do not separate inference in two distinct levels). In

Pruning from Adaptive Regularization

1229

1.8 1.6 1.4

_.---._

1.2 1-

2

Figure 1: The estimated average excess generalization error as function of normalized estimator. The graphs are obtained by introducing equation 5.2 in 5.3, and setting ZZ, = W G . The estimator is scaled by WML, and the estimated excess error is scaled by u 2 / N . The three graphs correspond to different noise levels (dashed, low; solid, critical; dotted, high). Note that if the noise is low, a local minimum is found, otherwise pruning takes place.

We could envision equation 5.7 solved by iteration, e.g., starting from = W M L . Iterating the function on the right-hand side we see that, besides the solution w G 2 = 0, there may be two more fixed points (or solutions) depending on the parameters. Analyzing the iteration scheme for this simple case is straightforward, one of the possible fixed points is stable, the other unstable. If it exists, the stable fixed point (corresponding to the local minimum in Fig. 1) is found by the iteration scheme, and it is given by

wgj

Note that the noise limits are different for the generalization error method and Bayes' method. The generalization error method is more conserva-

Lars Kai Hansen and Carl Edward Rasmussen

1230

2.5 II I

,,--. . ,\

%

...

-

2-

i

.....

Max Likelihood Bayes Generalizationmethod 1 Generalizationmethod 2

1.5 -

Normalized Teacher Weight

Figure 2: Average excess generalization error AE = E - a2 as a function of teacher magnitude for the two adaptive regularizationschemes used to estimate the mean of a gaussian variable. The teacher weight has been scaled by @, and the average excess error is scaled by the excess error of the maximum likelihood estimator (cr2/N).

tive than Bayes' (if we associate conservatism with setting w = 0). The noise limit is a factor 4 larger, meaning that pruning or super-regularization will happen more often. Note also that this estimator, like the Bayesian, is biased but consistent.

6 Comparing the Regularization Schemes

As discussed above, the quantity to compare is the average (with respect to training sets) generalization error of the estimators. Since each estimator is a function of the empirical mean, W M L , only, the average involves averaging with respect to the distribution of this quantity and we already noted that W M L N(zzl,u2/N).The generalization error of the maximum likelihood estimator itself was given in 3.3; it is independent of the teacher magnitude. N

Pruning from Adaptive Regularization

1231

Since the regularized estimators depend only on the training set value of WML, the avera e excess generalization errors [AE(w) = E ( w ) - a2]are functions of

w/&

where WE is either of the three biased estimators. The integral is easily evaluated numerically. In Figure 2 we picture these functions, and we note that adaptation of the regularization parameter leads to a decreased error only if the teacher is small. If the teacher is very large the adaptive schemes lead to a negligible increase in error as they indeed should. However, if the teacher is of intermediate size all schemes produce seriously increased errors. The reason for this increased error is simply that the adaptive schemes are too conservative; there is a certain probability even for a unit magnitude teacher that the scheme will prune, hence introduce a large error. We conclude that the benefit of the adaptive schemes depends critically on the distribution of teachers, i.e., the extent to which the domain lives up to the assumptions of the regularizer. In fact, distributions of teachers can be given for which each of the estimation schemes would be preferred. We note that for a teacher distribution that has significant mass at small teachers like the PO(@) 1/l@1 distribution corresponding to the prior of the Bayesian scheme (Buntine and Weigend 1991) the brave generalization student is superior for the present problem. This fact is most easily appreciated by noting that integrating the functions in Figure 2 with respect to PO(@)corresponds to integrating the functions on a logarithmic ordinate axis with a uniform measure, hence stretching the region around 0 to infinity. We iterate, this is no big surprise since that student precisely minimizes the generalization error with the correct (implicit) prior. Our experience with adaptive regularization is generally positive. A detailed account of multivariate applications, where pruning is achieved by utilizing individual weight regularization, will be given elsewhere (Rasmussen and Hansen 1994). The idea of using the estimated test error as a cost function for determination of regularization parameters is quite general. We are currently pursuing this in the context of time series processing with feedforward networks for which the corresponding generalization error estimate is well-known (see, e.g., Svarer et al. 1993). In conclusion, we have shown that the use of adaptive regularization for the simple task of estimating the mean of a gaussian variable of known variance leads to nontrivial decision processes. We have defined a competitive scheme for identification of the regularization parameter, based on minimalization of the expected generalization error. We have shown that pruning can be viewed as infinite regularization, and that this is a natural consequence of adaptive regularization. N

N

1232

Lars Kai Hansen and Carl Edward Rasmussen

Acknowledgments We thank the participants in the 1992 CONNECT Journal Club for discussions on Bayesian methods. We thank Anders Krogh, Jan Larsen, David MacKay, Peter Salamon, Claus Svarer, and Hans Henrik Thodberg for valuable discussions on this work. Also, we are grateful to the anonymous reviewer for constructive criticism. This research is supported by the Danish Research Councils for the Natural and Technical Sciences through the Danish Computational Neural Network Center (CONNECT). References Akaike, H. 1969. Fitting autoregressive models for prediction. Ann. Inst. Stat. Mat. 21, 243-247. Buntine, W., and Weigend, A. 1991. Bayesian back-propagation. Complex Syst. 5, 603-643. Hansen, L. 1993. Stochastic linear learning: Exact test and training error averages. Neural Networks 6, 393-396. Le Cun, Y., Denker, J., and Solla, S. 1990. Optimal brain damage. In Advances in Neural lnformation Processing Systems, D. Touretzky, ed., Vol. 2, pp. 598-605. Morgan Kaufmann, San Mateo, CA. MacKay, D. 1992a. Bayesian interpolation. Neural Comp. 4, 415-447. MacKay, D. 1992b. A practical framework for backpropagation networks. Neural Comp. 4, 448-472. Moody, J. 1991. Note on generalization, regularization and architecture selection in nonlinear systems. In Neural Networks for Signal Processing I, S. Kung, 8. Juang, and C. Kamm, eds., pp. 1-10. IEEE, Piscataway, NJ. Rasmussen, C., and Hansen, L. 1994. In preparation. Solla, S. 1992. Capacity control in classifiers for pattern recognizers. In Neural Networks for Signal Processing 11, S. Y. Kung et al., eds., pp. 255-266. IEEE, Piscataway, NJ. Svarer, C., Hansen, L., and Larsen, J. 1993. On design and evaluation of tappeddelay neural network architectures. In IEEE International Conferenceon Neural Networks, H. R. Berenji et al. eds., pp. 46-51. IEEE, Piscataway, NJ. Thodberg, H. 1991. Improving generalization on neural networks through pruning. Int. j . Neural Syst. 1, 317-326. Received May 14,1993; accepted January24, 1994.

This article has been cited by: 2. Ana S. Lukic, Miles N. Wernick, Dimitris G. Tzikas, Xu Chen, Aristidis Likas, Nikolas P. Galatsanos, Yongyi Yang, Fuqiang Zhao, Stephen C. Strother. 2007. Bayesian Kernel Methods for Analysis of Functional Neuroimages. IEEE Transactions on Medical Imaging 26:12, 1613-1624. [CrossRef] 3. J. Li, M.T. Manry, P.L. Narasimha, C. Yu. 2006. Feature Selection Using a Piecewise Linear Network. IEEE Transactions on Neural Networks 17:5, 1101-1115. [CrossRef] 4. Y. Xu, K.-W. Kwok-WoWong, C.-S. Leung. 2006. Generalized RLS Approach to the Training of Neural Networks. IEEE Transactions on Neural Networks 17:1, 19-34. [CrossRef] 5. Miroslaw Galicki, Lutz Leistritz, Ernst Bernhard Zwick, Herbert Witte. 2004. Improving Generalization Capabilities of Dynamic Neural NetworksImproving Generalization Capabilities of Dynamic Neural Networks. Neural Computation 16:6, 1253-1282. [Abstract] [PDF] [PDF Plus] 6. Ping Guo, M.R. Lyu, C.L.P. Chen. 2003. Regularization parameter estimation for feedforward neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:1, 35-44. [CrossRef] 7. A.D. Back, T.P. Trappenberg. 2001. Selecting inputs for modeling using normalized higher order statistics and independent component analysis. IEEE Transactions on Neural Networks 12:3, 612-617. [CrossRef] 8. Chi-Sing Leung, Ah-Chung Tsoi, Lai Wan Chan. 2001. Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Transactions on Neural Networks 12:6, 1314-1332. [CrossRef]

Communicated by Vera Ktrkovi

Approximation Capability of Layered Neural Networks with Sigmoid Units on Two Layers Yoshifusa Ito Toyohashi University of Technology, Toyohashi, Japan

Using only an elementary constructive method, we prove the universal approximation capability of three-layered feedforward neural networks that have sigmoid units on two layers. We regard the Heaviside function as a special case of sigmoid function and measure accuracy of approximation in either the supremum norm or in the Lp-norm. Given a continuous function defined on a unit hypercube and the required accuracy of approximation, we can estimate the numbers of necessary units on the respective sigmoid unit layers. In the case where the sigmoid function is the Heaviside function, our result improves the estimation of Khkovh (1992). If the accuracy of approximation is measured in the LP-norm, our estimation also improves that of Kbrkovi (1992), even when the sigmoid function is not the Heaviside function.

1 Introduction

It is well known that three-layered feedforward neural networks can uniformly approximate continuous functions if they have sigmoid units on the hidden layer. Several papers discuss this topic in detail: Carroll and Dickinson (1989), Cybenko (1989), Funahashi (1989), Hecht-Nielsen (1989), Hornik (1991), Hornik et al. (1989, 1990), Ito (1991a,b, 1992, 1993, 1994), and Stinchcombe and White (1989, 1990). In these papers, the approximation is in the supremum norm or Lp-norm or others. The universal approximation capability of feedforward neural networks of another type, which have two sigmoid unit layers, has also been investigated by several authors: Hecht-Nielsen (19871, Funahashi (1989), Girosi and Poggio (19891, and Kdrkovd (1991, 1992). The theory on networks of this type has been based on Kolmogorov’s representation theorem of continuous functions (Kolmogorov 1957). Although the inner functions of Kolmogorov’s representation formula are too complicated to be regarded as input-output functions of neural units (Girosi and Poggio 1989), they can be well approximated by a finite number of sigmoid functions in principle as they are continuous functions on finite intervals Neural Computation 6,1233-1243 (1994) @ 1994 Massachusetts Institute of Technology

1234

Yoshifusa Ito

(Funahashi 1989; Kdrkova 1991, 1992). Approximation formulas they have obtained are of the form:

where x, is the jth component of the input x E Rd. If, following tradition, their networks are regarded as four-layered, then they must have two hidden sigmoid unit layers. However, a formula of this form can be realized by a three-layered neural network with sigmoid units on the first and second layers, if several first layer units are provided for each component x,. Kdrkova (1992) has recently estimated the numbers of units on the respective sigmoid unit layers of such networks. Her paper is the first paper based on Kolmogorov’s theorem which explicitly estimates the number of necessary sigmoid units of four-layered networks. In the case of an ordinary three-layered neural network, Ito (1993) has estimated the number of nonlinear units under a condition that the activation function can be differentiable several times, although the main purpose of the paper is to extend the approximation capability to derivatives. The purpose of this paper is to remark that both an approximation formula of the form (1.1) and a bound on the number of sigmoid units can be obtained also by an elementary method. Our method is simple and differs from the traditional one used by Hecht-Nielsen (19871, Funahashi (19891, and Kirkova (1991, 1992). We do not use Kolmogorov’s representation theorem. We measure the accuracy of approximation in both the supremum norm and in the LP-norm, which was considered important by Kdrkova (1992) and used by many authors. From the formula we obtain, we can estimate the numbers of sigmoid units on the respective sigmoid unit layers. We regard the Heaviside function as a special type of sigmoid function. Our estimation of the number is tighter than those of KdrkovP when the sigmoid function is the Heaviside function or when the accuracy of approximation is measured by the LP-norm. Moreover, our simple constructive method naturally leads to a simple concrete learning rule, which we shall describe in Section 3. 2 Main Theorem

We call a function h sigmoid if it is monotone increasing, h ( t ) + 1 as t +.cc and h ( t ) + 0 as t + -cc. The Heaviside function is a sigmoid function in our sense. We denote by H the right continuous Heaviside function. Let E = [0,1]and Ed = [O,lld. The latter is the unit hypercube in the &dimensional Euclidean space Rd. Further, let Z i = (0,. . . , m for a positive integer m. For a point k E Zfn, we write k = (kl,. . . kd), x ( ~= ) ( l / m ) ( k l , . . . , k d ) and Ef = x ( ~+) [0, l / n ~ ) ~ .

Neural Networks with Sigmoid Units

1235

Let f be a function defined on Ed and set = sup{If(y) -f(x)llx?Y E

Ed, (Y - 4 E 10, 6Id)

(2.1)

This quantity is called the modulus of continuity by K ~ r k o v (1992). i We first treat the case where the sigmoid function is the Heaviside function. In the proof of the theorem below, we use functions IF, k = (kl,. . . ,kd) E Zd,, defined by $'(x)

:[C H (

=H

X;

- -k;

1 + :] -d

-

(2.2)

This function is the indicator function of a set IT$l [( 1/m)k,,0 0 ) .

Theorem 1. For any continuousfunction f defined on Edand any positive integer m, there are constants a k , k E Zd,, for which (2.3)

satisfies

on Ed

(2.4)

k , t E Zk

(2.5)

Proof. Set

These equations can be regarcd as simultaneous equations wit.. respect to aks with coefficients CkP = I kH ( x

(o),

k . f E z:

(2.6)

We can show that ( C k P ) is a triangle matrix when an appropriate order is introduced to Zd,. Define a mapping r: k + Cf=lk,m'-l. If r ( k ' ) < I'(k") for k', k" E Zd,, we write k' < Id'. Then, the "<" is a total order in Zd,. It is obvious that Cke = 1 for k = e and Cke = 0 for k > e. Put the elements of the matrix (ckt) so that its columns and rows are in orders of the respective suffixes. Then, it is a triangle matrix, where the diagonal elements are 1 and the lower off-diagonal elements are 0. Hence, the determinant ICkPl is equal to 1, which implies that there are constants aks for which the equations 2.5 hold. For the aks, define f by 2.3. Then, by 2.1, 2.2, and 2.5, 2.4 holds. 0 A sigmoid function can approximate the Heaviside function in a sense when it is sufficiently scaled and appropriately shifted. For simplicity, we represent both scaling and shifting constants simultaneously by a single number c in the lemma below.

Yoshifusa Ito

1236

Lemma 1. Let h be a sigmoid function. Then,

H(t) = c-m lim h [ ( c+ ~ )+ c] ~ t

(2.7)

Moreover, for any q > 0, the convergence is uniform on (-m, -171

U [0,m).

Proof. We have that min{h[(c + 1)2t+ c] 1 t 2 0) = h(c), max{h[(c 1)*t c] I t 2 -17) = h[-(c

+

+

Since lim,,,h(c) = 1 and lim,,,h[-(c the lemma holds.

+ 112q + c] + 1)217 + c] = 0, the statement of 0

Note that when c = 0 the sigmoid function is neither scaled nor shifted. We write hc(t) = h[(c 1)2t + c]. Since the convergence h' -+ H is not uniform, we cannot replace the Heaviside function H in Theorem 1 by hC without relaxing the bound q(l/rn)in the inequality 2.4. To show this, we introduce a partial order in Zd,:we write j 5 k if j , 5 k, for all i = 1,.. . ,d, and write j + k if j 5 k and j , < k, for some i. This partial order is related to the total order defined in the proof of Theorem 1. If j + k, then j < k. If j < k, then j ?fk. We first observe how the bound in Theorem 1 is obtained when the sigmoid function is the Heaviside function. For k = (k,,. . . ,kd) E Zd,-,, set k' = (kl, . . . k,+1,. . . ,kd) and kA = (kl+I, . . . kd +I). The components of k' other than ith are the same as those of k. Set Z1 = fi E Z i I j + kA} and 22 = Z i - Z1. Then, k' E Z1. Take a function f such that ~ , ( l / r n ) = E . For a while, we denote by f H the function defined by 2.3 and define f: and f y by

+

Tf'(x)

=

c

akIF(l(x), k~2,

i = 1,2

f H ( x ) for x = x ( ~ ) ,x('), i = 1,.. . ,d. By 2.1, I f H ( x ( k ' ) )-A f H ( ~ ( k ) )5l E. These differences can be piled up at x ( ~ " )so that If ( ~ ( ~ " 1-) -A -A f (x("))l 5 (d - 1 ) for ~ all i. On the other hand, we have that If ( x ( ~ " ) ) -A f ( x ( ~ ' ) )5[ E. Hence, we have that Then, f y ( x )

=

Thus, the differenceat ~ ( ~ "is1 up to dc.. This difference is subtracted away by so that +fY(xCkA))= f H ( ~ ( k A ) ) . However, among the summands of one which compensates for the difference is the term u ~ A I FHence, ~. ( u ~ A5 ( d E . When the sigmoid function is the Heaviside function, the difference is exactly compensated for.

~Y(x(~")) ~Y(x(~")) fy,

Neural Networks with Sigmoid Units

1237

Replace the Heaviside function H by a scaled version h' of a nonHeaviside sigmoid function h. Then, the function 1: is replaced by 1: defined by

As before, we define fh' using I:(.). Since 1: are not exactly step functions, the value of f K can be close to f h r ( ~ ( k A ) ) at sofne point of Ef.Similarly to the above case, we obtain that, for any 17 > 0, -fhc I < dE+ q for sufficiently large c, applying Lemma 2. The difference lf:(x) -f'(x)l is also bounded by d E 7 on E$ if c is sufficiently large. However, this bound of the difference is the maximum. When the activation function is the Heaviside function, the bound is E . Hence, we obtain that

[~Y(X(~~))

+

where y, 1 5 y 5 d, is a coefficient which depends on the sigmoid function h, the dimension d and the scaling constant c, but not on k. It is meaningful to use the coefficient y because its value can be obtained if the parameters are given and it makes the bound tighter. It is obvious that y q ( l / m ) v is a bound for the difference irrespective of the values off ( x ( ~ )f) (, x ( ~ ~and ) ) f (x'~')).The problem of the difference of the values does not take place along the surfaces {x E Ed I x; = l}, i = 1,.. . , d . Hence, the inequality above holds on all cubicles E;. Thus, we have that

+

(2.8) as f ; ( x )

= f H ( x ) on

Et. Using 2.8, we prove the theorem below.

Theorem 2. Let h be a sigmoid function. Then, for any continuous function f defined on Ed, any positive number vand any positive integer m, therearea scaling constant c, a coefficient y,0 5 y 5 1 , and constants a k , k E Zd,, for which (2.9)

satisfies

If(x)-f(x)l I ( Y + ~ ) E I

(2.10)

The coefficient y depends on the sigmoid function h, the dimension d , and the scaling constant c. If the sigmoid function is the Heaviside function H and c = 0, then y = 0 and 17 can be zero.

Yoshifusa Ito

1238

(i) +v

5 (-y+l)q

on Ed

(2.11)

Hence, the theorem is obvious.

11 f l i p

We define the LP-norm

0

of a function f defined on Ed by

In Theorem 3 below, we treat the approximation in the Lp-norm. Set

i=l r=l

m

rn

rn

Then, the volume of B, is less than dma.

Theorem 3. Let h be a sigrnoid function. Then, for any continuous function f defined on Ed, any positive integer m and any E > 0, there are a scaling constant c and coefficients ak, k E Z;, for which (2.12)

satisfies

llf-fllPI+J

(2.13)

+E

Proof. In this proof, we denote again by f" the function f defined by 2.3. Then, we have

I1 f - ThPIIP I11 f

-fHllP

+ 117"

-

fh' IIP

(2.14)

By Theorem 1, the first term of the right-hand side of 2.14 can be bounded straightforwardly:

In regard to the second term, we have

Neural Networks with Sigmoid Units

1239

By Lemma 1, we have IIF(x) - IF(x)l

I 17

on Ed - B,

and @(x) - IF(x)l 5 1

on B,

hold for a sufficiently large c. Hence, the quantity 2.15 is bounded:

/IfH - f h p l l ~

< dmdM($

+ IB,I)

< dmdM($

+ dm71)

(2.16)

for a large c, where M = maxkEzf, lukl. One may notice that the number of the coefficients ak is md. Since the right-hand side of 2.16 converges to 0 0 as 7 + 0, we obtain the theorem. Remark. By the proofs of Theorems 2 and 3, it is obvious that useful in the cases of these theorems are the coefficients aks obtained under an assumption that the sigmoid function is the Heaviside function and the accuracy of approximation is measured in the supremum norm. Thus, in the case where the accuracy of approximation is measured by Lp-norm, it is almost retained even when the Heaviside function is replaced by a non-Heaviside sigmoid function. Kbrkov6 (1992) emphasized the importance of measuring accuracy of approximation in Lp-norm with respect to specified measures. Although we have used the Lebesgue measure only in Theorem 3, it is obvious that the result can be extended to other measures. 3 Implementation of Approximations

Our simple constructive proof of the theorems can be regarded as an algorithm for implementing the approximation by a neural network with two sigmoid unit layers. Although Kbrkovd’s result (Kbrkova, 1992) may be applied, our result can be applied more conveniently to the approximation by neural networks because of its simplicity. In this section, the implementation of approximation by learning will also be discussed. In our networks, the connection weights between the first and second layers are fixed simply at 1. If the function q(6) of 6 defined by 2.1 is available, we can determine the integer m depending on the accuracy we need. By the remark following Theorem 4, the coefficients aks are useful determined under the assumption that the sigmoid function is the Heaviside function and the topology is in the supremum norm. Hence, we formulate the method of implementing approximation only for the case where the sigmoid function is the Heaviside function. If necessary, the Heaviside function can be replaced by a scaled sigmoid function later. Case 1: Case of simultaneously available data. If all the data C ~ ( X ( ~I ) ) can be deterE Zk} are available simultaneously, the coefficients mined at once by solving the simultaneous equations 2.5 as stated in

k

1240

Yoshifusa Ito

Section 2. Thus, implementation of the approximation is straightforward. Even in the case where the data arrive as a sequence, the determination of coefficients uks can be performed at once if the data can be memorized until all the data are present out.

Case 2 Case without memory of data. There is an alternative method of determining the coefficients uks. Let {k"}$l be a sequence of points of Zd,where k"1 >I k"* for any nl < n2. We call the sequence a minimal sequence of ks. By the condition, a minimal sequence contains all the elements of Zd,. The order "<" decided in the proof of Theorem 1 is an example of order that decides a minimal sequence. Recalling the function ZF, divide the sum 2.3 into three parts:

Then, irrespective of the coefficients u,s may be for j 2. k, the last term on the right-hand side of 3.1 is zero at x ( ~ ) zjfka,ljH(x(k)) : = 0. Hence, if the data f ( ~ ( ~arrive ) ) in the state where u,s have been already correctly decided for all j 4 k, ak can be correctly decided by the equation

because the left-hand side of 3.1 is equal to Ejsk u , ~ / ( x ( ~ ) Hence, ). it is obvious that if the data arrive as a minimal sequence all the coefficients a,s can be correctly decided after finishing adjustment of all coefficients uks according to 3.2. Even in the case where the data arrive as a nonminimal sequence, the coefficients are finally correctly adjusted if the sequence contains a minimal subsequence. This is also obvious by the discussion above. This process may be called "learning."

Case 3: Case of random sample points. Let

Define bk by

Let p be a probability measure on Ed such that p(E$) # 0 for all k E Zd,. It is obvious that even if the coefficientsULS are replaced by bks, respectively, the inequality 2.4 holds. Let cf(x")}glbe a sequence of data where the points x" are taken randomly according to the probability measure p. We describe a useful learning rule for such a random sequence of data.

Neural Networks with Sigmoid Units

1241

When the nth datum f ( x " ) arrives, decide the coefficient a; in a way that

C

a,Zr(xn)+ a#(x") = j ( x " )

(3.3)

j&,,,j#k

Then, define the coefficient a k by

where n N = n and ne, l = 1,.. . ,N - 1, is the time in the past at which the sampling point x''t belonged to E:. In other words, the coefficients are decided as the averages of the past data. It is not difficult to prove that aks converge to bks, respectively, This learning rule stands up to noise. If ,LL is a measure supported by (l/m)Zd,, the sequence consists of the data f ( d k ) ) sk, E Z",. 4 Discussion

Hecht-Nielsen (1987) regarded the inner and outer functions of Kolmogorov's representation formula as activation functions of a layered neural network. Both Funahashi (1989) and Kbrkova (1991, 1992) used Kolmogorov's representation theorem in order to prove that a formula of the form 1.1 can approximate continuous functions. However, we have obtained the similar result by an elementary method without using the theorem. Our idea for obtaining the approximation formula 2.3 is simple. Let IH = I$,,,.,,). This function is the indicator function of the first orthant. We can construct indicator functions of half open d-dimensional intervals E$ as combinations of shifts of lH.Such indicator functions can be used as local basis functions; we can approximate any continuous function by a combination of the indicator functions. The combination has an expression of the form of 2.3 where the numbers of the inner and outer Heaviside functions are, respectively, dm and md. When Kolmogorov's representation theorem is used for deriving the approximation formula, we cannot explicitly describe the formula without having the inner and outer functions of Kolmogorov's formula, which may be difficult to obtain. In contrast, our method avoids this difficulty because the parameters in our approximation formula 2.3 can be obtained explicitly. This leads naturally to the simple method of implementing the approximation or the simple learning rule described in Section 3. Our estimation of numbers of units on the first and second sigmoid unit layers is considerably smaller than Kbrkovh's. Our estimations of numbers of units on the first and second sigmoid layers are dm and md, respectively. These numbers correspond, respectively, to the numbers dm(m + 1) and m2(m+ l)dobtained by Kfirkovh [Theorem 2 of Kbrkova (1992)l. If a high accuracy of approximation is needed, m must be a large

1242

Yoshifusa Ito

number, which makes Kbrkovd's estimations notably larger than ours. However, in Theorem 2 the accuracy of our approximation is worse than hers in higher dimensional cases. Although Kdrkovi (1992) mentioned the importance of the LP-norm, she did not use the norm. This is because the supremum norm is strongest in the case of approximation on compact sets. However, the bound of the approximation error depends on the norm in our method. By this reason, we used the supremum norm as well as the Lp-norm, which has been used by many authors mentioned in Section 1. We may call the functions IH and Ih' orthant basis functions for convenience. Since an orthant basis function is not a local basis function, it is convenient for covering regions where the data are thin. By simple use of an orthant basis function, for example, we can construct the indicator function of any d-dimensional interval, which can be used as a local basis function as mentioned above. An indicator function of a large interval can be used in regions where the data are thin. Moreover, we can even thin out the data by using an orthant basis function without significantly affecting the accuracy of approximation. Although this avoids the use of an overwhelming number of units in the case of higher dimensions, we do not go into further detail as the purpose of this paper is different.

Acknowledgments The research was supported in part by a grant (04246221) from the ministry of education and culture, Japan, and was done at the Department of Statistics and Applied Probability, UC Santa Barbara, in July 1992. The author owes the last paragraph of Section 4 to a talk with Professor Rumelhart.

References Carroll, B. W., and Dickinson, B. D. 1989. Construction of neural nets using the Radon transform. '89 IJCNN Proc. I, 607-611. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control Signal System 2, 303-314. Funahashi, K. 1989. On the approximate realization of continuous mapping by neural networks. Neural Networks 2, 183-192. Girosi, F., and Poggio, T. 1989. Representation properties of network: Kolmogorov's theorem is irrelevant. Neural Cornp. 1, 465-469. Hecht-Nielsen, R. 1987. Kolmogorov's mapping neural network existence theorem. I€€€ First Intl. Conf. Neural Networks 111, 11-13. Hecht-Nielsen, R. 1989. Theory of the back propagation neural network. 89 IJCNN Proc. I, 593-605. Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 251-257.

Neural Networks with Sigmoid Units

1243

Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Hornik, K., Stinchcombe, M., and White, H. 1990. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 3, 551-560. Ito, Y. 1991a. Representation of functions by superposition of a step or sigmoid function and their applications to neural network theory. Neural Networks 4, 385-394. Ito, Y. 1991b. Approximation of functions on a compact set by finite sums of a sigmoid function without scaling. Neural Networks 4,817-826. Ito, Y. 1992. Approximation of continuous functions of Rd by linear combinations of shifted rotations of a sigmoid function with and without scaling. Neural Networks 5, 105-115. Ito, Y. 1993. Approximations of differentiable functions and their derivatives on compact set by neural networks. Math. Scient. 18, 11-19. Ito, Y. 1994. Differentiable approximation by means of the Radon transformation and its applications to neural networks. J. Comp. Appf. Math., in press. Kolmogorov, A. N. 1957. On the representations of continuous functions of many variables by superpositions of continuous functions of one variable and addition. Dokf.Akad. Nauk USSR 114(5), 953-956. Kbrkovi, V. 1991. Kolmogorov’s theorem is relevant. Neural Comp. 3,617-622. Kbrkovi, V. 1992. Kolmogorov’s theorem and multilayer neural networks. Neural Networks 5, 501-506. Stinchcornbe, M., and White, H. 1989. Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions. ’89 IJCNN Proc. I, 613-617. Stinchcornbe, M., and White, H. 1990. Approximating and learning unknown mapping using multilayer feedforward networks with bounded weights. ’90 IJCNN Proc. 111, 7-16.

Received September 10, 1992; accepted January 19, 1994.

This article has been cited by: 2. G. Liang, K. Chandrashekhara. 2006. Cure kinetics and rheology characterization of soy-based epoxy resin system. Journal of Applied Polymer Science 102:4, 3168-3180. [CrossRef] 3. Yoshifusa Ito . 2003. Activation Functions Defined on Higher-Dimensional Spaces for Approximation on Compact Sets with and without ScalingActivation Functions Defined on Higher-Dimensional Spaces for Approximation on Compact Sets with and without Scaling. Neural Computation 15:9, 2199-2226. [Abstract] [PDF] [PDF Plus] 4. Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143. [CrossRef] 5. Guang-Bin Huang, H.A. Babri. 1998. Comments on "Approximation capability in C(R/sup n/) by multilayer feedforward networks and related problems". IEEE Transactions on Neural Networks 9:4, 714-715. [CrossRef] 6. Yoshifusa Ito. 1996. Nonlinearity creates linear independence. Advances in Computational Mathematics 5:1, 189-203. [CrossRef]

Communicated by Halbert White

Estimation of Network Parameters in Semiparametric Stochastic Perceptron Motoaki Kawanabe Shun-ichi Amari Department of Mathematical Engineering and Information Physics, University of Tokyo, Bunkyo-ku, Tokyo 213, Japan

It was reported (Kabashima and Shinomoto 1992)that estimators of a binary decision boundary show asymptotically strange behaviors when the probability model is ill-posed or semiparametric. We give a rigorous analysis of this phenomenon in a stochastic perceptron by using the estimating function method. A stochastic perceptron consists of a neuron that is excited depending on the weighted sum of inputs but its probability distribution form is unknown here. It is shown that there exists no fi-consistent estimator of the threshold value h, that is, no estimator h that converges to h in the order of l / f i as the number n of observations increases. Therefore, the accuracy of estimation is much worse in this semiparametric case with an unspecified probability function than in the ordinary case. On the other hand, it is shown that there is a fi-consistent estimator w of the synaptic weight vector. These results elucidate strange behaviors of learning curves in a semiparametric statistical model. 1 Introduction Learning of neural networks, especially of stochastic networks, can be regarded from the statistical point of view as sequential estimation of network parameters from randomly chosen examples. Neural networks give good nonlinear models for analyzing nonlinear multivariate statistical phenomena and conversely statistical analysis gives a good insight on the learning behaviors of neural networks. The present paper treats the simple stochastic perceptron (stochastic neuron) to study strange asymptotic behaviors of estimators of network parameters (synaptic weights and thresholds) in the semiparametric situation to be explained later, cf. Kabashima and Shinomoto (1992). A simple stochastic perceptron (a stochastic neuron) classifies rn-dimensional input signals x into two categories C+ and C- stochastically with probabilities depending on the inputs. More definitely, let w be the synaptic weight vector and let h be the threshold of a perceptron, where Ncrtral Cnmpututioii 6, 1244-1261 (1994) @ 1994 Massachusetts Institute of Technology

Network Parameters in Stochastic Perceptron

1245

we assume IwI = 1. Then, the signal space is divided into two parts by the separating hyperplane of the perceptron

H:

w.x-h=O

(1.1)

The signed distance of a signal x from H is given by d(x) = w . x - h

(1.2)

where d(x) > 0 when x is in the positive side of H and d(x) I 0 otherwise. A stochastic perceptron emits an output z = 1 or -1 with a probability depending on the distance d ( x ) . Furthermore, the conditional distribution (the input-output relation) can be written Prob [z = 1 I x] = 0.5 + $ ( w . x - h ) Prob [z = -1 I x] = 0.5 - $(w. x -h) or Prob [z I x]

= 0.5 +z$(w . x - h )

(1-5)

where the function $ ( u ) is assumed to be a monotonically increasing differentiable function with

4(0) = 0 supI4(u)l < 0.5 Id

A specific sigmoidal function such as 1 #J(u)= - tanhu 2

(1.8)

is usually assumed for convenience, but the actual function 4 is unknown in the real biological situation. For this reason, we impose only weak conditions such as monotonicity and smoothness on the function 4, and consider 4 also as an unknown functional parameter besides w and h. We also assume that the true function #* satisfies these weak conditions. The training examples are assumed to be produced by the following manner. Each input vector x; is generated subject to a known or unknown probability density q(x,) independently. For given input x;, an output z; is generated by the true perceptron specified by (w*,h,, $*). From a given training set

of n independent examples from this perceptron, we want to estimate w, and h, under the condition that +* is unknown. The present paper proves that the asymptotic behaviors of estimators w? are quite different in the semiparametric situation. The same strange behaviors emerge also in learning w and h from examples.

Motoaki Kawanabe and Shun-ichi Amari

1246

When & is known, it is an ordinary statistical problem to estimate w and h from the training set D,,. When n is large, the maximum likelihood estimator is asymptotically best, and its expected squared error is easily calculated from the Fisher information matrix F of the underlying statistical model. The estimation error decreases in proportion inversely to the square root of the number n of examples, (1.10)

h-h,

=

OP(-&)

(1.11)

We call such an estimator an fi-consistent estimator. This shows that the squared error decrease in the order of l / n , as is the case with learning curves of generalization error (Haussler et al. 1988; Amari and Murata 1993). When q5* is unknown, the situation changes drastically. Because, in addition to the parameters w and h of interest, the statistical model includes the parameter 4, which has infinite- or function-degrees of freedom. This additional parameter is called a nuisance parameter, because we do not have any interest in estimating it. Such a model is called a semiparametric statistical model and statisticians have been searching for effective methods of inference in such a model (see, for example, Bickel et al. 1993; Amari and Kawanabe 1993; etc.). We show that there is a severe difficulty in estimating the threshold h rather than in estimating the weights w. To show this, we study the following models separately: 1. Threshold model (dose-response model): In this case, an input x is a scalar and w = 1, so that d ( x ) = x - h. Here, the threshold h is the only parameter to be estimated, and the probability model is Prob{z I x }

= 0.5

+ z q5(x - h )

(1.12)

This model is also called the (semiparametric) dose-response model. In this case, an amount x of dose is applied to a subject and we observe success (2 = 1) and failure ( z = -1) of the result of the treatment. The probability of the success ( z = 1) increases with the amount x of the dose as in equation 1.12, and we want to estimate the amount x = h of the dose where the success probability balances with the failure probability. Kabashima and Shinomoto (1992) studied this model, and found that the maximum likelihood estimator does not work in this case. They proposed an estimator whose squared error converges to 0 with the order of n-2/3. In other words, the estimator is consistent with the stochastic order n-1/3. '

(1.13)

Network Parameters in StochasticPerceptron

1247

This convergence is slower than 1 fi of the ordinary one. Their estimator is known as the maximum score estimator (see Manski 1975; Kim and Pollard 1990). We prove in the present paper that no estimators which converge to the true value with the stochastic order 1 fi exist. This shows that the learning curve has a slower characteristic in this case. We then prove that there exists an estimator better than the maximum score estimator which converges in order of n-p/(*p+') where p is an arbitrary integer. This result is also known by statisticians (Nawata 19891, but the estimator that we newly propose here is much simpler and easier to construct. 2. Synaptic weights model with h = 0 (orientation detection): Here, x is m-dimensional (m 2 2), w is to be estimated, and h = 0. In this case, the separating hyperplane passes through the origin, and the probability is determined by w . x, Prob{z 1 x}

= 0.5

+ z + ( w .x)

(1.14)

We show that there exists an estimator w converging to the true value w, in stochastic order 1 fi when q ( x ) is known. We also construct the estimator explicitly. 3. The general case: Combining the above two results, we propose an estimator w and h applicable to the general case. Here w is fi-consistent but h is np/(2pf1)-consistent. Our method is based on the geometric consideration on the estimating functions (Amari and Kumon 1988; Amari and Kawanabe 1993). An estimating function gives a consistent estimator of order fi when it exists, and the estimator can be derived easily by solving the estimating equation. However, we show that no estimating functions exist for the threshold model (dose-response model). We instead propose an asymptotic estimating function, by using the Parzen window for the nonparametric curve estimation (Parzen 1962). If we arrange the window appropriately to balance the bias and the variance term (cf. Geman et al. 19921, this gives a consistent estimator of order r ~ p / ( ~ P + ' ) . 2 Estimating Function

We briefly explain estimating functions. Given a statistical model { p ( x , 0)) where the probability distribution p ( x ,8) of random variable x is specified by a scalar parameter 0, a function g(x,0) is said to be an estimating function when it satisfies (2.1)

(2.2) where Es is the expectation with respect to p ( x , 8) and 80= d/df?. Given II independent observations xl, . . . ,x,,, C g ( x ; ,0) is the empirical substitute

Motoaki Kawanabe and Shun-ichi Amari

1248

to the expectation Es [g(x,O ) ] , so that it is plausible that n

Cg(xi,8) = 0

(2.3)

i=l

gives a good estimate 8. The score function u ( x , O ) = dslogp(x,O) (that is the derivative of the log likelihood) satisfies the above conditions 2.1, 2.2, and the estimating function U(X,

0) = aslogp(x, 6)

(2.4)

gives the maximum likelihood estimator. The idea of the estimating function was introduced by Godambe (1960) as a generalization of the maximum likelihood method. It is known that the estimating equation 2.3 gives an asymptotically normally distributed fi-consistent estimator 8. The estimating function method can be applicable to the semiparametric model {p(x,O,4)} where the probability distribution is specified by a parameter 0 that is to be estimated and also by a nuisance parameter 4 of infinite dimensions in which we do not have any interest. It is in general very difficult to estimate 4 from a finite number of observations. However, if there exists a function g(x,O ) , not depending on the nuisance parameter 4,such that Ee,b [g(x, 011 = 0

(2.5)

Ee,+ [aeg(xlO)l # 0

(2.6)

~ the expectation with respect to hold for any 4, where E B , denotes p(x, O,4), we can avoid the tedious procedure of estimating 4 and a good estimator is obtained by the simple estimating equation (2.7) Moreover, the estimating function method directly leads to a learning procedure. The stochastic approximation method suggests the following learning algorithm,

8,)

e n + , = e n - c n g(xn+ll

(2.8)

en+,

where 8, is the estimator obtained from n previous data XI, . . . ,x,, is the new estimator to be obtained from 8, and the new data x ~ +and ~, c,, is a constant satisfying

(2.9) It is possible to study the accuracy of learning in a similar way as in the present statistical analysis (see Kabashima and Shinomoto 1994).

Network Parameters in Stochastic Perceptron

1249

However, it is in general not easy to find an estimating function in a semiparametric case. The score function does not in general satisfy the condition 2.5 and the maximum likelihood estimator is not necessarily unbiased. It is even not certain if an estimating function ever exists or not. Amari and Kumon (1988) analyzed this problem by generalizing the dual information geometry (Amari 19851, and gave a definite answer to this problem. Amari and Kawanabe (1993) extended the results to be applicable to general semiparametric models. The result becomes simpler if the probability distribution is linear in 4 as it is in the present case. In this case, the projected score or the effective score denoted by u E ( x ,8,4) gives an estimating function for any q5 even if 4 does not coincide with the true &. We will explain about the projected , and related results in the appendix. score u E ( x 8,+) 3 Dose-Response Model

We first show that no estimating functions exist for estimating h. We assume that the signal x is subject to an unknown probability density q ( x ) that is analytic around h and q(h) > 0. Such a q ( x ) is said to be regular. Then, the joint probability distribution of x and z is given by

p ( x , z ; h ,4) = q ( x ) (0.5 + 2 4b - A))

(3.1)

where the nuisance function 4(s) is monotonically increasing, differentiable, and analytic around the origin, and it satisfies (3.2) (3.3) (3.4) Let

(9

be the class of such nuisance functions d. The score function is

u ( x , z ; h ,4) =

-2 4’(x - h ) 0.5 z $ ( x - h )

+

(3.5)

On the other hand, a change in $(s) is written as

4r(s)= 4(s) + t E ( s )

(3.6)

where perturbation [(s) is an arbitrary smooth function satisfying E(0) = 0

because of

(3.7)

Motoaki Kawanabe and Shun-ichi Amari

1250

The score in the direction € is given by (3.9) Because of $'(O) > 0, the score u is not represented as a linear combination of v's. However, the u is proved to be included in TN,which is the closure of the set spanned by v's. This leads to the following result (see Theorem 3 in the appendix). Theorem 1. In the semiparametric dose-response model with the class Q, of nuisance functions and q ( x ) known, there exist no estimating functions nor any fi-consisten t est imators. Remark. Consider the case that the class @ of the nuisance functions is limited further to the subclass @odd consisting of the functions $(-s)

= -4b)

(3.10)

in an interval [-sO,so]. If 9 ( x ) is known, there exist estimating functions of the form g ( x ,z , h ) = z f ( x - h ) / 9 ( x ) where f ( s ) is an even function in the interval [-SO, S O ] , and is otherwise 0. So, there exists a fi-consistent estimator. But in the case that the density 9 ( x ) is unknown, there exist no estimating functions nor any fi-consistent estimators again even if the additional condition 3.10 is assumed. Even though no fi-estimator exists, we can obtain a consistent estimator with slower convergence. To obtain this, we go back to the original idea of the estimating equation, n

C f ( x i , z i ; h )= 0

(3.11)

i=l

and analyze the behavior of b carefully. When the solution equation is close to the true value h,, we have by expansion I1

of this

11

C f (xi.2;; h , ) + Cahf(xi,zi; h,) (b - h,) + 0 (lh - h, 12) i=l

=0

(3.12)

i=l

where a,, = d/dh. By neglecting the term of

lh - h,I2, we have (3.13)

By the law of large numbers, the denominator converges to nu, where

a=E[&f(x.z;h,)]

(3.14)

Let us put b = E V(X, Z;h,)]

(3.15)

Network Parameters in Stochastic Perceptron

1251

which is 0 when f ( x , z; h ) is an estimating function. It is not 0 because no estimating functions exist in the present case. We put

v2 = v [f(x.z;h,)]

(3.16)

where V denotes the variance. When b is small, the central limit theorem shows that the numerator is asymptotically normally distributed random variable, which can be approximated by

nb + five

(3.17)

where e is the standard normal random variable subject to N(0 ,l). This simple but rough analysis shows that 'z - h, is distributed approximately as (3.18)

and the expected squared error is

E [('z

v2

- h,)'] = -

nu2

+ a*b2

-

(3.19)

For large n, the first variance term goes to 0 but the second bias term remains finite. Therefore, we cannot obtain a consistent estimator. In order to minimize the squared error, we need to minimize both the bias and variance terms. To this end, we consider the following window function (3.20)

where w is a smooth rapidly decreasing function satisfying the normalization condition

1

W ( S )ds = 1

(3.21)

When r is small, the function rP1w{( x - h ) / r } is almost 0 outside'a small neighborhood of x = h. On the other hand, the probability of t = 1 and z = -1 is fifty-fifty at x = h, whatever the true response curve 4* is, and is almost so in a small neighborhood of x = h,. Hence, by choosing the above fT, the bias term b becomes small. However, the variance term v becomes large, because this estimator takes only those data in a neighborhood at x = 'z into account and discards all the other data outside the small window. This is the dilemma explained in Geman et al. (1992). The problem is how to compromise the bias term and the variance term by choosing a good window function. It is also important how small r we choose depending on the number n of observations.

Motoaki Kawanabe and Shun-ichi Amari

1252

When a window function w ( s ) satisfies

k = l , ...,p- 1

J w ( s ) s k d s = 0, Jw(s)sPds

(# 0)

b

=

(3.22) (3.23)

it is called a pth-order function. Window functions of the above type were introduced and applied to the nonparametric curve and density estimation by Parzen (1962). He also analyzed the order of consistency of the estimators therefrom. About the consistency of an estimator of h, we can derive the following, which is the same as his result. Theorem 2. Under thesemiparametricsetting with nuisancefunction q5 E @and an unknown regular distribution q ( x ) ,there exists an estimator h that copverges to the true value in the order ofn-P/(*p+')for any integer p > 0. This estimator is given by using a pth-order window function. Proof. We first calculate the necessary terms. 1. Bias term:

(3.24) =

2

J

w

(+) d,(x - h,)q(x)dx

(3.25)

Now expanding 4, as (3.26)

and q as m

(3.27)

then the bias can be written (3.28)

where (3.29)

Network Parameters in Stochastic Perceptron

1253

By using a pth-order window function, then the bias is

b,

+ 0 (r”’)

= 2 b PprP

(3.30)

converging to 0 in the order of rP when

T

is small.

2. Variance term: (3.31) where u2 =

/ {w(s)}~ ds

(3.32)

This shows that the variance term diverges to

00

in the order of

T-1.

3. Information term:

/

a = E[&f7] = 2alqo w(s)ds

=

+ O(T)

2 d ( O ) q ( h * )> 0

(3.33) (3.34)

This term does not depend on r asymptotically. The overall error term is now written as

E

[(i- h,)’]

=

$[

+ 4b2p;

r2P

1

(3.35)

when we use a pth-order function w. It is easily seen that, as r becomes small, the variance term increases while the bias term decreases. The best compromise is to choose 7 depending on n such that 1 -

r2P

nr

(3.36)

or n-1/(2P+l)

(3.37)

The overall error is then given by

E

[(i- h , ) 2 ] = c n-2P/(2Pf1)

(3.38)

proving that there exists a np/(2p+’)-consistent estimator. The last but important problem is to construct a pth-order function wp(s)explicitly. Although many other possibilities may exist, we use the Hermite polynomials defined by (3.39)

Motoaki Kawanabe and Shun-ichi Amari

1254

They form an orthonormal system,

J

1

e-s2/2hp(s)h,(s) ds =

s,,

(3.40)

and hp(s)is a polynomial of degree p. By assuming that windows w(s) can be expanded in the form of (3.411 and by taking the conditions 3.22 into account, we have (3.42) when p is even (see Fig. 1). It is possible to obtain a similar one for an odd p but it is not useful. 4 Orientation Detection

In this section, we consider the m-dimensional case where the separating hyperplane passes through the origin. The results derived in this section are closely related to the results by Rudd (19831, Li and Duan (1989),and W a n and Li (1991). In order to get the concrete form of the effective score, we assume that x is isotropically distributed. More definitely, 9(x) can be written

d x ) = r (Ilxl12)

(4.1)

where 11. (1 is the Euclid norm. Provided that q(x) is an isotropic distribution, the following discussion is true whether q(x) is known or unknown. For instance, a convenient choice is normal distribution with mean 0 and with the unit covariance matrix,

The probability distribution is P(X,Z;W) = 9 ( ~ () O . ~ + Z ~ ( W . X ) }

(4.3)

We now calculate the score function, U =

ZX+’(W. x)

0.5

+Z$(W.

X)

(4.4)

Network Parameters in Stochastic Perceptron

1255

-0.2'

(a) second order

-0.2'

(b) fourth order

Figure 1: Window functions.

The score function of the nuisance $ in the direction of ((w.x) is similarly given by

(4.5)

Now let us put s=w.x

(4.6)

Motoaki Kawanabe and Shun-ichi Amari

1256

Then u [ [ ] is a function of s = w . x and z. Since ((s) is an arbitrary smooth function satisfying ((0) = 0, the nuisance tangent space T N is included in the set of random variables expressed as a function of s and z . The projection of a random variable u to this space is given by the conditional expectation

E[u I s,zl

(4.7)

The projection of the score function u is then given by (4.8) where 4’is the derivative of 4 and

m(s) = E[x 1 w . x = s]

(4.9)

Since this is included in T N ,the effective score uE is given by

uE(x,z;w,(b)=

z {x - m(w . x)} @ ( w x) .

0.5 + 2 #(w. X)

(4.10)

This gives an estimating function whatever 4 is chosen (see Theorem 3 in the appendix). In practical situation, we can employ an adaptive method, that is, at first we compute a rough estimator 4 of the true nuisance parameter $*, and then use the effective score uE at $ to derive estimator w.For any isotropic distribution 9(x), the conditional expectation is explicitly written as E[x I w . x = S] = s w = (W. X) w

(4.11)

Therefore, the optimal estimating function is U E ( X , 2; w,

4*)=

2

{x - (w , x) w} 4; (w . x) 0.5 z &(w.X)

+

(4.12)

The asymptotic variance of the estimator w derived by this estimating function is

showing that the w is a &-consistent estimator and moreover is efficient in this semiparametric situation.

Remark. Without the isotropic assumption of q ( x ) we can calculate the effective score uE, which may be more complicated, if we know q(x). Besides the isotropic distributions, the effective score takes also entirely the same form for any elliptic distribution with the same mean and covariance matrix.

Network Parameters in Stochastic Perceptron

1257

5 General Case

In the general case of a stochastic neuron, no estimating function exists because of the threshold term. However, we can construct a good asymptotic estimating function by combining the above two results. More definitely, we can use the combined function

fn(x,2;w,h3 4)= [u:(x,z; w,h,417 fi(x,z;w,h)] 7

(5.1)

having two components where f.(X,

2; w, h ) =

-

w

7

(“:-”I

(5.2)

is an asymptotic estimating function constructed with a pth-prder window function, and

u:(xJ;W,h,4)

(w . x) w} 4’(w.x - h ) 0.5 + Z ~ ( W * x - h )

2 {x -

=

(5.3)

is the effective score function of the direction w. In the same way as in the previous sections, we can derive asymptotic properties of the estimator (w,i)which is defined as a solution of the estimating equation of f,. It can be shown that (w,i) is a consistent estimator of the stochastic order

(5.4)

(5.5) We remark that the convergence speed of does not influence that of w.

h is slower than fi,but this

6 Conclusion In this paper, we discussed estimation of parameters of stochastic new rons in the semiparametric situation. It is shown that there is no estimating function for the threshold parameter h, while there exist estimating functions for the orientation parameter w. We defined the pth-order asymptotic estimating functions of window type for threshold h. We also proposed an estimator (w, h) derived from the combination of the asymptotic estimating function for h and the effective score function for w. The estimator & of threshold is consistent with the order nP/(@+l)where p is an arbitrary integer. The convergence speed of the orientation estimator w is of stochastic order 1 fi.Although we dealt with estimation of the 50%point and hyperplane in this paper, it is easy to extend these results to estimation of ( a x loo)% point and hyperplane.

Motoaki Kawanabe and Shun-ichi Amari

1258

There remain some important problems unsolved in this paper. The first is how we choose the optimal order p,, of the asymptotic estimating function for given sample size n. The cross-validation method may be employed to choose an appropriate order p and window width 7. The second is how we can construct a consistent estimating function when the distribution of the input x is unknown. We cannot use the effective score :u in this case, because the conditional expectation of x for given s = w . x is included in it. Although we treated only one stochastic neuron in this paper, we need to consider the more general situation where the decision boundary is more complicated surface and should be approximated by networks of such stochastic neurons. Extending these results and also incorporating investigations of projection pursuit (Friedman and Tukey 1974; Huber 19851, we want to study semiparametric estimation of such general problems in the future. Appendix Let @ be the space of the nuisance function 4,and fix an arbitrary function 4. Let us add a small perturbation in the direction of ( to the nuisance function 4 and construct a curve starting from 4 in @

We consider any perturbation (direction of the curve) ( where there exists a positive constant c(<) such that the curve q$[E] is included in for 0 5 t < E ( ( ) . Let E be the class of such functions ( that may depend on the starting point $0 = 4.We then have a parametric model P(Xlf) = p

(x, 6,4f“1)

(A.2)

parameterized by t (0 being fixed). The score function of the nuisance parameter in the direction F is given by

This random variable v indicates how the log likelihood changes as the nuisance function 4 changes in the direction (. Let T&, be the closure of linear space spanned by random variables v[E]in all the directions E Z of the change in the nuisance function. We call TN the nuisance subspace. Let d (A.4) 0,6,4)= logp(x, 634)

<

be the score function of the parameter B of interest, which includes information on how the log likelihood changes as B changes. However, in general, this change might contain components that can be produced by

Network Parameters in Stochastic Perceptron

1259

adding some perturbations to 4. From the geometric point of view, these changes can be regarded as tangent vectors of the manifold of probability distributions. So, it can be said that the change in 6, might have common directions to a change in the nuisance parameter 4. These directions cannot be used for estimating 6, because they can be produced by changing 4. Therefore, we project u to the space orthogonal to the nuisance subspace T N . Here, the orthogonal projection is defined by the inner product of two random variables a ( x ) and b ( x ) given by

( a ,b) = E8,4 b ( x )b(x)l

(A.5)

The projected score uE is this orthogonal component of the score u ( x ,19,4). The main results along this line are summarized in the following theorem, where 1 and 2 are due to Amari and Kumon (1988) and Amari and Kawanabe (1993),3 to Godambe (1960) and 4 to Begun et al. (1983).

Theorem 3. 1. A n estimating function exists when T N does not include u for all 6, and 4,

that is uE is not null. 2 . For any fixed

4,

g(x. 6,) = U E ( X . d , $ )

(A.6)

is an estimatingfunction satisfying E8.4 [ U E ( X , 8, $)] = 0

(A.7)

for any 4. 3. Theestimator 6 isasymptotically normally distributed and is fi-consistent, with the asymptotic variance

where 6, and 4* are the true parameters. 4. Theoptimal estimatingfunction is u E ( x 0, , 4*),which ingeneral depends on the unknown 4*, The estimator therefrom is efficient, that is, its asymptotic variance is the inverse of the effective Fisher information

If we can choose the true +* or one close to it, the estimator 8 is asymptotically the best. However, the point is that even if we misspecify 4* and use a wrong 4, the estimator 8, is still unbiased and fi-consistent. This is a very attractive feature of the estimating function method.

1260

Motoaki Kawanabe and Shun-ichi Amari

The result can easily be extended to the vector parameter case where g(x,0 ) is a vector function having the same dimensions as 8. Estimating functions satisfy

(A.11) where I . I means the determinant of a matrix. In the present case, the random variable is a pair (x, z ) and the parameters are (w,h).

References Amari, S. 1985. Differential-GeometricalMethod in Statistics, Lecture Note in Statistics, Vol. 28. Springer-Verlag, New York. Amari, S., and Kawanabe, M. 1993. Differential geometry of estimating functions in semiparametric statistical models. Tech. Rep. METR 93-20, Department of Mathematical Engineering, University of Tokyo. Amari, S., and Kumon, M. 1988. Estimation in the presence of infinitely many nuisance parameters-geometry of estimating functions. Ann. Statist. 16, 1044-1068. Begun, J.M., Hall, W. J., Hung, W. M., and Wellner, J. A. 1983. Information and asymptotic efficiency in parametric-nonparametric model. Ann. Statist. 11, 432452. Bickel, P. J., Klaassen, D. A. J., Ritov, V., and Wellner, J. A. 1993. Eficient and Adaptive Estimation in Semiparametric Models. Johns Hopkins University Press, Baltimore. Duan, N., and Li, K.-C. 1991. Slicing regression: A link-free regression method. Ann. Statist. 19, 505-530. Friedman, J. H., and Tukey, J. W. 1974. A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput. C-23, 881-889. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp. 4, 1-58. Godambe, V. I? 1960. An optimum property of regular maximum likelihood estimation. Ann. Math, Statist. 31, 1208-1212. Godambe, V. P., ed. 1991. Estimating Functions. Oxford University Press, New York. Haussler, D., Littlestone, N., and Warmuth, K. 1988. Predicting 0 , l functions of randomly drawn points. In Proceedings of COLT'88, pp. 280-295, Morgan Kaufmann, San Mateo. Huber, P. J. 1985. Projection pursuit. Ann. Statist. 13, 435-475. Kabashima, Y.,and Shinomoto, S. 1992. Learning curves for error minimum and maximum likelihood algorithms. Neural Comp. 4, 712-719. Kabashima, Y., and Shinomoto, S. 1994. Learning a decision boundary from stochastic examples: Incremental algorithms with and without queries. Neural Comp., in press. Kim, J., and Pollard, D. 1990. Cube root asymptotics. Ann. Statist. 18, 191-219.

Network Parameters in Stochastic Perceptron

1261

Li, K.-C., and Duan, N. 1989. Regression analysis under link violation. Ann. Statist. 17, 1009-1052. Manski, C.F. 1975. Maximum score estimation of the stochastic utility model of choice. J. Economet. 3, 205-228. Nawata, K. 1989. Semiparametric estimation and efficiency bounds of binary choice models when the models contain one continuous variable. Econ. Lett. 31, 21-26. Parzen, E. 1962. On estimation of a probability density function and mode. Ann. Statist. 33, 1065-1076. Rudd, P. 1983. Sufficient conditions for the consistency of maximum likelihood estimation despite misspecification of distribution in multinomial discrete choice models. Econometrica 51, 225-228.

Received February 8, 1993; accepted March 15, 1994.

This article has been cited by: 2. Koji Tsuda, Shotaro Akaho, Motoaki Kawanabe, Klaus-Robert Müller. 2004. Asymptotic Properties of the Fisher KernelAsymptotic Properties of the Fisher Kernel. Neural Computation 16:1, 115-137. [Abstract] [PDF] [PDF Plus]

Communicated by Eduardo Sontag

Degree of Approximation Results for Feedforward Networks Approximating Unknown Mappings and Their Derivatives Kurt Hornik Technische Universitat, Vienna, Austria

Maxwell Stinchcombe Halbert White University of California, San Diego, C A U S A

Peter Auer Technische Universitut, Graz, Austria

Recently Barron (1993)has given rates for hidden layer feedforward networks with sigmoid activation functions approximating a class of functions satisfying a certain smoothness condition. These rates do not depend on the dimension of the input space. We extend Barron’s results to feedforward networks with possibly nonsigmoid activation functions approximating mappings and their derivatives simultaneously. Our conditions are similar but not identical to Barron’s, but we obtain the same rates of approximation, showing that the approximation error decreases at rates as fast as n-‘I2, where n is the number of hidden units. The dimension of the input space appears only in the constants of our bounds. 1 Introduction

Recent results of Barron (1993), building on key work of Jones (19921, have established general results describing the degree of approximation properties of single hidden layer feedforward networks with sigmoid hidden layer activation function. Barron’s results follow from a fact about convex combinations in a Hilbert space attributed to B. Maurey in Pisier (1981). Barron shows that the best network approximation to an arbitrary function belonging to a dass satisfying a certain smoothness condition has a root mean square error that decreases as the inverse square root of n, the number of hidden nodes. Perhaps surprisingly, this rate is independent of the dimension d of the input space. This advantage is gained by the freedom permitted in the choice of the input to hidden weights. Very significantly, Barron shows that series approximations, such as Fourier Neural Computation 6,1262-1275 (1994)@ 1994 Massachusetts Institute of Technology

Feedforward Networks Approximating Unknown Mappings

1263

series and splines (which effectively act as feedforward networks with weights hard-wired between input and hidden units), have degree of approximation rates bounded below by those obtained with unconstrained feedforward networks. Moreover, the rates for series approximations depend on the dimension of the input space despite the smoothness of the class of functions considered. Although we have followed the convention familiar to readers of this journal by referring to Barron’s results as pertaining to the approximation properties of artificial neural networks, it is in fact more accurate to say instead that they pertain generally to approximation by convex combinations of families of functions. By viewing matters in this light, we hope to avoid contributing to the mystique associated with artificial neural networks. The improvements in rates over Fourier series and splines that Barron obtains arise because of differences in the richness of the families of functions over which convex combinations are taken. Standard Fourier series, for example, are combinations of functions with fixed frequencies. A sigmoid network analog constructed by using the “cosine squasher ” of Gallant and White (1988) as the hidden unit activation function (which contains the Fourier series as a special case) is a combination of functions of variable frequencies. Barron’s results apply to the latter networks, so the only special feature of the “neural network is the freedom in choice of frequency. From this viewpoint, the improvements offered by Barron’s results thus appear less mystical. Given the audience of this journal, we shall continue to refer to Barron’s results and similar results of others (including ourselves) as pertaining to artificial neural networks. Nevertheless, in so doing we do not wish to imply that there is anything special about neural networks in this context; fundamentally, we are concerned with the approximation properties of combinations (in our case convex) of families of functions. Consequently, when we speak of the output of a neural net, it should be understood throughout as referring to the value of a function obtained as a (convex) combination of some family of functions. Other degree of approximation results are available besides those of Barron. Making use of available degree of approximation results for Fourier series (e.g., Edmunds and Moscatelli 1977) and Gallant and White’s (1988) Fourier series networks, Gallant and McCaffrey (1991) obtain degree of approximation results in spaces larger than those considered by Barron (1993). However, Gallant and McCaffrey’s rates are slower than those known for Fourier series, a consequence of the additional parameters represented by the input to hidden unit biases appearing in standard feedforward networks, but absent in Fourier series. Our purpose here is to extend the approach and results of Barron (1993) to situations in which interest attaches to simultaneously approximating an unknown function and its derivatives. Situations where this is important arise in robotics, economics, engineering, chemistry, physics, and many other fields in which dynamics and/or sensitivity analysis

Kurt Hornik et al.

1264

are of interest. Hornik et al. (1990) (hereafter HSW) and Hornik (1991) have shown that single hidden layer feedforward networks are capable of providing arbitrarily accurate approximations to arbitrary functions and their derivatives by showing that such networks are dense in certain Sobolev spaces (spaces of differentiable functions where distance between functions is measured by taking into account distance between the functions and their derivatives). Our present results supplement prior work by describing how quickly the best approximation can improve (in Sobolev norm) as network complexity increases. The present results further extend Barron's work by treating nonsigmoid as well as sigmoid activation functions, and by giving results for other approximations using the embedding properties of Sobolev spaces established by the Rellich-Kondrachov Theorem. Examples of nonsigmoid hidden unit activation functions permitted by our results include sine and cosine functions, certain "bump" functions ke., probability density functions), and certain wavelet functions (e.g., derivatives of certain probability density functions). Results are stated and discussed in the following sections; all proofs are deferred to the appendix. 2 Main Results

For our purposes, it is most convenient to focus attention on approximating functions belonging to the function spaces BT;defined as follows. is the space of all bounded measurable functions on d , d E N, and for rn 2 1, Br is the space of all functions on R' that have continuous and uniformly bounded (partial) derivatives up through order rn. For all f E BT, define the norm

I representing The input environment measure, a probability measure on # the relative frequency with which input patterns occur, will be denoted by p. When equipped with the inner product V18)l",P

:=

c /@ W X ) D " g X )4 4 x )

OSlalSm

Br becomes a prehilbert space. To see this, let Cm,d = (1 + d + . . . + E Br, then llfllm,P I cm,dllfllB;. because llfl[k,P = CO5lalT-mS@[Daf(X)l2dPL(x) i Ilfll$ COSla15rnS@ dPL(X) = Ilfll&$l+ d +. . . + d"). With a slight abuse of notation, denote by Br = Br(p) the Hilbert d"')1/2 and note that iff

where 11. llm,p denotes the norm induced space that completes (a?,)I by this inner product (pm,@denotes the associated metric). In particular, if f E By and 0 c Br, we measure the distance between f and G by hI,P(f?

G ) := i"fSEG Prn,P(f?8).

Feedforward Networks Approximating Unknown Mappings

1265

We write a’x to denote the inner product of a and x in d and la1 = ( u ’ u ) ~ / * .Further, let ( ( a ) = max(la1,l), and, for functions 11, from w to W, let G;(11,, B , n ) be

{g: d

+.

w I g(x) =

5

P;t(a;)-“11,(ajx

i=l

+ O;), a; E d , O;, P; E W,2 ID;[ I B i=l

1

&$‘($I, B , n ) is the collection of output functions for neural networks with d-dimensional input vector x, a single hidden layer comprising n hidden units with common activation function $, real-valued input-to-hidden unit weights (ai), biases (Oi), and hidden-to-output weights P, scaled by t ( ~ i ) -[This ~ . scaling ensures that for all g E GY($,B,n)we have ~~g5 ~ ~BII1l,,JJq D 7 regardless of the number of hidden units comprising the networks.] Let MY be the space of all signed measures v on d x W for which

where IvI denotes the variation of v, lvl(A) := supC:=, lv(El)lwhere the supremum is taken over all finite sequences { E l } of disjoint measurable subsets of A. Observe that the sequence IIvIIM; is nondecreasing in m. For ?i, E By and v E MY, define the function T , p on d by

thenf = TQv E BY and in fact, Ilfllq 5 ~ ~ ~ ~ ~ ~ y ~ ~ v Observe that T Q v ( x is ) the output of a single hidden layer feedforward network with input vector x and a continuum of hidden units having common activation function 11, (cf. Irie and Miyake 1988). The distribution of hidden to output weights (over input to hidden weights and biases) is given by the measure v. Our first main result describes how well the class of “discrete networks“ G ~ ( $ JB, , n ) can approximate the ”continuous networks” with output function T+v.

Theorem 2.1. For 11, E By and Y E MY, we have

Observe that the degree of approximation to T$v by our feedforward networks is already of order n-’’’. We proceed by considering classes of function having an exact representation of the form TQVfor some choice of 11, and v. A rather broad class

~

~

,

Kurt Hornik et al.

1266

occurs by taking $ = cos and letting v vary. In particular, we consider of all real-valued functions on d represented as the class

f ( x )=

cia" d o f ( a )

where of is a complex measure on d satisfying IIofIIm = ~ ~ ( a ) ~ n < d ~ ~ ~ ~ ( a )

When of is absolutely continuous with respect to Lebesgue measure, its density is the Fourier transform

(in the gaussian summability sense). Note that we do not require that of be absolutely continuous with respect to Lebesgue measure. Conse-

quently, for m = 0, it suffices that f, e.g., be represented by an absolutely summable Fourier series. To see that when f belongs to P; it can be expressed as f = Tc,,vf for a particular choice vf, we can employ Rudin (1974, Theorem 6.12) to obtain a measurable function h(a) satisfying duf = hdloffl and Ih(a)l = 1 for all a E d; hence, letting 6, denote point mass on t, we may write h ( a ) = eie(a)and rewrite the representation for f as

where dvf(a,B)= d(off((a)dS~,,)(B) is now a (positive) measure in MT. The next Theorem follows immediately from Theorem 2.1 by putting ri, = cos. Theorem 2.2. Let f

E

l$”. Then for all probability measures p on d and B 2

llofllmr

To obtain more general results, we investigate how well functions in can be approximated by neural networks with noncosine activation function $, i.e., by functions in Gr($,B, n) with ,$ # cos. For this purpose, we restrict our attention to functions $ E By for which all relevant derivatives Dk$,0 I k 5 m, are absolutely integrable with respect to one-dimensional Lebesgue measure ke., J J D ~ $ J J := J~I(~) JwlDk$(t)l dt <.m). These assumptions are convenient, but not necessary.

Feedfonvard Networks Approximating Unknown Mappings

1267

We relax them in the next section. For such 11,, let

and let S, = { x E d : 1x1 5 r } . We then have the following result.

c+',

Theorem 2.3. Suppose that f E that 11, is in t?? and that all derivatives of $ up to order m are integrable. Let w # 0 be chosen in a way that $(w) # 0. Then for all positive r and A,

provided that B L l(l/u)m+l(r+ X)llqll,,,+~/nl$(u)I. The bound given by this result looks like a monster. However, notice that the expression in parentheses no longer depends on ofl but only on n, and p, and can now be fine-tuned by choosing r and X suitably. Clearly, such a procedure cannot be cast into a complete theorem covering all possible cases; we cover three in the next three Corollaries. To start with, suppose (as Barron does) that p is compactly supported and that 11, is compactly supported, too. We have the following corollary.

Corollary 2.4. In addition to the conditions of Theorem 2.3, assume that p and qb are compactly supported. Then there exist constants Bo and C depending only on m, d, A!, and li, such that

provided that B 1 BolloffJJm+l This case is the best possible in the sense that it is the only one where we can guarantee that the approximation error is of order n - l I 2 . For applications, one might then use basis splines of desired smoothness order m, because these are compactly supported. Also notice that suitable differenced versions of the Heaviside or ramp function (with m = 0) and the cosine squasher ( m = 1) (Gallant and White 1988) fall into this category of activation functions. If the tails of $ decay exponentially fast in the sense that for some constant y > 0, IleYl'(Dkli,(t)llL1cs 5 $J < 00 for all 0 I k I m, we have

hence by choosing X

= logn/2y,

we obtain

Kurt Hornik et al.

1268

Corollary 2.5. In addition to the conditions of Theorem 2.3, assume that p is compactly supported and that IleYl'lDk$(t)ll,,(R) is finite for some y > 0 and all 0 5 k 5 m. Then there exist constants Bo and C depending only on m, d , p, and .li,suck that

provided that B 1 B0110f11m+1 logn. Observe that suitably differenced versions of the logistic and the arctangent squasher are covered by the above corollary. Finally, suppose that the tails of the relevant derivatives of li, decay even slower in the sense that for some positive constant y, 11 ItlYDk$(t)llLl(w)< 00 for all 0 5 k 5 m. In this case, an optimal rate is achieved by taking X = n'/[2(Y+')1,yielding the following. Corollary 2.6. In addition to the conditions of Theorem 2.3, assume that p is compactly supported and that 11 It17Dk$(t)llLl(w, is finite for some y > 0 and all 0 5 k 5 m. Then there exist constants Bo and C depending only on m, d , p, and II, suck that

[ f ,G?($, B,n ) ] I CIIcfIIm+ln-'/2[Yl(Y+')l, provided that B 1 Bo(lcfIl,n+ln'/2(Y+'). pm+

Observe that the larger we may choose y, the closer the exponent gets to the compactly supported value -1/2. If p is not compactly supported, the tail behavior of p can be taken into account in an identical manner using the general inequality provided by Theorem 2.3 and optimally adjusting r. For the normal distribution (which has exponential tails) we obtain an extra (additive) log n term. 3 Further Results

We begin this section by relaxing the conditions required of $. We then state some degree of approximation results for Sobolev norms different than those of the preceding section that follow from the embedding properties of Sobolev spaces. In Theorem 2.3, we assumed that all derivatives Dk$,0 5 k 5 m, are absolutely integrable. Now we relax these conditions to require only that $ is a nonzero element of Br and that for some l ( I 1 m), 0 < J"lD'$(t)ldt < 00 (i.e., $ is I-finite in the terminology of HSW). In particular, the latter condition permits treatment of sigmoid activation functions (such as the Heaviside, logistic, and tanh functions) ruled out by our earlier condition. In this case, we can follow Lemma 3.3 of HSW to construct a nonzero activation function $A in l3;l such that @$A is integrable for 0 5 k 5 l in the following way. Pick a positive scalar h and put $A = A;$, where A;

Feedforward Networks Approximating Unknown Mappings

1269

is the lth iterate of the operator Ah, which is given by 1 Ah$(t) = 2 [$(t + h ) - $(t)l

Our next result follows because $A can be shown to satisfy the conditions required by Theorem 2.3 and G;($A, B , n ) G B , 2'n).

Gr($,

c+'

Corollary 3.1. Suppose that f E and that 1// is in f3T and 1-finite for some 1 2 m, and let h > 0. Then there exist finite constants Bo and C that only depend on m,d , h, and such that for all positive rand A,

provided that B 2 Bo(r + X ) ~ ~ o ~ ~ ~ n l + l . Analogs of Corollaries 2.4-2.6 follow from this result by application of identical arguments. Thus far, all of our degree of approximation results have been based on the Sobolev metric ,om,+.Other useful Sobolev distance measures can be defined as

P"l.m,ll(f$ g ) = maxo
+

Pm,q,p(frg)5 C P m + k . p , ( f 18) where A, is the normalized Lebesgue measure on S,. The following corollary is now straightforward.

Corollary 3.2. Let 2 5 q I OD, h > 0 and k = k(9,d). Assume that ,$ is in and is 1-finite for some 1 2 m k, and that is supported in S , for some 0 < r < 00 and has a bounded density with respect to Lebesgue measure. Then there exist constants Bo and C that depend on m, d , 9, h, I),p, and S , such that for all f E 3;;l+k+1 and X > 0,

+

Kurt Hornik et al.

1270

In particular, we now have a result for uniform approximation on compacts under certain smoothness assumptions expressed in terms of the moments of the spectral distribution off. Again, further corollaries can be derived by choosing particular values of X according to the tail behavior of 4). Finally, let us point out that if II, is a squasher (i.e., nondecreasing with limits 0 and 1 at -00 and $00, respectively), then for m = 0 the &,(A) term in Theorem 2.3 can be suppressed. To see this, introduce the functions

As K + 00, $n,fl converges pointwise to the indicator function of (0,h) with the possible exception of the endpoints 0 and h. As a simple application of Fubini's Theorem yields that Jw \ $ j t i , j 1 ( t ) \ dt = h, the dominated convergence Theorem implies that for all bounded functions u on w, limKAm Jw u(t)+K,~l(t)dt = Jlu ( t )dt. Hence in particular, limti+ml&,,ti(l)l = 2)sin(h/2)1 and limKhW R$h,,,(X)= l r ~ ,Itl) ~dt, (= max(h - X.0). Applying Theorem 2.3 with m = 0, w = 1, and $ replaced by 4)K,lrand then letting K + 00, we thus obtain the following.

,Jl

Corollary 3.3. Let ,$ be a squasher and h > 0. Then for all f E X > h, we have

provided that B 2 (r

B;,r > 0, and

+X)~~~~~~,/(2~~sin(h/2)~).

This result is roughly equivalent to Theorem 5 in Barron (1993) (our condition IIofI(1< 00 is a little stronger than Barron's Jp laIdloffl(a)< 00). Notice, however, that in both the above Corollary (as the derivatives of our $ K , f l tend to infinity as M --* 00) and in Barron's theorems (as his approach is based on the Heaviside function and sigmoidal approximations to it), the rates are obtained only for nonsmooth approximators. 4 Concluding Remarks

We have obtained a variety of results establishing that single hidden layer feedfonvard networks can approximate unknown functions and their derivatives with error decreasing at rates as fast as n-'I2. As in the case treated by Barron (1993), the dimension of the input space, d, does not affect the rate of approximation, but only the constants of proportionality. This is in sharp contrast to the properties of standard kernel and series approximants (cf. Cox 1984; Edmunds and Moscatelli 1977).

Feedforward Networks Approximating Unknown Mappings

1271

Indeed, the lower bounds for degree of approximation of series estimators obtained by Barron (1993) hold a fortiori in our context, because of the nesting relations obeyed by Sobolev metrics. Thus, the network functions considered here offer the same potential benefits for approximating functions and their derivatives in spaces of more than a few dimensions than they do when derivatives are not of concern. Appendix: Mathematical Proofs We begin with two results from which Theorem 2.1 follows directly. Recall that for a finite signed measure T on a separable Banach space x,we say that the integral of r in the weak sense, if it exists, is that point X E X such that for all h* in x', the dual of X, J h*(x)dT(x) = h*(X).In this case, we denote X by J x d r ( x ) . If the integral exists in the strong sense, then it agrees with the integral in the weak sense. Lemma A.1. Suppose that S is a nonempty norm-bounded subset of a separable Banach space X , that S = -S, that r is a finite signed measure on S , and that X = j'xdT(x). Then for any B 2 Irl(S),X E B . m(S)where E ( S ) denotes the closed convex hull of S . Proof. If r is the 0 measure, simply note that X = 0 and under the given assumptions, 0 E m(S).Suppose now that r is not the 0 measure. Using the Hahn-Jordan decomposition, express r as T + - 7- where r+ and 7are positive measures. In particular, X = J x d r + ( x )- j ' x d r - ( x ) . Because S = -S,f is the integral over S of a positive measure r' with r'(S) i Irl(S). Dividing f and r' by ~ ' ( s it ) ,is sufficient to examine the case where r is a probability measure. Let h' be an element of x', the dual of X. By definition, h*(X) = J h * ( x ) d r ( x ) . Therefore, h*(X) I sup,,,h*(x) I S U ~ , h*(x) ~ ~ where ~ ~ ) the second inequality follows because S E m(S) and both suprema are finite because S is norm bounded. But this implies that X belongs to all of the closed half-spaces containing S, and by the 0 Hahn-Banach theorem, this implies that X E W(S). Lemma A.2. Suppose we aregiven a family offunctions &, in Br parameterized ~ ~ q
Proof. Let G(B) = {p4u,: 5 B , w E W}. If we can show that for all B 2 1 1 ~ 1 1 , f is in the pm,+ closure of the convex hull of G ( B ) , the assertion follows from Theorem 5 in Barron (1993). Let S = G ( B ) in Lemma A.1. 0

Kurt Homik et al.

1272

Proof of Theorem 2.1. Let w = (a,O), &(x) = I(U)-~$(U'X + O ) , and d ~ ( w=) I ( ~ ) ~ d v 0) ( a such , that ((TI( = I I p l 1 ~ ; ' As . ID"d,(x)l = l@)-maaD'a'dJ(fl'x+ O)l I Il$llo; for all 0

I IaJI m, Theorem 2.1 follows immediately from Lemma A.2. 0

Proof of Theorem 2.3. The assumptions on $ and f imply that the Irie-Miyake type integral representation (cf. HSW, p. 558)

holds for all multi-indices 0 I IQI

I m, where

and w # 0 is chosen in such a way that &w) # 0. (In the notation of HSW, the term e-'"'/27r4(w) is replaced by e-2"iu8/G(w); the difference occurs because here we define Fourier transforms in the same way that Barron does.) Let

where dvM(a, 0) = I[-e(a)M,e(a)M](O) dv(a,

(here, l,(O) = 1 if O E S and zero otherwise). Then

provided that B 2 DM.

Feedforward Networks Approximating Unknown Mappings

1273

Summing over 0 5 la1 Im, using the trivial inequality u2 + u2 5 ( u u)' for u , u 2 0, taking the square root, and replacing f!(l/u)"l~~o~~~,, by P(l/u)"+'lI~flJm+l (as we are not seeking the best possible constants anyway), we finally obtain

+

The Theorem now follows upon using the triangle inequality Pin.,i

[ f j

G'Y($. 13%n ) ] IP,n,@(f?fr+x) + /)m,lc [ f r + x I

GY(dj3

B, n)]

0

For T c W and E > 0, let T, denote the f-neighborhood of the set T, i.e., the set of all s such that Is - ti 5 c for some t E T . Further, let L h be the operator L h u ( f ) = J;+h u ( s ) ds. Lemma A.3 Let u be integrable on W. Then for all T

Iblu(t)ldt 6

c w and h > 0,

/r,, lu(t)ldt

,

Proof. By an application of Fubini's Theorem,

0

Kurt Hornik et al.

1274

Proof of Corollary 3.1. It follows from the proof of Lemma 3.3 of HSW that $A is nonzero and has integrable derivatives of all orders up to 1. Choose w # 0 in such a way that GA(w) # 0. Then, as obviously Gy($,B,2'n), an application of Theorem 2.3 gives G ~ ( $ A , B n) , Pm,/A [f3G7($.B,2'n)]

5 P m , p [f,G?('$A?

n)]

provided that B 2 e ( l / W ) " + ' ( T + X ) l l g ~ I l m + l / ~ l ~ A ( w ) l . As clearly I l $ ~ l l ~ ;I ll$ll~;, the proof can be completed by showing that

using Lemma A.3. Now Ak$(t) = 2-k$0~kj$(f +jh), where the ck, are the signed binomial coefficients of order k. Clearly, Ah = LhD/2 and hence A; = 2-kLkDk. Thus by the above Lemma, lAL-kDk$(t + j h ) I d t I2k-'

1

ILk-kD'?j,(t)ldt

h '

I(h/2)'-kJ

ID'Wdt Tjh+(l-k)J!

AS for 0 I kI I, DkAb$(t) = AkAiLkDk$(t)= 2-k$O~kjAk-kDk$(f + j h ) , we obtain with T = { t : It1 L A} that

for all 0 I kI I and thus in particular &,,(A) asserted, and the Corollary follows.

5 l(h)'S,,,,,-,, lD'$(t)l dt as 0

Acknowledgments White's participation was supported by National Science Foundation Grants SES-8921382 and SES-9209023. We are grateful to Andrew Barron and to an anonymous referee for helpful comments. References Adams, R. A. 1975. Sobola, Spaces. Academic Press, New York.

Feedforward Networks Approximating Unknown Mappings

1275

Bamn, A. R. 1993. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transact. Inform. Theory IT-39, 930-945. Cox, D. D. 1984. Multivariate smoothing spline functions. SIAM 1.Nurner. Anal. 21, 789-813. Edmunds, D. E., and Moscatelli, V. B. 1977. Fourier approximations and embeddings in Sobolev spaces. Diss.Math. 145, 1-46. Gallant, A. R., and White, H. 1988. There exists a neural network that does not make avoidable mistakes. Proceedings of the Second Annual IEEE Conferenceon Neural Networks, Sun Diego, pp. I: 657464. IEEE %s, New York. Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4,251-257. Hornik, K., Stinchcombe, M., and White, H. 1990. Universal approximation of an unknown function and its derivatives using multilayer feedforward networks. Neural Networks 3,551-560. Irie, B., and Miyake, S. 1988. Capabilities of threelayer perceptrons. In IEEE Second Conference on Neural Networks, pp. I: 641-648. IEEE Press, New York. Jones, L. K. W. 1992. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural networks. Ann. Stat. 20, 608-613. McCaffrey, D. F., and Gallant, A. R. 1994. Convergence rates for single hidden layer feedforward networks. Neural Networks 7, 147-158. Pisier, G. 1981. Remarques sur un resultat non publie de B. Maurey. Seminaire Danalyse Fonctionelle 1980-1981, Ecole Polytechnique, Palaiseau. Rudin, W. 1974. Real and Complex Analysis. Tata McGraw-Hill, New Delhi.

Received April 30, 1993; accepted February 7, 1994.

This article has been cited by: 2. L. Saad Saoud, A. Khellaf. 2010. Nonlinear dynamic systems identification based on dynamic wavelet neural units. Neural Computing and Applications 19:7, 997-1002. [CrossRef] 3. G. Gnecco, M. Sanguineti. 2010. Suboptimal Solutions to Dynamic Optimization Problems via Approximations of the Policy Functions. Journal of Optimization Theory and Applications 146:3, 764-794. [CrossRef] 4. Alaa Khamees Al-Azzawi, M. Iqbal Saripan, Adznan Jantan, Rahmita Wirza O. K. Rahmat. 2010. A review of wave-net identical learning & filling-in in a decomposition space of (JPG-JPEG) sampled images. Artificial Intelligence Review . [CrossRef] 5. Angelo Alessandri, Giorgio Gnecco, Marcello Sanguineti. 2010. Minimizing Sequences for a Family of Functional Optimal Estimation Problems. Journal of Optimization Theory and Applications . [CrossRef] 6. Martin Holeňa. 2006. Piecewise-Linear Neural Networks and Their Relationship to Rule Extraction from DataPiecewise-Linear Neural Networks and Their Relationship to Rule Extraction from Data. Neural Computation 18:11, 2813-2853. [Abstract] [PDF] [PDF Plus] 7. Rafiqul Alam Tarefder, Luther White, Musharraf Zaman. 2005. Neural Network Model for Asphalt Concrete Permeability. Journal of Materials in Civil Engineering 17:1, 19. [CrossRef] 8. Ivan Tyukin , Cees van Leeuwen , Danil Prokhorov . 2003. Parameter Estimation of Sigmoid Superpositions: Dynamical System ApproachParameter Estimation of Sigmoid Superpositions: Dynamical System Approach. Neural Computation 15:10, 2419-2455. [Abstract] [PDF] [PDF Plus] 9. M. Magdon-Ismail, A. Atiya. 2002. Density estimation and random variate generation using multilayer networks. IEEE Transactions on Neural Networks 13:3, 497-520. [CrossRef] 10. B. Giraud, A. Touzeau. 2001. Elementary derivative tasks and neural net multiscale analysis of tasks. Physical Review E 65:1. . [CrossRef] 11. Charles J. Lockwood, Edward Kuczynski. 2001. Risk stratification and pathological mechanisms in preterm delivery. Paediatric and Perinatal Epidemiology 15:s2, 78-89. [CrossRef] 12. Xiaohong Chen, J. Racine, N.R. Swanson. 2001. Semiparametric ARX neural-network models with an application to forecasting inflation. IEEE Transactions on Neural Networks 12:4, 674-683. [CrossRef] 13. J. Fernandez de Caflete, A. Barreiro, A. Garcia-Cerezo, I. Garcia-Moral. 2001. An input-output based robust stabilization criterion for neural-network control of nonlinear systems. IEEE Transactions on Neural Networks 12:6, 1491-1497. [CrossRef]

14. A. Hugget, P. Sébastian, J.-P. Nadeau. 1999. Global optimization of a dryer by using neural networks and genetic algorithms. AIChE Journal 45:6, 1227-1238. [CrossRef] 15. Xiaohong Chen, H. White. 1999. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Transactions on Information Theory 45:2, 682-691. [CrossRef] 16. Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143. [CrossRef] 17. R. Rovatti. 1998. Fuzzy piecewise multilinear and piecewise linear systems as universal approximators in Sobolev norms. IEEE Transactions on Fuzzy Systems 6:2, 235-249. [CrossRef] 18. D.S. Modha, E. Masry. 1998. Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory 44:1, 117-133. [CrossRef] 19. V. Maiorov, R.S. Meir. 1998. Approximation bounds for smooth functions in C(R/sup d/) by neural and mixture networks. IEEE Transactions on Neural Networks 9:5, 969-978. [CrossRef] 20. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 21. D.S. Modha, E. Masry. 1996. Minimum complexity regression estimation with weakly dependent observations. IEEE Transactions on Information Theory 42:6, 2133-2145. [CrossRef] 22. Jean-Gabriel Attali, Gilles Pagès. 1995. Approximation of functions by perceptrons: a new approach. Neural Processing Letters 2:5, 19-22. [CrossRef]

Communicated by Nathan Intrator

Influence Function Analysis of PCA and BCM Learning Yong Liu Department of Physics, lnstitute for Brain and Neural Systems, Box 1843, Brown University, Providence, NO2912 U S A Based on the influence function (Hampel etal. 19861, we analyze several forms of learning rules under the influence of abnormal inputs (outliers). Principal component analysis (PCA) learning rules (Oja 1982, 1989, 1992) are shown to be sensitive to the influence of outliers. Biologically motivated robust PCA learning rules are proposed. The variants of BCM learning (Bienenstock et al. 1982; Intrator 1990b) with linear neurons are shown to be subject to the influence of outliers, in contrast they are shown to be robust to outliers with sigmoidal neurons. 1 Introduction

In addition to learning dynamics, there are at least two important issues related to an unsupervised learning rule. The first is the stability of the learning rule (Oja 1982; Bienenstock et al. 1982; Intrator and Cooper 1992; Miller and MacKay 19941, which concerns the stability of the learned weight versus a weight perturbation. The second is the robustness of the learning rule, which concerns the stability of the learned weight versus a perturbation of the environment. This second issue has been ignored in connectionist research. However, the environment to which a neuron is exposed is very noisy, thus abnormal inputs may be present. In statistical terms, these abnormal inputs are called outliers. We require that outliers do not have a dramatic influence on synaptic modification. A learning rule that is not sensitive to the influence of outliers is regarded as robust. In this paper, the issue of robustness is addressed. A general discussion on robustness is given in Huber (1981). In order to understand how the learned weight is influenced when the environment is perturbed, consider the following scenario: An input pattern d is added with a small probability T to the environment. Observe the weight change. The ratio between the weight change and T measures the sensitivity of the weight versus the environment perturbation. This is a heuristic definition of the concept influence function (Hampel et al. 1986); the exact definition is given in Appendix A. If a learning rule is robust, the change of the weight due to a small perturbation should be proportionally small, thus the influence function should be bounded. Neural Cornputation 6, 12761288 (1994)@ 1994 Massachusetts Institute of Technology

Influence Function Analysis of PCA and BCM Learning

1277

For the learning rules discussed in this paper the influence functions are continuous. Consequently the boundedness of a influence function is determined by its behavior at large values of d. The inputs that have a large value of d are outliers since they are far away from the rest of the data points. For a learning rule, these outliers can have a great influence on the synaptic modification. We shall assume that the mean of the input d is zero. This causes no problem since we can always shift d to d - E [d] when the mean is not zero. Two major types of learning rules have been proposed. One is based on second-order statistics of the environment (Sejnowski 1977; Oja 1982; Linsker 1986; Miller 1990), and the other is based on higher order statistics (Bienenstock et al. 1982; Intrator and Cooper 1992) (BCM). For the first type of learning, we shall concentrate on Oja’s version of principal component analysis (PCA) learning (Oja 1982, 1989), and for the second type on Intrator’s variant of BCM learning (Intrator and Cooper 1992). It is found that the learning rules with linear neurons are subject to the influence of outliers, and that the sigmoidal property of biological neurons may help the learning to overcome the influence of outliers. The influence functions and the robustness conditions of three forms of learning rules are given in Appendix A. We formulate PCA and BCM learning rules as the special cases of these three forms in the relevant sections. The robustness properties of the PCA and BCM learning rules are then examined. In Section 2, we discuss Oja’s version of PCA learning. In Section 2.1.1, Oja’s learning rule for a single neuron (Oja 1982)is found to be sensitive to outliers. A robust learning rule is proposed, based on maximizing the variance of the output of a sigmoidal neuron’ subject to the constraint that the length of weight equals one (Section 2.1.2). In Section 2.2.1, Oja’s learning rule for multineuron network (Oja 1989) is also found to be nonrobust. Analogous to the case of a single neuron, a robust learning rule for multineuron network is proposed (Section 2.2.2). In Intrator and Cooper (1992), it is claimed that for both a single neuron and a multineuron network, in the case of sigmoidal neurons the learning rules are robust, whereas in the case of linear neurons the learning rules are not. In Section 3, we verify this claim. 2 PCA Learning

2.1 Single Neuron. 2.2.2 Sensitivity of Ojus Rule to Outliers. Hebb’s postulate (Hebb 1949) is the cornerstone of most unsupervised learning rules based on the correlation function of the input. The naive Hebbian rule, however, is unstable. This problem is often overcome by adding constraints to this rule ’For example, a choice of the sigmoidal function is U ( X ) = (P- I)/($

+ 1).

Yong Liu

1278

(von der Malsburg 1973; Sejnowski 1977; Linsker 1986; Kohonen 1989; Miller and MacKay 1994). One such constraint is that the length of the weight equals one (Kohonen 1989). This is equivalent to adding a decay term to the naive Hebbian rule. This modified rule has been proven to be extracting principal component (Oja 1982). The learning rule (Oja 1982) is m = vc(d

-

mc)

(2.1)

where m and c = m . d are the weight and the output of the neuron, respectively, and 77 is the learning rate. It can be obtained b maximizing the variance of the output of a linear neuron E [ [m . d]’f subject to the constraint m . m = 1. According to Theorem 1 in Appendix A, the learning is not robust. This is largely due to the influence of the data present at a large value of d . 2.1.2 Robust PCA Learning Rule. A biological neuron is sigmoidal, or the output of the neuron c = a(m . d ) . Instead of maximizing the variance of the output of a linear neuron subject to the constraint m . m = 1, as has been done in Oja’s rule, we maximize the variance of the output of a sigmoidal neuron, or E [ [a(m . d)I2],subject to the same constraint. Under this optimization criterion, the stochastic learning rule, as derived in Appendix B, is

tiz = qa’(m.d ) a ( m .d ) [ d- m ( m . d ) ] (2.2) If m . d is in the linear region of the sigmoidal function, equation 2.2 can be written approximately as Oja’s rule in equation 2.1. Moreover, if m . d is out of the linear region, the weight changes little. This is desirable, since the nonlinear region is where the outliers may be found. Actually, according to Theorem 1 in Appendix A, this rule is robust due to the effect of the derivative of the sigmoidal function, which controls the growth of the influence of the data far away from the origin. 2.2 Multineuron Network.

2.2.1 Oja‘s Rule for Multineuron Network. Oja has generalized his rule to the case of a network with k neurons. This rule makes the k weights

converge to k unit vectors. These k unit vectors are mutually independent; every unit vector is a combination of the first k principal components (Oja 1989). Denote M = (ml, . . . , m k ) and c = (c1,. . . , ~ k where ) ~ mi and ci = mi . d are the weight and the output of the ith neuron, res ectively. In this case the optimization criterion is to maximize E [c’] = ErEi [mi . 4’1 subject to the constraint MMT = I. The learning rule is (Oja 1989)

Vi

(2.3)

Influence Function Analysis of PCA and BCM Learning

1279

or

M

= TIC(dT- cTM)

(2.4)

This learning rule is not robust, according to Theorem 1 in Appendix A. This is again due to the influence of the data with a large value of d.

2.2.2 Robust PCA Learning for Multineuron Network. A robust rule can be obtained if the criterion is to maximize the total variance of the outputs of sigmoidal neurons, or to maximize E [c’] = E [C,c!], subject to the . . . ,mk),c = ( c l , . . . , ~ k with ) ~ c, = constraint M M T = I, where M = (ml, ( ~ ( m. ,d ) . As derived in Appendix C, the stochastic learning rule is 1 d ( m ;. d)c;d - [ d ( m ;. d)c;m,. d + d ( m , . d)c,m;. d] mi 2 , Vi (2.5)

or

M

= r,(gdT - BM)

(2.6)

where B = ( b , , ) k x k with b,, = 1/2[d(m,. d)c,m, . d + d ( m , . d)c,m, . d] and g = ( g l ,. . . ,gk) with g, = d ( m , . d)c,. In the linear region of the sigmoidal function, the equation has the same form as that of Oja’s in equation 2.3; while in the nonlinear region where the outliers will possibly appear, a neuron learns little from the data due to the effect of the d term. According to Theorem 1 in Appendix A, this modified learning rule is robust. 3 BCM Learning

A linear neuron output is equivalent to a projection of input onto a onedimensional output. The weight is the direction of the projection. For a learning rule that extracts principal components, all it tells us is in which directions in feature space the projected inputs have large variance. Other forms of learning rules based upon second-order statistics convey similar information about the shape of the covariance matrix of the input. However, deviation from normality is more interesting (Friedman 1987). This is especially true when the input dimension is very high, because a random projection is approximately normally distributed due to the central limit theorem (Diaconis and Freedman 1984). In particular, if the object of the neuron is to distinguish patterns, multimodality is more interesting. BCM learning is such an unsupervised learning algorithm (Bienenstock et al. 1982). We shall base our discussion on Intrator’s variant (Intrator 1990b),which implements a projection pursuit algorithm (Huber 1985). This variant was discussed in a biological context as well

1280

Yong Liu

(Intrator and Cooper 1992). Based on projection indices that are some combinations of second- and third-order statistics of the distribution of the neuronal outputs, the learning rules focus on the directions with nonnormal distribution. However, projection indices that are of polynomial type are known to be sensitive to outliers (Huber 1985). Friedman (1987) suggested the use of a transformation that compresses the region from (--oo,oo) to (-1,l) to overcome the sensitivity. Intrator (1990b) proposed using a sigmoidal function. In fact, this function has a shape similar to that of the transformation used by Friedman (1987). Here we study the effect of such sigmoidal function on reducing the sensitivity to outliers. 3.1 Single Neuron. In the linear BCM theory (Intrator 1990b), the neuronal activity is c = m . d, and the learning rule is

m = qc(c - E [C*])d

(3.1)

This learning rule is sensitive to outliers according to Theorem 1 in Appendix A. The robust method replaces the neuronal activity by c = o ( m . d ) (Intrator and Cooper 1992). The learning rule is

tk = 7 ) d c (C - E

[c’])

d

(3.2)

This learning rule satisfies the condition for robustness specified in Theorem 1 in Appendix A. This is because d(x)o*(x)IxIvanishes and d ( x ) is bounded when the value of x = m . d becomes large. 3.2 Network with Feedforward Inhibition. The analysis can be easily extended to network with feedforward inhibition. For the case of linear neurons, the inhibited activity of the kth neuron is i.i = mi . d 7 C,+i mj . d = m . Ui (Intrator and Cooper 1992) in which the vector m (a concatenation of mi)is m = (WIT,. . . , mT)T,and ui = (bii)T,. . . ,bf’T)Twith

The stochastic learning rule can be written as

, 1

m. -q

&(ti

-

(3.3)

I

or (3.4) Similar to the linear single neuron case, this learning rule is subject to the influence of outliers according to Theorem 1 in Appendix A.

Influence Function Analysis of PCA and BCM Learning

1281

In the nonlinear case, they defined inhibited activity to be S; = - y Cjzlmi . d ) . The learning rule is

.(mi . d

(3.5)

or (3.6)

The conditions for robustness specified in Theorem 2 in Appendix A are satisfied for this learning rule. This is also due to the vanishing property of d ( x ) a 2 ( x ) xand the boundedness of a2(x)for a large value of x . 4 Summary

We use the influence function (Hampel et al. 1986) to evaluate several learning rules versus the influence of outliers. This is important because a real neuron can receive abnormal inputs. We discuss both Oja’s rules (Oja 1982, 1989, 1992) and their robust versions. We also examine Intrator’s rules (Intrator 1990b; Intrator and Cooper 1992). It is suggested that the learning rules with linear neurons are sensitive to the influence of outliers, whereas the sigmoidal property of biological neurons may help the learning to overcome the sensitivity.

Appendix A Influence Functions and Robustness of Learning Rules In this appendix, we discuss the technical issues related to the robustness of a learning rule. The definition of influence function (Hampel et al. 1986) is introduced first. In Lemma 1 and Lemma 2, the influence functions for the learning rules tir = q$(d, m) and m = ~ $ ( dm, , 0) are calculated, respectively, where m can be either a matrix or a vector, $ and p are smooth vector or matrix functions and 0 = E [cp(m,d ) ] . In Theorem 1, the robustness conditions for these two learning rules are given. In Lemma 3, the influence function for the learning rule m = Q Ckli,[ak(d),m,04 is calculated, where ak(d) is a function of d . Finally, the robustness conditions for this learning rule are given in Theorem 2. In these lemmas and theorems, we assume that the weight m ( t ) determined by a stochastic learning rule converges with probability one to the stable point ti of the deterministic version of the learning rule (obtained by taking expectation on both sides of the stochastic learning rule), i.e., (A.1) and the stable point m does exist in the finite domain. The conditions for the convergence have been given by Geman (1979) and extended by

Yong Liu

1282

Intrator (1990a), which conclude that the learning rate should be sufficiently small and some smootjl conditions should be imposed on the second derivative of the differential equations. For detail, see Geman (1979) and Intrator (1990a). Definition 1. Plumpel et al., 2986). The influencefunctjon of a learned weight m at data point d in feature space with data distribution F is defined as

I F ( m , d , F )= r-o+ lim

m ( G ) - h ( F ) --ath(G) 7-

-

a7-

(A.2)

Ir=O

where G = (1 - T ) F t 7-Aaand the distribution Ad(d') has mass 2 at d, or the distribution density dAd(d') = 6(d' - d ) . Remark. Since a set of data points are realizations of the data distribution F, the learned weight m is an implicit function of F. In the definition above, the numerator is the amount of change in the learned weight after a data perturbation at point d in the input space. The weight perturbation is equivalent to adding a pattern d with probability 7 to the input environment. Like the local slope of a curve which measures the local rate of change in the function value versus a change in the function argument, the above definition measures the sensitivity of the learned weight versus a data perturbation. Lemma 1. If the weight m ( t ) determined by the stochastic learning rule m = ? / $ ( dm) , converges to the stable poinf m of the deterministic version of this learning rule, in which 7 is the learning rate, the influence function of the optimal weight m at point d' and data distribution F is

Proof. The stable point th of the deterministic learning rule satisfies

E [ $ ( d , &)I or

=0

(A.4)

/ $ ~ [ dm(F)]dF(d) , 0

(A.5)

=

The equation is satisfied for any distribution F. Substituting G r)F t rAd,, we have (1 - 7 )

/ $[d,m(G)]dF(d)+ 7$[d', m ( G ) ]

=0

=

(1 (A.6)

Taking a derivative with respect to 7 and setting 7 = 0, we have

+ $ ~ [ d ' , h ( G=) )0

(A.7)

Using equation (A.4)and solving for 8m(G)/87Jr=0,we have equation (A.3).

Influence Function Analysis of PCA and BCM Learning

1283

Lemma 2. If the weight m ( t ) determined by the stochastic learning rule m = ip/)(d.m . 0 ) converges to the stable point m of the deterministic version of this learning rule, in which 71 is the learning rate and 0 = E [ip(m,d ) ] ,the influence function of the optimal weight m at point d’ and data distribution F is

IF(&, d’. F )

=

-D-’$(d’, &, 0 )

in which E [V,cp(i, d ) ]

+ E [Vm$(d.h,F ) ]

(A.9)

Proof. The stable point th of the deterministic learning rule satisfies (A.lO)

E [Ij,(d,m, O)] = 0

(A.11)

(A.12)

+ rAa#,we have r ) / $ [ d , h ( G ) , Q ( G ) ] d F ( d+) r$[d’,&(G),O(G)]= 0

Substituting G (1 -

=

(1 - r)F

(A.13)

in which

+

O ( G ) = (1 - r ) / p [ m ( G ) , d ] d F ( d ) rip[h(G),d’]

(A.14)

Taking a derivative with respect to r and setting r = 0, we have

+ 1Oa ” d , h ( F ) . O]dF(d)1V,cp(lir, d)dF(d) + 1s8$ [ d . t h ( F ) . Q]dF(d)cp(k,d’)+ $ ~ [ dh’ ,( F ) . 01 = 0

(A.15)

Using equation (A.10) and solving for d61(G)/3r1,=~,we have equation (A.8). Theorem 1. For the stochastic learning rule th = v $ ~ ( dm), , the robustness condition is that d l ( d ,m),as a function of d, is bounded. For the stochastic learning rule tiz = v$(d,m, 0 )in which 0 = E [ p ( m .d ) ] , the robustness conditions are that $ ( d , h,0 )and ~ ( md ). , as functions of d, are bounded.

Yong Liu

1284

Proof. According to Lemma 1 and Lemma 2, the theorem follows from the boundness of the influence functions for both cases under the conditions specified. Lemma 3. If the weight m(t) determined by the stochastic learning rule m = 17 Ck$(ak, m,Q k ) converges the stable point of the deterministic version of this learning rule, in which 7] is the learning rate, Ok = E [cp(m,a k ) ] ,and ak = ak(d) is a function of d, the influence function of the optimal weight tiz at point d’ and data distribution F is

Taking a derivative with respect to

T

and setting T

= 0,

we have

Influence Function Analysis of PCA and BCM Learning

1285

Using equation (A.18) and solving for c%(G)/d~1,=0, we have equation (A.16). Theorem 2. For the stochastic learning rule m = 71 Ck$(uk, m ,Ok), in which 11 is the learning rate, Ok = E [cp(m,uk)],and uk = uk(d) is a function of d, the robustness conditions are that for each k, li,(uk,rir, Ok) and p(m,uk) as functions of d are bounded. Proof. It follows from Lemma 3. If the conditions specified in the theo0 rem are satisfied, the influence function is bounded. Appendix B: Derivation of the Robust PCA Rule for a Single Neuron In this appendix, we derive the robust PCA learning rule for a single neuron (equation 2.2). Since the optimization criterion is to maximize E [ [a(m. d)]'] subject to the constraint m . m = 1, the stochastic version of the learning under this modified criterion is, for the ( n+l)th step weight,

in which 7~ is the learning rate and generally should be small so that the learning can converge. Since 71 is small, we can have the following expansion 1 Im(") + rp(m(n). d ) ~ ' ( m (.'d)dl ~) -

1

J[m( tl) + ~ p ( m ( '.Id)a'(m(") ) . d)d]2 1

-

JI + 2Va(m(n). d)a'(m(ll). d)m(n). d + 0(72) = 1 - 7p(m(") . d)a'(m("). d)m("). d + 0(7/2)

03.2)

Combining equation (B.1) with (B.21, we have the robust Hebbian rule

m("+') &)

+

. d)a(m("). d ) [ d- m('l)(m('I) d ) ] '

(€3.3)

In the continuous limit, we have equation 2.2. Appendix C Derivation of the Robust PCA Rule for Multineuron Network In this appendix, we derive the robust PCA learning rule for multineuron network (equation 2.5). Now the optimization criterion is to maximize E [c2] = E [Cic'] subject to the constraint M M T = I in which

Yong Liu

1286

+ g'(m!") I . d)cjm/"). d] mi")}

(C.6)

In the continuous limit, we have equation 2.5.

Acknowledgments

I thank all the members of the Institute for Brain and Neural Systems, in particular, Nathan Intrator for helpful discussion and for clarifying a lot

Influence Function Analysis of PCA and BCM Learning

1287

of places in the article and Mark Chamness, Mike Perrone, and m y wife Cong for improving the English. I also thank two anonymous referees for their invaluable suggestions. This research was supported by grants from NSF, ONR, and ARO.

References Bienenstock, E. L., Cooper, L. N., and Munro, P. W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2, 32-48. Diaconis, P., and Freedman, D. 1984. Asymptotics of graphical projection pursuit. Ann. Stat. 12(3), 793-815. Friedman, J. H. 1987. Exploratory projection pursuit. Am. Stat. Assoc. 82(397), 249-267. Geman, S. 1979. A method of averaging equations with st3bility and stochastic approximations. In Approximate Solution of Random Equations, A. T.BharuchaReid, ed., pp. 49-86. Elsevier North Holland, Amsterdam. Hampel, F., Rouchetti, E., Rousseeuw, P., and Stahel, W. 1986. Robust Statistics: The Approach Based on Influence Functions. Wiley, New York. Hebb, D. 0. 1949. The Organization of Behavior. Wiley, New York. Huber, I? 1981. Robust Statistics. Wiley, New York. Huber, P. J. 1985. Projection pursuit. Ann. Stat. 13, 435-475. Intrator, N. 1990a. A n average result for random differential equation. Tech. Rep., Institute for Brain and Neural Systems, Brown University, Providence, RI. Intrator, N. 1990b. A neural network for feature extraction. In Advances in Neural Information Processing Systems 2, D. Touretzky, ed., pp. 719-726. Morgan Kauhann, San Mateo, CA. Intrator, N., and Cooper, L. N. 1992. Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks 5, 3-17. Kohonen, T. 1989. Self-Organization and Associate Memory, 3rd ed. SpringerVerlag, Berlin. Linsker, R. 1986. From basic network principles to neural architecture. Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512, 8390-8394, 87794783. Miller, K. D. 1990. Correlation-based models of neural development. In Neuroscience and Connectionist Theory, M. Gluck and D. Rumelhart, eds., pp. 267354. Lawrence Erlbaum, Hillsdale, NJ. Miller, K. D., and MacKay, D. J. C. 1994. The role of constraints in Hebbian learning. Neural Comp. 6, 98-124. Oja, E. 1982. A simplified neuron model as a principal component analyzer. J. Math. Bio. 15, 267-273. Oja, E. 1989. Neural networks, principal components, and subspace. Intl. J. Neural Syst. 1, 61-68. Oja, E. 1992. Principal components, minor components, and linear neural networks. Neural Networks 5, 927-935.

1288

Yong Liu

Sejnowski, T.J. 1977. Storing covariance with nonlinearly interacting neurons. I. Mafh. Bio. 4(88), 303-321. von der Malsburg, C. 1973. Self-organizationof orientation selective cells in the striate cortex. Kybernetik 14, 85-100.

Received June 11, 1993; accepted March 23, 1994.

This article has been cited by:

Communicated by Gerald Tesauro

Boosting and Other Ensemble Methods Hams Drucker* Corinna Cortes L. D. Jackel Yann LeCun Vladimir Vapnik AT&T Bell Laboratories, Holmdel, N ] 07733 USA

We compare the performance of three types of neural network-based ensemble techniques to that of a single neural network. The ensemble algorithms are two versions of boosting and committees of neural networks trained independently. For each of the four algorithms, we experimentally determine the test and training error curves in an optical character recognition (OCR) problem as both a function of training set size and computational cost using three architectures. We show that a single machine is best for small training set size while for large training set size some version of boosting is best. However, for a given computational cost, boosting is always best. Furthermore, we show a surprising result for the original boosting algorithm: namely, that as the training set size increases, the training error decreases until it asymptotes to the test error rate. This has potential implications in the search for better training algorithms. Introduction There is interest in using a committee of learning machines to improve performance over that of a single learning machine. Perrone and Cooper (1992) did a study of what they termed ensemble methods (and others call committee machines), and give extensive references to other publications. In addition, there has been work done on boosting by Drucker et al. (1993a,b), and other work on committees by Srihani (1992), Suen et al. (1992), Benediksson and Swain (1992), and Hansen and Salamon (1990). In many of these cases, the committee members are neural networks trained on the same data, but initialized with different weights. The expectation is that the different networks converge to different local minima and by somehow combining the outputs, performance can be improved. Boosting is different in that each learning machine in an en*Corresponding address: Monmouth College, West Long Branch, NJ 07763 USA.

Neural Computation 6, 1289-1301 (1994) @ 1994 Massachusetts Institute of Technology

1290

Harris Drucker et al.

semble is trained sequentially on patterns that have been filtered by the previously trained members of the ensemble. Most ensemble methods have a training set with identical statistics for each member of the ensemble while each member in the ensemble of a boosting machine has a training set whose statistics are different because they have been altered by filtering them through the previously trained machines. After training an ensemble, one usually has to decide how to weight the outputs. One may use a straight voting scheme, which leads to problems in other than the two-class problem. (How does one make a decision in a committee of three if each member is voting for a different class?) Alternatively, one may use a weighted linear addition of the outputs of each machine. However, this question usually comes up after the networks in the ensemble have been trained and is not explicitly part of the training process. This is not true in boosting. Moreover, in recent work on “mixtures of experts” done by Jordan and Jacobs (1992) based on the expectation-maximization algorithm, both the networks and what‘ they call a gating network are trained. The basic idea is to build up a committee of experts, each of which is good on a subset of the problem and then gate the experts together. However, these techniques have not yet been applied to O C R problems. In assessing performance on learning machines, one usually has a training and test set. The test set performance is also termed the generalization ability and a plot of test error rate versus training set size is usually called the learning curve. For a finite capacity network (Vapnik 1982), one expects the difference between the test and training error rate to approach zero as training set size is increased. We have to carefully distinguish between the size of the training set and the actual number of patterns trained on. Assume we have a training set of size N , which may consist of what we may call (but hard to define) typical patterns, hard patterns, easy patterns, mislabeled patterns, and outliers (not all mutually exclusive). By judiciously picking a subset of size N , to actually train on, we may get better generalization performance than if we had trained on all N patterns. Of course one has to have some procedure to pick this subset but if we have some ”smart” procedure we obtain two advantages: generalization is better and time to train is less. Of course, there is some computational cost in examining patterns and then discarding them, but that cost is typically much less than using them in training, which requires a forward propagation, backward propagation, and weight updates for each pattern for each epoch. There are a number of procedures that either discard data or pick data intelligently including the references above concerning boosting and other work done by Sueng et al. (19921, Matic et al. (19921, Freund (19901, Baum and Lang (1991), and Atlas et al. (1990). To distinguish between N and N , we shall call the former the training set size and the latter the computational cost. When we do not explicitly train on all N patterns, the computational cost is a fair method to compare algorithms.

Boosting and Other Ensemble Methods

1291

The interest here is in comparing ensemble learning algorithms on an OCR problem as a function of the training set size and the computational cost. We therefore experimentally answer the following two questions for the algorithms investigated: given a training set size, which algorithm is best? Also, for a given compute time, which algorithm is best? We have also determined that the training error curve (plotted against training set size) for the boosting algorithm has a surprising characteristic that has implications in the search for better algorithms.

1 Theory of Boosting Boosting is an algorithm invented by Robert Schapire (1990) that, under certain conditions, allows one to improve the performance of any learning machine and was first designed in the context of the so-called probably approximately correct P A C ) learning model. In the standard PAC model (sometimes called the strong learning model), the learner must be able to produce a hypothesis with error at most E , for arbitrarily small positive values of E. Since the learner is receiving random examples, there is always the chance that the learner will receive a highly unrepresentative sample that prevents it from learning anything meaningful about the target concept. We therefore ask only that the learner succeed in finding a good approximation to the target concept with probability at least 1-6, where b is an arbitrarily small positive number. In a variation of the PAC model called the weak learning model, this requirement is relaxed dramatically. In this model, which was introduced by Kearns and Valiant (1989), the learner is required only to produce hypotheses with error rates slightly less than 1/2. Since the hypothesis that guesses entirely at random on every example will achieve an error rate of exactly 1/2, the weak learning model is essentially asking that the learner be able to produce hypotheses that perform only slightly better than random guessing. The main result of Schapire’s paper was a proof that the strong and weak learning models are actually equivalent. That is, Schapire gave a provably correct technique for converting any learning algorithm that performs only slightly better than random guessing into one that produces hypotheses tvith arbitrarily small error rates. The technique is described below (under the Procedure section) for creating a composite or ensemble hypothesis from three subhypotheses that were trained on three different distributions. Schapire proves that if these three weak subhypotheses have an error rate of N < 1/2 (with.respect to the distribution on which they were trained), then the resulting ensemble hypothesis at each successive iteration will have an error rate of 3a2-2a3,which is significantly less than a. Applying this technique recursively, Schapire was able to also show how the error rate can be made arbitrarily small.

1292

Harris Drucker et al.

2 Procedure

A database of 120,000 handwritten and segmented digits from the National Institute of Standards and Technology (NIST) was used for evaluation: one set of 2000 was used for testing and randomly chosen subsets of the remaining 118,000 used for training. The pixel arrays were subsampled to give a 10 by 10 size input space so the algorithms could run in a reasonable time. For each of the four algorithms, each of the three architectures, and a particular training set size, the training and test performance were evaluated 10 times (with different initialization of weights and different sets of training patterns) to obtain a mean for that training set size and computational cost. The average test and training performance were then plotted as a function of the training set size and computational cost. The objective here was not to get the best performance by choosing good architectures but to compare the algorithms. The choices of architecture and input space size were also dictated by the size of the database. We wanted to be sure that, for some training set size less than 118,000, error rates on both the test and training set would be approximately equal indicating that the best performance had been reached within the limits of the capacity of the network (Vapnik 1982). Training of a single machine is as follows: N1 patterns are chosen randomly and the weights randomized to small values. The network is trained using backpropagation (with stochastic gradient descent) until it reaches a minimum of the mean square error. The minimum is defined to be that epoch where the relative difference between the mean square error of the present epoch and the previous epoch is less than 0.25%. The test performance is then evaluated. This is repeated 10 times for the same size training set (but randomly chosen patterns and initial weights). We then iterate for different training set sizes until the test and training error rate are approximately equal. Here N = N1 = N,. For what we term a parallel machine, we randomize three machines and train each with a different training set, each of size N1.Each is trained separately. In previous work (Drucker et al. 1993a), we determined that addition of the respective outputs (digit 0 outputs of three machines added, etc.) gives better performance than simple voting (verified here also). Thus, to evaluate the training performance we present the entire N = 3N1 patterns to each of the three machines and add the respective 10 outputs to get one set of 10 outputs. The maximum output corresponds to the chosen label. Both the training set size and computational cost is 3N1. We also tried a committee with five members. In the original form of the boosting algorithm, a first machine is trained with N1 patterns. We then use the first machine to filter another set of training patterns in the following manner: Flip a fair coin. If heads, pass new patterns through the first machine and discard correctly classified patterns until the first machine misclassifies a pattern. That misclassified pattern is added to the training set for the second ma-

Boosting and Other Ensemble Methods

1293

chine. If tails, pass new patterns through the first machine and discard incorrectly classified patterns until the first machine correctly classifies a pattern and add this pattern to the training set. Iterate this procedure until there are a total N1 examples that become the training set for the second machine in the ensemble. The coin flipping ensures that the second set of N1 training examples is such, that if passed through the first machine, would have a 50% error rate. Thus these new N1 patterns have a different distribution than the set used to train the first machine. This is critically important in the proof of Schapire’s algorithm because now the second machine must learn a different distribution and should be contrasted to a parallel machine where each network learns the same distribution. Now define N2 as the number of examples that must be filtered to obtain N1 patterns to train the second machine. Note that N1 is fixed and N2 depends on the generalization error rate of the first machine. Once the second machine is trained in the usual manner, a third training set is formed as follows: Pass a new pattern through the first two machines. If the two machines agree, discard the data. If the two machines disagree, that pattern becomes a part of the training set for the third machine. Continue until there are a total of N1 patterns. Say there are N3 patterns that have to be examined to find the N1 patterns that form the training set for the third machine. The total size of the training set is N = N1 +N2 + N3 because these many patterns have to be examined (although not all are part of the computational cost) where the computational cost N , = 3N1 because these are the actual patterns trained on. The algorithm is therefore “smart” in the sense that there is a large pool of training examples but test performance is better if one selectively uses a subset (of size 3N1) for the actual training. In the original theoretical derivation of the algorithm, evaluation of the test performance was as follows: present a test pattern to the three machines. If the first two machines agree, use that label, otherwise use the labeling assigned by the third machine. As stated before, we have experimentally shown that adding of the three sets of outputs is much better than voting and that is the approach we take. The training performance for the boosting machine (which is called the original boosting algorithm here) is determined by presenting the N patterns to the three machines and adding the respective outputs. The test performance is evaluated by presenting the 2000 test patterns to the ensemble. In obtaining learning curves for this algorithm we increment N1 but plot training and test error rate versus both the training set size N and computational cost t3N1). Because the boosting machine uses so many patterns it was not possible to construct an ensemble with more than three members within the constraint of the size of the NIST database. The original boosting algorithm always uses new examples for each machine. If the error rate on the first machine is low then many patterns may have to be examined until N1 patterns are obtained to train the

Harris Drucker et al.

1294

2422.. 20.. 18.. 16.. 14.. 12..

-

TEST ERROR RATE

10.. 8 .. 6 .. 4 .. ERROR RATE ( O h ) 2 ..

0,

6.3% ERROR RATE TRAINING SET SIZE I

Figure 1: Training and test error rate for a boosting network using three 10010-10 networks. second machine. If a large training set is not available, then one might consider a modified algorithm. The modified boosting algorithm starts out with a first training set of size N4. After training the first machine, we reuse the N4 patterns and pass all these old patterns through the first machine, using a subset of them such that the new training set consists of patterns, 50% of which are incorrectly classified by the first machine and 50% classified correctly. Call this training set size N5. Note that here N4 is fixed and N5 will depend on the training error of the first machine. If the error rate of the first machine is very low, then there may be very few examples for the second machine to train on. After the second machine is trained, the N4 patterns are reused by presenting them to both trained machines. If the two machines disagree, add the pattern to the training set, otherwise discard. Call the number of patterns that the two machines disagree on Nb. The total training set size is fixed at N4 but the computational cost is N4 N5 Nb. We first (Fig. 1) show the training and test performance for a boosted network using the original boosting algorithm (each constituent network is 100-10-10). Error bars show plus and minus one standard deviation.

+ +

Boosting and Other Ensemble Methods

1295

The negative slope for the boosted training error rate is somewhat surprising because a single 100-10-10 network has a training error rate that monotonically increases until it asymptotes to the test error rate. The explanation is as follows: recall that the training set of size N consists of a subset of size 3N1 actually used to train on; the remainder are examined but filtered out. As an example, suppose the training set for the first network (N1)is of size 100. The training performance is very good (approximately 2%),but the generalization is very poor (23%).This means that the trained first network has to sort through only approximately 217 (50/0.23) examples never seen before to find 50 patterns in error and approximately 65 (50/0.77)examples to find 50 patterns classified correctly obtaining an N2 = 217 65 = 282 examples. The 182 examples examined but filtered out have an approximate zero error rate. Furthermore, since the second network will have poor generalization based on the 100 training examples, not many samples will have to be examined to find 100 examples that the first two networks disagree on. When the entire training set (including ones filtered out) are presented to the three networks to find the overall training rate, the performance is expected to be poor because you really have three networks with poor generalization. Now consider, what happens when we increase the training size. The first network has worse training error but the generalization is' better. Therefore more examples have to be sorted through to find the second and third training sets. As a consequence, the training set includes proportionately more examples filtered out and not used for actual training. These filtered out examples have close to zero error rate and tend to drive the training error rate [which includes both the error rate on the filtered out examples and the samples used for actual training (the computational cost)] down. In summary, as the training set size increases (Fig. 21, both Nand N,increase but not at the same rate so the ratio (3N,)/N decreases. In the limit, this algorithm uses only 20% of the training set to actually train on. The parallel machines and the single machine use all the patterns in the training set to train on. Therefore the ratio of computational cost to training set size is 1 for these algorithms. The modified boosting algorithm is more expensive: the computational cost is larger than the training set size. Figure 3 shows the test performance of the four algorithms. Generally, each algorithm has a range where it is superior. For small training set size a single machine is better while for large training set size, both the modified boosting and the original boosting algorithm give similar results (with differencesstatistically insignificant at the largest training set size). However, the modified boosting algorithm is better over a broader range. We did not want to clutter this figure with error bars for each curve. However, we do report the smallest test error rate and standard deviation in Table 1 that occurs when the training and test performance are approximately equal.

+

Harris Drucker et al.

1296

2-

(COMPUTATIONAL COST/ TRAINING SET SIZE)

1.6 MODIFIED BOOSTING

1.4 1.2

0.8

SINGLE AND PARALLEL MACHINE

ORIGINAL BOOSTING

I

100

1000

10 000

100 000

Figure 2: (Computational cost)/(training set size) versus training set size. For single and parallel machines, this ratio is 1. For the original boosting algorithm, not all the patterns in the training set are used and the ratio is less than 1. For the modified boosting algorithm the ratio is greater than 1. For the 100-10-10 network, we ask which algorithm is better if one has a given time to train. This is shown as a plot of test error rate versus computational cost (Fig. 4). This clearly shows that the original boosting algorithm is better. This is not unexpected considering Figure 2, which shows that a smaller and smaller part of the training set contributes to the computational cost as the training set size increases. We performed the same experiments on a single layer fully connected 100-10 network and a weight sharing network (LeCun et al. 1990) with 5400 connections but only 1014 free weights. The associated curves have the same general appearance of those in Figs. 3 and 4 with asymptotic characteristics in Table 1. An unusual characteristic of the 100-10 network is that the parallel and single machine have the same training and test performance for large training set size. This is not surprising in light of the fact that single layer networks have one minimum and all the machines in the parallel set of machines eventually obtain the same decision boundary. For this network, a single machine is as good as a

Boosting and Other Ensemble Methods

24&

0

1297

ORIGINAL BOOSTING

0 SINGLE MACHINE

PARALLEL MACHINE 0 MODIFIED BOOSTING

Figure 3: Test performance of four algorithms using a 100-10-10 network.

parallel machine. The characteristics of the training error curves for all these architectures would have shown the following: a negative slope for the original boosting algorithm, an initial positive slope and then turning flat or negative for the modified algorithm, and positive slopes for the remaining algorithms. As pointed out previously, increasing the number of members in the ensemble is useless for a single layer network. For the original boosting algorithm, there are not enough potential training patterns in the NIST database and it is too computationally expensive to use the modified boosting algorithm. We therefore examined the possibility of using five members in the parallel machine for the 100-10-10 and weight sharing networks. In Table 1, we show the best results. As can be seen, this does not give statistically significant improvement in performance. The study by Perrone and Cooper (1993) shows that beyond five members in an ensemble, there is little improvement in what they term “figure-ofmerit,” which includes both the error rate and the ability to reject patterns (i.e., make no decisions).

1298

Harris Drucker et al.

Table 1: Minimum Test Error Rate and Standard Deviation (in Percent) for the Three Networks and Four Algorithms

Network

Algorithm

Mean

Standard Deviation

100-10-10

Single network

9.0

0.61

Original boosting

6.3

0.34

Modified boosting

6.1

0.28

Parallel machine (3 members)

7.7

0.31

Parallel machine (5 members)

7.6

0.38

Single network

12.1

0.37

Original boosting

8.9

0.47

Modified boosting

8.8

0.50

Parallel machine (3 members)

11.9

0.23

Single

8.9

0.55

Original boosting

5.2

0.87

Modified boosting

5.2

0.28

Parallel machine (3 members)

7.1

0.51

Parallel machine (5 members)

6.8

0.58

100-10

Structured network

3 Conclusions

Figures 3 and 4 allow us to conclude that, for a given computational cost, the original form of boosting is best, although both boosting algorithms achieve the same asymptotic performance. It may seem that using a subset of the training set for the actual training is “throwing away data,” but by using the networks to select the data to be trained on, one can do better than training on all the data. In this investigation we constrained the architectures and input space so that the networks could be trained to their best performance within the constraints of the size of the database. One can do much better if one is willing to synthetically enlarge the database. One then can use all the training data and “new data” arise from using random deformations

Boosting and Other Ensemble Methods

“p1 22

20t

\\

0

1299

ORIGINAL BOOSTING

.

0

SINGLE MACHINE PARALLEL MACHINE

MODIFIED BOOSTING

14”

6

1

9 \

COMPUTATIONAL COST

‘6-3 I

4 0

1000

10 000

100 000

Figure 4: Test error rate versus computational cost for a 100-10-10 network.

of the original data. Deforming data (sometimes called image defect modeling) have been used extensively to improve the performance of non-neural-based, printed document readers (Baird 1993a,b). In Drucker et nl. (1993a,b), we detail the deformation procedures. Using deformations, 60,000 training examples, a full 28 x 28 input space, 10,000 test examples, and a very large shared-weight network with approximately 13,000 neurons, 17,000 weights, and 260,000 connections we are able to obtain a test error rate of 0.70% using boosting. For comparison purposes, results were obtained for the following networks (in increasing size of error rate): the single network shared-weight network with a 1.13%error rate, a fully connected 784-300-10 network with a 1.6% error rate, and a single-layer 784-10 network with an 8.4% test error rate. An important observation made in this study is that the training error curve for the original boosting algorithm has negative slope until it asymptotes to some fixed value of training error. We make the following conjectures: (1) The capacity (in some sense which we cannot define yet) increases with increasing training set size until it reaches an asymptotic value. (2) “Good” algorithms will have this same negative

1300

Harris Drucker et al.

slope characteristic. We know that the difference of the test and training error rate should go to zero as we increase the size of the training set. Therefore, it is highly desirable that the training error curve has this negative slope so that the test error rate follows the training error rate to some small value. (3) Constructive algorithms such as that of cascade architectures (Littman and Ritter 1993), cascade-correlation (Fahlman and Lebiere 19901, and tiling (Mezard and Nadal 1989) that build networks incrementally and algorithms referenced above that use queries, filtering, hierarchical structures, or in general adapt the structure of the network or discard data intelligently have (or should have) this same negative slope. We are attempting to prove or disprove these conjectures.

References Atlas, L., Cohn, D., Ladner, R., El-Sharkawi, M. A., Marks, 11, R. J., Aggoune, M. E., and Park, D. C. 1990. Training connectionist networks with queries and selective sampling. In Advances in Neural Information Processing 2, D. Touretzky, ed., pp. 566-573. Morgan Kaufmann, San Mateo, CA. Baird, H. S. 1993a. Calibration of document image defect models. Proc. 2nd Ann. Sympos. Document Anal. Inform. Retrieval (available from Information Science Research Center, University of Nevada, Las Vegas). Baird, H. S. 1993b. Document image defect models and their uses. Proc. IAPR 2nd Int. Conf. Document Anal. Recognition, Tsukuba Science City, Japan (available from Computer Society Press), 20-22. Baum, E. B., and Lang, K. J. 1991. Constructing hidden units using examples and queries. In Advances in Neural lnformation Processing3, R. Lippman, J. Moody, and D. Touretzky, eds., pp. 904-910. Morgan Kaufmann, San Mateo, CA. Benediktsson, J. A., and Swain, P H. 1992. Consensus theoretic classification methods. IEEE Trans. Syst. Man Cybern. 22(4), 688-704. Drucker, H., Schapire, R., and Simard, P. 1993a. Improving performance in neural networks using a boosting algorithm. In Neural lnformation Processing Systems5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 4249. Morgan Kaufmann, San Mateo, CA. Drucker, H., Schapire, R., and Simard, P. 1993b. Boosting performance in neural networks. lnt. J. Pattern Recognition Artif. Intelligence 7(4), 704-709. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlationlearning architecture. In Advances in Neural Information Processing Systems 2, E. S. Touretsky, ed., pp. 524-532. Morgan Kaufmann, San Mateo, CA. Freund, Y. 1990. Boosting a weak learning algorithm by majority. Proc. Third Annu. Workshop Comp. Learning Theory, 202-206. Hansen, L. K., and Salamon, E 1990. Neural network ensembles. I E E E Trans. Patterns Anal. Machine lntelligence 12(10),993-1001. Jordan, M. I., and Jacobs, R. A. 1992. Hierarchies of adaptive experts. In Advances in Neural Information Processing Systems 4, J. Moody, S. Hanson, and R. Lippman,.eds., pp. 985-993. Morgan Kaufmann, San Mateo, CA.

Boosting and Other Ensemble Methods

1301

Kearns, M., and Valiant, L. G. 1989. Cryptographic limitations on learning Boolean formulae and finite automata. Proc. Nineteenth Annu. ACM Symp. Theoy Computing, 443-444. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1990. Handwritten digit recognition with a backpropagation network. In Neural Information Processing Systems 2, D. Touretzsky, ed., pp. 396-404. Morgan Kaufmann, San Mateo, CA. Littman, E., and Ritter, H. 1993. Generalization abilities of cascade network architectures. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 188-195. Morgan Kaufmann, San Mateo, CA. Matic, N., Guyon, I., Bottou, L., Denker, J., Vapnik, V. 1992. Computer aided cleaning of large databases for character recognition. In Proceedings of the 12th International Conference on Pattern Recognition, vol. 2, pp. 330-333. IEEE Computer Society Press, Los Angeles, CA. Mezard, M., and Nadal, J.-P. 1989. Learning in feedforward layered networks: The tiling algorithm. 1. Phys. A 22, 2191-2204. Perrone, M. P., and Cooper, L. N. 1993. When networks disagree: Ensemble methods for hybrid neural networks. In Neural Networks for Speech and Image Processing, R. J. Mammone, ed. Chapman-Hall, London. Schapire, R. 1990. The strength of weak learnability. Machine Learn. 5(2), 197227. Srihari, S. 1992. High-performance reading machines. Proc. I E E E 80(7), 11201132. Sueng, H. S., Opper, M., and Sompolinsky, H. 1992. Query by committee. In Proceedings of the 1992 Conference on Learning Theory, ACM 0-89791-498/ 8/ 92/ 0007/ 0287,287-294. Suen, C. Y. et al. 1992. Computer recognition of unconstrained handwritten numerals. Proc. I E E E 80(7), 1162-1180. Vapnik, V. 1982. Estimation of Dependences Based on Empirical Data. SpringerVerlag, Berlin. Received November 5, 1993; accepted March 15, 1994.

This article has been cited by: 2. Bruno H. G. Barbosa, Lam T. Bui, Hussein A. Abbass, Luis A. Aguirre, Antônio P. Braga. 2010. The use of coevolution and the artificial immune system for ensemble learning. Soft Computing . [CrossRef] 3. Bhekisipho Twala. 2009. Combining classifiers for credit risk prediction. Journal of Systems Science and Systems Engineering 18:3, 292-311. [CrossRef] 4. CHRISTOPHER BOWD, MICHAEL H. GOLDBAUM. 2008. Machine Learning Classifiers in Glaucoma. Optometry and Vision Science 85:6, 396-405. [CrossRef] 5. M.M. Islam, Xin Yao, S.M. Shahriar Nirjon, M.A. Islam, K. Murase. 2008. Bagging and Boosting Negatively Correlated Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38:3, 771-784. [CrossRef] 6. Kin-Chung Wong, Wei-Yang Lin, Yu Hen Hu, Nigel Boston, Xueqin Zhang. 2007. Optimal Linear Combination of Facial Regions for Improving Identification Performance. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 37:5, 1138-1148. [CrossRef] 7. Yung-Keun Kwon, Byung-Ro Moon. 2007. A Hybrid Neurogenetic Approach for Stock Forecasting. IEEE Transactions on Neural Networks 18:3, 851-864. [CrossRef] 8. Devi Parikh, Robi Polikar. 2007. An Ensemble-Based Incremental Learning Approach to Data Fusion. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 37:2, 437-450. [CrossRef] 9. Xin Yao, Yong Xu. 2006. Recent Advances in Evolutionary Computation. Journal of Computer Science and Technology 21:1, 1-18. [CrossRef] 10. Valeriy V. GavrishchakaBoosting-Based Frameworks in Financial Modeling: Application to Symbolic Volatility Forecasting 20, 123-151. [CrossRef] 11. Md.M. Islam, Xin Yao, K. Murase. 2003. A constructive algorithm for training cooperative neural network ensembles. IEEE Transactions on Neural Networks 14:4, 820-834. [CrossRef] 12. Liu Yong, Zou Xiu-fen. 2003. Analysis of negative correlation learning. Wuhan University Journal of Natural Sciences 8:1, 165-175. [CrossRef] 13. Liu Yong, Zou Xiu-fen. 2003. From designing a single neural network to designing neural network ensembles. Wuhan University Journal of Natural Sciences 8:1, 155-164. [CrossRef] 14. L.C. Jain, L.I. Kuncheva. 2000. Designing classifier fusion systems by genetic algorithms. IEEE Transactions on Evolutionary Computation 4:4, 327-336. [CrossRef]

15. A.K. Jain, P.W. Duin, Jianchang Mao. 2000. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:1, 4-37. [CrossRef] 16. T. Higuchi, Xin Yao, Yong Liu. 2000. Evolutionary ensembles with negative correlation learning. IEEE Transactions on Evolutionary Computation 4:4, 380-387. [CrossRef] 17. Ran Avnimelech , Nathan Intrator . 1999. Boosted Mixture of Experts: An Ensemble Learning SchemeBoosted Mixture of Experts: An Ensemble Learning Scheme. Neural Computation 11:2, 483-497. [Abstract] [PDF] [PDF Plus] 18. Yong Liu, Xin Yao. 1999. Simultaneous training of negatively correlated neural networks in an ensemble. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 29:6, 716-725. [CrossRef] 19. Ansgar H L West, David Saad. 1998. Journal of Physics A: Mathematical and General 31:45, 8977-9021. [CrossRef] 20. Y. Shimshoni, N. Intrator. 1998. Classification of seismic signals by integrating ensembles of neural networks. IEEE Transactions on Signal Processing 46:5, 1194-1201. [CrossRef] 21. Kukjin Kang, Jong-Hoon Oh, Chulan Kwon. 1997. Learning by a population of perceptrons. Physical Review E 55:3, 3257-3261. [CrossRef] 22. S. Lawrence, C.L. Giles, Ah Chung Tsoi, A.D. Back. 1997. Face recognition: a convolutional neural-network approach. IEEE Transactions on Neural Networks 8:1, 98-113. [CrossRef] 23. B. D. RipleyComputer-Intensive Methods . [CrossRef]

1303

Index Volume 6 By Author

Amari, S. I. - See Kawanabe, M. Amit, D. and Fusi, S. Learning in Neural Networks with Material Synapses (Letter)

6(5):957-982

Armentrout, S. L. - See Sutton 111, G. G. Atick, J. J. - See Li, Z. Auer, P. - See Hornik, K. Babloyantz, A. - See Lourenqo, C. Baldzs, L. - See Szepesvdri, C. Baldi, P. and Chauvin, Y. Smooth On-Line Learning Algorithms for Hidden Markov Models (Letter)

6(2):307-318

Barrow, H. G. - See Goodhill, G. J. Beaufays, F. and Wan, E. A. Relating Real-Time Backpropagation and Backpropagation-Through-Tune:An Application of Flow Graph Interreciprocity (Letter)

6(2):296-306

Benaim, M. On Functional Approximation with Normalized Gaussian Units (Letter)

6(2):319-333

Bernander, O., Koch, C., and Usher, M. The Effect of Synchronized Inputs at the Single Neuron Level (Letter)

6(4):622441

Buonomano, D. V. and Mauk, M. D. Neural Network Model of the Cerebellum: Temporal Discrimination and the Timing of Motor Repsonses (Letter)

6(1):38-55

Burgi, P. Y. and Grzywacz, N. M. Model Based on Extracellular Potassium for Spontaneous Synchronous Activity in Developing Retinas (Letter)

6(5):983-1004

1304

Cardell, N. S., Joerding, W., and Li, Y. Why Some Feedforward Networks Cannot Learn Some Polynomials (Letter)

Index

6(4):761-766

Chauvin, Y.- See Baldi, P. Cheung, M. F,, Passino, K. M., and Yurkovich, S. Supervised Training of Neural Networks via Ellipsoid Algorithms (Letter)

6(4):748-760

Connolly, T. H. - See Gorinevsky, D. Cortes, C. - See Drucker, H. Cugliandolo, L. F. Correlated Attractors from Uncorrelated Stimuli (Note)

6(2):220-224

DAutrechy, C. L. - See Sutton 111, G. G . Destexhe, A., Mainen, Z . F., and Sejnowski, T. J. An Efficient Method for Computing Synaptic Conductances Based on a Kinetic Model of Receptor Binding (Note) Doya, K. and Selverston, A. I. Dimension Reduction of Biological Neuron Models by Artificial Neural Networks (Letter) Drucker, H., Cortes, C., Jackel, L.D., LeCun, Y., and Vapnik, V. Boosting and Other Ensemble Methods (Letter)

6(1):14-18

6(4):69&717

6(6):1288-1299

Ermentrout, B. Reduction of Conductance-Based Models with Slow Synapses to Neural Nets (Letter)

6(4):679-695

Ermentrout, B. and Kopell, N. Learning of Phase Lags in Coupled Neural Oscillators (Letter)

6(2):225-241

Fahmy, M. M. - See Osman, H. Fanelli, R. - See Manolios, P. Fetz, E. E. - See Munro, E. E. Fetz, E. E. - See Murthy, V. N. Field, D. J. What is the Goal of Sensory Coding? (Article)

6(4):559-601

Index

1305

Finnoff, W. Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm and Resistance to Local Minima (Letter)

6(2):285-295

Fu, A. M. N. Statistical Analysis of an Autoassociative Memory Network (Note)

6(5):837-841

Fusi, S. - See Amit, D. Gascuel, J. D., Moobed, B., and Weinfeld, M. An Internal Mechanism for Detecting Parasite Attractors in a Hopfield Network (Letter)

6(5):902-915

Gee, A. H. and Prager, R. W. Polyhedral Combinatorics and Neural Networks (Letter)

6(1):161-180

Georgopoulos, A. P. - See Lukashin, A. V. Goodhill, G. J. and Barrow, H. G. The Role of Weight Normalization In Competitive Learning (Letter)

6(2):255-269

Goodhill, G. J. and Willshaw, D. J. Elastic Net Model of Ocular Dominance: Overall Stripe Pattern and Monocular Deprivation (Letter)

6(4):615-621

Gorinevsky, D. and Connolly, T. H. Comparison of some Neural Network and Scattered Data Approximations: The Inverse Manipulator Kinematics Example (Letter)

6(3):521-542

Gottschalk, A., Olgilvie, M. D., Richter, D. W., and Pack, A. I. Computational Aspects of the Respiratory Pattern Generator (Letter)

6(1):56-68

Gray, C. M. - See Rhodes, P. A. Grzywacz, N. M. - See Burgi, P. Y. Hansen, L. K. and Rasmussen, C. E. Pruning from Adaptive Regularization (Letter) Hayashi, Y. Numerical Bifurcation Analysis of an Oscillatory

6(6):1222-1231

Index

1306

Neural Network with Synchronous/Asynchronous Connections (Letter)

6(4):658-667

Hooper, S. L. - See LoFaro, T. Hornik, K. - See Kuan, C.-M. Hornik, K., Stinchcombe, M., White, H., and Auer, P. Degree of Approximation Results for Feedforward Networks Approximating Unknown Mapping and Their Derivatives (Letter) 6(6):1261-1274 Ito, Y.

Approximation Capability of Layered Neural Networks with Sigmoid Units on Two Layers (Letter)

6(6):1232-1242

Jaakkola, T., Jordan, M. I., and Singh, S. P. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms (Letter)

6(6):1184-1200

Jackel, L.D. - See Drucker, H. Jacobs, R. A. - See Jordan, M. I. Joerding, W. - See Cardell, N. S. Johnson, M. H. - See OReilly, R. C. Jordan, M. I. - See Jaakkola, T. Jordan, M. I. - See Saul, L. Jordan, M. I. and Jacobs, R. A. Hierarchical Mixtures of Experts and the EM Algorithm (Article)

6(2):181-214

Kainen, P. C. - See Kbrkovd, V. Karhunen, J. Stability of Oja’s PCA Subspace Rule (Letter)

6(4):739-747

Kawanabe, M. and Amari, S. I. Estimation of Network Parameters in Semiparametric Stochastic Perceptron (Letter)

6(6):1243-1260

Koch, C. - See Bernander, 0. Koch, C. - See Usher, M. Koiran, P. Dynamics of Discrete Time, Continuous State Hopfield Networks (Letter)

6(3):459468

Index

1307

Kondo, K. - See Matsumoto, T. Kopell, N. - See Ermentrout, B. Kopell, N. - See LoFaro, T. Kosowsky, J. J. - See Yuille, A. L. Kuan, C.-M., Hornik, K., and White, H. A Convergence Result for Learning in Recurrent Neural Networks (Letter)

6(3):42W

KPlrkovd, V. and Kainen, P. C. Functionally Equivalent Feedforward Neural Networks (Letter)

6(3):54>558

Lai, Y. C., Winslow, R. L. and Sachs, M. B. The Functional Role of Excitatory and Inhibitory Interactions in Chopper Cells of the Anteroventral Cochlear Nucleus (Letter) 6(6):1126-1139 LeCun, Y. - See Drucker, H. LeCun, Y. - See Vapnik, V. Lee, J. A Novel Design Method for Multilayer Feedforward 6(5):885-901 Neural Networks (Letter) Levin, E. - See Vapnik, V. Levy, W. B. - See Minai, A. A. Lewicki, M. S. Bayesian Modeling and Classification of Neural Signals (Letter)

6(5):1005-1030

Li, Y. - See Cardell, N. S. Li, Z. and Atick, J. J. Toward a Theory of the Striate Cortex (Letter) Liu, Y. Influence Function Analysis of PCA and BCM Learning (Letter) LoFaro, T., Kopell, N., Marder, E. and Hooper, S. L. Subharmonic Coordination in Networks of Neurons with Slow Conductances (Letter) Lorincz, A. - See Szepesvlri, C.

6(1):127-146

6(6):1275-1287

6(1):69-&4

1308

Lourenco, C. and Babloyantz, A. Control of Chaos in Networks with Delay: A Model for Synchronization of Cortical Tissue (Letter)

Index

6(6):1140-1153

Lukashin, A. V. and Georgopoulos, A. P. A Neural Network for Coding of Trajectories by Time Series of Neuronal Population Vectors (Letter)

6(1):19-28

Luttrell, S. P. A Bayesian Analysis of Self-organizing Maps (Article)

6(5):767-794

Maass, W. Neural Nets with Superlinear VC-Dimension (Letter)

6(5):877-884

MacKay, D. J. C. - See Miller, K. D. Mainen, Z. F. - See Destexhe, A. Manolios, P. and Fanelli, R. First-Order Recurrent Neural Networks and Deterministic Finite State Automata (Letter)

6(6):1154-1172

Marder, E. - See LoFaro, T. Matsumoto, T.and Kondo, K. Realization of the "Weak Rod" by a Double Layer Parallel Network (Letter)

6(5):94&956

Mauk, M. D. - See Buonomano, D. V. McAuley, J. D. and Stampfli, J. Analysis of the Effects of Noise on a Model for the Neural Mechanism of Short-Term Active Memory (Letter) Mel, B. W. Information Processing in Dendritic Trees (Review)

6(4):66&678 6(6):1031-1085

Miller, K. D. and MacKay, D. J. C. The Role of Constraints in Hebbian Learning (Letter)

6(1):100-126

Minai, A. A. and Levy, W. B. Setting the Activity Level in Sparse Random Networks (Letter)

6(1):85-99

Moobed, B. - See Gascuel, J. D.

Index

Munro, E. E., Shupe, L., and Fetz, E. E. Integration and Differentiation in Dynamic Recurrent Neural Networks (Letter)

1309

6(3):405-419

Murthy, V. N. and Fetz, E. E. Effects of Input Synchrony on the Firing Rate of a Three-Conductance Cortical Neuron Model (Letter) Nadal, J. P. and Parga, N. Duality Between Learning Machines: A Bridge Between Supervised and Unsupervised Learning (Letter)

6(3):491-508

Nelson, M. E. A Mechanism for Neuronal Gain Control by Descending Pathways (Letter)

6(2):242-254

Niebur, E. and Worgotter, F. Design Principles of Columnar Organization in Visual Cortex (Note)

6(4):602-614

Niranjan, M. - See Shadafan, R. S. Olami, Z. - See Usher, M. Olgilvie, M. D. - See Gottschalk, A. OReilly, R. C. and Johnson, M. H. Object Recognition and Sensitive Periods: A Computational Analysis of Visual Imprinting (Article)

6(3)~357-389

Osman, H. and Fahmy, M. M. Probabilistic Wmer-Take-All Learning Algorithm for Radial-Basis-Function Neural Classifiers (Letter)

6(5):927-943

Pack, A. I. - See Gottschalk, A. Parga, N. - See Nadal, J. P. Passino, K. M. - See Cheung, M. F. Pearlmutter, B. A. Fast Exact Multiplication by the Hessian (Letter)

6(1):147-160

Peterson, C. - See Pi, H. Pi, H. and Peterson, C. Finding the Embedding Dimension and Variable Dependences in T i e Series (Letter)

6(3):509-520

1310

Index

Prager, R. W. - See Gee, A. H. Qian, N. Computing Stereo Disparity and Motion with Known Binocular Cell Properties (Letter)

6(3):390-404

Rasmussen, C. E. - See Hansen, L. K. Ray, W. H. - See Scott, G. M. Reggia, J. A. - See Sutton 111, G. G. Rhodes, P. A. and Gray, C. M. Simulations of Intrinsically Bursting Neocortical Pyramidal Neurons (Letter)

6(6):1086-1109

Richter, D. W. - See Gottschalk, A. Roberts, S. and Tarassenko, L. A Probabilistic Resource Allocating Network for Novelty Detection (Letter)

6(2):270-284

Rognvaldsson, T. On Langevin Updating in Multilayer Perceptrons (Letter)

6(5):91&926

Sachs, M. B. - See Lai, Y. C. Sanger, T. D. Theoretical Considerations for the Analysis of Population Coding in Motor Cortex (Letter) Saul, L. and Jordan, M. I. Learning in Boltzmann Trees (Letter) Scott, G. M. and Ray, W. H. Neural Network Process Models Based on Linear Model Structures (Letter)

6(1):29-37 6(6):1173-1183

6(4):718-738

Sejnowski, T. J. - See Destexhe, A. Selverston, A. I. - See Doya, K. Shadafan, R. S. and Niranjan, M. A Dynamic Neural Network Architecture by Sequential Partitioning of the Input Space (Letter)

6(6):1201-1221

Shupe, L. - See Munro, E. E. Sima, J. Loading Deep Networks is Hard (Letter)

6(5):842-850

Index

1311

Singh, S . P.- See Jaakkola, T. Sompolinsky, H. and Tsodyks, M. Segmentation by a Network of Oscillators with Stored Memories (Letter)

6(4):642-657

Stampfli, J. - See McAuley, J. D. Stemmler, M. - See Usher, M. Stinchcombe, M. - See H o d , K. Stolorz, P. - See Yuille, A. L. Sutton 111, G. G., Reggia, J. A., Armentrout, S . L., and DAutrechy, C. L. Cortical Map Reorganization as a Competitive Process (Article) Szepesvdri, C., Baldzs, L., and Lorincz, A. Topology Learning Solved by Extended Objects: A Neural Network Model (Letter)

6(1):1-13

6(3):441458

Tarassenko, L. - See Roberts, S . Tesauro, G. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play (Note)

6(2):215-219

Tsodyks, M. - See Sompolinsky, H. Unnikrishnan, K. P. and Venugopal, K. P. Alopex: A Correlation-Based Learning Algorithm for Feed-Forward and Recurrent Neural Networks (Letter)

6(3):469-490

Usher, M. - See Bemander, 0. Usher, M., Stemmler, M., Koch, C., and Olami, Z . Network Amplification of Local Fluctuations Causes High Spike Rate Variability, Fractal Firing Patterns and Oscillatory Local Field Potentials (Article)

6(5):795-836

Utans, J. - See Yuille, A. L. Vapnik, V. - See Drucker, H. Vapnik, V., Levin, E., and LeCun, Y. Measuring the VC-Dimension of a Learning Machine (Letter)

6(5):851-876

1312

Index

Venugopal, K. P. - See Unnikrishnan, K. P. Wan, E. A. - See Beaufays, F. Weinfeld, M. - See Gascuel, J. D. White, H. - See Hornik, K. White, H. - See Kuan, C. M. Wdlshaw, D. J. - See Goodhill, G. J. Winslow, R. L. - See Lai, Y. C. Worgotter, F. - See Niebur, E. Yuille, A. L. and Kosowsky, J. J. Statistical Physics Algorithms that Converge (Review)

6(3):341-356

Yuille, A. L., Stolorz, P.,and Utans, J. Statistical Physics, Mixtures of Distributions, and the EM Algorithm (Letter)

6(2):334-340

Yurkovich, S. - See Cheung, M. F.

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

Recommend Documents

No title

Contents Introduction Dean Wesley Smith Star Trek® “A Test of Character” Kevin Lauderdale “Indomitable” Kevin Killiany “...

No title

METHODS IN CELL PHYSIOLOGY VOLUME I1 This Page Intentionally Left Blank Methods in Cell Physiology Edited by DAVID...

No title

No title

HYPATI A SPECIAL ISSUE FrenchFeministPhilosophy WINTER1989 A Journalof FeministPhilosophy HYPATI A SPECIAL ISSUE Fre...

No title

No title

No title

No title

JOURNAL OF ANCIENT NEAR EASTERN RELIGIONS JOURNAL OF ANCIENT NEAR EASTERN RELIGIONS Aims & Scope The Journal of Ancien...

No title

6 INVESTIGACION Ingeniería hidráulica CIENCIA en el México prehistórico s. Christopher Caran y James E. Neely Ed ic...

No title