REVIEW
Communicated by Walter Heiligenberg
Deciphering the Brain’s Codes Masakazu Konishi Division of Biology, Califo...
9 downloads
1092 Views
17MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
REVIEW
Communicated by Walter Heiligenberg
Deciphering the Brain’s Codes Masakazu Konishi Division of Biology, California Institute of Technology, Pasadena, California 92225 USA The two sensory systems discussed in this review use similar algorithms for the synthesis of the neuronal selectivity for the stimulus that releases a particular behavior, although the neural circuits, the brain sites involved, and even the species are different. This stimulus selectivity emerges gradually in a neural network organized according to parallel and hierarchical design principles. The parallel channels contain lower order stations with special circuits for the creation of neuronal selectivities for different features of the stimulus. Convergence of the parallel pathways brings these selectivities together at a higher order station for the eventual synthesis of the selectivity for the whole stimulus pattern. The neurons that are selective for the stimulus are at the top of the hierarchy, and they form the interface between the sensory and motor systems or between sensory systems of different modalities. The similarities of these two systems at the level of algorithms suggest the existence of rules of signal processing that transcend different sensory systems and species of animals. 1 Introduction Both peripheral and central sensory neurons do not respond to all stimuli or stimulus variables but to a certain modality, range, configuration, or pattern of stimuli or stimulus variables. This property of sensory neurons will be called stimulus selectivity or, simply, selectivity. Some neurons are seIective for primary stimulus variables, such as frequency and wavelength, and others are selective for more complex patterns of stimulus variables, such as direction, velocity, and disparities. In some systems, lower order neurons show selectivity for simpler stimuli and higher order neurons for more complex stimuli. Such a correlation between stimulus selectivities and the anatomical levels of a sensory system suggests the possibility of finding where and how selectivities for complex stimuli are derived from the integration of selectivities for simpler stimulus variables. Of particular interest are sensory systems that contain higher order neurons selective for the same stimuli that cause specific behaviors or percepts. If these systems are amenable to analysis of successive stages of processing leading to the selectivity of these neurons, we may Neirral Computation 3, 1-18 (1991)
@ 1991 Massachusetts Institute of Technology
2
Masakazu Konishi
understand how the whole system is designed to analyze behaviorally relevant stimuli. For various reasons, many complex sensory systems are not amenable to this form of analysis. However, in the auditory system of the barn owl and the electrosensory system of the electric fish, Eigenmannia, such an analysis has been successfully carried out (Heiligenberg 1986; Konishi et al. 1988). In this review, I shall discuss what we can learn from these examples about neural codes, neural algorithms, and network organization. 2 Behavioral Analysis
The sense organs and the brain of an animal must be designed for the processing of the stimuli that are relevant for its survival and reproduction. It is, therefore, important to determine what stimulus to use in the analysis of neuronal selectivities. This section describes first the characteristics of the stimulus for sound localization by the barn owl and then the stimulus for the electric fish. On hearing a sound, the owl turns its head in the direction of the sound source (Knudsen et al. 1979). Experiments show that the owl uses interaural time differences for localization in azimuth and interaural amplitude differences for localization in elevation (Moiseff and Konishi 1981; Moiseff 1989). Of two possible sources of interaural time differences, namely, stimulus onset time and phase disparities, the owl extracts and uses the interaural phase differences of all audible frequencies contained in the signal. The vertical asymmetries of the barn owl's ears enable the owl to use interaural amplitude differences for localization in elevation. A higher sound level in the right ear and left ear causes the owl to turn its head upward and downward, respectively. The owl obtains both binaural disparities simultaneously from a single sound signal. Each location in the owl's two-dimensional auditory space is thus uniquely defined by a combination of interaural time and amplitude differences. The second example is the electric fish, Eigenmannia. This species creates electrical field potentials around its body to detect objects having conductivities different from the conductivity of the water. The electrical potential varies almost sinusoidally over time, and the fish can change its frequency. When the electrical fields of two fish have only slightly dissimilar frequencies, the fish try to increase the differences in signal frequency. This behavior is called the jamming avoidance response (Heiligenberg 1986). The main problem that a fish must solve in encountering another fish is to determine whether its own frequency is higher or lower than that of the other fish. The pacemaker cells that drive the electrical organ cycle by cycle could, in theory, provide a copy of the efferent command for comparison with the frequency of the incoming signal. Eigenmannia does not, however, use this method. Instead, the fish determines the sign of frequency differences from the waveform created by the mixing
Deciphering the Brain's Codes
3
of its own and the neighbor's signals. The phase and amplitude of the waveform at one locus on the body surface are different from those at another locus, because the sources of the two electrical fields are located within the bodies of the two fish that are separated in space. The fish uses these phase and amplitude differences between many loci on its body surface to determine the sign of frequency differences (see Fig. 1 for further explanation). 3 Successive Stages of Signal Processing
Complex sensory pathways have input, intermediate, and output stages. In the "bottom-up" approach, one starts with the sense organ and proceeds to higher order stations in the ascending sequence. The study of neuronal selectivities need not start with the input stage. In the "topdown" approach, one starts with output or other higher order neurons and goes downward through intermediate stages to the sense organ. Neither approach is easy when the neural network is complex. The bottomu p approach is difficult because of nonlinear properties of most neural systems. The top-down approach is difficult because the output neurons of the network may not be easily found. The output neurons of a hierarchically organized neural network occupy the anatomically highest stage of the hierarchy and project to other functionally identifiable networks such as the motor system and other sensory systems. Under favorable circumstances, the point of transition from one network to the next can be recognized by combinations of anatomical and physiological methods. Starting with higher order neurons has distinct advantages because the investigaior has the defined goal of explaining the stimulus selectivity of the higher order neurons. In the owl, we were lucky to start with what has turned out to be the output neurons. The following description of the owl's auditory system similarly starts with the output neurons and explains how their stimulus selectivity is established. 3.1 The Top-Down Approach in the Owl. We looked for higher order auditory neurons that responded only when sound came from a restricted area in space. A cluster of such cells, which are called spacespecific neurons, occurs in the external nucleus of the inferior colliculus (Knudsen and Konishi 1978; Moiseff and Konishi 1983). This nucleus is the highest station in that part of the owl's auditory system that processes the stimulus for sound localization (cf. Fig. 3). Experiments with earphones show that these neurons are selective for a combination of interaural time and amplitude differences (Moiseff and Konishi 1981; Olsen ef al. 1989). The neurons are selective for a particular spatial location, because they are tuned to the combination of interaural time and amplitude differences that results when the sound source is located at that site. We also know that a neuron's selectivity for interaural time and amplitude
4
Masakazu Konishi
A
pMu
advance
delay
SI&
Figure 1: Determination of the sign of frequency differences by the electric fish, Eigenmunniu. (A) Electrical signals. Eigenmunniu generates nearly sinusoidal electrical signals for navigation and orientation. When an individual (Sl) encounters another individual (SZ),they avoid jamming each other by changing the frequency of their signals. The fish uses the beat waveform (S1+S2) to determine whether its own frequency is higher or lower. (B) The fish uses differences in the phase and amplitude of the beat waveform between different body loci to determine the sign of frequency differences. In this figure, the solid-line and dotted-line waveforms show different degrees of contamination of S1 by S2; the solid-line waveform registered at one body locus is more contaminated and the dotted-line waveform registered at another locus is less contaminated. The small arrowheads indicate the phase relationships between the two waveforms. The left-slanted arrowheads indicate that the phase of the solid-line waveform is advanced relative to that of the dotted-line waveform. When these phase relationships and the rise and fall of amplitude are considered jointly, the sign of frequency differences can be determined unambiguously. Thus, the sequence, a fall in amplitude with a phase advance followed by a rise in amplitude with a phase delay, indicates that the fish's own frequency is lower than that of the other fish.
Deciphering the Brain’s Codes
5
differences determines, respectively, the azimuthal and elevational centers and widths of its receptive field. It is reasonable to assume that a neuron’s selectivity for a complex stimulus is due both to its intrinsic morphological and biophysical properties and to the integration of information conveyed by the input channels converging on it. Thus, the next step in the top-down approach is to determine what circuits and processes underlie the stimulus selectivity of space-specific neurons. A survey of all binaural stations below the level of the external nucleus of the inferior colliculus showed that they could be classified into two groups, one containing neurons sensitive to interaural amplitude differences and the other containing neurons sensitive to interaural phase differences (Moiseff and Konishi 1983; Sullivan and Konishi 1984; Takahashi et ul. 1984). Subsequent anatomical studies established separate pathways from the cochlear nuclei, the first auditory stations of the brain, to the inferior colliculus in the midbrain (Takahashi and Konishi 1988a,b). These findings led to the hypothesis that the owl’s auditory system uses parallel pathways for separate processing of interaural phase and amplitude differences. A more direct test of this assumption came from an experiment in which the response of space-specific neurons to interaural time and amplitude differences was observed while one of the two pathways was partially inactivated by injection of a local anesthetic. The beginning stage of the “time pathway” is one of the cochlear nuclei, nucleus magnocellurais, and that of the ”amplitude pathway” is the other cochlear nucleus, nucleus angularis. Partial inactivation of the nucleus magnocellularis drastically changed the response of space-specific neurons to interaural time differences without affecting their response to interaural amplitude differences. The converse was observed when the nucleus angularis was partially anesthetized. These cochlear nuclei are both anatomically and physiologically different from each other. Neurons of the nucleus magnocellularis are sensitive to stimulus phase but insensitive to variation in stimulus amplitude. By contrast, neurons of the nucleus angularis are sensitive to variation in stimulus amplitude but insensitive to stimulus phase. The phase sensitivity means that the neuron fires at or near a particular phase angle during the tonal period. This phenomenon, phase-locking, occurs at frequencies as high as 8.3 kHz in the owl. Neurons do not fire during every tonal period of such a high frequency, but whenever they fire, they phase-lock to the stimulus. When the stimulus is noise, neurons phase-lock to the phase of the spectral components to which they are tuned. The next step in this research was to determine where and how the neuronal selectivities for interaural phase and amplitude differences are established in the two pathways. The third-order nucleus in the time pathway, nucleus laminaris, is the first station that contains neurons selective for interaural phase differences. The owl uses phase-locked spikes from the left and right ears to measure interaural phase differences from
6
Masakazu Konishi
Figure 2: Neural circuits for the detection of interaural time differences. The inset shows a model of neural circuits for the detection of interaural time differences. It uses the principles of coincidence detection and delay lines. Binaural neurons, A, B, C, D, and E, fire maximally when impulses from the two sides arrive simultaneously. Except for C, the paths for impulse transmission to each neuron are different between the two sides. These asymmetries cause interaural differences in the arrival time of impulses. A neuron fires maximally when an imposed interaural time difference compensates for the asymmetry in impulse transmission time. This array of neurons thus encodes different azimuthal locations of sound systematically. The main figure shows the neural circuits. Nucleus magnocellularis is one of the first brain stations in the owl's auditory system. Nucleus laminaris receives inputs from both the ipsilateral and contralateral magnocellular nuclei. The figure shows axon collaterals from single ipsilateral and contralateral neurons projecting into nucleus laminaris, which contain binaural neurons. For the sake of clarity, the ipsilateral and contralteral axons are shown separately, although they interdigitate in reality. These interdigitating axons serve as delay lines, and the laminaris neurons as coincidence detectors. Interaural phase differences are computed separately for each frequency band.
which it eventually derives interaural time differences. The circuits that compute interaural phase differences use the principles of delay lines and coincidence detection (Fig. 21, corresponding to a model first proposed by Jeffress (1948). Laminaris neurons are innervated by axons from both
Deciphering the Brain’s Codes
7
ipsilateral and contralateral magnocellular neurons. The parts of these axons that lie within the boundaries of the nucleus laminaris act as delay lines, and laminaris neurons themselves as coincidence detectors (Carr and Konishi 1988, 1990). A binaural disparity in the arrival time (At,) of impulses at a laminaris neuron includes the difference in the acoustic transmission time to the two ears (ITD) and the difference in the impulse conduction time (At,) from the two ears to the neuron, hence At, = ITD+At,. Both ITD and At, vary, but the delay lines are organized such that for each neuron At, equals a particular ITD in magnitude but opposite in sign (i.e., ITD = -At,); impulses from the two sides arrive simultaneously and the laminaris neuron fires maximally. Laminaris neurons are, however, not perfect coincidence detectors, because they pass monaural signals. Interestingly, an unfavorable phase difference elicits a smaller number of impulses than that triggered by either of the monaural signals. Nonlinear processes such as inhibition may thus contribute to the computation of interaural phase differences. All of these processes occur in each audible frequency band. Laminaris neurons thus convey their selectivity for interaural phase differences to higher order nuclei in separate frequency channels. A single laminaris neuron responds to multiple ITDs that are separated by integer multiples of the stimulus period. This phenomenon occurs because phase is a circular variable. Thus, if an interaural time difference ITD corresponds to an interaural phase difference, IPD, then all ITD + nT also correspond to IPD, where I I and T are integer and the period of the stimulus tone, respectively. Laminaris neurons respond to all ITD + nT as long as they are within their physiological range. Laminaris neurons send their axons, in separate frequency channels, to two higher order nuclei, the central nucleus of the inferior colliculus and one of the lemniscal nuclei. The inputs from the nucleus laminaris endow the neurons of these nuclei with selectivity for interaural phase differences. Consequently, these neurons also respond to multiple ITDs. The neurons of the central nucleus of the inferior colliculus are, however, more sharply tuned to interaural phase differences. These neurons project to the lateral shell of the central nucleus of the inferior colliculus, and the neurons of this nucleus project, in turn, to the external nucleus of the inferior colliculus where space-specific neurons reside. Unlike the lower order neurons, space-specific neurons are broadly tuned to frequency and respond only to one ITD when a broad-band signal is used. This fact indicates that space-specific neurons receive inputs from the frequency channels that are selective for the same ITD and its phase equivalents (ITD + 1/27 (Wagner et al. 1988). Space-specific neurons or their immediate precursors in the lateral shell get rid of the frequency dependent variable ITD + nT. This abiIity of space-specific neurons to respond exclusively to one interaural time difference is due to excitatory and inhibitory interactions between the different frequency channels that they receive (Takahashi and Konishi 1986; Fujita 1989).
8
Masakazu Konishi
The last issue in the synthesis of the stimulus selectivity of spacespecific neurons concerns their ability to respond selectively to combinations of interaural time and amplitude differences. This capacity derives from the convergence of the two pathways in the lateral shell of the central nucleus of the inferior colliculus. Interaural amplitude differences are first encoded, however, in one of the lemniscal nuclei (Manley et al. 1988). Stimulation of the contralateral ear excites and that of the ipsilateral ear inhibits the neurons of this nucleus. The response of these neurons is, therefore, determined by interaural amplitude differences. These neurons are, however, not exclusively selective for interaural amplitude differences, because the contralateral ear alone can drive them. The outputs of these neurons are eventually used to produce the ability of space-specific neurons to tune to interaural amplitude differences. The convergence of the two pathways is not a simple addition but involves another nonlinear operation, which endows space-specific neurons with the ability to respond only to a combination of interaural time and amplitude differences. 3.2 The Bottom-Up Approach in the Electric Fish. In the electric fish, Heiligenberg and his associates have used the bottom-up approach to discover the neural mechanisms for the determination of the sign of frequency differences in the jamming avoidance response. It should be noted, however, that they used the results of behavioral analysis to guide their search for relevant neuronal stimulus selectivities. I shall briefly review the steps by which they discovered the output neurons. The selectivities for primary stimulus variables, phase and amplitude, are established in the sense organs themselves. The electric fish has two kinds of electroceptive sensory cells in the skin (Scheich et al. 1973). One of them fires a single impulse at each positive zero-crossing of the nearly sinusoidal electrical signal. The fish uses these impulses to convey to the brain information about the phase angles of the signal over the entire body surface. The other type of sensory cell is sensitive to variation in the amplitude of the electrical signal. The phase and the amplitude-sensitive cells are mixed over the body surface, but they project to different layers of the first brain station. These layers constitute the starting points of separate pathways for phase and amplitude (Carr et al. 1982; Heiligenberg and Dye 1982). Each layer contains three separate maps of the electrical field variables over the body surface, and the phase and amplitude maps are in register (Shumway 1989a,b). This nucleus, the electrosensory lateral line lobe, contains neurons and their circuits that are sensitive to the rise and fall of signal amplitude (Saunders and Bastian 1984; Shumway and Maler 1989). The electrosensory lateral line lobe projects to the multilayered torus semicircuIaris, presumabIy the homolog of the inferior colliculus. Like the owl, the electric fish uses special neural circuits for the computation of phase differences between different body loci. These circuits are found
Deciphering the Brain’s Codes
9
in lamina 6 of the torus (Carr et al. 1986). Of the two classes of output neurons of these circuits, one responds to phase advance and the other to phase delay. Neurons selective for the rise or fall of amplitude occur in other laminae of the torus. As in the owl, the phase and amplitude pathways converge on each other in specific layers of the torus. This convergence gives rise to four classes of neurons that are selective for four different combinations of amplitude and phase, that is, amplitude fall-phase advance, amplitude rise-phase delay, amplitude fall-phase delay, and amplitude rise-phase advance (Heiligenberg and Rose 1986; Rose and Heiligenberg 1986). The first two amplitude-phase combinations indicate that the fish’s own frequency is lower, and the second two that the fish’s own frequency is higher. These four neuron types are, however, not exclusively selective for particular amplitude-phase combinations, because they show some responses to other combinations. Also, the response of these neurons depends on the relative orientation of the fish’s own electrical field and that of its neighbor, because their receptive fields are restricted to small body surface areas (Heiligenberg 1986). The next stage of processing takes place in the nucleus electrosensorius where sensory channeIs from different body surface loci converge on single neurons (Keller and Heiligenberg 1989). The response of these neurons to amplitude-phase combinations becomes largely independent of the relative orientation of overlapping electrical fields. The final stage of processing is the prepacemaker nucleus in the diencephalon (Rose et al. 1988). Its neurons unambiguously discriminate between the signs of frequency differences. 4 The Output Neurons
The owl’s space-specific neurons are the output neurons of the network involved in sound localization, because they occupy the top of the hierarchy of the brainstem and pontine auditory nuclei and project to the optic tectum. These neurons are selective for the same stimulus that induces the sound localizing response in the owl. This stimulus selectivity is a result of all parallel and serial computations that are carried out by lower order neurons in the pathways leading to the output neurons. Space-specific neurons form a map of auditory space in the external nucleus of the inferior colliculus (Knudsen and Konishi 1978). This map projects to the optic tectum where an auditory-visual map of space is found (Knudsen and Knudsen 1983). This bimodal map appears to be linked to the motor map of head orientation. Electrical stimulation of the optic tectum elicits saccadic head movements, which are similar to those released by natural sound stimuli. The spatial locus to which the owl orients corresponds to the receptive fields of auditory-visual neurons located at the site of electrical stimulation (Du Lac and Knudsen 1990;
10
Masakazu Konishi
Masino and Knudsen 1990). The exact mechanisms of translation from sensory codes to motor codes are, however, not yet known. In the electric fish, neurons of the prepacemaker nucleus in the diencephalon are the output neurons of the system for the determination of the sign of frequency differences, because they occupy the top of the hierarchy of nuclei involved in this behavior and project directly to the premotor nucleus that controls the electrical organ (Rose et al. 1988). The stimulus selectivity of prepacemaker neurons is a result of all parallel and serial computations that take place in lower order nuclei in the pathways leading to them. These neurons fire more when the fish’s own frequency is higher and less when the fish’s own frequency is lower. This response pattern is exactly what the fish shows in response to the sign of frequency differences; when its own frequency is higher, the fish raises it, and when its own frequency is lower, the fish lowers it so that the frequency difference between the two fish becomes larger. Moreover, just as the fish’s response is rather independent of the relative orientation of the other fish, so is the response of prepacemaker neurons. A rise and fall of the discharge rate in these neurons, respectively, raises and lowers the frequency of firing in pacemaker neurons that trigger each discharge cycle of the electrical organ.
5 Stimulus Selectivities and Neural Codes Neural codes are pieces of information that neurons convey to other neurons. This section discusses first neural codes in the above sense and then the relationships between these codes and behavior. In the owl and the electric fish, we see how the timing and rate of impulses in the input stage are directly correlated with the phase and amplitude of the stimulus, respectively. Furthermore, we know how the selectivities for phase and amplitude disparities are derived from these inputs. Thus, phase-locked and rate-variable impulses are the neural codes for phase and amplitude, respectively. However, neither impulse timing nor rate is uniquely correlated with amplitude and phase disparities. Nevertheless, convergence of neurons selective for phase and amplitude disparities gives rise to neurons selective for combinations of the two disparities, indicating that the disparity-sensitive neurons convey relevant information to other neurons. The only code for this information is the neuron’s place or address. ”Place-coding” is the most universal signaling method in all neural systems. It is, therefore, justifiable to equate stimulus selectivities with neural codes, and the study of stimulus selectivities in successive stages of a sensory network can show how complex stimuli are encoded. The two examples show that the neurons at the top of a hierarchically organized system represent the final result of all computations that are carried out by lower order neurons. Representation of a large network by a small number of output neurons is an interesing problem
Deciphering the Brain’s Codes
11
from the point of view of network organization and coding. When a study of intermediate stages of a sensory system uncovers only neurons selective for simple stimulus features, this observation does not necessarily allow the investigator to conclude that the output of the system is encoded by a large ensemble of simple neurons. For example, Heiligenberg and associates interpreted the results of their early studies in terms of activities of a large ensemble of neurons. They found that the fish would fail to determine reliably the sign of frequency differences, if sensory inputs from a large part of its body surface were eliminated. Their models also indicated that comparisons of phase and amplitude between many pairs of body loci are necessary for the determination of the sign of frequency differences. In addition, their early studies of lower order nuclei uncovered only those neurons selective for separate lower order features of the stimulus for jamming avoidance. These observations prompted them to write papers with titles such as ”The jamming avoidance response revisited: The structure of a neuronal democracy” (Heiligenberg et al. 1978) and “The neural basis of a sensory filter in the jamming avoidance system: No grandmother cells in sight” (Partridge et al. 1980). However, when they studied the diencephalon, which contains the nucleus electrosensorius and the prepacemaker nucleus, they found extensive convergence of inputs from the body surface onto single neurons. These neurons, particularly prepacemaker neurons, unambiguously discriminated the sign of frequency differences. This finding resulted in a paper entitled “’Recognition units’ at the top of a neuronal hierarchy? Prepacemaker neurons code the sign of frequency differences unambiguously’’ (Rose et al. 1988). Thus, one cannot predict either from behavioral analyses alone or from the study of intermediate stages of a network how its output stage encodes relevant stimuli. In both the electric fish and the owl, we see extensive convergence of lower order neurons onto the output neurons that are selective for the behaviorally relevant stimulus. The ratio of the output neurons to lower order neurons has not been determined in either system, but the volume of tissue containing the output neurons appears to be much smaller than that of lower order nuclei in both systems. The prepacemaker nucleus, being about 100 bm in diameter, is the smallest nucleus in the electrosensory system (Keller et al. 1990). Moreover, this nucleus consists of two parts, one for the control of the jamming avoidance response and the other for “chirps,” which occur in courtship and aggression. The number of neurons in the “chirp” area is estimated to be about 200 on each side of the brain and the jamming avoidance area contains perhaps twice as many neurons (Heiligenberg, personal communication). These numbers are small in a system in which most lower order nuclei contain thousands of neurons. Intracellular stimulation of a single “chirp” neuron can induce weak ”chirps” and stimulation of many neurons with glutamate induces strong ”chirps” (Kawasaki and Heiligenberg 1988; Kawasaki et a/. 1988). These
12
Masakazu Konishi
neurons are somewhat similar to the command fibers of invertebrates. Recent reports show that a group of command fibers contributes to the control of several different motor output patterns, but the amount and nature of the contributions by different fibers vary in different patterns (Larimer 1988). Some of the oscillator circuits of invertebrates also show elements of combinatorial control of multiple output patterns by a small group of neurons (Getting 1989). Thus, in some systems, a small number of neurons represents the outputs of a network and controls relatively complex behaviors. 6 Similarities in Algorithms
In the present context, an algorithm refers to steps and procedures in signal processing. Figure 3 compares the algorithms for the processing of the signals for sound localization by the owl and for the jamming avoidance response by the electric fish. Both systems use parallel pathways for the processing of different stimulus features. Signal processing within each of the pathways occurs in a hierarchical sequence of nuclei. First, the codes for the primary stimulus variables, phase and amplitude, are sorted out at an early stage and routed to appropriate pathways, then different stimulus features are detected and encoded in each pathway by special neural circuits. Further processing in higher order stations makes the neural representations of the stimulus features more accurate and less ambiguous. Finally, the codes for these features are brought together by convergence of the parallel pathways. The result of convergence is not simply the addition of the codes from the input channels but the creation of a new code. In both examples, the inputs to the output neurons carry the codes for phase and amplitude disparities, but the output neurons do not respond to either feature alone but only to a combination of the two features. The output neurons of the entire network can be recognized in both the owl and the electric fish. These neurons occur at the top of the hierarchy of processing stages, and they encode the signals for sound localization and jamming avoidance unambiguously. The output neurons serve as the interface between the signal processing and motor system or between signal processing networks of different modalities such as the auditory and visual systems in the owl. There are thus remarkable similarities between the owl and fish algorithms (Konishi 1991). Both the auditory and electrosensory systems are thought to have evolved from the lateral line system, which the fish uses to detect disturbances in the surrounding water. This explanation does not account for specific aspects of the algorithms, such as the separation of the phase and amplitude codes in two different nuclei in the owl and two different layers of a nucleus in the electric fish. The fact that both
Phase
iaminans Amplilude
Neural Algorithm
Encoding 01 Irquancy, amplitude. and phase
Convergence 01 dinerent lrequency channels
A
Elimination 01 phase ambguity
A
Formation 01 a map 01 auditory space
A
Formation o t a bimodal map
Motor map 101 head orienting response A
BARN OWL
Convergence 01 body surlace S P ~ CChannels JIIC
Neural Algorithm
ELECTRIC FISH Nelwork Hierarchy
Figure 3: Similarities in neural algorithms. The owl’s auditory system computes the combinations of interaural time and amplitude differences that uniquely define separate loci in auditory space. The electric fish’s electrosensory system determines the sign of frequency differences between the fish’s own and a neighbor’s signal by detecting the differences between body loci in the phase and amplitude of the waveform resulting from the mixing of the two signals. Both systems compute phase and amplitude disparities to synthesize the codes for phase-amplitude combinations. This figure shows where in the brain and in what steps the two systems carry out the synthesis of the codes. The boxes show the brain nuclei and the arrows indicate the direction of connections. The process that takes place in each nucleus is posted on the right of the corresponding box. The multiple processes performed by one nucleus are listed as a group. The arrowheads between processes indicate the sequence in which the different processes occur.
1-
Nucleus magnocell
NUCIPUS
Lateral lemn~scusa
r
Network Hierarchy
14
Masakazu Konishi
animals deal with sinusoidal signals may be the reason for the similar algorithms, because the primary stimulus variables are the same in both sound and electrical signals. The goals of the two systems are also similar, because both systems ultimately encode combinations of phase and amplitude disparities. The neural implementations of the algorithms are, however, different between the systems. The electric fish uses different sensory cells to encode phase and amplitude, whereas the owl uses the same auditory neurons to encode both phase and amplitude. The electric fish uses electrical synapses to transmit phase-locked spikes in all relay stations below the stage where phase differences are computed. The owl uses chemical synapses for the same purpose, although they are specialized synapses. The electric fish uses the differences in the arrival time of phase-locked spikes between somata and dendrites to detect phase disparities between different body loci, whereas the owl uses axonal delay lines. Both animals use the convergence of different input channels to eliminate ambiguity in neuronal stimulus selectivity, but the convergence occurs in different parts of the brain, the midbrain in the owl and the diencephalon in the fish. Do similar algorithms occur in other complex sensory systems? The answer to this question is difficult to obtain, because few studies of complex sensory systems have investigated successive stages of signal processing. The visual system of the macaque monkey is the only other system that has been studied well enough for the discussion of algorithms. This complex system is also organized according to parallel and hierarchical design principles (Van Essen 1985; Maunsell and Newsome 1987; Hubel and Livingstone 1987; Livingstone and Hubel 1987, 1988; De Yoe and Van Essen 1988); parvocellular and magnocellular pathways are physiologically and anatomically distinct and the way stations in each pathway within the extrastriate cortex are hierarchically organized. These network hierarchies appear to be correlated with the processing hierarchies. Lower order features such as stimulus orientation are encoded in the striate cortex, whereas relatively higher order features, such as velocity and geometric patterns like faces, are encoded in higher stations, the middle temporal area encoding velocity and the inferotemporal area encoding faces (Gross et d. 1972; Perret et al. 1982; Maunsell and Van Essen 1983). However, much remains to be explored before we can understand the mechanisms and functional significance of feature extraction in this system as well as we do in the electric fish and the owl. 7 Concluding Remarks
Neuroethology, which studies the neural bases of natural behavior, has something to offer to the students of computational and neural systems. The tenet of neuroethology states that the brain is designed to process biologically relevant stimuli and control behavior essential for the
Deciphering the Brain’s Codes
15
survival and reproduction of the animal. Only behavioral observations and analyses can identify biologically significant stimuli. The two systems discussed above could not have been analyzed adequately and understood without use of and reference to such stimuli. These examples also show that the study of successive stages of signal processing is essential for the understanding of both the algorithm and its neural implementation. These examples are relevant to computational neuroscience the aim of which is to understand the workings of the brain. This field is, however, theory-rich and data-poor. To achieve its goal, the field needs benchmark neural systems in which both the algorithm and its neural implementation have been worked out. The electric fish and the owl provide such frames of reference for those who explore or model sensory networks.
Acknowledgments
I thank Jack Gallant and Walter Heiligenberg for reading the manuscript. This work was supported by NIH Grant DC00134.
References Carr, C. E., and Konishi, M. 1988. Axonal delay lines for time measurement in the owl’s brainstem. Proc. Nutl. Acad. Sci. U.S.A. 85, 8311-8315. Carr, C. E., and Konishi, M. 1990. A circuit for detection of interaural time differences in the brainstem of the barn owl. J. Neurosci. 10,3227-3246. Carr, C. E., Maler, L., and Sas, E. 1982. Peripheral organization and central projections of the electrosensory nerves in gymnotiform fish. J. Comp. Neurol. 211, 139-153. Carr, C. E., Maler, L., and Taylor, B. 1986. A time-comparison circuit in the electric fish midbrain. 11. Functional morphology. 1. Neurosci. 6, 1372-1383. De Yoe, E. A,, and Van Essen, D. C. 1988. Concurrent processing streams in monkey visual cortex. Treiids Neurosci. 11, 219-226. Du Lac, S., and Knudsen, E. I. 1990. Neural maps of head movement vector and speed in the optic tectum of the barn owl. J. Neurophysiol. 63, 131-146. Fujita, I. 1989. The role of GABA mediated inhibition in the formation of auditory receptive fields. Ueliaru Mem. Res. Found. Rep. 2, 159-161. Getting, I? A. 1989. Emergent principles governing the operation of neural networks. Annu. Rev. Neurosci. 12, 158-204. Gross, C. G., Rocha-Miranda, C. E., and Bender, D. B. 1972. Visual properties of neurons in inferotemporal cortex of the macaque. J. Neurophysiol. 35, 96-111. Heiligenberg, W. 1986. Jamming avoidance responses. In Ekctroreception, T. H. Bullock and W. Heiligenberg, eds., pp. 613449. Wiley, New York.
16
Masakazu Konishi
Heiligenberg, W., Baker, C., and Matsubara, J. 1978. The jamming avoidance response revisited: The structure of a neuronal democracy. J. Comp. Physiol. 127,267-286. Heiligenberg, W., and Dye, J. 1982. Labelling of electroreceptive afferents in a gymnotoid fish by intracellular injection of HRP: The mystery of multiple maps. J. Comp. Physiol. 148, 287-296. Heiligenberg, W., and Rose, G. 1986. Gating of sensory information: Joint computations of phase and amplitude data in the midbrain of the electric fish, Eigenmannia. J. Comp. Physiol. 159, 311-324. Hubel, D. H., and Livingstone, M. S. 1987. Segregation of form, color and stereopsis in primate area 18. J. Neurosci. 7, 3378-3415. Jeffress, L. A. 1948. A place theory of sound localization. J. Comp. Physiol. Psych. 41, 35-39. Kawasaki, M., and Heiligenberg, W. 1988. Individual prepacemaker neurons can modulate the pacemaker cycle in the gymnotiform electric fish, Eigenmanniu. J. Comp. Physiol. 162, 13-21. Kawasaki, M., Maler, L., Rose, G. J., and Heiligenberg, W. 1988. Anatomical and functional organization of the prepacemaker nucleus in gymnotiform electric fish: The accommodation of two behaviors in one nucleus. J. Comp. Neurol. 276, 113-131. Keller, C. H., and Heiligenberg, W. 1989. From distributed sensory processing to discrete motor representations in the diencephalon of the electric fish, Eigenmannia. J. Comp. Physiol. 164, 56.5576. Keller, C. H., Maler, L., and Heiligenberg, W. 1990. Structural and functional organization of a diencephalic sensory-motor interface in the Gymnotiform fish, Eigenmannia. J. Comp. Neurol. 293, 347-376. Kendrick, K. M., and Baldwin, B. A. 1987. Cells in temporal cortex of conscious sheep can respond preferentially to the sight of faces. Science 236, 448-450. Knudsen, E. I., Blasdel, G. G., and Konishi, M. 1979. Sound localization by the barn owl (Tyto alba) measured with the search coil technique. J. Comp. Physiol. 133, 1-11. Knudsen, E. I., and Knudsen, P. F. 1983. Space-mapped auditory projections from the inferior colliculus to the optic tectum in the barn owl (Tyto albu). J. Comp. Neurol. 218, 187-196. Knudsen, E. I., and Konishi, M. 1978. A neural map of auditory space in the owl. Science 200, 795-797. Konishi, M. 1991. Similar algorithms in different sensory systems and animals. Cold Spring Harbor Symp. Quant. Biol. (in press). Konishi, M., Takahashi, T. T., Wagner, H., Sullivan, W. E., and Carr, C. E. 1988. Neurophysiological and anatomical substrates of sound localization in the owl. In Auditory Function, G. M. Edelman, W. E. Gall, and W. M. Cowan, eds., pp. 721-745. Wiley, New York. Larimer, J. L. 1988. The command hypothesis: A new view using an old example. Trends Neurosci. 11, 506-510. Livingstone, M. S., and Hubel, D. H. 1987. Psychophysical evidence for separate channels for the perception of form, color, movement, and depth. J. Neurosci. 7, 3416-3468.
Deciphering the Brain’s Codes
17
Livingstone, M. S., and Hubel, D. H. 1988. Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science 240, 740749. Manley, G. A., Koeppl, C., and Konishi, M. 1988. A neural map of interaural intensity difference in the brainstem of the barn owl. J. Neurosci. 8, 26652676. Masino, T., and Knudsen, E. I. 1990. Horizontal and vertical components of head movement are controlled by distinct neural circuits in the barn owl. Nature (London) 345, 434437. Maunsell, J. H. R., and Newsome, W. T. 1987. Visual processing in monkey extrastriate cortex. Annu. Rev. Neurosci. 10, 363-401. Maunsell, J. H. R., and Van Essen, C. D. 1983. The connections of the middle temporal visual area (MT) and their relationship to a cortical hierarchy in the macaque monkey. I. Neurosci. 3, 2526-2586. Moiseff, A. 1989. Bi-coordinate sound localization by the barn owl. 1. Comp. Pkysiol. 164, 637-644. Moiseff, A,, and Konishi, M. 1981. Neuronal and behavioral sensitivity to binaural time difference in the owl. J. Neurosci. 1, 40-48. Moiseff, A., and Konishi, M. 1983. Binaural characteristics of units in the owl’s brainstem auditory pathway: Precursors of restricted spatial receptive fields. J. Neurosci. 3, 2553-2562. Olsen, J. F., Knudsen, E. I., and Esterly, S. D. 1989. Neural maps of interaural time and intensity differences in the optic tectum of the barn owl. I. Neurosci. 9, 2591-2605. Partridge, B. L., Heiligenberg, W., and Matsubara, J. 1980. The neural basis of behavioral filter: No grandmother cells in sight. J. Comp. Physiol. 145, 153-1 68. Perret, D. I., Rolls, E. T., and Caan, W. 1982. Visual neurons responsive to faces in the monkey temporal cortex. Exp. Bruin Res. 42, 319-330. Rose, G. J., and Heiligenberg, W. 1986. Neural coding of frequencies in the midbrain of the electric fish Eigenmunnia: Reading the sense ef rotation in an amplitude-phase plane. J. Comp. Physiol. 158, 613-624. Rose, G. J., Kawasaki, M., and Heiligenberg, W. 1988. ’Recognition units’ at the top of a neuronal hierarchy? Prepacemaker neurons in Eigenmannia code the sign of frequency differences unambiguously. J. Comp. Physiol. 162, 759-772. Saunders, J., and Bastian, J . 1984. The physiology and morphology of two types of electrosensory neurons in the weakly electric fish Apteronotus leptorhynchus. J . Comp. Physiol. 154, 199-209. Scheich, H., Bullock, T. H., and Hamstra, R. H. 1973. Coding properties of two classes of efferent nerve fibers: High frequency electroceptors in the electric fish, Eigenmannin. J . Neitrophysiol. 36, 3940. Shumway, C. A. 1989a. Multiple electrosensory maps in the medulla of weakly electric Gymnotiform fish. I. Physiological differences. J. Neurosci. 9, 43884399. Shumway, C. A. 1989b. Multiple electrosensory maps in the medulla of weakly
18
Masakazu Konishi
electric Gymnotiform fish. 11. Anatomical differences. J. Neurosci. 9, 44004415. Shumway, C. A., and Maler, L. 1989. GABAnergic inhibition shapes temporal and spatial response properties of pyramidal cells in the electrolateral line lobe of gymnotoid fish. J. Comp. Physiol. 164, 391407. Sullivan, W. E., and Konishi, M. 1984. Segregation of stimulus phase and intensity in the cochlear nuclei of the barn owl. J. Neurosci. 4, 1787-1799. Takahashi, T. T., Moiseff, A., and Konishi, M. 1984. Time and intensity cues are processed independently in the auditory system of the owl. J. Neurosci. 4, 1781-1786. Takahashi, T. T., and Konishi, M. 1986. Selectivity for interaural time difference in the owl's midbrain. J. Neurosci. 6, 3413-3422. Takahashi, T. T., and Konishi, M. 1988a. Projections of the cochlear nuclei and nucleus laminaris to the inferior colliculus of the barn owl. J. Comp. Neurol. 274, 191T211. Takahashi, T. T., and Konishi, M. 1988b. Projections of nucleus angularis and nucleus laminaris to the lateral lemniscal nuclear complex of the barn owl. J. Comp. Neurol. 274, 212-238. Van Essen, D. C. 1985. Functional organization of primate visual cortex. In Cerebral Cortex, A. Peters and E. G. Jones, eds., pp. 259-329. Plenum, New York. Wagner, H., Takahashi, T. T., and Konishi, M. 1987. Representation of interaura1 time difference in the central nucleus of the barn owl's inferior colliculus. J. Neurosci. 7, 3105-3116.
Received 30 August 90;accepted 20 September 1990
This article has been cited by: 1. Linru Nie, Dongcheng Mei. 2008. Effects of time delay on symmetric two-species competition subject to noise. Physical Review E 77:3. . [CrossRef] 2. L. R Nie, D. C Mei. 2007. Noise and time delay: Suppressed population explosion of the mutualism system. Europhysics Letters (EPL) 79:2, 20005. [CrossRef] 3. M. Konishi. 2006. Behavioral guides for sensory neurophysiology. Journal of Comparative Physiology A 192:6, 671-676. [CrossRef] 4. Anthony Leonardo. 2005. Degenerate coding in neural systems. Journal of Comparative Physiology A 191:11, 995-1010. [CrossRef] 5. Junji Ito, Toru Ohira. 2001. Emergence of a dominant unit in a network of chaotic units with a delayed connection change. Physical Review E 64:6. . [CrossRef] 6. Toru Ohira, Toshiyuki Yamane. 2000. Delayed stochastic systems. Physical Review E 61:2, 1247-1257. [CrossRef] 7. C. E. Carr , M. A. Friedman . 1999. Evolution of Time Coding SystemsEvolution of Time Coding Systems. Neural Computation 11:1, 1-20. [Abstract] [PDF] [PDF Plus] 8. K. Lotz, L. Boloni, T. Roska, J. Hamori. 1999. Hyperacuity in time: a CNN model of a time-coding pathway of sound localization. IEEE Transactions on Circuits and Systems I Fundamental Theory and Applications 46:8, 994. [CrossRef] 9. Matthew A. Friedman, Masashi Kawasaki. 1997. Calretinin-like immunoreactivity in mormyrid and gymnarchid electrosensory and electromotor systems. The Journal of Comparative Neurology 387:3, 341-357. [CrossRef] 10. Anne-Kathrin Warzecha, Martin Egelhaaf. 1997. How Reliably Does a Neuron in the Visual Motion Pathway of fhe Fly Encode Behaviourally Relevant Information?. European Journal of Neuroscience 9:7, 1365-1374. [CrossRef] 11. Walter Metzner, Svenja Viete. 1996. The neuronal basis of communication and orientation in the weakly electric fish,Eigenmannia. Naturwissenschaften 83:2, 71-77. [CrossRef] 12. Walter Metzner, Svenja Viete. 1996. The neuronal basis of communication and orientation in the weakly electric fish,Eigenmannia. Naturwissenschaften 83:1, 6-14. [CrossRef] 13. Ray W. Turner, Leonid L. Moroz. 1995. Localization of nicotinamide adenine dinucleotide phosphate-diaphorase activity in electrosensory and electromotor systems of a gymnotiform teleost,Apteronotus leptorhynchus. The Journal of Comparative Neurology 356:2, 261-274. [CrossRef] 14. L. F. Abbott. 1994. Decoding neuronal firing and modelling neural networks. Quarterly Reviews of Biophysics 27:03, 291. [CrossRef]
15. C. C. Bell, C. D. Hopkins, K. Grant, T. Natoli. 1994. Contributions of electrosensory systems to neurobiology and neuroethology. Journal of Comparative Physiology A 173:6, 657-763. [CrossRef]
Communicated by Christof Koch
Synchronization of Bursting Action Potential Discharge in a Model Network of Neocortical Neurons Paul C. Bush Rodney J. Douglas Salk Jnstitute, La Jolla CA 92037 USA,
and Departtneirt of Physiology, Unizwrsity ( ~Cape f Tozon Medical School, Cape Tozon, Soiitli Africa 7925
and M R C Anntoinicul Neiiropharinizcology Unit,Department of Pharmacology, Oxford OX1 3QT, Wiited Kingdom
We have used the morphology derived from single horseradish peroxidase-labeled neurons, known membrane conductance properties and microanatomy to construct a model neocortical network that exhibits synchronized bursting. The network was composed of interconnected pyramidal (excitatory) neurons with different intrinsic burst frequencies, and smooth (inhibitory) neurons that provided global feedback inhibition to all of the pyramids. When the network was activated by geniculocortical afferents the burst discharges of the pyramids quickly became synchronized with zero average phase-shift. The synchronization was strongly dependent on global feedback inhibition, which acted to group the coactivated bursts generated by intracortical reexcitation. Our results suggest that the synchronized bursting observed between cortical neurons responding to coherent visual stimuli is a simple consequence of the principles of intracortical connectivity. 1 Introduction
Recently there have been a number of reports that neurons in the visual cortex that respond to related features in the visual scene tend to synchronize their action potential discharge (Gray and Singer 1989; Gray et al. 1990; Engel et al. 1990). This finding has attracted considerable attention because it may reflect a process whereby the cortex binds coherent visual features into objects (Crick 1984; von der Malsburg and Schneider 1986). Typically, the synchronization is observed between neurons that have complex receptive field responses, and that have bursting rather than regular (Connors et al. 1982) discharge patterns. Theoretical analyses (Kammen et al. 1990) and simulations of connectionist networks (Sporns et ul. 1989) have examined the conditions required for Ncitrnl Conipictntioti 3, 19-30 (1991)
@ 1991 Massachusetts Institute of Technology
20
Paul C. Bush and Rodney J. Douglas
coherent neural activity but the detailed neuronal mechanism of synchronization has not been studied. In this article we address this problem using a network of cortical neurons with realistic morphology and excitability.
2 Model Cortical Network
-
The network was composed of model pyramidal and smooth cortical neurons. Each neuron was represented by a compartmental model that consisted of a series of cylindrical dendritic segments and an ellipsoidal soma (Fig. lA,B). The dimensions of these compartments were obtained by simplification of the detailed morphology of a pyramidal neuron and a basket neuron that had been intracellularly labeled with horseradish peroxidase (Douglas and Martin 1990a). Each of the compartments contained an appropriate profile of passive, voltage-dependent, calcium-dependent, and synaptic conductances. Active conductances had Hodgkin-Huxley-like dynamics, except that time constants were independent of voltage. Passive properties of the model cell were obtained from intracellular recordings made in the real cell. The magnitudes and dynamics of the conductances and the implementation of the compartmental simulation were similar to that described elsewhere in the literature (Traub et al. 1987; Getting 1989; Douglas and Martin 1990a). The relevant parameters are listed in Table 1. The cortical network consisted of 10 bursting pyramidal neurons and one basket neuron (Fig. 1C). The bursting behavior of the pyramids was dependent on a small, transient delayed rectifier (gKd) and a large calcium-dependent potassium conductance (gKCa). The reduced spike after hyperpolarization that resulted from a small fast gKd encouraged a high-frequency burst of action potentials and rapid accumulation of intracellular calcium. The burst was terminated by the hyperpolarization induced by gKCa. The interburst interval depended on the rate of calcium removal (buffering) from the intracellular compartment. Each of the 10 pyramidal cells was assigned a slightly different intracellular calcium decay rate so that their natural burst frequencies ranged between 18 and 37 Hz for a 1 nA intrasomatic current injection. The active conductances were located in the somatic compartment. Smooth cells have shorter spike durations, higher discharge rates, and show less adaptation than regular firing pyramidal cells (Connors et al. 1982). These characteristics were achieved in the model smooth cell by retaining only the spike conductances, 9~~ and gKd, both of which were large and fast. Each neuron represented the activity of a population of neurons of that type. The activities of these populations were measured as their average action potential discharge rates. The individual spikes of the representative neurons were used to estimate the average discharge rates
Synchronization of Bursting Action Potential Discharge
21
CXSCHEM DRW
Figure 1: (A) Layer 5 pyramidal cell from cat primary visual cortex intracellularly labeled with HRP and reconstructed in three dimensions (Douglas and Martin 1990a). (B) Simplified compartmental model of the pyramidal cell. (C) Cortical network composed of model neurons. Each of 10 populations (4 shown a s rectangular boxes) composed of pyramidal cells (filled shapes) receives input from the LGN. Each pyramidal population sends afferents to all nine other pyramidal populations, and also to the common smooth cell population (box containing open stellate shape). The smooth cell population feeds back to all 10 pyramidal populations.
Paul C. Bush and Rodney J. Douglas
22
Cells Pyramidal Resting potential Axial resistance Specific membrane capacitance Leak conductance Calcium decay time constant Spike Na conductance ‘Jm Th
Delayed rectifier K conductance Tm
Persistent Na conductance ‘Jm
Calcium-dependent K conductance ‘Jm
“A-current” K conductance ‘Jm
‘Jh
Calcium conductance ‘Jm
EPSP synaptic conductance (300 spikes/sec) EPSP rm EPSP 7h IPSP synaptic conductance (300 spikes/sec) IPSP 7, IPSP ‘Jh Smooth Resting potential Axial resistance Specific membrane capacitance Leak conductance Spike Na conductance 7-m
Th
Delayed rectifier K conductance 7-m
EPSP synaptic conductance (300 spikes/sec) EPSP 7, EPSP ‘Jh Axon conduction delay + synaptic delay
Table I: Model Parameters.
Value -66 mV 200 0-cm 2 pF/cm2 0.1 mS/cm2 7-20 msec 400 mS/cm2 0.05 msec 0.5 msec 80 mS/cm2 0.5 msec 2 mS/cm2 2 msec 15 mS/cm2 2 msec 2 ms/cm2 20 msec 100 msec 0.5 mS/cm2 2 msec
0.5 nS 5 msec 10 msec
1 nS 2 msec 3 msec -66 mV 100 n-cm 2 pF/cm2 0.1 mS/cm2 700 mS/cm2 0.05 msec 0.5 msec 400 mS/cm2 0.2 msec
0.5 nS 0.1 msec 0.1 msec 2 msec
Synchronization of Bursting Action Potential Discharge
23
of their respective populations. This was done by convolving each of the spikes of the representative neuron with a gamma interspike interval distribution. The shape parameter of the distribution was held constant ( a = 2). The mean interspike interval of the distribution was defined as the previous interspike interval of the representative neuron. Thus the interspike interval distribution became more compact at higher discharge frequencies. The lateral geniculate input to the pyramidal cells was modelled as a continuous discharge rate. The form of the input was a step-like function, and in some cases a noise component was added (Fig. 4B). The synaptic effect of a given population on its target neuron was computed from the average population discharge rate, maximum synaptic conductance, and synaptic conductance time constants (Table 1). The distributions of inputs from various sources onto visual cortical cells are not accurately known. However, both asymmetric and symmetric synapses tend to cluster on the proximal dendrites of cortical neurons (White 19891, and so in this simplified model we assigned all contacts to the proximal basal dendrites. We assumed that one excitatory synaptic input would contribute a somatic EPSP with a peak amplitude of about 100 ~ L VThus, . roughly 200 synchronous inputs are required to drive the postsynaptic cell to threshold, and about 600 to reach maximum discharge. In preliminary simulations we confirmed that this range of inputs could effectively drive a postsynaptic neuron if the maximum single synapse excitatory conductance was set to about 0.5 nS. This and all other maximum synaptic conductances were determined at a presynaptic discharge rate of 300 spikes/sec. Anatomical studies indicate (for review of neocortical circuitry see Martin 1988; Douglas and Martin 1990b) that any particular cortical pyramid makes only about one contact with its postsynaptic target. This means that a reasonable size for the coactivating population is about 600 pyramids, which is about 10% of the total excitatory input to a typical pyramidal cell. In the final simulations the population of 600 pyramidal cells comprised 10 subpopulations of 60 neurons, each population having a different characteristic burst frequency. Each single thalamic afferent supplies only about one synapse to any single postsynaptic neuron. We found that the input of about 40-80 such LGN afferents was suitable for activating the network, if the maximum single thalamic synaptic conductance was also set to about 0.5 nS. This number of cells represents roughly 10% of the total number of LGN contacts received by a pyramidal neuron. The inhibitory population consisted of 100 neurons, each of which made 5 synapses onto each pyramidal target. The maximum single inhibitory synaptic conductance was 1 ns. The model network was simulated using the program CANON (written in TurboPascal by RJD, Douglas and Martin 1990a), which executes on an AT-type microcomputer running under DOS. Simulation of 1 sec of model time required about 3 hr of computation on a 16 MHz 286AT.
24
Paul C. Bush and Rodney J. Douglas
3 Results and Discussion
Our initial simulations examined the bursting behavior of pyramidal populations in the absence of either excitatory or inhibitory intracortical connections (Fig. 2A,B). All of the pyramidal populations received the same constant thalamic input (Fig. 4B, half amplitude of dashed trace). As anticipated, each of the pyramidal populations displayed bursting activity, and their burst frequencies differed according to their intrinsic characteristics. For example, the characteristic burst frequencies of the two cells shown in Figure 2A,0 were 22 and 37 Hz, respectively, when they were stimulated directly using an 1 nA intrasomatic current injection. The same two pyramids displayed burst frequencies of 9 and 15 Hz when activated by this particular geniculate input, and these frequencies are reflected in their power spectra (Fig. 2A,B adjacent to voltage traces). The cross-correlation between these two cells (Fig. ZE, upper trace) has very little power near zero time, confirming the lack of burst synchronization apparent from the time traces. Introduction of excitatory intracortical connections between the pyramidal populations did not improve synchronization. On the contrary, the intracortical reexcitation implicit in these connections drove all of the pyramids to very high discharge rates (compare time traces and power spectra of Fig. 2A,B with C,D). The higher frequency intrinsic bursters fired continuously (Fig. 2D). The cross-correlations between pyramids confirmed the lack of synchronization (Fig. 2E, lower trace; no zero peak). Introduction of a common inhibitory population (Fig. 1C) led to a marked improvement in synchronization of pyramidal burst discharges (compare Fig. 3A,B with Fig. 2A-D). This is reflected in the marked increase in the zero peak of the cross-correlation (Fig. 3D). Comparison of the power spectra of the synchronized cells (Fig. 3A,B) with their uncoupled, uninhibited counterparts (Fig. 2A,B) shows that the synchronization process forces neurons with quite different burst frequencies (9 and 15 Hz in these examples) toward a common burst frequency (averaging 17 Hz in this example). The excitatory connections between the pyramidal populations provide a strong intracortical excitatory component that combines with the geniculate input (Douglas and Martin 1990a). This enhanced average excitation rapidly initiates global bursting, while the common inhibitory feedback truncates the bursts that occur in each population and so improves the synchronization of subsequent burst cycles. The synchronization of bursts is more robust than the periodicity of the bursts (Figs. 3,4). This explains why the cross-correlations have a prominent zero peak, but relatively small side lobes. The interburst interval is dependent on both the postburst hyperpolarization, and the strength of inhibition from the inhibitory population. The latter is in turn dependent on the average size of the previous burst in all the pyramids. This complex interdependence between events in many cells causes the
Synchronization of Bursting Action Potential Discharge
25
A
B
C
W ' D
I
I
>
a 3
E
0
m
500 ms
100 Hz
0
E P
Figure 2: Response of partially connected model network to constant thalamic input (half amplitude of dashed trace in Fig. 4B). In this and following figures the power spectra are shown to the right of time traces; cross-correlations are shown at the foot of the figure; the amplitudes of the cortical power spectra are all to the same arbitrary scale, the LGN spectra (Fig. 48,D) are to a separate scale. (A) Pyramidal cell with no intracortical connections bursts rhythmically at 9 Hz (fundamental frequency in adjacent power spectrum of voltage trace). (B) A different cell oscillates at 15 Hz to same input. Cross-correlation of the output of these two cells (E, upper trace) exhibits no peak at zero time, indicating no correlation between these two signals. (C) Response of the same model cell as in A, but now including reciprocal excitatory connections to all 9 other populations. The consequent enhanced excitation results in higher frequency discharge (50-60 Hz, see adjacent power spectrum). (D) Enhanced excitation causes the intrinsically higher frequency cell shown in B to latch up into continuous discharge. In this example all the power is at 200-300 Hz, which is off-scale. Cross correlation of latter two traces (lower trace in E) indicates that their discharge is not synchronized.
Paul C. Bush and Rodney J. Douglas
26
A
B
C
>
0)
0
0
L
z
E co
a
0
500 rns
-
50 Hz
0
P
-500
0
500 rns
Figure 3: Response of fully connected model network to constant thalamic input (half amplitude of dashed trace in 4B). (A,B) Response of same cells shown in Figure 2A,C, but now incorporating common feedback inhibition (Fig. 1C). The bursts in the two populations are synchronized, as indicated by the prominent zero peak in their cross-correlation (upper trace in D; for comparison lower trace is same as Fig. 1E). Their common inter-burst frequency (17 Hz) is reflected in their power spectra. Notice that each cell fires only in synchrony with the others. If a' cell misses a burst (arrowed in B) then it fires again only on the next cycle. (C) Response of real complex cell in cat primary visual cortex to optimally orientated moving bar (Douglas and Martin unpublished). Compare with model responses in Figure 3B,C and Figure 4A,C. Note missed burst (arrowed).
27
Synchronization of Bursting Action Potential Discharge
I
A
I
C
> E
m 0
D L
BQ
a,
$
a
0
In m I
50 Hz
0
500 ms
0
E L
a,
z
a II
. . ., -500
.
'V
r r v y 1 .
0
y
1
I
Y
I
500 rns
Figure 4: Synchronization is not dependent on identical, constant thalamic input. In this example five pyramidal populations were driven by LGN input shown in B, the other five by input shown in D. Inputs are the sum of a constant signal (dashed trace in B) plus noise with 24% power of signal. Comparison of the outputs of the highest and lowest intrinsic frequency cells (A and C) show that the burst synchronization is not diminished by this procedure. As in Figure 3, superposition of peaks of power spectra and prominent zero peak on crosscorrellogram (E, upper trace) confirm synchronization of pyramidal population discharges. The synchronization is lost if intracortical connections are removed (E, lower trace). Notice that the peaks of power spectra of LGN input (B and D) do not coincide with those of the model output (A and C), indicating that the cortical interburst frequency (averaging 23 Hz) is insensitive to the spectral characteristics of the LGN input.
28
Paul C. Bush and Rodney J. Douglas
interburst interval within a particular population to vary chaotically, even in the presence of constant thalamic input. This behavior is very similar to that seen in real cortical neurons. For example, compare the response of the model pyramids (Figs. 3A,B; 4A,C) with the intracellular signal derived from a real Complex cell during presentation of a preferred visual stimulus (Fig. 3C). This in vivo recording was made in a layer 3 neuron of cat primary visual cortex (Douglas and Martin unpublished). Figure 4 shows the results of a simulation in which noise was added to the output of the geniculate populations. Comparison of Figure 4A,C with Figure 3A,B and the presence of a strong peak at zero time in the cross-correlation (Fig.4E, upper trace) indicate that burst synchronization was remarkably resistant to the noise superimposed on the geniculate input. The higher average burst frequency (23 Hz) compared with the noise-free case (17 Hz) (Fig. 3A,B) is due to a larger geniculate signal. The synchronization evolved rapidly, and was well established within about 100 msec (2 bursts) of the onset of the pyramidal response. No particular population leads the bursting of the network. Instead the phase relations between any two populations changed chaotically from cycle to cycle so that the average phase between the cells remained zero (Fig. 4, cross-correlation), as is seen in vivo (Engel et al. 1990). The onset of discharge in the inhibitory population necessarily lags behind the onset of the earliest bursts in pyramidal populations. However, the onset of inhibitory discharge occurred within about 5 msec of the onset of the earliest pyramidal bursts and before the onset of the latest burst. Thus we do not expect that the phase shift of inhibitory cells with respect to pyramidal cells will be easy to detect experimentally. We found that the performance of the model was rather insensitive to the detailed cellular organization of the network as specified by the number of cells per population, number of synaptic contacts, and magnitude of synaptic effect. The crucial organizational principle was the presence of cortical reexcitation and adequate global feedback inhibition. Evidence for these circuits has been found in intracellular recordings from cat visual cortex (Douglas and Martin 1990a). This finding is consistent with the mathematical proof of Kammen et al. (1990) that a number of oscillatory units driving a common feedback comparator can converge to a common oscillatory solution. However, our results indicate that synchronization occurs even in the presence of chaotic bursting discharge, when oscillation is not a prominent feature. Our results bear a qualitative similarity to those of Traub et al. (1987). The main difference is that in the case presented here fast, concentrated inhibition produces tightly coherent, high frequency (20 rather than 2-3 Hz) bursting across the pyramidal populations. Much has been written recently concerning oscillations in the neocortex. However, burst synchronization is the most compelling feature of our model cortical network. The interburst intervals were not regular, instead they varied chaotically. Consequently the power spectra of the discharges
Synchronization of Bursting Action Potential Discharge
29
were broad, as has been noted in vivo (Freeman and van Dijk 1987). Synchronous bursts from large populations converging on a postsynaptic target cell will produce very large transient depolarizations, which would be optimal for activating NMDA receptors. This suggests the possibility that learning occurs at times of synchronization. Coherent bursting may permit selective enhancement of synapses of common target neurons. If all inputs to the target cell are bursting rather than constant the chances of false correlations between different coherent populations are reduced. Moreover, varied interburst intervals could help to avoid phase locking between different sets of rhythmic signals impinging on the common target.
4 Conclusion Our results suggest that the synchronized bursting observed in v i m between cortical neurons responding to coherent visual stimuli is a simple consequence of the known principles of intracortical connectivity. Two principles are involved. First, intracortical reexcitation by pyramidal collaterals amplifies the geniculate input signal and drives the coactivating pyramidal cells into strong coherent discharge. Second, global feedback inhibition converts the integrated burst discharges into a global reset signal that synchronizes the onset of the subsequent cycle in all the bursting pyramidal cells. Future work must investigate the processes that dynamically connect and disconnect populations of neurons to form coherent networks, the elements of which are then synchronized by the mechanisms outlined in this article. Acknowledgments We thank John Anderson for technical assistance. P.C.B. acknowledges the support of the McDonnell-Pew Foundation. R.J.D. acknowledges the support of the Mellon Foundation, the UK MRC, and the SA MRC. References Connors, B. W., Gutnick, M. J., and Prince, D. A. 1982. Electrophysiologicai properties of neocortical neurons in vitro. J. Neurophysiol. 62, 1149-1162. Crick, F. 1984. Function of the thalamic reticular complex: The searchlight hypothesis. Proc. Natl. Acad. Sci. U.S.A. 81,458-590. Douglas, R. J., a n d Martin, K. A. C. 1990a. A functional microcircuit for cat visual cortex. J. Physiol., submitted. Douglas, R. J., and Martin, K. A. C. 1990b. Neocortex. In The Synaptic Orgaizization of the Brain, G. M. Shepherd, ed., pp. 389-438. Oxford University Press, New York.
30
Paul C. Bush and Rodney J. Douglas
Engel, A. K., Konig, P., Gray, C. M., and Singer, W. 1990. Stimulus-dependent neuronal oscillations in cat visual cortex: Inter-columnar interaction as determined by cross-correlation analysis. Eur. J . Neurosci. 2, 586-606. Freeman, W. J., and Dijk, B. W. v. 1987. Spatial patterns of visual cortical fast EEG during conditioned reflex in a rhesus monkey. Brain Res. 422, 267-276. Getting, P. A. 1989. Reconstruction of small neural networks. In Methods in Neuronal Modelling, C. Koch and I. Segev, eds., pp. 171-194. MIT Press/Bradford Books, Cambridge, MA. Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A. 86, 1698-1 702. Gray, C. M., Engel, A. K., Konig, P., and Singer, W. 1990. Stimulus-dependent neuronal oscillations in cat visual cortex: Receptive field properties and feature dependence. Eur. J. Neurosci. 2, 607-619. Kammen, D., Holmes, P., and Koch, C. 1990. Collective oscillations in neuronal networks. In Advances in Neural Information Processing Systems, Vol. 2, D. Touretzky, ed., pp. 76-83. Morgan Kaufmann, San Mateo, CA. Malsburg, C. v. d., and Schneider, W. 1986. A neural cocktail-party processor. Biol. Cybern. 54, 2940. Martin, K. A. C. 1988. The Wellcome Prize Lecture: From single cells to simple circuits in the cerebral cortex. Q. J. Exp. Physiol. 73, 637-702. Sporns, O., Gally, J. A., Reeke, G. N., Jr., and Edelman, G. M. 1989. Reentrant signalling among simulated neuronal groups leads to coherency in their oscillatory activity. Proc. Natl. Acad. Sci. U.S.A. 86, 7265-7269. Traub, R. D., Miles, R., Wong, R. K. S., Schulman, L. S., and Schneiderman, J. H. 1987. Models of synchronized hippocampal bursts in the presence of inhibition. 11. Ongoing spontaneous population events. 1. Neurophysiol. 58, 752-764. White, E. L. 1989. Cortical Circuits: Synaptic Organization of the Cerebral Cortex - Structure, Function and Theory. Birkhauser, Boston.
Received 5 October 1990; accepted 26 October 90.
This article has been cited by: 2. M. A. Gieselmann, A. Thiele. 2008. Comparison of spatial integration and surround suppression characteristics in spiking activity and the local field potential in macaque V1. European Journal of Neuroscience 28:3, 447-459. [CrossRef] 3. Jakob Heinzle, Peter König, Rodrigo F. Salazar. 2007. Modulation of synchrony without changes in firing rates. Cognitive Neurodynamics 1:3, 225-235. [CrossRef] 4. S.-C. Liu, R. Douglas. 2004. Temporal Coding in a Silicon Network of Integrate-and-Fire Neurons. IEEE Transactions on Neural Networks 15:5, 1305-1314. [CrossRef] 5. C. Rasche, R.J. Douglas. 2001. Forward- and backpropagation in a silicon dendrite. IEEE Transactions on Neural Networks 12:2, 386-393. [CrossRef] 6. Erik D. Lumer . 2000. Effects of Spike Timing on Winner-Take-All Competition in Model Cortical CircuitsEffects of Spike Timing on Winner-Take-All Competition in Model Cortical Circuits. Neural Computation 12:1, 181-194. [Abstract] [PDF] [PDF Plus] 7. R. Eckhorn. 1999. Neural mechanisms of scene segmentation: recordings from the visual cortex suggest basic circuits for linking field models. IEEE Transactions on Neural Networks 10:3, 464-479. [CrossRef] 8. C. van Vreeswijk , H. Sompolinsky . 1998. Chaotic Balanced State in a Model of Cortical CircuitsChaotic Balanced State in a Model of Cortical Circuits. Neural Computation 10:6, 1321-1371. [Abstract] [PDF] [PDF Plus] 9. Geoffrey M. Ghose, Ralph D. Freeman. 1997. Intracortical connections are not required for oscillatory activity in the visual cortex. Visual Neuroscience 14:06, 963R. [CrossRef] 10. Geoffrey M. Ghose, Ralph D. Freeman. 1997. Intracortical connections are not required for oscillatory activity in the visual cortex. Visual Neuroscience 14:05, 963. [CrossRef] 11. Wulfram Gerstner, J. Leo van Hemmen, Jack D. Cowan. 1996. What Matters in Neuronal Locking?What Matters in Neuronal Locking?. Neural Computation 8:8, 1653-1676. [Abstract] [PDF] [PDF Plus] 12. Paul Bush, Terrence Sejnowski. 1996. Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. Journal of Computational Neuroscience 3:2, 91-110. [CrossRef] 13. D. Hansel, H. Sompolinsky. 1996. Chaos and synchrony in a model of a hypercolumn in visual cortex. Journal of Computational Neuroscience 3:1, 7-34. [CrossRef] 14. Alfred Nischwitz, Helmut Glünder. 1995. Local lateral inhibition: a key to spike synchronization?. Biological Cybernetics 73:5, 389-400. [CrossRef]
15. D. Hansel , G. Mato , C. Meunier . 1995. Synchrony in Excitatory Neural NetworksSynchrony in Excitatory Neural Networks. Neural Computation 7:2, 307-337. [Abstract] [PDF] [PDF Plus] 16. Wulfram Gerstner. 1995. Time structure of the activity in neural network models. Physical Review E 51:1, 738-758. [CrossRef] 17. Öjvind Bernander , Christof Koch , Marius Usher . 1994. The Effect of Synchronized Inputs at the Single Neuron LevelThe Effect of Synchronized Inputs at the Single Neuron Level. Neural Computation 6:4, 622-641. [Abstract] [PDF] [PDF Plus] 18. Charles M. Gray. 1994. Synchronous oscillations in neuronal systems: Mechanisms and functions. Journal of Computational Neuroscience 1:1-2, 11-38. [CrossRef] 19. Randall C. O'Reilly , Mark H. Johnson . 1994. Object Recognition and Sensitive Periods: A Computational Analysis of Visual ImprintingObject Recognition and Sensitive Periods: A Computational Analysis of Visual Imprinting. Neural Computation 6:3, 357-389. [Abstract] [PDF] [PDF Plus] 20. Wulfram Gerstner, Raphael Ritz, J. Leo Hemmen. 1993. A biologically motivated and analytically soluble model of collective oscillations in the cortex. Biological Cybernetics 68:4, 363-374. [CrossRef] 21. David C. Plaut, Tim Shallice. 1993. Perseverative and Semantic Influences on Visual Object Naming Errors in Optic Aphasia: A Connectionist AccountPerseverative and Semantic Influences on Visual Object Naming Errors in Optic Aphasia: A Connectionist Account. Journal of Cognitive Neuroscience 5:1, 89-117. [Abstract] [PDF] [PDF Plus] 22. Paul Patton, Elizabeth Thomas, Robert E. Wyatt. 1992. A computational model of vertical signal propagation in the primary visual cortex. Biological Cybernetics 68:1, 43-52. [CrossRef] 23. Christof Koch , Heinz Schuster . 1992. A Simple Network Showing Burst Synchronization without Frequency LockingA Simple Network Showing Burst Synchronization without Frequency Locking. Neural Computation 4:2, 211-223. [Abstract] [PDF] [PDF Plus] 24. Paul C. Bush , Terrence J. Sejnowski . 1991. Simulations of a Reconstructed Cerebellar Purkinje Cell Based on Simplified Channel KineticsSimulations of a Reconstructed Cerebellar Purkinje Cell Based on Simplified Channel Kinetics. Neural Computation 3:3, 321-332. [Abstract] [PDF] [PDF Plus]
Communicated by Christoph von der Malsburg
Parallel Activation of Memories in an Oscillatory Neural Network D. Horn M. Usher Sclrool of Physics and Astronomy, Raynioiid arid Bezwrly Sacklev Faculty of Exact Sciences, Tel Avizi University, Tel Aviv 69978, Israel We describe a feedback neural network whose elements possess dynamic thresholds. This network has an oscillatory mode that we investigate by measuring the activities of memory patterns as functions of time. We observe spontaneous and induced transitions between the different oscillating memories. Moreover, the network exhibits pattern segmentation, by oscillating between different memories that are included as a mixture in a constant input. The efficiency of pattern segmentation decreases strongly as the number of the input memories is increased. Using oscillatory inputs we observe resonance behavior. 1 Introduction
Attractor neural networks perform the task of pattern retrieval. This is usually carried out in the following way. One incorporates the pattern in question as a memory in the connections of a feedback network. Starting the network dynamics from an initial condition that is a distorted version of one of the memories, the network should flow into a fixed point corresponding to that memory. Once the fixed point is reached, pattern retrieval is accomplished. In a recent article, Wang, Buhmann and von der Malsburg (1990) suggested a modification of this procedure that addresses a slightly different question. Suppose one is given an input that is a composition of several of the memories. How can the network recognize all the individual components in parallel? This requires the network to retrieve all memory components while conserving the holistic property of the input. Their solution is based on a network of oscillators that are constructed by appropriately connected neurons. Their network, when presented with a continuous input that is a superposition of memories, oscillates in patterns determined by the mixed input, shifting from one to another as time goes on. In other words, it achieves pattern segmentation. This new type of behavior can also be easily obtained in a model that we have proposed (Horn and Usher 1989). In fact, it is a special Neural Coinputation 3, 3143 (1991)
@ 1991 Massachusetts Institute of Technology
32
D. Horn and M. Usher
case of a general phenomenon of associative transitions that exists in this model. The model consists of binary neurons that possess a new degree of freedom, a dynamic threshold. This threshold changes according to the firing history of the neuron to which it belongs. Such behavior was recently studied by Abbott (1990). A feedback network composed of these neurons can develop oscillatory behavior as a cooperative effect. Memories are defined in a standard fashion, for example, as in the Hopfield model (Hopfield 1982). When the network is in its oscillatory mode, an activated memory pattern may oscillate indefinitely. The synaptic connections can be defined so as to include couplings between different memories that serve as associations or pointers. These lead to transitions from one memory to another, creating motion in pattern space, the space of all memories. (Since many pointers can emerge from every memory pattern, the orbit of this motion is unpredictable.) Using a continuous input composed of several memory patterns we find similar behavior: oscillation of the relevant memories and motion between them, this time caused by the external input. The interest in oscillating networks has increased after the observation of frequencies in the range of 40-70 Hz in correlations between neurons in the visual cortex (Eckhorn et nl. 1988; Gray et al. 1989). In our model we distinguish between oscillations of the neurons and those of the patterns. The activity of a memory pattern is a global variable driven by the behavior of many neurons. This is the relevant variable in our analysis, it is readily available in numerical simulations, and, under some approximations, it can be studied analytically. A model that contains a small number of random memory patterns can be approximated by a set of differential equations for the activities of the patterns. When their number is bigger than four, we have to resort to numerical simulations. By investigating pattern activities, we demonstrate the properties described above. We define a criterion for optimal segmentation efficiency, requiring a high activity level and equal time sharing of the different memories included in the input, and show that it decreases fast as the number of such memories is increased. We find better performance for random patterns with negative asymmetry. Of particular interest is the case when one of the patterns in the input is modulated by an oscillatory function. The result is that the system displays resonating behavior. If the input frequency matches that of the neural network it will activate just the one relevant memory pattern.
2 Neural Networks with Dynamic Thresholds
We study feedback neural networks whose binary elements are denoted by Si = f l for i = 1,.. ., N . These neurons interact with one another
Memories in an Oscillatory Neural Network
33
through the synaptic weights ,JI1. In the Hopfield model (Hopfield 1982) the equation of motion for such a system is given by r
1
(2.1) where the prime designates the fact that statistical choice for temperature T with probability
F T ( r ) = 4Z1
(1
J
#
I.
FT is the appropriate
+ eF2r’T)-1
(2.2)
t is the discrete time variable used in simulations. The updating procedure can be either random, sequential, or parallel. In our generalization (Horn and Usher 1989) we have introduced threshold parameters 8, that change with time in a fashion that depends on the history of the neuron S , at the same location. The equations of motion of our system are then S,(t
+ 1) = FT [ h I ( f )
-
O,(f)]
h , ( t )= C ’ J t 3 S J ( t )
(2.3)
J
Let us choose the dynamic threshold as H l ( t ) = bR,(t)
(2.4)
where R, is defined in the following recursive fashion:
R , ( t + 1) = R L ( t )+ S,(f + 1) ~
c
(2.5)
This is an effective integration of the neuron activity over time that saturates at the value * r / ( c - 1) if the neuron stays constant at fl. c is chosen to be larger than 1. For appropriate values of the combination g = b c / ( c - l), the threshold destabilizes the tendency of the system to stay in a fixed point and leads to oscillatory behavior. In the Hopfield model the weights are chosen in a factorized symmetric fashion (2.6) The binary vectors 1 : are the input patterns (memories) of the model that, under appropriate conditions (Amit et al. 19851, form the fixed points into which the system flows. To characterize the dynamics it is useful to define overlap functions, or activities:
D. Horn and M. Usher
34
These are the global variables that measure to what extent the firing pattern of the neural network coincides with specific memory patterns p. In a typical network without threshold dynamics one of these overlaps will flow to 1 (or -1) and the rest to 0. With threshold dynamics as described above one obtains oscillatory behavior for these variables. To study pattern segmentation we follow Wang et al. (1990) and allow for continuous input into the network. If the input consists of an equal mixture of the first n memories, the equations of motion become
S,(f + 1) = FT [ h , ( t )- B , ( t ) + I , ( t ) ]
c (I’ 71
I,(t)= t
(2.8)
ll=l
Here we chose for simplicity a constant input regulated by one parameter. It contributes to the excitations of all the memories with labels 1 5 p 5 n. We will search for conditions under which all these memories perform oscillations driven by this input. We will also discuss the situation in which the input has one element with sinusoidal time behavior that causes the system to resonate. A small modification allows the treatment of asymmetric patterns (Amit et al. 1987) for which a = ((”)
<0
(2.9)
We will use examples where a = -0.6, that is, only 20% of the neurons in the memories are positive (firing). As a result there exists considerable overlap between the various patterns, 0.36 on the average. The dynamics of the model have to be modified by changing the local field hi of (2.3) into h, =
C E ~ ( v L @- a M ) - G ( M
-
a)
(2.10)
P
M is defined by M = l / N C , S,, and the new term with G is needed to implement the constraint on ((@). In our examples it will have the value G = 2. Whereas for a = 0 the activities can oscillate between fl,once a < 0 the amplitude of the activity will have asymmetric oscillations. The reason is that in this formulation, the negative of a memory is no longer an attractor. 3 Oscillatory Mean Field Equations
In our study of the problem of several memories we have shown that a simple analytic approximation leads to a correct qualitative description of our system. This is a mean-field approximation, in which one obtains equations of motion for the global variables of interest, the activities (or overlaps) mP. Assuming the existence of p memory patterns, and using
Memories in an Oscillatory Neural Network
35
the simplified representation of the R, variables as an expansion in the basis of these patterns,
R,
=
CPY
(3.1)
P
we obtained (Horn and Usher 1989) the relations (3.2) (3.3)
where the prime designates the fact that when v = p only the factor 7” = f l should be used. The last equation contains on the right-hand side all 2”’ possible combinations of the v # p amplitudes with weights of *l. These mean-field equations are a valid qualitative description of our system using a parallel updating procedure. Random updating can be approximated by a set of differential equations (where time becomes continuous) (3.4)
We have solved numerically such equations for p 5 4 and compared the results with simulations of corresponding networks with random memory patterns. For higher p values we resorted to simulations. The characteristic behavior we find is dominant oscillations of one pattern accompanied by small oscillations of all other activities. Starting with an initial condition that is a mixture of several memories we can obtain large oscillations of two pattern activities but not more than that. Introducing an external input as in (2.8) the equations get simply modified. Using a constant input for the first 11 memories, (3.4) changes into
(3.5)
Figure 1 shows the results for p = 4, n = 1. The parameters of the network are chosen to be in the oscillatory mode. Equation 3.5 is solved with initial conditions of ni2 = 1 and all other nil’ vanishing. At time t = 30 the input field 6 of the first memory pattern is turned on. Pattern number 1 becomes the dominant oscillator and the strength of pattern 2 is reduced. As a matter of fact, the amplitudes of the two oscillations
D. Horn and M. Usher
36
\ I \ /
. .
I
.
.
.
.
I
.
.
. .
1
,
.
.
.
I
.
.
,
.
.
Figure 1: Solution of equation 3.5 for the case p = 4, n = 1. The parameters were b = 0.15, c = 1.2 (leading to g = 0.9), and T = 0.4, e = 0.25. The initial condition was m2 = 1 and all other mu = 0. The activity of pattern 2 is represented by a dashed line. The input was activated at t = 30, leading to a dominating m’, which is represented by a solid line. are comparable yet pattern 1 is larger because of the positive bias of the input. The oscillations of pattern 1 before t = 30 are characteristic of the oscillations of all other patterns, which are neither in the initial condition nor in the input. In Figure 2 we investigate a situation in which more memories are activated by the input. Shown here are the activities of all the p = 4 patterns in a case in which the first n = 3 are used as input. The activity of pattern 4 dies out while the three other patterns oscillate in a phaselocked fashion. Similar behavior is obtained in simulations of a neural network updated sequentially by the set of equations 2.8 with the same parameters. We see that after a transitory period the system moves into a phase-locked behavior. This regularity is lost in the cases discussed in the next section, where the number of input patterns n is increased and the memory patterns are asymmetric. 4 Optimal Segmentation Efficiency
Numerical simulations of networks based on random symmetric memories, that is, such that a = (Cp) = 0, have shown very good segmentation efficiency for n = 3 patterns but quick deterioration as the number of
Memories in an Oscillatory Neural Network
I
25
. J . . .
I
.
.
.
.
37
I
75
50
. . . .
I
100
.
.
.
.
125
t
Figure 2: Using = 0.25 and ! I = 3 we show the results for the same system of the previous figure. Here we plot the activities of all the memory patterns. The one pattern that is not included in the input is represented by a dotted curve. It tends to zero, whereas all other activities display phase-locked oscillations.
input patterns is increased. This means that the amplitudes of the oscillations decreased or only one memory dominated over all other. We find better behavior for larger in studies of asymmetric patterns. A typical example for a system with n = -0.6 is shown in Figure 3. This is a case where rr = 6 out of p = 14 memory patterns of the model were mixed in the input. All six memories get activated but not in a uniform fashion. To be able to compare different cases we introduce the following order parameter, or figure of merit, which puts the emphasis on retrieval and uniformity:
(4.1)
F p stands for the fraction of the time t in which the neural network is in memory pattern LL, for example, obeying 7 r l w > A = 0.75. Clearly the cutoff ‘1 is arbitrary, and could be chosen higher or lower. The particular form of S could also be modified. We find this one to be the simplest to measure segmentation efficiency. It has the following important properties: it vanishes if any of the patterns does not reach the activity A, it is maximal ( S = 1) if the network finds itself always in one of the selected
38
D. Horn and M. Usher
memories and all share time equally (i.e., Fp = l/n), and it maintains the same value for different n if the fractions F scale like l/n. S depends quite strongly on the different parameters of our system. We are interested in finding the conditions under which S is large and maintains a reasonable value for as many n as possible. Some of our best results are shown in Figure 4. Note first the sharp decrease in the case of symmetric memories ( a = 0). Much better results can be obtained for asymmetric ones. We see in this figure that the increase of the total number of memories, p , leads to only a small decrease in S; but changing the threshold parameter g has a very strong effect. Our best results for a = -0.6 were obtained with g = 0.9. Even then, to obtain sizable S one is limited to a small range of n. We conclude that pattern segmentation is efficient only if the input contains a small number of memories. It is tempting to speculate that this feature can provide an explanation for the known limits on attention capacity. This can be the case if attention is understood as a mechanism that differentiates several patterns present in an external input. Such a process is represented here by oscillating activities of memory patterns with different phases.
Figure 3: The six activities of the patterns included in the input of a p = 14 network are shown as a function of time. This is a simulation of a system of N = 1000 neurons, using random patterns with a = -0.6 and parameters T = 0.2, E = 0.2, g = 0.9. The scale of the activities varies between 0 and 1.
Memories in an Oscillatory Neural Network
39
n
Figure 4: Segmentation efficiency for three sets of parameters. Solid lines are systems with p = 14 patterns, dashed lines denote p = 24, and dot-dash are 34. All are simulations of N = 1000 neurons with T = 0.2. The a = 0 system had input strength t = 0.25 while for the u = -0.6 cases we used F = 0.2. The two a = -0.6 cases differ by their 9 values.
5 Resonances As in any oscillatory system it is natural to expect that this neural network has interesting resonance properties. Looking, for example, at Figure 3, we see that the system displays a natural period of approximately 10 time units. How will it then react if one of the memories is introduced into the input not as a constant in time but as a sinusoidal amplitude with a period of lo? The answer is striking. Choosing the first memory to be the one which is sinusoidal, (5.1) we find that in spite of its relative inactivity in Figure 3 it becomes the dominant pattern in the behavior of the system. In fact, it is the only one with activity larger than A. In other words, the system chooses to resonate in tandem with the one input that oscillates in time. The resonance effect depends on matching the frequency of the input signal to that of the system. Varying the frequency we find, indeed, a decrease of the resonance effect. As a quantitative measure we use our
D. Horn and M. Usher
40
0.6
c t 1
Frequency
Figure 5: The segmentation efficiency S of the = 2 . . . 6 patterns in the system discussed in Section 5 is shown here as a function of the frequency of the sinewave modulating the input of pattern 1. figure of merit for segmentation, and apply it to all the n - 1 memories included as constants in the input. In other words, we measure
s = limt-,
[
l/(n-I)
nFp]
(5.2)
The result is shown in Figure 5. This quantity tends to zero as the frequency hits the resonance, signifying that only the resonating memory is activated. As one goes to lower or to higher frequencies the usual segmentation qualities reappear. 6 Discussion
The advantage of oscillatory neural networks over the conventional dissipative ones resides in their complex temporal behavior. This allows for new ways of handling information. Thus one can activate several memory patterns in parallel, each one being represented by an oscillation with a different phase. In our model the oscillations come about as cooperative effects of the neural network that are triggered by dynamic thresholds. The type of time-varying response that we assign to the thresholds could also be associated with synaptic connections, which
Memories in an Oscillatory Neural Network
41
would make the analysis more complicated. We view our model as representing a class of models in which the elements of the neural network have variable properties that depend on the history of their own activity. Once the network is in its oscillatory phase, the individual neurons develop periodic correlations with themselves as well as with other neurons. We can measure such effects in our network by calculating (6.1) Such results are displayed in Figure 6, which shows the correlations between three neurons in the n = 6 system studied in Figure 3. Stronger periodic correlations can be found in n = 3 phase-locked systems. The relevant information, as far as retrieval of memories is concerned, is not in the oscillations of the neurons but in the oscillations of the pattern activities. The latter may be difficult to extract from neurophysiological experiments but are readily available in model calculations. The ability of pattern segmentation is, as pointed out by Wang et al. (1990), a very interesting property of oscillating networks. We have described its characteristics and limitations within our model. Clearly it works well only if a small number of patterns is involved in the
-
I
0
I
-1
L 0
"
L "
~
"
20
'
"
"
'
"
40
"
'
"
~
80
80
" '
100
Figure 6: Correlations between three neurons in a p = 14, 7 t = 6 system with parameters T = 0.2, g = 0.9, E = 0.2, n = -0.6. Time is the parameter T of equation 6.1.
42
D. Horn and M. Usher
segmentation task. We have speculated that this phenomenon can provide the explanation to the limits on attention capacity. Within our model one can easily combine segmentation with association. The latter is implemented (Horn and Usher 1989) by adding associative transitions to the synaptic weights:
(6.2)
K t 3 serve as pointers. For simplicity we may consider X as an overall association strength and choose d,, as coefficients that are either 1 or 0. Whenever d,, > 0 it introduces a bias toward transition from pattern v to pattern p. Operating such a network with a given input one introduces a bias not only toward the memories that exist in the input, but also toward the families of memory patterns connected to them by pointers. Both the input and the pointers can be viewed as sources of induced transitions in our system. Especially interesting is the resonance property of our model. This is to be expected from such networks, and may have a special role in neurobiology (Eckhorn et al. 1988; Stryker 1989; Crick and Koch 1990). There are two aspects of a resonance that should be interesting. One is the ability of a weak signal to induce a strong effect, as is manifested in the example discussed in the previous section. We may imagine a signal generated in one cortical area and transmitted to another area. If the signal is periodic and it strikes a resonance at the receiving end, the transmission can be easily achieved. The other aspect is the fact that a resonance may be regarded as an element with special identity and meaning. This goes beyond the scope of our model. In particular, the frequency of our resonance does not bear any information. The latter is carried only by the spatial distribution of the pattern that oscillates. We may, however, imagine an ensemble of networks of the kind we discussed, sending signals to one another, such as a group of cortical areas. Using the resonances in each of them one can envisage the creation of a resonating wave. This can have special cognitive meaning depending on the nature of the networks that are activated and the environment to which they belong. References Abbott, L. F. 1990. A network of oscillators. 1. Phys. A23, 3835-3859. Amit, D. J., Gutfreund, H., and Sompolinsky, H. 1985. Spin glass models of neural networks. Phys. Rev. A 32, 1007-1018. Storing infinite numbers of patterns in a spin glass model of neural networks. Phys. Rev. Lett. 55, 15301533. Amit, D. J., Gutfreund, H., and Sompolinsky, H. 1987. Information storage in neural networks with low levels of activity. Phys. Rev. A 35, 2293-2303.
Memories in an Oscillatory Neural Network
43
Crick, F., and Koch, C. 1990. Towards a neurobiological theory of consciousness. Sem. Neurosci. 2, 263-275.. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cyberii. 60, 121-130. Gray, C. M., Konig, I?, Engel, A. K., and Singer, W. 1989. Oscillatory response in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Hopfield, J. J. 1982. Neural networks and physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79,2554-2558. Horn, D., and Usher, M. 1989. Neural networks with dynamical thresholds. Phys. Rezl. A 40, 103&1044. Stryker, M. P. 1989. Is grandmother an oscillation? Nature (London) 338,297-298. Wang, D., Buhmann, J., and von der Malsburg, C. 1990. Pattern segmentation in associative memory. Neural Comp. 2, 94-106.
Received 23 July 90; accepted 12 November 90.
This article has been cited by: 2. Ransom K. Winder, James A. Reggia, Scott A. Weems, Michael F. Bunting. 2009. An Oscillatory Hebbian Network Model of Short-Term MemoryAn Oscillatory Hebbian Network Model of Short-Term Memory. Neural Computation 21:3, 741-761. [Abstract] [Full Text] [PDF] [PDF Plus] 3. D. Wang. 2005. The Time Dimension for Scene Analysis. IEEE Transactions on Neural Networks 16:6, 1401-1426. [CrossRef] 4. Dominique Martinez. 2005. Detailed and abstract phase-locked attractor network models of early olfactory systems. Biological Cybernetics 93:5, 355-365. [CrossRef] 5. Antonino Raffone, Cees van Leeuwen. 2003. Dynamic synchronization and chaos in an associative neural network with multiple active memories. Chaos: An Interdisciplinary Journal of Nonlinear Science 13:3, 1090. [CrossRef] 6. DeLiang Wang, Xiuwen Liu. 2002. Scene analysis by integrating primitive segmentation and associative memory. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:3, 254-268. [CrossRef] 7. G. Frank, G. Hartmann, A. Jahnke, M. Schafer. 1999. An accelerator for neural networks with pulse-coded model neurons. IEEE Transactions on Neural Networks 10:3, 527-538. [CrossRef] 8. R. Eckhorn. 1999. Neural mechanisms of scene segmentation: recordings from the visual cortex suggest basic circuits for linking field models. IEEE Transactions on Neural Networks 10:3, 464-479. [CrossRef] 9. Seung Han, Won Kim, Hyungtae Kook. 1998. Temporal segmentation of the stochastic oscillator neural network. Physical Review E 58:2, 2325-2334. [CrossRef] 10. W. Senn , Th. Wannier , J. Kleinle , H.-R. Lüscher , L. Müller , J. Streit , K. Wyler . 1998. Pattern Generation by Two Coupled Time-Discrete Neural Networks with Synaptic DepressionPattern Generation by Two Coupled Time-Discrete Neural Networks with Synaptic Depression. Neural Computation 10:5, 1251-1275. [Abstract] [PDF] [PDF Plus] 11. David Horn , Irit Opher . 1996. Temporal Segmentation in a Neural Dynamic SystemTemporal Segmentation in a Neural Dynamic System. Neural Computation 8:2, 373-389. [Abstract] [PDF] [PDF Plus] 12. Qing Ma. 1996. Adaptive associative memories capable of pattern segmentation. IEEE Transactions on Neural Networks 7:6, 1439. [CrossRef] 13. D. Horn, I. Opher. 1995. Dynamical symmetries and temporal segmentation. Journal of Nonlinear Science 5:5, 359-372. [CrossRef] 14. Raphael Ritz, Wulfram Gerstner, Ursula Fuentes, J. Hemmen. 1994. A biologically motivated and analytically soluble model of collective oscillations in the cortex. Biological Cybernetics 71:4, 349-358. [CrossRef]
15. D. Horn , D. Sagi , M. Usher . 1991. Segmentation, Binding, and Illusory ConjunctionsSegmentation, Binding, and Illusory Conjunctions. Neural Computation 3:4, 510-525. [Abstract] [PDF] [PDF Plus]
Communicated by Oliver Braddick
Organization of Binocular Pathways: Modeling and Data Related to Rivalry Sidney R. Lehky Laboratory of Neuropsychology, National Institute of Mental Health, Building 9, Room 1 N-207, Bethesda, M D 20892 U S A
Randolph Blake Department of Psychology, Vanderbilt University, Nashville, TN 37240 U S A
It is proposed that inputs to binocular cells are gated by reciprocal inhibition between neurons located either in the lateral geniculate nucleus or in layer 4 of striate cortex. The strength of inhibitory coupling in the gating circuitry is modulated by layer 6 neurons, which are the outputs of binocular matching circuitry. If binocular inputs are matched, the inhibition is modulated to be weak, leading to fused vision, whereas if the binocular inputs are unmatched, inhibition is modulated to be strong, leading to rivalrous oscillations. These proposals are buttressed by psychophysical experiments measuring the strength of adaptational aftereffects following exposure to an adapting stimulus visible only intermittently during binocular rivalry. 1 Introduction Binocular rivalry refers to the alternating periods of dominance and suppression that occur when unmatched stimuli are presented to the two eyes. For example, if a vertical grating is presented to one eye and a horizontal grating to the other, then vertical and horizontal stripes are seen successively in an oscillating manner, and not both simultaneously in the form of a plaid. This distinctive response to uncorrelated images may help us understand how images are matched during stereopsis, and in general shed light on the organization of binocular vision. In this article we attempt to connect psychophysically based models of rivalry with the anatomy and physiology of early visual pathways, in particular the microcircuitry of the striate cortex and lateral geniculate nucleus. Neural models of binocular rivalry (Matsuoka 1985; Lehky 1988; Blake 1989; Mueller 1990),while differing in emphasis and detail, have agreed that the suppressive circuitry by which the signal from one eye blocks that of the other involves monocular neurons organized to form reciprocal feedback inhibition between left and right sides, prior to binocular Neural Computation 3, 44-53 (1991)
@ 1991 Massachusetts Institute of Technology
Organization of Binocular Pathways
45
Figure 1: Schematic neural circuit for binocular rivalry. It shows the ”gating circuitry,” consisting of reciprocal inhibition between left and right sides prior to binocular convergence. Inhibitory neurons are indicated by filled circles. Not shown is the ”matching circuitry” that controls the gate by modulating the strength of reciprocal inhibition, depending on whether binocular inputs match or not. It is proposed that the gating circuitry is located in either the lateral geniculate nucleus or layer 4 of striate cortex, and that the effects of the matching circuitry on the gating circuitry are mediated by outputs of striate layer 6.
convergence. Figure 1 shows one form of such a circuit. The inhibitory neurons are assumed to be monocularly driven because that is the most parsimonious arrangement by which inputs from one eye could selectively suppress inputs from the other. The circuit is believed to involve feedback rather than feedforward inhibition in order to produce oscillations. However, the psychophysical data outlined below cause problems for this class of model. These data compare the strength of various adaptational aftereffects when the adapting stimulus is either continuously visible or only intermittently visible during rivalry. One might think that since a stimulus is visible for less time during rivalry, it would cause less
Sidney R. Lehky and Randolph Blake
46
adaptation than a continuously visible one. However, all studies to date indicate that this is not so. These include measurements of 1. Contrast threshold elevation aftereffect (Blake and Fox 1974a; Blake and Overton 1979)
2. Spatial frequency shift aftereffect (Blake and Fox 197413) 3. Tilt aftereffect (Wade and Wenderoth 1978) 4. Motion aftereffect (Lehmkuhle and Fox 1975; OShea and Crassini 1981). In every case the strength of the aftereffect is the same whether or not the adapting stimulus is undergoing rivalry. The implication is that all adaptation to all classes of stimuli occurs early in visual pathways, prior to the site of the monocular suppressive circuitry. This seems implausible given the complexity of the aftereffects (orientation specific, direction specific). Both Lehky (1988) and Blake (1989) have highlighted this set of adaptation data as troublesome for binocular models. It happens, however, that all the above studies measured aftereffects to rivalrous stimuli visible 50% of the time. Although no decrease in adaptation was apparent under that condition, perhaps a decrease could be found under a more extreme condition, comparing a continuously visible stimulus with a rivalrous stimulus visible, say, only 10% of the time. In that situation we find that there is indeed a significant difference in adaptation strength, as measured by contrast threshold elevations to gratings. Before discussing this experiment and its implication for binocular vision, we shall outline a simple model that allows quantitative interpretation of the data. 2 Data Analysis Model
The basic assumption is that inhibition during rivalry has one particular locus, whereas the potential for adaptation may be distributed over a number of sites. If that is so, three possible situations are: 1. All adapting neurons are located before the site of rivalry suppression.
2. All adapting neurons are located after the site of rivalry suppression. 3. Adapting neurons are distributed both before and after the site of rivalry suppression. Under the first mode, adaptation strength is constant regardless of the fraction of time the rivalrous stimulus is visible, and equal to the adaptation caused by a continuously visible stimulus. Under the second
Organization of Binocular Pathways
47
mode, adaptation strength is proportional to the fraction of time a stimulus is visible. It is not linearly proportional, because adaptation strength as a function of adaptation time shows a compressive nonlinearity (Magnussen and Greenlee 1986; Rose and Lowe 1982). Finally, the third mode is intermediate to the two extremes described above. Adaptation is still proportional to predominance, but follows a flatter curve than the second mode. All this can be expressed by equation 2.1: c=
[I - c4][r
+ (I
-
.r)flP
+ (a
(2.1)
where: c= = z = f= Q
p =
threshold contrast after adaptation threshold contrast before adaptation fraction of time stimulus is visible during rivalry fraction of adaptation located prior to site of suppression exponent defining power law time course of adaptation
The term inside the first square bracket is a normalization constant keeping c from exceeding 1.0. The first and second terms in the second square bracket indicate relative adaptation rates during dominant and suppressed phases of rivalry, respectively. The equation requires that the values of c and co be normalized so that threshold contrast = 1.0 following adaptation to a continuously visible stimulus. In applying the model, the time course of adaptation is assumed to follow a square root law, so I-, is set to 0.5. This appears to be a reasonable estimate based on data in the literature (Magnussen and Greenlee 1986; Rose and Lowe 1982). The parameter Q is set from the data. The goal is to estimate the value of the parameter f , based on the shape of the experimental c versus r curve. A flatter curve implies a larger value of f.
3 Methods The general procedure was to induce rivalry by presenting orthogonal gratings to the two eyes, and afterward measure threshold contrast for the eye viewing the adapting grating. The predominance of the adapting grating (fraction of time it was visible) was varied for different runs by changing the grating contrast to the opposite eye, while holding the contrast of the adapting grating constant. Stimuli were 3.0 cdeg-' sinusoidal gratings, horizontal to the left eye and vertical to the right, within circular apertures 1.0" in diameter. Nonius lines were present at the perimeter of the apertures. These stimuli, presented on a pair of Tektronix 608 oscilloscopes, were viewed through a Wheatstone stereoscope. The left eye grating was the adapting stimulus, and its contrast was held constant at 0.15. The right eye grating
48
Sidney R. Lehky and Randolph Blake
contrast was either 0.0, 0.15, or 0.53 during different runs. This resulted in rivalry in which the fraction of time that the left eye predominated was either 1.0, 0.5, or 0.1, respectively. There were two subjects, SL, the author, and MB, who was unaware of the purposes of the experiment. The method of adjustment was used to determine contrast thresholds. Following a 1 min adaptation period during which rivalry ran freely, the screens went blank to mean luminance. The subject immediately adjusted a 10-turn logarithmic potentiometer controlling left eye contrast to what was judged as threshold, and pushed a button to indicate this decision to the computer. The grating used to measure postadaptation threshold was identical in spatial frequency and orientation to the adapting grating. Unadapted thresholds were measured in the same way, following binocular viewing of blank screens for a duration equal to the adaptation period of the other conditions. All conditions were replicated 10 times for each subject, with at least 15 min between trials to allow recovery from adaptation. 4 Results
Figure 2 shows contrast sensitivities (the inverse of contrast thresholds) following adaptation to stimuli with different predominances. There is a difference in sensitivity following adaptation to gratings with 0.5 and 1.O predominances, but it is not statistically significant. This is compatible with previous failures to find an effect of rivalry on adaptation strength. It is only when comparing the most extreme conditions, with predominances of 0.1 and 1.0, that the effect becomes significant (TI < 0.05). Even though only the most extreme conditions are significantly different, an overall trend is apparent in the data: increased predominance of the adapting stimulus during rivalry leads to decreased contrast sensitivity. The data in Figure 2 are replotted in Figure 3 as contrast threshold versus predominance, where threshold has been normalized to equal 1.O when predominance = 1.0. 5 Discussion
The basic experimental observation is that an adapting stimulus that is intermittently visible during rivalry produces a weaker adaptational aftereffect than a continuously visible stimulus. Since adaptation strength is a decelerating function of adaptation time, the claim is that this effect of rivalry is apparent only when the predominance of the adapting stimulus is small, much less than the 0.5 predominance used by studies in the past. These conclusions, drawn from the single experiment presented here, should be regarded as preliminary until subject to a number of confirmatory studies. However, if one accepts that rivalry does affect the strength of adaptation, as these data indicate, then this has
Organization of Binocular Pathways
I
I
49
I
1
I
I
MR
SL
€-
L
111 0.5
I.0
Predominance
1 .5
Predominance
Figure 2: Psychophysicaldata for two subjects, showing that contrast sensitivity declines as a function of the fraction of time that an adapting grating is visible during rivalry. Dashed lines show unadapted sensitivities. implications for the organization of binocular pathways as will be detailed below. Predictions of the data analysis model (equation 2.1) are superimposed on the data in Figure 3. It shows c plotted as a function of .I' for three values of f , where these variables were defined above. Inspection shows that the curve for which rivalry suppression precedes all adaptation ( f = 0) corresponds best to the data. Accepting this, one can suggest specific anatomical sites for rivalry suppression in accord with the following argument. First, it should be noted that adaptation to gratings is orientation-specific (Blakemore and Nachmias 1971). Given that information, the two premises of the argument are: 1. Suppression occurs prior to adaptation, which is orientation specific (based on data presented here). 2. The site of orientation-specific psychophysical adaptation coincides with the site of orientation-specific neurons. From these it follows that suppression precedes the appearance of orientation specificity in the visual system. Thus to the earlier requirement for a valid model that the suppressive neurons be monocular, we are now adding the requirement that they be nonoriented as well.
Sidney R. Lehky and Randolph Blake
50
Retinal ganglion cells fit these requirements but can be excluded because there are no opportunities for binocular interactions there. Neurons in the striate layer receiving direct magnocellular input, 4Cn, although monocular, are oriented (Hawken and Parker 1984; Livingstone and Hubel 19841, and are therefore also excluded. These considerations leave either the lateral geniculate nucleus, or those layers of striate cortex receiving direct parvocellular input (4A and 4CP) as the location of suppression during binocular rivalry to unmatched spatial patterns. The location of suppression may be different for unmatched motion or color. The inability to further localize suppression to either the LGN or parvodriven layer 4 reflects the lack of known physiological differences among neurons in those structures. (Note that if there were a significant subpopulation of nonadaptable orientation-specific neurons, that could render these localization arguments less secure.)
O . 1 0.2
f=O . o .............. f=O.
5
I
0.0
t
1
0.2
~
0.4
SL MB
0
----- f=l. 0
0 1
8
0.6
1
~
0.8
'
'
1
1.0
Predominance Figure 3: Data of Figure 2 is replotted to show normalized contrast thresholds (threshold = 1/sensitivity). Squares on the vertical axes indicate unadapted thresholds. The lines are plots of equation 2.1 for three values of f , where f is the fraction of adaptation occuring before rivalry suppression. In all cases the parameter p was fixed at 0.5. The data correspond best to f = 0, showing that rivalry suppression precedes adaptation. Since adaptation to gratings is orientation specific, it is argued that suppression precedes orientation specificity in the visual system.
Organization of Binocular Pathways
51
If rivalry suppression occurs in the LGN, a plausible substrate would be inhibitory interneurons between adjacent layers of that structure. Such inhibitory interactions were reviewed by Singer (19771, who also suggested they may be involved in rivalry. An alternative possibility, reciprocal inhibition mediated by the feedback loop to the LGN from layer 6 of V1 cortex, must be rejected because layer 6 outputs are binocular and oriented, while, again, units mediating suppression are postulated here to be monocular and non-oriented. On the other hand, if the locus of suppression is in layer 4, then it would likely involve inhibitory interneurons between adjacent ocular dominance columns. In addition to this ”gating circuitry,” which can selectively block signals originating from one eye during rivalry, there must be ”matching circuitry,” which controls the state of the gate depending on whether inputs to the eyes match or not. Lehky (1988) proposed a model specifying the interactions between the gating and matching circuits. Without repeating the underlying reasoning, the essence of the model is that the same reciprocal inhibitory gating circuitry (Fig. 1) underlies both binocular fusion and rivalry. The difference between the two states depends on the strength of inhibitory coupling. Weak coupling leads to stable fusion while strong coupling produces rivalrous oscillations. Under this model, the output neurons of the matching circuitry (not pictured in Fig. 1) act to modulate the strength of reciprocal inhibition in the gating circuitry as a function of the correlation between left and right eye signals. It seems reasonable to believe that the output neurons (though not necessarily the intrinsic neurons) of the binocular matching circuitry are binocularly driven. If that is the case, then to find these outputs one must look for binocular units that feed back on monocular units (of the gating circuitry). According to the available data, neurons of layer 6 of area V1 have this property uniquely. Layer 6 sends major projections only to structures with preponderantly monocular cells, namely the LGN as well as layers 4Ctr and 4C/j (but not 4B, which is binocular), and none of those monocular regions receives major binocular inputs other than from layer 6. This relationship becomes apparent on matching the physiology (reviewed by Livingstone and Hubel 1987) with the anatomy (Blasdel et al. 1985; Fitzpatrick et al. 1985; Lund and Boothe 1975). Therefore, whether binocular gating occurs in the LGN or layer 4 (or both), it is the layer 6 output that is the most likely candidate for controlling the gate. Figure 4 summarizes the proposed organization, illustrating the case in which gating may be occurring in the LGN. The other possibility, that of gating occurring in layer 4, could be illustrated analogously with ocular dominance columns replacing LGN layers. It is possible that the circuitry discussed here is important not only for suppression during rivalry, but also during other binocular processes, such as pathological suppression during amblyopia or the elimination of false matches during stereopsis.
Sidney R. Lehky and Randolph Blake
52
LEFT RETINAL INPUT
RIGHT RETINAL INPUT
Gating Circuitry
V
LAYER 6
Figure 4: Summary diagram of the model, in essence a more elaborate version of Figure 1. Signals from the left and right eyes are binocularly gated by reciprocal inhibition, either between adjacent layers of the LGN (as illustrated here) or between adjacent ocular dominance columns in those parts of striate layer 4 receiving direct parvocellular inputs (not shown). The restriction to parvodriven parts of layer 4 is based on data presented here involving rivalry induced by unmatched spatial patterns, and the situation may be different for unmatched motion. Whatever the case, the strength of inhibitory coupling is modulated by layer 6 units, which are the outputs of binocular matching circuitry. If the stimuli to the two eyes match, layer 6 neurons modulate reciprocal inhibition in the gating circuitry to be weak, leading to fused binocular vision. If the stimuli do not match, inhibition is modulated to be strong, producing rivalry. (The modulatory feedback loop is not included in Figure 1.)
Acknowledgments Portions of this work were presented at the 1989 annual meeting of the Association for Research in Vision and Ophthalmology. Supported by a Sloan Foundation grant to Terrence Sejnowski and Gian F. Poggio, a McDonnell Foundation grant to Terrence Sejnowski, and NIH Grant EY07760 to Randolph Blake. We thank Mary Bravo for serving as a subject.
Organization of Binocular Pathways
53
References Blake, R. 1989. A neural theory of binocular rivalry. Psychol. Rev. 96,145-167. Blake, R., and Fox, R. 1974a. Binocular rivalry suppression: Insensitive to spatial frequency and orientation change. Vision Res. 14, 687-692. Blake, R., and Fox, R. 1974b. Adaptation to invisible gratings and the site of binocular rivalry suppression. Nature (London) 249, 488-490. Blake, R., and Overton, R. 1979. The site of binocular rivalry suppression. Perception 8, 143-152. Blakemore, C., and Nachmias, J. 1971. The orientation specificity of two visual after-effects. I. Physiol. (London) 213, 157-174. Blasdel, G. G., Lund, J. S., and Fitzpatrick, D. 1985. Intrinsic connections of macaque striate cortex: Axonal projections of cells outside lamina 4C. J. Neurosci. 5, 3350-3369. Fitzpatrick, D., Lund, J. S., and Blasdel, G. G. 1985. Intrinsic connections of macaque striate cortex: Afferent and efferent connections of lamina 4C. J . Neurosci. 5, 3329-3349. Hawken, M. J., and Parker, A. J. 1984. Contrast sensitivity and orientation selectivity in lamina IV of the striate cortex of Old World monkeys. Exp. Bruin Res. 54, 367-372. Lehky, S. R. 1988. An astable multivibrator model of binocular rivalry. Perception 17,215-228. Livingstone, M. S., and Hubel, D. H. 1987. Psychophysical evidence for separate channels for the perception of form, color, movement, and depth. 1. Neurosci. 7,3416-3468. Lehmkule, S. W., and Fox, R. 1975. Effect of binocular rivalry suppression on the motion aftereffect. Vision Res. 15, 855-859. Lund, J. S., and Boothe, R. G. 1975. Interlaminar connections and pyramidal neuron organisation in the visual cortex, Area 17, of the macaque monkey. J. Cornp. Neurol. 159, 305-334. Magnussen, S., and Greenlee, M. W. 1986. Contrast threshold elevation following continuous and interrupted adaptation. Vision Res. 26, 67M75. Matsuoka, K. 1984. The dynamic model of binocular rivalry. Bid. Cybernet. 49, 201-208. Mueller, T. J. 1990. A physiological model of binocular rivalry. Visual Neurosci. 4, 63-73. OShea, R., and Crassini, B. 1984. Interocular transfer of the motion after-effect is not reduced by binocular rivalry. Vision Res. 21, 801-804. Rose, D., and Lowe, I. 1982. Dynamics of adaptation to contrast. Perception 11, 505-528. Singer, W. 1977. Control of thalamic transmission by corticofugal and ascending visual pathways in the visual system. Physiol. Rev. 57, 386-420. Wade, N., and Wenderoth, P. 1978. The influence of colour and contour rivalry on the magnitude of the tilt after-effect. Vision Res. 18, 827-835.
Received 8 August 1990; accepted 20 September 90.
This article has been cited by: 2. Ansgar R. Koene. 2006. A Model for Perceptual Averaging and Stochastic Bistable Behavior and the Role of Voluntary ControlA Model for Perceptual Averaging and Stochastic Bistable Behavior and the Role of Voluntary Control. Neural Computation 18:12, 3069-3096. [Abstract] [PDF] [PDF Plus] 3. Klaus Wunderlich, Keith A Schneider, Sabine Kastner. 2005. Neural correlates of binocular rivalry in the human lateral geniculate nucleus. Nature Neuroscience 8:11, 1595-1602. [CrossRef] 4. Peter Dayan . 1998. A Hierarchical Model of Binocular RivalryA Hierarchical Model of Binocular Rivalry. Neural Computation 10:5, 1119-1135. [Abstract] [PDF] [PDF Plus]
Communicated by Shun-ichi Amari
Dynamics and Formation of Self-organizing Maps Jun Zhang Neurobiology Group, 3210 Tolman Hall, University of California, Berkeley, CA 94720 USA
Amari (1983,1989) proposed a mathematical formulation on the selforganization of synaptic efficacies and neural response fields under the influence of external stimuli. The dynamics as well as the equilibrium properties of the cortical map were obtained analytically for neurons with binary input-output transfer functions. Here we extend this approach to neurons with arbitrary sigmoidal transfer function. Under the assumption that both the intracortical connection and the stimulus-driven thalamic activity are well localized, we are able to derive expressions for the cortical magnification factor, the point-spread resolution, and the bandwidth resolution of the map. As a highlight, we show analytically that the receptive field size of a cortical neuron in the map is inversely proportional to the cortical magnification factor at that map location, the experimentally well-established rule of inverse magnification in retinotopic and somatotopic maps.
1 Introduction
The self-organization of the nervous system and the consequential formation of cortical maps have been studied quite extensively (von der Malsburg 1973; Swindale 1980; Kohonen 1982; Linsker 1986; Miller et al. 1989). A cortical map, or more generally, a computational map refers to the neural structure of representing a continuous stimulus pararneter by a place-coded populational response, whose peak location reflects the mapped parameter (Knudsen et al. 1987). The cortical neurons in the map, each with a slightly different range of stimulus selectivity established during developmental course, operate as preset parallel filters on the afferent stimulus almost simultaneously. The stimulus parameter, now coded as the location of the most active neuron(s), can be accessed by higher processing centers via relatively simple neural connections. The network models for such cortical maps are usually composed of several layers of neurons from sensory receptors to cortical units, with feedforward excitations between the layers and lateral (or recurrent) conNeural Computation 3, 54-66
(1991)
@ 1991 Massachusetts Institute of Technology
Self-organizing Maps
55
nection within the layer. Standard techniques include (1) Hebbian rule and its variations for modifying synaptic efficacies, ( 2 ) lateral inhibition (in the general sense) for establishing topographical organization of the cortex as well as sharpening the cells’ tuning properties, and (3) adiabatic approximation in decoupling the dynamics of relaxation (which is on the fast time scale) and the dynamics of learning (which is on the slow time scale) of the network. However, in most cases, only computer simulation results were obtained and therefore provided limited mathematical understanding of the self-organizing neural response fields. In Takeuchi and Amari (1979) and Amari (1983,1989),a general mathematical formulation was presented to study analytically the existence conditions, the resolution and magnification properties, as well as the dynamic stability of cortical maps. This rather rigorous approach yielded very illuminating results. In particular, they suggested by perturbation analysis that, in the presence of periodic boundary conditions of the mapping, the relative values of the afferent spread size and the receptive field size will determine the emergence of a block-like, columnar structure as opposed to a continuous, topographic organization. Since their analysis was restricted to binary-valued neurons only, that is, neurons with step-function as their input-output transfer function, it is certainly desirable to extend this approach to the more general case of neurons with arbitrary sigmoidal transfer functions.
2 Dynamics of Self-organization Revisited
The network that Amari and colleagues investigated consists of three layers, a sensory receptor layer, a thalamic layer, and a cortical layer, with feedforward connections between the layers and lateral inhibition within the cortical layer only (Fig. 1). Following Takeuchi and Amari (1979),the activity of the cortical neuron at location x (a 2D vector in general) and time t may be described by its net input u(x. t ) (postsynaptic membrane potential with respect to the resting state) and output U ( X , t ) (average firing rate of the spike train) interrelated via some monotone increasing (sigmoidal) input-output function: if = f(u), u E (-m. oc), 1 1 E (0.1). To further indicate that these variables are functions of stimulus parameter y (a vector) and time parameter of the learning dynamics T , we shall write in this article n(x,y . t , 7 ) and ~ ( xy..t. T ) instead. The receptors have delta-function tuning to the stimulus parameter, and they feed into the thalamic layer with localized afferent spread. Notice that y is used to denote the stimulus variable as well as to index cells in the thalamic layer according to their optimal stimulus parameter (i.e., according to their connections from the receptor layer). The cortical neurons in the model receive both thalamocortical afferent projections as well as intracortical
Jun Zhang
56
0 .. .
Y
Figure 1: The three layered network model for the self-organizingcortical maps. The receptors, having delta-function tuning for the stimulus y, feed into thalamic neurons, with the stimulus-driven thalamic activity denoted by a(y’,y). The cortical neurons receive both the intracortical interaction characterized by the weighting function W(X, x’) and the thalamocortical input with synaptic connection S ( X , y,T ) modifiable (based on Hebbian learning rule) during development. Note that each thalamic neuron is indexed by its optimal driving stimulus (according to the connections from the receptor layer). lateral connections. The relaxation of the system is dictated, on the fast time scale t, by the equation
where w(x,x’) characterizes the weighting of lateral connections within the cortex from location x’ to location x, assumed to be unchanged with time; a(y’,y) represents the thalamocortical afferent activity at y‘ [the first argument in the function a(., .)I on the presentation of the stimulus y [the second argument in a ( . , . ) ] ; and S(X,Y,T)is the synaptic efficacy from the thalamocortical afferent y to the cortical neuron x, which varies on a slow time scale 7 and is thus treated as constant on the fast time scale t. This ”adiabatic approximation” allows Amari (1989) to construct a global Lyapunov function L [ 4 that is a function of y,t , 7 (x having been integrated) and thus represents the overall pattern of cortical
Self-organizing Maps
57
activity. It was proved that, on the stimulus presentation y at time T, the value of L[u] monotonously decreases, on the fast time scale t, as u(x. y. t. T ) evolves according to equation 2.1 until L[u] = L(y.t. T ) reaches T ) while ~ ( xy.. t. T ) reaches its “equilibrium” a minimum value Lm1,,(y. solution ~ [ xy.. T. s ( . ) . u ( . ) ] [U is a functional of s(x.y ?T ) and a(y’. y), and the bar denotes the equilibrium of the relaxation phase]. This establishes a cortical response field 0 = u(x, y, T ) relating the external stimulus y to the cortical activity at x at time T. To study the self-organization of this mapping, the synaptic efficacy s(x.y. T ) is assumed to be modifiable, on the slow time scale T, according to the following equation of learning (Hebbian rule): i)
--s(x, y. T ) = -s(x. y. T )
87
+ 11 J u(y, y’) f[Q(x.y’.
T)] p(y’)dy’
(2.2)
Note that, for the dynamics of learning, stimuli presentations are considered to be stochastically independent at each time T with some prescribed probability distribution p(y’). Here we set p(y‘) = constant (and thus let absorbed into the constant 77) to indicate the normal developmental course. Note that the integration is with respect to the second slot of a(.. .), the argument representing stimulus parameter. At the end of the network learning, synapses get “matured so that s(x.y. T ) becomes the time-independent S(x,y ): (2.3)
S(x.Y ) = ‘7 / 4 Y . Y’) W ( x .Y’)l dY’
whereas ~ ( xy.. T ) becomes the time-independent U(x. y): ~ ( x . y )= /“(x.x’)f[l:(x’,y)]dx’+ JS(x.y’)o(y’.y) dy’
+ 1~ ( yy’)f[U(x. . y’)]dy’ (2.4)
= J U ( X . x’)f[r/(x’.y)]dx’
Here ~ ( yy’) . is the autocorrelation between the thalamocortical afferents defined as K(Y.
y’) = 71
J a(y”,y) a(y”.Y’)
dy“
(2.5)
Equivalently, we may write
3 Reduced Equation of Cortical Map Formation
The master equation (equation 2.4 or 2.6) developed by Amari and colleagues describes the formation of cortical maps as equilibrium solutions to the dynamic self-organization of layered neural systems. In Amari
Jun Zhang
58
(19891, the resolution and magnification properties were studied for neurons with binary input-output transfer function, where f(.) assumes the value of either 1 or 0 and thus the integral in equation 2.4 can be explicitly evaluated. In Chernjavsky and Moody (19901, the case of linear transfer function f(u)= uu b was studied. Here we relax these restrictions and consider arbitrary transfer function f ( u ) . We will derive approximations of equation 2.6 for well-localized, translation-invariant functions of the intracortical connection w ( x ,x’)and the stimulus-driven thalamic activity 4 Y , Y’):
+
IJ(x.x’)
=
4Y,Y‘)
=
w(x- x’) .(Y -YO
It follows from equations 2.5 and 3.2 that the afferent autocorrelation ~ ( yy ’,) must also be translation invariant:
4 Y Y’) 7
=
(3.3)
4 Y - Y’)
Now we consider the first integral term in equation 2.6. For simplicity, x, y are taken as real numbers x, y here (i.e., the mapping is one-dimensional). Expanding V ( x ’ ,y) into the Taylor series around point (z. 9 )
(3.4) we have
where ak
=
1 /u(z k!
-
x’) (x’ - x ) k dx’
(3.6) Similarly,
... with
Self-organizing Maps
59
Therefore, the master equation 2.6 is transformed into'
(3.9)
By assuming that w ( t ) and a ( t ) are well localized, we imply ak, b k converge rapidly to 0 as k -+ 00. Taking only a few leading terms in the expansion, and further assuming that w(.r..r') = w(1.r - x'l) and a ( y . y') = a( Iy - y'() are both even functions of their arguments, therefore making u1 = 0 and bl = 0, we obtain f - ' [ V ( T .y)] = (a0
+ bo) V + (
'12- d2V
8x2
+ b- ;;2r)
(3.10)
or (3.11) with G(I')
= f-'(V)
- (00
+ bo) V
(3.12)
If the cortical lateral connection is balanced in its total excitation and total inhibition, a0 = 0. If the afferent autocorrelation is normalized, bo = 7 > 0. Equation 3.11 is a semilinear second-order partial differential equation. When 0262 > 0, it is of elliptic type; when a2b2 < 0, it is of hyperbolic type. The standard techniques of solving equation 3.11 can be found in mathematical textbooks, such as Chester (1971). In particular, 3.11 may be linearized and transformed into the canonical forms of (when a& < 0) &yV CV = 0, known as the telegraph equation, or (when a262 > 0) V2V CV = 0, known as the Helmholtz equation. These linear second-order partial differential equations have closed-form solutions when given appropriate boundary conditions.
+ +
4 Resolution and Magnification of the Map
The solution of equation 3.11 V = V ( s ,y) represents the response of the neuron at location x due to stimulus parameter y after the cortical map matures. When x is fixed, that is, at a particular cortical location so, the neuron's response is a function of stimulus parameter y. Maximal 'This simplified derivation of equation 3.9 is suggested to the author by Dr. 5. Amari. This equation was earlier obtained by expanding W ( T - . r ' ) into the sum of 6(.r - T ' ) and its derivatives SCn1(lc.- z'), where the delta-function is envisioned as the limiting case (i.e., with zero width) of a normalized gaussian function and its successive derivatives represent derivatives of gaussian that become less and less localized (Zhang 1990).
Jun Zhang
60
response is achieved for some optimal stimulus go, which is determined by (4.1) Obviously the optimal stimulus thus obtained is different for each location z0. The optimal stimulus parameter y as a function of cortical locations x may be written a5 (4.2)
Y =P(X)
so that equation 4.1 holds identically for all x
VZ[TP ( X ) 1
=0
(4.3)
Here and in the sequel, we use the subscript(s1 1, 2 of V to denote partial derivative(s1with respect to the first and/or second argument($ in V(...). Upon the presentation of an optimal stimulus, the maximal response of the neuron at x is
Vmax(X) = V [ z P. ( T ) ]
(4.4)
Suppose that this maximal response is everywhere the same ke., cortical neurons are indistinguishable)
Vmax(.x) = constant
(4.5)
or d
-V[T dx
A x ) ] = K[., P(Z)I
+ v 2 [ z d.)Id(x)
=0
(4.6)
It follows from equations 4.6 and 4.3 that (4.7)
K [ x 3 4 x ) 1= 0 or equivalently
Hence J: = p - I ( y) defines the location of maximal response (i.e., center of the cortical map) as a function of the stimulus variable. Differentiating equations 4.7 and 4.3 yields, respectively, d -vl[x. ,1,(x)] dz d -V&. p(x)] dx
=
Kl[1.,p ( z ) ] + vi2[z./A(.)]
I)&
= v21[z;
pLI(x)= 0
(4.9)
+ V*Z[Ic,p ( x ) ]pLI(z)= 0
(4.10)
Self-organizing Maps
61
Remembering that the order of partial differentiations is interchangeable \;2 = 151, we immediately have \;l[.r.//(.r)] - L;2[.r.//(.r)][ / L ’ ( . ) . ) ]
2
=0
.
(4.11)
On the other hand, equation 3.11 should always be satisfied:
G(Lhlax)= (121;1[.1..//(.r)] + b21;2[.1..//(~)]
(4.12)
From 4.11 and 4.12, we finally obtain (4.13) (4.14) The above results can be understood intuitively. Recall that the cortical magnification factor (CMF) is defined as the ratio of a resulting shift of the mapped location in the cortex over a change in the stimulus parameter. In the present context, it is simply equal to [ / i ’ ( . ~ )=] ~[ d~/ i (r)/d,i]-’, the reciprocal of the derivative of the function !j = / ( ( . I , ) , which is solvable from equation 4.3 or 4.7. The cortical magnification factor is apparently a function of cortical location x. The resolution of the map can be described by two related measures. For a fixed stimulus parameter, the extent of cortical regions being excited is a measure of the stimulus localization in a populational response (”point-spread resolution’?. At a particular cortical location, the range of effective stimuli is a measure of the stimulus selectivity of a single cell (”bandwidth resolution’?. To get the intuitive picture, we draw a family of “isoclines” of \’(.c. y) in the ( , I . y)-coordinates whereby L’(.r. y ) = constorit along each curve (Fig. 2a). The variation of \’(I.y) in the vertical direction indicates to look at cells’ response at a fixed cortical location (.r = ~ 0 while ) changing the stimulus parameter - the vertical bar measures, in reciprocal, the bandwidth resolution. The variation of ) in the horizontal direction indicates to fix the stimulus parameter while looking at responses of cells at different cortical locations - the horizontal bar measures, in reciprocal, the point-spread resolution. In both cases, of course, one needs to specify a criterion (in terms of percentage of maximal response, for instance) to discuss the magnitude of each resolution measure. From the graph, it is obvious that these two measures are interrelated. If we take a slice (cross-section) along the vertical direction, the value of V(.r,,.y) may be schematically plotted (Fig. 2b). Note that this is a plot of response amplitude I/ versus the stimulus parameter y, with the cortical location 1’0 fixed. The peak location of this curve corresponds to C,,, = I’(JO. yo), with = /i(zo)representing the optimal stimulus for the cell located at .r0. The ”width” of this tuning curve represents the extent
62
Jun Zhang
XO
YO
A
B
Figure 2: (a) The "isoclines" of V are plotted on the (.I-, y)-coordinates, whereby along each curve V ( z , y ) = constant. In particular, the curve y = ~ ( z corre) sponding to V ( z , y ) = V,,, was labeled. The vertical bar is a measure of the bandwidth resolution of the cortical map, while the horizontal bar is a measure of the point-spread resolution of the map (see text for details). (b) Taking a slice (cross-section) along the vertical direction in (a), the value of V ( T Oy). is plotted as a function of the stimulus parameter y, representing the tuning curve of the cell at a particular cortical location T O .
of stimulus selectivity, or the reciprocal of the bandwidth resolution of the map. If the cell's tuning curve is symmetric about its peak, it may be approximated by a quadratic function (at least near the peak location yo = ~ ( z o )where , dV(z0.yo)/dy according to equation 4.3)
YOl2
+
'.
(4.15)
The "width of this parabola is inversely related to the quadratic coefficient d2V(zo, yo)/ay2, or simply Vz2[zO.p(z0)l. We may replace 2 0 with x to indicate that this analysis applies to all cortical locations. Therefore V'[z, ~ ( I c ) as ] calculated in 4.14 is nothing but the bandwidth resolution of the map. Similarly, XI [z, p ( z ) ] can be viewed as the point-spread resolution of the cortical map. These two resolutions are linked to the cortical magnification factor via equation 4.11. It is interesting to note that equations 4.9 and 4.10 yield
Kl [z, P(2)1v 2 2 [ 5 , p(z)I - Kzb, 11(z)Iv21b,PL(Z)I= 0
(4.16)
or that the graph z = V ( z , y ) is parabolic at its peak points [z,p(z)]. This, along with the restrictions that v*[z,p ( 5 ) ] and &z[z, ~(z)] are both negative, constitutes the conditions of a continuous, homogeneous map.
Self-organizing Maps
63
Similarly, the occurrence of a discontinuous, columnar-like structure of the map corresponds to having isolated peak points at which the elliptic graph 2 = L’[z. ,y) attains the maximal value:
5 Rule of Inverse Magnification
The continuous and homogeneous cortical map as discussed in the previous section is a topographic map that uniformly associates a cortical location with each stimulus. The retinotopic map is an example where the stimulus parameter being mapped is the position in the frontal-parallel visual field. The somatotopic map is another example where the stimulus parameter is the location in the skin surface. In both cases, the receptive field size (RF) is a synonym of our previously defined “bandwidth” of a cortical neuron, be it an effective area of the visual space or an effective patch of the skin surfaces. In terms of the square-root of areal measurement, RF is simply (-k’’~)-’/~, or
(5.1) If b2 = 0 or becomes
62
is very small (for a discussion, see Appendix), then 5.1 CMF-’(.r)
(5.2)
RF(.r) . CMF(.r) = CoMstQnf
(5.3)
RF(.r) 3:
//’(.I,)
=
or finally
The product of the receptive field size and the cortical magnification factor is nothing but the size (in terms of cortical distance) that a cell receives its total input and would be activated. From equation 4.11, this product also equals (-&, )-’/’, the ”point-image”of a stimulus (Mcllwain 1986). That the total cortical distance to influence (drive) a cell and the overall size of cortical point-image are constant imply that the cortex is uniform in neuronal connections to implement its computations. The physiological uniformity of the cortex has long been observed experimentally. In monkey striate visual cortex, Hubel and Weisel (1974) reported that, despite the large scattering of cells’ receptive field sizes of cells at each eccentricity (now believed to correspond to functionally different cell groups), the average size (in square-root of areal measurement) is roughly proportional to the inverse of the cortical magnification factor. This inverse magnification rule was also revealed in monkey somatosensory cortex (Sur ef al. 1980), and was demonstrated most convincingly in the studies of reorganization of the hand-digit representation under
Jun Zhang
64
various surgical and behavioral manipulations (Jenkins et al. 1990). This remarkable relationship R F ( x ) .CMF(x) = constant is compatible with the anatomical uniformity of the cortex, in particular the uniform dendritic field sizes (which is the anatomical substrate of receptive field) of each cell type across the cortex. In Grajski and Merzenich (19901, the self-organization of somatotopic map was simulated using a three-layered network essentially the same as the one being discussed here. These authors demonstrated that the general principles of Hebbian rule, lateral inhibition, and adiabatic approximation are sufficient to account for the inverse relationship between the receptive field size and the cortical magnification factor. A similar result was also obtained by a probabilistic analysis of the Kohonen-type network (Obermayer et al. 1990). These empirical and computer studies are all consistent with our analytical result, and therefore nicely complement each other in helping us understand the principles as well as properties such as the inverse magnification rule of self-organizing cortical maps.
6 Conclusions
The analytic power of this approach toward a unified description of selforganization of cortical maps, as developed by Amari and colleagues and extended here, greatly facilitates mathematical appreciations of the dynamics as well as the equilibrium behavior of the neural system. The present formulation embodies the general scheme of layered neural networks with feedforward (thalamocortical) excitations and lateral (intracortical) connections, and takes into account features such as the autocorrelation in the stimulus-driven activities of the thalamic afferents and the Hebbian rule of synaptic modification. The magnification and the resolution of the map are derived analytically to allow comparisons with experimental data. In particular, the linear relationship between the receptive field size and the inverse cortical magnification factor (namely the inverse magnification rule) as derived under this formulation is consistent with both experimental observations and results from computer simulations.
Appendix A We discuss the condition
b2 = 0
in this appendix. From equation 3.8,
Self-organizing Maps
65
According to equations 2.5 and 3.2, r;(y - y’)
= /u(y”
-
y ) u(y” - y’) dy”
1
+
a ( y ” ) ~ [ y ” ( y - y’)] d y ”
=
which is simply the autocorrelation operation
~ ( t=)J
m
a ( t ’ ) a(t’
-m
+ t ) dt’
so; b2
=
2
1:
Jco -02
co
t2 ( ~ ( t n’ )( t
+ t’)dt’ dt
M
~ ( t dt’ ’ ) l m ( y - t’)2a ( y ) d y where we put y
=t
+ t’. Denoting
we have oc
b2
2
=
- Jco --03 c l ( t ’ ) d t ’ l c o ( U 2 - 2 y t ’ + f ’ 2 ) u ( y ) d y
=
2 (AoA:!-2A:+A2Ao)
=
AoA2 - A :
1
For an even-symmetric u ( t ) , A,
(A.6) = 0.
We finally obtain
Therefore, the condition b2 = 0 implies that the integral of a ( t ) , either weighted by t2 or not, should be zero. Physiologically, the ON/OFF regions in the response fields of the thalamic (geniculate) neurons must be balanced in its total excitation and total inhibition.
Acknowledgments This work was supported by PHS Grant EY-00014. The author especially thanks Dr. S. Amari for his helpful comments and for simplifying proofs that have enhanced this manuscript. Thanks are also extended to Drs. K. K. De Valois and R. L. De Valois for their generous support and constant encouragement. 2The following simplified proof is provided by Dr. S. Amari, and replaces a previous proof using Fourier transform techniques.
66
Jun Zhang
References Amari, S. 1983. Field theory of self-organizing neural nets. I E E E Trans. SMC SMC-13, 741-748. Amari, S. 1989. Dynamical study of formation of cortical maps. In Dynamic Interactions in Neural Networks: Models and Data, M. A. Arbib and S. Amari, eds., pp. 15-34. Springer-Verlag, New York. Chernjavsky, A,, and Moody, J. 1990. Spontaneous development of modularity in simple cortical models. Neural Comp. 2, 334-350. Chester, C. R. 1971. Techniques in Partial Differential Equations. McGraw-Hill, New York. Grajski, K. A., and Merzenich, M. M. 1990. Hebb-type dynamics is sufficient to account for the inverse magnification rule in cortical somatotopy. Neural Comp. 2, 71-84. Hubel, D. H., and Wiesel, T. N. 1974. Uniformity of monkey striate cortex: A parallel relationship between field size, scatter, and magnification factor. J . Comp. Neurol. 158, 295-306. Jenkins, W. M., Merzenich, M. M., Ochs, M. T., Allard, T., and Guic-Robles, E. 1990. Functional reorganization of primary somatosensory cortex in adult owl monkeys after behaviorally controlled tactile stimulation. J. Neuropkys. 63,82-104. Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 5949. Knudsen, E. I., du Lac, S., and Esterly, S. D. 1987. Computational maps in the brain. Annu. Rev.Neurosci. 10, 4145. Linsker, R. 1986. From basic network principles to neural architecture. Roc. Natl. Acad. Sci. U.S.A. 83, 7508-7512, 8390-8394, 8779-8783. Malsburg, Ch. von der 1973. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100. McIlwain, J. T. 1986. Point images in the visual system: New interest in an old idea. Trends Neurosci. 9, 354-358. Miller, K. D., Keller, J. B., and Stryker, M. I? 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605-615. Obermayer, K., Ritter, H., and Schulten, K. 1990. A neural network model for the formation of topographic maps in the CNS: Development of receptive fields. In Proc. Int. Joint Conf. Neural Networks (IJCNN'90), Sun Diego, 11, 42-29. Sur, M., Merzenich, M. M., and Kaas, J. H. 1980. Magnification, receptive-field area, and "hypercolumn" size in area 3b and 1 of somatosensory cortex in owl monkeys. J. Neurophys. 44, 295-311. Swindale, N. V. 1980. A model for the formation of ocular dominance stripes. Proc. R. SOC.London Ser. B. 208 243-264. Takeuchi, A., and Amari, S. 1979. Formation of topographic maps and columnar microstructures in nerve fields. Biol. Cybern. 35, 63-72. Zhang, J. 1990. Dynamical self-organization and formation of cortical maps. In Proc. Int. Joint Conf. Neural Networks (IJCNNSO),Sun Diego, 111, 487-492. Received 23 July 90; accepted 12 November 90.
This article has been cited by: 2. Frank W. Ohl, Henning Scheich. 1996. Differential Frequency Conditioning Enhances Spectral Contrast Sensitivity of Units in Auditory Cortex (Field Al) of the Alert Mongolian Gerbil. European Journal of Neuroscience 8:5, 1001-1017. [CrossRef] 3. Wlodzislaw Duch. 1994. Quantitative measures for self-organizing topographic maps. Open Systems & Information Dynamics 2:3, 295-302. [CrossRef]
Communicated by John Moody
A Tree-Structured Algorithm for Reducing Computation in Networks with Separable Basis Functions Terence D. Sanger M l T €25-534, Cambridge, M A 02139 U S A
I describe a new algorithm for approximating continuous functions in high-dimensional input spaces. The algorithm builds a tree-structured network of variable size, which is determined both by the distribution of the input data and by the function to be approximated. Unlike other tree-structured algorithms, learning occurs through completely local mechanisms and the weights and structure are modified incrementally as data arrives. Efficient computation in the tree structure takes advantage of the potential for low-order dependencies between the output and the individual dimensions of the input. This algorithm is related to the ideas behind k-d trees (Bentley 1975), CART (Breiman ef al. 1984), and MARS (Friedman 1988). I present an example that predicts future values of the Mackey-Glass differential delay equation. 1 Introduction
Networks consisting of linear combinations of nonlinear basis functions have proven to be useful for approximating functions in a variety of different domains. Such networks are represented by equations of the form:
(1.1) where is the desired (scalar) output for input .I' E RJ', !j is the output approximated by the network, and the w,'s are learned scalar weights. If the 9,'s are radial basis functions [for reviews, see Poggio and Girosi (1989), Powell (1987), and Klopfenstein and Sverdlove (1983)], then they take the form Q I ( J )
= P(11.1 - ,511)
(1.2)
whose value depends only on the distance of the input ,r from the "center" <[. When the dimension p of the input space is high, the work required to calculate the output pl ( r )of any basis function increases. In addition, since the size of the space increases geometrically with p , an excessively Neirral Computntzon 3, 67-78 (1991)
@ 1991 Massachusetts Institute of Technology
Terence D. Sanger
68
large number of basis functions may be required to approximate arbitrary functions adequately. This problem has been called the "curse of dimensionality." One might attempt to avoid this by noting that in some regions of the input space, the desired output function can be approximated using only a few dimensions of the input. This situation would occur if, for instance, the data were to lie on the line z2 = f(z1)for any 1-1 function f, in which case the desired output y could be estimated from either x1 or x2. In this report, I describe one method for reducing computational work that makes use of this idea. It is applicable in the case where the basis functions are separable, in the sense that they can be written
where x d is the dth component of z, $ r b ( z d ) = (p(Izd - ( & ) d l ) is a scalar function of scalar input, and r$ specifies which one-dimensional basis function $ should be chosen along the dth dimension to form pL. (An example of separable basis functions is the radially symmetric gaussian basis.) In this equation and in the rest of this report I assume for simplicity that the centers of the basis functions are located on a fixed regular lattice with N along each dimension so that there is a total of NP basis functions 9%.However, the technique can also be applied to basis functions with arbitrarily fixed centers, in which case, each function q5+, (zd) is the projection of cpz(z)along the zd dimension. 2 Network Structure
To understand the structure of the proposed network, I will build the approximation up one dimension at a time. If the output can be determined from only the z1 dimension, then it can be written
where the superscript [ll indicates that this is an approximation along only a single dimension, and the a!]'s are the weights for the basis functions &(zI) along this dimension. To train the weights, we can use the LMS learning algorithm (Widrow by and Hoff 1960) to reduce the mean-squared output error E[(y adding a weight change AaLl
= ~ ( -yY['])&(x~)
(2.2)
to a:] at each time sample, where y is a small rate term. Given sufficient input samples X I , this algorithm will converge until the average weight change E[Aa;]]= 0 for all n. The output $[l] will then be the linear least mean-squared error approximation to y based on the values #+,(xl).
Tree-Structured Algorithm
69
Figure 1: Two-layer tree constructed by the algorithm. T ' S are inputs, 0's are basis functions, and (1's are weights. See text for explanation. does not depend only on I I , then there will be some residual error Although this error will be uncorrelated with i l lr(l ) for all r / (since E [ & Y ! ~=] ]0), the variances E [ ( ~ l o ! l )will ~ ] be nonzero, indicating pressure to change the weights for certain examples. We can improve the approximation of y by allowing the weights to vary based on information from the I 2 dimension. Suppose we pick r1 to be the value of 71 for which E [ ( & t ! : 1 ) is 2 ]largest. We now want to modify the weight I,':. by adding a term If
(r/
-
!j
.)I'[$
-
111
A(),,
\
=
c
45L@7J12)
(2.3)
71=1
(see Fig. 1). We can use the LMS rule again to learn the weights ni?,l,Lfrom the error L l ~ by ~ ~ l
(2.4) (2.5)
Terence D. Sanger
70
and we see that the error term for the product basis function (xl)&(x2) is exactly the term that the regular LMS rule would supply. The approximation at the output is now (2.6)
(2.7)
This procedure can be followed for every value of n at each of the p dimensions. This leads to the following recursive learning rule: Act0 = Y(Y
a4;
-
9)
= Y ( Y - i)4,(21) =
( A O 0 ) 4 T l( 5 1 )
= ?(?I - i ) $ T 1 =
(21)dr2(Q)
(Aci;;l)$5r2(x2)
This formula makes it clear that the weight correction term ~ I a k f ! ,func,,,~~ tions like the error term y(y - 6) for the next "layer," since the update equation for cdd+l]T,,...,rd,n can be thought of as performing LMS learning with the output error being given by AaLf!,,,,.,. The weights from successive basis functions are being trained to correct the weights of previous ones based on the context specified by additional dimensions of the input. Modification of each weight depends only on its input and the error from the weight above, so training is completely local. New subtrees are grown below the leaf with maximal "weight variance" E [ ( A a ) 2 ]so , the placement of a subtree depends on the variance of the local weight changes that are automatically updated by the network. The resulting network forms a tree of depth p where each node has N children. If the tree is completed, then the approximation $" will contain all terms of the form Q ~ P I ~ , , . . . , ~ , ~ # + , ( ~...4Tp(xP) ~) and thus has at least as much approximation power as equation 1.1. $[PI also contains additional terms involving combinations of fewer than p dimensions, and thus may have more approximating power. However, if the tree is grown to its full size, there will be more coefficients than in equation 1.I, and this leads to more computational work. The work is decreased only if a sufficiently good approximation can be gained without growing the full tree. It thus makes sense to grow a subtree to modify a weight only when changing that weight would make a significant contribution to decreasing the output error for the network.
Tree-Structured Algorithm
71
3 Growing the Tree
Since the ability of this type of network to save computation depends on selectively growing the tree structure, the rule for determining when to create a subtree will determine the overall performance. Unfortunately, there is no general way to determine the optimal next subtree to grow. In the algorithm above, it was suggested that a tree be grown beneath the leaf of the current tree with maximum error variance given by
where the maximization is carried out between leaves at all depths (1. This growth rule has the advantage of being simple and inexpensive, since it depends only on values that are already computed as part of the LMS learning rule. If we assume that the existing tree is fixed and the new subtree is made complete down to p - d levels, then the decrease in expected error at the output E[(y-$)*] will be related to equation 3.1. in practice, it may not be practical to complete the subtree so that the weight error variance (3.1) is reduced to zero, but this error nevertheless is proportional to the maximal effect on the expected output error that a subtree at this node could have. It thus provides an upper bound on the usefulness of a partially grown subtree. Unfortunately, because this heuristic is determined by the effectiveness of a theoretically perfect complete subtree, it does not tell us where to place the next single-layer subtree. In addition, if the desired g cannot in fact be approximated using this set of basis functions (or if there is significant noise), equation 3.1 will not predict the maximum effect of a subtree on the output. The subtree selection heuristic described here is thus intended only as a suggestion, and further research is necessary to determine more reliable selection methods. One possibility which is used in the MARS algorithm (Friedman 1988) is to exhaustively test all possible new subtrees using cross-validation. Another possibility used in both CART (Breiman et al. 1984) and MARS (Friedman 1988) is to grow a tree that is larger than necessary and then prune it back to yield a smaller and more efficient structure. Unfortunately, both methods can be computationally expensive and are difficult to implement using local operations within the network. 4 Implementation Issues
As described here, the algorithm imposes an order on the dimensions, and if the dimension r p is the most useful, the entire tree will have to be grown merely to access it. To avoid this problem, one can provide the entire basis set { ~ ~ ( ~ ~ ~ at~ each ) } ~level. ~ ~ This ~ ~increases . ; ; , ~the size of the network by a factor of p , but it eliminates the need to choose an
Terence D. Sanger
72
ordering of the dimensions, and hopefully will reduce the depth of the required tree. Note that this network structure is not limited to any particular type of basis function. Any basis at all can be used, and the choice of basis will determine the approximation ability of the network and the number of nodes needed to attain a given accuracy. Possible bases include the radially symmetric gaussian basis (as in radial basis function techniques), a Fourier basis along each dimension (the nodes will represent diagonally oriented filters), the eigenvectors of the input distribution (the nodes will represent cross-products of orthogonal outputs), the analog value x d along each dimension (the nodes will represent monomials, and the network will find a polynomial approximation to y), or even the bit values of the binary representation of the input. If the basis is formed by dividing each dimension into disjoint regions along sharp boundaries, then the algorithm is exactly equivalent to a "k-d tree" (Bentley 1975). The algorithm can thus be thought of as a generalization of the k-d tree to arbitrary overlapping or nonorthogonal bases. 5 Example
As a simple example of performance, I will attempt to predict future values of the Mackey-Glass chaotic differential delay equation (Mackey and Glass 1977) j.=
0.2z(t - 7 ) - O.lz(t) 1f X"(t - 7 )
for 7- = 30, as suggested by Farmer and Sidorowich (1989), Lapedes and Farber (1988), and others. I use the same parameters and method of error estimation. The network has six-dimensional input x ( t - 612) for n = 0, . . . . 5 . Each input value is coded using 20 elements from the Fourier basis sin(wx),cos(wx), w = 1 . . . . .lo, as suggested in Sanger (1990). The entire set of 6 x 20 = 120 basis functions is available at each level. The task is for the network to learn to predict z ( t 6) while observing the continuously evolving time series. A new subtree was added (below the leaf with maximal error variance) every 400 time steps. Although the predictions and the true values are visually indistinguishable after only 20 subtrees have been added, the network was grown to 106 subtrees (about 40 min on a sparcstation) so that iterated predictions could be made. The 6-step normalized mean-squared error (NMSE, defined in Lapedes and Farber (1988) to be the root mean squared error divided by the data standard deviation) was 0.025, and iterated predictions 400 time steps into the future had NMSE < 0.5. Figure 2 shows the time series, 6-step predictions, and 600-step iterated prediction time series. Figure 3 shows NMSE as a function of iterated prediction time.
+
Tree-Structured Algorithm
73
1.4
0.2
0
600
800
Figure 2: Mackey-Glass time series, 6-step ahead predictions (which are indistinguishable from the time series here), and iterated prediction time series up to 600 time steps into the future. The network has converged for 42,400 samples.
O%'0.0 l 0
600
800
Figure 3: Iterated predictions for the Mackey-Glass equation, showing normalized mean-squared error as a function of prediction time up to 600 steps into the future. 6 Discussion
This formulation of basis function training has several advantages over more standard methods. It was motivated by an attempt to save computational work when using basis function techniques to approximate functions that locally can be calculated from only a few dimensions, and in this case both the learning time and the time required to compute the output jl are reduced. If a minimum error is specified for approximation, then just enough N'S can be calculated to achieve this criterion, and further terms do not need to be computed even if they d o contribute. An additional use for this network structure occurs when new dimensions may be added by the addition of, for example, new sensors. It is possible to construct the network so that the weights that have already
74
Terence D. Sanger
been learned do not need to be relearned to incorporate the new sensors. (Further improvement may be gained by modifying existing weights, but it will not be necessary to start over from scratch.) There are several other network algorithms that are related to the one proposed here. Basis function approximation is a well-known technique in statistics, as is approximation by polynomials of increasing order (Gabor 1961). Stone (1985) used a basis set consisting of univariate spline functions, which is equivalent to a single layer network of the type proposed here when the entire basis set is provided as in Section 4. Learning is often accomplished using variants of the LMS rule (Widrow and Hoff 1960) such as the Perceptron algorithm (Rosenblatt 1962) or Backpropagation (Rumelhart et al. 1986). The GMDH algorithm (Barron et al. 1984; Ikeda et al. 1976; Ivakhnenko 1971) and SONN (Tenorio and Lee 1989) construct polynomial approximations of increasing order by forming pairwise products of previously computed features that correlate well with the desired output. Successive combinations build a tree structure upward starting with the leaves. Similarly, in Cascade-Correlation (Fahlman and Lebiere 1990) and Tiling (Mezard and Nadal 1989) small subnetworks are trained to provide useful outputs that serve as inputs to succeeding networks. Note that in the algorithm proposed here the lower levels of the tree are used to modify the weights of the higher levels, while in GMDH and Cascade-Correlation the lower levels provide the input to the higher levels. Other algorithms in which one part of a network controls the weights of another are found in (Jacobs et al. 1990, 1991; Hinton et al. 1986). The Loess algorithm (Cleveland and Devlin 1988) can also be seen as a local basis function network that controls the weights being used to form a global regression. Many authors have used neural networks to perform time-series prediction, including Weigend et al. (1990), Farmer and Sidorowich (19891, Moody (19891, Moody and Darken (19891, and Lapedes and Farber (1988). Moody and Darken (1988,1989)used radial basis functions with adaptive centers to predict the Mackey-Glass time series with a delay of 7 = 17 (Mackey and Glass 1977) at 85 time steps into the future. They achieved NMSE of 0.06 after 30 min of processing. (Iterated prediction errors are not available.) Farmer and Sidorowich (1989) used a similar method based on nearest-neighbor approximation and fitting local polynomials, and they achieved a 6-step prediction error less than after approximately lo4 examples. Moody (1989)used a multiresolution system based on CMAC (Miller et al. 1987) and a multivariate B-spline basis to predict 50 time steps into the future on-line for T = 17 and achieved NMSE of 0.04 after 9900 examples. In off line mode, the system achieved a prediction error of 0.05 on only 500 exemplars. (Iterated prediction errors are not available.) Lapedes and Farber (1987, 1988) used a backpropagation network to achieve an iterated NMSE prediction error of 0.5 at 260 steps into the future and 1.0 at 400 steps. Weigend et al. 1990) also used
Tree-Structured Algorithm
75
backpropagation to predict timeseries data, but they have not applied their method to the Mackey-Glass equation. There are several other algorithms that grow similar tree structures that locally use only a few input dimensions at a time. k-d trees (Bentley 1975) form hard boundaries between regions of data by "splitting" at the median point along dimensions with high variance. This splits the input space into disjoint rectangular regions that can be used for assigning classifications to data or for piecewise constant function approximation. A similar idea based on single-layer network learning is described in Knerr et a/. (1990). CART (Breiman et a / . 1984; Sun et al. 1988) splits along dimensions that maximize an information-theoretic measure designed to increase classification ability. Similarly, AID (Morgan and Sonquist 1963; Sonquist et al. 1971) adds new dimensions wherever there is high prediction error variance. When used for function approximation these methods lead most naturally to piecewise constant approximations, although it is possible to fit linear or higher order functions within the classification regions. In all cases, tree-growing and weight adaptation are done using the entire data set, rather than incrementally as data arrives. Perhaps the most closely related algorithm is MARS (Friedman 19881, in which the basis functions along each dimension are truncated polynomials with variable offsets. This basis function set is a compromise between the hard region boundaries of k-d trees and CART, and the soft overlapping regions defined by gaussians or sinusoids, as used here. It leads to piecewise-smooth approximations, while the algorithm proposed here is capable of using arbitrary (separable) basis functions and thereby achieving arbitrary smoothness properties. However, since the method I propose could be adapted to use the MARS basis functions, the principal difference lies in the training and tree-growing methods. In MARS, the coefficients of the basis functions are learned by least-squares fitting to the complete data set, whereas I propose the use of the gradient descent LMS algorithm, which operates incrementally as data samples arrive. In MARS, splits (subtrees) are chosen by trying all possibilities and picking the one that minimizes a "lack-of-fit" criterion based on cross-validation, whereas I propose a method based on the weight variance, which is already available from LMS. Each split in MARS adds two new basis functions, so the tree has outdegree 2, whereas the trees proposed here have considerably larger degree. MARS finds an optimal tree by growing a tree that is larger than necessary and then pruning it back, whereas the algorithm proposed here does not necessarily find optimal tree structures, sacrificing optimality for simplicity of computation and learning which occurs completely within the local network structure. Perhaps the most important difference lies in the fact that MARS requires the complete data set for finding coefficients and choosing subtrees. Although this is preferable if all the data are available, it is less useful for a network that must continuously update itself to handle new data as they arrive. It is also difficult to implement using simple local learning rules.
76
Terence D. Sanger
So although both algorithms build tree structures, the learning in MARS is not performed locally within the network and requires considerable external machinery. MARS also depends on the truncated polynomial (spline) basis and cannot be easily adapted to arbitrary basis functions. However, if we use the most general definitions, then MARS, CART, k-d trees, and the algorithm proposed here are very similar, and the differences are due to the choices of basis function, tree-growing heuristic, and weight-modification rule. In summary, the algorithm I have presented combines elements from basis function networks and tree-structured regression techniques. It takes advantage of the possibility that a desired function can be computed from only a few dimensions of the input data. It can be implemented as a simple extension to the LMS algorithm. Since weight modification and tree-growing depend on only local information, it can be parallelized completely, and a Connection Machine implementation is currently being developed. Although it does not solve the “curse of dimensionality,” in applications it may make the use of basis function networks practical for high-dimensional input spaces. Much work remains to be done to improve the tree-growing heuristic. The use of pruning and statistical methods such as cross-validation might be able to improve robustness significantly, although perhaps at the expense of speed, parallelism, and ease of implementation.
Acknowledgments I would like to thank Terry Sejnowski and John Moody for their extensive and helpful comments on the manuscript. This work started during a course taught at MIT by Chris Atkeson, Michael Jordan, and Marc Raibert. The report describes research done within the laboratory of Dr. Emilio Bizzi in the Department of Brain and Cognitive Sciences at MIT. The author was supported by the division of Health Sciences and Technology, and by NIH Grants 5R37AR26710 and 5ROlNS09343 to Dr. Bizzi.
References Barron, R. L., Mucciardi, A. N., Cook, F. J., Craig, J. N., and Barron, A. R. 1984. Adaptive learning networks: Development and application in the United States of algorithms related to GMDH. In Self-Organizing Methods in Modeling, S. J. Farlow, ed. Marcel Dekker, New York. Bentley, J. H. 1975. Multidimensional binary search trees used for associated searching. Commun. ACM l8(9), 509-517. Breiman, L., Friedman, J., Olshen, R., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA.
Tree-Structured Algorithm
77
Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: An approach to regression analysis by local fitting. J. A m . Stat. Assoc. 83(403), 596-610. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. Tech. Rep. CMU-CS-90-100, Carnegie Mellon School of Computer Science, Pittsburgh. Farmer, J. D., and Sidorowich, J. J. 1989. Predicting chaotic dynamics. In Dynamic Patterns in Complex Systems, J. A. S. Kelso, A. J. Mandell, and M. F. Shlesinger, eds., pp. 265-292. World Scientific, New Jersey. Friedman, J. H. 1988. Multivariate adaptive regression splines. Tech. Rep. 102, Stanford Univ. Lab for Computational Statistics. Gabor, D. 1961. A universal nonlinear filter, predictor, and simulator which optimizes itself by a learning process. Proc. 1EE 108B,422438. Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. 1986. Distributed representations. In Parallel Distributed Processing, J. L. McClelland, D. E. Rumelhart, and The PDP Research Group, eds., pp. 77-109. MIT Press, Cambridge, MA. Ikeda, S., Ochiai, M., and Sawaragi, Y. 1976. Sequential GMDH algorithm and its application to river flow prediction. l E E E Trans. Systems, Man, Cybern. SMC-6(7), 473479. Ivakhnenko, A. G. 1971. Polynomial theory of complex systems. l E E E Trans. Systems, Man, Cybern. SMC-1(4), 364-378. Jacobs, R. A., Jordan, M. I., and Barto, A. G. 1990. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Tech. Rep. COINS TR 90-27, Univ. of Massachusetts, Amherst. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive .mixtures of local experts. Neural Cornp. 3, 79-87. Klopfenstein, R. W., and Sverdlove, R. 1983. Approximation by uniformly spaced gaussian functions. In Approximation Theory IV, C. K. Chui, L. L. Schumaker, and J. D. Ward, eds., pp. 575-580. Academic Press, New York. Knerr, S., Personnaz, L., and Dreyfus, G. 1990. Single-layer learning revisited: A stepwise procedure for building and training a neural network. Manuscript. Lapedes, A., and Farber, R. 1987. Nonlinear signal processing using neural networks. Los Alamos National Laboratory LA-UR-87-2662. Proc. IEEE, submitted. Lapedes, A., and Farber, R. 1988. How neural nets work. In Neural information Processing Systems, D. Z. Anderson, ed., pp. 442-456. Am. Inst. Physics, NY, Proceedings of the Denver, 1987 Conference. Mackey, M. C., and Glass, L. 1977. Oscillation and chaos in physiological control systems. Science 197, 287-289. Mezard, M., and Nadal, J.3 1989. Learning in feedforward layered networks: The tiling algorithm. J. Phys. A 22, 2191-2203. Miller, W. T., Glanz, F. H., and Kraft, L. G. 1987. Application of a general learning algorithm to the control of robotic manipulators. l n t . J. Robotics Res. 6(2), 84-98. Moody, J. 1989. Fast learning in multi-resolution hierarchies. In Advances in
78
Terence D. Sanger
Neural Information Processing Systems 1, D. S. Touretzky, ed., pp. 29-39. Morgan Kaufmann, San Mateo, CA. Moody, J., and Darken, C. 1988. Learning with localized receptor fields. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1,281-294. Morgan, J. N., and Sonquist, J. A. 1963. Problems in the analysis of survey data, and a proposal. 1. A m . Stat. Assoc. 58,415434. Poggio, T., and Girosi, F. 1989. A theory of networks for approximation and learning. MIT A1 Memo 1140. Powell, M. J. D. 1987. Radial basis functions for multivariable interpolation: A review. In Algorithms for Approximation, J. C. Mason and M. G. Cox, eds., pp. 14S167. Clarendon Press, Oxford. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, Chap. 8, pp. 318-362. MIT Press, Cambridge, MA. Sanger, T. D. 1990. Learning nonlinear features using eigenvectors of radial basis functions. Abstracts of the Neural Networks for Computing conference, Snowbird, UT. Sonquist, J. A., Baker, E. L., and Morgan, J. N. 1971. Searching for Structure. Institute for Social Research, Univ. Michigan, Ann Arbor. Stone, C. J. 1985. Additive regression and other nonparametric models. Ann. Stat. 13(2), 689-705. Sun, G. Z., Lee, Y. C., and Chen, H. H. 1988. A novel net that learns sequential decision process. In Neural Information Processing Systems, D. Z . Anderson, ed., pp. 760-766. American Institute of Physics, New York. Tenorio, M. F., and Lee, W.-T. 1989. Self organizing neural network for optimum supervised learning. Tech. Rep. TR-EE 89-30, Purdue Univ. School of Electrical Engineering. Weigend, A. S., Huberman, B. A., and Rumelhart, D. E. 1990. Predicting the future: A connectionist approach. Int. 1. Neural Syst. 1, 193. Widrow, B., and Hoff, M. E. 1960. Adaptive switching circuits. In IRE WESCON Conv. Record, Part 4, pp. 96-104. Received 20 March 1990; accepted 12 October 90
This article has been cited by: 2. N.Y. Nikolaev, H. Iba. 2001. Regularization approach to inductive genetic programming. IEEE Transactions on Evolutionary Computation 5:4, 359. [CrossRef] 3. P.P. Kanjilal, G. Saha, T.J. Koickal. 1999. On robust nonlinear modeling of a complex process with large number of inputs using m-QRcp factorization and C/sub p/ statistic. IEEE Transactions on Systems Man and Cybernetics Part B (Cybernetics) 29:1, 1. [CrossRef] 4. Menashe Dornay, Terence D. Sanger. 1993. Equilibrium point control of a monkey arm simulator by a fast learning tree structured artificial neural network. Biological Cybernetics 68:6, 499-508. [CrossRef] 5. Eric Hartman, James D. Keeler . 1991. Predicting the Future: Advantages of Semilocal UnitsPredicting the Future: Advantages of Semilocal Units. Neural Computation 3:4, 566-578. [Abstract] [PDF] [PDF Plus] 6. Yair BartalDivide-and-Conquer Methods . [CrossRef]
Communicated by Jeffrey Elman
Adaptive Mixtures of Local Experts Robert A. Jacobs Michael I. Jordan Drpartiiient of Brain arid Cogriitiz~eSciences, Massachusetts Znstitlrte of Technology, Cainbridge, MA 02139 USA
Steven J. Nowlan Geoffrey E. Hinton Departinerit of Coinpiiter Sciericc, Uiiiz~ersityof Toroilto, Toronto, Canada M5S 1A4
We present a new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases. The new procedure can be viewed either as a modular version of a multilayer supervised network, or as an associative version of competitive learning. It therefore provides a new link between these two apparently different approaches. We demonstrate that the learning procedure divides u p a vowel discrimination task into appropriate subtasks, each of which can be solved by a very simple expert network. 1 Making Associative Learning Competitive
If backpropagation is used to train a single, multilayer network to perform different subtasks on different occasions, there will generally be strong interference effects that lead to slow learning and poor generalization. If we know in advance that a set of training cases may be naturally divided into subsets that correspond to distinct subtasks, interference can be reduced by using a system composed of several different "expert" networks plus a gating network that decides which of the experts should be used for each training case.' Hampshire and Waibel (1989) have described a system of this kind that can be used when the division into subtasks is known prior to training, and Jacobs ef al. (1990) have described a related system that learns how to allocate cases to experts. The idea behind such a system is that the gating network allocates a new case to one or a few experts, and, if the output is incorrect, the weight changes are localized to these experts (and the gating network). 'This idea was first presented by Jacobs and Hinton at the Connectionist Summer School in Pittsburgh in 1988.
N w m l Corripiitntiori 3, 79-87 (1991)
@ 1991 Massachusetts Institute of Technology
80
Robert A. Jacobs et al.
So there is no interference with the weights of other experts that specialize in quite different cases. The experts are therefore local in the sense that the weights in one expert are decoupled from the weights in other experts. In addition they will often be local in the sense that each expert will be allocated to only a small local region of the space of possible input vectors. Unfortunately, both Hampshire and Waibel and Jacobs et al. use an error function that does not encourage localization. They assume that the final output of the whole system is a linear combination of the outputs of the local experts, with the gating network determining the proportion of each local output in the linear combination. So the final error on case c is
where 0,"is the output vector of expert i on case c, p: is the proportional contribution of expert i to the combined output vector, and d" is the desired output vector in case c. This error measure compares the desired output with a blend of the outputs of the local experts, so, to minimize the error, each local expert must make its output cancel the residual error that is left by the combined effects of all the other experts. When the weights in one expert change, the residual error changes, and so the error derivatives for all the other local experts change.2 This strong coupling between the experts causes them to cooperate nicely, but tends to lead to solutions in which many experts are used for each case. It is possible to encourage competition by adding penalty terms to the objective function to encourage solutions in which only one expert is active (Jacobs et 41. 1990), but a simpler remedy is to redefine the error function so that the local experts are encouraged to compete rather than cooperate. Instead of linearly combining the outputs of the separate experts, we imagine that the gating network makes a stochastic decision about which single expert to use on each occasion (see Fig. 1). The error is then the expected value of the squared difference between the desired and actual output vectors
Notice that in this new error function, each expert is required to produce the whole of the output vector rather than a residual. As a result, the goal of a local expert on a given training case is not directly affected by the weights within other local experts. There is still some indirect 2ForHampshire and Waibel, this problem does not arise because they do not learn the task decomposition. They train each expert separately on its own preassigned subtask.
Adaptive Mixtures of Local Experts
Expert Network
Expert Network
Expert Network
Network
Input
Figure 1: A system of expert and gating networks. Each expert is a feedforward network and all experts receive the same input and have the same number of outputs. The gating network is also feedforward, and typically receives the same input as the expert networks. It has normalized outputs I?, = exp(.r,)/ C, exp(.r,),where .T] is the total weighted input received by output unit J of the gating network. The selector acts like a multiple input, single output stochastic switch; the probability that the switch will select the output from expert J is p J .
coupling because if some other expert changes its weights, it may cause the gating network to alter the responsibilities that get assigned to the experts, but at least these responsibility changes cannot alter the sign of the error that a local expert senses on a given training case. If both the gating network and the local experts are trained by gradient descent in this new error function, the system tends to devote a single expert to each training case. Whenever an expert gives less error than the weighted average of the errors of all the experts (using the outputs of the gating network to decide how to weight each expert's error) its responsibility for that case
82
Robert A. Jacobs et al.
will be increased, and whenever it does worse than the weighted average its responsibility will be decreased. The error function in equation 1.2 works in practice but in the simulations reported below we used a different error function which gives better performance:
The error defined in equation 1.3 is simply the negative log probability of generating the desired output vector under the mixture of gaussians model described at the end of the next section. To see why this error function works better, it is helpful to compare the derivatives of the two error functions with respect to the output of an expert. From equation 1.2 we get (1.4)
while from equation 1.3 we get
(1.5) In equation 1.4 the term 11; is used to weight the derivative for expert i. In equation 1.5 we use a weighting term that takes into account how well expert i does relative to other experts. This is a more useful measure of the relevance of expert i to training case c, especially early in the training. Suppose, for example, that the gating network initially gives equal weights to all experts and lid' - of11 > 1 for all the experts. Equation 1.4 will adapt the best-fitting expert the slowest, whereas equation 1.5 will adapt it the fastest. 2 Making Competitive Learning Associative
It is natural to think that the "data" vectors on which a competitive network is trained play a role similar to the input vectors of an associative network that maps input vectors to output vectors. This correspondence is assumed in models that use competitive learning as a preprocessing stage within an associative network (Moody and Darken 1989). A quite different view is that the data vectors used in competitive learning correspond to the output vectors of an associative network. The competitive network can then be viewed as an inputless stochastic generator of output vectors and competitive learning can be viewed as a procedure for making the network generate output vectors with a distribution that matches the distribution of the "data" vectors. The weight vector of each competitive hidden unit represents the mean of a multidimensional gaussian
Adaptive Mixtures of Local Experts
83
distribution, and output vectors are generated by first picking a hidden unit and then picking an output vector from the gaussian distribution determined by the weight vector of the chosen hidden unit. The log probability of generating any particular output vector 0'' is then
where I is an index over the hidden units, pLis the "weight" vector of the hidden unit, k is a normalizing constant, and p , is the probability of picking hidden unit i, so the pz are constrained to sum to 1. In the statistics literature (McLachlan and Basford 1988), the p , are called "mixing proportions." "Soft" competitive learning modifies the weights (and also the variances and the mixing proportions) so as to increase the product of the probabilities (i.e., the likelihood) of generating the output vectors in the training set (Nowlan 1990a). "Hard" competitive learning is a simple approximation to soft competitive learning in which we ignore the possibility that a data vector could be generated by several different hidden units. Instead, we assume that it must be generated by the hidden unit with the closest weight vector, so only this weight vector needs to be modified to increase the probability of generating the data vector. If we view a competitive network as generating output vectors, it is not immediately obvious what role input vectors could play. However, competitive learning can be generalized in much the same way as Barto (1985) generalized learning automata by adding an input vector and making the actions of the automaton be conditional on the input vector. We replace each hidden unit in a competitive network by an entire expert network whose output vector specifies the mean of a multidimensional gaussian distribution. So the means are now a function of the current input vector and are represented by activity levels rather than weights. In addition, we use a gating network which allows the mixing proportions of the experts to be determined by the input vector. This gives us a system of competing local experts with the error function defined in equation 1.3. We could also introduce a mechanism to allow the input vector to dynamically determine the covariance matrix for the distribution defined by each expert network, but we have not yet experimented with this possibility. 3 Application to Multispeaker Vowel Recognition
The mixture of experts model was evaluated on a speaker independent, four-class, vowel discrimination problem (Nowlan 1990b). The data consisted of the first and second formants of the vowels [i], [I], [a], and [A1 (usually denoted [A]) from 75 speakers (males, females, and children) uttered in a hVd context (Peterson and Barney 1952). The data forms two
Robert A. Jacobs et al.
84
'T
I
/
3.2
2.4
1.6
0.8
I
0.3
0.6
0.9
1.2
1.5
Figure 2: Data for vowel discrimination problem, and expert and gating network decision lines. The horizontal axis is the first formant value, and the vertical axis is the second formant value (the formant values have been linearly scaled by dividing by a factor of 1000). Each example is labeled with its corresponding vowel symbol. Vowels [il and [I] form one overlapping pair of classes, vowels [a] and [A] form the other pair. The lines labeled Net 0, 1, and 2 represent the decision lines for 3 expert networks. On one side of these lines the output of the corresponding expert is less than 0.5, on the other side the output is greater than 0.5. Although the mixture in this case contained 4 experts, one of these experts made no significant contribution to the final mixture since its mixing proportion p , was effectively 0 for all cases. The line labeled Gate 0:2 indicates the decision between expert 0 and expert 2 made by the gating network. To the left of this line p z > PO, to the right of this line po > pz. The boundary between classes [a] and [A] is formed by the combination of the left part of Net 2's decision line and the right part of Net 0's decision line. Although the system tends to use as few experts as it can to solve a problem, it is also sensitive to specific problem features such as the slightly curved boundary between classes [a1 and [A]. pairs of overlapping classes, a n d different experts learn to concentrate o n one pair of classes or the other (Fig. 2). We compared standard backpropagation networks containing a single hidden layer of 6 o r 12 units with mixtures of 4 or 8 very simple experts. The architecture of each expert w a s restricted so it could form only a linear decision surface, which is defined as the set of input vectors for which the expert gives an output of exactly 0.5. All models were trained with data from the first 50 speakers a n d tested with data from the remaining 25 speakers. The small number of parameters for each expert allows excellent generalization performance (Table l), a n d permits
Adaptive Mixtures of Local Experts
System
4 Experts 8 Experts BP 6 Hid BP 12 Hid
85
Train % correct Test % correct
88 88 88 88
90 90 90 90
Average number of epochs
SD
1124 1083 2209 2435
23 12 83 124
Table 1: Summary of Performance on Vowel Discrimination Task. Results are based on 25 simulations for each of the alternative models. The first column of the table indicates the system simulated. The second column gives the percent of training cases classified correctly by the final set of weights, while the third column indicates the percent of testing cases classified correctly. The last two columns contain the average number of epochs required to reach the error criterion, and the standard deviation of the distribution of convergence times. Although the squared error was used to decide when to stop training, the criterion for correct performance is based on a weighted average of the outputs of all the experts. Each expert assigns a probability distribution over the classes and these distributions are combined using proportions given by the gating network. The most probable class is then taken to be the response of the system. The identical performance of all the systems is due to the fact that, with this data set, the set of misclassified examples is not sensitive to small changes in the decision surfaces. Also, the test set is easier than the training set.
a graphic representation of the process of task decomposition (Figure 3). The number of hidden units in the backpropagation networks w a s chosen to give roughly equal numbers of parameters for the backpropagation networks a n d mixture models. All simulations were performed using a simple gradient descent algorithm with fixed step size t. To simplify the comparisons, n o momentum or other acceleration techniques were used. The value of f for each system w a s chosen by performing a limited exploration of the convergence from the same initial conditions for a range of t. Batch training w a s used with one weight update for each pass through the training set (epoch). Each system was trained until a n average squared error of 0.08 over the training set was obtained. The mixtures of experts reach the error criterion significantly faster than the backpropagation networks ( p >> 0.9991, requiring only about half as many epochs on average (Table 1). The learning time for the mixture model also scales well a s the number of experts is increased: The mixture of 8 experts has a small, but statistically significant (11 > 0.951, advantage in the average number of epochs required to reach the error criterion. In contrast, the 12 hidden unit backpropagation network requires more epochs (11 > 0.95) to reach the error criterion than the network with 6
Robert A. Jacobs et al.
86
-o’21
0
-0.4
-0.6 -0.84
-0.56
-0.28
0
0.28
Figure 3: The trajectories of the decision lines of some experts during one simulation. The horizontal axis is the first formant value, and the vertical axis is the second formant value. Each trajectory is represented by a sequence of dots, one per epoch, each dot marking the intersection of the expert’s decision line and the normal to that line passing through the origin. For clarity, only 5 of the 8 experts are shown and the number of the expert is shown at the start of the trajectory. The point labeled TO indicates the optimal decision line for a single expert trained to discriminate [i] from [I]. Similarly, T 1 represents the optimal decision line to discriminate [a] from [A]. The point labeled X is the decision line learned by a single expert trained with data from all 4 classes, and represents a type of average solution. hidden units (Table 1). All statistical comparisons are based on a t test with 48 degrees of freedom and a pooled variance estimator. Figure 3 shows how the decision lines of different experts move around as the system learns to allocate pieces of the task to different experts. The system begins in an unbiased state, with the gating network assigning equal mixing proportions to all experts in all cases. As a result, each expert tends to get errors from roughly equal numbers of cases in all 4 classes, and all experts head towards the point X, which represents the optimal decision line for an expert that must deal with all the cases. Once one or more experts begin to receive more error from cases in one class pair than the other, this symmetry is broken and the trajectories begin to diverge as different experts concentrate on one class pair or the other. In this simulation, expert 5 learns to concentrate on discriminating classes [i] and [I] so its decision line approaches the optimal line for this discrimination (TO). Experts 4 and 6 both concentrate on discriminating classes [a] and [A], so their trajectories approach the
Adaptive Mixtures of Local Experts
87
optimal single line (Tl)a n d then split to form a piecewise linear approximation to the slightly curved optimal decision surface (see Fig. 2). Only experts 4, 5, a n d 6 are active in the final mixture. This solution is typical - in all simulations with mixtures of 4 o r 8 experts all b u t 2 or 3 experts h a d mixing proportions that were effectively 0 for all cases.
Acknowledgments Jordan a n d Jacobs were funded by grants from Siemens a n d the McDonnell-Pew program in Cognitive Neuroscience. Hinton a n d Nowlan were funded by grants from the Ontario Information Technology Research Center a n d the Canadian Natural Science a n d Engineering Research Council. Hinton is a fellow of the Canadian Institute for Advanced Research.
References Barto, A. G. 1985. Learning by statistical cooperation of self-interested neuronlike computing elements. Human Neurobiol. 4, 229-256. Hampshire, J., and Waibel, A. 1989. The meta-pi network: Building distributed knowledge representations for robust pattern recognition. Tech. Rep. CMU-CS89-166, Carnegie Mellon University, Pittsburgh, PA. Jacobs, R. A., and Jordan, M. I. 1991. Learning piecewise control strategies in a modular connectionist architecture, in preparation. Jacobs, R. A., Jordan, M. I., and Barto, A. G. 1991. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cog. Sci., in press. McLachlan, G. J., and Basford, K. E. 1988. Mixture Models: Znference and Applications to Clustering. Marcel Dekker, New York. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1(2), 281-294. Nowlan, S. J. 1990a. Maximum likelihood competitive learning. In Advances in Neural lnformution Processing System 2, D. S. Touretzky, ed., pp. 574-582. Morgan Kaufmann, San Mateo, CA. Nowlan, S. J. 1990b. Competing experts: A n experimental investigation of associative mixture models. Tech. Rep. CRG-TR-90-5, University of Toronto, Toronto, Canada. Peterson, G. E., and Barney, H. L. 1952. Control methods used in a study of the vowels. I. Acoust. SOC. A m . 24, 175-184.
Received 27 July 1990; accepted 1 November 90.
This article has been cited by: 2. Chulsang Yoo, Jooyoung Park. 2010. A mixture-density-network based approach for finding rating curves: Facing multi-modality and unbalanced data distribution. KSCE Journal of Civil Engineering 14:2, 243-250. [CrossRef] 3. Jean-Pierre Stockis, Jürgen Franke, Joseph Tadjuidje Kamgaing. 2010. On geometric ergodicity of CHARME models. Journal of Time Series Analysis . [CrossRef] 4. Rajib Nayak, James Gomes. 2010. Generalized hybrid control synthesis for affine systems using sequential adaptive networks. Journal of Chemical Technology & Biotechnology 85:1, 59-76. [CrossRef] 5. Isaac Martín Diego, Alberto Muñoz, Javier M. Moguerza. 2010. Methods for the combination of kernel matrices within a support vector framework. Machine Learning 78:1-2, 137-174. [CrossRef] 6. Petr Kadlec, Bogdan Gabrys. 2009. Architecture for development of adaptive on-line prediction models. Memetic Computing 1:4, 241-269. [CrossRef] 7. Reza Ebrahimpour, Ehsanollah Kabir, Mohammad Reza Yousefi. 2009. Improving mixture of experts for view-independent face recognition using teacher-directed learning. Machine Vision and Applications . [CrossRef] 8. J. Bolton, P. Gader. 2009. Random Set Framework for Context-Based Classification With Hyperspectral Imagery. IEEE Transactions on Geoscience and Remote Sensing 47:11, 3810-3821. [CrossRef] 9. Clodoaldo A. M. Lima, André L. V. Coelho, Fernando J. Zuben. 2009. Pattern classification with mixtures of weighted least-squares support vector machine experts. Neural Computing and Applications 18:7, 843-860. [CrossRef] 10. Christian Plagemann, Sebastian Mischke, Sam Prentice, Kristian Kersting, Nicholas Roy, Wolfram Burgard. 2009. A Bayesian regression approach to terrain mapping and an application to legged robot locomotion. Journal of Field Robotics 26:10, 789-811. [CrossRef] 11. Elif Derya Übeyli. 2009. Modified mixture of experts employing eigenvector methods and Lyapunov exponents for analysis of electroencephalogram signals. Expert Systems 26:4, 339-354. [CrossRef] 12. Enso Ikonen, Kaddour Najim. 2009. Multiple Model-Based Control Using Finite Controlled Markov Chains. Cognitive Computation 1:3, 234-243. [CrossRef] 13. Paolo Soda, Giulio Iannello, Mario Vento. 2009. A multiple expert system for classifying fluorescent intensity in antinuclear autoantibodies analysis. Pattern Analysis and Applications 12:3, 215-226. [CrossRef] 14. Hsiu-Ting Yu. 2009. S. FRÜHWIRTH-SCHNATTER (2006) Finite Mixture and Markov Switching Models. Psychometrika 74:3, 559-560. [CrossRef]
15. Andrea L. Gebhart, Richard N. Aslin, Elissa L. Newport. 2009. Changing Structures in Midstream: Learning Along the Statistical Garden Path. Cognitive Science 33:6, 1087-1116. [CrossRef] 16. Elif Derya Übeyli. 2009. Modified Mixture of Experts for Diabetes Diagnosis. Journal of Medical Systems 33:4, 299-305. [CrossRef] 17. J. Carreau, Y. Bengio. 2009. A Hybrid Pareto Mixture for Conditional Asymmetric Fat-Tailed Distributions. IEEE Transactions on Neural Networks 20:7, 1087-1101. [CrossRef] 18. Zainal Ahmad, Rabiatul ′Adawiah Mat Noor, Jie Zhang. 2009. Multiple neural networks modeling techniques in process control: a review. Asia-Pacific Journal of Chemical Engineering 4:4, 403-419. [CrossRef] 19. E.D. Ubeyli. 2009. Eigenvector Methods for Automated Detection of Electrocardiographic Changes in Partial Epileptic Patients. IEEE Transactions on Information Technology in Biomedicine 13:4, 478-485. [CrossRef] 20. Hiroshi Imamizu, Mitsuo Kawato. 2009. Brain mechanisms for predictive control by switching internal models: implications for higher-order cognitive functions. Psychological Research Psychologische Forschung 73:4, 527-544. [CrossRef] 21. Yaoyao Zhu, Xiaolei Huang, Wei Wang, Daniel Lopresti, Rodney Long, Sameer Antani, Zhiyun Xue, George Thoma. 2009. Balancing the Role of Priors in Multi-Observer Segmentation Evaluation. Journal of Signal Processing Systems 55:1-3, 185-207. [CrossRef] 22. N. Gradojevic, R. Gencay, D. Kukolj. 2009. Option Pricing With Modular Neural Networks. IEEE Transactions on Neural Networks 20:4, 626-637. [CrossRef] 23. J. Molina-Vilaplana, J. L. Contreras-Vidal, M. T. Herrero-Ezquerro, J. Lopez-Coronado. 2009. A model for altered neural network dynamics related to prehension movements in Parkinson disease. Biological Cybernetics 100:4, 271-287. [CrossRef] 24. Manuel Carcenac. 2009. A modular neural network for super-resolution of human faces. Applied Intelligence 30:2, 168-186. [CrossRef] 25. P. Rojanavasu, Hai Huong Dam, H.A. Abbass, C. Lokan, O. Pinngern. 2009. A Self-Organized, Distributed, and Adaptive Rule-Based Induction System. IEEE Transactions on Neural Networks 20:3, 446-459. [CrossRef] 26. Michael J. Procopio, Jane Mulligan, Greg Grudic. 2009. Learning terrain segmentation with classifier ensembles for autonomous robot navigation in unstructured environments. Journal of Field Robotics 26:2, 145-175. [CrossRef] 27. Xin Li, Yunfei Zheng. 2009. Patch-Based Video Processing: A Variational Bayesian Approach. IEEE Transactions on Circuits and Systems for Video Technology 19:1, 27-40. [CrossRef] 28. M.D. Muhlbaier, A. Topalis, R. Polikar. 2009. Learn$^{++}$ .NC: Combining Ensemble of Classifiers With Dynamically Weighted Consult-and-Vote for
Efficient Incremental Learning of New Classes. IEEE Transactions on Neural Networks 20:1, 152-168. [CrossRef] 29. Eri Itoh, Shinji Suzuki. 2009. Architecture for Harmonizing Manual and Automatic Flight Controls. Journal of Aerospace Computing, Information, and Communication 6, 1-37. [CrossRef] 30. Elif Derya Übeyli, Konuralp Ilbay, Gul Ilbay, Deniz Sahin, Gur Akansel. 2008. Differentiation of Two Subtypes of Adult Hydrocephalus by Mixture of Experts. Journal of Medical Systems . [CrossRef] 31. H. Cevikalp, R. Polikar. 2008. Local Classifier Weighting by Quadratic Programming. IEEE Transactions on Neural Networks 19:10, 1832-1838. [CrossRef] 32. Pietro Zito, Haibo Chen, Margaret C. Bell. 2008. Predicting Real-Time Roadside CO and $\hbox{NO}_{2}$ Concentrations Using Neural Networks. IEEE Transactions on Intelligent Transportation Systems 9:3, 514-522. [CrossRef] 33. M.A. Wiering, H. van Hasselt. 2008. Ensemble Algorithms in Reinforcement Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38:4, 930-936. [CrossRef] 34. Dongbing Gu. 2008. Distributed EM Algorithm for Gaussian Mixtures in Sensor Networks. IEEE Transactions on Neural Networks 19:7, 1154-1166. [CrossRef] 35. M.M. Islam, Xin Yao, S.M. Shahriar Nirjon, M.A. Islam, K. Murase. 2008. Bagging and Boosting Negatively Correlated Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38:3, 771-784. [CrossRef] 36. Giovanni Pezzulo. 2008. Coordinating with the Future: The Anticipatory Nature of Representation. Minds and Machines 18:2, 179-225. [CrossRef] 37. Elif Derya Übeyli. 2008. Implementing wavelet transform/mixture of experts network for analysis of electrocardiogram beats. Expert Systems 25:2, 150-162. [CrossRef] 38. Leo Galway, Darryl Charles, Michaela Black. 2008. Machine learning in digital games: a survey. Artificial Intelligence Review 29:2, 123-161. [CrossRef] 39. Jun Tani, Ryu Nishimoto, Jun Namikawa, Masato Ito. 2008. Codevelopmental Learning Between Human and Humanoid Robot Using a Dynamic Neural-Network Model. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38:1, 43-59. [CrossRef] 40. Minh Ha Nguyen, H.A. Abbass, R.I. McKay. 2008. Analysis of CCME: Coevolutionary Dynamics, Automatic Problem Decomposition, and Regularization. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38:1, 100-109. [CrossRef] 41. Berk Gokberk, Helin Dutagaci, Ayd¿n Ulas, Lale Akarun, B�lent Sankur. 2008. . IEEE Transactions on Systems Man and Cybernetics Part B (Cybernetics) 38:1, 155. [CrossRef]
42. Shalom Darmanjian, Jose C. Principe. 2008. Boosted and Linked Mixtures of HMMs for Brain-Machine Interfaces. EURASIP Journal on Advances in Signal Processing 2008, 1-13. [CrossRef] 43. Sébastien Hélie, Gyslain Giguère, Denis Cousineau, Robert Proulx. 2007. Using knowledge partitioning to investigate the psychological plausibility of mixtures of experts. Artificial Intelligence Review 25:1-2, 119-138. [CrossRef] 44. Sheng-Uei Guan, Chunyu Bao, TseNgee Neo. 2007. Reduced Pattern Training Based on Task Decomposition Using Pattern Distributor. IEEE Transactions on Neural Networks 18:6, 1738-1749. [CrossRef] 45. Manuel Carcenac. 2007. A modular neural network applied to image transformation and mental images. Neural Computing and Applications . [CrossRef] 46. Mark Eastwood, Bogdan Gabrys. 2007. The Dynamics of Negative Correlation Learning. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 49:2, 251-263. [CrossRef] 47. Shun-ichi Amari. 2007. Integration of Stochastic Models by Minimizing α-DivergenceIntegration of Stochastic Models by Minimizing α-Divergence. Neural Computation 19:10, 2780-2796. [Abstract] [PDF] [PDF Plus] 48. M. Aziz Muslim, Masumi Ishikawa, Tetsuo Furukawa. 2007. Task segmentation in a mobile robot by mnSOM: a new approach to training expert modules. Neural Computing and Applications 16:6, 571-580. [CrossRef] 49. Zhuoxin Sun, Ori Rosen, Allan R. Sampson. 2007. Multivariate Bernoulli Mixture Models with Application to Postmortem Tissue Studies in Schizophrenia. Biometrics 63:3, 901-909. [CrossRef] 50. Reza Ebrahimpour, Ehsanollah Kabir, Mohammad Reza Yousefi. 2007. Face Detection Using Mixture of MLP Experts. Neural Processing Letters 26:1, 69-82. [CrossRef] 51. Hiroshi Imamizu, Norikazu Sugimoto, Rieko Osu, Kiyoka Tsutsui, Kouichi Sugiyama, Yasuhiro Wada, Mitsuo Kawato. 2007. Explicit contextual information selectively contributes to predictive switching of internal models. Experimental Brain Research 181:3, 395-408. [CrossRef] 52. D. M. Gavrila, S. Munder. 2007. Multi-cue Pedestrian Detection and Tracking from a Moving Vehicle. International Journal of Computer Vision 73:1, 41-59. [CrossRef] 53. Abdelhamid Bouchachia. 2007. Learning with partly labeled data. Neural Computing and Applications 16:3, 267-293. [CrossRef] 54. Yasutoshi Nomura, Hitoshi Furuta, Michiyuki Hirokane. 2007. An Integrated Fuzzy Control System for Structural Vibration. Computer-Aided Civil and Infrastructure Engineering 22:4, 306-316. [CrossRef] 55. Yung-Keun Kwon, Byung-Ro Moon. 2007. A Hybrid Neurogenetic Approach for Stock Forecasting. IEEE Transactions on Neural Networks 18:3, 851-864. [CrossRef]
56. Giovanna Jona Lasinio, Fabio Divino, Annibale Biggeri. 2007. Environmental risk assessment in the Tuscany region: a proposal. Environmetrics 18:3, 315-332. [CrossRef] 57. Ludmila Kuncheva, Juan Rodriguez. 2007. Classifier Ensembles with a Random Linear Oracle. IEEE Transactions on Knowledge and Data Engineering 19:4, 500-508. [CrossRef] 58. Devi Parikh, Robi Polikar. 2007. An Ensemble-Based Incremental Learning Approach to Data Fusion. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 37:2, 437-450. [CrossRef] 59. Colin Fyfe. 2007. Two topographic maps for data visualisation. Data Mining and Knowledge Discovery 14:2, 207-224. [CrossRef] 60. David J. Miller, Siddharth Pal. 2007. Transductive Methods for the Distributed Ensemble Classification ProblemTransductive Methods for the Distributed Ensemble Classification Problem. Neural Computation 19:3, 856-884. [Abstract] [PDF] [PDF Plus] 61. Mark S. Handcock, Adrian E. Raftery, Jeremy M. Tantrum. 2007. Model-based clustering for social networks. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170:2, 301-354. [CrossRef] 62. Anelia Angelova, Larry Matthies, Daniel Helmick, Pietro Perona. 2007. Learning and prediction of slip from visual information. Journal of Field Robotics 24:3, 205-231. [CrossRef] 63. Elif Derya Übeyli. 2007. Comparison of different classification algorithms in clinical decision-making. Expert Systems 24:1, 17-31. [CrossRef] 64. Xia Hong, Sheng Chen, Chris J. Harris. 2007. A Kernel-Based Two-Class Classifier for Imbalanced Data Sets. IEEE Transactions on Neural Networks 18:1, 28-41. [CrossRef] 65. Junfeng Pan, Qiang Yang, Yiming Yang, Lei Li, Frances Li, George Li. 2007. Cost-Sensitive-Data Preprocessing for Mining Customer Relationship Management Databases. IEEE Intelligent Systems 22:1, 46-51. [CrossRef] 66. Kumiko Nishi, Ichiro Takeuchi. 2007. Casualty Insurance Pure Premium Estimation Using Two-Stage Regression Tree. Transactions of the Japanese Society for Artificial Intelligence 22, 183-190. [CrossRef] 67. Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan. 2006. Soft Mixer Assignment in a Hierarchical Generative Model of Natural Scene StatisticsSoft Mixer Assignment in a Hierarchical Generative Model of Natural Scene Statistics. Neural Computation 18:11, 2680-2718. [Abstract] [PDF] [PDF Plus] 68. Sheng-Uei Guan, Chunyu Bao, Ru-Tian Sun. 2006. Hierarchical Incremental Class Learning with Reduced Pattern Training. Neural Processing Letters 24:2, 163-177. [CrossRef]
69. I. Guler, E.D. Ubeyli. 2006. Automated Diagnostic Systems With Diverse and Composite Features for Doppler Ultrasound Signals. IEEE Transactions on Biomedical Engineering 53:10, 1934-1942. [CrossRef] 70. Kazuyuki Samejima, Ken'Ichi Katagiri, Kenji Doya, Mitsuo Kawato. 2006. Multiple model-based reinforcement learning for nonlinear control. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 89:9, 54-69. [CrossRef] 71. G.-B. Huang, L. Chen, C.-K. Siew. 2006. Universal Approximation Using Incremental Constructive Feedforward Networks With Random Hidden Nodes. IEEE Transactions on Neural Networks 17:4, 879-892. [CrossRef] 72. S-P Kim, J C Sanchez, Y N Rao, D Erdogmus, J M Carmena, M A Lebedev, M A L Nicolelis, J C Principe. 2006. A comparison of optimal MIMO linear and nonlinear models for brain–machine interfaces. Journal of Neural Engineering 3:2, 145-161. [CrossRef] 73. Michael Rimer, Tony Martinez. 2006. Classification-based objective functions. Machine Learning 63:2, 183-205. [CrossRef] 74. J. Zhang, Q. Jin, Y. Xu. 2006. Inferential Estimation of Polymer Melt Index Using Sequentially Trained Bootstrap Aggregated Neural Networks. Chemical Engineering & Technology 29:4, 442-448. [CrossRef] 75. Mingyang Xu, Michael W. Golay. 2006. Data-guided model combination by decomposition and aggregation. Machine Learning 63:1, 43-67. [CrossRef] 76. J. Cho, J.C. Principe, D. Erdogmus, M.A. Motter. 2006. Modeling and Inverse Controller Design for an Unmanned Aerial Vehicle Based on the Self-Organizing Map. IEEE Transactions on Neural Networks 17:2, 445-460. [CrossRef] 77. Yang Ge , Wenxin Jiang . 2006. On Consistency of Bayesian Inference with Mixtures of Logistic RegressionOn Consistency of Bayesian Inference with Mixtures of Logistic Regression. Neural Computation 18:1, 224-243. [Abstract] [PDF] [PDF Plus] 78. Tadahiro TANIGUCHI, Tetsuo SAWARAGI. 2006. Construction of Behavioral Concepts through Social Interactions based on Reward Design: Schema-Based Incremental Reinforcement Learning. Journal of Japan Society for Fuzzy Theory and Intelligent Informatics 18:4, 629-640. [CrossRef] 79. Xin Yao, Yong Xu. 2006. Recent Advances in Evolutionary Computation. Journal of Computer Science and Technology 21:1, 1-18. [CrossRef] 80. Giuliano Armano, Alessandro Orro, Eloisa Vargiu. 2006. MASSP3: A System for Predicting Protein Secondary Structure. EURASIP Journal on Advances in Signal Processing 2006, 1-10. [CrossRef] 81. Steven Walczak, Madhavan Parthasarathy. 2006. Modeling online service discontinuation with nonparametric agents. Information Systems and e-Business Management 4:1, 49-70. [CrossRef]
82. Alexandre X. Carvalho, Martin A. Tanner. 2006. Modeling nonlinearities with mixtures-of-experts of time series models. International Journal of Mathematics and Mathematical Sciences 2006, 1-23. [CrossRef] 83. J. Arenas-Garcia, V. Gomez-Verdejo, A.R. Figueiras-Vidal. 2005. New Algorithms for Improved Adaptive Convex Combination of LMS Transversal Filters. IEEE Transactions on Instrumentation and Measurement 54:6, 2239-2249. [CrossRef] 84. G.-B. Huang, K.Z. Mao, C.-K. Siew, D.-S. Huang. 2005. Fast Modular Network Implementation for Support Vector Machines. IEEE Transactions on Neural Networks 16:6, 1651-1663. [CrossRef] 85. Elif Derya Übeyli. 2005. A Mixture of Experts Network Structure for Breast Cancer Diagnosis. Journal of Medical Systems 29:5, 569-579. [CrossRef] 86. G. Armano, L. Milanesi, A. Orro. 2005. Multiple Alignment Through Protein Secondary-Structure Information. IEEE Transactions on Nanobioscience 4:3, 207-211. [CrossRef] 87. Peter D Neilson, Megan D Neilson. 2005. An overview of adaptive model theory: solving the problems of redundancy, resources, and nonlinear interactions in human movement control. Journal of Neural Engineering 2:3, S279-S312. [CrossRef] 88. K. Chen. 2005. On the Use of Different Speech Representations for Speaker Modeling. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 35:3, 301-314. [CrossRef] 89. J.I. Arribas, J. Cid-Sueiro. 2005. A Model Selection Algorithm for a Posteriori Probability Estimation With Neural Networks. IEEE Transactions on Neural Networks 16:4, 799-809. [CrossRef] 90. I. Mora-Jimenez, J. Cid-Sueiro. 2005. A Universal Learning Rule That Minimizes Well-Formed Cost Functions. IEEE Transactions on Neural Networks 16:4, 810-820. [CrossRef] 91. Takumi Ichimura, Shinichi Oeda, Machi Suka, Katsumi Yoshida. 2005. A learning method of immune multi-agent neural networks. Neural Computing and Applications 14:2, 132-148. [CrossRef] 92. Masahiko Morita, Kouhei Matsuzawa, Shigemitsu Morokami. 2005. A model of context-dependent association using selective desensitization of nonmonotonic neural elements. Systems and Computers in Japan 36:7, 73-83. [CrossRef] 93. Alexandre X. Carvalho, Martin A. Tanner. 2005. Modeling nonlinear time series with local mixtures of generalized linear models. Canadian Journal of Statistics 33:1, 97-113. [CrossRef] 94. J. Lan, J. Cho, D. Erdogmus, J.C. Principe, M.A. Motter, J. Xu. 2005. Local Linear PID Controllers for Nonlinear Control. Control and Intelligent Systems 33:1. . [CrossRef] 95. Abedalrazq Khalil. 2005. Applicability of statistical learning algorithms in groundwater quality modeling. Water Resources Research 41:5. . [CrossRef]
96. A.X. Carvalho, M.A. Tanner. 2005. Mixtures-of-Experts of Autoregressive Time Series: Asymptotic Normality and Model Specification. IEEE Transactions on Neural Networks 16:1, 39-56. [CrossRef] 97. Seiji Ishihara, Harukazu Igarashi. 2005. A Task Decomposition Algorithm Based on the Distribution of Input Pattern Vectors for Classification Problems. IEEJ Transactions on Electronics, Information and Systems 125:7, 1043-1048. [CrossRef] 98. C.K. Loo, M. Rajeswari, M.V.C. Rao. 2004. Novel Direct and Self-Regulating Approaches to Determine Optimum Growing Multi-Experts Network Structure. IEEE Transactions on Neural Networks 15:6, 1378-1395. [CrossRef] 99. Roelof K. Brouwer. 2004. A hybrid neural network for input that is both categorical and quantitative. International Journal of Intelligent Systems 19:10, 979-1001. [CrossRef] 100. A.S. Cofino, J.M. Gutierrez, M.L. Ivanissevich. 2004. Evolving modular networks with genetic algorithms: application to nonlinear time series. Expert Systems 21:4, 208-216. [CrossRef] 101. Jayanta Basak, Ravi Kothari. 2004. A Classification Paradigm for Distributed Vertically Partitioned DataA Classification Paradigm for Distributed Vertically Partitioned Data. Neural Computation 16:7, 1525-1544. [Abstract] [PDF] [PDF Plus] 102. F.A. Mussa-Ivaldi, S.A. Solla. 2004. Neural Primitives for Motion Control. IEEE Journal of Oceanic Engineering 29:3, 640-650. [CrossRef] 103. S.K. Warfield, K.H. Zou, W.M. Wells. 2004. Simultaneous Truth and Performance Level Estimation (STAPLE): An Algorithm for the Validation of Image Segmentation. IEEE Transactions on Medical Imaging 23:7, 903-921. [CrossRef] 104. L. Xu. 2004. Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor Autodetermination. IEEE Transactions on Neural Networks 15:4, 885-902. [CrossRef] 105. S.-K. Ng, G.J. McLachlan. 2004. Using the EM Algorithm to Train Neural Networks: Misconceptions and a New Algorithm for Multiclass Classification. IEEE Transactions on Neural Networks 15:3, 738-749. [CrossRef] 106. M.-W. Chang, C.-J. Lin, R.C.-H. Weng. 2004. Analysis of Switching Dynamics With Competing Support Vector Machines. IEEE Transactions on Neural Networks 15:3, 720-727. [CrossRef] 107. M.A. Moussa. 2004. Combining Expert Neural Networks Using Reinforcement Feedback for Learning Primitive Grasping Behavior. IEEE Transactions on Neural Networks 15:3, 629-638. [CrossRef] 108. J. Su, J. Wang, Y. Xi. 2004. Incremental Learning With Balanced Update on Receptive Fields for Multi-Sensor Data Fusion. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:1, 659-665. [CrossRef]
109. Andrew Estabrooks, Taeho Jo, Nathalie Japkowicz. 2004. A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20:1, 18-36. [CrossRef] 110. Lewis Bott, Evan Heit. 2004. Nonmonotonic Extrapolation in Function Learning. Journal of Experimental Psychology: Learning, Memory, and Cognition 30:1, 38-50. [CrossRef] 111. Michael L. Kalish, Stephan Lewandowsky, John K. Kruschke. 2004. Population of Linear Experts: Knowledge Partitioning and Function Learning. Psychological Review 111:4, 1072-1099. [CrossRef] 112. Lee-Xieng Yang, Stephan Lewandowsky. 2004. Knowledge Partitioning in Categorization: Constraints on Exemplar Models. Journal of Experimental Psychology: Learning, Memory, and Cognition 30:5, 1045-1064. [CrossRef] 113. Kai Huang, Robert F. Murphy. 2004. From quantitative microscopy to automated image understanding. Journal of Biomedical Optics 9:5, 893. [CrossRef] 114. Ori Rosen, Ayala Cohen. 2003. Analysis of growth curves via mixtures. Statistics in Medicine 22:23, 3641-3654. [CrossRef] 115. D.J. Miller, J. Browning. 2003. A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:11, 1468-1483. [CrossRef] 116. C.M. Bachmann, M.H. Bettenhausen, R.A. Fusina, T.F. Donato, A.L. Russ, J.W. Burke, G.M. Lamela, W.J. Rhea, B.R. Truitt, J.H. Porter. 2003. A credit assignment approach to fusing classifiers of multiseason hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing 41:11, 2488-2499. [CrossRef] 117. Jun Tani, M. Ito. 2003. Self-organization of behavioral primitives as multiple attractor dynamics: A robot experiment. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 33:4, 481-488. [CrossRef] 118. Md.M. Islam, Xin Yao, K. Murase. 2003. A constructive algorithm for training cooperative neural network ensembles. IEEE Transactions on Neural Networks 14:4, 820-834. [CrossRef] 119. Chee Peng Lim, R.F. Harrison. 2003. Online pattern classification with multiple neural network systems: an experimental study. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 33:2, 235-247. [CrossRef] 120. N. Garcia-Pedrajas, C. Hervas-Martinez, J. Munoz-Perez. 2003. COVNET: a cooperative coevolutionary model for evolving artificial neural networks. IEEE Transactions on Neural Networks 14:3, 575-596. [CrossRef] 121. Liu Yong, Zou Xiu-fen. 2003. From designing a single neural network to designing neural network ensembles. Wuhan University Journal of Natural Sciences 8:1, 155-164. [CrossRef] 122. Liu Yong, Zou Xiu-fen. 2003. Analysis of negative correlation learning. Wuhan University Journal of Natural Sciences 8:1, 165-175. [CrossRef]
123. D. Windridge, J. Kittler. 2003. A morphologically optimal strategy for classifier combinaton: multiple expert fusion as a tomographic process. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:3, 343-353. [CrossRef] 124. Shimon Edelman, Nathan Intrator. 2003. Towards structural systematicity in distributed, statically bound visual representations. Cognitive Science 27:1, 73-109. [CrossRef] 125. M.A.L. Thathachar, P.S. Sastry. 2002. Varieties of learning automata: an overview. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:6, 711-722. [CrossRef] 126. Andrew D. Back , Tianping Chen . 2002. Universal Approximation of Multiple Nonlinear Operators by Neural NetworksUniversal Approximation of Multiple Nonlinear Operators by Neural Networks. Neural Computation 14:11, 2561-2566. [Abstract] [PDF] [PDF Plus] 127. A. Kehagias, V. Petridis. 2002. Predictive modular neural networks for unsupervised segmentation of switching time series: the data allocation problem. IEEE Transactions on Neural Networks 13:6, 1432-1449. [CrossRef] 128. Michalis K. Titsias , Aristidis Likas . 2002. Mixture of Experts Classification Using a Hierarchical Mixture ModelMixture of Experts Classification Using a Hierarchical Mixture Model. Neural Computation 14:9, 2221-2244. [Abstract] [PDF] [PDF Plus] 129. Lizhong Wu, S.L. Oviatt, P.R. Cohen. 2002. From members to teams to committee-a robust approach to gestural and multimodal recognition. IEEE Transactions on Neural Networks 13:4, 972-982. [CrossRef] 130. E. Mizutani, K. Nishio. 2002. Multi-illuminant color reproduction for electronic cameras via CANFIS neuro-fuzzy modular network device characterization. IEEE Transactions on Neural Networks 13:4, 1009-1022. [CrossRef] 131. Kenji Doya , Kazuyuki Samejima , Ken-ichi Katagiri , Mitsuo Kawato . 2002. Multiple Model-Based Reinforcement LearningMultiple Model-Based Reinforcement Learning. Neural Computation 14:6, 1347-1369. [Abstract] [PDF] [PDF Plus] 132. A.M. Martinez. 2002. Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:6, 748-763. [CrossRef] 133. Purushottam Papatla, Mariam (Fatemeh) Zahedi, Marijana Zekic-Susac. 2002. Leveraging the Strengths of Choice Models and Neural Networks: A Multiproduct Comparative Analysis. Decision Sciences 33:3, 433-461. [CrossRef] 134. Ronan Collobert , Samy Bengio , Yoshua Bengio . 2002. A Parallel Mixture of SVMs for Very Large Scale ProblemsA Parallel Mixture of SVMs for Very Large Scale Problems. Neural Computation 14:5, 1105-1114. [Abstract] [PDF] [PDF Plus]
135. Sheng-Uei Guan, Shanchun Li. 2002. Parallel growing and training of neural networks using output parallelism. IEEE Transactions on Neural Networks 13:3, 542-550. [CrossRef] 136. L.I. Kuncheva. 2002. Switching between selection and fusion in combining classifiers: an experiment. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:2, 146-156. [CrossRef] 137. C.C. Chibelushi, F. Deravi, J.S.D. Mason. 2002. A review of speech-based bimodal recognition. IEEE Transactions on Multimedia 4:1, 23-37. [CrossRef] 138. David Mitchell, Robert Pavur. 2002. Using modular neural networks for business decisions. Management Decision 40:1, 58-63. [CrossRef] 139. G. Mayraz, G.E. Hinton. 2002. Recognizing handwritten digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:2, 189. [CrossRef] 140. Sun-Gi Hong, Sang-Keon Oh, Min-Soeng Kim, Ju-Jang Lee. 2002. Evolving mixture of experts for nonlinear time series modelling and prediction. Electronics Letters 38:1, 34. [CrossRef] 141. Sham Kakade, Peter Dayan. 2002. Acquisition and extinction in autoshaping. Psychological Review 109:3, 533-544. [CrossRef] 142. Stephan Lewandowsky, Michael Kalish, S. K. Ngang. 2002. Simplified learning in complex situations: Knowledge partitioning in function learning. Journal of Experimental Psychology: General 131:2, 163-193. [CrossRef] 143. Masahiko Haruno , Daniel M. Wolpert , Mitsuo Kawato . 2001. MOSAIC Model for Sensorimotor Learning and ControlMOSAIC Model for Sensorimotor Learning and Control. Neural Computation 13:10, 2201-2220. [Abstract] [PDF] [PDF Plus] 144. Jochen Triesch , Christoph von der Malsburg . 2001. Democratic Integration: Self-Organized Integration of Adaptive CuesDemocratic Integration: Self-Organized Integration of Adaptive Cues. Neural Computation 13:9, 2049-2074. [Abstract] [PDF] [PDF Plus] 145. David C. Becalick, Timothy J. Coats. 2001. Comparison of Artificial Intelligence Techniques with UKTRISS for Estimating Probability of Survival after Trauma. The Journal of Trauma: Injury, Infection and Critical Care 51:1, 123-133. [CrossRef] 146. Jefferson T. Davis, Athanasios Episcopos, Sannaka Wettimuny. 2001. Predicting direction shifts on Canadian-US exchange rates with artificial neural networks. International Journal of Intelligent Systems in Accounting, Finance & Management 10:2, 83-96. [CrossRef] 147. J Svensson, M von Hellermann, R W T König. 2001. Plasma Physics and Controlled Fusion 43:4, 389-403. [CrossRef]
148. H. Yin, N.M. Allinson. 2001. Self-organizing mixture networks for probability density estimation. IEEE Transactions on Neural Networks 12:2, 405-411. [CrossRef] 149. Hsin-Chia Fu, Yen-Po Lee, Cheng-Chin Chiang, Hsiao-Tien Pao. 2001. Divide-and-conquer learning and modular perceptron networks. IEEE Transactions on Neural Networks 12:2, 250-263. [CrossRef] 150. R. Polikar, L. Upda, S.S. Upda, V. Honavar. 2001. Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) 31:4, 497. [CrossRef] 151. X. Dai. 2001. CMA-based nonlinear blind equaliser modelled by a two-layer feedforward neural network. IEE Proceedings - Communications 148:4, 243. [CrossRef] 152. Qiang Gan, C.J. Harris. 2001. A hybrid learning scheme combining EM and MASMOD algorithms for fuzzy local linearization modeling. IEEE Transactions on Neural Networks 12:1, 43. [CrossRef] 153. Tao Hong, M.T.C. Fang. 2001. Detection and classification of partial discharge using a feature decomposition-based modular neural network. IEEE Transactions on Instrumentation and Measurement 50:5, 1349. [CrossRef] 154. A.A. Ilumoka. 2001. Efficient and accurate crosstalk prediction via neural net-based topological decomposition of 3-D interconnect. IEEE Transactions on Advanced Packaging 24:3, 268. [CrossRef] 155. C.J. Harris, X. Hong. 2001. Neurofuzzy mixture of experts network parallel learning and model construction algorithms. IEE Proceedings - Control Theory and Applications 148:6, 456. [CrossRef] 156. R. Feraund, O.J. Bernier, J.-E. Viallet, M. Collobert. 2001. A fast and accurate face detector based on neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:1, 42. [CrossRef] 157. Naonori Ueda. 2001. Transactions of the Japanese Society for Artificial Intelligence 16, 299-308. [CrossRef] 158. Y. Bengio, V.-P. Lauzon, R. Ducharme. 2001. Experiments on the application of IOHMMs to model financial returns series. IEEE Transactions on Neural Networks 12:1, 113. [CrossRef] 159. Dirk Husmeier . 2000. The Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural NetworksThe Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural Networks. Neural Computation 12:11, 2685-2717. [Abstract] [PDF] [PDF Plus] 160. S. Gutta, J.R.J. Huang, P. Jonathon, H. Wechsler. 2000. Mixture of experts for classification of gender, ethnic origin, and pose of human faces. IEEE Transactions on Neural Networks 11:4, 948-960. [CrossRef] 161. Andreas S. Weigend, Shanming Shi. 2000. Predicting daily probability distributions of S&P500 returns. Journal of Forecasting 19:4, 375-392. [CrossRef]
162. Yannick Marchand , Robert I. Damper . 2000. A Multistrategy Approach to Improving Pronunciation by AnalogyA Multistrategy Approach to Improving Pronunciation by Analogy. Computational Linguistics 26:2, 195-219. [Abstract] [PDF] [PDF Plus] 163. Wenxin Jiang . 2000. The VC Dimension for Mixtures of Binary ClassifiersThe VC Dimension for Mixtures of Binary Classifiers. Neural Computation 12:6, 1293-1301. [Abstract] [PDF] [PDF Plus] 164. Ichiro Takeuchi, Takeshi Furuhashi. 2000. A description of dynamic behavior of sensory/motor systems with fuzzy symbolic dynamic systems. Artificial Life and Robotics 4:2, 84-88. [CrossRef] 165. Wenxin Jiang, M.A. Tanner. 2000. On the asymptotic normality of hierarchical mixtures-of-experts for generalized linear models. IEEE Transactions on Information Theory 46:3, 1005-1013. [CrossRef] 166. Zoubin Ghahramani , Geoffrey E. Hinton . 2000. Variational Learning for Switching State-Space ModelsVariational Learning for Switching State-Space Models. Neural Computation 12:4, 831-864. [Abstract] [PDF] [PDF Plus] 167. Fu-Lai Chung, Ji-Cheng Duan. 2000. On multistage fuzzy neural network modeling. IEEE Transactions on Fuzzy Systems 8:2, 125-142. [CrossRef] 168. A. Khotanzad, H. Elragal, T.-L. Lu. 2000. Combination of artificial neural-network forecasters for prediction of natural gas consumption. IEEE Transactions on Neural Networks 11:2, 464-473. [CrossRef] 169. Masa-aki Sato , Shin Ishii . 2000. On-line EM Algorithm for the Normalized Gaussian NetworkOn-line EM Algorithm for the Normalized Gaussian Network. Neural Computation 12:2, 407-432. [Abstract] [PDF] [PDF Plus] 170. Shiro Ikeda. 2000. Acceleration of the EM algorithm. Systems and Computers in Japan 31:2, 10-18. [CrossRef] 171. M.A. Carreira-Perpinan. 2000. Mode-finding for mixtures of Gaussian distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:11, 1318. [CrossRef] 172. T. Higuchi, Xin Yao, Yong Liu. 2000. Evolutionary ensembles with negative correlation learning. IEEE Transactions on Evolutionary Computation 4:4, 380. [CrossRef] 173. A. Karniel, G.F. Inbar. 2000. Human motor control: learning to control a time-varying, nonlinear, many-to-one system. IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) 30:1, 1. [CrossRef] 174. J. L. Castro, M. Delgado, C. J. Mantas. 2000. MORSE: A general model to represent structured knowledge. International Journal of Intelligent Systems 15:1, 27-43. [CrossRef] 175. Azriel Rosenfeld, Harry Wechsler. 2000. Pattern recognition: Historical perspective and future directions. International Journal of Imaging Systems and Technology 11:2, 101-116. [CrossRef]
176. Qun Zhao, Jose C. Principe, Victor L. Brennan, Dongxin Xu, Zheng Wang. 2000. Synthetic aperture radar automatic target recognition with three strategies of learning and representation. Optical Engineering 39:5, 1230. [CrossRef] 177. G. Deng, H. Ye, L.W. Cahill. 2000. Adaptive combination of linear predictors for lossless image compression. IEE Proceedings - Science, Measurement and Technology 147:6, 414. [CrossRef] 178. L.C. Jain, L.I. Kuncheva. 2000. Designing classifier fusion systems by genetic algorithms. IEEE Transactions on Evolutionary Computation 4:4, 327. [CrossRef] 179. A.K. Jain, P.W. Duin, Jianchang Mao. 2000. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:1, 4. [CrossRef] 180. Wenxin Jiang , Martin A. Tanner . 1999. On the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear ModelsOn the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear Models. Neural Computation 11:5, 1183-1198. [Abstract] [PDF] [PDF Plus] 181. S.A. Rizvi, L.-C. Wang, N.M. Nasrabadi. 1999. Rate-constrained modular predictive residual vector quantization of digital images. IEEE Signal Processing Letters 6:6, 135-137. [CrossRef] 182. Ori Rosen, Martin Tanner. 1999. Mixtures of proportional hazards regression models. Statistics in Medicine 18:9, 1119-1131. [CrossRef] 183. A. Sarajedini, R. Hecht-Nielsen, P.M. Chau. 1999. Conditional probability density function estimation with sigmoidal neural networks. IEEE Transactions on Neural Networks 10:2, 231-238. [CrossRef] 184. Ran Avnimelech , Nathan Intrator . 1999. Boosted Mixture of Experts: An Ensemble Learning SchemeBoosted Mixture of Experts: An Ensemble Learning Scheme. Neural Computation 11:2, 483-497. [Abstract] [PDF] [PDF Plus] 185. Suzanna Becker . 1999. Implicit Learning in 3D Object Recognition: The Importance of Temporal ContextImplicit Learning in 3D Object Recognition: The Importance of Temporal Context. Neural Computation 11:2, 347-374. [Abstract] [PDF] [PDF Plus] 186. Ran Avnimelech , Nathan Intrator . 1999. Boosting Regression EstimatorsBoosting Regression Estimators. Neural Computation 11:2, 499-520. [Abstract] [PDF] [PDF Plus] 187. V. Ramamurti, J. Ghosh. 1999. Structurally adaptive modular networks for nonstationary environments. IEEE Transactions on Neural Networks 10:1, 152. [CrossRef] 188. D.J. Miller, Lian Yan. 1999. Critic-driven ensemble classification. IEEE Transactions on Signal Processing 47:10, 2833. [CrossRef] 189. Bao-Liang Lu, H. Kita, Y. Nishikawa. 1999. Inverting feedforward neural networks using linear and nonlinear programming. IEEE Transactions on Neural Networks 10:6, 1271. [CrossRef]
190. Sun-Yuan Kung, J. Taur, Shang-Hung Lin. 1999. Synergistic modeling and applications of hierarchical fuzzy neural networks. Proceedings of the IEEE 87:9, 1550. [CrossRef] 191. B. Apolloni, I. Zoppis. 1999. Sub-symbolically managing pieces of symbolical functions for sorting. IEEE Transactions on Neural Networks 10:5, 1099. [CrossRef] 192. C. Di Natale, E. Proietti, R. Diamanti, A. D'Amico. 1999. Modeling of APCVD-doped silicon dioxide deposition process by a modular neural network. IEEE Transactions on Semiconductor Manufacturing 12:1, 109. [CrossRef] 193. I.A. Taha, J. Ghosh. 1999. Symbolic interpretation of artificial neural networks. IEEE Transactions on Knowledge and Data Engineering 11:3, 448. [CrossRef] 194. Chin-Teng Lin, I-Fang Chung. 1999. A reinforcement neuro-fuzzy combiner for multiobjective control. IEEE Transactions on Systems Man and Cybernetics Part B (Cybernetics) 29:6, 726. [CrossRef] 195. A.N. Srivastava, R. Su, A.S. Weigend. 1999. Data mining for features using scale-sensitive gated experts. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:12, 1268. [CrossRef] 196. Bao-Liang Lu, M. Ito. 1999. Task decomposition and module combination based on class relations: a modular neural network for pattern classification. IEEE Transactions on Neural Networks 10:5, 1244. [CrossRef] 197. A.D. Wilson, A.F. Bobick. 1999. Parametric hidden Markov models for gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:9, 884. [CrossRef] 198. M A Wani, D T Pham. 1999. Efficient control chart pattern recognition through synergistic and distributed artificial neural networks. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture 213:2, 157-169. [CrossRef] 199. Tom Ziemke . 1998. Adaptive Behavior in Autonomous AgentsAdaptive Behavior in Autonomous Agents. Presence: Teleoperators and Virtual Environments 7:6, 564-587. [Abstract] [PDF] [PDF Plus] 200. Rainer Dietrich, Manfred Opper. 1998. Journal of Physics A: Mathematical and General 31:46, 9131-9147. [CrossRef] 201. Stefan Schaal , Christopher G. Atkeson . 1998. Constructive Incremental Learning from Only Local InformationConstructive Incremental Learning from Only Local Information. Neural Computation 10:8, 2047-2084. [Abstract] [PDF] [PDF Plus] 202. Chi-Hang Lam, F. Shin. 1998. Formation and dynamics of modules in a dual-tasking multilayer feed-forward neural network. Physical Review E 58:3, 3673-3677. [CrossRef] 203. James A. Reggia , Sharon Goodall , Yuri Shkuro . 1998. Computational Studies of Lateralization of Phoneme Sequence GenerationComputational Studies
of Lateralization of Phoneme Sequence Generation. Neural Computation 10:5, 1277-1297. [Abstract] [PDF] [PDF Plus] 204. Sun-Yuan Kung, Jenq-Neng Hwang. 1998. Neural networks for intelligent multimedia processing. Proceedings of the IEEE 86:6, 1244-1272. [CrossRef] 205. A.J. Zeevi, R. Meir, V. Maiorov. 1998. Error bounds for functional approximation and estimation using mixtures of experts. IEEE Transactions on Information Theory 44:3, 1010-1025. [CrossRef] 206. Y. Shimshoni, N. Intrator. 1998. Classification of seismic signals by integrating ensembles of neural networks. IEEE Transactions on Signal Processing 46:5, 1194-1201. [CrossRef] 207. R. Rae, H.J. Ritter. 1998. Recognition of human head orientation based on artificial neural networks. IEEE Transactions on Neural Networks 9:2, 257-265. [CrossRef] 208. M.M. Poulton, R.A. Birken. 1998. Estimating one-dimensional models from frequency-domain electromagnetic data using modular neural networks. IEEE Transactions on Geoscience and Remote Sensing 36:2, 547-555. [CrossRef] 209. David J. Miller , Hasan S. Uyar . 1998. Combined Learning and Use for a Mixture Model Equivalent to the RBF ClassifierCombined Learning and Use for a Mixture Model Equivalent to the RBF Classifier. Neural Computation 10:2, 281-293. [Abstract] [PDF] [PDF Plus] 210. C.L. Fancourt, J.C. Principe. 1998. Competitive principal component analysis for locally stationary time series. IEEE Transactions on Signal Processing 46:11, 3068. [CrossRef] 211. C. Ornes, J. Sklansky. 1998. A visual neural classifier. IEEE Transactions on Systems Man and Cybernetics Part B (Cybernetics) 28:4, 620. [CrossRef] 212. J. Yen, Liang Wang, C.W. Gillespie. 1998. Improving the interpretability of TSK fuzzy models by combining global learning and local learning. IEEE Transactions on Fuzzy Systems 6:4, 530. [CrossRef] 213. K. Rose. 1998. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE 86:11, 2210. [CrossRef] 214. R. Sun, T. Peterson. 1998. Autonomous learning of sequential tasks: experiments and analyses. IEEE Transactions on Neural Networks 9:6, 1217. [CrossRef] 215. I-Cheng Yeh. 1998. Modeling Concrete Strength with Augment-Neuron Networks. Journal of Materials in Civil Engineering 10:4, 263. [CrossRef] 216. S.K.T. Kriebel, W. Brauer, W. Eifler. 1998. Coastal upwelling prediction with a mixture of neural networks. IEEE Transactions on Geoscience and Remote Sensing 36:5, 1508. [CrossRef] 217. Yoram Singer . 1997. Adaptive Mixtures of Probabilistic TransducersAdaptive Mixtures of Probabilistic Transducers. Neural Computation 9:8, 1711-1733. [Abstract] [PDF] [PDF Plus]
218. Athanasios Kehagias , Vassilios Petridis . 1997. Time-Series Segmentation Using Predictive Modular Neural NetworksTime-Series Segmentation Using Predictive Modular Neural Networks. Neural Computation 9:8, 1691-1709. [Abstract] [PDF] [PDF Plus] 219. Kukjin Kang, Jong-Hoon Oh, Chulan Kwon. 1997. Learning by a population of perceptrons. Physical Review E 55:3, 3257-3261. [CrossRef] 220. Robert A. Jacobs . 1997. Bias/Variance Analyses of Mixtures-of-Experts ArchitecturesBias/Variance Analyses of Mixtures-of-Experts Architectures. Neural Computation 9:2, 369-383. [Abstract] [PDF] [PDF Plus] 221. V. Petridis, A. Kehagias. 1997. Predictive modular fuzzy systems for time-series classification. IEEE Transactions on Fuzzy Systems 5:3, 381. [CrossRef] 222. A.L. McIlraith, H.C. Card. 1997. Birdsong recognition using backpropagation and multivariate statistics. IEEE Transactions on Signal Processing 45:11, 2740. [CrossRef] 223. J.A. Benediktsson, J.R. Sveinsson, O.K. Ersoy, P.H. Swain. 1997. Parallel consensual neural networks. IEEE Transactions on Neural Networks 8:1, 54. [CrossRef] 224. Yu Hen Hu, S. Palreddy, W.J. Tompkins. 1997. A patient-adaptable ECG beat classifier using a mixture of experts approach. IEEE Transactions on Biomedical Engineering 44:9, 891. [CrossRef] 225. Anders Krogh, Peter Sollich. 1997. Statistical mechanics of ensemble learning. Physical Review E 55:1, 811-825. [CrossRef] 226. A.V. Rao, D. Miller, K. Rose, A. Gersho. 1997. Mixture of experts regression modeling by deterministic annealing. IEEE Transactions on Signal Processing 45:11, 2811. [CrossRef] 227. Chuanyi Ji, Sheng Ma. 1997. Combinations of weak classifiers. IEEE Transactions on Neural Networks 8:1, 32. [CrossRef] 228. Sung-Bae Cho. 1997. Neural-network classifiers for recognizing totally unconstrained handwritten numerals. IEEE Transactions on Neural Networks 8:1, 43. [CrossRef] 229. Pierre Baldi, Yves Chauvin. 1996. Hybrid Modeling, HMM/NN Architectures, and Protein ApplicationsHybrid Modeling, HMM/NN Architectures, and Protein Applications. Neural Computation 8:7, 1541-1565. [Abstract] [PDF] [PDF Plus] 230. Christopher M. Bishop, Ian T. Nabney. 1996. Modeling Conditional Probability Distributions for Periodic VariablesModeling Conditional Probability Distributions for Periodic Variables. Neural Computation 8:5, 1123-1133. [Abstract] [PDF] [PDF Plus] 231. A. Khotanzad, J.J.-H. Liou. 1996. Recognition and pose estimation of unoccluded three-dimensional objects from a two-dimensional perspective view by banks of neural networks. IEEE Transactions on Neural Networks 7:4, 897-906. [CrossRef]
232. J.del.R. Millan. 1996. Rapid, safe, and incremental learning of navigation strategies. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 26:3, 408-420. [CrossRef] 233. Peter M. Williams. 1996. Using Neural Networks to Model Conditional Multivariate DensitiesUsing Neural Networks to Model Conditional Multivariate Densities. Neural Computation 8:4, 843-854. [Abstract] [PDF] [PDF Plus] 234. D. Sarkar. 1996. Randomness in generalization ability: a source to improve it. IEEE Transactions on Neural Networks 7:3, 676-685. [CrossRef] 235. E. Alpaydin, M.I. Jordan. 1996. Local linear perceptrons for classification. IEEE Transactions on Neural Networks 7:3, 788-794. [CrossRef] 236. Vassilios Petridis , Athanasios Kehagias . 1996. A Recurrent Network Implementation of Time Series ClassificationA Recurrent Network Implementation of Time Series Classification. Neural Computation 8:2, 357-372. [Abstract] [PDF] [PDF Plus] 237. Klaus Pawelzik , Jens Kohlmorgen , Klaus-Robert Müller . 1996. Annealed Competition of Experts for a Segmentation and Classification of Switching DynamicsAnnealed Competition of Experts for a Segmentation and Classification of Switching Dynamics. Neural Computation 8:2, 340-356. [Abstract] [PDF] [PDF Plus] 238. Y. Bengio, P. Frasconi. 1996. Input-output HMMs for sequence processing. IEEE Transactions on Neural Networks 7:5, 1231. [CrossRef] 239. V. Petridis, A. Kehagias. 1996. Modular neural networks for MAP classification of time series and the partition algorithm. IEEE Transactions on Neural Networks 7:1, 73. [CrossRef] 240. Masahiko Shizawa. 1996. Multivalued regularization network-a theory of multilayer networks for learning many-to-h mappings. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 79:9, 98-113. [CrossRef] 241. Robert A. Jacobs . 1995. Methods For Combining Experts' Probability AssessmentsMethods For Combining Experts' Probability Assessments. Neural Computation 7:5, 867-888. [Abstract] [PDF] [PDF Plus] 242. Younès Bennani . 1995. A Modular and Hybrid Connectionist System for Speaker IdentificationA Modular and Hybrid Connectionist System for Speaker Identification. Neural Computation 7:4, 791-798. [Abstract] [PDF] [PDF Plus] 243. Peter Dayan, Richard S. Zemel. 1995. Competition and Multiple Cause ModelsCompetition and Multiple Cause Models. Neural Computation 7:3, 565-579. [Abstract] [PDF] [PDF Plus] 244. G. Deco , D. Obradovic . 1995. Decorrelated Hebbian Learning for Clustering and Function ApproximationDecorrelated Hebbian Learning for Clustering and Function Approximation. Neural Computation 7:2, 338-348. [Abstract] [PDF] [PDF Plus]
245. Tor A. Johansen, Bjarne A. Foss. 1995. Semi-empirical modeling of non-linear dynamic systems through identification of operating regimes and local models. Modeling, Identification and Control: A Norwegian Research Bulletin 16:4, 213-232. [CrossRef] 246. Steven J. Nowlan, Terrence J. Sejnowski. 1994. Filter selection model for motion segmentation and velocity integration. Journal of the Optical Society of America A 11:12, 3177. [CrossRef] 247. S Sathiya Keerthi, B Ravindran. 1994. A tutorial survey of reinforcement learning. Sadhana 19:6, 851-889. [CrossRef] 248. R. S. Shadafan , M. Niranjan . 1994. A Dynamic Neural Network Architecture by Sequential Partitioning of the Input SpaceA Dynamic Neural Network Architecture by Sequential Partitioning of the Input Space. Neural Computation 6:6, 1202-1222. [Abstract] [PDF] [PDF Plus] 249. Robert A. Jacobs, Stephen M. Kosslyn. 1994. Encoding Shape and Spatial Relations: The Role of Receptive Field Size in Coordinating Complementary Representations. Cognitive Science 18:3, 361-386. [CrossRef] 250. Michael I. Jordan , Robert A. Jacobs . 1994. Hierarchical Mixtures of Experts and the EM AlgorithmHierarchical Mixtures of Experts and the EM Algorithm. Neural Computation 6:2, 181-214. [Abstract] [PDF] [PDF Plus] 251. Alan L. Yuille , Paul Stolorz , Joachim Utans . 1994. Statistical Physics, Mixtures of Distributions, and the EM AlgorithmStatistical Physics, Mixtures of Distributions, and the EM Algorithm. Neural Computation 6:2, 334-340. [Abstract] [PDF] [PDF Plus] 252. ANNETTE KARMILOFF-SMITH, ANDY CLARK. 1993. What's Special About the Development of the Human Mind/Brain?. Mind & Language 8:4, 569-581. [CrossRef] 253. WILLIAM BECHTEL. 1993. The Path Beyond First-Order Connectionism. Mind & Language 8:4, 531-539. [CrossRef] 254. William Bechtel. 1993. Currents in connectionism. Minds and Machines 3:2, 125-153. [CrossRef] 255. Suzanna Becker , Geoffrey E. Hinton . 1993. Learning Mixture Models of Spatial CoherenceLearning Mixture Models of Spatial Coherence. Neural Computation 5:2, 267-277. [Abstract] [PDF] [PDF Plus] 256. Léon Bottou , Vladimir Vapnik . 1992. Local Learning AlgorithmsLocal Learning Algorithms. Neural Computation 4:6, 888-900. [Abstract] [PDF] [PDF Plus] 257. Robert A. Jacobs, , Michael I. Jordan . 1992. Computational Consequences of a Bias toward Short ConnectionsComputational Consequences of a Bias toward Short Connections. Journal of Cognitive Neuroscience 4:4, 323-336. [Abstract] [PDF] [PDF Plus]
258. Robert A. Jacobs, Michael I. Jordan, Andrew G. Barto. 1991. Task Decomposition Through Competition in a Modular Connectionist Architecture: The What and Where Vision Tasks. Cognitive Science 15:2, 219-250. [CrossRef] 259. Terence D. Sanger . 1991. A Tree-Structured Algorithm for Reducing Computation in Networks with Separable Basis Functions. Neural Computation 3:1, 67-78. [Abstract] [PDF] [PDF Plus] 260. Marcus FreanConnectionist Architectures: Optimization . [CrossRef] 261. Yair BartalDivide-and-Conquer Methods . [CrossRef] 262. Paul J. WerbosNeurocontrollers . [CrossRef] 263. Peter DayanReinforcement Learning . [CrossRef] 264. Geoffrey HintonArtificial Intelligence: Neural Networks . [CrossRef] 265. Robi PolikarPattern Recognition . [CrossRef]
Communicated bv Dana Ballard
Efficient Training of Artificial Neural Networks for Autonomous Navigation Dean A. Pomerleau School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 U S A
The ALVINN (Autonomous Land Vehicle In a Neural Network) project addresses the problem of training artificial neural networks in real time to perform difficult perception tasks. ALVINN is a backpropagation network designed to drive the C M U Navlab, a modified Chevy van. This paper describes the training techniques that allow ALVINN to learn in under 5 minutes to autonomously control the Navlab by watching the reactions of a human driver. Using these techniques, ALVINN has been trained to drive in a variety of circumstances including singlelane paved and unpaved roads, and multilane lined and unlined roads, at speeds of up to 20 miles per hour.
1 Introduction
Artificial neural networks sometimes require prohibitively long training times and large training data sets to learn interesting tasks. As a result, few attempts have been made to apply artificial neural networks to complex real-world perception problems. In those domains where connectionist techniques have been applied successfully, such as phoneme recognition (Waibel et al. 1988) and character recognition (LeCun et al. 1989; Pawlicki et al. 1988), results have come only after careful preprocessing of the input to segment and label the training exemplars. In short, artificial neural networks have never before been successfully trained using sensor data in real time to perform a real-world perception task. The ALVJ" (Autonomous Land Vehicle In a Neural Network) system remedies this shortcoming. ALVI" is a backpropagation network designed to drive the CMU Navlab, a modified Chevy van (see Fig. 1). Using real time training techniques, the system quickly learns to autonomously control the Navlab by watching a human driver's reactions. ALVI" has been trained to drive in a variety of circumstances including Neural Computation 3, 88-97 (1991)
@ 1991 Massachusetts Institute of Technology
Training of Artificial Neural Networks
89
Figure 1: The CMU Navlab autonomous navigation testbed.
single-lane paved and unpaved roads, and multilane lined and unlined roads, at speeds of u p to 20 miles per hour.
2 Network Architecture
ALVINN's current architecture consists of a single hidden layer backpropagation network (see Fig. 2). The input layer of the network consists of a 30 x 32 unit "retina" onto which a video camera image is projected. Each of the 960 units in the input retina is fully connected to the hidden layer of 5 units, which in turn is fully connected to the output layer. The output layer consists of 30 units and is a linear representation of the direction the vehicle should travel in order to keep the vehicle on the road. The centermost output unit represents the "travel straight ahead" condition, while units to the left and right of center represent successively sharper left and right turns. To drive the Navlab, a video image from the onboard camera is reduced to a low-resolution 30 x 32 pixel image and projected onto the input layer. After completing a forward pass through the network, a steering command is read off the output layer. The steering direction dictated by the network is taken to be the center of mass of the "hill" of activation surrounding the output unit with the highest activation level. Using the center of mass of activation instead of the most active output unit when determining the direction to steer permits finer steering corrections, thus improving ALVINNs driving accuracy.
Dean A. Pomerleau
90
Sharp Left
Straight Ahead
Sharp Right
4
4
i
nnnnn
I
Units
30x32 Video Input Retina
Figure 2 ALVINN architecture. 3 Training
To train ALVINN, the network is presented with road images as input and the corresponding correct steering direction as the desired output. The weights in the network are altered using the backpropagation algorithm so that the network's output more closely corresponds to the correct steering direction. The only modifications to the standard backpropagation algorithm used in this work are a weight change momentum factor that is steadily increased during training, and a learning rate constant for each weight that is scaled by the fan-in of the unit to which the weight projects. ALVINN's ability to learn quickly results from the output representation and the exemplar presentation scheme. Instead of training the network to activate only a single output unit, ALVI" is trained to produce a gaussian distribution of activation centered around the steering direction that will keep the vehicle centered on the road. As in the decoding stage, this steering direction may fall
Training of Artificial Neural Networks
91
between the directions represented by two output units. The following approximation to a gaussian equation is used to precisely interpolate the correct output activation levels:
where . I , represents the desired activation level for unit I and rl, is the /th unit’s distance from the correct steering direction point along the output vector. The constant 10 in the above equation is an empirically determined scale factor that controls the number of output units the gaussian encompasses. As an example, consider the situation in which the correct steering direction falls halfway between the steering directions represented by output units ,I and J + 1. Using the above equation, the desired output activation levels for the units successively farther to the left and the right of the correct steering direction will fall off rapidly with the values 0.98, 0.80, 0.54, 0.29, 0.13, 0.04, 0.01, etc. This gaussian desired output vector can be thought of as representing the probability density function for the correct steering direction, in which a unit’s probability of being correct decreases with distance from the gaussian’s center. By requiring the network to produce a probability distribution as output, instead of a ”one of N” classification, the learning task is made easier since slightly different road images require the network to respond with only slightly different output vectors. This is in contrast to the highly nonlinear output requirement of the “one of N” representation in which the network must significantly alter its output vector (from having one unit on and the rest off to having a different unit on and the rest off) on the basis of fine distinctions between slightly shifted road scenes. 3.1 Original Training Scheme. The source of training data has evolved substantially over the course of the project. Training was originally performed using simulated road images designed to portray roads under a wide variety of weather and lighting conditions. The network was repeatedly presented with 1200 synthetic road scenes and the corresponding correct output vectors, while the weights between units in the network were adjusted with the backpropagation algorithm (Pomerleau et 01. 1988). The network required between 30 and 40 presentations of these 1200 synthetic road images in order to develop a representation capable of accurately driving over the single-lane Navlab test road. Once trained, the network was able to drive the Navlab at up to 1.8 m/sec (3.5 mph) along a 400-m path through a wooded area of the CMU campus under a variety of weather conditions including snowy, rainy, sunny, and cloudy situations. Despite its apparent success, this training paradigm had serious drawbacks. From a purely logistical standpoint, generating the synthetic road
92
Dean A. Pomerleau
scenes was quite time consuming, requiring approximately 6 hr of Sun-4 CPU time. Once the road scenes were generated, training the network required an additional 45 min of computation time using the Warp systolic array supercomputer onboard the Navlab. In addition, differences between the synthetic road images on which the network was trained and the real images on which the network was tested often resulted in poor performance in actual driving situations. For example, when the network was trained on synthetic road images that were less curved than the test road, the network would become confused when presented with a sharp curve during testing. Finally, while effective at training the network to drive under the limited conditions of a single-lane road, it became apparent that extending the synthetic training paradigm to deal with more complex driving situations like multilane and off-road driving, would require prohibitively complex artificial road generators.
3.2 Training "On-the-fly". To deal with these problems, I have developed a scheme, called training "on-the-fly," that involves teaching the network to imitate a human driver under actual driving conditions. As a person drives the Navlab, backpropagation is used to train the network with the current video camera image as input and the direction in which the person is currently steering as the desired output. There are two potential problems associated with this scheme. First, since the human driver steers the vehicle down the center of the road during training, the network will never be presented with situations where it must recover from misalignment errors. When driving for itself, the network may occasionally stray from the road center, so it must be prepared to recover by steering the vehicle back to the center of the road. The second problem is that naively training the network with only the current video image and steering direction runs the risk of overlearning from repetitive inputs. If the human driver takes the Navlab down a straight stretch of road during part of a training run, the network will be presented with a long sequence of similar images. This sustained lack of diversity in the training set will cause the network to "forget" what it had learned about driving on curved roads and instead learn to always steer straight ahead. Both problems associated with training on-the-fly stem from the fact that backpropagation requires training data that are representative of the full task to be learned. To provide the necessary variety of exemplars while still training on real data, the simple training on-the-fly scheme described above must be modified. Instead of presenting the network with only the current video image and steering direction, each original image is laterally shifted in software to create 14 additional images in which the vehicle appears to be shifted by various amounts relative to the road center (see Fig. 3). The shifting scheme maintains the correct perspective by shifting nearby pixels at the bottom of the image more than far away pixels at the top of the image as illustrated in Figure 3.
Training of Artificial Neural Networks
93
Original Image
Shifted Images
Figure 3: The single original video image is laterally shifted to create multiple training exemplars in which the vehicle appears to be at different locations relative to the road. The correct steering direction as dictated by the driver for the original image is altered for each of the shifted images to account for the extra lateral vehicle displacement in each. The use of shifted training exemplars eliminates the problem of the network never learning about situations from which recovery is required. Also, overtraining on repetitive images is less of a problem, since the shifted training exemplars add variety to the training set. However as additional insurance against the effects of repetitive exemplars, the training set diversity is further increased by maintaining a buffer of recently encountered road scenes. In practice, training on-the-fly works as follows. A video image is digitized and reduced to the low resolution image required by the network. This single original image is shifted 7 times to the left and 7 times to the right in 0.25-m increments to create 15 new training exemplars. Fifteen old patterns from the current training set of 200 road scenes are chosen and replaced by the 15 new exemplars. The 15 patterns to be replaced in the training set are chosen in the following manner. The 10 tokens in the training set with the lowest error are replaced in order to prevent the network from overlearning frequently encountered situations such as straight stretches of road. The other 5 exemplars to be replaced are chosen randomly from the training set. This random replacement is done to prevent the training set from becoming filled with erroneous road patterns that the network is unable to correctly learn. These erroneous exemplars result from occasional momentary incorrect steering directions by the human driver. After this replacement process, one forward and one backward sweep of the backpropagation algorithm is performed on these 200 exemplars
94
Dean A. Pomerleau
to incrementally update the networks weights, and then the process is repeated. The network requires approximately 50 iterations through this digitize-replace-train cycle to learn to drive on the roads that have been tested. Running on a Sun-4, this takes approximately 5 min during which a person drives at about 4 miles per hour over the test road. After this training phase, not only can the network imitate the person's driving along the same stretch of road, it can also generalize to drive along parts of the road it has never encountered, under a wide variety of weather conditions. In addition, since determining the steering direction from the input image merely involves a forward sweep through the network, the system is able to process 25 images per second, allowing it to drive at up to the Navlab's maximum speed of 20 miles per hour.' This is over twice as fast as any other sensor-based autonomous system has driven the Navlab (Crisman and Thorpe 1990; Kluge and Thorpe 1990).
4 Discussion
The training on-the-fly scheme gives ALVINN a flexibility that is novel among autonomous navigation systems. It has allowed me to successfully train individual networks to drive in a variety of situations, including a single lane dirt access road, a single-lane paved bicycle path, a two-lane suburban neighborhood street, and a lined two-lane highway (see Fig. 4). ALVINN networks have driven in each of these situations for up to 1/2 mile, until reaching the end of the road or a difficult intersection. The development of a system for each of these domains using
Figure 4: Video images taken on three of the test roads ALVINN has been trained to drive on. They are, from left to right, a single lane dirt access road, a single lane paved bicycle path, and a lined two-lane highway. 'The Navlab has a hydraulic drive system that allows far very precise speed control, but that prevents the vehicle from driving over 20 miles per hour.
Training of Artificial Neural Networks
95
the ”traditional approach” to autonomous navigation would require the programmer to (1) determine what features are important for the particular task, (2) program detectors (using statistical or symbolic techniques) for finding these important features, and (3) develop an algorithm for determining which direction to steer from the location of the detected features. An illustrative example of the traditional approach to autonomous navigation is the work of Dickmanns (Dickmanns and Zapp 1987) on high-speed highway driving. Using specially designed hardware and software to track programmer chosen features such as the lines painted on the road, Dickmanns’ system is capable of driving at up to 60 miles per hour on the German autobahn. However, to achieve these results in a hand-coded system, Dickmanns has had to sacrifice much in the way of generality. Dickmanns emphasizes acccurate vehicle control in the limited domain of highway driving, which, in his words, “put relatively low requirements on image processing.” In contrast, ALVINN is able to lenrn for each new domain what image features are important, how to detect them, and how to use their position to steer the vehicle. Analysis of the hidden unit representations developed in different driving situations shows that the network forms detectors for the image features that correlate with the correct steering direction. When trained on multilane roads, the network develops hidden unit feature detectors for the lines painted on the road, while in single-lane driving situations, the detectors developed are sensitive to road edges and road shaped regions of similar intensity in the image. Figure 5 illustrates the evolution of the weights projecting to the 5 hidden units in the network from the input retina during training on a lined two-lane highway. For a more detailed analysis of ALVINNs internal representations see Pomerleau (1989, 1990). As a result of this flexibility, ALVINN has been able to drive in a wider variety of situations than any other autonomous navigation system. ALVINN has not yet achieved the impressive speed of Dickmanns’ system on highway driving, but the primary barrier preventing faster driving is the Navlab’s physical speed limitation. In fact, at 25 frames per second, ALVINN cycles twice as fast as Dickmanns’ system. A new vehicle that will allow ALVINN to drive significantly faster is currently being built at CMU. Other improvements I am developing include connectionist and nonconnectionist techniques for combining networks trained for different driving situations into a single system. In addition, I am integrating symbolic knowledge sources capable of planning a route and maintaining the vehicle’s position on a map. These modules will allow ALVINN to make high-level, goal-oriented decisions such as which way to turn at intersections and when to stop at a predetermined destination.
96
Dean A. Pomerleau
I
Hidden U"ll I
Hidden Lint1 2
Hidden Unii 1
HlddCll
HlddN
Unit 4
Llnll 5
Eporh 10
Fpwh 20
Epoch 50
Figure 5: The weights projecting from the input retina to the 5 hidden units in an ALVINN network at four points during training on a lined two-lane highway. Black squares represent inhibitory weights and white squares represent excitatory weights. The diagonal black and white bands on weights represent detectors for the yellow line down the center and the white line down the right edge of the road.
Acknowledgments This work would not have been possible without the input and support provided by Dave Touretzky, John Hampshire, and especially Charles Thorpe, Omead Amidi, Jay Gowdy, Jill Crisman, James Frazier, and the rest of the CMU ALV group. This research was supported by the Office of Naval Research under Contracts N00014-87-K-0385, N00014-87-K-0533, and N00014-86-K-0678, by National Science Foundation Grant EET-8716324, by the Defense Advanced Research Projects Agency (DOD) monitored by the Space and Naval Warfare Systems Command under Contract N00039-87-C-0251, and by the Strategic Computing Initiative of DARPA, through contracts DACA76-85-C-0019, DACA76-85-C-0003, and DACAS6-85-C-0002, which are monitored by the US. Army Engineer Topographic Laboratories.
Training of Artificial Neural Networks
97
References Crisman, J. D., and Thorpe C. E. 1990. Color vision for road following. In Vision and Naziigution: The C M U Nuvlab, Charles Thorpe, ed., pp. 9-23. Kluwer Academic Publishers, Boston, MA. Dickmanns, E. D., and Zapp, A. 1987. Autonomous high speed road vehicle guidance by computer vision. In Proc. 10th World Congress Automatic Control, Vol. 4, Munich, West Germany. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neirral Conzp. 1(4), 541-551. Kluge, K., and Thorpe, C. E. 1990. Explicit models for robot road following. In Vision and Naziigution: The C M U Navlab, Charles Thorpe, ed., pp. 25-38. Kluwer Academic Publishers, Boston, MA. Pawlicki, T. F., Lee, D. S., Hull, J. J., and Srihari, S. N. 1988. Neural network models and their application to handwritten digit recognition. Proc. I E E E 11.11. Conf. Neiiral Networks, San Diego, CA. Pomerleau, D. A. 1989. ALVINN: An autonomous land vehicle in a neural network. In Adziaizces in Neural Inforination Processing Systems, 1 , D. s.Touretzky, ed., pp. 305-313. Morgan Kaufmann, San Mateo, CA. Pomerleau, D. A. 1990. Neural network based autonomous navigation. In Vision and Nazligation: Tlze C M U Nuvlab, Charles Thorpe, ed., pp. 83-92. Kluwer Academic Publishers, Boston, MA. Pomerleau, D. A., Gusciora, G. L., Touretzky, D. S., and Kung, H. T. 1988. Neural network simulation at Warp speed: How we got 17 million connections per second. Proc. I E E E bzt. Joint Couf. Neural Netmorks, San Diego, CA. Thorpe, C., Herbert, M., Kanade, T., Shafer S., and the members of the Strategic Computing Vision Lab. 1987. Vision and navigation for the Carnegie Mellon Navlab. In Annual Reviezu of Computer Science, Vol. 11, Joseph Traub, ed., pp. 521-556. Annual Reviews, Palo Alto, CA. Waibel, A,, Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1988. Phoneme recognition: Neural networks vs. hidden Markov models. Proc. Iizt. Conf. Acoustics, Speech and Signal Process., New York, NY. ~~
..
Received 23 April 1990; accepted 8 October 90.
This article has been cited by: 2. Richard Roberts, Charles Pippin, Tucker Balch. 2009. Learning outdoor mobile robot behaviors by example. Journal of Field Robotics 26:2, 176-195. [CrossRef] 3. Dave Ferguson, Thomas M. Howard, Maxim Likhachev. 2008. Motion planning in urban environments. Journal of Field Robotics 25:11-12, 939-960. [CrossRef] 4. Seong-Joo Han, Se-Young Oh. 2008. An optimized modular neural network controller based on environment classification and selective sensor usage for mobile robot reactive navigation. Neural Computing and Applications 17:2, 161-173. [CrossRef] 5. Shuqing Zeng, Juyang Weng. 2007. Online-learning and Attention-based Approach to Obstacle Avoidance Using a Range Finder. Journal of Intelligent and Robotic Systems 50:3, 219-239. [CrossRef] 6. Kian Hsiang Low , Wee Kheng Leow , Marcelo H. Ang Jr. . 2005. An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion TasksAn Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks. Neural Computation 17:6, 1411-1445. [Abstract] [PDF] [PDF Plus] 7. G.N. Desouza, A.C. Kak. 2002. Vision for mobile robot navigation: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:2, 237. [CrossRef] 8. J. Weng, S. Chen. 2000. Visual learning with navigation as an example. IEEE Intelligent Systems 15:5, 63-71. [CrossRef] 9. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196. [CrossRef] 10. P. Gaudiano, E. Zalama, J.L. Coronado. 1996. An unsupervised neural network for low-level control of a wheeled mobile robot: noise resistance, stability, and hardware implementation. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 26:3, 485-496. [CrossRef] 11. S. Baluja. 1996. Evolution of an artificial neural network based autonomous land vehicle controller. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 26:3, 450-463. [CrossRef] 12. G.L. Dudek. 1996. Environment representation using multiple abstraction levels. Proceedings of the IEEE 84:11, 1684. [CrossRef] 13. K.S. Narendra. 1996. Neural networks for control theory and practice. Proceedings of the IEEE 84:10, 1385. [CrossRef] 14. M. Rosenblum, L.S. Davis. 1996. An improved radial basis function network for visual autonomous road following. IEEE Transactions on Neural Networks 7:5, 1111. [CrossRef] 15. Dean A. Pomerleau. 1994. Neural network-based vision for precise control of a walking robot. Machine Learning 15:2, 125-135. [CrossRef]
16. Roderic A. Grupen, Richard S. Weiss. 1994. Integrated control for interpreting and manipulating the robot environment. Robotica 12:02, 165. [CrossRef] 17. B. Crespi, C. Furlanello, L. Stringa. 1993. A memory-based approach to navigation. Biological Cybernetics 69:5-6, 385-393. [CrossRef] 18. Javier R. Movellan, James L. McClelland. 1993. Learning Continuous Probability Distributions with Symmetric Diffusion Networks. Cognitive Science 17:4, 463-496. [CrossRef]
Communicated by Gary Dell
Sequence Manipulation Using Parallel Mapping Networks David S. Touretzky Deirdre W. Wheeler School of Computer Science, Carnegie Mellon, Pittsburgh, PA 15213-3890 U S A We describe a parallel mapping matrix that performs several types of sequence manipulations that are the building blocks of well-known phonological processes. Our results indicate that human phonological behavior can by modeled by a highly constrained connectionist architecture, one that uses purely feedfonvard circuitry and imposes tight limits on depth of derivations. 1 Introduction
We have constructed a connectionist model based on a parallel mapping scheme to perform arbitrary combinations of insertions, deletions, and mutations on input sequences. These primitive operations describe the effects of human phonological processes, which derive surface phonetic forms from underlying phonemic ones. In the architecture discussed here, only a small amount of computation is required to perform derivations. This is important for constructing cognitively plausible theories of grammar. Derivations in classical phonological theories sometimes require long sequences of ordered rule applications, which appears to be incompatible with known constraints on neural implementation. The mapping matrix is the central component of M3P, our "Many Maps Model of Phonology" currently under development (Touretzky and Wheeler 1990a,b; Wheeler and Touretzky in press). In addition to performing sequence manipulation, the matrix can also compute efficient projections of sequences, such as extracting all the vowels in an utterance. Projections are useful for implementing processes such as umlaut and vowel harmony, which operate on a series of vowels, ignoring intervening consonants.' Another component of the M3P architecture recognizes clusters of adjacent segments that meet some feature specification; the cluster may then be operated on as a unit. Like the projection operation, this has special significance for phonology. It is also possible to apply clustering to the output of a projection. 'In autosegmental phonology terms, such processes are said to look only at selected "tiers," with vowels and consonants occupying separate tiers. Neural Computation 3,9&109 (1991)
@ 1991 Massachusetts Institute of Technology
Parallel Mapping Networks
99
M3P is labeled a connectionist model because its primitive operations are constrained to be efficiently implementable by feedforward circuits constructed from simple, neuron-like elements. This constraint strongly influences the phonological theory. The long-term goal of our work is to devise an architecture that is not only a computationally plausible model of human phonological behavior, but also a source of novel predictions and insights into its nature. 2 Sequence Manipulation Via a Change Buffer
A phoneme sequence may be viewed as a string of segments, each of which is a vector of binary features. Consider the word ”cats,” whose underlying phonemic form’ is /kaet + z/. The phoneme / z / is described by the feature vector [-syllabic, +consonantal, -sonorant, +anterior, +coronal, +continuant, +strident, +voice]. In the surface form, [kztsl, the final segment has been changed to have the feature [-voice], giving [sl. Two other types of changes that may occur when deriving surface from underlying forms are insertion and deletion of segments. M3P uses a ”change buffer” to explicitly represent the changes a sequence is to undergo. An input sequence plus change buffer are input to the parallel mapping matrix, which derives the output sequence by applying the requested changes. Figure 1 shows examples of a mutation, an insertion, and a deletion accomplished via this change buffer mechanism. (The matrix itself is omitted from the figure to save space; it will be described in more detail in the next section.) Although each of these examples involves only a single change in the change buffer, the matrix ensures that any number of insertions, deletions, and mutations can be processed simultaneously. Rules in M3P are implemented by binary threshold units that look at some region of the input buffer, and if they fire, write some pattern of changes into the change buffer. Rule units are replicated at each input buffer position so that they may apply to the entire utterance in parallel. The change buffer approach is fundamentally different from other connectionist models that manipulate phoneme sequences. For example, the Rumelhart and McClelland (1986) verb-learning model employed a single layer of trainable weights to directly map words, represented as sets of Wickelfeatures, from input to output form. MacWhinney’s model of Hungarian verb inflection (personal communication) uses a syllabically structured sequence representation and a layer of hidden units, but it too maps inputs directly to outputs. Both these models can learn much more complex transformations than can be expressed in M3P’s change buffer ,formalism. This fact is crucial to the Rumelhart and McClelland ’The tradition in linguistics is that each morpheme be represented by a single underlying form. For reasons having to d o with the overall simplicity of the analysis, the underlying representation of the English plural morpheme is assumed to be /z/.
David S. Touretzky and Deirdre W. Wheeler
100
a Input
=]I Change
Change
Buffer
Buffer Ins
output
1-
output
Input
output
a
I
s
d
t
I
a
i
s
t
l
Figure 1: Examples of (a) a mutation in the derivation of "cats," (b) an insertion in the derivation of "buses," (c) a deletion in the derivation of "iced tea." model, since it combines morphological knowledge with phonological processing to derive exceptional past tense forms such as "go"/ "went" using the same mechanism as for regular forms like "kiss"/ "kissed." But the direct mapping approach has several disadvantages, as noted by Pinker and Prince (1988). Direct mapping nets behave only approximately correctly, and sometimes make highly unnatural errors. Largely unstructured, they are also underconstrained, allowing them to easily learn -functions outside the range of human behavior. 3 Operation of the Mapping Matrix
The mapping matrix is designed to produce an appropriately changed string, right justified in the output buffer, containing neither gaps nor collisions. Without this device, gaps in the string could arise due to deletions requested by the change buffer, and collisions could occur as a result of insertions. Figure 2 shows how the matrix handles the derivation of the word "parses" (underlyingly /pars + z/), in some New England
Parallel Mapping Networks
101
lnpul
p
a
r
s
2
Input
p
a
r
S
2
oulpul
U
IBI
,
,:,R I -back
Ins
I oulpul
J
Figure 2: Derivation of the word "parses": (a) copying input segments into the matrix; (b) state of the matrix and output buffer after settling.
dialects. Three processes apply in this example: fronting of the / a / vowel before /r/, r-deletion, and insertion of an / I / as part of the regular process of English plural formation, producing the surface form [paes~z]. Each square of the mapping matrix is a register that can hold one segment (phoneme). The first step in the operation of the matrix is to
102
David S. Touretzky and Deirdre W. Wheeler
Small excitatory weights Input buffer units to map units Change buffer insertion units to map units Map units to output buffer units
$1 +l +l
Moderate excitatorylinhibitory weights Change buffer mutation units to map units
4~5
Large inhibitory weights Change buffer deletion units to map units Map units to other map units
-50 -50
Table 1: Connection Types and Weight Values.
copy each segment of the input buffer down to all the squares in its column, as shown in Figure 2a. If an input segment is to undergo a mutation then it is the mutated segment that is copied down to the matrix. For example, the change buffer indicates that the / a / is to undergo the change [-back]; it therefore shows up as /ae/ in the matrix. If an input segment is to be deleted, the deletion bit being turned on in the change buffer disables all the squares in that column of the matrix. Hence the /r/ does not appear at all in the matrix. Finally, if an insertion is indicated in the change buffer, the inserted segment is copied down to all the squares in its column. Input segments are assigned to every other column of the mapping matrix to leave room for possible insertions. Note that a segment to be inserted cannot simultaneously undergo a change, because it is not present in the input buffer. However, it is possible for an input segment to undergo multiple changes. For example, a vowel could be simultaneously rounded and nasalized by independent rules; the changes would combine in the mutation slot of the change buffer. The next step in the operation of the matrix is to read off the output string. There are several ways this might be done, with differing cost/performance characteristics. In the simplest approach, which uses O ( n 2 )units each with O ( n )connections, every active square in the matrix inhibits all the squares to its left in the same row, and all the squares below it in the same column. After the matrix settles, there will be at most one square active in any row or column, as shown in Figure 2b. One can then read out the string by or’ing together all the squares in each row. A practical advantage of this scheme is that it uses uniform weights. Essentially there are only three weight values and six connection types,
Parallel Mapping Networks
103
as shown in Table 1. (We assume all thresholds are zero.) The only problem is that the settling time of the matrix is linear in the length of the string. Consider the / z / column of Figure 2a; let z, refer to the cell in row / of that column, with row 1 being at the top. Initially both z1 and z2 are active, so z2 inhibits all the squares to its left, including 12. But z1 inhibits z2, so when z2 turns off, 12 can turn back on again, which causes 17 to turn off, and so on. The rightmost squares stabilize first; the leftmost ones may flip on and off several times before settling into their final states. The matrix can be made to settle in constant time by using slightly more complex circuitry. In this scheme, each square tallies the number of active squares to its right in the topmost row. If a square is in row I and there are exactly I - 1 active squares to the right of it in the top row, then that square will remain active; otherwise it will shut itself off. After a single update cycle, all squares will have reached their final states. This scheme still requires just O(r1’) units, each with O ( / t )connections, but the tally units in different rows require different thresholds (bias connections), so the wiring pattern is less uniform. A reduction in circuit complexity is possible if we assume a constant bound on change in length of any contiguous region in the string. This is a reasonable assumption for human languages. Let Ins, and Del, be the total number of insertions and deletions, respectively, that apply to columns I through I t . Suppose that for every I , IIns, - Del) 5 k for some value X . In this case, the upper triangular matrix can be replaced by a band of width 2k + 1, which requires only U ( k r t ) units and settles in constant time with uniform weights. 4 Projections The projection of a sequence is the subsequence consisting of only those segments satisfying a specified predicate. The most common example of projection in phonology is the vowel projection. Some phonological processes operate on sequences of vowels, ignoring any intervening consonants. If rules are implemented by single binary-threshold units, as is the case in M3P, those rules that are required to skip over variable numbers of consonants cannot look directly at the input buffer; they must look at its vowel projection. A mapping matrix without a change buffer can be used to take projections, as shown in Figure 3 . Columns that do not contain segments of the appropriate type (e.g., vowels) are shut off; the remaining segments are then collected together to form a contiguous sequence in the output buffer. In order for rules to operate on the segments of a projection, they need to be able to bnckproject to the original string. For example, consider the umlaut rule in Modern Icelandic that changes / a / to / o / if the following
David S. Touretzky and Deirdre W. Wheeler
104
Input
m
e
6
1
Vowel Projection ,_.............
a
l
u
m
! Back
t Projection
.......... by lul
Figure 3: Taking the vowel projection of a string, and backprojecting to apply the umlaut rule's change to the proper segment. vowel is / u / . ~This rule has to look at the vowel projection to determine adjacency of vowels separated by an arbitrary number of consonants. As shown in Figure 3, the vowel projection of the input string /mebalum/ "medicine" (dative plural) is /eau/. The derived surface form will be [me~oluml. The rule applies to the vowel /a/ because the vowel to its right is /u/. Therefore the rule should write [+round,-back] into the mutation slot of the /a/ in the change buffer. The problem is how to find this slot. The /a/ is the second nonnull segment in the vowel projection buffer, but it is the fourth segment in the input buffer. By backprojecting its changes through the map, the rule can deposit them in the appropriate change buffer slot. Backprojection uses the active cells of the settled mapping matrix to translate changes relative to projection buffer positions into changes relative to input buffer positions, so that the changes can be deposited in the correct change buffer slot. The backprojection is shown as a dotted line in the figure; the change buffer itself is omitted to save space. Only mutations are backprojected; the operation is underconstrained for insertions, and to the best of our knowledge unnecessary for deletions. 5 Clustering
Phonological phenomena commonly involve iterative processing. Consider voicing assimilation in Russian, where the voicing quality of the %ch rules exist in many of the world's languages, but not English. See Anderson (1974) for one analysis of the Modem Icelandic data. Our own analysis differs somewhat.
Parallel Mapping Networks
105
rightmost member of a consonant cluster spreads to the entire cluster. One way to describe this process is to have a rule that proceeds right-toleft through an utterance, voicing (or devoicing) consonants if their right neighbor is a consonant and is voiced (or voiceless, respectively):
/ mce n
s
k
bi /
mce n
s
g
bi
g
bi
g
bil
voicing assimilation
-
mce n
z
voiciizg assimilation (null effect)
-
[ mce
n
z
"if Mcensk
voicing assimilation
-
More modern theories, such as autosegmental phonology (see Goldsmith 1990 for a nice review) treat iteration as a spreading phenomenon: the rule applies only once, but that application may spread the feature [+voice] to any number of adjacent segments. Iterative processes in M3P are modeled using an approach that is similar in spirit. There is special clustering circuitry for recognizing sequences of adjacent segments that should be treated as a unit. The Russian voicing assimilation rule is represented this way: Cluster type: [-syllabic] Direction: right-to-left Trigger: [+consonant, -sonorant, trvoice] Element: [+consonant] Change: [rrvoice] The trigger of a cluster is a consonant; the elements are the preceding consonants. The result of the rule is that the elements of the cluster all become voiced, to agree in voicing with the trigger. Figure 4 shows how the Trigger and Element bits are set for the utterance /mcensk bi/ "if Mcensk." When a cluster rule writes its changes into the change buffer, the change is recorded only for those segments whose Element bit is set. Another process commonly described using iterative rule application is vowel harmony, whereby properties of a trigger vowel such as height or roundness spread to one or more succeeding vowels, ignoring any intervening consonants. Vowel harmony is implemented in M3P by applying clustering to the vowel projection. See Touretzky and Wheeler (1990a) for an example. 6 M3P: T h e Big Picture
The M3P model is organized as a collection of maps. Based on proposals by Goldsmith (1990) and Lakoff (1989), the model utilizes three levels of representation, called M (morphophonemic), P (phonemic), and
David S. Touretzky and Deirdre W. Wheeler
106
Input
m
c
e
n
m
c
e
n
s
k
b
i
‘
b
i
Trigger Element
mut
Change Buffer
del
Ins
output
z
g
Figure 4: Modeling Russian voicing assimilation via clustering.
F (phonetic). M is the input level and holds the underlying form of an utterance; P is an intermediate level; and F is the output level, containing the surface form. The model supports just two levels of derivation: M-P and P-F, as shown in Figure 5. The clustering and projection circuitry is omitted from this diagram. Cluster rules and rules that operate on projections write their requested changes into the appropriate change buffer slots in parallel with the ordinary rules. Another portion of the M3P model, also not shown in the diagram, deals with syllabification. M-level strings that do not meet a language’s syllabic constraints trigger insertions and deletions via the M-P mapping matrix, so that the P-level string is always syllabically well formed (Touretzky and Wheeler 1990b). Several types of phonological processes are sensitive to syllabic structure, the most notable being stress assignment. Syllabic information is available to rules through a set of onset, nucleus, and coda bits set at M-level by the syllabifier, and transmitted to P-level via the M-P map. Some phenomena that are said to require a computationally expensive process of cyclic rule application, such as u-epenthesis in Modern Icelandic (Kiparsky 19851, appear to be
Parallel Mapping Networks
107
M-P Rules
Figure 5: Overview of the M3P model. implementable in M3P by extending syllabification to be sensitive to morphological boundaries. This is presently under investigation. 7 Discussion
This work was inspired by Goldsmith’s and Lakoff’s earlier proposals for three-level mapping architectures where rules could apply in parallel. The crucial feature distinguishing our work from theirs is their reliance on intralevel well-formedness constraints in addition to interlevel rules. In their models, constraints can interact with rules and with each other in complicated ways. Their proposals therefore require some form of parallel relaxation process, perhaps even simulated annealing ~ e a r c h . ~ Neither has been implemented to date, due to the complexity of the computation involved. We were able to implement M3P using purely feed-forward circuitry because our model permits only inter-level rules. But denying ourselves the power of an unconstrained relaxation process forced us to drastically rethink the model’s structure. Specialized clustering and syllabification primitives were introduced in compensation. M3P is thus a highly constrained architecture. It cannot, to use a now famous example, reverse all the segments of an utterance (Pinker and ‘Goldsmith makes explicit reference to Smolensky’s Harmony Theory (Smolensky 1986).
108
David S. Touretzky and Deirdre W. Wheeler
Prince 1988), or perform many other elaborate and unnatural sequence manipulations, because it is restricted to two levels of insertion, deletion, and mutation operations on sequences and their projections. (A fourth phonological operation, metathesis of adjacent segments, is not presently supported but could easily be added.) This parallel mechanism, with its strictly limited derivational depth, appears to be sufficient for human languages. Impementing the theory as a connectionist model forces us to choose our phonological primitives carefully, since they must be efficiently realizable as threshold logic circuitry. In this way, the model provides constraints on the theory, for example, derivations should be short, and intralevel well-formedness constraints should be avoided. Yet the theory also shapes the model by suggesting linguistically appropriate features of the input that the model ought to be recognizing, such as syllable structure. Several lengthy chains of ordered rules from classical analyses have been succesfully reformulated into M3l”s two-level scheme using devices such as parallel ordering (putting two rules at the same level, which prevents them from feeding or bleeding each other) or syllabically motivated epenthesis. But if derivational depth is truly limited, we should expect to run out of levels at some point. Consider vowel harmony, which must be treated as a P-F process when epenthetic vowels undergo harmony, as is the case in many languages. Since there are no derivational levels beyond P-F, the theory predicts that in all human languages that harmonize on epenthetic vowels, vowel harmony will not be seen feeding other phonological processes. To the best of our knowledge this is true. We do not wish to suggest that the circuitry of M3P corresponds directly to some bit of human neural tissue. It might perhaps be a highly simplified approximation to some portion of a cortical language area, but that remains to be seen. We use the circuitry only as an existence proof that our linguistic theory has an efficient implementation, and, therefore, that phonology need not be computationally expensive. Whatever the brain’s phonological processor actually does, it appears to do no more than a simple parallel rule system with some added structuring primitives that permits two levels of derivation. So even i f people do not explicitly represent rules and mapping matrices in their heads, our model contributes to our understanding of language by suggesting strict limits on the computational power of whatever it is they do have in their heads. To date the model has been applied successfully to limited sets of data drawn from a variety of languages (English, Mohawk, Yawelmani, Turkish, Slovak, Gidabal, Russian, Korean, and Icelandic). We are currently extending it to deal with stress and tone, and planning a more comprehensive analysis of a single language as a further test of the model’s validity.
Parallel Mapping Networks
109
Acknowledgments This research w a s supported by a contract from Hughes Research Laboratories, by the Office of Naval Research under contract number N0001486-K-0678, a n d b y National Science Foundation Grant EET-8716324. We thank Gillette Elvgren 111 for his work on implementing the simulations.
References Anderson, S. 1974. The Organization of Phonology. Academic Press, New York. Goldsmith, J. A. 1990. Autosegmental and Metrical Phonology. Basil Blackwell, Oxford, UK. Kiparsky, P. 1985. Some consequences of lexical phonology. Phonol. Yearbook 2, 83-186. Lakoff, G. 1989. Cognitive phonology. Draft of paper presented at the UCBerkeley Workshop on Constraints vs Rules, May 1989. Pinker, S., and Prince, A. 1988. On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition 28, 73-193. Rumelhart, J. L., and McClelland, D. E. 1986. On learning the past tenses of English verbs. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 2, J. L. McClelland and D. E. Rumelhart, eds., pp. 216-271. The MIT Press, Cambridge, MA. Smolensky, P. 1986. Information processing in dynamical systems: Foundations of harmony theory. In Pnrallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. l, D. E. Rumelhart and J. L. McClelland, eds., pp. 194-281. The MIT Press, Cambridge, MA. Touretzky, D. S., and Wheeler, D. W. 1990a. A computational basis for phonology. In Advances in Neural lnformation Processing Systems 2, D. S. Touretzky, ed., pp. 372-379. Morgan Kaufmann, San Mateo, CA. Touretzky, D. S., and Wheeler, D. W. 199Ob. Two derivations suffice: The role of syllabification in cognitive phonology. In The MIT Parsing Volume, 19891990, C. Tenny, ed., pp. 21-35. MIT Center for Cognitive Science, Parsing Project Working Papers 3. Wheeler, D. W., and Touretzky, D. S. In press. A connectionist implementation of cognitive phonology. In The Last Phonological Rule, J. Goldsmith, ed. University of Chicago Press, Chicago.
Received 14 September 1990; accepted 8 October 90.
This article has been cited by: 2. Jerome A. Feldman. 1993. Structured connectionist models and language learning. Artificial Intelligence Review 7:5, 301-312. [CrossRef]
Communicated by Jeffrey Elman
Parsing Complex Sentences with Structured Connectionist Networks Ajay N. Jain School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 25223 USA A modular, recurrent connectionist network is taught to incrementally parse complex sentences. From input presented one word at a time, the network learns to do semantic role assignment, noun phrase attachment, and clause structure recognition, for sentences with both active and passive constructions and center-embedded clauses. The network makes syntactic and semantic predictions at every step. Previous predictions are revised as expectations are confirmed or violated with the arrival of new information. The network induces its own "grammar rules" for dynamically transforming an input sequence of words into a syntactickernantic interpretation. The network generalizes well and is tolerant of ill-formed inputs. 1 Introduction Traditional methods employed in parsing natural language have focused on developing powerful formalisms to represent syntactic and semantic structure along with rules for transforming language into these formalisms. The builders of such systems must accurately anticipate and model all of the language constructs that their systems will encounter. Spoken language complicates matters even further in several ways. It is more strictly sequential than written language (one cannot look ahead). Spoken language also has a loose structure that is not easily captured in formal grammar systems. This is compounded by phenomena such as ungrammaticality, stuttering, and interjections. Errors in word recognition are also possible. Independent of these factors, systems that can produce predictive information for speech recognition are desirable. Parsing methodologies designed to cope with these requirements are needed. Connectionist networks have three main computational strengths that may be useful in such domains. First, they learn and can generalize from examples. This offers a potential solution to the difficult problem of constructing grammars for spoken language. Second, by virtue of the learning algorithms they employ, connectionist networks can potentially exploit statistical regularities across different modalities (e.g., syntactic information and prosodic information). Lastly, connectionist Neural Computation 3,110-120 (1991) @ 1991 Massachusetts Institute of Technology
Parsing Complex Sentences
111
networks tend to be tolerant of noisy input as is present in real speech. The work presented here is a step toward a connectionist parsing system that demonstrates these benefits in the context of a speech processing system. Many connectionist architectures have been devised for processing natural language. Several of these architectures have implemented formal syntactic grammar systems (e.g., Charniak and Santos 1987; Selman and Hirst 1985; Fanty 1986). Others have modeled semantic phenomena but have paid less attention to parsing (e.g., Waltz and Pollack 1985; McClelland and Kawamoto 1986). These systems, as with standard formal grammar systems, do not acquire grammar. In contrast, this article describes a connectionist network that learns to parse complex sentences presented one word at a time by acquiring a statistical grammar based on a combination of semantic and syntactic cues.' The goals of this work were threefold: first, to show that connectionist networks can learn to incrementally parse nontrivial sentences; second, to show how modularity and structure can be exploited in building complex networks with relatively little training data; and third, to show generalization ability and noise tolerance suggestive of application to more substantial problems.
2 Incremental Parsing Language processing is particularly difficult for connectionist systems in part because of its sequential nature. As input tokens are received, it is not generally possible to immediately determine how to process them. Complex temporal behavior is required to parse effectively. Earlier work produced a connectionist architecture that learned to parse a small set of sentences, including some with passive constructions (Jain 1989). This article describes an extension to the architecture that processes grammatically complex sentences and requires a substantial scale increase. A set of sentences with up to three clauses, including sentences with center-embedding and passive constructions, formed the training corpus.Z Here are some example sentences: 0
Fido dug u p a bone near the tree in the garden.
0
I know the man who John says Mary gave the book. The dog who ate the snake was given a bone. ~~
'A lengthier presentation of this work appears in Jain and Waibel (1990). 'The training set contained over 200 sentences. They were a subset of the sentences that form the example set of a parser based on a left associative grammar developed by Roland Hausser (Hausser 1989). These sentences are grammatically interesting, but they do not reflect the statistical structure of common speech.
Ajay N. Jain
112
[Phrase Block 1: The dog (RECIPIENT)] [Phrase Block 2: was given (ACTION)] [Phrase Block 3: a bone (PATIENT)]] [Clause 2: [Phrase Block 1: who (AGENT)] [Phrase Block 2: ate (ACTION)1 [Phrase Block 3: the snake (PATIENT)] (RELATIVE: “who” refers to Clause 1, Phrase Block 111
[Clause 1:
Figure 1: Parser‘s representation of, “The dog who ate the snake was give a bone.” The sentence is represented as two clauses made up of phrase blocks to which role labels are assigned. The embedded relative clause is also labeled.
Given the input one word at a time, the network’s task is to incrementally build a representation of the sentence that includes the following information: phrase block structure: clause structure, semantic role assignment, and interclause relationships. Figure 1 shows a representation of the desired parse of ”The dog who ate the snake was given a bone.”
3 Network Architecture
The approach to temporal context taken in this work is different from that of the simple recurrent network (Elman 1990) or the time-delay paradigm (Waibel et al. 1989). In the former approach, a network must learn to capture complex contextual information through somewhat indirect means. In the latter approach, time is represented spatially, and units have direct access to portions of past history and have no need to learn to capture temporal information. The approach taken here lies somewhere in the middle. Networks are given the computational hardware to use storage buffers that can atomically manipulate activation patterns. The process of capturing temporal context is integrated into the task to be learned. The network formalism is described in Jain (1989). There are four major features of this formalism: 0
Well-behaved symbol buffers are constructed using groups of units whose connections are gated by other units.
3The term phrase block denotes a contiguous sequence of tightly related words. It does not correspond to the classical grammatical notion of phrase.
Parsing Complex Sentences
113
INTERCLAUSE LEVEL
CLAUSE ROLES LEVEL
CLAUSE STRUCTURE LEVEL
WORD LEVEL
Figure 2: Parsing architecture with an example sentence.
0
Units have temporal state; they integrate their inputs over time, and decay toward zero. Units produce a standard sigmoidal output value and a velocity output value. Units are responsive to both the static activation values of other units and their dynamic changes.
0
The formalism supports recurrent networks.
Networks learn using gradient descent via error backpropagation. Figure 2 shows the detailed network architecture. Information flows through the network as follows. A word is presented by stimulating its associated word unit for a short time. This produces a pattern of activation across the feature units that represents the meaning of the
114
Ajay N. Jain
word.4 The Phrase level uses the sequence of word representations from the Word level to build contiguous phrase blocks. Connections from the Word level to the Phrase level are modulated by gating units that learn the required conditional assignment behavior to capture word feature activations patterns in the phrase blocks. The Clause Structure level assembles phrase blocks into clauses. For example, ”[The dog] [who] [ate] [the snake] [was given1 [a bone],” is mapped into ”[The dog] [was given] [a bone]” and “[who] [ate] [the snakel.” The Clause Roles level produces labels for the roles and relationships of the phrase blocks in each clause of the sentence (e.g., Agent, Action, and Patient). The final level, Interclause, represents the interrelationships among the clauses making u p the sentence (e.g., clause 2 is relative to the first phrase block of clause I). The parser was constructed from three separately trained modules. The Phrase level formed one module, the Clause Roles level another, and the Clause Structure and Interclause levels together formed the third. Each module’s hidden units received recurrent connections from the output units (those units with specified targets) to provide contextual information (similar to Jordan 1986). The recurrent connections also provided a means for competitive effects to develop among output units. The Phrase and Clause Roles modules were constructed by replication. A subnetwork capable of assembling a single phrase block was trained to process all the phrase blocks in the corpus and was replicated to produce the 10 phrase blocks making up the Phrase level. Thus, even if a particular construction only appeared in one position in the training set, the full Phrase level module is able to properly process it at any position. Similarly, at the Clause Roles level, a single subnetwork was trained to process all of the clauses in the corpus. This subnetwork was also replicated. The replication process is similar to “weight slaving” in TDNNs (Waibel et al. 1989), where equality constraints are placed on weights serving analogous functions in different input positions. Target values were set at the beginning of pattern presentation for all units with static target values. This encouraged predictive behavior since it was advantageous for units to achieve their target values as early as possible during the presentation of an input pattern to avoid accumulating error. Gating units have changing targets. They must become active during the time course of a single word and then become inactive. Their target values were computed dynamically during the presentation of each training sentence. 4The connections from the word units to the feature units, which encode semantic and syntactic information about words, are compiled into the network and are fixed. Connectionist networks have previously been used for acquiring semantic features of words (Miikkulainen and Dyer 1989), but in building large systems, it makes sense to precompile as much lexical information as possible - especially if one does not have a surfeit of training data. By making use of existing lexical knowledge, one can avoid the expense of acquiring such information through training and ensure that the lexicon is uniform and general.
Parsing Complex Sentences
115
It is important to note that while the parsing architecture is fixed for any particular parser, in principle there are no limits on the number of constituents, or number of labels and relationships that a parser can contain. If the training set contained sentences with conditional clauses, this would simply require an additional set of Interclause units to denote the conditional relationships between clauses. The Clause Structure level would not require additional units, but the existing units would have to learn the clausal structure of conditional sentences. The architecture supports manipulation of symbols, building of structures, and labeling (and attachment) of structures, and is thus quite general. 4 Parsing Performance
The network learned to parse a large, diverse training set. This section discusses three aspects of the network's performance: dynamic behavior of the integrated modules, generalization, and tolerance of noisy input. 4.1 Dynamic Behavior. The dynamic behavior of the network will be illustrated on the example sentence from Figures 1 and 2: "The dog who ate the snake was given a bone." This sentence was not in the training set. Initially, all of the units in the network are at their resting values. The units of the phrase blocks all have low activation. The word unit corresponding to "the" is stimulated, causing its word feature representation to become active across the feature units of the Word level. The hidden layer causes the gating unit associated with slot 1 of phrase block 1 to become active, which in turn causes the feature representation of "the" to be assigned to the slot. The gate closes as the next word is presented. The remaining words of the sentence are processed similarly, resulting in the final Phrase level representation shown in Figure 1. While this is occurring, the higher levels of the network are processing the evolving Phrase level representation. The behavior of some of the output units of the Clause Structure level is shown in Figure 3. Early in the presentation of the first word, the Clause Structure level hypothesizes that the first four phrase blocks will belong to the first clause - reflecting the dominance of single clause sentences in the training set. After "the" is processed, this hypothesis is revised. The network then believes that there is an embedded clause of three (possibly four) phrase blocks following the first phrase block. This predictive behavior emerged spontaneously from the training procedure (a large majority of sentences in the training set beginning with a determiner had embedded clauses after the first phrase block). The next two words ("dog who'') confirm the network's expectation. The word "ate" allows the network to firmly decide on an embedded clause of three phrase blocks within the main clause. This is the correct clausal
Ajay N. Jain
116
11111111111111111111llllllllllll~
. _.._...._................__...__.___ ..........__...._....__
1
I
llllllllalllllllllllllllllllllllll~..
....,,
.
Figure 3: Example of Clause Structure dynamic behavior.
structure of the sentence and is confirmed by the remainder of the input. The Interclause level (not shown in the figure) indicates that the embedded clause is relative to the first phrase block of the main clause during the initial hypothesis of the embedded clause. The Clause Roles level processes the individual clauses as they are mapped through the Clause Structure level. The output units for clause 1initially hypothesize an Agent/Action/Patient role structure with some competition from a Recipient/Action/Patient role structure (the Agent and Recipient units’ activation traces for clause 1, phrase block 1 are shown in Fig. 4). This prediction occurs because active constructs outnumbered passive ones during training. The final decision about role structure is postponed until just after the embedded clause is presented. The input tokens “was given” immediately cause the Recipient/Action/ Patient role structure to dominate. The network also indicates that a fourth phrase block (e.g., ”by Mary’? is expected to be the Agent (not shown). For clause 2 (”[who] [ate] [the snake]’’), an Agent/Action/Patient role structure is again predicted; this time the prediction is borne out.
Parsing Complex Sentences
n
I
cLAusE1~pHwsEl------[THE~ml T h d q uho ate the
I.....
......
117
sneke ws
given a
bms
1111111/111111111111lllllllllllllllllllllllllllllll
1111111111
Figure 4: Example of Clause Roles dynamic behavior, 4.2 Generalization and Noise Tolerance. One type of generalization is implicit in the architecture. Word feature patterns have two parts: a syntactic/semantic part and an ID (identity) part. The representations of ”John” and “Peter” differ only in their ID parts. Units in the network that learn do not have any input connections from the ID portions of the word units. Thus, when the network learns to parse “John gave the apple to the boy,” it will know how to parse ”Peter promised the cookie to the girl.” This type of generalization is extremely useful, both for addition of new words to the network and for processing sentences for which the net was not explicitly trained. The network also generalizes to correctly process truly novel sentences - sentences that are distinct (beyond ID features) from those in the training set. The weight sharing techniques at the Phrase and Clause Structure levels have an impact here. Although it is difficult to measure generalization quantitatively, some statements can be made about the types of novel sentences that are correctly processed relative to the training sentences. Substitution of single words resulting in a meaningful sentence is tolerated almost without exception. Substitution of entire phrase blocks by different phrase blocks causes some errors in structural parsing on sentences that have few similar training exemplars. However, the network does quite well on sentences that can be formed from major components of familiar sentences (e.g., interchanging clauses). More training data, especially for multiclause sentences, would improve the performance. Noise tolerance is particularly important in processing spoken language. The effects of noise were simulated by testing the network on sentences that had been corrupted in several ways. Note that during training the parser was exposed only to well-formed sentences. Sentences in which verbs were made ungrammatical were processed without difficulty (e.g., “We am happy.’’). Sentences in which verb phrases
118
Ajay N. Jain
were badly corrupted produced reasonable interpretations. For example, the sentence "Peter was gave a bone to Fido," received an Agent/Action/ Patient/Recipient role structure as if "was gave" was supposed to be either "gave" or "has given." Interpretation of corrupted verb phrases was context dependent. Single clause sentences in which determiners were randomly deleted to simulate speech recognition errors were processed correctly 85% of the time. Multiple clause sentences corrupted in a similar manner produced more parsing errors. There were fewer examples of multiclause sentence types, and this hurt performance. Deletion of function words such as prepositions beginning prepositional phrases produced few errors, but deletions of critical function words such as "to" in infinitival constructions introducing subordinate clauses caused serious problems. The network was somewhat sensitive to variations in word presentation "speed," but tolerated intenvord silences. Interjections of "ahh," which were simulated by inserting "a" in the word sequence, and partial phrase repetitions were also tested. The network did not perform as well on these sentences as other networks trained for less complex parsing tasks. One possibility is that the modular replication technique is preventing the formation of strong attractors for the training sentences. There appears to be a tradeoff between generalization and noise tolerance.
5 Conclusion This project shows that a connectionist network can acquire a statistical grammar for an interesting fraction of English. It predictively applies its knowledge as input tokens are processed. This differs from attempts to add stochastic components to rule-based grammars (e.g., Seneff 1989). The stochastic component is beneficial for disambiguation and prediction, but in such systems, probabilities are applied at a single level (e.g., along arcs in a transition network). The connectionist approach can model stochastic effects at varying degrees of coarseness: anything from a single word to a complex partially complete syntactic structure can be the (statistically trained) trigger of some action. The training procedure forces the network to "search efficiently, to apply likely "rules" before less likely ones. To minimize error, the trained network must make decisions about sentence structure as early as possible. The connectionist approach also offers advantages over conventional parsers in terms of noise tolerance. Ungrammatical near-misses can be processed sensibly in many cases in the connectionist approach whereas grammar-based approaches often include no error correction (Hausser 1989). Other grammar-based approaches rely on complex, handcrafted 5Speed refers to the number of network update cycles during the presentation of each word. The network was trained on a constant speed.
Parsing Complex Sentences
119
rules to cope with foreseeable input variations (Young et al. 1989). A connectionist parser can potentially be trained to cope with expected input variations, but it will also be tolerant of other variations that were not explicitly modeled. The modular technique permitted the three component modules of the network to be constructed and trained independently - an important advantage when designing large networks. In addition, replicative procedures that remove positional sensitivities were an efficient way to maximize generalization from an unbalanced training set. However, replication prevented the network from modeling position-specific regularities that may have enhanced noise tolerance. More systematic work needs to be done to understand the effects of the various training procedures on generalization and noise tolerance. Work is in progress applying this type of network to a spoken language system for a conversational domain with a limited vocabulary. The connectionist approach should prove useful because tight coupling is desired between the parsing system and the underlying speech system. The predictive nature of this type of parser (its outputs can help drive the word hypothesizer of a speech system), its robustness, and the potential to integrate multiple input modalities (e.g., pitch and stress cues) should benefit the system. The suggestive results presented here will be more fully explored in this ongoing work.
Acknowledgments
~
This research was funded by grants from ATR Interpreting Telephony Research Laboratories, the National Science Foundation under Grant EET-8716324, and the Office of Naval Research under contract number N00014-86-K-0678. I thank Dave Touretzky and Alex Waibel for helpful comments and discussions.
References Charniak, E., and Santos, E. 1987. A connectionist context-free parser which is not context-free but then it is not really connectionist either. In Proc. Ninth A n n . Conf. Cog. Sci. Soc., 70-77. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14(2),179-212. Fanty, M. 1986. Context-free parsing with connectionist networks. In AIP Conference Proceedings number 151, J. s.Denker, ed. American Institute of Physics, New York. Hawser, R. 1989. Computation of Language: An Essay on Syntax, Semantics, and Pragmatics in Natural Man-Mackine Communication. Springer-Verlag,Berlin. Jain, A. N. 1989. A Connectionist Architecture for Sequential Symbolic Domains. Tech. Rep. CMU-CS-89-187, School of Computer Science, Carnegie Mellon University.
120
Ajay N. Jain
Jain, A. N., and Waibel, A. H. 1990. Incremental parsing by modular recurrent connectionist networks. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 364371. Morgan Kaufmann, San Mateo, CA. Jordan, M. I. 1986. Attractor dynamics and parallelism in a connectionist sequential machine. In Proc. Eighth A n n . Conf. Cog. Sci. SOC.,pp. 531-546. McClelland, J. L., and Kawamoto, A. H. 1986. Mechanisms of sentence processing: Assigning roles to constituents. In Parallel Distributed Processing, Vol. 2, J. L. McClelland and D. E. Rumelhart, eds., pp. 273-331. The MIT Press, Cambridge, MA. Miikkulainen, R., and Dyer, M. G. 1989. Encoding input /output representations in connectionist cognitive systems. In Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds., pp. 347-356. Morgan Kaufmann, San Mateo, CA. Selman, B., and Hirst, G. 1985. A rule-based connectionist parsing system. In Proc. Seventh Annu. Conf. Cog. Sci. SOC.,212-221. Seneff, S. 1989. TINA: A probabilistic syntactic parser for speech understanding systems. In Proc. 1989 l E E E Conf. Acoustics, Speech Signal Process., pp. 711714. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1989. Phoneme recognition using time-delay neural networks. I E E E Trans. Acoustics, Speech, Signal Process. 37(3), 328-339. Waltz, D., and Pollack, J. 1985. Massively parallel parsing: A strongly interactive model of natural language interpretation. Cog. Sci. 9, 51-74. Young, S. R., Hauptmann, A. G., Ward, W. H., Smith, E. T., and Werner, I? 1989. High level knowledge sources in usable speech recognition systems. Commun. A C M 32(2), 183-193.
Received 14 May 1990; accepted 23 October 90.
This article has been cited by: 2. F. Costa, P. Frasconi, V. Lombardo, P. Sturt, G. Soda. 2005. Ambiguity Resolution Analysis in Incremental Parsing of Natural Language. IEEE Transactions on Neural Networks 16:4, 959-971. [CrossRef] 3. Edward Kei Shiu Ho , Lai Wan Chan . 2001. Analyzing Holistic Parsers: Implications for Robust Parsing and SystematicityAnalyzing Holistic Parsers: Implications for Robust Parsing and Systematicity. Neural Computation 13:5, 1137-1170. [Abstract] [PDF] [PDF Plus] 4. Edward Kei Shin Ho , Lai Wan Chan . 1999. How to Design a Connectionist Holistic ParserHow to Design a Connectionist Holistic Parser. Neural Computation 11:8, 1995-2016. [Abstract] [PDF] [PDF Plus] 5. M. Sheikhan, M. Tebyani, M. Lotfizad. 1997. Continuous speech recognition and syntactic processing in Iranian Farsi language. International Journal of Speech Technology 1:2, 135-141. [CrossRef] 6. Risto Miikkulainen. 1996. Subsymbolic Case-Role Analysis of Sentences with Embedded Clauses. Cognitive Science 20:1, 47-73. [CrossRef]
Communicated by Garrison Cottrell
Rules and Variables in Neural Nets Venkat Aj janagadde Lokendra Shastri Computer and Information Science Department, University of Pennsylvania, Philadelphia, PA 19104 U S A A fundamental problem that must be addressed by connectionism is that of creating and representing dynamic structures (Feldman 1982;von der Malsburg 1985). In the context of reasoning with systematic and abstract knowledge, this problem takes the form of the variable binding problem. We describe a biologically plausible solution to this problem and outline how a knowledge representation and reasoning system can use this solution to perform a class of predictive inferences with extreme efficiency. The proposed system solves the variable binding problem by propagating rhythmic patterns of activity wherein dynamic bindings are represented as the synchronous firing of appropriate nodes. 1 Introduction
One of the fundamental problems that must be addressed by connectionism is that of creating and representing dynamic structures (Feldman 1982; von der Malsburg 1985). In the context of reasoning, this problem takes the form of the variable binding or the role binding problem. Assume that the following systematic and general knowledge is part of an agent’s model of its environment:
1. If someone (say, ,r) flies from a source y to a destination z , then .r moves from y to t. 2. If someone moves from a source to a destination z then it reaches z . The above knowledge may be expressed succinctly in the form of firstorder rules given below, wherein, fly is a three place predicate with roles: py-agent, fly-source, and fly-destination; move is also a three place predicate with roles: move-agent, move-source, and move-destination; while reach is a binary predicate with roles: reach-agent and reach-location. Observe that the use of variables such as 5, y, and z allows the expression of general - instantiation independent - knowledge and helps in the specification Neural Computation 3, 121-134 (1991)
@ 1991 Massachusetts Institute of Technology
Venkat Ajjanagadde and Lokendra Shastri
122
of the correspondence between the roles of these predicates.
f l y ( x ,y, z ) * move(z 9. z )
(1.1)
rnove(z,y, z ) + reack(r. z )
(1.2)
Given the above rules, if an agent is told fly(tweety,treel ,treeZ) (i.e., "Tweety flew from tree1 to tree2'?, it should be able to infer move(tweety, tree1,treeZ) and reach(tweety,treeZ). A connectionist reasoning system that encodes the above rules should also be capable of making the same inferences, and hence should exhibit the following behavior: If the networks pattern of activity is initialized to represent the fact fly(tweety,treel ,treeZ), then after some time, its activity should evolve to include the representation of the facts move(tweety,t reel ,tree2) and reach(tweety,t ree2). This raises two questions. First, how should a novel and dynamic fact such as fly(tweety,treel,treeZ) be represented as a pattern of activity? Observe that such a representation should be free of cross-talk, that is, the representation of fly(fweety,treel,treeZ) should be distinct from the representation of other facts such asfly(tweety,treeZ,treel)or fly(tweety,treel ,treel). Second, how should the initial representation of this fact cause the network to create the dynamic representations of the inferred facts: move (tweety,treel ,tree2) and reack(tweety,treeZ)? Once again, such a propagation should occur without cross-talk. Whereas the first problem concerns the representation of variable or role bindings, the second problem concerns the propagation of such bindings. In the following sections, we describe a solution to these problems and outline how a connectionist system can use this solution to perform predictive inferences. Such a reasoning system has several desirable features: 1. It is parallel at the knowledge level, that is, it can execute a large number of rules simultaneously.'
2. It can represent fairly complex rules. 3. It performs inferences extremely fast: The time taken by the system to draw an inference is only proportional to the length of its shortest derivation and is independent of the overall size of the rule base. 4. The size of the network is only linear in the number of rules encoded
in the system. 'Most extant connectionist reasoning systems impose the restriction that only one rule may fire at a time (Bamden 1989; Touretzky and Hinton 1988). A notable exception is the ROBIN system (Lange and Dyer 1989).
Rules and Variables in Neural Nets
123
5. In view of items (3) and (4), the system scales and can potentially encode a very large number of rules and yet perform systematic inferences with extreme efficiency. 6. Neurophysiological evidence suggests that the mechanism proposed for solving the variable binding problem is biologically plausible.
The reader is encouraged to refer to Shastri and Ajjanagadde (1990) for a detailed treatment of our solution to the dynamic binding problem and its use in the design of the connectionist reasoning system. Therein we also discuss several extensions and compare our system to other connectionist reasoning systems (Barnden 1989; Lange and Dyer 1989; Smolensky 1987; Touretzky and Hinton 1988). 2 Representing Role Bindings
A fact may be viewed as a specification of role bindings. For example, the fact fly(fweety,treel,tree2i specifies that the three roles of the reIation fly, namely, fly-agent, fly-source, and fly-destination, are bound to tweety, treel, and tree2, respectively. Hence, the problem of representing a newly available fact amounts to representing - in a dynamic fashion - the bindings between the appropriate roles and fillers. One way of doing this may be to physically interconnect the nodes representing the appropriate roles and fillers. This suggestion can take two forms. One possibility is to assume that the link required to represent a role-filler binding is created dynamically as and when the system needs to represent a fact. This solution is implausible because the growth of new physical structures cannot take place at the speed expected of a reasoning system.2 A second possibility is to assume the prior existence of interconnections between all possible pairs of role-filler bindings (von der Malsburg 1985; Feldman 1982). These connections normally remain "inactive" but become "active" selectively to represent dynamic bindings. The problem with such a solution is that there will usually be an extremely large number of possible role-filler bindings and having permanent structures corresponding to the representation of all such bindings will require an implausibly large number of nodes and/or links. Other techniques (based on the Von Neumann architecture), such as passing names or pointers, cannot be used to represent bindings in connectionist networks because they violate the basic connectionist constraint that nodes be simple and messages be scalar levels of activation. We propose to represent and propagate dynamic role bindings by exploiting the temporal structure of a network's pattern of activity. The use of the temporal dimension makes it possible to represent and propagate bindings without requiring the growth of new links or the existence of prior interconnections between all possible role and filler pairs. The 'We expect such a system to perform inferences within a few hundred milliseconds.
124
Venkat Ajjanagadde and Lokendra Shastri
possible use of temporally organized activity for representing dynamic structures in neural nets has been suggested by several researchers including Hebb (19491, von der Malsburg (1985, 19861, and Abeles (19821, and the use of the temporal dimension for encoding simple forms of conceptual knowledge map be found in Clossman (1988) and Fanty (1988). In our proposed solution we assume that each role and filler is represented by a distinct phase-sensitive binary threshold unit (a pbtu, for short). A p-btu is like a btu except that the timing of its output spike depends on the timing of its input. In particular, if a p-btu receives an oscillatory input, it also oscillates in response and maintains a precise phase relationship with the input. The binding of a role to a filler is represented by the synchronous firing of the p-btu nodes representing the role and the filler. As an example, consider the dynamic representation of the fact py(tweety,treel,tree2). A representation of this fact requires the representation of the bindings between the fillers tweety, treel, and tree2 and the roles py-agent, fly-source, and fly-destination, respectively. The rhythmic firing pattern shown in Figure 1 corresponds to the network’s representation of these bindings. Observe that the nodes representing different fillers are firing in distinct phases and a role node is firing in synchrony with the filler node to which it is bound. The proposed representation of bindings is filler-cen tered. With reference to the pattern of activity in Figure 2, observe that the firing of the role nodes fly-agent, move-agent, and reach-agent is synchronized with the firing of the filler node ”tweety.” This represents that tweety is bound to the three roles fly-agent, move-agent, and reach-agent. 3 Propagation of Bindings
Having discussed the representation of bindings let us consider the problem of propagating bindings. Recall that if the system encodes rules (1.1) and (1.2), we would expect it to infer move(tweety,treel,treeZ) and reach(tweety,tree2) on being given the initial fact fly(tweety,treel ,tree2). Alternately, we would expect the system to infer move(pigeon5, buiZding3, tree6) and reach(pigeon5,tree6) on being given the initial fact Py(pigeon5, buiZding3,tree6). Notice that the bindings of roles in the inferred facts must be determined dynamically and systematically based on (1) the bindings of roles in the initial fact and (2) the correspondence between roles specified in the rules encoded in the system. The reasoning system realizes this by propagating the pattern of activity corresponding to the initial fact in accordance with the role correspondence specified in the rules. Below, we describe how this is realized. The proposed system can deal with rules involving constants and existentially quantified variables as predicate arguments and the details of the complete realization may be found in Shastri and Ajjanagadde
Rules and Variables in Neural Nets
125
sail-dest sail-source sail-agent building3 pigeon5 reach-loc reach-agent move-desl move-source move-agent fly-dest fly-source fly-agent tree2 treel tweety
Figure 1: Rythmic pattern of activation representing the role bindings fly-agent = tweety, py-source = treel, and fly-destination = tree2. These bindings constitute the fact py(tweety,treeZ,treeZ~.All active filler nodes occupy a distinct phase and the binding between a role and a filler is represented by the in-phase firing of the associated role and filler nodes. Some role names have been abbreviated.
(1990). In this letter, however, we limit ourselves to rules that have the following form: V X ~.,. . , z T r[Pi(. L . .) A P2(. . .)
. . . A PTL(. . .) + Q ( .. .)]
where Pzs a n d Q are distinct predicates.
(3.1)
Venkat Ajjanagadde and Lokendra Shastri
126
sail-dest sail-source sail-agent building3 pigeon5 reach-loc reach-agent move-dest move-source move-agent fly-dest fly-source fly-agent tree2 treel tweety
Figure 2: Pattern of activation representing the role bindings fly-agent = tweety, move-agent = tweety, reach-agent = tweety, fly-source = treel, move-source = treel, fly-destination = tree2, move-destination = tree2, and reach-location = tree2. These bindings constitute the facts fly(tweety,treel,treeZ), move(tweety, treel ,tree2), and reach(tweety,tree2). The bindings between tweety and the roles fly-agent, moveagent, and reach-agent are represented by the synchronous firing of the three role nodes in-phase with tweety.
We will first describe how single antecedent rules, that is, rules with a single predicate in their antecedent, are encoded. Next we will outline the encoding needed to handle multiple antecedent rules.
Rules and Variables in Neural Nets
127
3.1 Encoding and Reasoning with Single Antecedent Rules. An nary predicate is represented by a pred node (drawn as a pentagon) and a cluster of I I role nodes (depicted as diamonds). Thus, the ternary predicate fly is represented by the pred node labeled FLY and the three role nodes -fly-agent, fly-source, and fly-destination (refer to Fig. 3). Each entity is represented by a filler node (depicted as a circle). The role and filler nodes are p-btu nodes whose behavior was described in Section 2. The exact behavior of a pred node is not relevant to our present discussion and for simplicity it may be assumed that pred nodes are simple binary threshold units. A rule is encoded by linking the pred nodes of the antecedent and consequent predicates and connecting the role nodes of the antecedent
Figure 3: Encoding of predicates, fillers and rules. The rules encoded are 3 mozie(.r. y. z ) , suil(.r% y, z ) + move(.r.y. z), and move(.r. y. 2 ) 3 rench(.r. 2 ) . Links between role nodes reflect the correspondence between roles of the antecedent and consequent predicates of rules. Circular nodes represent fillers (concepts) and diamond nodes represent roles. Both these nodes are pbtu nodes. The pentagon-shaped nodes are pred nodes and there is one such node for each predicate. The behavior of p-btu and pred is explained in the text. Role names have been abbreviated. For example, the role node labeled ”agent” in the cluster of nodes associated with the predicate E Y is the role node “fly-agent.’‘
fly(.. . (1. 2 )
128
Venkat Ajjanagadde and Lokendra Shastri
predicate to the role nodes of the consequent predicate in accordance with the correspondence between roles specified in the rule. Compare the interconnections between role nodes in Figure 3 and the occurrence of variables in rules (1.1) and (1.2). A dynamic fact is represented by activating the pred node corresponding to the associated predicate and activating the appropriate filler and role nodes so that each filler is oscillating in a distinct phase and all the roles bound to a given filler are oscillating in-phase with the filler. Thus the pattern of activation depicted in Figure 1 represents the fact fly(tweety,treel,tree2). Note that the figure does not show the activity of the pred node FLY, which will also be active -though not in a phase-sensitive manner. We want this pattern of activation to lead to the pattern of activation shown in Figure 2, which represents the facts: fly(tweety,treeZ,treeZ), moue(tweety,treeZ,tree2),and reach(tweety,tree2) (the activity of pred nodes is not shown in Fig. 2). Let us see how the desired propagation of bindings takes place. First consider the pred nodes. As the pred node FLY is active, it sends activation to the pred node M O V E and turns it on. The pred node M O V E in turn causes the pred node REACH to become active. Thus, after two steps, the three pred nodes FLY, M O V E , and REACH become active. Now, consider the activity of role and filler nodes. The following pairs of nodes are firing in synchrony in the initial pattern of activation: (fly-agent,tweety),(fly-source,treel), and (fly-destination,tree3). Over the next period of oscillation, the nodes fly-agent, fly-source, and fly-destination send activation to the nodes moue-agent, moue-source, and moue-destination, respectively. If we assume that a role node becomes active ,(. - d ) time after it receives activation - where T is the period of oscillation and d equals the link propagation delay from one role node to another - then the pairs of role nodes (fly-agent,moue-agent), (fly-source,moue-source), and (fly-destination,moue-destination) will begin to fire in ~ynchrony.~ In effect, the nodes moue-agent, moue-source, and move-destination will begin firing in synchrony with tweety, treel, and tree2, respectively, thereby representing the fact moue(tweety,treeZ,tree2). The activations from the nodes moue-agent and moue-destination will in turn reach the nodes reach-agent and reach-location and cause them to fire in synchrony with tweety and tree2, respectively. Thus, within time 2 x 7r after the network is initialized to the pattern of activity shown in Figure 1, its pattern of activity will evolve to that shown in Figure 2, and the system would have inferred moue(tweety,treel ,tree2) and reach(tweety,tree2), given fly(tweety,t reel ,tree2). Conceptually, the encoding of rules corresponds to creating an inferential dependency graph. In terms of this graph it should be easy to see that the propagation of role bindings and rule firings corresponds to a parallel 3The model described here may be generalized so that each role is represented by an ensemble of nodes rather than a single node. Such a model exhibits tightly synchronized activity in roles in spite of noisy internodal propagation delays (Mandelbaum and Shastri 1990). Also see Section 4.
Rules and Variables in Neural Nets
129
breadth-first traversal of the rule base. Thus the time taken to perform an inference is independent of the total number of rules and just equals 1 x 7r where I is the length of the chain of inference. 3.2 Multiple Antecedent Rules. The encoding of single antecedent rules described above can be extended to handle multiple-antecedent rules. We illustrate the encoding of such rules with the help of an example. Figure 4 depicts the encoding of the rule: pi(.rl. .1’2..r3) A pz(.f 4. ~ ‘ 2A)p3( I 1..r4) =+ (1(I 1..14)
(3.2)
Notice that the above rule should fire only if all the predicates in the antecedent are “satisfied.” This also requires that the fillers bound to the roles of the antecedent predicates satisfy certain constraints. For example, the second role of PI and the second role of Pz should be bound to the same filler. Similarly, the first role of P2 and the second role of F‘? should also be bound to the same object. The proposed encoding of dynamic role bindings makes it extremely easy to enforce such constraints: Checking that two or more roles are bound to the same filler simply involves checking that the two role nodes are firing in synchrony! With reference to Figure 4, the triangular nodes K and L are simple conjiincfiue nodes: a conjunctive node becomes active if it receives activation along all its inputs. The inputs need not be in-phase, and it suffices that they arrive within a reasonable window of time. The square nodes gl, 82, and 83 are coincidence detectors. A coincidence detector becomes
Figure 4: Encoding of the multiple antecedent rule: PI (.rl. 22.s3)A P2(2-4,~2) A P3(rl.14)+ Q(s1,s4).The encoding of multiple antecedent rule makes use of ”coincidence detector nodes” (square nodes), “conjunctive nodes” (triangular nodes), and ”enabling” modifiers (links terminating with dark arrows). The behavior of these nodes is explained in the text.
130
Venkat Ajjanagadde and Lokendra Shastri
active if it receives synchronous (or in-phase) activation along all its inputs. Links terminating with solid arrows act as "enabling" modifiers. Any link that is impinged on by an enabling modifier propagates activation only if the enabling modifier is also active. Observe that the output of the conjunctive node L will be high if and only if the role bindings of the antecedent predicates satisfy the necessary equality constraints. The activation of L enables the propagation of bindings from the roles of the antecedent predicates to the roles of the consequent predicate.
4 Biological Plausibility
The computational properties of nodes used in our system fall strictly within the constraints imposed by the core features of connectionism. But, the mechanisms for representing bindings proposed here are also biologically plausible in a stronger sense: There exists neurophysiological evidence that (1) the basic mechanisms required for sustaining synchronous patterns of activity exist in the brain, and (2) such patterns may in fact be used by the animal brain to encode and process information. The fact that EEG recordings display rhythmic activity -even though they measure gross electrical activity summed over millions of neurons strongly suggests that there exists significant temporal synchronization in neuronal activity (Sejnowski 1981). The existence of oscillatory activity in the olfactory bulb, hippocampus, and cerebellum has also been documented by several researchers (Gerstein 1970; Freeman 1981; MacVicar and Dudek 1980). The most compelling evidence however, comes from recent findings of stimulus-related synchronous oscillations in the cat visual cortex (Gray et al. 1989; Gray and Singer 1989; Eckhorn et al. 1988, 1990). These findings support the hypothesis that the dynamic binding of all features related to a single object may be realized by the synchronous firing of the cells encoding these features. This hypothesis is analogous to our proposal for representing role bindings: Just as all features of an object are linked together by virtue of their synchronous activation, all the roles filled by the same object (i.e., filler) are represented by virtue of the synchronous activation of the appropriate role nodes. In Shastri and Ajjanagadde (1990) we discuss a generalization of the system wherein roles and fillers are represented by ensembles of nodes instead of single nodes. Using a neurally plausible model of interensemble and intraensemble interaction it has been shown that interconnected role ensembles can exhibit tightly synchronized activity in spite of noisy internodal propagation delays and spontaneous node firings (Mandelbaum and Shastri 1990). This model is based on three principles: (1) the basic unit of representation is an ensemble of nodes, (2) the thresholdtime characteristic of a node during its relative refractory period may be used to modulate the interspike interval, and (3) there exists a weak
Rules and Variables in Neural Nets
131
but fast coupling between immediate neighbors within an ensemble (cf. Eckhorn et al. 1990). In Shastri and Ajjanagadde (1990) we also argue that the computational power of a system that uses temporal synchrony and phase-locked oscillations is comparable to that of a marker passing system (Fahlman 1979). This can be seen by recognizing that each phase in a rhythmic pattern of activation may be thought of as a transient marker. This correspondence is of significance to neural computation. Marker passing systems are not biologically plausible - they depend on a central controller that must direct every node at each step of computation and require the use of fairly complex nodes that must store and selectively propagate markers (Fahlman 1979). In contrast, a system based on temporal synchrony requires extremely simple nodes and operates without a central controller. Observe that once an input is presented to our reasoning system, it performs the requisite computations automatically.
5 Discussion We have described a biologically plausible solution to the dynamic (variable) binding problem and outlined how a connectionist knowledge representation and reasoning system may use this solution to perform predictive inferences with efficiency. Below we discuss the limitations of the binding mechanism and provide pointers to future work. As stated in Section 3, the proposed mechanism for representing and propagating dynamic bindings requires that each active filler fire in a distinct phase within a period of oscillation. As there is no limit on the number of role nodes that can be in synchrony with a filler node, there is no limit on the number of roles to which a filler may get bound. The number of fillers participating in bindings at any given time: however, is limited and is bounded by the ratio x/w,where x is the period of oscillation and w is the spike width. If we pick x to be 15 msec and w to be 1 msec, and if we allow for some variation in propagation delays, firing frequency, and spike width, we find that about 10 fillers can participate in role bindings at the same time. In other words, the number of objects the reasoning system can simultaneously deal with is about 10. It is perhaps not coincidental that such a limitation relates well with the ”magic number” 7, which is often proposed as the “capacity” of human short-term memory and has been found to be a robust measure of the human ability to deal with dynamic information (Miller 1956). A second limitation of the encoding described above is that during any reasoning episode, the system can represent only one dynamic fact per predicate. Note, however, that any number of dynamic facts can be represented simultaneously, as long as they involve different predicates. It was pointed out in Shastri and Ajjanagadde (1990) that the proposed 4The relevant unit of time here is the period of oscillation.
132
Venkat Ajjanagadde and Lokendra Shastri
scheme can be extended to represent several, but a bounded number of, dynamic facts involving the same predicate. This extension is described in Mani and Shastri (1990). A third limitation of the binding mechanism concerns the multiple occurrence of variables in the consequent of a rule. It is required that any variable occurring more than once in the consequent of a rule must also occur in its antecedent and get bound during the reasoning process. Note that no such restriction exists for variables occurring in the antecedent. We have shown how our system can propagate an initial set of bindings over time to create the representation of inferred facts. What remains unspecified is a mechanism that would create an initial set of bindings in response to linguistic or visual input. We are interested in determining how simple linguistic inputs such as "Tweety flew from tree1 to tree2" can lead to the appropriate oscillatory activity. We are also interested in mechanisms that would allow the system to shift its (internal) 'focus of attention' from one set of objects to another. The system outlined in this letter is a forward reasoner; it makes predictions based on its long-term knowledge (rules) and newly available facts. The same binding mechanism can be used to build a system that can store long-term facts, answer queries, and perform explanatory inferences. Such a system applies rules in the backward direction to generate explanations and matches these explanations with stored facts. Such a system is described in Shastri and Ajjanagadde (1990). Another important extension of the reasoning system involves the use of function symbols in rules. This requires the ability to represent dynamically created objects. Such an extension is described in Ajjanagadde (1990). The reasoning and expressive power of the system can be enhanced by interfacing it with specialized reasoning modules such as a semantic network or an IS-A hierarchy (Shastri 1988). Such a system would provide a natural framework for representing and reasoning with rules, facts, and queries that refer to typed (or sorted) variables. The use of typed variables facilitates the expression of conditions under which a causal relation may hold. The rules in the reasoning system described above are assumed to be hard, logical rules. There is nothing inherent in the proposed solution, however, that precludes the representation of probabilistic or defeasible rules. The proposed system makes use of the phase of activation to encode binding information. This leaves open the possibility of using the amplitude of activation and weighted links to encode the strength of probabilistic rules and the 'degree of belief' in the dynamic bindings. A crucial question that we did not address is that of learning. The system represents new information as a transient trace of activation, but how are some of these traces converted into synaptically enkoded longterm structures? How are new rules learned from experience? Although we do not offer any solution to this problem at this time, we would like to emphasize that the problem of learning the encoding of rules described in
Rules and Variables in Neural Nets
133
Section 4 is no more difficult than the problem of learning other structured representation within the connectionist framework.
Acknowledgments We wish to thank M. Abeles, E. Bienenstock, G. W. Cottrell, M. Dyer, J. A. Feldman, M. Fanty, G. L. Gerstein, P. J. Hayes, G. E. Hinton, C. von der Malsburg, and S. J. Thorpe for their comments, suggestions, and criticism. Thanks to D. R. Mani for drawing the figures. This research was supported by NSF Grants IRI 88-05465, MCS-8219196-CER, MCS83-05211, DARPA Grants NOOOl4-85-K-0018 and N00014-85-K-0807, and ARO Grant ARO-DAA29-84-9-0027.
References Abeles, M. 1982. Local Cortical Circirits: Stirdies of Brain Function, Vol. 6. Springer, New York. Ajjanagadde, V. G. 1990. Reasoning with function symbols in a connectionist system. Proc. Cog. Sci. Conf., pp. 285-292. Barnden, J. 1989. Neural-net implementation of complex symbol-processing in a mental model approach to syllogistic reasoning, Proc. IICAI-89, pp. 568-573. Clossman, G. 1988. A model of categorization and learning in a connectionist broadcast system. Ph.D. Dissertation, Department of Computer Science, Indiana University. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analysis in the cat. Biol. Cybeniet. 60, 121-130. Eckhorn, R., Reitboeck, H. J., Arndt, M., and Dicke, P. 1990. Feature linking via synchronization among distributed assemblies: Simulations of results from cat visual cortex. Neural Comp. 2, 293-307. Fahlman, S. E. 1979. NETL: A System for Represeizfing Real-World Knowledge. MIT Press, Cambridge, MA. Fanty, M. A. 1988. Learning in structured connectionist networks. Ph.D. Dissertation, Computer Science Department, University of Rochester, Rochester, NY. Feldman, J. A. 1982. Dynamic connections in neural networks, Biol. Cybernet. 46, 27-39. Freeman, W. J. 1981. A physiological hypothesis of perception. Perspect. B i d . Med. 24(4), 561-592. Gerstein, G. L. 1970. Functional association of neurons: Detection and interpretation. In The Neurosciences: Second Study Program, F. 0. Schmitt, ed., pp. 648-661. The Rockfeller Univ. Press, New York. Gray, C. M., and Singer, W. 1989. Stimulus specific neuronal oscillations in orientation specific columns of cat visual cortex, Proc. Natl. Acad. Sci. 86, 1698-1702.
134
Venkat Ajjanagadde and Lokendra Shastri
Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Hebb, D. 0. 1949. The Organization of Behavior. Wiley, New York. Lange, T., and Dyer, M. 1989. High-level inferencing in a connectionist network. Connection Sci. 1(2), 181-217. MacVicar B., and Dudek, F. E. 1980. Dye-coupling between CA3 pyramidal cells in slices of rat hippocampus. Brain Xes. 196, 494-497. von der Malsburg, C. 1985. Nervous structures with dynamical links. Ber. Bunsenges. Phys. Chem. 89, 703-710. von der Malsburg, C. 1986. A neural cocktail-party processor. Bid. Cybernet. 54, 2940. Mandelbaum, R., and Shastri, L. 1990. A robust model for temporal synchronization of distant neurons. Working paper, Computer and Information Science Department, University of Pennsylvania, Philadelphia, PA. Mani, D. R., and Shastri, L. 1990. Representing multiple dynamic instantiations of a predicate in a connectionist system. Tech. Rep., Dept. of Computer Science, Univ. of Pennsylvania. Miller, G. A. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol. Rev. 63(2), 81-97. Sejnowski, T. J. 1981. Skeleton filters in the brain. In Parallel Models of Associative Memory, G. E. Hinton and J. A. Anderson, eds., pp. 189-212. Erlbaum, Hillsdale, NJ. Shastri, L. 1988. A connectionist approach to knowledge representation and limited inference, Cog. Sci. 12(3), 331-392. Shastri, L., and Ajjanagadde, V. 1990. From simple associations ,to systematic reasoning: A connectionist encoding of rules, variables, and dynamic binding. Tech. Rep. MS-CIS-90-05, Dept. of Computer Science, Univ. of Pennsylvania. Behav. Brain Sci., submitted. Smolensky, P. 1987. On variable binding and the representation of symbolic structures in connectionist systems. Tech. Rep. CU-CS-355-87, Department of Computer Science, University of Colorado at Boulder. Touretzky, D., and Hinton, G. 1988. A distributed connectionist production system, Cog. Sci. 12(3), 423-466.
Received 17 April 1989; accepted 23 October 1990.
This article has been cited by: 2. D. Beroule. 2004. An Instance of Coincidence Detection Architecture Relying on Temporal Coding. IEEE Transactions on Neural Networks 15:5, 963-979. [CrossRef] 3. A. Petrosino, E. Tarantino. 1998. Parallel image understanding algorithms on MIMD multicomputers. Computing 60:2, 91-107. [CrossRef] 4. S.W.K. Chan, J. Franklin. 1998. Symbolic connectionism in natural language disambiguation. 9:5, 739. [CrossRef] 5. M. Vai, Zhimin Xu. 1995. Representing knowledge by neural networks for qualitative analysis and reasoning. 7:5, 683. [CrossRef]
Communicated by Nabil Farhat
TAG: A Neural Network Model for Large-Scale Optical Implementation Hyuek-Jae Lee Soo-Young Lee Sang-Yung Shin Dqmrt~iie~zt of EIectricnl Engiiiwring, Koreii Adzlanced Institute of Science mid Technology, P.O. Box 150 Clroriguyniigizi, Seoul, Korea Bo-Yun Koh A g e q for Defense Dezdoyrwrit, P.O. Box 35, Dnejoli, Korea
TAG (Training by Adaptive Gain) is a new adaptive learning algorithm developed for optical implementation of large-scale artificial neural networks. For fully interconnected single-layer neural networks with .Y input and -21 output neurons TAG contains two different types of interconnections, i.e., AIIll’global fixed interconnections and S + .I1 adaptive gain controls. For two-dimensional input patterns the former may be achieved by multifacet holograms, and the latter by spatial light modulators (SLMs). For the same number of input and output neurons TAG requires much less adaptive elements, and provides a possibility for large-scale optical implementation at some sacrifice in performance as compared to the perceptron. The training algorithm is based on gradient descent and error backpropagation, and is easily extensible to multilayer architecture. Computer simulation demonstrates reasonable performance of TAG compared to perceptron performance. An electrooptical implementation of TAG is also proposed. 1 Introduction
Neural networks have been widely recognized as having good potential to solve complicated classification and adaptive control problems. Although adaptive trainability by simple learning algorithm provides flexibility to perform complex tasks, special hardware is required to take advantage of massive parallelism and analog asynchronous operation of the neural networks. However, adaptive elements cost more than fixed elements, and this became a limiting factor for large scale implementations. After the first optical implementation of the one-dimensional Hopfield model (Farhat et al. 1985) had been reported, extensive research was Neiirnl Conzpitfatiun 3, 135-143 (1991)
@ 1991 Massachusetts Institute of Technology
Hyuek-Jae Lee et al.
136
conducted for two-dimensional neural networks. For optical implementation multifacet holograms (Jang et al. 1988a) may achieve fairly large fixed interconnections. Volume holograms (Brady et al. 1986) or lenslet arrays with spatial light modulator (Jang et al. 1989) may also achieve adaptive interconnections. However the former still requires further research, especially on fixing and copying, and the latter requires SLM adaptive elements beyond current availability for large-scale implementation. In this letter a new adaptive learning algorithm, TAG (Training by Adaptive Gain), has been developed to train neural networks with fewer adaptive elements.
2 Network Architecture Let us consider a neural network with N input neurons and M output neurons. For fully connected adaptive neural networks one has M N adaptivz elnments. In this model the interconnections are composed of N M glob1 fixed interconnections and N + A4 local adaptive gain controls. Figure 1 shows this architecture in a simple form. In mathematical
ADAPTIVE LOCAL GAIN-CONTROLS
ADAPTIVE LOCAL GAIN-CONTROLS
.
b
T
X
FIXED GLOBAL INTERCONNECTS
INPUT
~~
~
Figure 1: Proposed network architecture.
Y OUTPUT
TAG: A Neural Network Model
137
notations each output yi is represented as (2.1) where x3 is activation of the jth input neuron, u,T,,irl, is the interconnection between the jth input neuron and Ith output neuron, and S ( . ) is a sigmoid function. The interconnection consists of fixed global interconnection T,, and adaptive local gain control u, and w3. We had shown that adaptive learning of input gain controls w J s greatly increases storage capacity and error-correction performance for the Hopfield model (Lee et al. 1989). Unlike the previous model T2,s are predetermined in this new model. They may be randomly generated, or obtained from any learning algorithm for standard input/output patterns. It is worth mentioning that the Hopfield interconnectionsT2,s look like random numbers for a huge set of independent stored patterns. For handwritten character for typed characters, and use recognition applications one may obtain TZ3s the new learning algorithm for uLsand U J ~ for S handwritten characters. This combination is a way to compromise between global interconnections and adaptability for large-scale implementation. 3 Adaptive Learning Algorithm
We have adopted a gradient-based least-square-error minimization algorithm for the adaptive learning. The total error E is defined as 1 E =-
c
2 s
(yp
-
t:)2
(3.1)
2
where s is an index over classes (input-output pairs), i is an index over output neurons, y is the actual state of an output neuron, and t is its desired state (Rumelhart et al. 1986). To minimize E by the steepest descent method it is necessary to compute the partial derivatives of E with respect to each of the adaptive elements, v, and w,. By applying chain rule, one obtains (3.2)
and (3.3)
Hyuek-JaeLee et al.
138
where 9,“is the argument of the sigmoid function in equation 2.1 with input x;, 6; and 7,” are output and input errors, respectively, and defined as
6:
= (9,” - t l ) S’
(9;)
(3.4)
It is worth noting that the input error 7,” may be calculated by backpropagation of output error, St. This error backpropagation allows us to extend this model to multilayer architectures, which are quite similar to multilayer perceptrons. However, unlike the multilayer perceptron, gradient calculation of this model does not involve any vector-matrix type multiplication. Only point-to-point scalar multiplication is enough. It requires much less learning time compared to that of perceptron. 4 Simulation Results
Performance of this new architecture is tested by computer simulatibn for classifier. We generated two sets of input patterns, one standard pattern and the other slightly deformed patterns, as shown in Figure 2. A perceptron-based number classifier with 8 x 8 binary input neurons and 10 classifying output neurons has been originally trained for a standard number set in Figure 2(a), and later 3 slightly deformed number sets in Figure 2(b) are introduced. Without retraining, the network does not classify all the 10 deformed patterns correctly of which Hamming distances from the original patterns are ranged from 4 to 17. After retraining the local control parameters it classifies all correctly and even shows good error correction performance. To further increase the errorcorrecting performance we add MAXNET (Lippmann 1987) at the output layer and get ”winner-take-all”function. Then the performance becomes greatly improved as shown in Figure 3(a). It is also worth noting that the MAXNET has fixed interconnections only, and adding the MAXNET does not increase the number of adaptive elements. In Figure 3(b) we further extended our model for untrained random interconnection weights. Naturally the performance of random interconnections is much worse than pretrained interconnections, but still shows adaptive learning capability. Also, instead of sequential output neuron assignment, we had freedom to select the output neuron for each input pattern and obtained additional performance improvement. In these cases output activations for each input pattern are first calculated with equal local gain controls, that is, wLU3 = 1 and vi = 1 for all input and output neurons, and the output neuron with maximum activation is selected to be “1” while the other output neurons are set to ”0” for the pattern. The TAG learning algorithm is applied only after output neurons are assigned to all stored patterns. Performance per adaptive elements of this
TAG: A Neural Network Model
139
..... ..... ..... . . ...... ...... ..... ....... ....... ~...... ...... ...... ..... ..:.. ...... ....... . ..... ...... .... ..... ....... ....... ...... . ....... .. ...... ...... .... ...... ...... ...... ..... ...... ..... . I . . I
:
i
. . I .
I
1
1
I
.
:1
= :
I
8 .
I
. 1
m
. 1
. 1
. 1
I
.
I .
I
I
I . . I . .
I .
I
I
urn
I .
I
I . .
I .
I . I I I I I I I I .I . I .
I
.
.as.
PI.
hlll
1 I I I I I I I I
.
I I I I I I I I .
I
8 . 1
I .
I
I
I .
I I I I I I
I . I
I .
I I I I I I . I I .
:
:
;
I
I .
:
I .
I .
:
I . . I I I . I . I .
:
I
.
I
.
I
.
I I
. .
II II I
I
I . .
.
I
1%
. 1
I .
: . . I *
I
.
. 1
I .
I II III
I .
I I .
I
.
I . I I I I I I I
. I
I
. I
I . . I . .
I .
I. . .
I I I
I* I . . I
1
..... ...: .. ...... ..... ...... ......
c
I .
;
I . I 8.. I . . I . .
..... ....... ..... ..... ..... :......: . I .
I
I .
I
I
I .
I
I .
.
; . . . . . I :
I
. . I
I .
I . I I B I I I ,
..... ...... . . I
I . I . . . I l
I I
.
I . I I . . I . . m I .
. . I .
lbl
Figure 2: Input patterns for simulation: (a) standard set; (b) three slightly modified sets.
TAG model is regarded as much better than that of perceptrons. These random interconnections allow us to use predetermined hardwired interconnections for a wide variety of applications, and are very usefuI for practical implementations. Two important issues remain to be discussed. For scalability to larger size nets our simulation shows that the number of stored patterns is about the number of input neurons divided by the number of output neurons. For 300 x 300 input neurons with 300 output neurons about 300 patterns
Hyuek-Jae Lee et al.
140
20
la1
Ibl
-
o - p ,
40
-
20
-
0 , . 0
.
. ,
,
,
.
,
.
,
,
,
, . , . , . , . , . , . , . , . , . I
2
3
4
5
6
7
8
9
1
0
Hammlng dlStanCe
Figure 3: Error correction probabilities of perceptron and TAG model (a) pretrained interconnections (0,perceptron; x and +, TAG model with and without MAXNET, respectively). (b) Random interconnections (0,perceptron; x and +, TAG model with and without MAXNET, respectively). Dashed lines show improved performance by selecting output nodes for each input pattern. In these cases output activations for an input pattern are first calculated with equal local gain controls, that is, w j = 1 and wi = 1 for all input and output neurons, and the output neuron with maximum activation is selected to be "1" for the pattern. The TAG learning algorithm is implemented only after assigning output neurons for all input patterns.
TAG: A Neural Network Model
141
may be classified. Also the dynamic range of analog SLMs tends to limit system performance. In our simulation SLM dynamic range is assumed to be lO:l, that is, minimum and maximum values of wJ and 1 1 , are set to 0.2 and 2, respectively. 5 Optical Implementation
The TAG model is actually designed for optical implementation of largescale artificial neural networks. Resolution of available SLMs has been one of the most critical limitations on achievable number of adaptive interconnections. In globally connected neural networks such as the perceptron and Hopfield model, it directly limits achievable input neuron numbers multiplied by output neuron numbers. However, in our model, only the sum of input neuron and output neuron numbers is limited by SLM resolutions. Figure 4 shows a schematic illustration of electrooptical implementation of the TAG model. It has two paths, both controlled by a personal computer (P.C.). At recall stage only the upper path works. The lower path is designed for error backpropagation at the adaptive learning stage. The local gain controls o k l for the forward path and wtJ for the backward path are combined with input and output error tiLJ,
n
#
\\
LENS
MULTIFACET HOLOGRAM
(T&)
Figure 4: Schematic illustration for electrooptical implementation.
142
Hyuek-Jae Lee et al.
respectively, and implemented by two-dimensional SLMs with gray levels. The output gain controls w Z jand VkE for forward and backward paths are implemented by the P.C. Multifacet holograms store the fixed global interconnections. It is worth noting that the forward and backward paths use different N4 interconnection schemes (Jang ef al. 1988b) to utilize the same multifacet holograms for both paths. Calculation of error gradients requires only scalar multiplication, and can easily be done in the P.C. One may also put two-dimensional SLMs in front of the detectors for these calculations. The P.C. may also be substituted by parallel electronic hardwares or incorporated in sophisticated electrooptic devices.
6 Conclusions
In this article we have proposed a new adpatation algorithm to train fully-interconnected neural networks with local gain controls only. Error correction performance of this model has been investigated and is proven to be reasonable. With less numbers of adaptive elements this model is easy to implement, and has a wide range of practical applications.
Acknowledgments This research was supported by Korea Science and Engineering Foundation.
References Brady, D., Gu, X.-G., and Psaltis, D. 1988. Photorefractivecrystals in optical neural computers. SPIE Proc. 882 Neural Network Models for Optical Computing, 132-136. Farhat, N. H., Psaltis, D., Prata, A., and Paek, E. 1985. Optical implementation of the Hopfield model. Appl. Optics 24, 1469-1475. Jang, J. S., Jung, S. W., Lee, S. Y., and Shin, S. Y. 1988a. Optical implementation of the Hopfield model for two-dimensional associative memory. Optics Lett. 13,248-250. Jang, J. S., Shin, S. Y., and Lee, S. Y. 1988b. Parallel N4 weighted optical interconnections: Comments. Appl. Optics 27, 4364. Jang, J. S., Shin, S. Y., and Lee, S. Y. 1989. Programmable quadratic associative memory using holographic lenslet arrays. Optics Lett. 14, 838-840. Lee, S . Y., Jang, J. S., Park, J. S., Shin, S. Y., and Shim, C. S. 1989. Modification of the Hopfield model and its optical implementation for correlated images. S H E Proc. 963 Optical Computing, 504-511.
TAG: A Neural Network Model
143
Lippmann, R. I? 1987. An introduction to computing with neural nets. l E E E A S S P Mag. 4(2), 4-22. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back-propagation errors. Nature (Londod 323, 53S536.
Received 20 September 1990; accepted 22 October 90.
This article has been cited by: 2. S. Lawrence, C.L. Giles, Ah Chung Tsoi, A.D. Back. 1997. Face recognition: a convolutional neural-network approach. 8:1, 98. [CrossRef]
Communicated by Christof Koch
Stimulus-Dependent Assembly Formation of Oscillatory Responses: I. Synchronization Peter Konig Thomas B. Schillen Max-Planck-Institut fur Hirnforschung, Deutschordenstraj3e 46, 6000 Frankfurt 71, Germany
Current concepts in neurobiology of vision assume that local object features are represented by distributed neuronal populations in the brain. Such representations can lead to ambiguities if several distinct objects are simultaneously present in the visual field. Temporal characteristics of the neuronal activity have been proposed as a possible solution to this problem and have been found in various cortical areas. In this paper we introduce a delayed nonlinear oscillator to investigate temporal coding in neuronal networks. We show synchronization within two-dimensional layers consisting of oscillatory elements coupled by excitatory delay connections. The observed correlation length is large compared to coupling length. Following the experimental situation, we then demonstrate the response of such layers to two short stimulus bars of varying gap distance. Coherency of stimuli is reflected by the temporal correlation of the responses, which closely resembles the experimental observations. 1 Introduction Current theories of visual processing assume as a first step the extraction of local object features like color, velocity, disparity, etc. (Treisman and Gelade 1980; Julesz 1981; Marr 1982; Ballard et al. 1983; Malsburg and Singer 1988). This processing is considered to occur in parallel through corresponding feature detectors involving spatially separated populations of neurons in the brain. Simultaneous processing of several objects in a natural scene will then elicit superposed responses in each of the detectors. This leads to the problem of uniquely binding responding cells into the correct assemblies that code for the different objects in the visual field (Malsburg 1986; Damasio 1989). A solution to this problem by conjunction of all possible feature constellations to dedicated cardinal neurons is prohibited by the ensuing combinatorial explosion. As a consequence, it has been suggested that temporal structure of neuronal activity would allow the unique definition of assemblies. In particular, Neural Computation 3, 155-166 (1991) @ 1991 Massachusetts Institute of Technology
Peter Konig and Thomas B. Schillen
156
temporal correlation of responses to the same object would provide a solution to the binding problem (Malsburg 1981; Abeles 1982; Crick 1984; Malsburg 1986; Damasio 1989). Stimulus-driven oscillations of neuronal activity have been found in various cortical areas (Freeman 1975; Gray and Singer 1987; Eckhorn et al. 1988; Gray and Singer 1989). Furthermore, stimulus-dependent synchronization and assembly formation of these oscillations have recently been demonstrated in cat visual cortex (Gray ct al. 1989; Engel et al. 1990). As a consequence, first attempts have been made to include oscillatory behavior into models of visual processing (Malsburg and Schneider 1986; Sporns et al. 1989; Reitboeck et al. 1989; Wilson and Bower 1990; Hartmann and Driie 1990; Sompolinsky et al. 1990; Kammen et al. 1990). In this paper, we investigate the temporal structure of responses in two-dimensional layers of delayed nonlinear oscillators. We demonstrate the use of excitatory delay connections for the synchronization of oscillatory responses. Closely following experimental observations, we show that the coherence of stimuli can be coded by synchronizing the oscillatory responses of spatially distributed cell assemblies. 2 Simulation of Delayed Nonlinear Oscillators
In order to investigate temporal coding in neuronal activity, we have implemented a delayed nonlinear oscillator as a basic oscillatory element (Fig. 1A).
Figure 1: (A) Basic oscillatory element implemented by coupling an excitatory unit (0) with an inhibitory unit (0) using delay connections. An additional unit (0) allows for external input of a stimulus. t , time; z ( t ) , unit activity; F ( L ) ,output function; w , coupling constant; T , delay time; i e ( t ) ,external input. Subscripts: excitatory unit; i, inhibitory unit. For details see text. (B) Synchronization between oscillators is achieved by coupling the excitatory unit of one oscillator to the inhibitory unit of another (dashed lines). The coupling delay is chosen to be of the order of the oscillator’s intrinsic delays. (A,
Assembly Formation of Oscillatory Responses: I
157
An excitatory unit u,,is coupled with delay re,to an inhibitory unit u I , which in turn projects back to unit u, with delay rLe.An additional unit allows for external input of a stimulus. The dynamics of the system is determined by the following delay differential equations:
where t is time, z ( t ) is unit activity, cy is a damping constant, w is the coupling strength (m> 0), T is delay time, i e ( t )is external stimulus input, and
is a Fermi function output nonlinearity with slope and threshold 0. Here, TO = 0.5 msec corresponds to our unit of time, and with our standard set of parameters, TO << T = 20 msec, where T is the oscillation pe] 1/12 P2 T ~ riod length. v(t)introduces white noise with variance V [ q ( t )= where p is a measure of the noise level. These differential equations evolve naturally from the description of a simple harmonic oscillator by the introduction of a nonlinear and delayed coupling. Aspects of a related system have been analyzed by Wilson and Cowan (1973). Assuming that not single neurons but rather ensembles of neurons are the essential elements of cortical information processing, this formulation represents “neuronal” activity as a real-valued function. In this context, we therefore interpret each of our units u, and uLito represent a neuronal population with a combined firing probability reflected by the output nonlinearity. The inclusion of transmission delays into the system’s dynamic description is motivated by two reasons: (1) Delays are naturally present in biological networks through synaptic transmission and finite conduction velocity, and (2) in two-dimensional layers of coupled delayed oscillators, delays have a profound influence on the phase relations within the layer. The effect of varying delay time T on the oscillatory behavior of the two coupled units is shown in Figure 2A. With no or too little delay the system relaxes to a stable fixed point determined by coupling parameters and external input. Increasing delay time sufficiently transfers the system to a stable limit cycle. Note that with the specified parameter set the minimum delay time necessary to facilitate oscillation is of the order of 0.1 of the oscillation period length, well compatible with physiological data (Gray and Singer 1989). A similar dependence of oscillatory behavior as for delay also holds for input amplitude ie(t) (Fig. 2B). Depending on the level of input
,
Peter Konig and Thomas B. Schillen
158
+ 10 0
A
OT
Time
6T
B
OT
6T
- 40
OT
6T
Figure 2: Dependence of a single oscillator‘s activity x e ( f ) on delay time T , input level i,, and coupling strength 111. Our standard set of parameters is 12, = cY( = 0.1, 111 E 7llri = Wi, = 1.0, T C T,i = 7 ‘ ~ ~= : 470, i,(t) = 0.8, 0 = 1.0, 0 = 2.0. p = O T ~ - ” ~ , with exceptions where noted. T = 4 0 ~ 0is the period length of an oscillator with standard parameters. (A) Effect of varying delay time 7 3 T,% = T%,. Dotted, T = OTO; solid thin, T = 2.5~0;solid thick, T = 570; dashed thin, T = 1 0 ~ 0dashed ; thick, T = 2 0 ~ 0 (B) . Effect of varying input level i, = i,(t) = const. Dotted, i, = 0.0; solid thin, i, = 0.3; solid thick, i, = 0.6; dashed thin, i , = 0.9; dashed thick, i, = 1.2. (C) Effect of varying coupling strength w G ‘ u J= ~ ~1 1 1 ~ ~Dotted . w = 0.5; solid thin, 711 = 1.5; solid thick, w = 2.5; dashed thin, ‘UI = 3.5; dashed thick, = 4.5.
(2
activity the system is located at a fixed point or exhibits a limit cycle oscillation. Increasing input amplitude from zero increases the amplitude of the oscillation, while leaving the frequency fairly constant. Note that the input activity is stationary and, thus, is not driving the oscillator by itself. This input dependence, therefore, allows a stimulus-dependent transition of units between a nonoscillatory and an oscillatory state. It also avoids problems of frequency coding of stimulus intensity as in Kammen et al. (1989), which interferes with the synchronization between coupled oscillators. Figure 2C demonstrates the influence of varying coupling strength ‘w. As can be seen, the oscillation frequency depends only modestly on the exact value of w. This is physiologically reasonable if synchronization of oscillatory activity is to be employed for information processing. The brain’s synaptic efficacy will vary by biological variance and modifications through learning. If oscillation frequency were sensitively
Assembly Formation of Oscillatory Responses: I
159
dependent on synaptic coupling strength, learning might easily destroy the very synchronization that was the cause of the learned synaptic modification. The restriction to a limited frequency band thus renders synchronization less sensitive to exact biological parameters.
3 Synchronizing an Oscillatory Layer by Excitatory Delay Connections
As a next step, we investigate the behavior of two-dimensional layers of coupled oscillators of the described type. Aiming at synchronization of oscillatory activities within the layer, we introduce a coupling of the following type (Fig. lB, dashed lines): Each oscillator’s excitatory unit u, is coupled to the inhibitory units u,of all its nearest-neighbor oscillators. The coupling weights zu :; are chosen to be isotropic. The delay time 7::)is of the order of the oscillator’s intrinsic delays. With this type of delay coupling every oscillator will excite all its neighboring inhibitory units simultaneously to its own. By this arrangement every oscillator is promoting synchronized oscillations of its neighboring oscillators with zero phase lag, as required by experimental evidence (Gray et a / . 1989). Figure 3 demonstrates synchronization within a 14x 7 oscillatory layer. Figure 3A shows activity traces of 20 units arbitrarily selected from the layer. Throughout the simulation all oscillators receive identical constant input z e ( t ) corresponding to a limit cycle oscillation. For t < 0 all the oscillators are isolated and desynchronized by a high noise level. For t 2 0 the wk:)-connections are enabled, which are able to rapidly synchronize the layer within very few oscillation cycles. The top part of Figure 3B represents the oscillation phases of all oscillators in the layer at t = 8 T . The apparent homogeneity reflects the layer’s synchronized state. The bottom of Figure 38 shows oscillation phases at t = 8T for a control simulation, in which the synchronizing connections were not enabled. Note that with the specified coupling, the layer’s correlation length is much larger than the implemented coupling length. This is achieved without the use of a mean field comparator as proposed to be necessary by Kammen et al. (1990). Note also that the coupling delay 7:;’ = 0.1T is small compared to the oscillation period length, in correspondence to physiological observations (Gray and Singer 1989). The synchronization of the layer does not critically depend on the exact value of the coupling delay, in agreement with the observations for a system of two oscillators reported by Schuster and Wagner (1989). In particular, synchronization was verified for this model for uniform coupling delays 7:;)= 0.1 T. . . . ,0.5 T as well as for a rectangular delay distribution 7:;’ E [0.1T ,0.5 TI. In order not to
Peter Kiinig and Thomas B. Schillen
160
..__
-LJ
A
- 5T
I
Time
15T
B
Figure 3: Synchronizing an oscillatory layer by excitatory delay connections. (A) Activity traces of 20 excitatory units arbitrarily selected from a layer of 14 x 7 delayed nonlinear oscillators. t < 0, isolated oscillators desynchronized by high noise level. t > 0, synchronizing the entire layer by enabling nearest-neighbor excitatory delay connections ( ( I ) : : ) . High noise level maintained throughout. Cyclic boundary conditions. T , period length of isolated oscillator. Notation: Throughout this paper, i d ' ) denotes the (isotropic) coupling weights, with which an oscillator is coupled to its 8r neighboring oscillators located on the surrounding square of edge length 2r. + 1 oscillators (r-nearest-neighbor coupling). (B, top) Activity-phase map of all oscillators at t = 82'. Each circle represents a single oscillator. Activity is coded by circle radius, oscillation phase by shading (0.. .27r). (8, bottom) Activity-phase map at f = 8T from a control simulation that did not enable w:;-connections. Parameters: f < 0, standard set; t > 0, standard set and ,to:: = 0.08, ui,i = 0.8, toic = 1.0, T;:) = T~~ = T~~ = 4 ~ 0 ; p = 0 . 4-1 ~12~Vt. increase the number of parameters in the model unnecessarily we use in the following only a single uniform coupling delay.
4 Coherency Detection by Coupled Oscillators
~~
We now demonstrate the response of a two-dimensional layer of delayed nonlinear oscillators to stimulus bar segments.
Assembly Formation of Oscillatory Responses: I
161
A 10 x 20 layer of oscillators is configured with nearest, next nearest, and double next nearest-neighbor coupling of the described type Cwk',', wfi, LU::). Coupling weights are again isotropic and represent a gaussian distribution of synaptic connectivity. Some level of noise is maintained to represent fluctuations in oscillatory activity and to allow symmetry breaking. Each oscillator in the layer is interpreted to represent an entire retinal receptive field (RF). In this example, we restrict ourselves to cells that show no direction selectivity and that are all of identical orientation preference. With this interpretation an oscillator's external input i, ( t )reflects the presence of an appropriate light bar stimulus moving across the pertaining R E Correspondingly, movement of a light bar on the retina will provide a stimulus to the covering map of RFs and their pertaining oscillators. Following the experimental situation (Gray et al. 1989, Fig. 3), Figure 4 depicts the simulated reponse of the layer to two short light bars separated by varying gap distances (4,2, and 0 oscillator positions). Each single bar segment provides homogeneous input ie(t) to an area of 2 x 5 oscillators. The data for each stimulus condition are presented in separate columns of the figure. Figure 4A shows the distribution of external input to the layer. The oscillators analyzed for the cross correlograms of activities shown in (B) and (C) are marked by numbered white dots. Panel (B) depicts cross correlations (2-3) between stimulus segments for 20 epochs of 20 cycles each. The average of these correlations as compared to cross correlations within stimulus segments (1-2,3-4) is shown in (C). As demonstrated in the previous section, oscillators within every single bar segment are tightly coupled and cross correlations show zero phase lag. In the case of no gap distance (Fig. 4, right column) the two bar segments form a continuous long bar, which then is completely coupled without phase lag across its entire area. Cross correlation (2-3) coincides with correlations (1-2) and ( 2 4 ) . In the other extreme, with the gap distance exceeding the range of synchronizing connections (left column), coupling is restricted to each bar segment's area. Between segments, the oscillators' activities relative to each other shift through all phases resulting in a minimum cross correlation (2-3). With an intermediate gap distance (middle column) coupling between bar segments is still established but is less stringent. Phase differences between segments vary somewhat around zero leading to a reduced amplitude in the cross correlogram. Note, however, that as in the case of the continuous long bar there is no phase lag between the oscillatory responses induced by the two segments, as required by experimental evidence (Gray et al. 1989; Engel et al. 1990).
Peter Konig and Thomas B. Schillen
162
B-~T
c
t 3T
0
t 3T
Time
3T
I
0
Time
r
1
L
I
Figure 4: Effect of stimulus-coherency on cross correlations in a twodimensional layer of delayed nonlinear oscillators: (A) Stimulus configurations for two short light bars with gap distances of 4 and 2 RFs, and one continuous long bar. The oscillators analyzed for cross correlations of activities are marked by numbered white dots. (B) Normalized cross correlations (2-3) between stimulus bar segments for 20 epochs of 20 T. Normalization by geometric mean of the two auto correlations. (C) Mean normalized cross correlations within (1-2, 3-4) (dashed) and between (2-3) (solid) stimulus bar segments. Mean of 20 epochs of 20 T . Cross correlations (2-3) between stimulus segments correspond to stimulus-coherency in agreement with experimental observations (Gray et al. 1989). Parameters: standard set and L, ( t ) = 0.8 where depicted (black boxes), z t ( t ) = 0 elsewhere, ?II:; = 0.05, w:: = 0.035, = 0.01, 7;;’ = T:’,’ = T:) = 4 ~ 0 , p=0.17,
-1 12
.
Assembly Formation of Oscillatory Responses: I
163
5 Conclusions
The results presented in this paper demonstrate that neighbor coupling by the described excitatory delay connections (wi:)is well suited to establish zero phase lag synchronization within two-dimensional oscillatory layers (Fig. 3). This synchronization exhibits a correlation length that is large compared to the employed coupling length. Synchronization does not critically depend on the exact value of the coupling delay and it is robust against noise. The finding that synchronization by neighbor coupling necessarily leads to phase lags (Kammen ef al. 1990) cannot be confirmed within our system. In the formulation of our model, each oscillator is meant to represent an entire neuronal population and the oscillator's activity is, therefore, specified as a continuous function. In this context, the oscillator's output function reflects the combined firing probability of all the neurons in the ensemble. This approach follows from the assumption that not single neurons but rather ensembles of neurons are essential for information processing in the brain. A conversion of the present model into one using a detailed spike description should pose no major problems. That essential characteristics of the oscillatory neuronal behavior can be formulated in a continuous model is shown by the results of our simulations. The inclusion of delays into the analysis of temporal coding by oscillatory activity extends the approaches presented by others (Sporns et al. 1989; Reitboeck et al. 1989; Sompolinsky et al. 1990; Kammen et al. 1990). Considering a synaptic delay of 1 msec, an intracortical conduction velocity of the order of 1 mm msec-' (Luhmann et al. 19901, and an oscillation period in cat visual cortex of about 20 msec (Gray and Singer 1989), intracortical transmission delays amount to approximately 0.1 of the oscillation period length. Delays may therefore have a substantial influence on the temporal characteristics of oscillatory activity in the brain. The results reported in this and the following paper (Schillen and Konig 1991) demonstrate the effects coupling delays can have on the temporal structure of oscillatory responses in layers of delayed nonlinear oscillators. In particular, the simulations presented in this paper show synchronization in layers of this type by an appropriate choice of excitatory delay connections (w::). We verified that synchronization occurs for a wide range of coupling delays as well as for a distribution of coupling delays within the same layer. At this stage, we did not want to increase the model's number of parameters unnecessarily. Therefore, we used only one delay for all connections within a layer for the current simulations. The inverse of our damping constant a was chosen to be compatible with ranges of physiological membrane time constants (a-' = 10 . TO = 5 msec) (Connors et al. 1982; McCormick et al. 1985). We also checked our results for a parameter set using a-' = 10 msec. For the current study we did not further extend the range of investigated values of a.
164
Peter Kiinig and Thomas B. Schillen
The described model represents stimulus intensity by oscillation amplitude and codes for stimulus coherence by the phase of the oscillation. This avoids the problems of frequency coding of stimulus intensity as it is used by Kammen et al. (1990). In particular, the layer exhibits oscillatory activity only at locations where a stimulus is applied and the stimulus response, therefore, need not be segregated from background oscillations. This agrees with experimental evidence, which demonstrates the nonoscillatory character of spontaneous neuronal activity. Furthermore, the model qualitatively shows the same temporal coherence relations in response to stimulus bar segments (Fig. 4) as the physiological data (Gray et al. 1989). This includes also the observed residual coupling between responses to two coherently moving stimuli separated by a small gap. This residual correlation approaches more closely the experimental observations (Gray et al. 1989) improving on the behavior exhibited by the model by Sporns et al. (1989). In addition, the local structure of the employed coupling allows sufficiently separated stimuli of identical intensity to generate independent oscillatory patterns. This agrees with experimental data and contrasts with the effect of a global mean field comparator as proposed by Kammen et al. (1990). With the restriction to cells without direction selectivity, the current simulation cannot show the loss of synchronicity in response to stimulus bars moving in opposite directions (Gray et al. 1989). What is also missing so far is the interaction of cells of different orientation preferences. These issues will be addressed in the following paper (Schillen and Konig 1991).
Acknowledgments We would like to thank Wolf Singer for valuable discussions on the physiological background. We thank H. Sompolinsky and D. Kleinfeld for useful discussions. Jan C. Vorbriiggen helped us with his outstanding expertise in computer operation. We are grateful to Wolf Singer, Jan C. Vorbriiggen, and Julia Delius for comments on the first draft of this paper. Renate Ruhl provided excellent graphical assistance.
References Abeles, M. 1982. Local Curtical Circuits. An Electrophysiological Study. SpringerVerlag, Berlin. Ballard, D. H., Hinton, G. E., and Sejnowski, T. J. 1983. Parallel visual computation. Nature (London) 306, 21-26. Connors, B. W., Gutnick, M. J., and Prince, D. A. 1982. Electrophysiological properties of neocortical neurons in vitro. J. Neurophysiol. 48, 1303-1320. Crick, F. 1984. Function of the thalamic reticular complex: The searchlight hypothesis. Proc. Nutl. Acad. Sci. U.S.A. 81, 4586-4590.
Assembly Formation of Oscillatory Responses: I
165
Damasio, A. R. 1989. The brain binds entities and events by multiregional activation from convergence zones. Neural Comp. 1, 123-132. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121-130. Eckhorn, R., Reitboeck, H. J., Arndt, M., and Dicke, I? 1990. Feature linking via synchronization among distributed assemblies: Simulations of results from cat visual cortex. Neural Comp. 2, 293-307. Engel, A. K., Konig, I?, Gray, C. M., and Singer, W. 1990. Stimulus-dependent neuronal oscillations in cat visual cortex: Inter-columnar interaction as determined by cross-correlation analysis. Eur. J. Neurosci. 2, 588-606. Freeman, W. J. 1975. Mass Action in the Nervous System. Academic Press, New York. Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Gray, C. M., and Singer, W. 1987. Stimulus-specific neuronal oscillations in the cat visual cortex: A cortical functional unit. Soc. Neurosci. Abstr. 13(404.3). Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A. 86, 1698-1 702. Hartmann, G., and Driie, S. 1990. Self organization of a network linking features by synchronization. In Parallel Processing in Neural Systems and Computers, R. Eckmiller, ed., pp. 361-364. Elsevier, Amsterdam. Julesz, B. 1981. Textons, the elements of texture perception and their interaction. Nature (London) 290, 91-97. Kammen, D. M., Holmes, P. J., and Koch, C. 1990. Origin of oscillations in visual cortex: Feedback versus local coupling. In Models of Brain Function, R. M. J. Cotterill, ed., pp. 273-284. Cambridge University Press, Cambridge. Konig, I?, and Schillen, T. B. 1990. Segregation of oscillatory responses by conflicting stimuli - Desynchronizing connections in neural oscillator layers. In Parallel Processing in Neural Systems and Computers, R. Eckmiller, ed., pp. 117-120. Elsevier, Amsterdam. Luhmann, H. J., Greuel, J. M., and Singer, W. 1990. Horizontal interactions in cat striate cortex: 11. A current source-density analysis. Eur. J. Neurosci. 2, 358-368. Marr, D. 1982. Vision. Freeman, New York. McCormick, D. A,, Connors, B. W., Lighthall, J. W., and Prince, D. A. 1985. Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol. 54, 782-806. Reitboeck, H. J., Eckhorn, R., Arndt, M., and Dicke, I? 1989. A model of feature linking via correlated neural activity. In Synergetics of Cognition, H. Haken and M. Stadler, eds., pp. 112-125. Springer-Verlag, Berlin. Schillen, T. B. 1990. Simulation of delayed oscillators with the MENS general purpose modelling environment for network systems. In Parallel Processing
166
Peter Konig and Thomas B. Schillen
in Neural Systems and Computers, R. Eckmiller, ed., pp. 135-138. Elsevier, Amsterdam. Schillen, T. B., and Konig, P. 1990. Coherency detection by coupled oscillatory responses - Synchronizing connections in neural oscillator layers. In Paral/el Processing in Neural Systems and Computers, R. Eckmiller, ed., pp. 139-142. Elsevier, Amsterdam. Schillen, T. B., and Konig, P. 1990. Coherency detection and response segregation by synchronizing and desynchronizing delay connections in a neuronal oscillator model. In International Joint Conference on Neural Networks, IEEE Neural Networks Council, ed., pp. 11-387-11-395, San Diego, CA. IEEE. Schillen, T. B., and Konig, P. 1991. Stimulus-dependent assembly formation of oscillatory responses: 11. Desynchronization. Neural Comp. 3, 167-177. Schuster, H. G., and Wagner, P. 1989. Mutual entrainment of two limit cycle oscillators with time delayed coupling. Prog. Theor. Phys. 81(5), 939-945. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1990. Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acad. Sci. U.S.A. 87, 7200-7204. Sporns, O., Gally, J. A., Reeke, G. N. Jr., and Edelman, G. M. 1989. Reentrant signaling among simulated neuronal groups leads to coherency in their oscillatory activity. Proc. Natl. Acad. Sci. U.S.A. 86, 7265-7269. Treisman, A. M., and Gelade, G. 1980. A feature-integration theory of attention. C ~ g nPsychol. . 12, 97-136. von der Malsburg, C. 1981. The correlation theory of brain function. Internal Report 81-2, Max-Planck-Institute for Biophysical Chemistry, Gottingen, F.R.G. von der Malsburg, C. 1986. Am I Thinking Assemblies? In Brain Theory, G. Palm and A. Aertsen, eds., pp. 161-176. Springer-Verlag, Berlin. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail-party processor. Biol. Cybern. 54, 29-40. von der Malsburg, C., and Singer, W. 1988. Principles of cortical network organization. In Neurobiology of Neocortex, I? Rakic and W. Singer, eds., pp. 69-99. John Wiley & Sons, New York. Dahlem Konferenzen. Wilson, M. W., and Bower, J. M. 1990. Computer simulation of oscillatory behavior in cerebral cortical networks. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 84-91. Morgan Kaufmann, San Mateo, CA. Wilson, H. R., and Cowan, J. D. 1973. A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik 13, 55-80.
Received 6 July 1990; accepted 12 November 1990.
This article has been cited by: 2. Jakob Heinzle, Peter König, Rodrigo F. Salazar. 2007. Modulation of synchrony without changes in firing rates. Cognitive Neurodynamics 1:3, 225-235. [CrossRef] 3. Xianfa Jiao, Rubin Wang. 2006. Synchronization in neuronal population with the variable coupling strength in the presence of external stimulus. Applied Physics Letters 88:20, 203901. [CrossRef] 4. D. Wang. 2005. The Time Dimension for Scene Analysis. IEEE Transactions on Neural Networks 16:6, 1401-1426. [CrossRef] 5. R. Eckhorn, A.M. Gail, A. Bruns, A. Gabriel, B. Al-Shaikhli, M. Saam. 2004. Different Types of Signal Coupling in the Visual Cortex Related to Neural Mechanisms of Associative Processing and Perception. IEEE Transactions on Neural Networks 15:5, 1039-1052. [CrossRef] 6. Antonino Raffone, Cees van Leeuwen. 2003. Dynamic synchronization and chaos in an associative neural network with multiple active memories. Chaos: An Interdisciplinary Journal of Nonlinear Science 13:3, 1090. [CrossRef] 7. S. Mohamad, K. Gopalsamy. 2002. Extreme stability and almost periodicity in continuous and discrete neuronal models with finite delays. The ANZIAM Journal 44:02, 261. [CrossRef] 8. K. Gopalsamy, Sariyasa. 2002. Time delays and stimulus-dependent pattern formation in periodic environments in isolated neurons. IEEE Transactions on Neural Networks 13:3, 551-563. [CrossRef] 9. S.M. Bohte, H. La Poutre, J.N. Kok. 2002. Unsupervised clustering with spiking neurons by sparse temporal coding and multilayer RBF networks. IEEE Transactions on Neural Networks 13:2, 426-435. [CrossRef] 10. Antonino Raffone , Gezinus Wolters . 2001. A Cortical Mechanism for Binding in Visual Working MemoryA Cortical Mechanism for Binding in Visual Working Memory. Journal of Cognitive Neuroscience 13:6, 766-785. [Abstract] [PDF] [PDF Plus] 11. Zhaoping Li . 2001. Computational Design and Nonlinear Dynamics of a Recurrent Network Model of the Primary Visual Cortex*Computational Design and Nonlinear Dynamics of a Recurrent Network Model of the Primary Visual Cortex*. Neural Computation 13:8, 1749-1780. [Abstract] [PDF] [PDF Plus] 12. Jeffrey J. Fox , Ciriyam Jayaprakash , DeLiang Wang , Shannon R. Campbell . 2001. Synchronization in Relaxation Oscillator Networks with Conduction DelaysSynchronization in Relaxation Oscillator Networks with Conduction Delays. Neural Computation 13:5, 1003-1021. [Abstract] [PDF] [PDF Plus] 13. A. V. Medvedev. 2001. Temporal binding at gamma frequencies in the brain: paving the way to epilepsy?. Australasian Physics & Engineering Sciences in Medicine 24:1, 37-48. [CrossRef]
14. Gerard L Gebber. 2001. Experimental Biology 2000 Symposium on Differential Control of Sympathetic Outflow A DEFENCE-LIKE REACTION: AN EMERGENT PROPERTY OF A SYSTEM OF COUPLED NON-LINEAR OSCILLATORS. Clinical and Experimental Pharmacology and Physiology 28:1-2, 125-129. [CrossRef] 15. Liang Zhao, E.E.N. Macau. 2001. A network of dynamically coupled chaotic maps for scene segmentation. IEEE Transactions on Neural Networks 12:6, 1375. [CrossRef] 16. Jan Karbowski , Nancy Kopell . 2000. Multispikes and Synchronization in a Large Neural Network with Temporal DelaysMultispikes and Synchronization in a Large Neural Network with Temporal Delays. Neural Computation 12:7, 1573-1606. [Abstract] [PDF] [PDF Plus] 17. David M. Halliday . 2000. Weak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal BandwidthWeak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal Bandwidth. Neural Computation 12:3, 693-707. [Abstract] [PDF] [PDF Plus] 18. Seon Park, Seunghwan Kim, Hyeon-Bong Pyo, Sooyeul Lee. 1999. Effects of time-delayed interactions on dynamic patterns in a coupled phase oscillator system. Physical Review E 60:4, 4962-4965. [CrossRef] 19. G. Frank, G. Hartmann, A. Jahnke, M. Schafer. 1999. An accelerator for neural networks with pulse-coded model neurons. IEEE Transactions on Neural Networks 10:3, 527-538. [CrossRef] 20. Zhaoping Li. 1998. A Neural Model of Contour Integration in the Primary Visual CortexA Neural Model of Contour Integration in the Primary Visual Cortex. Neural Computation 10:4, 903-940. [Abstract] [PDF] [PDF Plus] 21. Akira Iwabuchi. 1998. Dynamic Binding of Visual Features by Neuronal/Stimulus Synchrony. APPLIED HUMAN SCIENCE Journal of Physiological Anthropology 17:3, 97-108. [CrossRef] 22. H. J. Kappen. 1997. Stimulus-dependent correlations in stochastic networks. Physical Review E 55:5, 5849-5858. [CrossRef] 23. Stephen Grossberg, Alexander Grunewald. 1997. Cortical Synchronization and Perceptual FramingCortical Synchronization and Perceptual Framing. Journal of Cognitive Neuroscience 9:1, 117-132. [Abstract] [PDF] [PDF Plus] 24. Wulfram Gerstner, J. Leo van Hemmen, Jack D. Cowan. 1996. What Matters in Neuronal Locking?What Matters in Neuronal Locking?. Neural Computation 8:8, 1653-1676. [Abstract] [PDF] [PDF Plus] 25. O. Parodi, P. Combe, J. -C. Ducom. 1996. Temporal coding in vision: coding by the spike arrival times leads to oscillations in the case of moving targets. Biological Cybernetics 74:6, 497-509. [CrossRef]
26. Paul Bush, Terrence Sejnowski. 1996. Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. Journal of Computational Neuroscience 3:2, 91-110. [CrossRef] 27. S. Campbell, DeLiang Wang. 1996. Synchronization and desynchronization in a network of locally coupled Wilson-Cowan oscillators. IEEE Transactions on Neural Networks 7:3, 541-554. [CrossRef] 28. David Horn , Irit Opher . 1996. Temporal Segmentation in a Neural Dynamic SystemTemporal Segmentation in a Neural Dynamic System. Neural Computation 8:2, 373-389. [Abstract] [PDF] [PDF Plus] 29. Galina N. Borisyuk, Roman M. Borisyuk, Alexander I. Khibnik, Dirk Roose. 1995. Dynamics and bifurcations of two coupled neural oscillators with different connection types. Bulletin of Mathematical Biology 57:6, 809-840. [CrossRef] 30. Alfred Nischwitz, Helmut Glünder. 1995. Local lateral inhibition: a key to spike synchronization?. Biological Cybernetics 73:5, 389-400. [CrossRef] 31. Christian Kurrer, Klaus Schulten. 1995. Noise-induced synchronous neuronal oscillations. Physical Review E 51:6, 6213-6218. [CrossRef] 32. Peter König, Andreas K. Engel, Pieter R. Roelfsema, Wolf Singer. 1995. How Precise is Neuronal Synchronization?How Precise is Neuronal Synchronization?. Neural Computation 7:3, 469-485. [Abstract] [PDF] [PDF Plus] 33. Wulfram Gerstner. 1995. Time structure of the activity in neural network models. Physical Review E 51:1, 738-758. [CrossRef] 34. Paul Bressloff. 1994. Dynamics of compartmental model recurrent neural networks. Physical Review E 50:3, 2308-2319. [CrossRef] 35. Raphael Ritz, Wulfram Gerstner, Ursula Fuentes, J. Hemmen. 1994. A biologically motivated and analytically soluble model of collective oscillations in the cortex. Biological Cybernetics 71:4, 349-358. [CrossRef] 36. Alain Destexhe. 1994. Oscillations, complex spatiotemporal behavior, and information transport in networks of excitatory and inhibitory neurons. Physical Review E 50:2, 1594-1606. [CrossRef] 37. Charles M. Gray. 1994. Synchronous oscillations in neuronal systems: Mechanisms and functions. Journal of Computational Neuroscience 1:1-2, 11-38. [CrossRef] 38. Fang Liu, Yoko Yamaguchi, Hiroshi Shimizu. 1994. Flexible vowel recognition by the generation of dynamic coherence in oscillator neural networks: speaker-independent vowel recognition. Biological Cybernetics 71:2, 105-114. [CrossRef] 39. Giorgio M. Innocenti, Patricia Lehmann, Jean-Christophe Houzel. 1994. Computational Structure of Visual Callosal Axons. European Journal of Neuroscience 6:6, 918-935. [CrossRef]
40. Thomas B. Schillen, Peter König. 1994. Binding by temporal structure in multiple feature domains of an oscillatory neuronal network. Biological Cybernetics 70:5, 397-405. [CrossRef] 41. E. R. Grannan , D. Kleinfeld , H. Sompolinsky . 1993. Stimulus-Dependent Synchronization of Neuronal AssembliesStimulus-Dependent Synchronization of Neuronal Assemblies. Neural Computation 5:4, 550-569. [Abstract] [PDF] [PDF Plus] 42. Frank Pasemann. 1993. Discrete dynamics of two neuron networks. Open Systems & Information Dynamics 2:1, 49-66. [CrossRef] 43. T. Murata, H. Shimizu. 1993. Oscillatory binocular system and temporal segmentation of stereoscopic depth surfaces. Biological Cybernetics 68:5, 381-391. [CrossRef] 44. David Somers, Nancy Kopell. 1993. Rapid synchronization through fast threshold modulation. Biological Cybernetics 68:5, 393-407. [CrossRef] 45. Wulfram Gerstner, Raphael Ritz, J. Leo Hemmen. 1993. A biologically motivated and analytically soluble model of collective oscillations in the cortex. Biological Cybernetics 68:4, 363-374. [CrossRef] 46. David C. Plaut, Tim Shallice. 1993. Perseverative and Semantic Influences on Visual Object Naming Errors in Optic Aphasia: A Connectionist AccountPerseverative and Semantic Influences on Visual Object Naming Errors in Optic Aphasia: A Connectionist Account. Journal of Cognitive Neuroscience 5:1, 89-117. [Abstract] [PDF] [PDF Plus] 47. Leif H. Finkel , Paul Sajda . 1992. Object Discrimination Based on Depth-from-OcclusionObject Discrimination Based on Depth-from-Occlusion. Neural Computation 4:6, 901-921. [Abstract] [PDF] [PDF Plus] 48. Peter König , Bernd Janosch , Thomas B. Schillen . 1992. Stimulus-Dependent Assembly Formation of Oscillatory Responses: III. LearningStimulus-Dependent Assembly Formation of Oscillatory Responses: III. Learning. Neural Computation 4:5, 666-681. [Abstract] [PDF] [PDF Plus] 49. Hartmut Neven, Ad Aertsen. 1992. Rate coherence and event coherence in the visual cortex: a neuronal model of object recognition. Biological Cybernetics 67:4, 309-322. [CrossRef] 50. Christoph Malsburg, Joachim Buhmann. 1992. Sensory segmentation with coupled neural oscillators. Biological Cybernetics 67:3, 233-242. [CrossRef] 51. Thomas B. Schillen , Peter König . 1991. Stimulus-Dependent Assembly Formation of Oscillatory Responses: II. DesynchronizationStimulus-Dependent Assembly Formation of Oscillatory Responses: II. Desynchronization. Neural Computation 3:2, 167-178. [Abstract] [PDF] [PDF Plus]
Communicated by Christof Koch
Stimulus-Dependent Assembly Formation of Oscillatory Responses: 11. Desynchronization Thomas B. Schillen Peter Konig Max-Planck-lnstitu t fur Hirnforschung, Deutschordenstrape 46, 6000 Frankfurf 72, Germany
Recent theoretical and experimental work suggests a temporal structure of neuronal spike activity as a potential mechanism for solving the binding problem in the brain. In particular, recordings from cat visual cortex demonstrate the possibility that stimulus coherency is coded by synchronization of oscillatory neuronal responses. Coding by synchronized oscillatory activity has to avoid bulk synchronization within entire cortical areas. Recent experimental evidence indicates that incoherent stimuli can activate coherently oscillating assemblies of cells that are not synchronized among one another. In this paper we show that appropriately designed excitatory delay connections can support the desynchronization of two-dimensional layers of delayed nonlinear oscillators. Closely following experimental observations, we then present two examples of stimulus-dependent assembly formation in oscillatory layers that employ both synchronizing and desynchronizing delay connections: First, we demonstrate the segregation of oscillatory responses to two overlapping but incoherently moving stimuli. Second, we show that the coherence of movement and location of two stimulus bar segments can be coded by the correlation of oscillatory activity. 1 Introduction
As outlined in the preceding paper (Konig and Schillen 1991), current theories of visual processing lead to the problem of binding distributed feature responses into unique representations for several distinct objects in the visual field (Malsburg 1986; Malsburg and Schneider 1986; Damasio 1989). As a potential solution to this problem it has been proposed that the temporal structure of neuronal activities serves to define cell assemblies that code for particular objects (Malsburg 1981; Abeles 1982; Malsburg and Schneider 1986; Damasio 1989). Meanwhile, this concept of temporal coding has received support by physiological evidence from cat visual cortex (Gray and Singer 1987; Eckhorn et al. 1988; Gray and Singer 1989; Gray et al. 1989; Engel et al. 1990b). Neural Computation 3, 167-178 (1991) @ 1991 Massachusetts Institute of Technology
168
Thomas 8.Schillen and Peter Kiinig
The preceding paper (Konig and Schillen 1991) addressed the topic of coding stimulus coherency by synchronization of oscillatory activity in two-dimensional layers of delayed nonlinear oscillators. Coding by coupled oscillations requires that synchronization is selective and does not lead to bulk synchronization of entire cortical areas. Utilization of the available phase space requires uncorrelated oscillation of different neuronal assemblies. Assemblies coding for two partially overlapping but distinct objects in the visual field should segregate by engaging in independent oscillatory patterns. These considerations are now also supported by recent experimental observations (Engel et al. 1990a). In order to allow differentiating features, like different velocities, disparities, etc., to segregate assemblies representing different objects, a desynchronizing mechanism must be present. In this paper we describe a second type of excitatory delay connection suitable for this task. Closely following experimental observations, we then present two examples of stimulus-dependent assembly formation in oscillatory layers. The first is a simulation of the experiment by Engel el al. (1990a1, which demonstrates the segregation of oscillatory responses to two overlapping but incoherently moving stimuli. The second extends the model described in the preceding paper to the experimental condition where synchronization depends on the coherence of movement and location of two collinear stimulus segments (Gray et al. 1989). 2 Desynchronizing a n Oscillatory Layer by Excitatory Delay Connections
In order to provide a desynchronizing mechanism within layers of delayed nonlinear oscillators (Konig and Schillen 19911, we introduce a second type of excitatory delay connection (Fig. 1, dotted lines): Each oscillator’s excitatory unit 11, is coupled to the excitatory units u’, of all its next nearest-neighbor oscillators. The coupling weights ui;; are chosen to be isotropic. The delay time 7:;)is of the order of the oscillator’s intrinsic delays (7:: = ZT,, = 870), compatible with physiological delay times. [For a description of symbols refer to Konig and Schillen (1991).] This type of delay coupling tends to establish a nonzero phase relation between coupled oscillators. In particular, with T,$) = re, = 4 7 0 one oscillator drives the other into a phase lag of 7r/2. Within a two-dimensional layer the local solutions cannot all be reconciled with each other simultaneously. This leads to a frustrated system, which in the presence of some noise exhibits quickly varying phase relations of all oscillators. Figure 2 demonstrates this desynchronizing behavior for a 14 x 7 oscillatory layer. Figure 2A shows activity traces of 20 units arbitrarily selected from the layer. Throughout the simulation all oscillators receive identical constant input 1, ( t ) corresponding to a limit cycle oscillation. For t < 0 all the oscillators are isolated and initialized in a synchronized
Assembly Formation of Oscillatory Responses: I1
169
Figure 1: Investigated types of delay coupling within layers of delayed nonlinear oscillators. Dashed, (short range) synchronizing connections (Konig and Schillen 1991); dotted, (long-range)desynchronizing connections.
state. A small amount of noise is applied to break the symmetry of the system. At t = 0 the I U connections ~ ~ are enabled and noise is switched off. For t > 0, the activity traces show that the layer desynchronizes within a few oscillation cycles. The top of Figure 2B represents the oscillation phases of all oscillators in the layer at t = 12 T. The desynchronization of the layer is shown by the heterogeneity of the distribution of the phases. The bottom part of Figure 2B shows oscillation phases at t = 12 T for a control simulation in which the desynchronizing connections were not enabled. This simulation demonstrates that suitably chosen connections between excitatory units are able to desynchronize different "neuronal" oscillators. We verified that this desynchronization does not critically depend on the exact value of the coupling delay.
3 Stimulus-Dependent Segregation of Oscillatory Responses
~
We now want to demonstrate the stimulus-dependent segregation of oscillators into different "neuronal" assemblies, as suggested by experimental evidence (Engel et al. 1990a). For this purpose we use a one-dimensional chain of 8 oscillators, which is now coupled by both synchronizing (Konig and Schillen 1991) and desynchronizing delay connections as shown in Figure 1. The coupling length of the desynchronizing connections (next nearest neighbor) is chosen to be larger than that of the synchronizing ones (nearest neighbor). The desynchronizing coupling weights w!; are set to about half the
Thomas B. Schillen and Peter Konig
170 -~
.~
-5T A
0
Time
15T
B
Figure 2: Desynchronizing an oscillatory layer by excitatory delay connections. (A) Activity traces of 20 excitatory units arbitrarily selected from a layer of 14 x 7 delayed nonIinear oscillators (Kiinig and Schillen 1991). t < 0, isolated oscillators initialized in a synchronized state at low noise level for symmetry breaking. t >: 0, dcsynchronizing the entire layer by enabling next nearest-neighbor excitatory delay connections ( U J ~ ~ No ~ ) . noise. Cyclic boundary conditions. T , period length of isolated oscillator. (B, top) Activity-phase map of all oscillators at t = 121’. Each circle represents a single oscillator. Activity is coded by circle radius, oscillation phase by shading ( 0 . ’ 2n). ( 8 , bottom) Activity-phase O ~ ~ map at t = 12 T from a control simulation that did not enable ~ I connections. Parameters: t < 0, standard set (cf. Kiinig and Schillen 1991) and /j = 0.2 T ~ - ” ~ ; f > 0,standard set and t t f ; = 0.01, T:.:! = 8 ~ 0i’, , : OT”-”~.
synchronizing weights w:’,’ and delay times are r;: = 2 0.2 T . To allow for fluctuations in neuronal activity some level o f noise is maintained throughout the simulation. Each of the oscillators is meant to represent a receptive field (RF) of a different preferred orientation at identical “retinal” location. For this simulation w e assume a continuous sequence of 8 orientations in steps of 22.5”(Fig. 3A). All RFs are considered to exhibit a gaussian orientation tuning.
Assembly Formation of Oscillatory Responses: I1
171
As in the experiment (Engel et al. 1990a1, we "record" from two oscillators, whose preferred orientations differ by 45" (112.5", 157.5') (Fig. 3A, hatched). We present two distinct stimulus paradigms: (1) a single stimulus bar of intermediate orientation (135") (Fig. 3B-E, left column) and (2) two superimposed stimuli oriented at 0" and 90" (Fig. 3B-E, right column). Panels (C) depict the corresponding external inputs to the oscillator chain in accordance with the assumed orientation tuning. Input to the two monitored oscillators is identical for both stimulus conditions. Panels (D) show the resulting activity traces. In the case of the single stimulus both oscillators are well synchronized and thus belong to the same oscillatory assembly. With the superimposed stimuli each of the monitored oscillators couples to one of the two assemblies representing the two presented stimulus orientations. Because of the desynchronizing connections these assemblies are driven out of phase, while the amplitudes of the oscillators' activities remain unchanged as compared to the corresponding single stimulus conditions. The phase relationship between the two monitored oscillators is indicated by the cross correlograms in panels (El. The simulation shows that two oscillators can couple to different "neuronal" assemblies in a stimulus-dependent manner, as demonstrated by physiological experiments (Engel et al. 1990a). The synchronizing connections enable an oscillator to couple also to assemblies representing suboptimal orientation preferences, again consistent with experimental evidence (Gray et al. 1989). The inclusion of desynchronizing connections with a coupling length greater than that of the synchronizing ones establishes stimulus-dependent correlation lengths. Thus, the correlation length of an assembly activated by a single stimulus is larger than that found in one of the ensembles coding for the two superimposed stimuli. This is the origin of the decoupling in the case of the two conflicting stimuli. Without desynchronizing connections every sufficiently overlapping input configuration would readily synchronize completely. In particular, as predicted, the oscillatory responses to the two superimposed stimuli (Fig. 3D, right) become synchronized if the desynchronizing connections are eliminated (data not shown). Note that the stimulus-specificvariation of correlation length cannot be achieved by simply choosing an appropriate noise level in a system containing only synchronizing connections. Note also that with the superimposed stimuli the desynchronizing excitatory connections affect only the phase relation not the activity amplitudes of the monitored oscillators, as compared to the single stimulus condition. The interpretation of the described oscillator chain as orientationselective cells only serves to demonstrate a principle. The above results extend canonically to other stimulus modalities. The oscillator chain could, for example, be equally well interpreted as a sequence of RFs having different velocity preferences but identical preferred orientation.
172
Thomas B. Schillen and Peter Konig
A
-2
-2-
D b
Time
lo
Time
b+lW
I
Figure 3: Stimulus-dependent assembIy formation in a one-dimensional chain of delayed nonlinear oscillators. (A) Eight oscillators representing orientationselective cells with identical receptive field location. (B) Stimulus conditions of one (left) and two (right) light bars. The corresponding stimulus input to the oscillators is shown in (C). (D) Activity traces from the 112.5" and 157.5" unit [(A), hatched]. (E) Mean normalized auto (dashed) and cross (solid) correlations of the units shown in (D). Mean of 20 epochs of 20T. Normalization by geometric mean of the two auto correlations. Parameters: standard set and s e ( t ) as specified in 0, wb',' = 0.1, wi?J = 0.04, 7:;' = 4.r0, 7:: = 870, p = 0.2~;''~.
Assembly Formation of Oscillatory Responses: I1
173
4 Coding Stimulus-Coherency in Oscillatory Multilayers
Using the concepts of the previous section we now extend our model of temporal coding (Konig and Schillen 1991) to include the direction of stimulus movement (Gray et al. 1989). Three two-dimensional layers of delayed nonlinear oscillators are used to represent "neuronal" populations having different preferences with respect to direction of stimulus movement (Fig. 4A): two layers with forward and backward direction selectivity (forward layer, backward layer) and one layer lacking direction preference (neutral layer). Each layer is of the type described in Section 4 of Konig and Schillen (1991, Fig. 4). Accordingly, all three layers are again interpreted as retinotopic representations of RFs, where corresponding RFs are taken to represent matched "retinal" locations. The RFs of all three layers are assumed to have the same orientation preferences. Within each layer synchronizing connections (wz:,wz:,wf:)are implemented as described before (Konig and Schillen 1991). The connections between layers generalize the concepts of the previous section into three dimensions (cf. Fig. 4A): A particular layer is coupled to its nearestneighbor layer by synchronizing connections and to its next nearestneighbor layer by desynchronizing ones. Because of computational limitations and without loss of generality, we have simulated only an appropriate subset of this coupling. In particular, the forward and backward layers synchronize the neutral layer by , : w w:;, and wk2,) connections and mutually desynchronize each other by means of ,;w: w;;, and w:; couplings. As with the single oscillatory layer, w(') denotes the coupling strength of an oscillator to its r-nearest-neighbor oscillators, being as before specified by retinotopic coordinates. All connection weights within and between layers are chosen to be isotropic with respect to retinotopy. In correspondence to the experiment, the input to the three layers emulates the different stimulus conditions of forward and backward moving light bars (Fig. 4B). Within each layer input is applied as described in Konig and Schillen (1991). As in the experiment, we "record" from the population lacking direction selectivity. We compute cross correlations within and between stimulus segments, again as detailed in Konig and Schillen (1991). The resulting cross correlograms are shown in Figure 4C: Within each stimulus segment all oscillators are synchronized with zero-phase lag independent of stimulus condition (Fig. 4C, dashed). This defines the oscillatory assemblies that code for each particular stimulus bar. Coupling between the two assemblies representing the two bar segments depends on the direction of movement of and the gap distance between the stimuli (Fig. 4C, solid). With no gap distance (Fig. 4, right column) the two segments form one long stimulus bar, responses to which are completely synchronized without phase lag within its entire area. If the two stimulus bars move
Thomas B. Schillen and Peter Kiinig
1 74
A
a
R
c
0
-I'
lime
0 tJ1
Figure 4: Temporal coding of stimulus coherency with respect to direction of movement and location of stimuli. (A) Three oscillatory layers for the representation of "neuronal" populations with different preferences for the direction o f stimulus movement: selectivity for forward (top) and backward (bottom) direction and no direction preference (middle layer). Coupling by synchronizing (dashed) and desynchronizing (dotted) delay connections as described in the text. (B) Stimulus conditions: two short light bars moving in opposite directions, two short light bars moving in the same direction, and one continuous long bar. (C) Mean normalized cross correlations within (dashed) and between (solid) stimulus bar segments, computed for the middle layer. Mean of 20 epochs of 207'. For correlation details see Konig and Schillen (1991, Fig. 4). Cross correlations between stimulus segments reflect stimulus coherency, in agreement with experimental observations (Gray et al. 1989). Parameters: within each layer: parameter set and input as with Konig and Schillen (1991, Fig. 4). Input applied to layers according to stimulus condition; from top to middle layer: ((114) = 0.05, .uib',' = 0.05, w::: = 0.035, uii? = 0.01, WLZ = 0.01, ~ $ 2 = 0.01, 7:;' = T::' = T~Z,' = 4 ~ 0 ,T;, = = ~t','= 8 ~ 0 ;from bottom to middle layer: correspondingly; /Ir = 0.1 T~-"*.
Assembly Formation of Oscillatory Responses: I1
175
in the same direction but are separated by a small gap (middle column), a somewhat reduced cross correIation between the two corresponding assemblies results. If, with the same gap distance, the two bar segments now move in opposite directions (left column) the resulting oscillatory activities become decoupled, as indicated by only a residual cross correlation. The model is thus capable of evaluating direction of stimulus movement as an additional coherency criterion besides stimulus location. Accordingly, the model desynchronizes oscillatory responses to stimulus bars moving in opposite directions, while synchronizing responses to coherently moving stimuli if they are located sufficiently close to each other. This coding of stimulus coherency closely resembles the experimental observations (Gray et al. 1989). 5 Conclusions
The simulations presented in this paper show that an appropriate choice of excitatory delay connections ( w e e 0 )can ) provide desynchronization in layers of delayed nonlinear oscillators. This desynchronization does not critically depend on the exact value of the coupling delay. Desynchronization by excitatory connections is particularly interesting with respect to the preponderence of non-GABAergic neurons in cortex (Beaulieu and Somogy 1990). In the following we want to discuss some aspects of the different delay connections employed in our model (Fig. 4). Connections within a Layer: Each layer corresponds to a retinotopic map of a neuronal population having identical feature preferences. The synchronizing connections within a layer couple oscillatory responses to extended stimuli having the appropriate features. All oscillatory activity at neighboring ”retinal” locations is recruited into the same assembly defined by zero-phase-lag synchronization. The local character of the coupling allows responses to sufficiently separated stimuli to engage into different oscillatory patterns, the correlation of which reflects the separation of the stimuli. In this case, noise is the cause for the segregation of distinct assemblies within a layer. Connections between Layers: At each “retinal”location, the oscillators of the different layers form an oscillatory column similar to the chain of oscillators shown in Figure 3. If we visualize the multilayer arrangement of oscillators as a three-dimensional oscillatory module then the dimension of the column corresponds to the module’s spectrum of preferred features (e.g., preferred orientations, directions) while the other two dimensions represent the retinotopic map. Within the column, presentation of a single stimulus will elicit oscillatory responses of neighboring oscillators in accordance with the oscillator’s feature tuning (cf. Fig. 3). These responses are coupled by short-range synchronizing connections into an
176
Thomas B. Schillen and Peter Konig
assembly coding for the single stimulus. Long-range desynchronizing connections provide segregation of responses to partially overlapping but distinct stimuli into independently oscillating ensembles. The properties of the desynchronizing connections go beyond desynchronization by noise. Desynchronizing connections actively dephase different neighboring oscillatory assemblies while noise impairs only the synchronizing interaction between such assemblies. Furthermore, desynchronizing connections affect stimulus responses in a specific way as opposed to the effects of noise. It would also be possible, for example, to achieve the segregation of assemblies shown in Figure 3 by a suitable choice of noise level. However, increasing intensity or overlap of the two stimuli at other locations of the module would render this choice inappropriate. As a consequence, the noise level would have to be increased, but this would then pose the problem of synchronizing responses to low intensity stimuli. In contrast, desynchronization by delay connections is based on the oscillatory activity itself and therefore scales with increased activity. Furthermore, a system containing both long-range desynchronizing and short-range synchronizing connections exhibits stimulus-dependent variations of correlation lengths: the size of an assembly synchronized by a single stimulus will be reduced if a second overlapping stimulus is presented. This facilitates segregation of the two pertaining assemblies further. Another argument in favor of desynchronizing connections involves the specificity of active desynchronization. Sompolinsky et al. (1990) present a study of a network of coupled oscillators applying mean field theory to a continuous phase model. In their description of the directionselective stimulus response by use of synchronizing connections, these authors have to exclude synchronization within the layer of neutral direction selectivity. This is necessary as otherwise the oppositely moving stimulus bars would also elicit synchronized responses in the neutral layer. This implies exclusion of an entire population of cells from cooperative interactions, which is physiologically implausible. This concept appears as contraintuitive also with respect to learning. Cells lacking direction selectivity are particularly likely to respond simultaneously, and this should facilitate the development of connections within this population. Furthermore, without specific desynchronizing connections, synchronizing mechanisms will, in general, have to be avoided unless they are selective with respect to a particular feature dimension. However, implementing only feature selective synchronizing connections becomes more and more difficult as the number of feature dimensions increases. This would lead to just the combinatorial problems, which we want to solve by the introduction of temporal coding. These considerations suggest that active desynchronization might also occur in natural cortical networks. If this is indeed the case we would expect stimulus-dependent variations of correlation length to be found in physiological experiments.
Assembly Formation of Oscillatory Responses: I1
177
Extending these concepts to several feature dimensions, we propose (1)synchronizing connections for the formation of assemblies corresponding to coherent features a n d (2) desynchronizing connections for the segregation of responses to differentiating features of an object.
Acknowledgments We would like to thank the same people that helped us with the preceding paper: Wolf Singer for discussions of the physiological background, H. Sompolinsky a n d D. Kleinfeld for useful discussions, Jan C. Vorbruggen for his support on computer operation, Wolf Singer, Jan C. Vorbruggen, a n d Julia Delius for comments on the first draft of this paper, a n d Renate Ruhl for her excellent graphical assistance.
References Abeles, M. 1982. Local Cortical Circuits. A n Electropkysiological Study. SpringerVerlag, Berlin. Beaulieu, C., and Somogy, P. 1990. Targets and quantitative distribution of GABAergic synapses in the visual cortex of the cat. Eur. J. Neurosci. 2(4), 296-303. Damasio, A. R. 1989. The brain binds entities and events by multiregional activation from convergence zones. Neural Comp. 1, 123-132. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121-130. Engel, A. K., Konig, P., Gray, C. M., and Singer, W. 1990. Synchronization of oscillatory responses: A mechanism for stimulus-dependent assembly formation in cat visual cortex. In Parallel Processing in Neural Systems and Computers, R. Eckmiller, ed., pp. 105-108. Elsevier, Amsterdam. Engel, A. K., Konig, P., Gray, C. M., and Singer, W. 1990. Stimulus-dependent neuronal oscillations in cat visual cortex: Inter-columnar interaction as determined by cross-correlation analysis. Eur. J. Neurosci. 2, 588-606. Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Gray, C. M., and Singer, W. 1987. Stimulus-specific neuronal oscillations in the cat visual cortex: A cortical functional unit. Soc. Neurosci. Abstr. 13(404.3). Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A. 86, 1698-1702. Konig, P., and Schillen, T. B. 1991. Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. Neural Comp. 3, 155-166. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1990. Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acad. Sci. U.S.A. 87, 7200-7204.
178
Thomas B. Schillen and Peter Konig
von der Malsburg, C. 1981. The correlation theory of brain function. Internal Report 81-2, Max-Planck-Institute for Biophysical Chemistry, Gijttingen, Germany. von der Malsburg, C. 1986. Am I Thinking Assemblies? In Brain Theory, G. Palm and A. Aertsen, eds., pp. 161-176. Springer-Verlag, Berlin. von der Malsburg, C. and Schneider, W. 1986. A neural cocktail-party processor. B i d . Cybern. 54, 2940.
Received 6 July 1990; accepted 12 November 1YYO.
This article has been cited by: 2. D. Wang. 2005. The Time Dimension for Scene Analysis. IEEE Transactions on Neural Networks 16:6, 1401-1426. [CrossRef] 3. Antonino Raffone , Gezinus Wolters . 2001. A Cortical Mechanism for Binding in Visual Working MemoryA Cortical Mechanism for Binding in Visual Working Memory. Journal of Cognitive Neuroscience 13:6, 766-785. [Abstract] [PDF] [PDF Plus] 4. W. Senn , Th. Wannier , J. Kleinle , H.-R. Lüscher , L. Müller , J. Streit , K. Wyler . 1998. Pattern Generation by Two Coupled Time-Discrete Neural Networks with Synaptic DepressionPattern Generation by Two Coupled Time-Discrete Neural Networks with Synaptic Depression. Neural Computation 10:5, 1251-1275. [Abstract] [PDF] [PDF Plus] 5. Akira Iwabuchi. 1998. Dynamic Binding of Visual Features by Neuronal/Stimulus Synchrony. APPLIED HUMAN SCIENCE Journal of Physiological Anthropology 17:3, 97-108. [CrossRef] 6. H. J. Kappen. 1997. Stimulus-dependent correlations in stochastic networks. Physical Review E 55:5, 5849-5858. [CrossRef] 7. Wulfram Gerstner, J. Leo van Hemmen, Jack D. Cowan. 1996. What Matters in Neuronal Locking?What Matters in Neuronal Locking?. Neural Computation 8:8, 1653-1676. [Abstract] [PDF] [PDF Plus] 8. Pieter R. Roelfsema, Andreas K. Engel, Peter König, Wolf Singer. 1996. The Role of Neuronal Synchronization in Response Selection: A Biologically Plausible Theory of Structured Representations in the Visual CortexThe Role of Neuronal Synchronization in Response Selection: A Biologically Plausible Theory of Structured Representations in the Visual Cortex. Journal of Cognitive Neuroscience 8:6, 603-625. [Abstract] [PDF] [PDF Plus] 9. Galina N. Borisyuk, Roman M. Borisyuk, Alexander I. Khibnik, Dirk Roose. 1995. Dynamics and bifurcations of two coupled neural oscillators with different connection types. Bulletin of Mathematical Biology 57:6, 809-840. [CrossRef] 10. Paul Bressloff. 1994. Dynamics of compartmental model recurrent neural networks. Physical Review E 50:3, 2308-2319. [CrossRef] 11. Alain Destexhe. 1994. Oscillations, complex spatiotemporal behavior, and information transport in networks of excitatory and inhibitory neurons. Physical Review E 50:2, 1594-1606. [CrossRef] 12. Raphael Ritz, Wulfram Gerstner, Ursula Fuentes, J. Hemmen. 1994. A biologically motivated and analytically soluble model of collective oscillations in the cortex. Biological Cybernetics 71:4, 349-358. [CrossRef] 13. Thomas B. Schillen, Peter König. 1994. Binding by temporal structure in multiple feature domains of an oscillatory neuronal network. Biological Cybernetics 70:5, 397-405. [CrossRef]
14. E. R. Grannan , D. Kleinfeld , H. Sompolinsky . 1993. Stimulus-Dependent Synchronization of Neuronal AssembliesStimulus-Dependent Synchronization of Neuronal Assemblies. Neural Computation 5:4, 550-569. [Abstract] [PDF] [PDF Plus] 15. Frank Pasemann. 1993. Discrete dynamics of two neuron networks. Open Systems & Information Dynamics 2:1, 49-66. [CrossRef] 16. Peter König , Bernd Janosch , Thomas B. Schillen . 1992. Stimulus-Dependent Assembly Formation of Oscillatory Responses: III. LearningStimulus-Dependent Assembly Formation of Oscillatory Responses: III. Learning. Neural Computation 4:5, 666-681. [Abstract] [PDF] [PDF Plus] 17. Hartmut Neven, Ad Aertsen. 1992. Rate coherence and event coherence in the visual cortex: a neuronal model of object recognition. Biological Cybernetics 67:4, 309-322. [CrossRef] 18. Christoph Malsburg, Joachim Buhmann. 1992. Sensory segmentation with coupled neural oscillators. Biological Cybernetics 67:3, 233-242. [CrossRef] 19. A. K. Kreiter, W. Singer. 1992. Oscillatory Neuronal Responses in the Visual Cortex of the Awake Macaque Monkey. European Journal of Neuroscience 4:4, 369-375. [CrossRef] 20. Peter König , Thomas B. Schillen . 1991. Stimulus-Dependent Assembly Formation of Oscillatory Responses: I. SynchronizationStimulus-Dependent Assembly Formation of Oscillatory Responses: I. Synchronization. Neural Computation 3:2, 155-166. [Abstract] [PDF] [PDF Plus]
Communicated by Fernando Pineda
Recurrent Network Model of the Neural Mechanism of Short-Term Active Memory David Zipser Department of Cognitive Science, 0515, University of California, San Diego, 9500 Gilman Drive, La Jolla, C A 92093-0515 Two decades of single unit recording in monkeys performing shortterm memory tasks has established that information can be stored as sustained neural activity. The mechanism of this information storage is unknown. The learning-based model described here demonstrates that a mechanism using only the dynamic activity in recurrent networks is sufficient to account for the observed phenomena. The temporal activity patterns of neurons in the model match those of real memoryassociated neurons, while the model’s gating properties and attractor dynamics provide explanations for puzzling aspects of the experimental data. 1 Introduction
There are many definitions of short-term memory. Here we will be concerned only with that subset of memory phenomena lasting up to a few tens of seconds that have been studied using single unit microelectrode recording in awake behaving monkeys starting with Fuster and Alexander (1971) and continuing to the present. An example of this kind of experiment is the delayed saccade task where a spatial location must be remembered (Niki and Watanabe 1976; Gnadt and Andersen 1988; Shintaro et al. 1990). The subject fixates one light while another light, the target, is briefly flashed at a variable location in the periphery. After a delay the fixation light goes out, signaling that a saccade must be made to the location of the remembered target. Another example is the “delayed match to sample” task in which an initial stimulus, typically a tone or a color, must be remembered so that it can be compared to a final stimulus of the same kind presented after a delay (Bauer and Fuster 1976; Quintana et al. 1988; Gottlieb et al. 1989). Match is typically indicated by pushing a button. Lesion and brain cooling studies have identified several cortical areas that are specifically required for these short-term memory tasks, but not required for versions of the same tasks performed without the delay (Fuster 1985; Fuster ef al. 1985; Quintana ef al. 1989; Colombo ef al. 1990; Goldman-Rakic 1987). When single unit recordings are made in these areas, a rich spectrum of memory-associated neural Neural Computation 3, 179-193 (1991) @ 1991 Massachusetts Institute of Technology
180
David Zipser
firing patterns is found. The spectrum of firing is similar in the different cortical areas, and its properties, together with other evidence, indicate that information is being stored as patterns of neural activity. The similarity of the firing patterns observed in different cortical areas suggests that a single general mechanism may be used for active information storage. The model described here is concerned with the nature of this general mechanism. The model must account for the observed single unit firing patterns. The pattern most obviously related to information storage is seen in sustained firing neurons whose activity increases abruptly at the start of a memory task and returns to background at the end. An example of this type of neuron is shown in Figure 1A. The magnitude of the sustained response depends on the direction and magnitude of the saccade, showing that this neuron is coding quantitative information in its firing rate. Neurons that have similar sustained firing patterns, but store different modalities of information, are found in the other cortical areas. Neurons with other kinds of activity patterns can be identified as involved in short-term memory by the effect of task performance on their activity. In the case of the auditory unit shown in Figure 1 B the animal performs a match to sample task only when a reward tube that delivers juice is in its mouth. The illustrated neuron shows a sustained response when the memory task is performed and none when it is not, even though the same set of stimuli is presented. A frontal memory unit identified by its task sensitivity is shown in Figure 1C. Here the task/no-task distinction depends on the color of the initial stimulus. Red, blue, green, and yellow initial stimuli are indicative of tasks with rewards while a violet initial stimulus is never followed by a reward. Only rewarded stimuli produce activity in this neuron. It differs from sustained activity neurons in that it fires briskly only during both the initial stimulus and final cue period. The model described here was configured by training a recurrent neural network to mimic the basic features of short-term memory. This strategy has been shown in previous studies to produce networks whose hidden units behave very much like real neurons computing the same relation as the model (Zipser and Andersen 1988). In the case of recurrent networks trained on dynamical problems, the network dynamics often simulates the experimentally observed dynamics. This makes it possible to generate models with a close functional homology to real systems. Particularly nice results have been obtained recently with recurrent network models of the dynamics of the vestibuloocular system (Arnold and Robinson 1989; Anastasio 1991). 2 The Model
A system that can mimic short-term memory needs at least one input to carry the analog value to be stored and a second input to indicate when
181
Neural Mechanism of Short-Term Active Memory
T I M E (SEC)
-
N O TASK
TASK
I
C
'21
TIME ISEC)
Figure 1: (A) Spike histograms of an intended movement cell in area LIP of the Rhesus monkey. Each histogram includes responses from 8-10 trials. Trials are grouped according to increasing response delay times. The horizontal bar below each histogram indicates the stimulus presentation. The arrow indicates the time at which the fixation spot was extinguished. Eye movements occurred from 150 to 400 msec following offset of spot. Bin size = 50 msec. From Gnadt and Andersen (1988). (B) Histogram showing the activity of a unit in the supratemporal gyrus of baboon auditory cortex during a tone matching task. Dark bars show the times of presentation of the first and second tones. Solid line is the task performance case and dotted line the no task case. From Gottlieb et al. (1989). (C) Spike discharge histograms of a prefrontal unit during shortterm memory experiments; bin size 1 sec. The horizontal bar indicates stimulus presentation. Red, green, yellow, or blue presented during the stimulus period indicate a memory task with reward. Violet presented during the stimulus period is not rewarded and serves as the no task cue. Neuron C responds primarily to the initial stimulus and the final cue. From Yajeya et a/. (1988).
David Zipser
182
new values are loaded. This second input, or gate, is required because the single neuron firing data show that new values are loaded only at appropriate times during a memory task. The analog input to the model corresponds to what in the brain would be an activity representing information to be remembered. The gate corresponds to a hypothesized control signal generated at the start and the end of the delay period. While the model hypotheses gate signals, it does not address the question of where and how they are generated. I have found that a model that accounts for much of the observed single neuron firing data can be generated by training a recurrent neural network having just these two inputs and some hidden units to store analog values. The model consists of a fully connected recurrent network of discrete time logistic units updated by the following equations:
+ 1)
=
net,@ 1)
=
Yz(f
+
1 1 + e-net,(t+l) wt3(t)?j3(t) .j E I u N U B,I E N
C 3
where I is the set of subscripts indexing inputs, N is the set of subscripts indexing unit output activities, and B is a set with the index of the bias activity that is fixed at 1.0. The bias weights in the instances described here were fixed typically at values from -1.0 to -2.5 and not trained. There are two input lines to the model, one for the analog value to be stored and the other for the gate. The input lines go to all the units in the network. The output of one neuron in the network represents the stored value, the rest of the neurons are hidden units. At the start of training all weights are set to small random values. The network is trained so that the output unit maintains the value present at the analog input whenever the gate goes from active to inactive. This output value must be maintained despite any changes on the analog input until the next time the gate becomes active. More details about the pattern of input-output signals used to train and test the model are shown in Figure 2. Many instances of the model were generated by training networks with 6 to 20 units to do the task described in Figure 2 using a fairly standard learning algorithm for recurrent networks called "Backpropagation Through Time," which is reviewed in Williams and Zipser (1991). During training the analog input was set to a different randomly chosen value between 0.0 and 1.0 on each time step. The gate input was set to 1.0 for one time step and then returned to 0 for randomly chosen intervals averaging 4 time steps. The procedure used to generate the temporal activation patterns of the model neurons was designed to resemble the situation existing in typical short-term memory tasks (bottom panel of Figure 2). At the beginning of the test the model networks are set to a "resting" level by gating in a low analog value. After a delay corresponding to the intertrial interval, an analog value representing the activity to be remembered is gated in. After a period corresponding to the memory delay, the network is reset
Neural Mechanism of Short-Term Active Memory
Analog In Gate In
183
d 'JRAINING
TIME STEI'S
Figure 2: The input-output structure of the model with diagrams of the training and testing paradigms. by gating in the initial resting value. During the delay the analog input is held fixed to simulate the fact that stimulation is generally not given during the delay period. The temporal activation patterns of a typical trained network are shown in Figure 3 as graphs of activity vs. time. Each unit has its own temporally changing activity pattern. The output unit has a moderately stable sustained activity reflecting the stored value. The spectrum of hidden unit activity patterns can be roughly divided into three major classes. Storage units as in lines 1, 3, and 6; gating units in lines 2 and 8; and units that mix storage and gating in lines 4 and 5 of Figure 3. Other instances of the model generated in different training runs show hidden
David Zipser
184
1
-
3
TIME STEPS
Figure 3: Temporal activity patterns of units in a network of 9 logistic units trained to be a system with the input-output characteristics described in Figure 2 and the text. The network was simulated by a Common LISP program that implemented the Backpropagation Through Time algorithm and was run on a Symbolics MacIvory co-processor installed in a Mac I1 computer. The inputs are labeled A and G, the analog input and the gate, respectively. The bias weights are fixed at -2.5 in this model instance. Training was for 200,000 time steps with an average of 4 time steps between gate pulses. The patterns were generated by first setting the activities to their basal levels by gating in an analog value of 0.0 (this gate is not shown). Then a value of 1.0 is gated and held for 17 time steps at which time 0.0 is again gated into the register.
Neural Mechanism of Short-Term Active Memory
185
units with the same three kinds of activity patterns, but with differences in detail. These hidden unit activity patterns closely resemble those of real neurons as will be shown below by comparing them to experimental data. Recurrent networks with logistic units of the kind modeled cannot actually store arbitrary analog values indefinitely as would be possible if linear units were used. Rather these networks exhibit a dynamic behavior such as decay to stable attractors, oscillation, or chaos. All instances of the model described here were found to decay to at most two stable attractors. No stable oscillation or chaos has been observed. The instances with two attractors, the most common case, have a threshold for the stored analog value that determines to which attractor they eventually settle. This is illustrated in the curves without noise in Figure 4. Each instance of the model has its own characteristic threshold and settling time. Some instances show damped oscillations as the network settles to an attractor. 3 Model vs. Experiment
One way to decide if the model accounts for the experimental data is to compare the temporal activity patterns of model and real neurons. This task is complicated by the large number of different patterns found in both real and model neurons. Most of the model temporal activity patterns can be classified as having either sustained activity during the delay, elevated activity only at the time of gating, or a mixture of these two patterns. Most of the published short-term memory neurons also fall into these three categories if we interpret activity during the initial cue and during the final cue or action as corresponding to activity during gating. It is also possible to compare the temporal activity patterns of individual model and real neurons. The model provides no basis for comparing real time, so in these comparisons the number of time steps used for the model neuron is adjusted to give a good match with the temporal pattern of the real neuron's activity. Some examples of comparisons between model hidden units and real neurons are shown in Figure 5. The model neurons were selected from about 70 hidden units in 10 instances of the model, but in fact a match between a real and model unit can be found for the majority of hidden units and for most of the published real neurons. Some of the important details of the model's activity patterns are found in the experimental data. For example, the sustained activity model units differ as to whether or not they are strongly inhibited while the gate is active. This difference is mirrored in the experimental data as shown in lines A and B of Figure 5. Another feature found in both model hidden units and real neurons is the tendency of the sustained activity to drift u p or down during the delay period, as seen in Figure 5C and D. This has been attributed by experimenters either to a decay of the stored
David Zipser
186 -
__
Figure 4: Attractor dynamics of a model instance with six hidden units. The threshold for the model instance shown here is 0.5715. When values above this threshold are gated into the network all units settle to their upper attractors; when values below the threshold are stored they settle to their lower attractors. The figure superimposes graphs of the temporal activity patterns obtained for a pair of starting values just below and just above the threshold. The gate is set back to 0.0 at time step 0 and the time course of activity in the network is displayed for 60 more time steps. Solid curve: no noise. Dotted curve: with noise having the characteristics described in the text. information or to an anticipation of the upcoming action. In the case of the model these changes are due to the the network moving toward stable attractors. A characteristic firing pattern that has elevated activity only at the times of the initial stimulus and the final action is seen in Figure 1C. This pattern corresponds to model units of the type seen in lines 2 and 8 of Figure 3. In another class of activity pattern cue-related and sustained firing are combined. Model units with this kind of pattern are compared to a real neurons in Figure 5E and F. Another way to validate the model is to compare its behavior to the real system as some parameter is varied. Unfortunately there are very
Neural Mechanism of Short-Term Active Memory
187
t
I
17
I
li
*
I
F
4 STEPS
SEC TIME
Figure 5: Comparison of the temporal activity patterns of cortical neurons with hidden units from model networks during real and simulated delay memory experiments. The experimental data have been copied from published sources using a Hewlett Packard ScanJet Plus. The histograms have been redrawn to the same physical size and format to facilitate comparison. The horizontal axis represents time, in seconds for the real neurons and in time steps for the model units. The vertical axis represents activity in spikes per second for the real neurons and a scale of 0 to 1 for the model units. The horizontal bar is the time of presentation of the first stimulus in the case of the experimental data and indicates the period immediately after the offset of the first gate in the case of the model. The arrow indicates the start of the cue ending the delay period in the experimental case and the offset of the final gate in the case of the model. (A) A neuron from posterior parietal area LIP during a delay saccade task from Gnadt and Andersen (1988). (B) An inferotemporal neuron during a visual delay match to sample task from Fuster et al. (1985). (C,F) Frontal neurons in the principal sulcus during delay match to sample experiments from Fuster (1984). (D) A frontal neuron during a delay choice task experiment from Quintana et al. (1989). (E) A composite of 33 principal sulcus neurons that all have cue-period and delay-period activity in a delay saccade task from Shintaro et al. (1990).
David Zipser
188
~~~~~
~
,A, A? ............,.-.. .
-Zoo0
msec
2000
A = 0.80
- zoo0
msec
Figure 6: Comparison of neuron from the auditory cortex with a model neuron during a delay match to sample task. The dynamics of neuron firing at two frequencies of initial stimulus are compared to the activity patterns of the model unit at two different analog input values. From Gottlieb et a / . (1989). few experimental data of this kind. One example of this approach, taken from a study of short-term tone memory, is given in Figure 6. The monkey's task is to determine if the second of two tones spaced 1 sec apart is the same as the first. The parameter varied is the frequency of the first tone, which, the experimenters hypothesize, actually determines the magnitude of the input to the neuron because it is tuned to a particular frequency. At the optimal frequency of 7071 Hz the activity rises abruptly after the first tone and then remains constant until the final tone, whereas at a suboptimal frequency there is only a gradual rise in activity during the delay period (left side of Figure 6). A similar kind of dynamics is seen in the model unit when the magnitude of the analog input is varied, as shown on the right of Figure 6. Another important point illustrated in Figure 6 is that real and model neurons also respond in similar ways in the no-task situation. The no-task situation is brought about in the experimental case by removing the reward tube from the
Neural Mechanism of Short-Term Active Memory
189
monkey’s mouth before presenting the stimulus tones. In the case of the model the no-task situation is simulated by omitting the gate pulses. In both the experimental and model case the units respond weakly to the stimuli but have no sustained activity. These results demonstrate that the firing patterns of many real neurons involved in active memory can be matched in detail to the activity patterns in the model. Of particular importance is the homology between the gate-related activity in the model and the cue-related firing in real neurons. This homology shows that the experimental data are consistent with the existence of gate-like signals in the brain that are present only when memory tasks are performed and which load and clear the memory.
4 The Model with Noise
The firing patterns of single neurons recorded anywhere in the cortex are very noisy. Even in the most rigidly controlled experiments there is a large variance between trials so that many trials must be averaged to get a clear picture of a neuron’s characteristic activity pattern. This is particularly true for the cortical neurons found to have sustained activity during the delay period of short-term memory experiments (Villa and Fuster 1990). This variability raises the question of how these neurons could accurately store information represented in terms of their firing rates. This problem was studied by analyzing the effect of noise on the model described above. As we will see, when noise is added to the model networks the attractor dynamics are affected in a way that can slow the rate of information loss. Low-level noise also produces large erratic shifts in activity that lead to testable predictions about the statistics of spiking in real memory neurons. The effect of adding noise during a test of the model network is shown in Figure 4. The two solid traces are from runs without noise where the value stored was just above and just below the threshold. These two traces show the network relaxing to its attractors. The third, dotted, trace is one example of what happens when an approximately gaussian distributed random number with variance of 0.004 was added to the output activity of each unit on every time step following the loading of the memory with its threshold value of 0.5715. With noise the large scale behavior of all the sustained activity units, that is, units 1, 2, and 4, and the output unit is very erratic, but in any one example they are a11 strongly correlated. In the example shown they spend some time near the upper attractor and then switch to the lower attractor. Each noisy run will be different, but all have certain features in common. The large-scale behavior of all the units is strongly correlated and they spend most of their time near one or the other attractor. Another way to inject noise is to add it to the input before application of the logistic squashing function
190
David Zipser
rather than adding it to the outputs. The qualitative results of these two ways of adding noise are quite similar. The tendency of activities in noisy models to cluster near their attractors is seen clearly in Figure 7, which shows the distribution function of the activities for unit 2 of the network from Figure 4 after 40 time steps. The distribution is bimodal with peaks at the positions of the two attractors. By 40 time steps the limit distribution is being approached. The limit distribution is not affected much by noise level over a wide range,
Figure 7: The activity distribution function for unit 2 of the same model instance shown in Figure 4. A single trained instance of the model was run 2000 times for 40 time steps starting with the threshold as the initial stored value. The activity value found at the 40th time step was put into one of 20 bins of size 0.05. The number of occurrences in each bin is plotted as a function of bin size.
Neural Mechanism of Short-Term Active Memory
191
TIME STEPS 5
I n.n
4s
20
l
l
I .o
IN 1'1'1A L IN I'UT
Figure 8: The effect of averaging noisy runs on the input-output relation of the same model instance shown in Figure 4. The solid line shows the relation without noise. The horizontal axis is the initially stored value; the vertical axis is the activity of the unit being graphed at the indicated time step. Each dot in the dotted curve is an average of 30 runs for the number of time steps indicated. but does depend on the relative "strength" of the attractors, which differs between model instances. The location and strength of the attractors are properties of the weights and of the value held on the analog input during the memory period. While some theoretical work has been done on very simple noisy networks of this kind (Cowan 1968), it is not yet known analytically how the various parameters determine the distributions observed here. The limiting activity distribution is not reached immediately unless
192
David Zipser
the noise level is very large. For low noise levels there is an initial period when the stored values are well represented. This is followed by an intermediate period during which the choice of attractor is more strongly influenced by the initially stored value than by the limit distribution. This phenomenon has the interesting consequence of making the value of the activity during the intermediate period obtained by averaging the output of many runs of the model a good measure of the stored value for a longer time than would be expected, even though most instances making u p the average are near one of the attractors. This is illustrated in Figure 8 where the activity of all the units in the network after either 5, 20, or 45 time steps is plotted as a function of the initial stored value. The dark solid line indicates the input output relation with no added noise, while each point in the dotted curves is generated by averaging 30 runs with added noise. At 5 time steps the remembered output is a good representation of the input with or without noise. By 20 time steps the network without noise has a near step function relating initial input to delayed output, but the noisy average still gives a fairly good input-output relation. By 45 time steps the limit distribution has been almost reached and the average output no longer represents the initially stored value. This result indicates that the apparently destructive effects on memory accuracy caused by decay to attractors might be overcome for a while by using the average output of many noisy networks to represent stored information. Acknowledgments
I thank Joaquin Fuster and Jack Cowan for very helpful discussions. This work was supported by System Development Foundation Grant G359D, and National Institute of Mental Health Grant MH45271. References Anastasio, T. 1991. Neural network models of velocity storage in the horizontal vestibulo-ocular reflex. Biol. Cybern. 64, 187-196. Arnold, D., and Robinson, D. A. 1989. A learning neural-network model of the oculomotor integrator. SOC. Neurosci. Abst. 15, part 2, 1049. Bauer, R. H., and Fuster, J. M. 1976. Delayed-matching and delayed-response deficit from cooling dorsolateral prefrontal cortex in monkeys. 1. Conzp. Pkysiol. Psyckot. 3, 293-302. Colombo,M., DAmato, M. R., Rodmann, H. R., and Gross, G. C. 1990. Auditory association cortex lesions impair auditory short-term memory in monkeys. Science 247, 336-338. Cowan, J. D. 1968. Statistical mechanics of nervous nets. In Neural Networks: Proceedings of the School on Neural Networks June 1967 in Ravello, E. R. Caianiello, ed. Springer-Verlag,Berlin.
Neural Mechanism of Short-Term Active Memory
193
Fuster, J. 1984. Behavioral electrophysiology of the prefrontal cortex. TINS 7, 408414. Fuster, J. M. 1985. The prefrontal cortex, mediator of cross-temporal contingencies. Human Neurobiol. 4, 169-179. Fuster, J. M., and Alexander, G. E. 1971. Neuron activity related to short-term memory. Science 173, 652-654. Fuster, J. M., Bauer, R. H., and Jervey, J. P. 1985. Functional interactions between inferotemporal and prefrontal cortex in a cognitive task. Brain Res. 330, 299307. Gnadt, J. W., and Andersen, R. A. 1988. Memory related motor planning activity in posterior parietal cortex of macaque. Exp. Brain Res. 70, 216-220. Goldman-Rakic, P. S. 1987. Circuitry of primate prefrontal cortex and regulation of behavior by representational memory. Handbook of Physiology - The Nervous System, V. B. Mountcastle and F. Plum, eds., pp. 373417. American Physiological Society, Bethesda, MD. Gottlieb, Y., Vaadia, E., and Abeles, M. 1989. Single unit activity in the auditory cortex of a monkey performing a short term memory task. Exp. Brain Res. 74, 139-148. Niki, H., and Watanabe, M. 1976. Prefrontal unit activity and delayed response: Relation to cue location versus direction of response. Brain Res. 105, 79-88. Quintana, J., Yajeya, J., and Fuster, J. M. 1988. Prefrontal representation of stimulus attributes during delay tasks. I. Unit activity in cross-temporal integration of sensory and sensory-motor information. Brain Res. 474, 211221. Quintana, J., Fuster, J. M., and Yajeya, J. 1989. Effects of cooling parietal cortex on prefrontal units in delay tasks. Brain Res. 503, 100-110. Shintaro, F., Burce, C. J., and Goldman-Rakic, P. S. 1990. Visuospatial coding in primate prefrontal neurons revealed by oculomotor paradigms. J. Neurophysiol. 63, 814-831. Villa, A. E. P., and Fuster, J. M. 1990. Temporal firing patterns of inferotemporal neurons in a visual memory task. SOC.Neurosci. Abstr. 16, part 1, 760. Williams, R. J., and Zipser, D. 1991. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures, and Applications, Y. Chauvin and D. E. Rumelhart, eds. Erlbaum, Hillsdale, NJ. Yajeya, J., Quintana, J., and Fuster, J. M. 1988. Prefrontal representation of stimulus attributes during delay tasks. 11. The role of behavioral significance. Brain Res. 474, 222-230. Zipser, D., and Andersen, R. A. 1988. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature (London) 331(6158), 679-684. ~~
Received 31 October 1990, accepted 24 January 1991.
This article has been cited by: 2. Katsunari Shibata, Masanori Sugisaka. 2005. Dynamics of a recurrent neural network acquired through learning a context-based attention task. Artificial Life and Robotics 7:4, 145-150. [CrossRef] 3. Tetsuto MINAMI, Toshio INUI. 2003. A NEURAL ARCHITECTURE FOR RULE-GUIDED BEHAVIOR: A SIMULATION OF PHYSIOLOGICAL EXPERIMENTS. PSYCHOLOGIA -An International Journal of Psychology in the Orient 46:4, 268-283. [CrossRef] 4. Sohie Lee Moody , Steven P. Wise . 2000. A Model that Accounts for Activity Prior to Sensory Inputs and Responses During Matching-to-Sample TasksA Model that Accounts for Activity Prior to Sensory Inputs and Responses During Matching-to-Sample Tasks. Journal of Cognitive Neuroscience 12:3, 429-448. [Abstract] [PDF] [PDF Plus] 5. Chantal E. Stern, Michael E. Hasselmo. 1999. Bridging the gap: Integrating cellular and functional magnetic resonance imaging studies of the hippocampus. Hippocampus 9:1, 45-53. [CrossRef] 6. Javier R. Movellan . 1998. A Learning Theorem for Networks at Detailed Stochastic EquilibriumA Learning Theorem for Networks at Detailed Stochastic Equilibrium. Neural Computation 10:5, 1157-1178. [Abstract] [PDF] [PDF Plus] 7. Hiroyuki Nakahara* , Kenji Doya . 1998. Near-Saddle-Node Bifurcation Behavior as Dynamics in Working Memory for Goal-Directed BehaviorNear-Saddle-Node Bifurcation Behavior as Dynamics in Working Memory for Goal-Directed Behavior. Neural Computation 10:1, 113-132. [Abstract] [PDF] [PDF Plus] 8. Sherman P. Wiebe, Ursula V. St�ubli, Jos� Ambros-Ingerson. 1997. Short-term reverberant memory model of hippocampal field CA3. Hippocampus 7:6, 656-665. [CrossRef] 9. Toru Ohira, Jack D. Cowan. 1995. Stochastic Single NeuronsStochastic Single Neurons. Neural Computation 7:3, 518-528. [Abstract] [PDF] [PDF Plus] 10. J. Devin McAuley , Joseph Stampfli . 1994. Analysis of the Effects of Noise on a Model for the Neural Mechanism of Short-Term Active MemoryAnalysis of the Effects of Noise on a Model for the Neural Mechanism of Short-Term Active Memory. Neural Computation 6:4, 668-678. [Abstract] [PDF] [PDF Plus] 11. Edwin E. Munro , Larry E. Shupe , Eberhard E. Fetz . 1994. Integration and Differentiation in Dynamic Recurrent Neural NetworksIntegration and Differentiation in Dynamic Recurrent Neural Networks. Neural Computation 6:3, 405-419. [Abstract] [PDF] [PDF Plus]
Communicated by Geoffrey Hinton
- - _.._
Learning Invariance from Transformation Sequences Peter Foldiik Plzysmlogical Laborn tony, University of Cambridge, Douining Strcct, Cnmbridge CB2 3EG, U.K.
The visual system can reliably identify objects even when the retinal image is transformed considerably by commonly occurring changes in the environment. A local learning rule is proposed, which allows a network to learn to generalize across such transformations. During the learning phase, the network is exposed to temporal sequences of patterns undergoing the transformation. An application of the algorithm is presented in which the network learns invariance to shift in retinal position. Such a principle may be involved in the development of the characteristic shift invariance property of complex cells in the primary visual cortex, and also in the development of more complicated invariance properties of neurons in higher visual areas. 1 Introduction
How can we consistently recognize objects when changes in the viewing angle, eye position, distance, size, orientation, relative position, or deformations of the object itself (e.g., of a newspaper or a gymnast) can change their retinal projections so significantly? The visual system must contain knowledge about such transformations in order to be able to generalize correctly. Part of this knowledge is probably determined genetically, but it is also likely that the visual system learns from its sensory experience, which contains plenty of examples of such transformations. Electrophysiological experiments suggest that the invariance properties of perception may be due to the receptive field characteristics of individual cells in the visual system. Complex cells in the primary visual cortex exhibit approximate invariance to position within a limited range (Hubel and Wiesel 1962)’while cells in higher visual areas in the temporal cortex show more complex forms of invariance to rotation, color, size, and distance, and they also have much larger receptive fields (Gross and Mishkin 1977, Perrett et a1. 1982). The simplest model of a neuron, which takes a weighted sum of its inputs, shows a form of generalization in which patterns that differ on only a small number of input lines generate similar outputs. For such a unit, patterns are similar when they are close in Hamming distance. Any simple transformation, like a shift in position or a rotation, can cause a great difference in Hamming distance, so this Ncural Comptntion 3, 194-200 (1991) @ 1991 Massachusetts Institute of Technology
Learning Invariance
195
simple unit tends to respond to the transformed image very differently and generalizes poorly across the transformation. The solution to this problem is therefore likely to require either a more complex model of a neuron, or a network of simple units. 2 Shift Invariance
Fukushima (1980) proposed a solution to the positional invariance problem by a network consisting of alternating feature detector ("S"or simple) and invariance ("C"or complex) layers. Feature detectors in the "Slayer are replicated in many different positions, while the outputs of detectors of the same feature are pooled from different positions in the "C" layers. The presence of the feature in any position within a limited region can therefore activate the appropriate "C" unit. This idea is consistent with models of complex cells in the primary visual cortex (Hubel and Wiesel 1962; Spitzer and Hochstein 1985) in that they assume that complex cells receive their major inputs from simple cells or simple-cell-like subunits selective for the same orientation in different positions. In Fukushima's model, the pair of feature detecting and invariance layers is repeated in a hierarchical way, gradually giving rise to more selectivity and a larger range of positional invariance. In the top layer, units are completely indifferent to the position of the pattern, while they are still sensitive to the approximate relative position of its components. In this way, not only shift invariance, but some degree of distortion tolerance is achieved as well. This architecture has successfully been applied both by Fukushima (1980) and LeCun et al. (1989) in pattern recognition problems. LeCun et al. achieve reliable recognition of handwritten digits (zip codes) by using such architectural constraints to reduce the number of free parameters that need to be adjusted. Some of the principles presented in these networks may also be extremely helpful in modeling the visual system. The implementation of some of their essential assumptions in biological neural networks, however, seems very difficult. Apart from the question of the biological plausibility of the backpropagation algorithm used in LeCun et al.'s model, both models assume that the feature detectors are connected to "comp1ex"units in a fixed way, and that all the simple units that are connected to a complex unit have the same input weight vector (except for a shift in position). Therefore whenever the weights of one of the "simple" units are modified (e.g., by a Hebbian mechanism), the corresponding weights of all the other simple units connected to the same complex unit need to be modified in exactly the same way ("weight sharing'?. This operation is nonlocal for the synapses of all the units except for the one that was originally modified. A "learn now" signal broadcast by the complex unit to all its simple units would not solve this problem either, as the shifted version of the input, which would be necessary for local learning, is not available for the simple units.
Peter Foldiak
196
3 A Learning Rule -
An arrangement is needed in which detectors of the same feature all connect to the same complex unit. However, instead of requiring simple units permanently connected to a complex unit (a “family’? to develop in an identical way, the same goal can be achieved by letting simple units develop independently and then allowing similar ones to connect adaptively to a complex unit (form “clubs’?. A learning rule is therefore needed to specify these modifiable simple-to-complex connections. A simple Hebbian rule, which depends only on instantaneous activations, does not work here as it only detects overlapping patterns in the input and picks up correlations between input units. If the input to the simple layer contains an example of the feature at only one spatial position at any moment then there will never be significant overlap between detectors of that feature in different positions. The absence of positive correlations would prevent those units being connected to the same output. The solution proposed here is a modified Hebbian rule in which the modification of the synaptic strength at time step t is proportional not to the pre- and post-synaptic activity, but instead to the presynaptic activity ( 2 ) and to an average value, a trace of the postsynaptic activity (p). A second, decay term is added in order to keep the weight vector bounded:
A similar trace mechanism has been proposed by Klopf (1982) and used in models of classical conditioning by Sutton and Barto (1981). A trace is a running average of the activation of the unit, which has the effect that activity at one moment will influence learning at a later moment. This temporal low-pass filtering of the activity embodies the assumption that the desired features are stable in the environment. As the trace depends on the activity of only one unit, the modified rule is still local. One possibility is that such a trace is implemented in a biological neuron by a chemical concentration that follows cell activity. 4 Simulation
~
The development of the connections between the simple and complex units is simulated in an example in which the goal is to learn shift invariance. In the simple layer there are position-dependent line detectors, one unit for each of 4 orientations in the 64 positions on an 8 x 8 grid. There are only 4 units in the complex layer, fully connected to the simple units. During training, moving lines selected at random from four orientations and two directions are swept across a model retina, which gives
Learning Invariance
* # %u % x Y
)**
w
** ** 4
* X X K K X Y *
***
x X I x x * I * * * * * *
******** * * * l * * * X
********
197
x * m**xx * *** x x x * *#* m * * x x*x * x *** *** * x x
**
X X
x x x x x * r x x x x x*x t f x x x#v x x x x *#x M x x
*
X # I ( * * * * *
*+* * x
*
* x * * * x x x
* t m * x x * + x *
**w****m
X X K X Y X I *
X ' 7 * * * X I X
Figure 1: Five consecutive frames from one of the sequences used as input. Each line segment represents a simple unit of the corresponding orientation and position. Thick segments are active ( z j = l), thin ones are inactive units ( z j = 0). The trace is maintained between sweeps. rise to activation of the simple units of the appropriate orientation in different positions at different moments in time (Fig. 1). Such activation can either be the result of eye movements, object motion in the environment, or it may even be present during early development as there is evidence for waves of activity in the developing mammalian retina (Meister et al. 1990). The activation of these simple units is the input to the network. If an active simple unit succeeds in exciting one of the four complex units, then the trace of that complex unit gets enhanced for a period of time comparable to the duration of the sweep across the receptive fields of the simple units. Therefore all the connections from the simple units that get activated during the rest of that sweep get strengthened according to the modified Hebb rule. Simple units of only one orientation get activated during a sweep, causing simple units of only one orientation to connect to any given complex unit. To prevent more than one complex unit from responding to the same orientation, some kind of competitive, inhibitory interaction is necessary between the complex units. In some previous simulations an adaptive competitive scheme, decorrelation, was used (Barlow and Foldiak 1989; Foldi5k 1990), which is thought to be advantageous for other reasons. For the sake of clarity, however, the simplest possible competitive scheme (Rumelhart and Zipser 1985) was used in the simulation described here. Each unit took a sum of its inputs weighted by the connection strengths. The output Y k of the unit with the maximal weighted sum was set to 1, while the outputs of the rest of the units were set to 0: Y k = {
1 if argmaxz(C,U I ~ , ~ ,=) k 0 otherwise
Figure 2a shows the initially random connections between the simple and the complex units, while Figure 2b shows the connections after training with 500 sweeps across the retina.
I98
Peter Foldiiik
Figure 2: Connection patterns of the four complex units (a) before training and (b) after training on 500 line sweeps across the retina. The length of each segment indicates the strength of the connection from the simple unit of the corresponding position and orientation to the complex unit. Initial weights were chosen from a uniform distribution on [0,0.1]. a = 0.02, 6 = 0.2. (c) The result of training without trace (6 = 1). 5 Discussion The simple example given above is not intended to be a realistic model of complex cell development, since unoriented input to complex cells was ignored and simple units were considered merely as line detectors. By using a more realistic model of simple cells, the above principle would be able to predict that simple cells of the same spatial frequency and orientation but of different phase tuning (dark/bright line centre, even/odd symmetry) connect to the same compIex cell, which would therefore lose sensitivity to phase. A further consequence would be that simple cells tuned to different spatial frequencies would segregate on different complex cells. The application of this algorithm to more complicated or abstract invariances (e.g., 3D rotations or deformations) would perhaps be even more interesting as it is even harder to see how they could be specified without some kind of learning; the way in which such invariance
Learning Invariance
199
properties could be wired in is much less obvious than in the case of positional invariance in Fukushima’s or LeCun’s networks. All that would be required by the proposed algorithm from previous stages of processing is that the transformation-dependent features should be available as input, and that the environment should generate sequences of the transformation causing the activation of these transformation-dependent detectors within a short period of time. Where no such detectors are available, other learning rules, based on temporal sequences or variation in form (Mitchison 1991, Webber 1991) may be able to find stable representations. If a supervision signal indicates the invariant properties, or self-supervision between successive time steps is applied, then backpropagation can also give rise to invariant feature detectors without explicit weight sharing (Hinton 1987). Nevertheless such learning is rather slow. Achieving a transformation-independent representation would certainly be very useful in recognizing patterns, yet the information that these invariance stages throw away may be vital in performing visual tasks. A “where” system would probably have to supplement and cooperate with such a “what“ system in an intricate way. Acknowledgments
I would like to thank Prof. H. B. Barlow, Prof. F. H. C. Crick, Dr. A. R. Gardner-Medwin, Prof. G. E. Hinton, and Dr. G. J. Mitchison for their useful comments. This work was supported by an Overseas Research Studentship, a research studentship from Churchill College, Cambridge, and SERC Grants GR/E43003 and GR/F34152. References Barlow, H. B., and Foldiak, I? 1989. Adaptation and decorrelation in the cortex. In The Computing Neuron, R. M. Durbin, C. Miall, and G. J. Mitchison, eds., Chap. 4, pp. 54-72. Addison-Wesley, Wokingham. Foldiik, P. 1990. Forming sparse representations by local anti-Hebbian learning. Biol. Cybernet. 64,165-170. Fukushima, K. 1980. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybernet. 36, 193-202. Gross, C. G., and Mishkin, M. 1977. The neural basis of stimulus equivalence across retinal translation. In Lateralization in the Nervous System, S. Harnad, R. Doty, J. Jaynes, L. Goldstein, and G. Krauthamer, eds., pp. 109-122. Academic Press, New York. Hinton, G. E. 1987. Learning translation invariant recognition in a massively parallel network. In PARLE: Parallel Architectures and Languages Europe, G. Goos and J. Hartmanis, eds., pp. 1-13. Lecture Notes in Computer Science, Springer-Verlag,Berlin.
200
Peter Foldiak
Hubel, D. H., and Wiesel, T. N. 1962. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. I. Physiol. 160, 106-154. Klopf, A. H. 1982. The Hedortisfic Neuron: A T h m y of Memory, Learning, and liitelligence. Hemisphere, Washington, DC. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neirrd Comp. 1, 541-551. Meister, M., Wong, R. 0. L., Baylor, D. A., and Shatz, C. J. 1990. Synchronous bursting activity in ganglion cells of the developing mammalian retina. Inziesti. Ophthnlrnol. Visual Sci. (suppl.) 31, 115. Mitchison, G. J. 1991. Removing time variation with the anti-Hebbian synapse. Neicrd Ciimp., in press. Perrett, D. I., Rolls, E. T., and Caan, W. 1982. Visual neurons responsive to faces in the monkey temporal cortex. Exp. Bruin Res. 47,329-342. Rumelhart, D. E., and Zipser, D. 1985. Feature discovery by competitive learning. Cog. Sci. 9, 75-112. Spitzer, H., and Hochstein, S. 1985. A complex-cell receptive-field model. 1.Neurophysiol. 53, 1266-1286. Sutton, R. S., and Barto, A. C. 1981. Toward a modern theory of adaptive networks: Expectation and prediction. Psychol. RKV.88, 135-1 70. Webber, C. J. St. C. 1991. Self-organization of position- and deformation-tolerant neural representations. Network 2, 43-61. .
.
~
~
_
_
_
Received 20 September 1990; accepted 12 October 1990
This article has been cited by: 2. Yiwen Wang, Bertram E. Shi. 2010. Autonomous Development of Vergence Control Driven by Disparity Energy Neuron PopulationsAutonomous Development of Vergence Control Driven by Disparity Energy Neuron Populations. Neural Computation 22:3, 730-751. [Abstract] [Full Text] [PDF] [PDF Plus] [Supplementary material] 3. Roland Memisevic, Geoffrey E. Hinton. Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann MachinesLearning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines. Neural Computation, ahead of print1-20. [Abstract] [PDF] [PDF Plus] 4. Wakayo Yamashita, Gang Wang, Keiji Tanaka. 2010. View-invariant object recognition ability develops after discrimination, not mere exposure, at several viewing angles. European Journal of Neuroscience 31:2, 327-335. [CrossRef] 5. Aapo Hyvärinen. 2009. Statistical Models of Natural Images and Cortical Visual Representation. Topics in Cognitive Science . [CrossRef] 6. Sawa Fuke, Masaki Ogino, Minoru Asada. 2009. Acquisition of the Head-Centered Peri-Personal Spatial Representation Found in VIP Neuron. IEEE Transactions on Autonomous Mental Development 1:2, 131-140. [CrossRef] 7. Edmund T. Rolls, James M. Tromans, Simon M. Stringer. 2008. Spatial scene representations formed by self-organizing learning in a hippocampal extension of the ventral visual system. European Journal of Neuroscience 28:10, 2116-2127. [CrossRef] 8. N. Li, J. J. DiCarlo. 2008. Unsupervised Natural Experience Rapidly Alters Invariant Object Representation in Visual Cortex. Science 321:5895, 1502-1507. [CrossRef] 9. Minjoon Kouh, Tomaso Poggio. 2008. A Canonical Neural Circuit for Cortical Nonlinear OperationsA Canonical Neural Circuit for Cortical Nonlinear Operations. Neural Computation 20:6, 1427-1451. [Abstract] [PDF] [PDF Plus] 10. Sylvain Sirois, Michael Spratling, Michael S. C. Thomas, Gert Westermann, Denis Mareschal, Mark H. Johnson. 2008. Précis of Neuroconstructivism: How the Brain Constructs Cognition. Behavioral and Brain Sciences 31:03. . [CrossRef] 11. Saskia M. Koller, Diana Hardmeier, Stefan Michel, Adrian Schwaninger. 2008. Investigating training, transfer and viewpoint effects resulting from recurrent CBT of X-Ray image interpretation. Journal of Transportation Security 1:2, 81-106. [CrossRef] 12. Felix Creutzig, Henning Sprekeler. 2008. Predictive Coding and the Slowness Principle: An Information-Theoretic ApproachPredictive Coding and the Slowness Principle: An Information-Theoretic Approach. Neural Computation 20:4, 1026-1041. [Abstract] [PDF] [PDF Plus]
13. Mounya Elhilali, Shihab A. Shamma. 2008. A cocktail party with a cortical twist: How cortical mechanisms contribute to sound segregation. The Journal of the Acoustical Society of America 124:6, 3751. [CrossRef] 14. Xu Miao, Rajesh P. N. Rao. 2007. Learning the Lie Groups of Visual InvarianceLearning the Lie Groups of Visual Invariance. Neural Computation 19:10, 2665-2693. [Abstract] [PDF] [PDF Plus] 15. Joseph F. Murray, Kenneth Kreutz-Delgado. 2007. Visual Recognition and Inference Using Dynamic Overcomplete Sparse LearningVisual Recognition and Inference Using Dynamic Overcomplete Sparse Learning. Neural Computation 19:9, 2301-2352. [Abstract] [PDF] [PDF Plus] 16. Sheng Li, Si Wu. 2007. Robustness of neural codes and its implication on natural image processing. Cognitive Neurodynamics 1:3, 261-272. [CrossRef] 17. Akira Date, Koji Kurata. 2007. A model of complex cell development by information separation. Systems and Computers in Japan 38:7, 76-83. [CrossRef] 18. James L. McClelland, Richard M. Thompson. 2007. Using domain-general principles to explain children's causal reasoning abilities. Developmental Science 10:3, 333-356. [CrossRef] 19. Richard Turner, Maneesh Sahani. 2007. A Maximum-Likelihood Interpretation for Slow Feature AnalysisA Maximum-Likelihood Interpretation for Slow Feature Analysis. Neural Computation 19:4, 1022-1038. [Abstract] [PDF] [PDF Plus] 20. Thomas Dean. 2007. Learning invariant features using inertial priors. Annals of Mathematics and Artificial Intelligence 47:3-4, 223-250. [CrossRef] 21. Edmund T. Rolls, Simon M. Stringer. 2007. Invariant Global Motion Recognition in the Dorsal Visual System: A Unifying TheoryInvariant Global Motion Recognition in the Dorsal Visual System: A Unifying Theory. Neural Computation 19:1, 139-169. [Abstract] [PDF] [PDF Plus] 22. Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan. 2006. Soft Mixer Assignment in a Hierarchical Generative Model of Natural Scene StatisticsSoft Mixer Assignment in a Hierarchical Generative Model of Natural Scene Statistics. Neural Computation 18:11, 2680-2718. [Abstract] [PDF] [PDF Plus] 23. Pawan Sinha, Benjamin Balas, Yuri Ostrovsky, Richard Russell. 2006. Face Recognition by Humans: Nineteen Results All Computer Vision Researchers Should Know About. Proceedings of the IEEE 94:11, 1948-1962. [CrossRef] 24. Tobias Blaschke, Pietro Berkes, Laurenz Wiskott. 2006. What Is the Relation Between Slow Feature Analysis and Independent Component Analysis?What Is the Relation Between Slow Feature Analysis and Independent Component Analysis?. Neural Computation 18:10, 2495-2508. [Abstract] [PDF] [PDF Plus] 25. S. M. Stringer, G. Perry, E. T. Rolls, J. H. Proske. 2006. Learning invariant object recognition in the visual system with continuous transformations. Biological Cybernetics 94:2, 128-142. [CrossRef]
26. Reto Wyss, Peter König, Paul F. M. J. Verschure. 2006. A Model of the Ventral Visual System Based on Temporal Stability and Local Memory. PLoS Biology 4:5, e120. [CrossRef] 27. Gang Wang, Shinji Obama, Wakayo Yamashita, Tadashi Sugihara, Keiji Tanaka. 2005. Prior experience of rotation is not required for recognizing objects seen from different angles. Nature Neuroscience 8:12, 1768-1775. [CrossRef] 28. David D Cox, Philip Meier, Nadja Oertelt, James J DiCarlo. 2005. 'Breaking' position-invariant object recognition. Nature Neuroscience 8:9, 1145-1147. [CrossRef] 29. Wolfgang Einhäuser, Jörg Hipp, Julian Eggert, Edgar Körner, Peter König. 2005. Learning viewpoint invariant object representations using a temporal coherence principle. Biological Cybernetics 93:1, 79-90. [CrossRef] 30. G. Wallis. 2005. Stability Criteria for Unsupervised Temporal Association Networks. IEEE Transactions on Neural Networks 16:2, 301-311. [CrossRef] 31. Yan Karklin , Michael S. Lewicki . 2005. A Hierarchical Bayesian Model for Learning Nonlinear Statistical Regularities in Nonstationary Natural SignalsA Hierarchical Bayesian Model for Learning Nonlinear Statistical Regularities in Nonstationary Natural Signals. Neural Computation 17:2, 397-423. [Abstract] [PDF] [PDF Plus] 32. David B. Grimes , Rajesh P. N. Rao . 2005. Bilinear Sparse Coding for Invariant VisionBilinear Sparse Coding for Invariant Vision. Neural Computation 17:1, 47-73. [Abstract] [PDF] [PDF Plus] 33. Muhua Li , James J. Clark . 2004. A Temporal Stability Approach to Position and Attention-Shift-Invariant RecognitionA Temporal Stability Approach to Position and Attention-Shift-Invariant Recognition. Neural Computation 16:11, 2293-2321. [Abstract] [PDF] [PDF Plus] 34. Alessandro Treves. 2004. Computational constraints between retrieving the past and predicting the future, and the CA3-CA1 differentiation. Hippocampus 14:5, 539-556. [CrossRef] 35. Tjeerd Jellema, David I. Perrett. 2003. Perceptual History Influences Neural Responses to Face and Body PosturesPerceptual History Influences Neural Responses to Face and Body Postures. Journal of Cognitive Neuroscience 15:7, 961-971. [Abstract] [PDF] [PDF Plus] 36. Laurenz Wiskott . 2003. Slow Feature Analysis: A Theoretical Analysis of Optimal Free ResponsesSlow Feature Analysis: A Theoretical Analysis of Optimal Free Responses. Neural Computation 15:9, 2147-2177. [Abstract] [PDF] [PDF Plus] 37. Christoph Kayser , Konrad P. Körding , Peter König . 2003. Learning the Nonlinearity of Neurons from Natural Visual StimuliLearning the Nonlinearity of Neurons from Natural Visual Stimuli. Neural Computation 15:8, 1751-1759. [Abstract] [PDF] [PDF Plus]
38. Heiko Wersing , Edgar Körner . 2003. Learning Optimized Features for Hierarchical Models of Invariant Object RecognitionLearning Optimized Features for Hierarchical Models of Invariant Object Recognition. Neural Computation 15:7, 1559-1588. [Abstract] [PDF] [PDF Plus] 39. M.V. Jankovic. 2003. A new simple ∞OH neuron model as a biologically plausible principal component analyzer. IEEE Transactions on Neural Networks 14:4, 853-859. [CrossRef] 40. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 41. Jan C. Wiemer . 2003. The Time-Organized Map Algorithm: Extending the Self-Organizing Map to Spatiotemporal SignalsThe Time-Organized Map Algorithm: Extending the Self-Organizing Map to Spatiotemporal Signals. Neural Computation 15:5, 1143-1171. [Abstract] [PDF] [PDF Plus] 42. Jarmo Hurri , Aapo Hyvärinen . 2003. Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural VideoSimple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video. Neural Computation 15:3, 663-691. [Abstract] [PDF] [PDF Plus] 43. Aapo Hyvärinen, Jarmo Hurri, Jaakko Väyrynen. 2003. Bubbles: a unifying framework for low-level statistical properties of natural image sequences. Journal of the Optical Society of America A 20:7, 1237. [CrossRef] 44. Simon M. Stringer , Edmund T. Rolls . 2002. Invariant Object Recognition in the Visual System with Novel Views of 3D ObjectsInvariant Object Recognition in the Visual System with Novel Views of 3D Objects. Neural Computation 14:11, 2585-2596. [Abstract] [PDF] [PDF Plus] 45. M. W. Spratling , M. H. Johnson . 2002. Preintegration Lateral Inhibition Enhances Unsupervised LearningPreintegration Lateral Inhibition Enhances Unsupervised Learning. Neural Computation 14:9, 2157-2179. [Abstract] [PDF] [PDF Plus] 46. Laurenz Wiskott , Terrence J. Sejnowski . 2002. Slow Feature Analysis: Unsupervised Learning of InvariancesSlow Feature Analysis: Unsupervised Learning of Invariances. Neural Computation 14:4, 715-770. [Abstract] [PDF] [PDF Plus] 47. Hiroyuki Nakahara , Shun-ichi Amari , Okihide Hikosaka . 2002. Self-Organization in the Basal Ganglia with Modulation of Reinforcement SignalsSelf-Organization in the Basal Ganglia with Modulation of Reinforcement Signals. Neural Computation 14:4, 819-844. [Abstract] [PDF] [PDF Plus] 48. Wolfgang Einhauser, Christoph Kayser, Peter Konig, Konrad P. Kording. 2002. Learning the invariance properties of complex cells from their responses to natural stimuli. European Journal of Neuroscience 15:3, 475-486. [CrossRef]
49. Peter Ulric Tse. 2002. A contour propagation approach to surface filling-in and volume formation. Psychological Review 109:1, 91-115. [CrossRef] 50. Konrad P. Körding , Peter König . 2001. Neurons with Two Sites of Synaptic Integration Learn Invariant RepresentationsNeurons with Two Sites of Synaptic Integration Learn Invariant Representations. Neural Computation 13:12, 2823-2849. [Abstract] [PDF] [PDF Plus] 51. Edmund T. Rolls , T. Milward . 2000. A Model of Invariant Object Recognition in the Visual System: Learning Rules, Activation Functions, Lateral Inhibition, and Information-Based Performance MeasuresA Model of Invariant Object Recognition in the Visual System: Learning Rules, Activation Functions, Lateral Inhibition, and Information-Based Performance Measures. Neural Computation 12:11, 2547-2572. [Abstract] [PDF] [PDF Plus] 52. Aapo Hyvärinen , Patrik Hoyer . 2000. Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature SubspacesEmergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces. Neural Computation 12:7, 1705-1720. [Abstract] [PDF] [PDF Plus] 53. Chris J. S. Webber . 2000. Self-Organization of Symmetry Networks: Transformation Invariance from the Spontaneous Symmetry-Breaking MechanismSelf-Organization of Symmetry Networks: Transformation Invariance from the Spontaneous Symmetry-Breaking Mechanism. Neural Computation 12:3, 565-596. [Abstract] [PDF] [PDF Plus] 54. Suzanna Becker . 1999. Implicit Learning in 3D Object Recognition: The Importance of Temporal ContextImplicit Learning in 3D Object Recognition: The Importance of Temporal Context. Neural Computation 11:2, 347-374. [Abstract] [PDF] [PDF Plus] 55. Néstor Parga , Edmund Rolls . 1998. Transform-Invariant Recognition by Association in a Recurrent NetworkTransform-Invariant Recognition by Association in a Recurrent Network. Neural Computation 10:6, 1507-1525. [Abstract] [PDF] [PDF Plus] 56. HanChuan Peng, Lifeng Sha, Qiang Gan, Yu Wei. 1998. Energy function for learning invariance in multilayer perceptron. Electronics Letters 34:3, 292. [CrossRef] 57. Teuvo Kohonen, Samuel Kaski, Harri Lappalainen. 1997. Self-Organized Formation of Various Invariant-Feature Filters in the Adaptive-Subspace SOMSelf-Organized Formation of Various Invariant-Feature Filters in the Adaptive-Subspace SOM. Neural Computation 9:6, 1321-1344. [Abstract] [PDF] [PDF Plus] 58. Guy Wallis, Roland Baddeley. 1997. Optimal, Unsupervised Learning in Invariant Object RecognitionOptimal, Unsupervised Learning in Invariant Object Recognition. Neural Computation 9:4, 883-894. [Abstract] [PDF] [PDF Plus]
59. Kim Plunkett, Annette Karmiloff-Smith, Elizabeth Bates, Jeffrey L. Elman, Mark H. Johnson. 1997. Connectionism and Developmental Psychology. Journal of Child Psychology and Psychiatry 38:1, 53-80. [CrossRef] 60. James V. Stone. 1996. Learning Perceptually Salient Visual Parameters Using Spatiotemporal Smoothness ConstraintsLearning Perceptually Salient Visual Parameters Using Spatiotemporal Smoothness Constraints. Neural Computation 8:7, 1463-1492. [Abstract] [PDF] [PDF Plus] 61. Randall C. O'Reilly , Mark H. Johnson . 1994. Object Recognition and Sensitive Periods: A Computational Analysis of Visual ImprintingObject Recognition and Sensitive Periods: A Computational Analysis of Visual Imprinting. Neural Computation 6:3, 357-389. [Abstract] [PDF] [PDF Plus] 62. Graeme Mitchison . 1991. Removing Time Variation with the Anti-Hebbian Differential SynapseRemoving Time Variation with the Anti-Hebbian Differential Synapse. Neural Computation 3:3, 312-320. [Abstract] [PDF] [PDF Plus]
Communicated by David Willshaw
A Biologically Supported Error-Correcting Learning Rule Peter J. B. Hancock Leslie S. Smith William A. Phillips Centre for Cognitive and Computational Neuroscience, Departments of Computing Science and Psychology, University of Stirling, Stirling, Scotland FK9 4LA
We show that a form of synaptic plasticity recently discovered in slices of the rat visual cortex (Artola et al. 1990)can support an error-correcting learning rule. The rule increases weights when both pre- and postsynaptic units are highly active, and decreases them when pre-synaptic activity is high and postsynaptic activation is less than the threshold for weight increment but greater than a lower threshold. We show that this rule corrects false positive outputs in feedforward associative memory, that in an appropriate opponent-unit architecture it corrects misses, and that it performs better than the optimal Hebbian learning rule reported by Willshaw and Dayan (1990). 1 Introduction
Learning rules that correct errors are most often used in cognitive simulations and in the technological applications of neural nets. The Delta rule (Widrow and Hoff 1960)is a typical example. Three terms are required to specify the weight change: presynaptic activity, the postsynaptic activity produced by the net, and the postsynaptic activity specified by the training signal. Performance improves gradually with repeated presentation of the whole training set. There is psychological evidence for such a rule (e.g., Sutton and Barto 1981), but no biological evidence has yet been presented for a rule of this kind. Learning rules based on biological evidence typically use just two terms to specify weight change: presynaptic activity and postsynaptic activity. They do not require multiple presentations of the training set to reach their optimum performance. The many forms of this kind of learning are collectively called Hebbian rules. It is well established that the computational power of error-correcting rules exceeds that of the Hebbian rules. Recently Artola et al. (1990) reported a new form of synaptic plasticity in slices of adult rat visual cortex. They show that tetanic presynaptic input produces long-term potentiation (LTP) if postsynaptic depolarization exceeds a high threshold, and long-term depression (LTD) if it does not Neural Computation 3, 201-212 (1991) @ 1991 Massachusetts Institute of Technology
P. J. B. Hancock, L. S. Smith, and W. A. Phillips
202
A+ --
b
0
Post-synaptic Activation
A- --
e-
e+
Figure l: Simple version of the ABS rule, showing weight change for a synapse from an active presynaptic unit.
exceed the high threshold but does exceed a lower threshold. The high threshold is related to NMDA receptor-gated conductances. At first sight this seems to be just another Hebbian rule, but it is unusual because LTD occurs when the postsynaptic unit is moderately active but not when it is less active. This nonmonotonic relationship of weight change to postsynaptic activation is the critical difference. A simple form of this rule is shown in Figure 1. We shall refer to it as the ABS rule. It resembles the proposal of Bienenstock et al. (19821, but it does not use the time averages of unit activity to specify weight change thresholds. To see a possible rationale for this rule consider the development of a feedforward associative net learning a random set of pairs of binary patterns. The net consists of M input units fully connected to N output units. These output units compute a weighted sum of their inputs (including the training signal) and give a binary output determined by whether the activation is above or below their output threshold. Initially, all the weights are assumed to be sufficiently small that the only output units to be active are those driven by the training signal. We assume that this signal reaches the threshold for weight increase. As the loading increases, some of the units that should, according to the training signal, be OFF start to become active. This activity triggers the weight decrement: the rule thus reduces specifically the weights that are causing problems. This is a simple form of error correction. A few high weights from active input units to inactive output units can be tolerated, and indeed should be because of the other patterns that have been learned. Reducing all
Error-Correcting Learning Rule
203
such weights, as a simple two term rule would, is likely to lead to other errors. Here we begin the computational study of this learning rule, and compare it with the Hebbian rule that Willshaw and Dayan (1990) have shown to be optimal of that class. They demonstrate the requirement for decreases in synaptic efficacy that on average match the increases. The optimal rule is the covariance rule (Sejnowski 1977), which they call Rule 1. Two simpler cases (Rules 2 and 3) are shown to give good but slightly less than optimal results. There is biological evidence for both of the simpler rules (Rauschecker and Singer 1979; Stanton and Sejnowski 1989). 2 ABS Rule Definition
This study of the ABS rule is designed for direct comparison with the results of Willshaw and Dayan (1990). They considered the storage in a single-layer feedforward associative net of s2 pattern pairs, each consisting of an input vector A h ) and an output vector &). The components of each A(w) are set to 1 with a probability of s, and to a low value c with probability (1 - s) (we are substituting s for their p to avoid confusion with probability p later). Part of their conclusion is that the value of c is not important, given appropriate output thresholds and their rules, so we always set it to 0. Components of a B(w) are set to 1 with a probability of T and 0 with a probability of 1 - T . The activation of an output unit, X,, is given by the weighted sum of its inputs:
i=l
If the activation is above the unit’s threshold 8,, its output 0, is set to 1, otherwise to the low value c (0): 0 -
-
1 if X, > O., 0 otherwise
The simple form of the learning rule shown in Figure 1 may be defined by
AW,,
=
A+ if X, 2 Of and A , ( w ) = 1 A- if 0- < Xj< O+ and A,(w)= 1 0 otherwise
We do not need a specific value for O+ in our simulations. We assume that the target output signal is strong enough to drive units into weight increment and that the signals from the adaptive weights are not.’ Here
’
Artola et a / . show that, under bicuculline disinhibition, the internal signals from the adaptive weights can drive the cell sufficiently to cause weight increment. Since
P. J. B. Hancock, L. S. Smith, and W. A. Phillips
204
we also assume that the training inputs consist of binary signals. With these two assumptions we can reformulate the ABS rule as follows: AW,,
=
= 1 and A,(w)= 1 A+ if B3(w) A- if B,(w)= 0 and X , > 8- and A , ( w ) = 1 0 otherwise
These two forms of the ABS rule are equivalent if each unit is seen as having two different inputs, the modifiable connections of the associative memory and the training signal with which the associations are being made: M
x,= C,4,(w)I/I/,, + D,(dJ)d r=l
Here rl is the strength of the training signal, set such that only it can drive the unit sufficiently to reach the weight increment threshold. This specification of the rule allows weights to change between being positive and being negative, which is not biologically plausible (Crick and Asanuma 1986). One of the implications of Willshaw and Dayan’s work is that negative weights are required for optimal storage. To allow direct comparison with their rules we have allowed negative weights in the first experiments reported below. This is then corrected in the section discussing an architecture with opponent units. Note that Q- < Q3, in order to prevent falsely high outputs. The unit is active above B-, but not sufficiently active to be counted as ON in binary terms. However, if the difference is too large, the rule resembles the simple binary rule and there is a danger of overcorrection. 3 The Effects of Error Correction
Hebbian rules with binary signals lead to a distribution of activation levels after learning illustrated by Figure 2a. The overlapping tails of the desired high and desired low distribution are where the errors occur. The ABS rule is able to cut the tail off the high end of the desired low distribution, while the full three-term Delta rule is able to correct errors in both directions, (Fig. 2b, c). Obviously there comes a point where the Delta rule will also fail, but it occurs at higher loadings than for two-term rules. Note that we are using threshold logic units, which may not be biologically plausible. However it is clear that no form of output function could prevent errors if the two distributions overlap. With additional circuitry the ABS rule is also able to correct misses. The requirement is to replace the single output units with mutually this would lead to a runaway self-association, with strong weights getting stronger, we assume that the threshold was reached because of the disinhibition,and that normally other inputs would also be required.
Error-Correcting Learning Rule
Target low
205
Target high
Output unit activation ->
Figure 2: Idealized activation frequency distributions after learning for a single output unit, plotted separately for when it should have low and high outputs. (a) Result of simple binary two term rules: the region of overlap indicates that errors will be made, wherever the threshold is put. (b) The ABS rule corrects false positive errors, reducing the region of overlap. (c) The Delta rule corrects errors in both directions. inhibitory opponent pairs. This may be regarded as a simplification of the local inhibition that is common in cortex. Whenever a unit is trained to be ON, its opponent is trained to be OFF, and vice versa. False positive outputs are corrected as before. The way that misses are corrected can be seen by considering why a unit that is below its output threshold has too little activation. Part of the reason is that its own weights are too low, but it is also being inhibited by its opponent cell. By symmetry, this unit is responding too strongly and will reach the weight decrement threshold. Its activity will be decreased, reducing the inhibition and allowing the other unit to give a higher output. The simple two-term rules learn in a single pass: unless weights are limited in some way further presentations of the training set will not affect the result, only the size of the weights and activation. As with the Delta rule, the ABS rule gives improved performance with additional presentations. 4 Simulation Experiments
4.1 Single Unit Architecture. The performance of the ABS rule was tested by repeating the experiments of Willshaw and Dayan (1990), who measured signal-to-noise ratios and the number of bit errors for a feedforward associative net. Their computations of signal to noise ratio assume
206
P. J. B. Hancock, L. S. Smith, and W. A. Phillips
the distributions are gaussian. The ABS rule distribution (Fig. 2b) is not gaussian, so that the figures produced are not directly comparable. We therefore report only the actual numbers of errors produced, since minimizing this is more important. The bit errors were counted by two methods. Initially we used the threshold set by Willshaw and Dayan‘s method, which is designed for gaussian distributions. As might be expected from Figure 2, the ideal threshold for the ABS rule is rather lower than for the simpler Hebbian rules. An optimal threshold can be found by searching through the actual responses of each unit in the region where desired low and desired high outputs overlap to minimize the number that are wrong. This procedure produced significantly better results. We have not yet looked for a method of setting something like the optimal threshold for each unit without recourse to such serial search procedures. The optimal Hebbian rules of Willshaw and Dayan (1990) specify the sizes of the weight change parameters for given bit probabilities. The ABS rule decrements the weight only when an error occurs, so that if Willshaw and Dayan’s conclusion that the expected value of the weight should be zero still holds, the size of the decrement has to be larger than is given by their binary homosynaptic depression Rule 3. This rule gives a weight increment of 1 - T and a decrement of T ( T is the output bit probability). We therefore fixed the increment size at 1 - T , and experimented with a range of values for the decrement size. The initial value of all the weights was zero, there being no need here for the symmetry breaking required by some other methods. The results are shown in Table 1. A number of things are apparent from Table 1. The absolute level of performance is good, and improves as bit probability decreases. It learns 200 patterns when the bit probability is 0.1 with on average only 0.05 bits in error out of 20, so at least 190 of the 200 output patterns will be completely correct. As predicted, for both bit probabilities, the optimal size of A- is larger than the value specified by Willshaw and Dayan’s Rule 3. Near the optimum, the precise value of A- is not critical. In both cases the average value of the weights is near zero at optimal performance. These results were used to set the sizes of the weight changes to their optimal value in the following experiments. We next compared the ABS rule with the optimal Hebbian rule (Rule 1) of Willshaw and Dayan (W&D) over a range of bit probabilities. The results are given in Table 2. The ABS rule does better at all bit probabilities and in contrast to normal Hebbian rules, its performance improves with training. However, there is little room for improvement at low bit probabilities and the limit is quickly reached. 4.2 Opponent Architecture. In the simple architecture of the preceding experiments the ABS rule corrects false positives. In an architecture with twice as many output units arranged in mutually inhibitory pairs it also corrects misses. The internal activations for each unit are calculated
Error-Correcting Learning Rule
207
Table 1: Results from a net with 512 input units and 20 output units, with 200 patterns, averaged over 10 runs, with 5 training cycles. Weight increment is A+, weight decrement A-. Avg weight is the average of all weights to all 20 output units. Bit errs is the average number of errors per 20-bit pattern, counted using the threshold used by W&D. Min errs is the number of bit errors given by an optimal threshold for each unit. Bit prob s,r = 0.1, A+ = 0.9
A-
Bit errs
Min errs Avg weight
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.o
0.135 0.06 0.05 0.05 0.05 0.05 0.05 0.052 0.053
0.065 0.018 0.0125 0.0055 0.003 0.0025 0.002 0.001 0.0015
1.8 0.43 0.19 0.06 -0.02 -0.03 -0.09 -0.16 -0.19
Bit prob s,r = 0.5, A+ = 0.5 Bit errs Min errs Avg weight
6.38 3.86 1.6 1.04 1.01 1.01 1.05 1.1
5.97 3.58 1.41 0.85 0.81 0.82 0.86 0.94
50.3 26.4 9.3 3.26 1.61 1.32 1.09 0.82
Table 2: Results from a 512 input, 20 output net with 200 random input-output patterns, averaged over 10 runs with different pattern sets. Bit errors refers to the average number of errors per pattern, counted using the threshold used by W&D. Min errors is the number of bit errors given by an optimal threshold for each unit.
W & D Rule 1 ABS 5 epochs ABS 10 epochs ABS 20 epochs Bit errors
Min errors
Bit errors
Min
s,r
Bit Min errors errors errors
Bit errors
errors
0.5 0.4 0.3 0.2 0.1 0.05
1.07 0.97 0.72 0.35 0.08 0.03
0.89 0.82 0.56 0.25 0.027 0.003
1.03 0.82 0.54 0.24 0.05 0.02
0.86 0.63 0.32 0.06 0.004 0.0
0.71 0.61 0.47 0.23 0.05 0.02
0.34 0.13 0.044 0.005 0.004 0.0
0.77 0.64 0.48 0.24 0.05 0.02
0.50 0.29 0.12 0.015 0.004 0.0
Min
I? J. B. Hancock, L. S. Smith, and W. A. Phillips
208
as before, the units then inhibit each other by subtracting some fraction K of the opponent unit's activation:
During training, for each unit where the target is 1, its opponent unit is set to 0, and vice versa. The weight change procedure for each unit is the same as for the single-sided architecture. We are not suggesting that such an orderly arrangement of pairs of units is biologically plausible. This design is a simplification that matches the assumption of binary training signals. However, local mutual inhibition is widespread in the cortex and a more realistic simulation might contain a layer of units such as that suggested by von der Malsburg (1973). Here we only wish to demonstrate the possibilities of the learning rule and have kept the architecture as simple as possible. The opponent architecture also allows the problem of negative weights to be addressed. Effectively, we are simply splitting each weight in two, and putting the inhibitory part on a separate unit. For this to work requires only that the weight decrement threshold 0- be above zero. The value is not critical, since the weights and activations are automatically adjusted appropriately. A value of 50 proved satisfactory for the weight change parameters in use here. Results are given in Table 3. This system can learn 200 patterns without errors, though convergence to this accuracy is quite slow, requiring Table 3: Results from a net with 512 input units and 20 x 2 output units trained with 200 random input-output patterns, for a variety of parameters. In all cases
A+ is 0.02, and there are 30 training cycles.
Bit probability
A-
K;
0.5 0.5 0.5 0.5
0.1 0.1 0.1 0.1
0.5 0.8 0.9 1.0
0.5
0.15 0.9
0
0.3 0.2 0.1
0.1 0.1 0.1
0 0.007 0.048
0.9 0.9 0.9
Min bit errors per pattern 1.28 0.03 0 0.37
Error-Correcting Learning Rule
209
30 or 40 training epochs. Performance is distinctly better than the singlesided architecture, which still makes about 0.1 errors per pattern after 40 epochs. The system is sensitive to the value of IE. (the strength of the mutual inhibition), with 1 giving distinctly worse performance than slightly lower values. The value of A- is less critical, a good value being five times the size of A+. Performance tails off as the bit probability decreases: precisely the opposite of the simpler architecture, although with bit probabilities as low as 0.2 this system still does better than the optimal Hebbian rule. The reason for the effect of bit probability is clear: if one unit is ON with a probability of 0.1, then its opponent is O N with a probability of 0.9. The required values of A+ and A- are very different for the two opponent units. Choosing appropriate values does allow all the patterns to be learned. Although a mechanism for adjusting weight change sizes to suit the measured bit probability is possible, we prefer a more biological solution with small groups of mutually inhibitory units (like a winner-takes-all cluster), each of which responds approximately equally often. 5 Stability
An important question concerning any learning rule is its stability and convergence, both in terms of errors and synaptic weights. Consider an individual weight Wa,. It will be incremented for those patterns where A,(w)= B,(w)= 1. Assuming input and output bit probabilities s and r are equal, weight increment would be expected on R / r 2 patterns. Weight decrement is expected for some of the cases where A,(w) = 1 and B,(w) = 0, specifically those when the unit activation X , > 8-. Zero weight change is achieved if 62 -A+
=
R ~
r(l - r )
r2
a - p [ X j > 0-(B,(w) = 01
This rearranges to give A+ --
a-
-p
r
(1 -?-)
[x,> O-IB,(w)= 01
That this is at least moderately stable may be seen by considering the situation where the value of A+ is too high. Weights will tend to increase, leading to an increased probability of exceeding 8- and provoking a weight decrement. Conversely, an overlarge value for A- will reduce the probability of exceeding 0-, allowing the weights to build up. Exceeding 8 - does not necessarily imply registering an error, provided there is a gap between 8- and the binary output threshold 8,. As with many systems a suitably pathological input sequence will break it; in practice with the runs reported here we saw just an occasional single bit error in the epochs
P. J. 8. Hancock, L. S. Smith, and W. A. PhiIlips
210
-
$2 0 f:
: : $1 5
Opponent ABS
-
010 -
Opponent ABS Bound Opp ABS
a
gi 0
0.05
0,
50 5 0.00
00
0
10
2o
Epochs 30
10
20
30
40 ~
~
Figure 3: (a) Course of learning for optimal Hebbian and simple and opponent ABS rules, 512 input units, 20 output units/pairs, 200 random patterns of bit probability 0.5, average o f 10 experiments. Both ABS rules had A + = 0.02 and A- = 0.1. 6'- was 0 for simple, 50 for opponent ABS. Latter had weights constrained to be positive and initial weights 4.0. (b) Expanded scale, showing effects of adding an upper weight bound of 6 to opponent ABS. Other details as (a). following initial convergence. Overall convergence is smooth, as shown by Figure 3. As the error rate is reduced, the ratio of A + / & required for weight stability will also decrease. Work in progress with adjustment of the ratio indicates that performance is indeed improved. As the ABS rule contains a purely Hebbian increment component, it is clear that there is no upper limit on the weights: a single input repeatedly applied would cause all the active weights to grow indefinitely. Although frequently ignored in simulation work, any real synapse (in brains or silicon) will clearly have an upper limit on its strength. So the behavior of the ABS rule with an upper weight bound is important. It was checked by simply clipping any weight that exceeded a limit. This was arbitrarily set at 6, an intentionally very tight constraint given that the weights start at 4, and that some normally reach around 15 while learning 200 patterns in 30 epochs. As would be expected, the performance deteriorated noticeably, but the system still converges well and approaches zero errors (Fig. 3b). The weights were followed beyond 50 epochs and do not change significantly. In practice, therefore, the ABS rule is stable and tolerant of constraints. 6 Discussion
We have shown that a learning rule based on the form of synaptic plasticity reported by Artola et al. (1990) can correct false positives and misses.
~
~
~
Error-Correcting Learning Rule
211
It can learn more random paired associates than the optimal classical Hebbian rule, and its performance continues to improve with repeated presentations of the training set. In essence the rule assumes that during training the required outputs are signaled by distinctively high levels of postsynaptic activation. Lower levels of postsynaptic activation within a specified range can thus be treated as false positives and the weights from active input lines reduced. This entails two further assumptions. (1) The maximum sizes of the weights produced by this rule must be limited so that they cannot produce levels of postsynaptic activation that mimic the training signals. (2) The functionally effective level of postsynaptic activation must be less than the level required for weight increment, otherwise the learning would not be effective at test. Both assumptions are biologically plausible because the weight must be limited, and it is known that neural output activity can be functionally effective at activation levels well below the NMDA threshold. In our formulation of the rule for the opponent architecture we calculate the internal activations according to the modifiable weights, then use the output signal to decide the weight change. In reality, one of the units will be being driven hard on by the output signal. This should thoroughly inhibit the other unit, which would therefore never reach the decrement threshold, causing the rule to revert to a simple Hebbian. A plausible solution to this problem is given by the possibility of dendritic processing. The patch of the dendrite receiving the modifiable input may then reach decrement threshold, while the inhibition prevents the cell from firing. A simple biological mechanism could provide the predicted change in ratio of weight increment to decrement as error rate declines. The size of the decrement could be controlled by the concentration of some substance, an enzyme perhaps, at the synapse. Frequent weight decrement events would use u p the stock of enzyme, reducing the size of the change. A low error rate would result in occasional, larger decrements. Further work on the ABS rule to be reported elsewhere (Hancock et al. 1991) shows that it compares favorably with the classical perceptron learning rule (PLR) in the early stages of learning. The PLR does not perform particularly well on the first pass of training data and there has traditionally been a divide between single-pass Hebb-like rules and multipass error-correcting rules. The ABS rule thus raises the possibility of obtaining the benefits of both, with a relatively good performance on a single pass, but continuing to improve with further training. The rule also works well in autoassociative architectures. Important unresolved issues on which we are currently working include the extension of the rule to nonbinary signals, and its role in multilayer architectures when combined with other biologically supported learning rules.
212
P. J. B. Hancock, L. S. Smith, and W. A. Phillips
Acknowledgments
This work was funded by the SERC of the UK, and by the BRAIN initiative of the EEC. We are very grateful to Wolf Singer a n d Alain Artola for helpful discussions a n d to Peter Dayan for help with setting up the simulations. Roland Baddeley and Peter Cahusac made helpful comments on earlier drafts of this paper. References Artola, A., Brocher, S., and Singer, W. 1990. Different voltage-dependent thresholds for the induction of long-term depression and long-term potentiation in slices of the rat visual cortex. Nature (London) 347, 69-72. Bienenstock, E. L., Cooper, L. N., and Munro, P. W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2, 32-48. Crick, F. H. C., and Asanuma, C. 1986. Certain aspects of the anatomy and physiology of the cerebral cortex. In Parallel Distributed Processing, J. L. McClelland and D. E. Rumelhart, eds., Vol. 2, pp. 333-371. Bradford Books, MIT Press. Hancock, P. J. B., Smith, L. S., and Phillips, W. A. 1991. Error correcting capabilities of a recently discovered form of cortical synaptic plasticity. In preparation. Rauschecker, J. P., and Singer, W. 1979. Changes in the circuitry of the kitten’s visual cortex are gated by post-synaptic activity. Nature (London) 280, 58-60. Sejnowski, T. J. 1977. Storing covariance with nonlinearly interacting neurons. J . Math B i d . 4, 303-321. Stanton, P., and Sejnowski, T. J. 1989. Associative long-term depression in the hippocampus: Induction of synaptic plasticity by Hebbian covariance. Nuture (London) 339, 215-218. Sutton, R. S., and Barto, A. G. 1981. Toward a modern theory of adaptive networks: Expectation and prediction. Psychol. Rev. 88-2, 135-170. von der Malsburg, C. 1973. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100. Widrow, B., and Hoff, M. E. 1960. Adaptive switching circuits. IRE WESCON Convention Record, New York: IRE, 96-104. Willshaw, D. J., and Dayan, P. 1990. Optimal plasticity from matrix memories: What goes up must come down. Neural Cornp. 2, 85-93.
Received 30 August 1990; accepted 30 January 1991
This article has been cited by: 2. Jim Kay, W. A. Phillips. 1997. Activation Functions, Computational Goals, and Learning Rules for Local Processors with Contextual GuidanceActivation Functions, Computational Goals, and Learning Rules for Local Processors with Contextual Guidance. Neural Computation 9:4, 895-910. [Abstract] [PDF] [PDF Plus] 3. Randall C. O'Reilly. 1996. Biologically Plausible Error-Driven Learning Using Local Activation Differences: The Generalized Recirculation AlgorithmBiologically Plausible Error-Driven Learning Using Local Activation Differences: The Generalized Recirculation Algorithm. Neural Computation 8:5, 895-938. [Abstract] [PDF] [PDF Plus] 4. Suzanna Becker. 1996. Network: Computation in Neural Systems 7:1, 7-31. [CrossRef]
Communicated by John Moody
A Resource-Allocating Network for Function Interpolation John Platt Synaptics, 2860 Zanker Road, Suite 206, Sun Jose, CA 95134 USA
We have created a network that allocates a new computational unit whenever an unusual pattern is presented to the network. This network forms compact representations, yet learns easily and rapidly. The network can be used at any time in the learning process and the learning patterns do not have to be repeated. The units in this network respond to only a Iocal region of the space of input values. The network learns by allocating new units and adjusting the parameters of existing units. If the network performs poorly on a presented pattern, then a new unit is allocated that corrects the response to the presented pattern. If the network performs well on a presented pattern, then the network parameters are updated using standard LMS gradient descent. We have obtained good results with our resource-allocating network (RAN). For predicting the Mackey-Glass chaotic time series, RAN learns much faster than do those using backpropagation networks and uses a comparable number of synapses. 1 Introduction
Judd (1988) has shown that the problem of loading a multilayer perceptron with binary units is NP-complete. Loading sigmoidal multilayer networks is computationally expensive for large sets of real data, with unknown bounds on the amount of computation required. Baum (1989) pointed out that the problem of NP-complete loading is associated only with a network of fixed resources. If a network can allocate new resources, then the problem of loading can be solved in polynomial time. Therefore, we are interested in creating a network that allocates new Computational units as more patterns are learned. Traditional pattern recognition algorithms, such as Parzen windows and k-nearest neighbors, allocate a new unit for every learned example. The number of examples in real problems forces us to use fewer than one unit for every learning example: we must create and store an abstraction of the data. The network described here allocates far fewer units than the number of presented examples. The number of allocated units scales sublinearly Neural Computation 3, 213-225 (1991) @ 1991 Massachusetts Institute of Technology
214
John Platt
with the number of presented inputs. The network can be used either for on-line or off-line learning. Previous workers have used networks whose transfer function is a gaussian (Broomhead and Lowe 1988; Moody and Darken 1988, 1989; Poggio and Girosi 1990). The use of gaussian units was originally inspired by approximation theory, which describes algorithms that interpolate between irregularly spaced input-output pairs (Powell 1987). In fact, Lapedes discussed the hypothesis that multiple layers of sigmoidal units form gaussian-like transfer functions in order to perform interpolation (Lapedes 1987). Gaussian units are well-suited for use in a resource-allocating network because they respond only to a local region of the space of input values. When a gaussian unit is allocated, it explicitly stores information from an input-output pair instead of merely using that information for gradient descent. The explicit storage of an input-output pair means that this pair can be used immediately to improve the performance of the system in a local region of the input space. A unit with a nonlocal response needs to undergo gradient descent, because it has a nonzero output for a large fraction of the training data. The work of Moody and Darken (1988, 1989) is the closest to the work specified below. They use gaussian units, where the gaussians have variable height, variable centers, and fixed widths. The network learns the centers of the gaussians using the k-means algorithm (Lloyd 1957; Stark ef al. 1962; MacQueen 1967), and learns the heights of the gaussians using the LMS gradient descent rule (Widrow 1960). The width of the gaussians is determined by the distance to the nearest gaussian center after the k-means learning. Moody has further extended his work by incorporating a hash table lookup (Moody 1989). The hash table is a resource-allocating network where the values in the hash table become nonzero only if the entry in the hash table is activated by the corresponding presence of nonzero input probability. Our work improves on previous work in several ways: 1. Although it has the same accuracy, our network requires fewer weights than do networks in either Moody and Darken (1989) or in Moody (1989).
2. Like the hashing approach in Moody (19891, our network automatically adjusts the number of units to reflect the complexity of the function that is being interpolated. Fixed-size networks either use too few units, in which case the network memorizes poorly, or too many, in which case the network generalizes poorly. 3. We use units that respond to only a local region of input space, similar to Moody and Darken (1988,19891, but unlike backpropagation. The units respond to only a small region of the space of inputs so
Network for Function Interpolation
215
that newly allocated units do not interfere with previously allocated units. 4. The RAN adjusts the centers of the gaussian units based on the error at the output, like Poggio and Girosi (1990). Networks with centers placed on a high-dimensional grid, such as Broomhead and Lowe (1988) and Moody (19891, or networks that use unsupervised clustering for center placement, such as Moody and Darken (1988, 19891, generate larger networks than RAN, because they cannot move the centers to increase the accuracy.
5. Parzen windows and I;-nearest neighbors both require a number of stored patterns that grow linearly with the number of presented patterns. With our method, the number of stored patterns grows sublinearly, and eventually reaches a maximum. 2 The Algorithm
This section describes a resource-allocating network (RAN), which consists of a network, a strategy for allocating new units, and a learning rule for refining the network. 2.1 The Network. RAN is a two-layer network (Fig. 1). The first layer consists of units that respond to only a local region of the space of input values. The second layer aggregates outputs from these units and
Figure 1: The architecture of the network. In parallel, the network computes the distances of the input vector I to the stored centers c j . The distance is then exponentiated to yield a weight xJ. The output y is a weighted sum of the heights h, and an offset 7.
John Platt
216
creates the function that approximates the input-output mapping over the entire space. The units on the first layer store a particular region in the input space. When the input moves away from the stored region the response of the unit decreases. A simple function that implements a locally tuned unit is a gaussian:
(2.1)
We use a C’ continuous polynomial approximation to speed u p the algorithm, without loss of network accuracy: (2.2)
where q = 2.67 is chosen empirically to make the best fit to a gaussian. The inputs to the synapses of the second layer are the outputs of the units of the first layer. The purpose of each second-layer synapse is to define the contribution of each first-layer unit to a particular output y of the network. Each output of the network y is the sum of the first-layer outputs x3, each weighted by the synaptic strength h, plus a constant vector y, which does not depend on the output of the first layer: (2.3)
The y is the default output of the network when none of the first-layer units is active. The h,z, term can be thought of as a bump that is added or subtracted to the constant term y to yield the desired function. 2.2 The Learning Algorithm. The network starts with a blank slate: no patterns are yet stored. As patterns are presented to it, the network chooses to store some of them. At any given point the network has a current state, which reflects the patterns that have been stored previously. The allocator identifies a pattern that is not currently well represented by the network and allocates a new unit that memorizes the pattern. The output of the new unit extends to the second layer. After the new unit is allocated, the network output is equal to the desired output T. Let the index of this new unit be n. The peak of the response of the newly allocated unit is set to the novel input,
c, = I
(2.4)
Network for Function Interpolation
217
The linear synapses on the second layer are set to the difference between the output of the network and the novel output,
The width of the response of the new unit is proportional to the distance from the nearest stored vector to the novel input vector,
where K. is an overlap factor. As K grows larger, the responses of the units overlap more and more. The RAN uses a two-part novelty condition. An input-output pair (I,T) is considered novel if the input is far away from existing centers,
and if the difference between the desired output and the output of the network is large
Typically, E is a desired accuracy of output of the network. Errors larger than t are immediately corrected by the allocation of a new unit, while errors smaller than E are gradually repaired using gradient descent. The distance b ( t ) is the scale of resolution that the network is fitting at the tth input presentation. The learning starts with b ( t ) = Sma,, which is the largest length scale of interest, typically the size of the entire input space of nonzero probability density. The distance b ( t ) shrinks until it reaches bmin, which is the smallest length scale of interest. The network will average over features that are smaller than dmin. We used a function
where r is a decay constant. At first, the system creates a coarse representation of the function, then refines the representation by allocating units with smaller and smaller widths. Finally, when the system has learned the entire function to the desired accuracy and length scale, it stops allocating new units altogether. The two-part novelty condition is necessary for creating a compact network. If only condition 2.7 is used, then the network will allocate units instead of using gradient descent to correct small errors. If only condition 2.8 is used, then fine-scale units may be allocated in order to represent coarse-scale features, which is wasteful. By allocating new units, RAN eventually represents the desired function ever more closely as the network is trained. Fewer units are needed
John Platt
218
for a given accuracy i f the first-layer synapses “,A, the second-level synapses h,,, and the thresholds are adjusted to decrease the error:
E
=
(2.10)
IIY - T1I2
We use the Widrow-Hoff LMS algorithm (Widrow and Hoff 1960) to decrease the error whenever a new unit is not allocated:
(2.11) In addition, we adjust the centers of the responses of units to decrease the error: (2.12)
Equation 2.12 is derived from gradient descent and equation 2.1. Equation 2.12 also has an intuitive interpretation. Units whose outputs that would cancel the error have their centers pulled toward the input. Units whose outputs that would increase the error have their centers pushed away from the input. Empirically, equation 2.12 also works for the polynomial approximation 2.2. The structure of the algorithm is shown below as pseudocode, including initialization code: h = nmax y = To (from the first input-output pair) loop over presentations of input-output pairs (I.T)
{
+
evaluate output of network y = 1,h, r J(I) 7 compute error E = T - y find distance to nearest center 11 = min, IIc, - 111 if \\El\> E and d > 5 then { allocate new unit, c,,, = I, h,,, = E if this is the first unit to be allocated then width of new unit = h h else width of new unit = ~ ( 1
1
else perform gradient descent on y,h, if 6 > bmln h = h x exp(-l/.r)
1
CJI.
Network for Function Interpolation
219
3 Results
One application of an interpolating RAN is to predict complex time series. As a test case, a chaotic time series can be generated with a nonlinear algebraic or differential equation. Such a series has some short-range time coherence, but long-term prediction is very difficult. The need to predict such a time series arises in such real-world problems as detecting arrhythmias in heartbeats. The RAN was tested on a particular chaotic time series created by the Mackey-Glass delay-difference equation: (3.1)
for u = 0.2, b = 0.1, and r = 17. The network is given no information about the generator of the time series, and is asked to predict the future of the time series from a few samples of the history of the time series. In our example, we trained the network to predict the value at time T + AT, from inputs at time T , T - 6, T - 12, and T - 18. The network was tested using two different learning modes: off-line learning with a limited amount of data, and on-line learning with a large amount of data. The Mackey-Glass equation has been learned off-line, by other workers, using the backpropagation algorithm (Lapedes and Farber 19871, and radial basis functions (Moody and Darken 1989). We used RAN to predict the Mackey-Glass equations with the following parameters: N = 0.02, 400 learning epochs, 6 ,, = 0.7, K. = 0.87, and &,in = 0.07 reached after 100 epochs. RAN was simulated using E = 0.02 and 6 = 0.05. In all cases, AT = 85. Figures 2 and 3 compare the RAN to the other learning algorithms. Figure 2 shows the normalized error rate on a test set versus the size of the learning set for various algorithms. The test set is 500 points of the output of the Mackey-Glass equation at T = 4000. The normalized error is the rms error divided by the square root of the variance of the output of the Mackey-Glass equation. When the RAN algorithm is optimized for accuracy ( E = 0.021, then it attains accuracy comparable to hashing B-splines. Figure 3 shows the size of the network versus the size of the learning set. As the size of the learning set grows, the number of units allocated by RAN grows very slowly. The size of the network is measured via number of weights or parameters, which is an approximation to the complexity of the network. For backpropagation, the size is the number of synapses. For the RBF networks and for RAN, there are six parameters per unit: four to describe the location of the center, one for the width, and one for the height of the gaussian. For hashing B-splines, each unit has two parameters: the hash table index and its corresponding hash table value.
John Platt
220
0 =RAN(E=O.O~) sRAN(E=O.OI)
A
hnshing B-spliw
0 =NndvdRBF B -K-mmaRBF = bffik-propngrtlon
*
100
loo0 Size of Training Set
Figure 2: The normalized rms error on a test set for various off-line learning algorithms. Backpropagation, RAN, and hashing B-splines are all competitive in error rate. (Near the backpropagation symbol, the symbol for hashing B-splines is omitted for clarity.)
f
/
100
loo0 loo00 Size of Training Set
1OOOOO
Figure 3: The number of weights in the network versus the size of the training set. RAN and backpropagation are competitive in the compactness of the network. Notice that as the training set size increases, the size of the RAN stays roughly constant.
Network for Function Interpolation
221
Figure 4:The error on a test set versus the size of the network. Backpropagation stores the prediction function very compactly and accurately, but takes a large amount of computation to form the compact representation. RAN is as compact and accurate as backpropagation, but uses much less computation to form its representation. Figure 4 shows the efficiency of the various learning algorithms: the smallest, most accurate algorithms are toward the lower left. When optimized for size of network ( F = 0.05), the RAN has about as many weights as backpropagation and is just as accurate. The efficiency of RAN is roughly the same as backpropagation, but requires much less computation: RAN takes approximately 8 min of SUN-4 CPU time to reach the accuracy listed in Figure 4, while backpropagation took approximately 30-60 minutes of Cray X-MP time. The novelty criteria and the center adjustment are both important to the performance of the RAN algorithm. We tested off-line learning of Mackey-Glass predictions using three styles of network that share the same transfer function: a flat network whose centers are chosen with the k-means algorithm, a hierarchical network whose centers are chosen with the k-means algorithm, and a RAN. Each of these networks was tested with either center adjustment via gradient descent or no center adjustment at all. Table 1 shows the normalized rms error on a test set after training off-line on 500 examples. The nonhierarchical k-means network was formed with 100 units. The hierarchical k-means network was formed with three sets of centers: k-means was run separately for 75, 20, and 5 units. In both k-means networks, the widths of the units were chosen via equation 2.6, with a R = 0.87. Using the same parameters as used above, and with E = 0.05, RAN allocated 100 units without center adjustment, and 95 units with center adjustment.
John Platt
222
Table 1: Normalized rms error for various substrategies of RAN.
Flat network
Hierarchical network
RAN
0.54 0.20
0.31 0.15
0.17 0.066
No center adjust Center adjust
Table 2: Comparison between RAN and hashing B-splines.
Method RAN Hashing B-spline 1 level of hierarchy Hashing B-spline 2 levels of hierarchy
Number of units Normalized rms error 143
0.054
284
0.074
1166
0.044
Table 1 shows that the three substrategies of RAN are about equally important. Using hierarchy, adjusting the centers via gradient descent, and choosing units to allocate based on the novelty conditions all seem to improve the performance by roughly a factor of 1.5 to 2. The Mackey-Glass equation has been learned using on-line techniques by hashing B-splines (Moody 1989). We used on-line RAN using the following parameters: (Y = 0.05, t = 0.02,,,S = 0.7, S, = 0.07, K = 0.87, and b,,, reached after 5000 input presentations. Table 2 compares the on-line error versus the size of network for both RAN and the hashing B-spline (Moody, personal communication). In both cases, AT = 50. The RAN algorithm has similar accuracy to the hashing B-splines, but the number of units allocated is between a factor of 2 and 8 smaller. Table 3 shows the effectiveness of the c novelty condition for online learning. When c is set very low, the network performs very well, but is very large. Raising F decreases the size of the network without substantially affecting the performance of the network. For E > 0.05, the network becomes very compact, but the accuracy becomes poor. Figure 5 shows the output of the RAN after having learned the Mackey-Glass equation on-line. In the simulations, the network learns to roughly predict the time series quite rapidly. Notice in Figure 5a the
Network for Function Interpolation Table 3: Effectiveness of
F
0 0.01 0.02
0.05 0.10
F
223
novelty condition.
Number of units
Normalized rms error
189 174 143 50 26
0.055 0.050 0.054 0.071 0.102
Figure 5: The output of the RAN as it learns on-line. The thick line is the output from the Mackeyalass equation, the thin line is the prediction by the network. (a) The beginning of the learning. Very quickly, RAN picks up the basic oscillatory behavior of the Mackey-Glass equation. (b) The end of the on-line learning. At T = 10,000, the predictions match the actual output very well.
sudden jumps in the output of the network, which show that a new unit has been allocated. As more examples are shown, the network allocates more units and refines its predictions.
4 Conclusions
There are various desirable attributes for a network that learns: it should learn quickly, it should learn accurately, and it should form a compact representation. Formation of a compact representation is particularly important for networks that are implemented in hardware, because silicon
224
John Platt
area is at a premium. A compact representation is also important for statistical reasons: a network that has too many parameters can overfit data and generalize poorly. Many previous network algorithms either learned quickly at the expense of a compact representation, or formed a compact representation only after laborious computation. The RAN is a network that can find a compact representation with a reasonable amount of computation.
Acknowledgments Thanks to Carver Mead, Carl Ruoff, and Fernando Pineda for useful comments on the paper. Thanks to Glenn Gribble for helping to put the paper together. Special thanks to John Moody who not only provided useful comments on the paper, but also provided data on the hashing B-splines.
References
~__
Baum, E. B. 1989. A proposal for more powerful learning algorithms. Neural Comp. 1(2), 201-207. Broomhead, D., and Lowe, D. 1988. Multivariable function interpolation and adaptive networks. Complex Syst. 2, 321-355. Judd, S. 1988. On the complexity of loading shallow neural networks. I. Complex. 4, 177-192. Lapedes, A, and Farber, R. 1987. Nonlinear Signal Processing Using Neural Networks: Prediction and System Modeling. Tech. Rep. LA-UR-87-2662, Los Alamos National Laboratory, Los Alamos, NM. Lloyd, S. I? 1957. Least Squares Quantization in PCM. Bell Laboratories Internal Tech. Rep. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics, and Probability, L. M. LeCam and J. Neyman, eds., p. 281. University of California Press, Berkeley. Moody, J., and Darken, C. 1988. Learning with localized receptive fields. In Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds., pp. 133-143. Morgan-Kaufmann, San Mateo. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1(2), 281-294. Moody, J. 1989. Fast learning in multi-resolution hierarchies. In Advances in Neural Infortnation Processing Systems, I, D. Touretzky, ed., pp. 29-39. MorganKaufmann, San Mateo. Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 978-982.
Network for Function Interpolation
225
Powell, M. J. D. 1987. Radial basis functions for multivariable interpolation: A review. In Algorithms for Approximation, J. C. Mason and M. G. Cox, eds., pp. 143-167. Clarendon Press, Oxford. Stark, L., Okajima, M., and Whipp1e;G. H. 1962. Computer pattern recognition techniques: Electrocardiographics diagnosis, Commun. ACM 5, 527-532. Widrow, B., and Hoff, M. 1960. Adaptive switching circuits. In 1960 IRE WESCON Convention Record, pp. 96-104. IRE, New York.
Received 8 June 1990; accepted 16 November 1990.
This article has been cited by: 2. Honggui Han, Qili Chen, Junfei Qiao. 2010. Research on an online self-organizing radial basis function neural network. Neural Computing and Applications . [CrossRef] 3. Ihab Samy, Ian Postlethwaite, Da-Wei Gu, John Green. 2010. Neural-Network-Based Flush Air Data Sensing System Demonstrated on a Mini Air Vehicle. Journal of Aircraft 47:1, 18-31. [CrossRef] 4. Koichiro Yamauchi. 2009. Optimal incremental learning under covariate shift. Memetic Computing 1:4, 271-279. [CrossRef] 5. Petr Kadlec, Bogdan Gabrys. 2009. Architecture for development of adaptive on-line prediction models. Memetic Computing 1:4, 241-269. [CrossRef] 6. Seema N. Pandey, Shashikala Tapaswi, Laxmi Srivastava. 2009. Growing RBFNN-based soft computing approach for congestion management. Neural Computing and Applications 18:8, 945-955. [CrossRef] 7. Weifeng Liu, Il (Memming) Park, Yiwen Wang, JosÉ C. Principe. 2009. Extended Kernel Recursive Least Squares Algorithm. IEEE Transactions on Signal Processing 57:10, 3801-3814. [CrossRef] 8. Guorui Feng, Guang-Bin Huang, Qingping Lin, R. Gay. 2009. Error Minimized Extreme Learning Machine With Growth of Hidden Nodes and Incremental Learning. IEEE Transactions on Neural Networks 20:8, 1352-1357. [CrossRef] 9. M. Bortman, M. Aladjem. 2009. A Growing and Pruning Method for Radial Basis Function Networks. IEEE Transactions on Neural Networks 20:6, 1039-1045. [CrossRef] 10. Cristobal Luque, Jose M. Valls, Pedro Isasi. 2009. Time series prediction evolving Voronoi regions. Applied Intelligence . [CrossRef] 11. Michael J. Watts. 2009. A Decade of Kasabov's Evolving Connectionist Systems: A Review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 39:3, 253-269. [CrossRef] 12. S. Ozawa, A. Roy, D. Roussinov. 2009. A Multitask Learning Model for Online Pattern Recognition. IEEE Transactions on Neural Networks 20:3, 430-445. [CrossRef] 13. Ihab Samy, Ian Postlethwaite, Dawei Gu. 2009. Subsonic Tests of a Flush Air Data Sensing System Applied to a Fixed-Wing Micro Air Vehicle. Journal of Intelligent and Robotic Systems 54:1-3, 275-295. [CrossRef] 14. Myoung Soo Park, Jin Young Choi. 2009. Evolving Logic Networks With Real-Valued Inputs for Fast Incremental Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:1, 254-267. [CrossRef] 15. Yonggon Lee, Stefen Hui, Edwin Zivi, Stanislaw H. Żak. 2008. Variable neural adaptive robust controllers for uncertain systems. International Journal of Adaptive Control and Signal Processing 22:8, 721-738. [CrossRef]
16. Keem Siah Yap, Chee Peng Lim, I.Z. Abidi. 2008. A Hybrid ART-GRNN Online Learning Neural Network With a $\varepsilon$ -Insensitive Loss Function. IEEE Transactions on Neural Networks 19:9, 1641-1646. [CrossRef] 17. Ning Jin, Derong Liu. 2008. Wavelet Basis Function Neural Networks for Sequential Learning. IEEE Transactions on Neural Networks 19:3, 523-528. [CrossRef] 18. Jianming Lian, Yonggon Lee, S.D. Sudhoff, S.H. Zak. 2008. Self-Organizing Radial Basis Function Network for Real-Time Approximation of Continuous-Time Dynamical Systems. IEEE Transactions on Neural Networks 19:3, 460-474. [CrossRef] 19. Weifeng Liu, Puskal P. Pokharel, Jose C. Principe. 2008. The Kernel Least-Mean-Square Algorithm. IEEE Transactions on Signal Processing 56:2, 543-554. [CrossRef] 20. Lei Chen, Hung Keng Pung. 2008. Convergence analysis of convex incremental neural networks. Annals of Mathematics and Artificial Intelligence 52:1, 67-80. [CrossRef] 21. Ah-Hwee Tan, Ning Lu, Dan Xiao. 2008. . IEEE Transactions on Neural Networks 19:2, 230. [CrossRef] 22. Shawn D. Mansfield, Lazaros Iliadis, Stavros Avramidis. 2007. Neural network prediction of bending strength and stiffness in western hemlock ( Tsuga heterophylla Raf.). Holzforschung 61:6, 707-716. [CrossRef] 23. Tetsuya Hoya, Yoshikazu Washizawa. 2007. Simultaneous Pattern Classification and Multidomain Association Using Self-Structuring Kernel Memory Networks. IEEE Transactions on Neural Networks 18:3, 732-744. [CrossRef] 24. Youngwook Kim, Sean Keely, Joydeep Ghosh, Hao Ling. 2007. Application of Artificial Neural Networks to Broadband Antenna Design Based on a Parametric Frequency Model. IEEE Transactions on Antennas and Propagation 55:3, 669-674. [CrossRef] 25. Jie Ni, Qing Song. 2007. Pruning Based Robust Backpropagation Training Algorithm for RBF Network Tracking Controller. Journal of Intelligent and Robotic Systems 48:3, 375-396. [CrossRef] 26. A. J. Rivera, I. Rojas, J. Ortega, M. J. Jesus. 2007. A new hybrid methodology for cooperative-coevolutionary optimization of radial basis function networks. Soft Computing 11:7, 655-668. [CrossRef] 27. Puneet Singla, Kamesh Subbarao, John L. Junkins. 2007. Direction-Dependent Learning Approach for Radial Basis Function Networks. IEEE Transactions on Neural Networks 18:1, 203-222. [CrossRef]
28. Arta A. Jamshidi, Michael J. Kirby. 2007. Towards a Black Box Algorithm for Nonlinear Function Approximation over High-Dimensional Domains. SIAM Journal on Scientific Computing 29:3, 941. [CrossRef] 29. Kyosuke Nishida, Koichiro Yamauchi, Takashi Omori. 2006. An online learning algorithm with dimension selection using minimal hyper basis function networks. Systems and Computers in Japan 37:11, 11-21. [CrossRef] 30. Jesús González, Ignacio Rojas, Héctor Pomares, Fernando Rojas, José Manuel Palomares. 2006. Multi-objective evolution of fuzzy systems. Soft Computing 10:9, 735-748. [CrossRef] 31. C. Constantinopoulos, A. Likas. 2006. An Incremental Training Method for the Probabilistic RBF Network. IEEE Transactions on Neural Networks 17:4, 966-974. [CrossRef] 32. G.-B. Huang, L. Chen, C.-K. Siew. 2006. Universal Approximation Using Incremental Constructive Feedforward Networks With Random Hidden Nodes. IEEE Transactions on Neural Networks 17:4, 879-892. [CrossRef] 33. D. Shi, F. Chen, G. S. Ng, J. Gao. 2006. The construction of wavelet network for speech signal processing. Neural Computing and Applications 15:3-4, 217-222. [CrossRef] 34. Muhammad Taher Abuelma'Ati, Abdullah Shwehneh. 2006. A Reconfigurable Gaussian/Triangular Basis Functions Computation Circuit. Analog Integrated Circuits and Signal Processing 47:1, 53-64. [CrossRef] 35. Zhi-dong Qi, Xin-jian Zhu, Guang-yi Cao. 2006. Nonlinear modeling of molten carbonate fuel cell stack and FGA-based fuzzy control. Journal of Shanghai University (English Edition) 10:2, 144-150. [CrossRef] 36. Nan-Ying Liang, Guang-Bin Huang, P. Saratchandran, N. Sundararajan. 2006. . IEEE Transactions on Neural Networks 17:6, 1411. [CrossRef] 37. Sheng Wan, Larry E. Banta. 2006. . IEEE Transactions on Neural Networks 17:6, 1424. [CrossRef] 38. Gang Leng, Thomas Martin McGinnity, Girijesh Prasad. 2006. . IEEE Transactions on Fuzzy Systems 14:6, 755. [CrossRef] 39. Sethu Vijayakumar , Aaron D'Souza , Stefan Schaal . 2005. Incremental Online Learning in High DimensionsIncremental Online Learning in High Dimensions. Neural Computation 17:12, 2602-2634. [Abstract] [PDF] [PDF Plus]
40. Koichiro Yamauchi, Takayuki Oohira, Takashi Omori. 2005. Fast incremental learning methods inspired by biological learning behavior. Artificial Life and Robotics 9:3, 128-134. [CrossRef] 41. Kian Hsiang Low , Wee Kheng Leow , Marcelo H. Ang Jr. . 2005. An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion TasksAn Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks. Neural Computation 17:6, 1411-1445. [Abstract] [PDF] [PDF Plus] 42. D.-L. Yu, T.K. Chang, D.-W. Yu. 2005. Fault Tolerant Control of Multivariable Processes Using Auto-Tuning PID Controller. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 35:1, 32-43. [CrossRef] 43. G.-B. Huang, P. Saratchandran, N. Sundararajan. 2005. A Generalized Growing and Pruning RBF (GGAP-RBF) Neural Network for Function Approximation. IEEE Transactions on Neural Networks 16:1, 57-67. [CrossRef] 44. C.C. Pain, J.L.M.A. Gomes, M.D. Eaton, C.R.E. de Oliveira, A.J.H. Goddard. 2005. A model of heat transfer dynamics of coupled multiphase-flow and neutron-radiation: Application to a nuclear fluidized bed reactor. International Journal of Numerical Methods for Heat & Fluid Flow 15:8, 765-807. [CrossRef] 45. G.-B. Huang, P. Saratchandran, N. Sundararajan. 2004. An Efficient Sequential Learning Algorithm for Growing and Pruning RBF (GAP-RBF) Networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:6, 2284-2292. [CrossRef] 46. Jos� M. Valls, In�s M. Galv�n, Pedro Isasi. 2004. Lazy Learning in Radial Basis Neural Networks: A Way of Achieving More Accurate Models. Neural Processing Letters 20:2, 105-124. [CrossRef] 47. C.K. Loo, M. Rajeswari, M.V.C. Rao. 2004. Novel Direct and Self-Regulating Approaches to Determine Optimum Growing Multi-Experts Network Structure. IEEE Transactions on Neural Networks 15:6, 1378-1395. [CrossRef] 48. N. Doulamis, A. Doulamis, A. Panagakis, K. Dolkas, T. Varvarigou, E. Varvarigos. 2004. A Combined Fuzzy-Neural Network Model for Non-Linear Prediction of 3-D Rendering Workload in Grid Computing. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:2, 1235-1247. [CrossRef] 49. P.P. Angelov, D.P. Filev. 2004. An Approach to Online Identification of Takagi-Sugeno Fuzzy Models. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:1, 484-498. [CrossRef] 50. J. Gonzalez, I. Rojas, J. Ortega, H. Pomares, J. Fernandez, A. Diaz. 2003. Multiobjective evolutionary optimization of the size, shape, and position parameters of radial basis function networks for function approximation. IEEE Transactions on Neural Networks 14:6, 1478-1495. [CrossRef]
51. Jau-Jia Guo, P.B. Luh. 2003. Selecting input factors for clusters of gaussian radial basis function networks to improve market clearing price prediction. IEEE Transactions on Power Systems 18:2, 665-672. [CrossRef] 52. Kyoung-Mi Lee, W.N. Street. 2003. An adaptive resource-allocating network for automated detection, segmentation, and classification of breast cancer nuclei topic area: image processing and recognition. IEEE Transactions on Neural Networks 14:3, 680-687. [CrossRef] 53. Chee Peng Lim, R.F. Harrison. 2003. Online pattern classification with multiple neural network systems: an experimental study. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 33:2, 235-247. [CrossRef] 54. C. Panchapakesan, M. Palaniswami, D. Ralph, C. Manzie. 2002. Effects of moving the center's in an RBF network. IEEE Transactions on Neural Networks 13:6, 1299-1307. [CrossRef] 55. Giampiero Campa, Mario L. Fravolini, Brad Seanor, Marcello R. Napolitano, Diego Del Gobbo, Gu Yu, Srikanth Gururajan. 2002. On-line learning neural networks for sensor validation for the flight control system of a B777 research scale model. International Journal of Robust and Nonlinear Control 12:11, 987-1007. [CrossRef] 56. P. Mitra, C.A. Murthy, S.K. Pal. 2002. Density-based multiscale data condensation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:6, 734-747. [CrossRef] 57. N.K. Kasabov, Qun Song. 2002. DENFIS: dynamic evolving neural-fuzzy inference system and its application for time-series prediction. IEEE Transactions on Fuzzy Systems 10:2, 144-154. [CrossRef] 58. Lehel Csató , Manfred Opper . 2002. Sparse On-Line Gaussian ProcessesSparse On-Line Gaussian Processes. Neural Computation 14:3, 641-668. [Abstract] [PDF] [PDF Plus] 59. Christophe Andrieu , Nando de Freitas , Arnaud Doucet . 2001. Robust Full Bayesian Learning for Radial Basis NetworksRobust Full Bayesian Learning for Radial Basis Networks. Neural Computation 13:10, 2359-2407. [Abstract] [PDF] [PDF Plus] 60. Masashi Sugiyama , Hidemitsu Ogawa . 2001. Subspace Information Criterion for Model SelectionSubspace Information Criterion for Model Selection. Neural Computation 13:8, 1863-1889. [Abstract] [PDF] [PDF Plus] 61. Chia-Feng Juang, Chgin-Teng Lin. 2001. Noisy speech processing by recurrently adaptive fuzzy filters. IEEE Transactions on Fuzzy Systems 9:1, 139-152. [CrossRef] 62. A. Konig. 2000. Interactive visualization and analysis of hierarchical neural projections for data mining. IEEE Transactions on Neural Networks 11:3, 615-624. [CrossRef]
63. Masa-aki Sato , Shin Ishii . 2000. On-line EM Algorithm for the Normalized Gaussian NetworkOn-line EM Algorithm for the Normalized Gaussian Network. Neural Computation 12:2, 407-432. [Abstract] [PDF] [PDF Plus] 64. P. Chandra Kumar, P. Saratchandran, N. Sundararajan. 2000. Minimal radial basis function neural networks for nonlinear channel equalisation. IEE Proceedings - Vision, Image, and Signal Processing 147:5, 428. [CrossRef] 65. Mirko van der Baan, Christian Jutten. 2000. Neural networks in geophysical applications. Geophysics 65:4, 1032. [CrossRef] 66. Y. Li, N. Sundararajan, P. Saratchandran. 2000. Analysis of minimal radial basis function network algorithm for real-time identification of nonlinear dynamic systems. IEE Proceedings - Control Theory and Applications 147:4, 476. [CrossRef] 67. Chia-Feng Juang, Chin-Teng Lin. 1999. A recurrent self-organizing neural fuzzy inference network. IEEE Transactions on Neural Networks 10:4, 828-845. [CrossRef] 68. C. Citterio, A. Pelagotti, V. Piuri, L. Rocca. 1999. Function approximation-fast-convergence neural approach based on spectral analysis. IEEE Transactions on Neural Networks 10:4, 725-740. [CrossRef] 69. K. Yamauchi, N. Yamaguchi, N. Ishii. 1999. Incremental learning methods with retrieving of interfered patterns. IEEE Transactions on Neural Networks 10:6, 1351. [CrossRef] 70. G.P. Liu, V. Kadirkamanathan, S.A. Billings. 1999. Variable neural networks for adaptive control of nonlinear systems. IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) 29:1, 34. [CrossRef] 71. N. Stamatis, D. Parthimos, T.M. Griffith. 1999. Forecasting chaotic cardiovascular time series with an adaptive slope multilayer perceptron neural network. IEEE Transactions on Biomedical Engineering 46:12, 1441. [CrossRef] 72. P. Yee, S. Haykin. 1999. A dynamic regularized radial basis function network for nonlinear, nonstationary time series prediction. IEEE Transactions on Signal Processing 47:9, 2503. [CrossRef] 73. V. Ramamurti, J. Ghosh. 1999. Structurally adaptive modular networks for nonstationary environments. IEEE Transactions on Neural Networks 10:1, 152. [CrossRef] 74. D.J. Mayes, A.F. Murray, H.M. Reekie. 1999. Non-Gaussian kernel circuits in analogue VLSI: implications for RBF network performance. IEE Proceedings - Circuits, Devices and Systems 146:4, 169. [CrossRef] 75. S. Yonghong, P. Saratchandran, N. Sundararajan. 1999. Minimal resource allocation network for adaptive noise cancellation. Electronics Letters 35:9, 726. [CrossRef] 76. Stefan Schaal , Christopher G. Atkeson . 1998. Constructive Incremental Learning from Only Local InformationConstructive Incremental Learning from
Only Local Information. Neural Computation 10:8, 2047-2084. [Abstract] [PDF] [PDF Plus] 77. C. C. Holmes , B. K. Mallick . 1998. Bayesian Radial Basis Functions of Variable DimensionBayesian Radial Basis Functions of Variable Dimension. Neural Computation 10:5, 1217-1233. [Abstract] [PDF] [PDF Plus] 78. Lu Yingwei, N. Sundararajan, P. Saratchandran. 1998. Performance evaluation of a sequential minimal radial basis function (RBF) neural network learning algorithm. IEEE Transactions on Neural Networks 9:2, 308-318. [CrossRef] 79. Chia-Feng Juang, Chin-Teng Lin. 1998. An online self-constructing neural fuzzy inference network and its applications. IEEE Transactions on Fuzzy Systems 6:1, 12. [CrossRef] 80. Andrew J. Meade, Michael Kokkolaras, Boris A. Zeldin. 1997. Sequential function approximation for the solution of differential equations. Communications in Numerical Methods in Engineering 13:12, 977-986. [CrossRef] 81. V.R. de Angulo, C. Torras. 1997. Self-calibration of a space robot. IEEE Transactions on Neural Networks 8:4, 951-963. [CrossRef] 82. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 83. Lu Yingwei , N. Sundararajan , P. Saratchandran . 1997. A Sequential Learning Scheme for Function Approximation Using Minimal Radial Basis Function Neural NetworksA Sequential Learning Scheme for Function Approximation Using Minimal Radial Basis Function Neural Networks. Neural Computation 9:2, 461-478. [Abstract] [PDF] [PDF Plus] 84. A. Roy, S. Govil, R. Miranda. 1997. A neural-network learning theory and a polynomial time RBF algorithm. IEEE Transactions on Neural Networks 8:6, 1301. [CrossRef] 85. E. Littmann, H. Ritter. 1997. Adaptive color segmentation-a comparison of neural and statistical methods. IEEE Transactions on Neural Networks 8:1, 175. [CrossRef] 86. L. Yingwei, N. Sundararajan, P. Saratchandran. 1997. Identification of time-varying nonlinear systems using minimal radial basis function neural networks. IEE Proceedings - Control Theory and Applications 144:2, 202. [CrossRef] 87. B.A. Whitehead, T.D. Choate. 1996. Cooperative-competitive genetic evolution of radial basis function centers and widths for time series prediction. IEEE Transactions on Neural Networks 7:4, 869-880. [CrossRef] 88. A.G. Bors, I. Pitas. 1996. Median radial basis function neural network. IEEE Transactions on Neural Networks 7:6, 1351. [CrossRef]
89. S. Fabri, V. Kadirkamanathan. 1996. Dynamic structure neural networks for stable adaptive control of nonlinear systems. IEEE Transactions on Neural Networks 7:5, 1151. [CrossRef] 90. B.A. Whitehead. 1996. Genetic evolution of radial basis function coverage using orthogonal niches. IEEE Transactions on Neural Networks 7:6, 1525. [CrossRef] 91. K. Liano. 1996. Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks 7:1, 246. [CrossRef] 92. Jörg Bruske , Gerald Sommer . 1995. Dynamic Cell Structure Learns Perfectly Topology Preserving MapDynamic Cell Structure Learns Perfectly Topology Preserving Map. Neural Computation 7:4, 845-865. [Abstract] [PDF] [PDF Plus] 93. Mark J. L. Orr. 1995. Regularization in the Selection of Radial Basis Function CentersRegularization in the Selection of Radial Basis Function Centers. Neural Computation 7:3, 606-623. [Abstract] [PDF] [PDF Plus] 94. Shimon Edelman. 1995. Representation, similarity, and the chorus of prototypes. Minds and Machines 5:1, 45-68. [CrossRef] 95. S.S. Watkins, P.M. Chau. 1995. Reduced-complexity circuit for neural networks. Electronics Letters 31:19, 1644. [CrossRef] 96. B. Truyen, N. Langloh, J. Cornelis. 1994. An adiabatic neural network for RBF approximation. Neural Computing & Applications 2:2, 69-88. [CrossRef] 97. Dimitry Gorinevsky , Thomas H. Connolly . 1994. Comparison of Some Neural Network and Scattered Data Approximations: The Inverse Manipulator Kinematics ExampleComparison of Some Neural Network and Scattered Data Approximations: The Inverse Manipulator Kinematics Example. Neural Computation 6:3, 521-542. [Abstract] [PDF] [PDF Plus] 98. Stephen Roberts , Lionel Tarassenko . 1994. A Probabilistic Resource Allocating Network for Novelty DetectionA Probabilistic Resource Allocating Network for Novelty Detection. Neural Computation 6:2, 270-284. [Abstract] [PDF] [PDF Plus] 99. Bernd Fritzke. 1994. Fast learning with incremental RBF networks. Neural Processing Letters 1:1, 2-5. [CrossRef] 100. Bjørn Lillekjendlie, Dimitris Kugiumtzis, Nils Christophersen. 1994. Chaotic time series. Part II. System Identification and Prediction. Modeling, Identification and Control: A Norwegian Research Bulletin 15:4, 225-245. [CrossRef] 101. Visakan Kadirkamanathan , Mahesan Niranjan . 1993. A Function Estimation Approach to Sequential Learning with Neural NetworksA Function Estimation Approach to Sequential Learning with Neural Networks. Neural Computation 5:6, 954-975. [Abstract] [PDF] [PDF Plus] 102. Gustavo Deco , Jürgen Ebmeyer . 1993. Coarse Coding Resource-Allocating NetworkCoarse Coding Resource-Allocating Network. Neural Computation 5:1, 105-114. [Abstract] [PDF] [PDF Plus]
Communicated by Halbert White
On the Convergence of the LMS Algorithm with Adaptive Learning Rate for Linear Feedforward Networks Zhi-Quan Luo Department of Electrical and Computer Engineering, McMaster University, Hamilton, Ontario, L8S 4L7, Canada
We consider the problem of training a linear feedforward neural network by using a gradient descent-like LMS learning algorithm. The objective is to find a weight matrix for the network, by repeatedly presenting to it a finite set of examples, so that the sum of the squares of the errors is minimized. Kohonen showed that with a small but fixed learning rate (or stepsize) some subsequences of the weight matrices generated by the algorithm will converge to certain matrices close to the optimal weight matrix. In this paper, we show that, by dynamically decreasing the learning rate during each training cycle, the sequence of matrices generated by the algorithm will converge to the optimal weight matrix. We also show that for any given 6 > 0 the LMS algorithm, with decreasing learning rates, will generate an f-optimal weight matrix (i.e., a matrix of distance a t most f away from the optimal matrix) after O ( l / f ) training cycles. This is in contrast to Q ( l / f l o g l / f ) training cycles needed to generate an €-optimal weight matrix when the learning rate is kept fixed. We also give a general condition for the learning rates under which the LMS learning algorithm is guaranteed to converge to the optimal weight matrix. 1 Introduction and Problem Formulation
Error backpropagation algorithm has been widely used for training a feedforward neural network and has shown much success in a number of important applications (Gorman and Sejnowski 1988; Rumelhart et al. 1986; Sejnowski and Rosenberg 1987). The popularity of backpropagation learning algorithm is largely due to the fact that it is incremental (i.e., the example is learned one at a time by the network) and has a relatively fast rate of convergence. In this paper, we analyze an error backpropagation algorithm (the LMS learning rule) for training a linear feedforward neural network. Such networks, albeit simple, have been proven useful in a wide variety of applications (Rumelhart and McClelland 1986), and they have been found to make reasonable generalizations and perform reasonably well on patterns that have never before been presented. Neural Computation 3, 226-245 (1991) @ 1991 Massachusetts Institute of Technology
Convergence of the LMS Algorithm
227
A linear feedforward neural network G with no hidden units is a twolayered directed graph. The first layer of G, the input layer, consists of a set of r input nodes, while the second, the output layer, has s nodes. There are a total of T.S edges in G connecting each input node with all the output nodes of G; there are no connections among the input (output) nodes themselves. In addition, there is a real valued weight (or connection strength) wzj associated with the edge between the ith output node and the jth input node. For any input vector II: E R‘, the jth component of :I‘ is given to the jth input node of G, which transmits it to the output nodes through the connecting edges. Each output node i computes a weighted sum yi = xi=,wZ1xJ.We will call g2 the output of node i and call the s-vector g = (yl, . . . Y , ) ~ the output vector of G, where the superscript T denotes the matrix transpose. In what follows, we use W to denote the .T x r matrix [wZ3]. Thus, in matrix notation, we have y = W x . The problem of training a feedforward neural network G (with no hidden units) is to find a set of weights wij for G, by using a given set of examples, so that certain criterion of optimality is achieved. In this paper, we shall assume that examples are given by the pairs of real . p )]p, = 1,. . . , m, where each A ( p ) E P’ and B(p) E R5. vectors [ A ( p ) B( The vectors A ( p ) , B(p) are called, respectively, the pth input vector and the pth desired output vector. The minimization of the mean-square error ~
m
r n ,
is the most often used criterion of optimality. Here, each L p ( W ) = 11 WA(p)- B ( p )I 12/2 is the error corresponding to the pth example. There are other criterions of optimality, such as the minimization of ”cross entropy,’’ depending on the particular application under consideration. In this paper, we shall consider only the mean-square error criterion. Let A be the T x m matrix whose pth column is equal to A ( p )and let B be the s x m matrix whose pth column is given by B ( p ) . It can be easily seen that the weight matrix W minimizing 1.1 is the least-square solution to the (possibly inconsistent) linear system of equations W A = B. It then follows that L ( W ) is minimized at the point W = B A f , where A+ is the usual Moore-Penrose pseudoinverse matrix of A (see Golub and Van Loan 1983, p. 139). are fed to the network Typically, the m examples {[A(p):B(p)]},”==, many times and they are usually presented to the network in a cyclical fashion. The weight matrix W of the network is updated each time after an example is given to the network. Let W”(p)be the weight matrix of the network just before the pth example is being learned in the nth training cycle. The following iterative learning algorithm is due to Kohonen (1974).
Zhi-Quan Luo
228
LMS Learning Algorithm. When the I)th example is presented to the network, the weight matrix W"((p)is updated according to W T 1 (+p 1) = W"yp)- a V L , = W " ( P )- .IL[w7L(P)A(p) -WIAT(P)
(1.2)
where CY,! > 0 is the learning rate (or stepsize) in the 71th training cycle. [Here, we adopt the convention that W"(m+l) = W'L+l(l).lThe criterion of termination is that the change of the weights is less than some desired threshold value. A couple of remarks about the above LMS learning algorithm are in order. 1. From the iteration formula 1.2, the LMS learning algorithm can be seen as a gradient descent-like algorithm that iteratively improves the weight matrix W with every training example [ A ( p )B, ( p ) ] .In particular, the gradient of L ( W )with respect to W is
where V L , = [ W T A ( p) B(p)]A(;o) is the contribution to the gradient of L ( W ) from the pth example. When the pth example is being learned, the weight matrix W n ( p )is updated (cf. 1.2) along the opposite of the gradient direction of L,. Thus, the resulting weight matrix will decrease the value of L,. Notice that with such update the total error L may not decrease (in some situations, it may even increase), since the weight matrix is moving along the negative gradient of L,, rather than the negative gradient of L. Thus, the convergence of the LMS algorithm does not follow from the classical results of gradient descent methods and new analysis is needed to ensure that the weight matrices generated by the LMS algorithm will indeed converge to BA+. 2. Notice that the above LMS learning procedure is "incremental" in the sense that the network does a gradient search one example at a time and there is no need for the network to remember any of the previously seen examples. 3. The well known Widrow-Hoff procedure can be viewed as a special case of the above LMS learning algorithm where the network has exactly one output node (i.e., s = 1). Also, the commonly known backpropagation of algorithm is a generalization of the LMS learning algorithm where the neural networks may have hidden units and nonlinear (smooth) sigmoidal functions (see Rumelhart et al. 1986). The backpropagation algorithm for training such general networks works in basically the same way as the LMS learning
Convergence of the LMS Algorithm
229
algorithm for training the linear feedforward networks. Thus, remarks (1) and (2) also apply to the Widrow-Hoff algorithm and the backpropagation algorithm. 4. The LMS learning algorithm is very close in spirit to the classical stochastic gradient algorithms and adaptive filtering algorithms. In particular, all of these algorithms are, in some sense, certain approximate versions of the (deterministic) gradient descent algorithm for minimizing a differentiable function. Although the LMS algorithm can be viewed as an adaptive filtering algorithm with certain special signal sequences, its convergence cannot be inferred from the existing convergence theory for adaptive filtering algorithms. We shall make a comparison of these algorithms and their convergence properties in Section 4. Despite the simplicity and long history of the LMS algorithm, its convergence property remains largely unknown. Although there has been some heuristic analysis arguing the convergence of the backpropagation algorithm (see, e.g., Rumelhart et al. 1986, p. 4441, little rigorous work is documented in the literature. To the best of our knowledge, the first rigorous analysis of the convergence properties of the LMS algorithm was due to Kohonen (1974). In particular, Kohonen established the following: Theorem 1.1. Consider the LMS algorithm with each example being learned cyclically. Let Wn(p) be given by equation 1.7. If the learning rate a , is fixed fo a small constant N > 0, then
-
lim W " ( p )= Wa(p).
11
'30
V p = 1,.. . ,m
where Wa(p)is some matrix depending on p and a only. Moreover, there holds lim W ( )
a-o+
a
-
BA+,
V p = I , ...,m
(1.4)
Theorem 1.1 implies that if the (fixed) learning rate is small, then the subsequence { W n ( p ) }of matrices generated by the LMS algorithm will converge to some weight matrix close to the optimal solution BA+.In general, for each small but fixed learning rate a, the limiting matrices Wa(p)are distinct for different p . Moreover, the sequence converges to these limit points by "jumping" around them cyclically. (This implies that with a fixed learning rate the sequence of matrices generated by the LMS algorithm will, in general, not converge to the desired solution BA+.) In this paper, we show that, by dynamically decreasing the stepsize cy, during each training cycle, the LMS algorithm will converge to the optimal weight matrix BA+. We also show that for any given 6 > 0 the LMS algorithm, with decreasing stepsizes, will generate an eoptimal weight matrix (i.e., a matrix of distance E away from BA+) after 0(1/<)
Zhi-Quan Luo
230
training cycles. This is in contrast to R(1/ f log 1/ t ) training cycles needed to generate an +optimal weight matrix if the stepsize is kept fixed. We also give a general condition for the learning rates under which the LMS algorithm is guaranteed to converge to the optimal weight matrix BA+. The effect of adaptively altering the learning rate in each training cycle has also been the subject of study by Jacobs (1958). In particular, Jacobs proposed a heuristic scheme in which if the current gradient is in the same direction as the previous gradient, the learning rate increases linearly with time, while if the gradient has changed directions, the learning rate decays exponentially. It has been observed numerically that such a scheme can sometimes speed up the rate of convergence, although theoretically it is not known if such heuristic scheme would yield a sequence of weight matrices converging to U A + . The issue of rate of convergence was addressed by Tesauro et nl. (1989). Finally, a rigorous analysis of backpropagation algorithm from a statistical point of view was recently given by White (1989). The rest of this paper is organized as follows. In Section 2, we demonstrate the effects of dynamically changing the learning rates during each training cycle by considering a simple example. We show for this simple example that if we use l/n as the learning rate in the rlth training cycle then the weights generated by the backpropagation algorithm will converge to the optimal weight vector. Moreover, the convergence speed of the quadratic error function is on the order of o(1/n2), and the weight iterates converge to the optimal weight vector like l / u . Finally, we show that O ( l / t ) training cycles are sufficient to obtain a set of toptimal weights if the learning rates are decreasing like 1/11, whereas if the learning rate is kept fixed, then at least R(l/Flogl/c) training cycles are needed in order to produce a set of weights of the same quality. Section 3 contains the main convergence analysis of the LMS algorithm for a general class of learning rate updating schemes. In Section 4, we compare our convergence results for the LMS algorithm with that of stochastic gradient algorithms and adaptive filtering algorithms. Finally, Section 5 contains some further discussion on our results as compared to the rate of convergence results by Tesauro and his colleagues (1959). 2 A Simple Example
Consider a simple two layer linear network with one input node and one output node. Suppose that there are two training examples given by (1.C I ) and (1.c2), where q , cz are two distinct real numbers. Thus, the problem of training such a network becomes that of minimizing the quadratic error function 1 L('/U)= - ( w 2
- C1)*
+ -1( , I 1 1 2
~
,:*)2
(2.1)
Convergence of the LMS Algorithm
231
It is easy to see that L ( w ) is a strongly convex function and is minimized at 7u* = (c1 + c2)/2. Under the assumption that the two examples are presented to the network in a cyclical manner, the LMS algorithm with a fixed learning rate a can be described as follows (cf. 1.2): w"(2)
=
wn+l(i) =
w"(1) - Cy[?UTL(1) - c:*] ~ ~ ~- a( [ d 2 L)( 2-) Q ]
(2.2) (2.3)
for n = 0 , 1 , .. ., where the learning rate cy is a small positive constant. [Recall that the notation W J " ( ~ ) , 1. = 1.2, denotes the weight just before learning the ith example in the nth training cycle.] For simplicity, we initialize the weight by letting (2.4)
wO(1) = 0
From equations 2.2-2.3, we obtain W_u7L+1(1) =
= =
(1 - N ) ? L J n ( 2 ) f Q C 2 (1 - a ) [(l- o ) w " ( l ) cyc,] cyc2 (1 - a ) 2 w " ( l ) (1 - a)cyc1 cycz
+
+
+ +
(2.5)
The above recurrence equation is linear and can be solved explicitly by ) 0. Moreover, if 0 < a < 1, then it can using the initial condition ~ ' ( 1 = )} convergent and the limiting value be shown that the sequence { ~ " ( lis is given by (2.6)
Similarly, we can show that lim wP(2) = wa(2)=
n-cc
(1 - a)c2 2-a
+ c1
(2.7)
The above example indicates that if the learning rate a is kept fixed, then the sequence of weights generated by the LMS algorithm is not convergent. In fact, the sequence will oscillate between two limiting ) . that points w a i l ) and ~ ~ ( 2Notice
Therefore, when cy approaches 0, both wa(l)and wa(2) will converge to the optimal solution 7u* = (q c2)/2, as predicted by Theorem 1.1. However, as long as the learning rate a is positive, the two limiting points w a ( l ) and wa(2) will never be the same, and they will never be equal to the desired solution 7u*. This suggests that if the learning rate is kept fixed during the learning process, then a small learning rate is preferred since it will lead to close to optimal solution. However, if the learning rate is too small, then the rate at which ~ " ( 1converges ) to w,(1) will be slow.
+
Zhi-Quan Luo
232
We now consider dynamically decreasing the learning rate for each training cycle. Intuitively, if cv is decreasing too slowly, then the subsequences { 1iP(1)}and { ~ ' ~ ( 2may ) ) still converge to different limiting points, similar to the case in which the learning rate is kept constant. On the other hand, if cv is decreasing too fast, then the two subsequences, though they may converge to the same limiting point, will not converge to the desired point w * since the quickly shrinking stepsizes will not allow them to ever reach the point w * . For simplicity, we shall consider only the case in which on = l / n . The more general learning rate updating schemes will be treated in Section 3. With ufL= l / n , we have from 1.2 1 Ili"(2) = Idyl) - - [ w ' ~ ( l ) - "4 (2.8) t? 1
~'"'(1) = uir'(2) - - [ ~ " ( 2 )- "21, 77
ti'n 2 2
(2.9)
Let ~ ~ (be1 still ) initialized at 0. To see that limn+mwrL(1)= limn+m~ " ( 2=) ("1 + ~ ) / 2 we , first notice that
( ;J
rdJf'(2) =
1- -
?lP+1(1)
+ -c1 1 7I + 1
+
Let !jrL = ~ ' ~ ( 2 )(c.1 ~ ) / 2 Then, . by replacing ~ " ( 2 and ) 71P+'(2) with ?in and gn+l in the above formula, we obtain ?Jn+l
+ = - 1 (u. + +2 n +1
YU+1
= -?In =
"1
I/
Q
~
2 By rearranging the terms in the above equation, we see that 1 n+l
(1 -
which implies that
yT1+l can
/I -
&)
?Jn
be represented as an infinite product
+
where gz = w2(2)- (cl ~ ) / is2 some finite constant. We now need to use the following fact from calculus:
Fact 1. There exist two positive constants
dl
and
(12
such that
Convergence of the LMS Algorithm
233
nY="=,l-
In other words, as n tends to infinity, the infinite product (2/i l)]converges to zero more or less like the sequence 1/n2. Using Fact 1, we see that the sequence yn converges to'zero with the speed of l/n2. Thus, w"(2) converges to w* = (cl + c2)/2 at a rate l/n2, which further implies IL[w"(2)] - L(w*)l = O(l/n4). To see that w"(1) also converges to w*, we consider equation 2.9: the right-hand side of 2.9 converges to
+
To determine the speed at which w"(1) converges to w*, we consider the following relation (cf. 2.9): W"+'(l) - w* = [w"(2)
-
1 w'] - -[w"(2) - cz] n
(2.10)
The first term in the right-hand side of 2.10 converges to zero like l/n2, while the second term decreases like l / n since w"(2) - cz converges to (c1 - c2)/2, which is a nonzero constant. Thus, w"(1) converges to w* at a speed comparable to l/n. As a result, we see that L[w"(l)] converges to L(w*) at the rate l / n 2 . We summarize the results obtained so far in the following proposition: Proposition 2.1. If the learning rate for the nth training cycle is set to be l / n , then both Subsequences {~"(l)};.~ and {w"(2)},"==, (defined by equations 2.82.9 converge to the optimal weight w* = (cl c2)/2. Moreover, we have Iw"(1) - w*l = o(l/n), lw"(2) - w*l = O ( l / n 2 ) and IL[w"(l)] - L(w*)l = O ( I / n 2 ) ,IL[wn(2)]- L(w*)l = O ( I / n 4 ) .
+
We further compare the LMS algorithm with the adaptive learning rate a, = l / n to the LMS algorithm with a fixed learning rate. In particular, for any given t > 0, we wish to determine the number of training cycles needed by both algorithms to generate an c-optimal weight w (i.e., Jw - w*1 5 €1. We first consider the case an = l / n . Notice that Iw"(1) - w*( = O ( l / n ) and lw"(2) - w*l = O ( l / n 2 ) . It then follows that after a total of 0(1/~) training cycles the weights' w"(1) and w"(2) will be away from w*. On the other hand, recall that if the learning rate an is kept fixed at some a > 0, then the two subsequences { W " ( ~ ) ) F = ~ and {~"(2)};=~ will converge to the two distinct limiting points ~ ~ ( 1 ) and ~ ~ ( 2Using ) . equations 2.6-2.7, we see that Iwa(l) - w*I = R(a) and Iwa(2) - w*l = R(a). Thus, if an c-optimal weight w is to be generated by the LMS algorithm with a fixed learning rate a, then there must hold 'In fact, ~ " ( 2will ) be within 6 distance from W* after only
training cycles.
234
Zhi-Quan Luo
O ( c ) .Now we let z,, ~ ~ ' ' (-1~) ~ ~ Using ~ ( 1 the ) . definition of ~ ~ " ( 1 ) (cf. 2.6) and equation 2.5, we obtain (k =
2,,+1 -
(1 - 0 ) 2 z 1 , .
vn 2 2
It follows that zr1 = (1 - ~ ) " ~ z gwhere , zo is a nonzero constant. Thus, in order for I z , ~ 5~ f , we must have 1(1- 0 ) 2 n Z @ I 5
f
which implies that 11 = R(l/tulogl/t). Since (k = O(f), we see that 11 = R ( l / f I o g l / ~ )Thus, . we have shown that following: Proposition 2.2. Let f be any positive real number. lf o,, = 1/11, then the LMS algorithm can generate an c-optimal weight 71' after O ( l / f )training cycles. However, if the learning rate is kept fixed, then at least R ( l / t log 1 / F ) training cycles are needed for the LMS algorithm to generate an f-optimal weight t i ) . Proposition 2.2 indicates that letting the learning rate decrease appropriately will not only make the LMS algorithm converge to the desired optimal solution but also reduce the number of training cycles needed to generate an r-optimal solution. In the following section, we shall see that all of the convergence results derived in this section for the simple example 2.1 can be extended to the general case in which the number of nodes in the output layer of a neural network is arbitrary. 3 Convergence Analysis of the LMS Algorithm with Decreasing
Learning Rate We first introduce our notations: Notations. Let A (B,respectively) be the T x m ( s x m, respectively) matrix whose pth column is equal to A(p) [ B ( I I )respectively]. , For any matrix M , we shall use N ( M ) to denote the null space of M . In other words, N ( M ) is the subspace of vectors that are orthogonal to the rows of M.
We make the following assumption on the training process. Assumption A (Cyclic Learning Rule). The entire training process consists of infinitely many training cycles. During each cycle, each of the r r i examples is learned exactly once according to some fixed order. For simplicity, let (1.2.. . . . r r i } be the order in which the examples are learned in each training cycle. Our main result is the following: Theorem 3.1. Consider training a linear feedforward network using the LMS algorithm 2.2 with the cyclic learning rule (cf. Assumption A). Let the initial
Convergence of the LMS Algorithm
235
weight matrix be given by W'(1) = 0. If a, ( n 2 l), the learning rate for the nth training cycle, satisfies the following conditions: (3.1) n=l
n=l
then the sequence of weight matrices generated by the LMS algorithm (cf. equation 2.2) will converge to the optimal weight matrix BAt. Before proving Theorem 3.1, let us first establish the following lemma, related to Lemma 1 of Sacks (1958) and Lemma 2.1 of Fabian (1968).
Lemma 3.1. Suppose that zn+1
(1 - Pn)zn
(2,)
is a sequence of real numbers satisfying
+ O(Pi),
21
= TO
(3.2)
where r0 is an arbitrary real number and (pn : pn 2 0,n 2 1) satisfies (3.3)
Then, we have limn+- zn
= 0.
Proof. Let T, denote the last term of 3.2. Thus, there exists some positive constant C such that IT,( 5 CP;, for all n 2 1. By repeatedly applying 3.2, we see that n
zn+1 =
C ~ ( ni)ri,
(3.4)
i=O
where n
p(n,i)=
(l-Pj),
i = ,..., ~ n-I
j=i+l
and p(n,n)= 1
Since C,"==,,Bn = $00, it follows from a well-known fact in calculus (see Apostol 1957, p. 382) that limn-m p(n,i) = 0, for all fixed i. Since lril 5 C/3; and C,"==,,Bi < 00, we see that
z=o
Let E be any positive number so that t/2C1 < 1, it follows from the above equation that there exists an integer I such that (3.5)
Zhi-Quan Luo
236
Since lim,,,, /7(n3I ) integer N such that
Thus, for every
II
=0
for all fixed integer i, we can choose a positive
2 Ili, we have I/)(
It.
/)I < 1 and
(3.7) (3.8)
where 3.7 follows from the observation that I p ( r ~ 0 . 1 < 1 for all I I 2 N and for all I (cf. 3.6) and 3.8 follows from 3.5 and 3.6. This shows that
as desired. Q.E.D. Proof of Theorem 3.1. We prove Theorem 3.1 by showing that for each p (1 5 p 5 m ) the subsequence of matrices {WrL((p)} converges to UA+
as ri approaches infinity. Due to symmetry, we only need to consider the subsequence { W T 1 ( l ) }Let . us first derive a recurrence equation for the subsequence (W7'(1)}. Notice that equation 1.2 can be rewritten as
+
W T L ( p 1)
=
W T 1 ( p[I) 15p 5
~
711.
+
ur,A(p)AT(p)]o,lB(p)AT(p). Vn 2 1
(3.9)
It then follows that
+
t
~
5 B ( i ) A ' ( / ) fi ,
1=1
~
[ I - w , ~ A ( ~ ) A ~ ( ~(3.10) )]
p=t+l
(3.11)
where 3.10 is a result of repeatedly applying equation 3.9 to all the examples in the ,nth training cycle, and 3.11 follows from expanding the products in equation 3.10 and collecting the second and higher order terms in
Convergence of the LMS Algorithm
237
the expansion. Note that E ( a r Lin ) 3.11 stands for a matrix whose entries are bounded as a,, approaches zero. Moreover, by a careful examination of the terms in the expansion of 3.10, the following can be verified:
Observation 1. N ( R ( o P L )the ) , null space of the matrix R(o,) in equation 3.11, contains all the vectors perpendicular to the input vectors { A ( p ): p = 1.. . . m}. In other words, if a vector u is in the null space of A T , then ii is in the null space of R((rrL).
.
Using matrix notation, we see that equation 3.11 can be rewritten in the following simple form:
w y i ) ( I - N , A A ~ )+ a
~ n + ~ (= i )
n
~
+~ &q~,,) T
(3.12)
It remains to show that the sequence of matrices W T L (generated l) by 3.12, under the assumption that CY,,Ssatisfy the condition 3.1, will converge to B A f . To show this, we shall use the Singular Value Decomposition of A and AT (see, for example, Golub and Van Loan 1983, p. 18):
V'AU
=
UTATV = DT
D.
(3.13)
where V, U are two orthonormal matrices of sizes r x r and respectively, and D is an r x 711 diagonal matrix of the form D = diag [ml, m z ? . . .
Here, m1 2 mz 2 over, we have
. . . _>
crk
.
mk.
0 , . . . .O]
x
rri,
(3.14)
> 0 are called the singular values of A. More-
N ( A T )= span{ u k + l . .
. . , PI,}
(3.15)
where N ( A T ) denotes the null space of AT and ( k l)st, . . ., rth column vectors of V . From 3.13, we have
+
VTAATV = DDT
711
= diag [m:.
02'. . . . .o:.
vk+l..
. . . vT
are the
0 , . . . .O]
Let X" = Wn(l)V, for all 71 2 1. Multiplying 3.12 with V from the right, we obtain Xnfl = xn( I
-
+
~ , , D D ~ (Y,,F )
+~:J(Q,)v
(3.16)
where F = BATV. Let us pause for a moment to map out the directions for the most intricate part of our proof. Note that the convergence of {Wn(l)} is equivalent to the convergence of { X " } , since the sequence { X " } is obtained from {WTL(l)} through an invertible linear transformation V . Thus, it suffices to show that the sequence { X 7 ' } is convergent. We shall prove this by arguing that each entry of {Xn}is convergent. Finally, we complete the proof of Theorem 3.1 by showing that limit matrix to which {WTL(l)} converges is equal to BA+.
Zhi-Quan Luo
238
Consider an arbitrary entry, say (2,j)thentry, of the matrix equation 3.16. In the rest of the proof, we shall use the notations X,",+', X;, Ft3, and RZ to denote, respectively, the (z,j)th entry of Xn+l, X", F , and R(an)V.There are two cases. Case 1.
J
1 k + 1. Then, equations 3.14 and 3.16 imply that
xn+l= Xli + anFtj + aiRG, 23
Vn2 1
(3.17)
Using the Observation 1 and 3.15, we see that the ( k + l)st, . . ., rth columns of the matrices BATV and R(a,)V are equal to zero. Thus, FA3and RG in equation 3.17 are both equal to zero. As a result, during the entire process of learning, X; remains unchanged. Since W'(1) is initialized as the zero matrix and X" = W"(l)V, we see that X:j = 0 and XG = 0 for all n 2 2. Thus, limn+wX; = 0. Case 2.
j
I k . In this case, we have, using equations 3.14 and 3.16, that
xn+l 23 ~
(1- ana:)X,n3
+ anFtJ + aiRG,
'dn 2 1
(3.18)
For n 2 1, we introduce a new variable yn def - X; - Fij/a5. Then, equation 3.18 can be rewritten as ~ n += l
(1 - ang;)gn
+~ R G
(3.19)
Since RG is bounded and a, satisfies the condition 3.1, it follows from equation 3.19 and Lemma 3.1 (with the correspondence 3" H znr anu$tf yn = 0. In other words, limn-w X ; = FaJ/u;. @,I that The above analysis shows that the sequence of numbers obtained by considering an arbitrary entry of the matrix sequence {Xn} is convergent. It then follows that {X"} is convergent, which further implies that (W"(1)) is convergent. Therefore, it only remains to show that the limit matrix to which {Wn(l)} converges is equal to BA+. Summarizing the above analysis, we have
Putting the above results in the matrix form, we see that lim X Q
1L-W
=
F .diag
=
FC BA~VC
=
Convergence of the LMS Algorithm
239
where C = diag [ l / n : . . . . . l/m!. 0.. . . .O] is an ru x I' diagonal matrix. This, together with the fact that I.t'"(l) = X"lrT, implies that
=
BA'
where the third step is due to 3.13 and the last step follows from a property of Moore-Penrose pseudoinverse (see, e.g., Golub and Van Loan 1983, p. 139). This completes the proof of Theorem 3.1. Q.E.D. A couple of remarks about Theorem 3.1 are in order. 1. Theorem 3.1 assumes that W'(1) is initialized at the zero matrix. Without this assumption, the argument for the Case 1 cannot go through. In fact, if the initial weight matrix W'(1) is chosen arbitrarily, then it is generally not true that the weight matrices generated by the LMS algorithm will converge to BA+. However, with some minor modifications, } converges the above argument can be used to show that { W 7 L ( p )still to some weight matrix W that minimizes the error function L. (Note that there may be more than one matrix which minimizes the error function L.) Moreover, such limit matrix W is independent of p . W depends solely on the problem data and the choice of the initial weight matrix WI(1). 2. There are many ways to choose the learning rates {an : a,, > 0. n 2 l} that satisfy condition 3.1. For example, we can let u,, = 1/n6,for some 6 E (1/2. I]. Though any sequence of learning rates satisfying 3.1 will guarantee global convergence, the choice of the learning rates has direct effect on the speed of convergence. In practice, one would like to choose the learning rates {a,} that could render the fastest rate of convergence. Notice that the convergence speed of the LMS algorithm is determined by the speed at which the sequence y~~~ converges to zero. Let us consider the learning rate updating scheme: art = l / n 6 , for all 7 ) 2 1. For the convenience of analysis, let us assume that y,, = O ( 1 / n p ) . The objective is to choose a 6 that maximizes /L. Using equation 3.19, we have
Zhi-Quan LUO
240
where { is some series uniformly bounded away from zero and infinity. Thus, we see
+
+
Notice that for h E (1/2.1], we have 1 2 h 1'. Thus, assuming no cancellations occur, we conclude from the above equation that 6 + / i = 26, converges to zero more or which implies that ~r = h. In other words, less like the sequence 1//1'.This line of reasoning suggests that we should choose the learning rates o,i = l / n , in which case the weight matrices generated by the LMS algorithm will converge to BA+ at least like 1 / 7 1 . It then follows that O ( l / i ) is sufficient for the LMS algorithm to generate an c-optimal weight matrix. Moreover, L[I/c"'(l)] L ( U A i ) converges to zero like 1/tt2. 3. Finally, we remark that the results of Theorem 3.1 remain valid even if the examples are learned in an nlrnast cyclic manner. ~
Assumption B (Almost Cyclic Learning Rule). The entire training process consists of infinitely many training cycles. During each cycle, each of the / r I examples is learned exactly once. The order in which the examples are learned in each training cycle may be different. In fact, our proof of Theorem 3.1 can be easily generalized to the case in which the learning rule is almost cyclic. Specifically, only some notational changes in equations 3.9-3.11 are needed to account for the different ordering in each training cycle; and it can be seen that 3.12 and Observation 1, which are the key steps in the entire proof, still hold. The rest of the proof can be copied verbatim.
4 Comparisons with Stochastic Gradient Descent Algorithms
and Adaptive Filtering Algorithms
-
The LMS algorithm is very close in spirit to the classical stochastic gradient descent algorithms for minimizing a differentiable function (see Bertsekas and Tsitsiklis 1989, 37.8; Kushner and Clark 1978). In what follows, we shall compare the convergence properties of the LMS algorithm with that of the stochastic gradient algorithms and the adaptive filtering algorithms. 4.1 The LMS Algorithm vs. Stochastic Gradient Descent Algorithms. Let F : H 8 be a differentiable cost function to be minimized. In the context of neural network, F corresponds to the total error function L(W).
Convergence of the LMS Algorithm
241
Under the assumption that only a noisy measurement of the gradient is available, stochastic gradient algorithms can be described as .X(T1
f 1) = 2 ( 7 1 )
-
y {VF[rC(??,)] -k 111(7L))
(4.1)
where w(n)is the noise in the measurement of V F [ 2 ( n ) and ] y > 0 is the stepsize. In the convergence study of stochastic gradient algorithms, it is typically assumed that the noise ~ ( nis)independent of ~ ( n )Suppose . ) variance cr2 > 0. Then, in light of 4.1, we see that .x(n)has that ~ ( n ,has variance at least y2a2. This implies that if the stepsize y is kept fixed throughout the computation, then x ( n ) will never converge to a (local) minimum I:* of F . In most cases, z ( n ) will reach a neighborhood of IC* and start moving randomly around z*. Moreover, the radius of such neighborhood is typically a linear function of y; the smaller the stepsize 7, the closer ~ ( 7 1 , )can reach z*. Thus, in order to obtain a good estimate of z*, y should be chosen small. On the other hand, if y is too small, then the stochastic algorithm may take too many steps to generate a good estimate of z*. These phenomena are very similar to the behaviors of the LMS algorithm with fixed learning rate where the iterates usually oscillate around a set of limiting points near the global optimum (see Theorem 1.1). To ensure the convergence of x(7j.j to x*, we can use a time-varying stepsize yl&. As a result, equation 4.1 is replaced by .'C(11
+ 1) =
X(7J,) -
yn {VF[X(n,)] f
W(71))
(4.2)
It was shown (Kushner and Clark 1978) that under the condition (4.3) the stochastic algorithm 4.2 will converge to x*. Intuitively, the first part of the condition 4.3 is necessary to ensure that x(n)can move far enough to reach x*, and it is needed even without the presence of noise. The second part of 4.3 is used to ensure that the variance of ~ ( 7 1 )converges to zero. Notice that 4.3 is exactly identical to our convergence condition 3.1 for the LMS algorithm. This should not come as a surprise, however, since both the LMS algorithm and stochastic gradient algorithm are, in some sense, certain "inexact" versions of the (deterministic) gradient descent algorithm. The only difference is that for the LMS algorithm the gradient of the objective function L is approximated by the gradient of an individual error function L,, whereas for stochastic gradient algorithms the gradient V F is approximated by its noisy measurement. For both algorithms, the second part of the convergence condition 4.3, or 3.1, is used basically to ensure the accumulated approximation errors of the gradient vector converge to zero.
242
Zhi-Quan Luo
Note that we have assumed in our analysis that the training data presented to the neural network are deterministic. The stochastic gradient algorithms also permit deterministic training data. For example, the training data can be generated as a realization of a random sequence with certain specific properties. Therefore the convergence results (Theorem 3.1) for the LMS algorithm can be viewed as a property of the iteration matrices for a particular realization of some random sequence. The standard results for stochastic gradient algorithms study the convergence properties for all possible realizations of the random sequence and they typically assert the convergence of certain iterates with probability 1. In other words, the convergence of the iterates is not always guaranteed; there may exist certain noise patterns for which stochastic gradient algorithm can fail to converge, although such an event has probability zero. Thus, one cannot use the standard convergence results for stochastic gradient algorithms to infer Theorem 3.1. 4.2 The LMS Algorithm vs. Adaptive Filtering Algorithms. Another type of stochastic optimization methods that closely resembles the LMS algorithm is the adaptive filtering algorithms commonly used in adaptive control, system identification of stochastic systems. Let { &} be a sequence of real-valued (vector) reference signals and let {y,,} be a sequence of observable (vector) signals. Typically, certain stationarity or almost stationarity assumptions are made on the random sequences { d r ) } and {y,)}. The basic adaptive filtering problem is to calculate a sequence of weight matrices { W n }that converge to b'*,a minimizer of
(By using Gauss-Markov estimations, it can be shown that the minimizer W' does not depend on n.) Adaptive filtering algorithms for solving this problem can be described as
where a,, represents the stepsize. Notice that the above algorithm, if ignoring the randomness in the input data {&} and is exactly identical to the LMS algorithm 1.2. In particular, { y l l } and { $,,} will correspond to, respectively, the input and the output vectors of the training examples. Thus, by regarding the deterministic training examples as random variables with variance zero, we can view the problem of training a feedforward neural network as an adaptive filtering problem where both the observable signal sequence { y r L } and the reference signal sequence { d,,} are independent but nonstationaty. The adaptive filtering algorithms have been studied extensively by many researchers. In particular, Ljung (1982) treated a very similar problem where the observable signal sequence {g,&} and the reference signal sequence {&} are given by some periodic and deterministic sequences.
Convergence of the LMS Algorithm
243
However, unlike our analysis where the step sues {a,} are chosen by 4.5-4.7, Ljung chose the step sizes {an} so as to guarantee a descent in the cost function. Moreover, he made no attempt to analyze the rate of convergence and his convergence results are weaker than ours. There have been many other convergence studies of adaptive filtering algorithms. Most of these results assert (under various assumptions on the noise pattern) that the iterates generated by 4.4 converge to the optimal weight matrix with probability 1. A typical assumption is that both the observable signal sequence { y,} and the reference signal sequence { &} are independent and stationary. There has been some limited progress in the convergence study of the adaptive filtering algorithms with certain special nonstationary signal sequences. To the best of our knowledge, the weakest convergence condition was due to Zhu and Yin (1988) who established the almost sure convergence for a special class of correlated and nonstationary signals. [We shall consider only the uncorrelated signals here, since in our analysis training data are assumed to be deterministic (thus, independent) for neural networks.] In particular, it was shown that if there exist three constant nonnegative definite matrices Q,, Qz, Q 3 and a sequence {m,} of positive constants such that (4.5) (4.6) (4.7) then the iterates generated by the adaptive algorithm 4.4 will converge to an optimal weight matrix with probability 1, provided that the step sizes {a,} satisfy the condition 00
a, > 0,
a,m,
-+ 0,
1ff,m, =
00
14.8)
n=l
Roughly speaking, condition 4.8 corresponds to the first part of the convergence condition 3.1 (C,"=la, = co) for the LMS algorithm. Thus, the convergence condition 4.8 is much weaker than 3.1. This would not have been possible if it were not for the restrictive assumptions 4.5-4.7 on the signal sequences. In the context of neural networks, conditions 4.54.7 basically translate to the requirement that each example is a scalar multiple of any other example. In other words, these conditions imply that A ( p ) = h,A(l), B(p) = h,B(l), p = 1,.. . ,m, where each h, is some real scalar. [Recall that {[A(l), B(l)],. . . , [A(m),B ( m ) ] are } the m given training examples.] Under such a requirement, each individual error function L, is a scalar multiple of every other individual error function L,, which further implies that V L , is a scalar multiple of VE. For this degenerate case, the LMS algorithm is really the same as the usual gradient descent algorithm. This explains why condition 4.8 alone is enough to imply convergence. In summary, our convergence results do not follow from
244
Zhi-Quan Luo
the existing convergence theory for the adaptive filtering algorithms because of certain restrictive assumptions used in those analysis (Zhu and Yin 1988).
5 Concluding Remarks
In this paper, we have provided a general condition on learning rates under which the weight matrices generated by the LMS algorithm will converge to the desired optimal matrix. Our results have strengthened the results of Kohonen (1974), which established some convergence properties of the LMS algorithm when the learning rate is kept constant during the entire training process. In fact, we have shown (Section 2), by way of an example, the constant learning rate in general does not guarantee the convergence to the optimal solution. In addition, we have seen that if the learning rate is kept constant during the entire training process then a total of I2(1/clog 1/c) training cycles are required to find an c-optimal solution. This is in contrast to O ( l / t ) training cycles needed to generate an r-optimal solution when the learning rates are let to decrease gradually in the training process. Thus, by dynamically decreasing the learning rates during the training process we can reduce the total number of training cycles needed to generate an c-optimal solution by a factor of log 1 / F . Finally, we remark that our analysis is global in nature since we do not make any assumptions on the asymptotic behaviors of individual error functions L p ( W ) .This is in contrast to the analysis of Tesauro et MI. (1989) who analyzed the algorithm's behavior when the individual error L, is small. Tesauro et al. (1989) pointed out the need to analyze the LMS algorithm at the times earlier in the learning process, when not all the individual errors are small. Our paper can be viewed as a step in this direction. Tesauro and his colleagues also argued that the speed of convergence of the error function L cannot be faster than O ( l / / / ) , which contradicts our results in Sections 2 and 3. This discrepancy may be explained by those inappropriate assumptions (e.g., the individual error L,, converges to zero even with a fixed learning rate) used in their analysis.
Acknowledgments The author would like to thank Drs. Paul Tseng, Sanzheng Qiao, and Xiaoyuan Tu for their help during the preparation of this paper. The referee's insightful comments are also greatly appreciated. This research is supported by a grant from the Science and Engineering Research Board of McMaster University.
Convergence of the LMS Algorithm
245
References Apostol, T. M. 1957. Matheinaticnl Analysis, A Modern Approach to Advanced Calculus. Addison-Wesley, Reading, MA. Bertsekas, D. P., and Tsitsiklis, J. N. 1989. Parallel and Distributed Computation, Numerical Methods. Prentice-Hall, Englewood Cliffs, NJ. Fabian, V. 1968. On asymptotic normality in stochastic approximation. In A n n . Math. Stat. 39, 1327-1332. Golub, G. H., and Van Loan, C. F. 1983. Matrix Computations. Johns Hopkins University Press, Baltimore, MD. Gorman, I? R., and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75-89. Jacobs, R. A. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks 1,295-307. Kohonen, T. 1974. An adaptive associative memory principle. I E E E Transact. Comput. 444-445. Kohonen, T. 1984. Self-Organization and Associative Memory. Springer-Verlag, Berlin. Kushner, H. J., and Clark, D. S. 1978. Stochastic Approximation Method for Constrained and Unconstrained Systems. Springer-Verlag, Berlin. Ljung, L. 1982. Recursive identification methods for off-line identification problems. IFAC Identification Syst. Parameter Estimation, pp. 555-560. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. Parallel Distributed Processing - Explorations in the Microstructure of Cognition, D. E. Rumelhart and J. L. McClelland, eds., pp. 318-362. MIT Press, Cambridge, MA. Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing Explorations in the Microstructure of Cognition. MIT Press, Cambridge, MA. Sacks, J. 1958. Asymptotic distribution of stochastic approximation. Ann. Math. Stat. 29, 373-405. Sejnowski, T. J., and Rosenberg, C. R. 1987. Parallel networks that learn to pronounce English text. J . Complex Syst. 1, 145-168. Tesauro, G., He, Y., and Ahmad, S. 1989. Asymptotic convergence of backpropagation. Neuraf Comp. 1,382-391. White, H. 1989. Some asymptotic results for learning in single hidden layer feedforward network models. J. A m . Statist. Assoc. 84, 1003-1013. Widrow, B., and Hoff, M. E. 1960. Adaptive switching circuits. Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, Part 4, pp. 96-104. Zhu, Y. M., and Yin, G. 1988. Adaptive filters with constraints and correlated non-stationary signals. Syst. Control Lett. 10,271-279.
Received 10 July 1990; accepted 25 September 1990.
This article has been cited by: 2. Elias Salomao Helou Neto, Alvaro Rodolfo De Pierro. 2010. Incremental Subgradients for Constrained Convex Optimization: A Unified Framework and New Methods. SIAM Journal on Optimization 20:3, 1547. [CrossRef] 3. Zong-Ben Xu, Rui Zhang, Wen-Feng Jing. 2009. When Does Online BP Training Converge?. IEEE Transactions on Neural Networks 20:10, 1529-1539. [CrossRef] 4. D. Srinivasan, M.C. Choy, R.L. Cheu. 2006. Neural Networks for Real-Time Traffic Signal Control. IEEE Transactions on Intelligent Transportation Systems 7:3, 261-272. [CrossRef] 5. N. Zhang, W. Wu, G. Zheng. 2006. Convergence of Gradient Method With Momentum for Two-Layer Feedforward Neural Networks. IEEE Transactions on Neural Networks 17:2, 522-525. [CrossRef] 6. Min Chee Choy, Dipti Srinivasan, Ruey Long Cheu. 2006. . IEEE Transactions on Neural Networks 17:6, 1511. [CrossRef] 7. W. Wu, G. Feng, Z. Li, Y. Xu. 2005. Deterministic Convergence of an Online Gradient Method for BP Neural Networks. IEEE Transactions on Neural Networks 16:3, 533-540. [CrossRef] 8. Christopher Monterola, Martin Zapotocky. 2005. Noise-enhanced categorization in a recurrently reconnected neural network. Physical Review E 71:3. . [CrossRef] 9. Krzysztof C. Kiwiel. 2004. Convergence of Approximate and Incremental Subgradient Methods for Convex Optimization. SIAM Journal on Optimization 14:3, 807. [CrossRef] 10. Min Chee Choy, D. Srinivasan, Ruey Long Cheu. 2003. Cooperative, hybrid agent architecture for real-time traffic signal control. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 33:5, 597-607. [CrossRef] 11. Christopher Monterola, May Lim, Jerrold Garcia, Caesar Saloma. 2002. Accurate forecasting of the undecided population in a public opinion poll. Journal of Forecasting 21:6, 435-449. [CrossRef] 12. Angelia Nedic, Dimitri P. Bertsekas. 2001. Incremental Subgradient Methods for Nondifferentiable Optimization. SIAM Journal on Optimization 12:1, 109. [CrossRef] 13. Aharon Ben-Tal, Tamar Margalit, Arkadi Nemirovski. 2001. The Ordered Subsets Mirror Descent Optimization Method with Applications to Tomography. SIAM Journal on Optimization 12:1, 79. [CrossRef] 14. L. Grippo. 2000. Convergent on-line algorithms for supervised learning in neural networks. IEEE Transactions on Neural Networks 11:6, 1284. [CrossRef]
15. Dimitri P. Bertsekas, John N. Tsitsiklis. 2000. Gradient Convergence in Gradient methods with Errors. SIAM Journal on Optimization 10:3, 627. [CrossRef] 16. Changjing Shang, D. Reay, B. Williams. 1999. Adapting CMAC neural networks with constrained LMS algorithm for efficient torque ripple reduction in switched reluctance motors. IEEE Transactions on Control Systems Technology 7:4, 401-413. [CrossRef] 17. Peter Sollich , David Barber . 1998. Online Learning from Finite Training Sets and Robustness to Input BiasOnline Learning from Finite Training Sets and Robustness to Input Bias. Neural Computation 10:8, 2201-2217. [Abstract] [PDF] [PDF Plus] 18. P Sollich, D Barber. 1997. On-line learning from finite training sets. Europhysics Letters (EPL) 38:6, 477-482. [CrossRef] 19. A.C. McCormick, A.K. Nandi. 1997. Real-time classification of rotating shaft loading conditions using artificial neural networks. IEEE Transactions on Neural Networks 8:3, 748-757. [CrossRef] 20. Dimitri P. Bertsekas. 1997. A New Class of Incremental Gradient Methods for Least Squares Problems. SIAM Journal on Optimization 7:4, 913. [CrossRef] 21. A C McCormick, A K Nandi. 1997. Classification of the rotating machine condition using artificial neural networks. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 211:6, 439-450. [CrossRef] 22. T. Heskes, W. Wigerinck. 1996. A theoretical comparison of batch-mode, on-line, cyclic, and almost-cyclic learning. IEEE Transactions on Neural Networks 7:4, 919-925. [CrossRef] 23. Dimitri P. Bertsekas. 1996. Incremental Least Squares Methods and the Extended Kalman Filter. SIAM Journal on Optimization 6:3, 807. [CrossRef] 24. Dimitri P. Bertsekas . 1995. A Counterexample to Temporal Differences LearningA Counterexample to Temporal Differences Learning. Neural Computation 7:2, 270-279. [Abstract] [PDF] [PDF Plus] 25. Shu Yao, Bo Zhang. 1994. The learning convergence of CMAC in cyclic learning. Journal of Computer Science and Technology 9:4, 320-328. [CrossRef] 26. Roberto Battiti . 1992. First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's MethodFirst- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method. Neural Computation 4:2, 141-166. [Abstract] [PDF] [PDF Plus]
Communicated by Halbert White
Universal Approximation Using Radial-Basis-Function Networks J. Park I. W. Sandberg Department of Electrical arid Computer Engineering, liniuersity of Texas at Austin, Austin, Texns 78712 U S A
There have been several recent studies concerning feedforward networks and the problem of approximating arbitrary functionals of a finite number of real variables. Some of these studies deal with cases in which the hidden-layer nonlinearity is not a sigmoid. This was motivated by successful applications of feedforward networks with nonsigmoidal hidden-layer units. This paper reports on a related study of radial-basis-function (RBF) networks, and it is proved that RBF networks having one hidden layer are capable of universal approximation. Here the emphasis is on the case of typical RBF networks, and the results show that a certain class of RBF networks with the same smoothing factor in each kernel node is broad enough for universal approximation.
1 Introduction
There have been several recent studies concerning the capabilities of multilayered feedforward neural networks. Particularly pertinent to this paper are results that show that certain classes of neural networks are capable of providing arbitrarily good approximations to prescribed functionals of a finite number of real variables. From the theoretical point of view, these studies are important, because they address the question of whether a satisfactory solution is yielded by some member of a given class of networks. More specifically, suppose we have a problem that we want to solve using a certain type of neural network. Suppose also that there exists a decision function f’ : ?RT-+ YP whose implementation as a network plays a central role in the solution of the problem. Imagine that we have a family G of functions mapping Er to 3 P characterized by a certain structure and having certain elements (e.g., one might consider a set of multilayered perceptrons), and that we hope to solve the problem Neural Computation 3, 246-257 (1991) @ 1991 Massachusetts Institute of Technology
RadiaI-Basis-Function Networks
247
by implementing some satisfactory member of G. The first question we need to consider might be: Is this family G broad enough to contain f or a good approximation of f ? Obviously, attempts to solve the problem without considering this question might be very time-consuming and might even be fruitless. Several papers address this question for the case of multilayered perceptron models with sigmoidal nonlinearities, and affirmative answers have been obtained by showing that in a satisfactory sense the family G considered can actually approximate any decision function drawn from a certain large class (Cybenko 1989; Hornik et al. 1989). At the present time, with the advantages and limitations of multilayered perceptron networks more transparent and with results containing comparative studies becoming available (e.g., Lippman 19891, research concerning different types of feedforward networks is very active. Among the various kinds of promising networks are the so-called radialbasis-function (RBF) networks (Lippman 1989). The block diagram of a version of an RBF classifier with one hidden layer is shown in Figure 1. Each unit in the hidden Iayer of this RBF network has its own centroid, and for each input II' = (xl. -cz>. . . , L , ) , it computes the distance between .c and its centroid. Its output (the output signal at one of the kernel nodes) is some nonlinear function of that distance. Thus, each kernel node in the RBF network computes an output that depends on a radially symmetric function, and usually the strongest output is obtained when the input is near the centroid of the node. Assuming that there are 7' input nodes and m output nodes, the overall
output
*2
Input
Figure 1: A radial-basis-functionnetwork.
J. Park and I. W. Sandberg
248
response function without considering nonlinearity in an output node has the following form:
where A4 t N the set of natural numbers is the number of kernel nodes in the hidden layer, M/, E PrL is the vector of weights from the ith kernel node to the output nodes, :I: is an input vector (an element of W),f< is a radially symmetric kernel function of a unit in the hidden layer, zi and CT, are the centroid and smoothing factor (or width) of the ith kernel node, respectively, and : [O. co) ---t R ! is a function called the activation function, which characterizes the kernel shape. A gaussian function is often used as an activation function, and the smoothing factors of kernel nodes may be the same or may vary across nodes. In this paper, RBF networks having the representation 1.1 are studied. Strong results are obtained to the effect that, under certain mild conditions on the kernel function K (or the activation function g), RBF networks represented by 1.1 with the same ‘T; in each kernel node have the capability of universal approximation. Cybenko (1989) also considers feedforward networks with a single hidden layer of kernel functions. However, only L’ approximation is considered in the corresponding part of Cybenko (1989), and only the case in which the smoothing factors can vary across nodes is addressed. A detailed comparison is given in Section 3. This paper is organized as follows: In Section 2 our main results are presented, and in Section 3 a discussion of our results is given.
In this section, we consider the approximation of a function by some element of a specific family of RBF networks. Throughout the paper, we use the following notation and definitions, in which N,!R and !R“ denote the set of natural numbers, the set of real numbers, and the set of real ~-vectors,respectively. Let L T ’ ( W ) Lm(Kr), , C(%“),and C,.(!Rr),respectively, denote the usual spaces of %-valued maps f defined on !Rr such that f is pth power integrable, essentially bounded, continuous, and continuous with compact support. The usual Lr’ and L” norms are denoted by 11 . \ I p and 11 . , respectively. The integral of f’ E L’(IfE”) over a Lebesgue measurable set A in !RT is written as JA f ( ~ : ) d ~or,: if f is a function of several variables and, say, f ( u , .) E L1(W) we write JA ! ( a , : r ) h to denote the integral of f ( u , .) over A. The convolution operation is denoted by *,” and the characteristic function of a Lebesgue measurable subset A of 97 is written as 1 ~ . ”
Radial-Basis-Function Networks
249
The family of RBF networks considered here consists of functions y : !R' + R ! represented by
(2.1) where M E N , c > 0, illi E 32, and z , E 92' for i = 1... . . M . We call this family S K . Note that 2.1 is the same as 1.1,with the exception that the smoothing factors in all kernel nodes are same, and the output space is R instead of It will become clear that the extension of our results to multidimensional output spaces is trivial, and so we consider only a one-dimensional output space. We will use the following result, which is a slight modification of a theorem in (Bochner and Chandrasekharan 1949, p. 101).
Lemma 1. Let f E U'(F), p E [I.a), and let 4 : P' + !R be an integrable function such that JRv 4 ( z ) d z = 1 . Define de : g r -+ Y? by d C ( z )= (l/ey)4(x/e) for 6 > 0. Then 11 $ e * f - f ]IP+ 0 as f + 0. Proof. Note that $e E L ' ( F ) . By a direct extension from 8 to F of a standard theorem in analysis (Bochner and Chandrasekharan 1949, p. 991, one has $ e * f E L P ( F ) ,which is used below. By a change of variable,
Thus,
With q defined by l / p + l / q
= 1,
by Fubini's theorem and Holder's inequality.
J. Park and I. W. Sandberg
250
Since 11 f(. - a )- f(.) 1Ip,1 2 11 f l i p and translation is continuous in Lp(8') (see Bochner and Chandrasekharan 1949, p. 98, and consider its direct extension to 3') we have
II & * f
-f
Ilp+
0 as
+
0
by Lebesgue's dominated convergence theorem. This proves the lemma. Our Theorem 1 (below) establishes that, under certain mild conditions on the kernel function K , RBF networks represented by 2.1 are capable of approximating arbitrarily well any function in L J ' ( F ) . Theorem 1. Let K : !RT + R! be an integrable bounded funcfion such that K is continuous almost everywhere and $%,.K ( x ) d x # 0. Then the family SK is dense in LJ'(R') for every p E [ 1 co). Proof. Let p E [l,co),f E L"(R'), and c > 0. Since Cc(P') is dense in L p ( 8 ' ) (Rudin 1986, p. 69), there exists an fc E Cc(8') such that 11 fc - f [Ip< ~ / 3 We . will assume below that fc is nonzero. Notice that this involves no loss of generality. Let 4 : R' -,R be defined by $(x) = (1/ JRT.K(a)dcu) . K ( x ) ,for x E R'. Then $ satisfies the conditions on $ in Lemma 1. Thus, by defining I& : P + R as in Lemma 1, we obtain 11 $, * fc - fc llP+ 0 as + 0. Therefore, there is a positive (T such that 11 #u * f c - fc 43. Since fc has compact support, there exists a positive T such that suppf. c [-TIT]'. Note that $,,(a - .)fc(.) is Riemann integrable on [-T, TI', because it is continuous almost everywhere and is bounded by
Itp<
II 40
llm
. II f c
IIOO
Define 71, : R'
. ---+
by
where the set {cxi E X' : i = 1,2,. . . ,n'} consists of all points in [-TI TI' of the form [-T + (ZilT/n),.. ., -T + (2irT/?2)], il,i2,...,i, = 2 , 2 , . ..,n. Note that v n ( a ) is a Riemann sum for j'-T,TIT$u(a - z)f,(z)dz, and J[-T,~p &(a - z ) f c ( z ) d x= Je7b ( a - z ) f , ( x ) d x = ($,, * f.)(cy). Thus, for as n -,00. Since $g * f. E L p ( W ) ,there any cy E R', w,(a) + (4,,* f.)(cy) is a positive Tl such that
Since 4Dis bounded and exists T2 > 0 such that
4,,E L1(W)l we have $,
E LP(R'). Thus, there
Radial-Basis-Function Networks
251
Note that I O,~((Y) /
Therefore,
Define To = max(T1. Tz
+ T ) . Using 1 ot3/TIfor all
J
E { 1.2.. . . . I.},
and so
1 / l , > ( f Y ) 1”
d o < (f/9)”
(2.2)
Also,
by the dominated convergence theorem. Thus, there is an N E which
for
J. Park and I. W. Sandberg
252
From the above,
// V N
~
f
\I,<
c . Since
with
the proof is complete. By K radially symmetric, we mean that 11 .I' 112=11 g 112 implies K ( J ) = K ( y ) . In this case, the activation function 9 : [O. m) -+ Y? is obtained by defining y ( d ) = K ( z ) .where z is any element of 3? such that /I r: /I*= d . Therefore, in the case of radial symmetry, 2.1 can be written as
Note that there is no requirement of radial symmetry of the kernel function K in the above theorem. Thus, the theorem is stronger than necessary for RBF networks, and might be useful for other purposes. Similarly, in the following theorem and corollaries, radial symmetry of the kernel function K is not assumed, even though we are interested primarily in radial-basis-function networks. If we interpret the term "radially symmetric" more generally than literally, then we may say that K is radially symmetric with respect to 11 // if 11 L ll=ll g /I implies K(r ) = K ( q ) .where 11 . 11 is some norm defined on S r .With this generalization in mind, we sometimes use I/ .I' - z, I/ for the distance between r and z , instead of 11 s - z, 112. A slight modification of Theorem 1 given below addresses the case in which the function f we wish to approximate with an RBF network is not an element of Lp((zrz'), but an element of LfUc(?J?) for some p E [l.m). Here the locally-lp space Lkc(3?), 1 5 1) < 03 is defined as the set of all measurable f : P 3 8 such that f . l[-N,N~r t L " ( 8 ' ) for every N E N . One way to define a metric on L k C ( F )is by
The following is direct corollary of Theorem 1.
Corollary 1. Let K : P' -+ 8 be an integrable bounded function such that K is continuous almost everywhere and Js, K ( r ) d # ~ 0. Then the family S K is dense in LX, (W) for every p t [l co).
.
Proof. Let p E [l,03), f E LL,(F). and c,"==,+l 2-" < €12.
6
> 0. Choose
E
N
such that
Radial-Basis-Function Networks
253
Since f . 1[-m,m~7 E LP(R'), by Theorem 1 there is a - 21 lip< t/2. Thus,
2,
E SK such that
11 f . 1 [ - m , m p
5 2-" + c 2-" II (f m
PIoc(f1v)
I
- v) * L , n p
Ilp
"=I
n=.m+l
< E/2+ II (f - v) .1[-m,.m]~Ilp<
6
which establishes the corollary. Theorem 1 and Corollary 1 concern approximation with respect to the LP metric or a metric induced by LP metric. We next give a theorem concerning the approximation of continuous functions with respect to a metric induced by the uniform metric. Theorem 2. Let K : R' -+ R be'an integrable bounded function such that K is continuous and JRr K ( x ) d x # 0. Then the family SK is dense in C(R') with respect to the metric d defined by
Proof. Let f : R' + R be any continuous function, and E > 0. Define 4 : RT + R by normalizing K , and define 4u : R' -+ R for 0 > 0 as in the proof of Theorem 1. Pick a natural number m such that 2-" < ~ / 3 ,and then choose a positive T such that T > m. Since f is continuous on the compact set [-m,m]', we can obtain a nonzero continuous function f : R' -+ R with the property that f(z) = f(3:) for 3: E [-m, m]', and f(z) = 0 for z E R' \ [-T, TI'. Note that f is bounded and uniformly continuous. Using 4 E L1(X'), pick a positive To such that
Since f is uniformly continuous, there is a 6 > 0 for which implies
11 x - y
6 (2.5)
Choose 0 > 0 such that 11 oz Then using 2.4 and 2.5,
)I2<
6 for all 3: E [-TO, TO]'.Let a E [-m, m]'.
J. Park and I. W. Sandberg 6, *
j ) ( ( i= )
O,(O
-
r ) f ( r ) d r Define
:Xr
~l~~
+
8 by
where the set ( 0 , E 2' : 1 = 1.2.. . , . 1 1 ' ) consists of all points in [-T. TI' of the form [-T (2/1T/r1). . . . . -T (2/,.T/n)],1 1 . . . . . I ,= 1... . . n . Since the map ( 5 . 7.) H @ n ( ~ s - x)f(z) is uniformly continuous on [--1r1. rn]? x [-T. TI', there is a 60 > 0 such that 4 E [ - - / I / . r ~ ] I ~. !j, E [-T. T I' with 11 .I' - 11 / j ~ < bo implies I $,(s - . r ) f ( s-) & ( s - y)f(y) / < ~ / 3 ( 2 T ) It '. easily follows that for n > 2&1'/&,
+
+
(2.7)
Choose N
EN
I O,V(O) in which
(I
-
such that N > 2\/TT/bo. Then using 2.6 and 2.7, ~ ( c P I< ) 2t/3
E [ - w . w ] ' is arbitrary. Since
I ( r ) = /(.I
) for
.I'
E [-m. m ] ' ,
ri=m+l
which finishes the proof. The statement in Theorem 2 is equivalent to the statement that SK is under the indicated conditions on uniformly dense on compacta in C(8') K . That is, under the conditions on 11' of Theorem 2, for any continuous function f : P --i 32, for any t > 0, and for any compact subset C c 8', there exists a q E S K such that // ( q - f ) .Ic F. Thus, by a useful relationship between uniform convergence on compacta and convergence in measure (Hornik et al. 1989, lemma 2.2), we have the following corollary:
/I2<
Corollary 2. Let p be a finite measure on X".Then under the conditions on K of Theorem 2, the family Sh is dense in C ( P ) with respect to the metric p p defined by ~ , ~ ( f . y=) inf{f > 0 : / ! { . I . t P :I f'(.r) - , 9 ( ~ )/ > f}
< E}.
3 Conclusions and Discussion
The results in Section 2 establish that under certain mild conditions on the kernel function, radial-basis-function networks having one hidden layer and the same smoothing factor in each kernel are broad enough for universal approximation. This provides an analytical basis for the design of neural networks using radial basis functions.
Radial-Basis-Function Networks
255
To the extent that the results of this paper bear on the approximation w,. K ( - - z z / a z )of kernel of a function in L'jR') with a finite sum C,"=, functions, there is some overlap of a part of Cybenko (1989) and this study. Using a theorem due to Wiener (Rudin 1973, p. 210) and the pertinent argument used in Cybenko (1989), it can be shown that the set {C,"=,w,. K ( . - z,/(T,) : M E N , w, E 3, z, E R', 0,# 0) is dense in L1(X'), under the condition that K E L'(X') and J R v K ( x ) d x# 0. This certainly shows the capability of certain RBF networks with respect to approximating an arbitrary L' function. However, note that here the smoothing factor (T, in each kernel node has a full degree of freedom, that is, the 0,s can have different values across the kernel nodes. Thus, the major differences between this L' approximation and the results given in Section 2 concern the class of RBF networks considered as well as the metrics used.' From the theoretical point of view, this condition concerning the same smoothing factor is often very important, because many studies are concerned with approximation using the functions C,"=,w, . h( 11 . - z, 11) (Broomhead and Lowe 1988; Powell 1985; Sun 19891, and radial basis functions with the same smoothing factor in each kernel node are often used in real applications (Broomhead and Lowe 1988). In connection with studies of approximation using radial basis functions, the recent results concerning the solvability of radial-function interpolation (Powell 1985; Sun 1989) are interesting, because they are directly applicable to the training of neural networks of the type we have focused attention on. These studies (Powell 1985; Sun 1989) are concerned with the interpolation of data by the m functions h( 11 . - z, I/), z = 1,.. . ,m, when the data (z,, y,) with z, E R', yz E R,z = 1,.. . , m are given. More precisely, the existence of a unique interpolant Czl w, .h( [I . - z, 11) for distinct data ( z , ,9,) with z, E R,y, E %,2 = I,. ..,m has been shown for a certain class of pairs of h and 11 . 11. This existence leads us to an interesting observation: Suppose that training data (z,, yz),z = 1,.. . ,m are given, where z, E X',y, = 1 if z, E A, yt = -1 if z, E B, and A , B g R ' with A n B = 0. From the given data, construct a new data set z,* E R", z = 1,.. . , m, by defining
Note that z; E
R", while zi E
8'. Then by the above existence property,
'In this connection, Wiener's theorem referred to above can also be used to give a direct proof that L' approximations can be achieved with linear combinations of translates of any element of L'(IR') whose Fourier transform never vanishes. The gaussians exp(-a // . )1; are examples of such functions.
J. Park and I. W. Sandberg
256 for certain classes of y and for each I E { 1.2. . . . , n ) } .
11 . 1 ,
there exist A, E 3.J
= 1..
. . . rr/ such that
Thus, with A = (A,. A 2 . . . . . A,,,)T, z h h > 0 if zL E A. and z,*A < 0 if z , E B. In other words, { (zt*.g L ) : I = 1.2. . . . , nr} is linearly separable in this case. Therefore, the perceptron learning rule suffices for the training of this network. Additional related papers are (Hartman et al. 1990; Sandberg 1991). The work of Hartman et al. (19901, which appeared after this work was completed, considers gaussian functions and approximations on compact subsets of X' that are convex. It is shown there that networks with a single layer of gaussian units are universal approximators. In Sandberg (1991) more general results for gaussian functions are given as a special case of propositions concerning the uniform approximation of functionals defined on compact subsets of spaces that need not be finite dimensional. Also, it is observed in Sandberg (1991) that (what might be called) "function-space feed forward neural networks" with an input layer of bounded linear functionals and just one hidden nonlinear layer are universal approximators of real continuous functionals on compact subsets of a normed linear space. Acknowledgments This work was supported in part by the National Science Foundation under Grant MIP-8915335. References Bochner, S., and Chandrasekharan, K. 1949. Fourier Transform. Princeton University Press, Princeton, NJ. Broomhead, D. S., and Lowe, D. 1988. Multi-variable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control, Signnls, Syst. 2, 303-314. Hartman, E. J., Keeler, J. D., and Kowalski, J. M. 1990. Layered neural networks with gaussian hidden units as universal approximations. Neural Comp. 2, 210-215.
Hornik, K. M., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Lippman, R. P. 1989. Pattern classification using neural networks. l E E E Commun. Mag. 27, 47-64.
Radial-Basis-Function Networks
257
Powell, M. J. D. 1985. Radial basis functions for multi-variable interpolation: A review. IMA Conference on Algorithms for the Approximation of Functions and Data, RMCS Shrivenham, UK. Rudin, W. 1973. Functional Analysis. McGraw-Hill, New York. Rudin, W. 1986. Real and Abstract Analysis, 3rd ed. McGraw-Hill, New York. Sandberg, I. W. 1991. Gaussian basis functions and approximations for nonlinear systems. Proceedings of the Ninth Kobe International Symposium on Electronics and Information Sciences, Kobe, Japan. Sun, X. 1989. On the solvability of radial function interpolation. Approximation Theory VI 2, 643-646.
Received 17 September 1990; accepted 25 January 1991.
This article has been cited by: 2. Mingyang Zhang, Kelin Wang, Chunhua Zhang, Hongsong Chen, Huiyu Liu, Yuemin Yue, Ingrid Luffman, Xiangkun Qi. 2010. Using the radial basis function network model to assess rocky desertification in northwest Guangxi, China. Environmental Earth Sciences . [CrossRef] 3. Qiang Kang, Wei Wang. 2010. Adaptive fuzzy controller design for a class of uncertain nonlinear MIMO systems. Nonlinear Dynamics 59:4, 579-591. [CrossRef] 4. George Kapetanios, Andrew P. Blake. 2010. TESTS OF THE MARTINGALE DIFFERENCE HYPOTHESIS USING BOOSTING AND RBF NEURAL NETWORK APPROXIMATIONS. Econometric Theory 1. [CrossRef] 5. Arta A. Jamshidi, Michael J. Kirby. 2010. Skew-Radial Basis Function Expansions for Empirical Modeling. SIAM Journal on Scientific Computing 31:6, 4715. [CrossRef] 6. Israel Gonzalez-Carrasco, Angel Garcia-Crespo, Belen Ruiz-Mezcua, Jose Luis Lopez-Cuadrado. 2009. Dealing with limited data in ballistic impact scenarios: an empirical comparison of different neural network approaches. Applied Intelligence . [CrossRef] 7. Yan-Jun Liu, Zhi-Feng Wang. 2009. Adaptive fuzzy controller design of nonlinear systems with unknown gain sign. Nonlinear Dynamics 58:4, 687-695. [CrossRef] 8. Chen Lu, Ning Ma, Zhuo Chen, Jean-Philippe Costes. 2009. Pre-evaluation on surface profile in turning process based on cutting parameters. The International Journal of Advanced Manufacturing Technology . [CrossRef] 9. Kostyantyn Y. Volyanskyy, Wassim M. Haddad, Anthony J. Calise. 2009. A New Neuroadaptive Control Architecture for Nonlinear Uncertain Dynamical Systems: Beyond $\sigma $- and $e$-Modifications. IEEE Transactions on Neural Networks 20:11, 1707-1723. [CrossRef] 10. Jamuna Kanta Sing, Sweta Thakur, Dipak Kumar Basu, Mita Nasipuri, Mahantapas Kundu. 2009. High-speed face recognition using self-adaptive radial basis function neural networks. Neural Computing and Applications 18:8, 979-990. [CrossRef] 11. Shiwei Yu, Kejun Zhu, Siwei Gao. 2009. A hybrid MPSO-BP structure adaptive algorithm for RBFNs. Neural Computing and Applications 18:7, 769-779. [CrossRef] 12. Ho-Sung Park, Witold Pedrycz, Sung-Kwun Oh. 2009. Granular Neural Networks and Their Development Through Context-Based Clustering and Adjustable Dimensionality of Receptive Fields. IEEE Transactions on Neural Networks 20:10, 1604-1616. [CrossRef]
13. María D. Perez-Godoy, Antonio J. Rivera, Francisco J. Berlanga, María José Del Jesus. 2009. CO2RBFN: an evolutionary cooperative–competitive RBFN design algorithm for classification problems. Soft Computing . [CrossRef] 14. Liangyong Wang, Tianyou Chai, Lianfei Zhai. 2009. Neural-Network-Based Terminal Sliding-Mode Control of Robotic Manipulators Including Actuator Dynamics. IEEE Transactions on Industrial Electronics 56:9, 3296-3304. [CrossRef] 15. S. Y. Kim, B. Kim. 2009. Modelling of spatial plasma by using neural network. Surface Engineering 25:6, 417-422. [CrossRef] 16. Yan Hua, Wei Ping, Xiao Xian-Ci. 2009. A method to improve the precision of chaotic time series prediction by using a non-trajectory. Chinese Physics B 18:8, 3287-3294. [CrossRef] 17. V. Stepanyan, A. Kurdila. 2009. Asymptotic Tracking of Uncertain Systems With Continuous Control Using Adaptive Bounding. IEEE Transactions on Neural Networks 20:8, 1320-1329. [CrossRef] 18. Antonio Sánchez-García, Patricio Muñoz-Esparza, José Luis Sancho-Gomez. 2009. A novel image-processing based method for the automatic detection, extraction and characterization of marine mammal tonal calls. Journal of the Marine Biological Association of the United Kingdom 1. [CrossRef] 19. Arun S Veeramani, John H Crews, Gregory D Buckner. 2009. Hysteretic recurrent neural networks: a tool for modeling hysteretic materials and systems. Smart Materials and Structures 18:7, 075004. [CrossRef] 20. Angelo Alessandri, Raffaele Bolla, Mauro Gaggero, Matteo Repetto. 2009. Modeling and Identification of Nonlinear Dynamics for Freeway Traffic by Using Information From a Mobile Cellular Network. IEEE Transactions on Control Systems Technology 17:4, 952-959. [CrossRef] 21. Sisil Kumarawadu, Tsu-Tian Lee. 2009. Neuroadaptive Output Tracking of Fully Autonomous Road Vehicles With an Observer. IEEE Transactions on Intelligent Transportation Systems 10:2, 335-345. [CrossRef] 22. M. Islam, A. Sattar, F. Amin, Xin Yao, K. Murase. 2009. A New Adaptive Merging and Growing Algorithm for Designing Artificial Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:3, 705-722. [CrossRef] 23. Wen Yao, Xiaoqian Chen, Wencai Luo. 2009. A gradient-based sequential radial basis function neural network modeling method. Neural Computing and Applications 18:5, 477-484. [CrossRef] 24. M.A. Selver, C. Guzelis. 2009. Semiautomatic Transfer Function Initialization for Abdominal Visualization Using Self-Generating Hierarchical Radial Basis Function Networks. IEEE Transactions on Visualization and Computer Graphics 15:3, 395-409. [CrossRef]
25. Ying Zhang, Bernard J. Jansen, Amanda Spink. 2009. Identification of factors predicting clickthrough in Web searching using neural network analysis. Journal of the American Society for Information Science and Technology 60:3, 557-570. [CrossRef] 26. Kang Li, Jian-Xun Peng, Er-Wei Bai. 2009. Two-Stage Mixed Discrete–Continuous Identification of Radial Basis Function (RBF) Neural Models for Nonlinear Systems. IEEE Transactions on Circuits and Systems I: Regular Papers 56:3, 630-643. [CrossRef] 27. Liangyong Wang, Tianyou Chai, Zheng Fang. 2009. Neural-network-based two-loop control of robotic manipulators including actuator dynamics in task space. Journal of Control Theory and Applications 7:2, 112-118. [CrossRef] 28. Claudio Turchetti, Giorgio Biagetti, Francesco Gianfelici, Paolo Crippa. 2009. Nonlinear System Identification: An Effective Framework Based on the Karhunen–LoÈve Transform. IEEE Transactions on Signal Processing 57:2, 536-550. [CrossRef] 29. A. Alessandri, L. Cassettari, R. Mosca. 2009. Nonparametric nonlinear regression using polynomial and neural approximators: a numerical comparison. Computational Management Science 6:1, 5-24. [CrossRef] 30. Kenneth Scerri, Michael Dewar, Visakan Kadirkamanathan. 2009. Estimation and Model Selection for an IDE-Based Spatio-Temporal Model. IEEE Transactions on Signal Processing 57:2, 482-492. [CrossRef] 31. S. Giulini, M. Sanguineti. 2009. Approximation Schemes for Functional Optimization Problems. Journal of Optimization Theory and Applications 140:1, 33-54. [CrossRef] 32. Pratik J. Parikh, Sarah S. Lam. 2009. Parameter estimation for abrasive water jet machining process using neural networks. The International Journal of Advanced Manufacturing Technology 40:5-6, 497-502. [CrossRef] 33. D. Achela K. Fernando, Asaad Y. Shamseldin. 2009. Investigation of Internal Functioning of the Radial-Basis-Function Neural Network River Flow Forecasting Models. Journal of Hydrologic Engineering 14:3, 286. [CrossRef] 34. W.S. Chen. 2009. Adaptive backstepping dynamic surface control for systems with periodic disturbances using neural networks. IET Control Theory & Applications 3:10, 1383. [CrossRef] 35. Giuseppe Castaldi, Vincenzo Galdi, Giampiero Gerini. 2009. EVALUATION OF A NEURAL-NETWORK-BASED ADAPTIVE BEAMFORMING SCHEME WITH MAGNITUDE-ONLY CONSTRAINTS. Progress In Electromagnetics Research B 11, 1-14. [CrossRef] 36. Bor-Shyh Lin, Bor-Shing Lin, Fok-Ching Chong, Feipei Lai. 2009. Higher Order Statistics-Based Radial Basis Function Network for Evoked Potentials. IEEE Transactions on Biomedical Engineering 56:1, 93-100. [CrossRef]
37. K. Meng, Z.Y. Dong, K.P. Wong. 2009. Self-adaptive radial basis function neural network for short-term electricity price forecasting. IET Generation, Transmission & Distribution 3:4, 325. [CrossRef] 38. M. Alper Selver, Olcay Akay, Emre Ardali, A. Bahadir Yavuz, Okan Onal, Gurkan Ozden. 2009. Cascaded and Hierarchical Neural Networks for Classifying Surface Images of Marble Slabs. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) . [CrossRef] 39. Naveen Kumar Saxena, Mohd. Ayub Khan, P. K. S. Pourush, Nitendar Kumar. 2009. Neural network analysis of switchability of microstrip rectangular patch antenna printed on ferrite material. International Journal of RF and Microwave Computer-Aided Engineering NA-NA. [CrossRef] 40. Vandana Vikas Thakare, Pramod Singhal. 2009. Microstrip antenna design using artificial neural networks. International Journal of RF and Microwave Computer-Aided Engineering NA-NA. [CrossRef] 41. M. Baglietto, C. Cervellera, M. Sanguineti, R. Zoppoli. 2008. Management of water resource systems in the presence of uncertainties by nonlinear approximation techniques and deterministic sampling. Computational Optimization and Applications . [CrossRef] 42. Yoshifusa Ito. 2008. Simultaneous Approximations of Polynomials and Derivatives and Their Applications to Neural NetworksSimultaneous Approximations of Polynomials and Derivatives and Their Applications to Neural Networks. Neural Computation 20:11, 2757-2791. [Abstract] [PDF] [PDF Plus] 43. N. Hovakimyan, E. Lavretsky, Chengyu Cao. 2008. Adaptive Dynamic Inversion via Time-Scale Separation. IEEE Transactions on Neural Networks 19:10, 1702-1711. [CrossRef] 44. Ye. V. Bodyanskiy, N. Ye. Kulishova. 2008. Memory-based neuro-fuzzy system for interpolation of reflection coefficients of printing inks. Cybernetics and Systems Analysis 44:5, 625-632. [CrossRef] 45. Dong Nan, Wei Wu, Jin Ling Long, Yu Mei Ma, Lin Jun Sun. 2008. L p approximation capability of RBF neural networks. Acta Mathematica Sinica, English Series 24:9, 1533-1540. [CrossRef] 46. Michael M. Li, Brijesh Verma, Xiaolong Fan, Kevin Tickle. 2008. RBF neural networks for solving the inverse problem of backscattering spectra. Neural Computing and Applications 17:4, 391-397. [CrossRef] 47. D.A.G. Vieira, R.H.C. Takahashi, V. Palade, J.A. Vasconcelos, W.M. Caminhas. 2008. The $Q$ -Norm Complexity Measure and the Minimum Gradient Method: A Novel Approach to the Machine Learning Structural Risk Minimization Problem. IEEE Transactions on Neural Networks 19:8, 1415-1430. [CrossRef]
48. Miao HUANG. 2008. Fast algorithm for surface reconstruction from cloud data based RBF neural network. Journal of Computer Applications 28:2, 469-472. [CrossRef] 49. Shuyuan Yang, Licheng Jiao, Min Wang. 2008. A new directional multi-resolution ridgelet network. Frontiers of Electrical and Electronic Engineering in China 3:2, 198-203. [CrossRef] 50. S. Suresh, S. Narasimhan, N. Sundararajan. 2008. Adaptive control of nonlinear smart base-isolated buildings using Gaussian kernel functions. Structural Control and Health Monitoring 15:4, 585-603. [CrossRef] 51. Jianming Lian, Yonggon Lee, S.D. Sudhoff, S.H. Zak. 2008. Self-Organizing Radial Basis Function Network for Real-Time Approximation of Continuous-Time Dynamical Systems. IEEE Transactions on Neural Networks 19:3, 460-474. [CrossRef] 52. Weisheng Chen, Junmin Li. 2008. Decentralized Output-Feedback Neural Control for Systems With Unknown Interconnections. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38:1, 258-266. [CrossRef] 53. J.I. Mulero-Martinez. 2008. Best Approximation of Gaussian Neural Networks With Nodes Uniformly Spaced. IEEE Transactions on Neural Networks 19:2, 284-298. [CrossRef] 54. Jer-Guang Hsieh, Yih-Lon Lin, Jyh-Horng Jeng. 2008. . IEEE Transactions on Neural Networks 19:2, 201. [CrossRef] 55. Jian-Xun Peng, Kang Li, George W. Irwin. 2008. A New Jacobian Matrix for Optimal Learning of Single-Layer Neural Networks. IEEE Transactions on Neural Networks 19:1, 119-129. [CrossRef] 56. Senem Makal, Ahmet Kizilay, Lutfiye Durak. 2008. ON THE TARGET CLASSIFICATION THROUGH WAVELET-COMPRESSED SCATTERED ULTRAWIDE-BAND ELECTRIC FIELD DATA AND ROC ANALYSIS. Progress In Electromagnetics Research PIER 82, 419-431. [CrossRef] 57. Vahram Stepanyan, Naira Hovakimyan. 2008. Visual Tracking of a Maneuvering Target. Journal of Guidance, Control, and Dynamics 31:1, 66-80. [CrossRef] 58. A. Guillén, H. Pomares, I. Rojas, J. González, L. J. Herrera, F. Rojas, O. Valenzuela. 2007. Studying possibility in a clustering algorithm for RBFNN design for function approximation. Neural Computing and Applications 17:1, 75-89. [CrossRef] 59. Andrew P. Blake, George Kapetanios. 2007. Testing for Neglected Nonlinearity in Cointegrating Relationships. Journal of Time Series Analysis 28:6, 807-826. [CrossRef] 60. Amanda Young, Chengyu Cao, Vijay Patel, Naira Hovakimyan, Eugene Lavretsky. 2007. Adaptive Control Design Methodology for
Nonlinear-in-Control Systems in Aircraft Applications. Journal of Guidance, Control, and Dynamics 30:6, 1770-1782. [CrossRef] 61. A. Alessandri, M. Cuneo, S. Pagnan, M. Sanguineti. 2007. A recursive algorithm for nonlinear least-squares problems. Computational Optimization and Applications 38:2, 195-216. [CrossRef] 62. R. Sanjeev Kunte, R. D. Sudhaker Samuel. 2007. A simple and efficient optical character recognition system for basic symbols in printed Kannada text. Sadhana 32:5, 521-533. [CrossRef] 63. A. Alessandri, C. Cervellera, M. Sanguineti. 2007. Functional Optimal Estimation Problems and Their Solution by Nonlinear Approximation Schemes. Journal of Optimization Theory and Applications 134:3, 445-466. [CrossRef] 64. Yuanyuan Zhao, Jay A. Farrell. 2007. Self-Organizing Approximation-Based Control for Higher Order Systems. IEEE Transactions on Neural Networks 18:4, 1220-1231. [CrossRef] 65. Joaquin Sitte, Liang Zhang, Ulrich Rueckert. 2007. Characterization of Analog Local Cluster Neural Network Hardware for Control. IEEE Transactions on Neural Networks 18:4, 1242-1253. [CrossRef] 66. Shi Huading, Liu Jiyuan, Zhuang Dafang, Hu Yunfeng. 2007. Using the RBFN model and GIS technique to assess wind erosion hazard of Inner Mongolia, China. Land Degradation & Development 18:4, 413-422. [CrossRef] 67. Vahram Stepanyan, Naira Hovakimyan. 2007. Adaptive Disturbance Rejection Controller for Visual Tracking of a Maneuvering Target. Journal of Guidance, Control, and Dynamics 30:4, 1090-1106. [CrossRef] 68. Alberto Guillén, Ignacio Rojas, Jesús González, Héctor Pomares, L. J. Herrera, O. Valenzuela, F. Rojas. 2007. Output value-based initialization for radial basis function neural networks. Neural Processing Letters 25:3, 209-225. [CrossRef] 69. J. Bongard, H. Lipson. 2007. From the Cover: Automated reverse engineering of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 104:24, 9943-9948. [CrossRef] 70. Xiao-hua Yang, Jing-feng Huang, Jian-wen Wang, Xiu-zhen Wang, Zhan-yu Liu. 2007. Estimation of vegetation biophysical parameters by remote sensing using radial basis function neural network. Journal of Zhejiang University SCIENCE A 8:6, 883-895. [CrossRef] 71. Bor-Shyh Lin, Bor-Shing Lin, Fok-Ching Chong, Feipei Lai. 2007. Higher-Order-Statistics-Based Radial Basis Function Networks for Signal Enhancement. IEEE Transactions on Neural Networks 18:3, 823-832. [CrossRef] 72. A. A. Pashilkar, N. Sundararajan, P. Saratchandran. 2007. Adaptive Nonlinear Neural Controller for Aircraft Under Actuator Failures. Journal of Guidance, Control, and Dynamics 30:3, 835-847. [CrossRef] 73. Zhongxu Hu, Robert Bicker, Chris Marshall. 2007. Prediction of depth removal in leather surface grit blasting using neural networks and Box-Behnken design
of experiments. The International Journal of Advanced Manufacturing Technology 32:7-8, 732-738. [CrossRef] 74. Sisil Kumarawadu, Keigo Watanabe, Tsu-Tian Lee. 2007. High-Performance Object Tracking and Fixation With an Online Neural Estimator. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 37:1, 213-223. [CrossRef] 75. Marios K. Karakasis, Dimitrios G. Koubogiannis, Kyriakos C. Giannakoglou. 2007. Hierarchical distributed metamodel-assisted evolutionary algorithms in shape optimization. International Journal for Numerical Methods in Fluids 53:3, 455-469. [CrossRef] 76. Angelo Alessandri, Cristiano Cervellera, Marcello Sanguineti. 2007. Design of Asymptotic Estimators: An Approach Based on Neural Networks and Nonlinear Programming. IEEE Transactions on Neural Networks 18:1, 86-96. [CrossRef] 77. Alexey Kononov, Dries Gisolf, Eric Verschuur. 2007. Application of neural networks to traveltime computation. SEG Technical Program Expanded Abstracts 26:1, 1785. [CrossRef] 78. N. Mai-Duy, R.I. Tanner. 2007. A collocation method based on one-dimensional RBF interpolation scheme for solving PDEs. International Journal of Numerical Methods for Heat & Fluid Flow 17:2, 165-186. [CrossRef] 79. M. Zhang, S. Xu, J. Fulcher. 2007. ANSER: ADAPTIVE NEURON ARTIFICIAL NEURAL NETWORK SYSTEM FOR ESTIMATING RAINFALL. International Journal of Computers and Applications 29:3. . [CrossRef] 80. Koldo Basterretxea, Jos Manuel Tarela, Ins del Campo, Guillermo Bosque. 2007. An Experimental Study on Nonlinear Function Computation for Neural/Fuzzy Hardware Design. IEEE Transactions on Neural Networks 18:1, 266-283. [CrossRef] 81. S. Padma, R. Bhuvaneswari, S. Subramanian. 2007. Application of soft computing techniques to induction motor design. COMPEL: The International Journal for Computation and Mathematics in Electrical and Electronic Engineering 26:5, 1324-1345. [CrossRef] 82. Jian-Xun Peng, Kang Li, George W. Irwin. 2007. A Novel Continuous Forward Algorithm for RBF Neural Modelling. IEEE Transactions on Automatic Control 52:1, 117-122. [CrossRef] 83. Puneet Singla, Kamesh Subbarao, John L. Junkins. 2007. Direction-Dependent Learning Approach for Radial Basis Function Networks. IEEE Transactions on Neural Networks 18:1, 203-222. [CrossRef] 84. G.-B. Huang, L. Chen, C.-K. Siew. 2006. Universal Approximation Using Incremental Constructive Feedforward Networks With Random Hidden Nodes. IEEE Transactions on Neural Networks 17:4, 879-892. [CrossRef]
85. Shiuh-Jer Huang, Kuo-Ching Chiou. 2006. An Adaptive Neural Sliding Mode Controller for MIMO Systems. Journal of Intelligent and Robotic Systems 46:3, 285-301. [CrossRef] 86. Heung-Fai Lam, Ka-Veng Yuen, James L. Beck. 2006. Structural Health Monitoring via Measured Ritz Vectors Utilizing Artificial Neural Networks. Computer-Aided Civil and Infrastructure Engineering 21:4, 232-241. [CrossRef] 87. L. Weruaga, B. Kieslinger. 2006. Tikhonov Training of the CMAC Neural Network. IEEE Transactions on Neural Networks 17:3, 613-622. [CrossRef] 88. Huang Qiu-hao, Cai Yun-long. 2006. Assessment of karst rocky desertification using the radial basis function network model and GIS technique: a case study of Guizhou Province, China. Environmental Geology 49:8, 1173-1179. [CrossRef] 89. Nan-Ying Liang, Guang-Bin Huang, P. Saratchandran, N. Sundararajan. 2006. . IEEE Transactions on Neural Networks 17:6, 1411. [CrossRef] 90. C. Wang, D.J. Hill. 2006. Learning From Neural Control. IEEE Transactions on Neural Networks 17:1, 130-146. [CrossRef] 91. Sisil Kumarawadu, Tsu Tian Lee. 2006. . IEEE Transactions on Intelligent Transportation Systems 7:4, 500. [CrossRef] 92. Teck Por Lim, Sadasivan Puthusserypady. 2006. Error criteria for cross validation in the context of chaotic time series prediction. Chaos: An Interdisciplinary Journal of Nonlinear Science 16:1, 013106. [CrossRef] 93. Y. Abe, Y. Iiguni. 2006. Interpolation capability of the periodic radial basis function network. IEE Proceedings - Vision, Image, and Signal Processing 153:6, 785. [CrossRef] 94. Jian Shi, Xing-Gao Liu. 2006. Product quality prediction by a neural soft-sensor based on MSA and PCA. International Journal of Automation and Computing 3:1, 17-22. [CrossRef] 95. Filiz Güneş, Nurhan Türker. 2005. Artificial neural networks in their simplest forms for analysis and synthesis of RF/microwave planar transmission lines. International Journal of RF and Microwave Computer-Aided Engineering 15:6, 587-600. [CrossRef] 96. Yevgeniy Bodyanskiy, Nataliya Lamonova, Iryna Pliss, Olena Vynokurova. 2005. An adaptive learning algorithm for a wavelet neural network. Expert Systems 22:5, 235-240. [CrossRef] 97. B. Kim, B. T. Lee, K. K. Lee. 2005. On the use of a neural network to characterize the plasma etching of SiON thin films. Journal of Materials Science: Materials in Electronics 16:10, 673-679. [CrossRef]
98. N. Mai-Duy, R. I. Tanner. 2005. Computing non-Newtonian fluid flow with radial basis function networks. International Journal for Numerical Methods in Fluids 48:12, 1309-1336. [CrossRef] 99. N. Mai-Duy, R. I. Tanner. 2005. Solving high-order partial differential equations with indirect radial basis function networks. International Journal for Numerical Methods in Engineering 63:11, 1636-1654. [CrossRef] 100. Nam Mai-Duy, Thanh Tran-Cong. 2005. An efficient indirect RBFN-based method for numerical solution of PDEs. Numerical Methods for Partial Differential Equations 21:4, 770-790. [CrossRef] 101. M.J. Er, W. Chen, S. Wu. 2005. High-Speed Face Recognition Based on Discrete Cosine Transform and RBF Neural Networks. IEEE Transactions on Neural Networks 16:3, 679-691. [CrossRef] 102. Michael Schmitt . 2005. On the Capabilities of Higher-Order Neurons: A Radial Basis Function ApproachOn the Capabilities of Higher-Order Neurons: A Radial Basis Function Approach. Neural Computation 17:3, 715-729. [Abstract] [PDF] [PDF Plus] 103. A. Krzyzak, D. Schafer. 2005. Nonparametric Regression Estimation by Normalized Radial Basis Function Networks. IEEE Transactions on Information Theory 51:3, 1003-1010. [CrossRef] 104. C. Li, H. Ye, G. Wang, J. Zhang. 2005. A Recursive Nonlinear PLS Algorithm for Adaptive Nonlinear Process Modeling. Chemical Engineering & Technology 28:2, 141-152. [CrossRef] 105. Yugang Niu, James Lam, Xingyu Wang, Daniel W. C. Ho. 2005. Adaptive H[sub ∞] Control Using Backstepping Design and Neural Networks. Journal of Dynamic Systems, Measurement, and Control 127:3, 478. [CrossRef] 106. D. Wang, J. Huang. 2005. Neural Network-Based Adaptive Dynamic Surface Control for a Class of Uncertain Nonlinear Systems in Strict-Feedback Form. IEEE Transactions on Neural Networks 16:1, 195-202. [CrossRef] 107. Shaoduan Ou, Luke E. K. Achenie. 2005. Artificial Neural Network Modeling of PEM Fuel Cells. Journal of Fuel Cell Science and Technology 2:4, 226. [CrossRef] 108. Slavisa Trajkovic. 2005. Temperature-Based Approaches for Estimating Reference Evapotranspiration. Journal of Irrigation and Drainage Engineering 131:4, 316. [CrossRef] 109. Yasar Becerikli. 2004. On three intelligent systems: dynamic neural, fuzzy, and wavelet networks for training trajectory. Neural Computing and Applications 13:4, 339-351. [CrossRef] 110. G. Ramakrishna, O.P. Malik. 2004. Radial Basis Function Identifier and Pole-Shifting Controller for Power System Stabilizer Application. IEEE Transactions on Energy Conversion 19:4, 663-670. [CrossRef] 111. L. Zhang, W. Zhou, L. Jiao. 2004. Hidden Space Support Vector Machines. IEEE Transactions on Neural Networks 15:6, 1424-1434. [CrossRef]
112. Rao Wen-bi, Zhang Xiang, Boström Henrik. 2004. Remote intelligent identification system of structural damage. Wuhan University Journal of Natural Sciences 9:5, 812-816. [CrossRef] 113. M. Sanchez-Fernandez, M. de-Prado-Cumplido, J. Arenas-Garcia, F. Perez-Cruz. 2004. SVM Multiregression for Nonlinear Channel Estimation in Multiple-Input Multiple-Output Systems. IEEE Transactions on Signal Processing 52:8, 2298-2307. [CrossRef] 114. R. Shavit, I. Taig. 2004. Comparison study of pattern-synthesis techniques using neural networks. Microwave and Optical Technology Letters 42:2, 175-179. [CrossRef] 115. Eduard J. Gamito, E. David Crawford. 2004. Artificial neural networks for predictive modeling in prostate cancer. Current Oncology Reports 6:3, 216-221. [CrossRef] 116. G. Yue, X. Zhou, X. Wang. 2004. Performance Comparisons of Channel Estimation Techniques in Multipath Fading CDMA. IEEE Transactions on Wireless Communications 3:3, 716-724. [CrossRef] 117. S.S. Ge, C. Wang. 2004. Adaptive Neural Control of Uncertain MIMO Nonlinear Systems. IEEE Transactions on Neural Networks 15:3, 674-692. [CrossRef] 118. Joshua Ashenberg, Enrico C Lorenzini. 2004. Analytical formulation of a complex mutual gravitational field. Classical and Quantum Gravity 21:8, 2089-2100. [CrossRef] 119. M.A. Moreno, J. Usaola. 2004. A New Balanced Harmonic Load Flow Including Nonlinear Loads Modeled With RBF Networks. IEEE Transactions on Power Delivery 19:2, 686-693. [CrossRef] 120. Feng Jiu-Chao, Qiu Yu-Hui. 2004. Identification of Chaotic Systems with Application to Chaotic Communication. Chinese Physics Letters 21:2, 250-253. [CrossRef] 121. S. Ferrari, M. Maggioni, N.A. Borghese. 2004. Multiscale Approximation With Hierarchical Radial Basis Functions Networks. IEEE Transactions on Neural Networks 15:1, 178-188. [CrossRef] 122. J. Gonzalez, I. Rojas, J. Ortega, H. Pomares, J. Fernandez, A. Diaz. 2003. Multiobjective evolutionary optimization of the size, shape, and position parameters of radial basis function networks for function approximation. IEEE Transactions on Neural Networks 14:6, 1478-1495. [CrossRef] 123. Yoshifusa Ito . 2003. Activation Functions Defined on Higher-Dimensional Spaces for Approximation on Compact Sets with and without ScalingActivation Functions Defined on Higher-Dimensional Spaces for Approximation on Compact Sets with and without Scaling. Neural Computation 15:9, 2199-2226. [Abstract] [PDF] [PDF Plus]
124. Fan Yang, M. Paindavoine. 2003. Implementation of an rbf neural network on embedded systems: real-time face tracking and identity verification. IEEE Transactions on Neural Networks 14:5, 1162-1175. [CrossRef] 125. E. Lavretsky, N. Hovakimyan, A.J. Calise. 2003. Upper bounds for approximation of continuous-time dynamics using delayed outputs and feedforward neural networks. IEEE Transactions on Automatic Control 48:9, 1606-1610. [CrossRef] 126. Irwin W. Sandberg. 2003. Gaussian radial basis functions and the approximation of input-output maps. International Journal of Circuit Theory and Applications 31:5, 443-452. [CrossRef] 127. Jau-Jia Guo, P.B. Luh. 2003. Selecting input factors for clusters of gaussian radial basis function networks to improve market clearing price prediction. IEEE Transactions on Power Systems 18:2, 665-672. [CrossRef] 128. R Roopesh Kumar Reddy, Ranjan Ganguli. 2003. Structural damage detection in a helicopter rotor blade using radial basis function neural networks. Smart Materials and Structures 12:2, 232-241. [CrossRef] 129. Nam Mai-Duy, Thanh Tran-Cong. 2003. Neural networks for BEM analysis of steady viscous flows. International Journal for Numerical Methods in Fluids 41:7, 743-763. [CrossRef] 130. P. Cerveri, C. Forlani, A. Pedotti, G. Ferrigno. 2003. Hierarchical radial basis function networks and local polynomial un-warping for X-ray image intensifier distortion correction: A comparison with global techniques. Medical & Biological Engineering & Computing 41:2, 151-163. [CrossRef] 131. Irwin W. Sandberg . 2003. Indexed Families of Functionals and Gaussian Radial Basis FunctionsIndexed Families of Functionals and Gaussian Radial Basis Functions. Neural Computation 15:2, 455-468. [Abstract] [PDF] [PDF Plus] 132. Xiaobo Zhou, Xiaodong Wang. 2003. Channel estimation for ofdm systems using adaptive radial basis function networks. IEEE Transactions on Vehicular Technology 52:1, 48-59. [CrossRef] 133. W Lin, M H Wu, S Duan. 2003. Engine Test Data Modelling by Evolutionary Radial Basis Function Networks. Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering 217:6, 489-497. [CrossRef] 134. Y. Abe, Y. Iiguni. 2003. Fast computation of RBF coefficients for regularly sampled inputs. Electronics Letters 39:6, 543. [CrossRef] 135. Zoran Vojinovic, Vojislav Kecman, Vladan Babovic. 2003. Hybrid Approach for Modeling Wet Weather Response in Wastewater Systems. Journal of Water Resources Planning and Management 129:6, 511. [CrossRef] 136. Michael Schmitt . 2002. Descartes' Rule of Signs for Radial Basis Function Neural NetworksDescartes' Rule of Signs for Radial Basis Function Neural Networks. Neural Computation 14:12, 2997-3011. [Abstract] [PDF] [PDF Plus]
137. Siu-Yeung Cho , Tommy W. S. Chow . 2002. A New Color 3D SFS Methodology Using Neural-Based Color Reflectance Models and Iterative Recursive MethodA New Color 3D SFS Methodology Using Neural-Based Color Reflectance Models and Iterative Recursive Method. Neural Computation 14:11, 2751-2789. [Abstract] [PDF] [PDF Plus] 138. C. Panchapakesan, M. Palaniswami, D. Ralph, C. Manzie. 2002. Effects of moving the center's in an RBF network. IEEE Transactions on Neural Networks 13:6, 1299-1307. [CrossRef] 139. N. Hovakimyan, F. Nardi, A.J. Calise. 2002. A novel error observer-based adaptive output feedback approach for control of uncertain systems. IEEE Transactions on Automatic Control 47:8, 1310-1314. [CrossRef] 140. Kah Phooi Seng, Zhihong Man, Hong Ren Wu. 2002. Lyapunov-theory-based radial basis function networks for adaptive filtering. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 49:8, 1215-1220. [CrossRef] 141. Meng Joo Er, Shiqian Wu, Juwei Lu, Hock Lye Toh. 2002. Face recognition with radial basis function (RBF) neural networks. IEEE Transactions on Neural Networks 13:3, 697-710. [CrossRef] 142. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC DimensionNeural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 143. Byungwhan Kim, Sungjin Park. 2002. Characterization of inductively coupled plasma using neural networks. IEEE Transactions on Plasma Science 30:2, 698-705. [CrossRef] 144. P. Cerveri, C. Forlani, N. A. Borghese, G. Ferrigno. 2002. Distortion correction for x-ray image intensifiers: Local unwarping polynomials and RBF neural networks. Medical Physics 29:8, 1759. [CrossRef] 145. Ming Zhang, Shuxiang Xu, J. Fulcher. 2002. Neuron-adaptive higher order neural-network models for automated financial data modeling. IEEE Transactions on Neural Networks 13:1, 188. [CrossRef] 146. J. Gonzalez, H. Rojas, J. Ortega, A. Prieto. 2002. A new clustering technique for function approximation. IEEE Transactions on Neural Networks 13:1, 132. [CrossRef] 147. J. Q. Gong, Bin Yao. 2001. Neural network adaptive robust control with application to precision motion control of linear motors. International Journal of Adaptive Control and Signal Processing 15:8, 837-864. [CrossRef] 148. Abelardo Errejon, E. David Crawford, Judith Dayhoff, Colin O'Donnell, Ashutosh Tewari, James Finkelstein, Eduard J. Gamito. 2001. Use Of Artificial Neural Networks In Prostate Cancer. Molecular Urology 5:4, 153-158. [CrossRef]
149. Siu-Yeung Cho , Tommy W. S. Chow . 2001. Enhanced 3D Shape Recovery Using the Neural-Based Hybrid Reflectance ModelEnhanced 3D Shape Recovery Using the Neural-Based Hybrid Reflectance Model. Neural Computation 13:11, 2617-2637. [Abstract] [PDF] [PDF Plus] 150. Irwin W. Sandberg. 2001. Gaussian radial basis functions and inner product spaces. Circuits Systems and Signal Processing 20:6, 635-642. [CrossRef] 151. Axel Röbel . 2001. Synthesizing Natural Sounds Using Dynamic Models of Sound AttractorsSynthesizing Natural Sounds Using Dynamic Models of Sound Attractors. Computer Music Journal 25:2, 46-61. [Citation] [PDF] [PDF Plus] 152. Judith E. Dayhoff, James M. DeLeo. 2001. Artificial neural networks. Cancer 91:S8, 1615-1635. [CrossRef] 153. C Manzie, M Palaniswami, H Watson. 2001. Gaussian networks for fuel injection control. Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering 215:10, 1053-1068. [CrossRef] 154. Siu-Yueng Cho, T.W.S. Chow. 2001. Neural computation approach for developing a 3D shape reconstruction model. IEEE Transactions on Neural Networks 12:5, 1204. [CrossRef] 155. R.J. Schilling, J.J. Carroll, A.F. Al-Ajlouni. 2001. Approximation of nonlinear systems with radial basis function neural networks. IEEE Transactions on Neural Networks 12:1, 1. [CrossRef] 156. Koji Okuhara, Koji Sasaki, Shunji Osaki. 2000. Reproductive and competitive radial basis function networks adaptable to dynamical environments. Systems and Computers in Japan 31:13, 65-75. [CrossRef] 157. T. Knohl, H. Unbehauen. 2000. Nichtlineare adaptive Regelung von Hammerstein-Systemen mittels RBF-Netze (Adaptive Control of Hammerstein Systems using RBF Networks). at - Automatisierungstechnik 48:11/2000, 547. [CrossRef] 158. Willi Freeden, Oliver Glockner, Joachim Rang, Gebhard Schüler. 2000. Von forstmeteorologischen Punktmessungen zu räumlich aggregierten Daten. Forstwissenschaftliches Centralblatt 119:1-6, 332-349. [CrossRef] 159. L.S.H. Ngia, J. Sjoberg. 2000. Efficient training of neural nets for nonlinear adaptive filtering using a recursive Levenberg-Marquardt algorithm. IEEE Transactions on Signal Processing 48:7, 1915-1927. [CrossRef] 160. Ching-Sung Shieh, Chin-Teng Lin. 2000. Direction of arrival estimation based on phase differences using neural fuzzy network. IEEE Transactions on Antennas and Propagation 48:7, 1115-1124. [CrossRef] 161. T. Sigitani, Y. Iiguni, H. Maeda. 2000. Progressive cross-section display of 3D medical images. Medical & Biological Engineering & Computing 38:2, 140-149. [CrossRef]
162. T.W.S. Chow, Siu-Yeung Cho. 2000. Learning parametric specular reflectance model by radial basis function network. IEEE Transactions on Neural Networks 11:6, 1498. [CrossRef] 163. Mirko van der Baan, Christian Jutten. 2000. Neural networks in geophysical applications. Geophysics 65:4, 1032. [CrossRef] 164. M. Pappas, I. Pitas. 2000. Digital color restoration of old paintings. IEEE Transactions on Image Processing 9:2, 291. [CrossRef] 165. H.D. Patino, D. Liu. 2000. Neural network-based model reference adaptive control system. IEEE Transactions on Systems Man and Cybernetics Part B (Cybernetics) 30:1, 198. [CrossRef] 166. R.B. McLain, M.A. Henson, M. Pottmann. 1999. Direct adaptive control of partially known nonlinear systems. IEEE Transactions on Neural Networks 10:3, 714-721. [CrossRef] 167. N.B. Karayiannis. 1999. Reformulated radial basis neural networks trained by gradient descent. IEEE Transactions on Neural Networks 10:3, 657-671. [CrossRef] 168. T. Sigitani, Y. Iiguni, H. Maeda. 1999. Image interpolation for progressive transmission by using radial basis function networks. IEEE Transactions on Neural Networks 10:2, 381-390. [CrossRef] 169. Jie Zhang, A.J. Morris. 1999. Recurrent neuro-fuzzy networks for nonlinear process modeling. IEEE Transactions on Neural Networks 10:2, 313-326. [CrossRef] 170. N.W. Townsend, L. Tarassenko. 1999. Estimations of error bounds for neural-network function approximators. IEEE Transactions on Neural Networks 10:2, 217-230. [CrossRef] 171. A.G. Bors, I. Pitas. 1999. Object classification in 3-D images using alpha-trimmed mean radial basis function network. IEEE Transactions on Image Processing 8:12, 1744. [CrossRef] 172. G. De Nicolao, G.F. Trecate. 1999. Consistent identification of NARX models via regularization networks. IEEE Transactions on Automatic Control 44:11, 2045. [CrossRef] 173. E. Gelenbe, Zhi-Hong Mao, Yan-Da Li. 1999. Function approximation with spiked random networks. IEEE Transactions on Neural Networks 10:1, 3. [CrossRef] 174. R.S. Blum, R.J. Kozick, B.M. Sadler. 1999. An adaptive spatial diversity receiver for non-Gaussian interference and noise. IEEE Transactions on Signal Processing 47:8, 2100. [CrossRef] 175. S. Haykin. 1999. Radar clutter attractor: implications for physics, signal processing and control. IEE Proceedings - Radar, Sonar and Navigation 146:4, 177. [CrossRef]
176. P. Yee, S. Haykin. 1999. A dynamic regularized radial basis function network for nonlinear, nonstationary time series prediction. IEEE Transactions on Signal Processing 47:9, 2503. [CrossRef] 177. Mu-Chun Su, Chih-Wen Liu, Chen-Sung Chang. 1999. Rule extraction for voltage security margin estimation. IEEE Transactions on Industrial Electronics 46:6, 1114. [CrossRef] 178. S. Chen, Y. Wu, B.L. Luk. 1999. Combined genetic algorithm optimization and regularized orthogonal least squares learning for radial basis function networks. IEEE Transactions on Neural Networks 10:5, 1239. [CrossRef] 179. Gokaraju K. Raju, Charles L. Cooney. 1998. Active learning from process data. AIChE Journal 44:10, 2199-2211. [CrossRef] 180. W. Pedrycz. 1998. Conditional fuzzy clustering in the design of radial basis function neural networks. IEEE Transactions on Neural Networks 9:4, 601-612. [CrossRef] 181. R. Rovatti. 1998. Fuzzy piecewise multilinear and piecewise linear systems as universal approximators in Sobolev norms. IEEE Transactions on Fuzzy Systems 6:2, 235-249. [CrossRef] 182. Songwu Lu, T. Basar. 1998. Robust nonlinear system identification using neural-network models. IEEE Transactions on Neural Networks 9:3, 407-429. [CrossRef] 183. M.S. Crouse, R.D. Nowak, R.G. Baraniuk. 1998. Wavelet-based statistical signal processing using hidden Markov models. IEEE Transactions on Signal Processing 46:4, 886-902. [CrossRef] 184. Yi-Jen Wang, Chin-Teng Lin. 1998. Runge-Kutta neural network for identification of dynamical systems in high accuracy. IEEE Transactions on Neural Networks 9:2, 294-307. [CrossRef] 185. D. Achela K. Fernando, A. W. Jayawardena. 1998. Runoff Forecasting Using RBF Networks with OLS Algorithm. Journal of Hydrologic Engineering 3:3, 203. [CrossRef] 186. M. Gori, Ah Chung Tsoi. 1998. Comments on local minima free conditions in multilayer perceptrons. IEEE Transactions on Neural Networks 9:5, 1051. [CrossRef] 187. Guo-Qing Wei, W. Brauer, G. Hirzinger. 1998. Intensity- and gradient-based stereo matching using hierarchical Gaussian basis functions. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:11, 1143. [CrossRef] 188. C. Turchetti, M. Conti, P. Crippa, S. Orcioni. 1998. On the approximation of stochastic processes by approximate identity neural networks. IEEE Transactions on Neural Networks 9:6, 1069. [CrossRef] 189. A.R. Webb, S. Shannon. 1998. Shape-adaptive radial basis functions. IEEE Transactions on Neural Networks 9:6, 1155. [CrossRef]
190. Y. Iiguni, I. Kawamoto, N. Adachi. 1997. A nonlinear adaptive estimation method based on local approximation. IEEE Transactions on Signal Processing 45:7, 1831-1841. [CrossRef] 191. D. Gorinevsky. 1997. An approach to parametric nonlinear least square optimization and application to task-level learning control. IEEE Transactions on Automatic Control 42:7, 912-927. [CrossRef] 192. G. Feng. 1997. A new stable tracking control scheme for robotic manipulators. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 27:3, 510-516. [CrossRef] 193. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 194. Guo-Qing Wei, G. Hirzinger. 1997. Parametric shape-from-shading by radial basis functions. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, 353-365. [CrossRef] 195. H. N. Mhaskar, Nahmwoo Hahm. 1997. Neural Networks for Functional Approximation and System IdentificationNeural Networks for Functional Approximation and System Identification. Neural Computation 9:1, 143-159. [Abstract] [PDF] [PDF Plus] 196. Tin-Yan Kwok, Dit-Yan Yeung. 1997. Objective functions for training new hidden units in constructive neural networks. IEEE Transactions on Neural Networks 8:5, 1131. [CrossRef] 197. Po-Rong Chang, Wen-Hao Yang. 1997. Environment-adaptation mobile radio propagation prediction using radial basis function neural networks. IEEE Transactions on Vehicular Technology 46:1, 155. [CrossRef] 198. B.W. Stiles, I.W. Sandberg, J. Ghosh. 1997. Complete memory structures for approximating nonlinear discrete-time mappings. IEEE Transactions on Neural Networks 8:6, 1397. [CrossRef] 199. N.B. Karayiannis, G.W. Mi. 1997. Growing radial basis neural networks: merging supervised and unsupervised learning with network growth techniques. IEEE Transactions on Neural Networks 8:6, 1492. [CrossRef] 200. A. Roy, S. Govil, R. Miranda. 1997. A neural-network learning theory and a polynomial time RBF algorithm. IEEE Transactions on Neural Networks 8:6, 1301. [CrossRef] 201. N. Srinivasa. 1997. Learning and generalization of noisy mappings using a modified PROBART neural network. IEEE Transactions on Signal Processing 45:10, 2533. [CrossRef] 202. M.D. Brown, G. Lightbody, G.W. Irwin. 1997. Nonlinear internal model control using local model networks. IEE Proceedings - Control Theory and Applications 144:6, 505. [CrossRef]
203. Zhi-Hong Mao, Yan-Da Li, Xue-Feng Zhang. 1997. Approximation capability of fuzzy systems using translations and dilations of one fixed function as membership functions. IEEE Transactions on Fuzzy Systems 5:3, 468. [CrossRef] 204. H. Hidalgo, E. Gomez-Trevino. 1996. Application of constructive learning algorithms to the inverse problem. IEEE Transactions on Geoscience and Remote Sensing 34:4, 874-885. [CrossRef] 205. A.M. Annaswamy, Ssu-Hsin Yu. 1996. θ-adaptive neural networks: a new approach to parameter estimation. IEEE Transactions on Neural Networks 7:4, 907-918. [CrossRef] 206. Inhyok Cha, S.A. Kassam. 1996. RBFN restoration of nonlinearly degraded images. IEEE Transactions on Image Processing 5:6, 964-975. [CrossRef] 207. Ju-Yeop Choi, H.F. Van Landingham, S. Bingulac. 1996. A constructive approach for nonlinear system identification using multilayer perceptrons. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 26:2, 307-312. [CrossRef] 208. A. Krzyzak, T. Linder, C. Lugosi. 1996. Nonparametric estimation and classification using radial basis function nets and empirical risk minimization. IEEE Transactions on Neural Networks 7:2, 475-487. [CrossRef] 209. F.L. Lewis, A. Yegildirek, Kai Liu. 1996. Multilayer neural-net robot controller with guaranteed tracking performance. IEEE Transactions on Neural Networks 7:2, 388-399. [CrossRef] 210. H. N. Mhaskar . 1996. Neural Networks for Optimal Approximation of Smooth and Analytic FunctionsNeural Networks for Optimal Approximation of Smooth and Analytic Functions. Neural Computation 8:1, 164-177. [Abstract] [PDF] [PDF Plus] 211. M. Heiss. 1996. Error-minimizing dead zone for basis function networks. IEEE Transactions on Neural Networks 7:6, 1503. [CrossRef] 212. S. Jagannathan, F.L. Lewis. 1996. Multilayer discrete-time neural-net controller with guaranteed performance. IEEE Transactions on Neural Networks 7:1, 107. [CrossRef] 213. E.S. Chng, S. Chen, B. Mulgrew. 1996. Gradient radial basis function networks for nonlinear and nonstationary time series prediction. IEEE Transactions on Neural Networks 7:1, 190. [CrossRef] 214. Tin-Yau Kwok, Dit-Yan Yeung. 1996. Use of bias term in projection pursuit learning improves approximation and convergence properties. IEEE Transactions on Neural Networks 7:5, 1168. [CrossRef] 215. A.G. Bors, I. Pitas. 1996. Median radial basis function neural network. IEEE Transactions on Neural Networks 7:6, 1351. [CrossRef] 216. S. Jagannathan, F.L. Lewis. 1996. Discrete-time neural net controller for a class of nonlinear dynamical systems. IEEE Transactions on Automatic Control 41:11, 1693. [CrossRef]
217. David Lowe, Robert Matthews. 1995. Shakespeare vs. fletcher: A stylometric analysis by radial basis functions. Computers and the Humanities 29:6, 449-461. [CrossRef] 218. F. L. Lewis, A. Yeşildirek, K. Liu. 1995. Neural net robot controller: Structure and stability proofs. Journal of Intelligent & Robotic Systems 12:3, 277-299. [CrossRef] 219. Maxwell B. Stinchcombe . 1995. Precision and Approximate Flatness in Artificial Neural NetworksPrecision and Approximate Flatness in Artificial Neural Networks. Neural Computation 7:5, 1021-1039. [Abstract] [PDF] [PDF Plus] 220. Tin-Yau Kwok, Dit-Yan Yeung. 1995. Improving the approximation and convergence capabilities of projection pursuit learning. Neural Processing Letters 2:3, 20-25. [CrossRef] 221. Mark J. L. Orr. 1995. Regularization in the Selection of Radial Basis Function CentersRegularization in the Selection of Radial Basis Function Centers. Neural Computation 7:3, 606-623. [Abstract] [PDF] [PDF Plus] 222. W L Wang, D J Whitehouse. 1995. Nanotechnology 6:2, 45-51. [CrossRef] 223. Mohammad Bahrami. 1995. Issues on representational capabilities of artificial neural networks and their implementation. International Journal of Intelligent Systems 10:6, 571-579. [CrossRef] 224. Mohammad Bahrami, Keith E. Tait. 1994. A neural network-based proportional integral derivative controller. Neural Computing & Applications 2:3, 134-141. [CrossRef] 225. B. Truyen, N. Langloh, J. Cornelis. 1994. An adiabatic neural network for RBF approximation. Neural Computing & Applications 2:2, 69-88. [CrossRef] 226. Dimitry Gorinevsky , Thomas H. Connolly . 1994. Comparison of Some Neural Network and Scattered Data Approximations: The Inverse Manipulator Kinematics ExampleComparison of Some Neural Network and Scattered Data Approximations: The Inverse Manipulator Kinematics Example. Neural Computation 6:3, 521-542. [Abstract] [PDF] [PDF Plus] 227. Stephen Roberts , Lionel Tarassenko . 1994. A Probabilistic Resource Allocating Network for Novelty DetectionA Probabilistic Resource Allocating Network for Novelty Detection. Neural Computation 6:2, 270-284. [Abstract] [PDF] [PDF Plus] 228. Michel Benaim . 1994. On Functional Approximation with Normalized Gaussian UnitsOn Functional Approximation with Normalized Gaussian Units. Neural Computation 6:2, 319-333. [Abstract] [PDF] [PDF Plus] 229. Chris M. Bishop. 1994. Neural networks and their applications. Review of Scientific Instruments 65:6, 1803. [CrossRef]
230. Jooyoung Park , Irwin W. Sandberg . 1993. Approximation and Radial-Basis-Function NetworksApproximation and Radial-Basis-Function Networks. Neural Computation 5:2, 305-316. [Abstract] [PDF] [PDF Plus] 231. Ronald C. Beavis, Steven M. Colby, Royston Goodacre, Peter de B. Harrington, James P. Reilly, Stephen Sokolow, Charles W. WilkersonArtificial Intelligence and Expert Systems in Mass Spectrometry . [CrossRef] 232. Mohamad T. Musavi, Alan Fern, Dan R. CoughlinPaper Industry, System Identification and Modeling . [CrossRef]
Communicated by Yann LeCun
Recognizing Hand-Printed Letters and Digits Using Backpropagation Learning Gale L. Martin James A. Pittman MCC, Austin, Texas 78759 U S A
We report on results of training backpropagation nets with samples of hand-printed digits scanned off of bank checks and hand-printed letters interactively entered into a computer through a stylus digitizer. Generalization results are reported as a function of training set size and network capacity. Given a large training set, and a net with sufficient capacity to achieve high performance on the training set, nets typically achieved error rates of 4-5% at a 0% reject rate and 1-2% at a 10% reject rate. The topology and capacity of the system, as measured by the number of connections in the net, have surprisingly little effect on generalization. For those developing hand-printed character recognition systems, these results suggest that a large and representative training sample may be the single, most important factor in achieving high recognition accuracy. Benefits of reducing the number of net connections, other than improving generalization, are discussed. Practical interest in hand-printed character recognition is fueled by two current technology trends: one toward systems that interpret hand-printing on hard-copy documents and one toward notebook-like computers that replace the keyboard with a stylus digitizer. The stylus enables users to write and draw directly on a flat panel display. In this paper, we report on the results of applying multilayered neural nets trained through backpropagation (Rumelhart et al. 1986) to both cases. Developing hand-printed character recognition systems is typically a two-stage process. First, intuition and lengthy experimentation are used to select a set of features to represent the raw input pattern. Then a variety of techniques are used to optimize the classifier system that assumes this featural representation. Most applications of backpropagation learning to character recognition use the learning capabilities only for this latter stagedeveloping the classifier system (Burr 1986; Denker ef al. 1989; Mori and Yokosawa 1989; Weideman rf al. 1989). We have come to believe that the strength of backpropagation techniques in this domain lies in automating the development process. We find that much of the hand-crafting involved in selecting features can be Neural Cnmprctaatiari 3, 258-267 (1991) @ 1 Y Y 1 Massachusetts Institute of Technology
Recognizing Hand-Printed Letters and Digits
259
avoided by feeding the net presegmented, size-normalized gray scale arrays for the input characters. In addition, generalization performance is surprisingly insensitive to characteristics of the network architecture, as long as enough training samples are used and there is sufficient capacity to support training to high levels. We report on results for both hand-printed digits and letters. The hand-printed digits come from a set of 40,000 hand-printed digits scanned from the numeric amount region of "real-world" bank checks. They were automatically presegmented and size-normalized to a 15 x 24 gray scale array, with pixel values ranging from 0 to 1.O. The test set consists of 4000 samples and training sets varied from 100 to 35,200 samples. Although it is always difficult to compare recognition rates arising from different pattern sets, some appreciation for the difficulty of categorization can be gained using human performance data as a benchmark. An independent person categorizing the test set of presegmented, size-normalized digits achieved an error rate of 3.4%. This figure is considerably below the near-perfect performance of operators keying in numbers directly from bank checks, because the segmentation algorithm is flawed. Working with letters, as well as digits, enables tests of the generality of results on a different pattern set having more than double the number of output categories. The hand-printed letters come from a set of 8600 upper-case letters collected from over 110 people writing with a stylus input device on a flat panel display. The stylus collects a sequence of z-?l coordinates at 200 points per second at a spatial resolution of 1000 points per inch. The temporal sequence for each character is first converted to a size-normalized bitmap array, keeping aspect ratio constant. We have found that recognition accuracy is significantly improved if these bitmaps are blurred through convolution with a gaussian distribution. Each pattern is represented as a 15 x 24 gray scale image, with pixel values ranging from 0 to 1.0. A test set of 2368 samples was extracted by selecting samples from 18 people, so that training sets were generated by people different from those generating the test set. Training set sizes ranged from 500 to roughly 6300 samples. Nets were trained to error rates of 2-3%. Training began with a learning rate of 0.05 and a momentum value of 0.9. Toward the end of training, the learning rate was decreased when training accuracy began to oscillate or had stabilized for a large number of training epochs. Output vectors were evaluated on a winner-take-all basis, as we have found that this consistently improves accuracy and is less sensitive to variations in network parameters. The logistic activation function was used for all hidden and output nodes. Nets were run on two different simulators, with care taken to verify results across the simulators. One was an aItered version of the backpropagation simulator developed by McClelland and Rumelhart (1988) running on a Sun 4 computer, and the second was a backpropagation simulator written by the second author in Lisp, running on a Symbolics 3640.
Gale L. Martin and James A. Pittman
260
Table I: Error rates of best nets trained on largest sample sets and tested on new samples. Rejections were based on placing a threshold for the acceptable distance between the highest and next highest activation values in the output vector. Reject rate (%) Digits (%)
Letters (%) ~~~
0% 5% 10% 35%
4% 3% 1% 0.001%
5% 3% 2% 0.003%
1 High Recognition Accuracy
We find relatively high recognition accuracy for both pattern sets. Table 1 reports error rates achieved on the test samples for both pattern sets, at various reject rates. In the case of the hand-printed digits, the 4% error rate (0% rejects) approaches the 3.4% errors made by the human judge. This suggests that further improvements to generalization will require improving segmentation accuracy. The fact that an error rate of 5% was achieved for letters is promising. Accuracy is fairly high, even though there are a large number of categories (26). This error rate may be adequate for applications where contextual constraints can be used to significantly boost accuracy at the word level. In general, these data suggest that, for this domain at least, it is not necessary to hand select an optimal feature set to represent the input characters.
2 Minimal Network Capacity and Topology Effects ~-
The effects of network parameters on generalization have both practical and scientific significance. The practical developer of hand-printed character recognition systems is interested in such effects to determine whether limited resources should be spent on trying to optimize network parameters or on collecting a larger, more representative training set. For the scientist, the interest lies in discovering the strength of contentspecific factors in determining training and generalization performance, or equivalently, the extent to which general models describe behavior in specific circumstances. A central premise of most general models of learning-by-example is that the size of the initial search space - the capacity of the system - determines the number of training samples needed to achieve high
Recognizing Hand-Printed Letters and Digits
261
generalization performance. Learning is conceptualized as a search for a function that maps all possible inputs to their correct outputs. Learning occurs by comparing successive samples of input-output pairs to functions in a search space. Functions inconsistent with training samples are rejected. Very large training sets narrow the search to a function that closely approximates the desired function and yields high generalization. The capacity of a learning system - the number of functions it can represent - determines generalization, since a larger initial search space requires more training samples to narrow the search sufficiently. This suggests that to improve generalization, capacity should be minimized. Unfortunately, it is typically unclear how to minimize capacity without eliminating the desired function from the search space. An often suggested heuristic is that simpler is better. It receives support from experience in curve fitting. Low-order polynomials typically extrapolate and interpolate better than high-order polynomials (Duda and Hart 1973). Extensions of the heuristic to neural net learning propose reducing capacity by reducing the number of connections or the number of bits used to represent each connection weight (Baum and Haussler 1989; Denker et al. 1987; LeCun ef al. 1989). Predicting the size of generalization improvements to be achieved in any specific case is difficult though. Due to the gradient descent nature of backpropagation learning, not all functions that can be represented will be visited during learning. Increases in network capacity may thus have little effect on the actual search space for backpropagation. In addition, content-specific factors, such as the proportion of representable functions that roughly match the desired function, may hide capacity effects for particular domains. We evaluated the extent to which reducing the capacity of a net improves generalization on hand-printed character recognition as a function of training set size. Net capacity was manipulated in three ways: (1) reducing the number of hidden nodes, (2) reducing number of connections by limiting connectivity to local areas, and (3) sharing connection weights between hidden nodes. We found only marginal effects on generalization. 2.1 Number of Hidden Nodes. Figure 1 presents generalization results as a function of training set size for nets having one hidden layer and varying numbers of hidden nodes. The number of free parameters (i.e., number of connections and biases) in each case is presented in parentheses. Despite considerable variation in the number of free parameters, using nets with fewer hidden nodes did not improve generalization.
2.2 Local Connectivity and Shared Weights. Another way to reduce the number of free parameters is to limit connectivity between layers to local areas and to predispose the net toward developing positioninvariant feature detectors through the use of weight sharing (LeCun 1989;
Gale L. Martin and James A. Pittman
262
96 correct
N ~ m Q e r01 Hidden Node .... 165 50 100
5: Training Set Size
1000
10000
1161.2811
tooooo
Training Set Sme
Figure 1: Effect of number of hidden nodes and training set size on genera 1'ization.
Rumelhart et al. 1986). The incoming weights are shared across a set of hidden nodes, in the sense that corresponding weights leading into these nodes are randomly initialized to the same values and constrained to have equivalent updates during learning. Three types of network architectures were evaluated, each having two hidden layers. Global nets had 150 nodes in the first hidden layer and 50 nodes in the second. Each node was connected to all nodes in the preceding layer. In the local nets, 540 first hidden layer nodes received input from 5 x 8 local and overlapping regions (offset by 2 pixels) on the input array. The 100 nodes comprising the second hidden layer and the nodes in the output layer had global connections to the preceding layer. For the local, shared nets, the first hidden layer consisted of 540 nodes, composed of 10 groups of 54 nodes. Each node received input from 5 x 8 local and overlapping regions (offset by 2 pixels) on the input layer. The 54 nodes making u p a group comprised a 6 x 9 array of nodes. Within a group, the corresponding incoming weights to nodes were shared so that the same feature map developed. The ten different groups correspond to ten different feature maps. Thus, the structure of the first hidden layer can be visualized as a 6 x 9 x 10 cube of nodes. The second hidden layer consisted of 102 nodes, composed of 17 groups of 6 nodes. Each of these nodes received input from 4 x 5 x 10 local and overlapping (offset by 2) regions on the cube making u p the first hidden layer. The 17 groups correspond to 17 different feature maps. Each node in the output layer was connected to all nodes in the second hidden layer. Experiments on nets with shared versus unique biases revealed no difference in performance, and so we adopted the use of shared biases as a standard.
Recognizing Hand-Printed Letters and Digits
263
/ I/ L et ie r I
1O(
.’/
% Correct
75-
7:
___
----
L
~
E
---
~ 177 ~2501I
LOCll
_ _Local. .
LOCI1
snared
Shered
(78866) 18.5051
l 50, o100 0 l
5c
I
1000
10000
100000
lraininp Set Size
1000
10000
100000
Training Set Size
Figure 2: Effects of net capacity and topology on generalization. As indicated in Figure 2, we found only negligible generalization improvements in moving from nets with global connectivity to nets with local receptive fields or to nets with local receptive fields and shared weights. This is true despite a substantial reduction in the number of free parameters. The positive effects that do occur are at relatively small training set sizes. We also found minimal differences in looking at the rejection rate data. Consider the nets trained on 35,200 digits. At a 4.6% rejection rate, the global net yielded 3.0% errors, the local net yielded 2.1 % errors, and the local, shared net yielded 2.8% errors. At a 9.6% rejection rate, the corresponding data are as follows: global - 1.7% errors; local 1.1%errors; and local, shared - 1.7%errors. We have experimented with a large variety of different net architectures of this sort for hand-printed character recognition, varying the number of hidden nodes, the sizes and overlap of local receptive fields, and the use of local receptive fields with and without shared weights in one or both hidden layers. We find only marginal and inconsistent indications that constraining net capacity improves generalization. For the practical developer of hand-printed character recognition systems, with a focus only on optimizing generalization, these results suggest that it is probably better to devote limited resources to collecting a very large, representative training set than to extensive experimentation with different net architectures. 2.3 Comparisons with Similar Work. The present study has similarities and differences with the work of LeCun et al. (1989) on recognizing presegmented, hand-printed zip code digits. In both studies, preseg-
264
Gale L. Martin and James A. Pittman
mented character images were fed to nets with a local, shared architecture, and high generalization resulted. In both cases, analyses of the feature maps in the first hidden layer units revealed what appears to be oriented line and edge detectors (see Fig. 3), analogous to feature detectors found in visual cortex (Hubel and Wiesel 1979) and to Linsker’s (1986) orientation-selective nodes, which emerge from a self-adaptive net exposed to random patterns. The similar findings in the two studies suggest that methodological differences between the studies, such as the use of unique versus shared biases and different receptive field sizes, do not appear to significantly impact generalization or the nature of the solution found. The primary difference between the two studies lies in interpreting the importance of architectural constraints on achieving high generalization. LeCun and his colleagues argue that reducing the number of free parameters in a net is key to achieving high generalization in a hand-printed character recognition system. They report results that a global net architecture with one hidden layer of 40 nodes yielded inferior generalization rates. Our work with hand-printed character recognition has led us to the opposite interpretation -constraining the number of free parameters has negligible effects of generalization in this domain. For example, as reported in Figure 1, we found that global net architectures with 50 hidden nodes did yield inferior generalization, but that increasing the number of hidden nodes raised generalization to that of the more constrained
Figure 3: Receptive fields that evolved in first hidden layer nodes in nets with local receptive fields having shared weights. Each of the eight large, gray rectangles corresponds to the receptive field for a hidden node. The four on the left came from a net trained on digits; those on the right from a net trained on letters. Within each of these eight, the black rectangles correspond to negative weights and the white to positive weights. The size of the black and white rectangles reflects the magnitude of the weights.
Recognizing Hand-Printed Letters and Digits
265
architectures. This suggests that the problem was due to insufficient, rather than excess, capacity. Certainly, there are likely to be architectural constraints that push performance u p to human accuracy levels by biasing the net toward discovering the range of invariants that underlie human pattern recognition. The problem is that only a few of these invariants have been explicitly specified (e.g., position, size, rotation). It is not clear how to predispose a net toward discovering the full range. The encouraging aspect of these data is that, even without knowing how to specify such constraints, using training set sizes on the order of thousands to tens of thousands of samples enables high generalization for this domain. 2.4 Limiting the Generality of Interpretation. For a number of reasons, it is important to refrain from interpreting these results too broadly. In a related study (Martin ef nl. 1990), we found that network capacity effects on training and generalization depend quite strongly on the content of the material being learned. Increasing the order of complexity of a specialized variant of the parity function makes both training and generalization more sensitive to net capacity. Using too few hidden nodes substantially increased the likelihood of encountering apparent local minima. Using too many hidden nodes substantially increased the size of the training set required to achieve a given level of generalization. Even in the domain of hand-printed character recognition, network capacity may have a stronger influence on generalization when different techniques are used. For example, we regularly train only to 2-3% error rates. This helps to avoid the possibility of overfitting the data, although we have seen no indication of this when we have trained to higher levels, as long as we use large training sets. Another possibility is that the number of connections may not be a good measure of capacity. The amount of information that can be passed on by a given connection may be a better measure than the number of connections. Our failure to find more than marginal capacity effects on generalization may also be due to not using big enough nets. Hand-printed character recognition tasks that require larger input arrays are likely to require even larger nets. This might cause more substantial capacity effects to emerge, although some initial experiments in varying input size for hand-printed characters failed to indicate that generalization decreases with increased input pattern size (Martin et al. 1990). Finally, it is important to point out that generalization performance on presegmented characters is not the only determinant of a successful hand-printed character recognition system. There are good practical and scientific reasons for using constrained architectures. One reason is to speed processing. Another is that using local receptive fields with shared weights predispose a net toward position invariance (LeCun 1989; Rumelhart et al. 1986), and this may be important in developing nets that can segment, as well as recognize characters. It is also the case that
266
Gale L. Martin and James A. Pittman
nets with a local, shared architecture develop internal representations that seem easier to interpret. Working with nets of this sort may lead to insights on useful features a n d o n h o w biological vision systems use features such a s oriented lines to develop powerful pattern recognition systems. Acknowledgments We would like to thank the NCR corporation for loaning us the set of hand-printed digits, Peter Robinson a n d Dave Rumelhart for their helpful advice, a n d Joyce Conner, Janet Kilgore, and Kay Bauer for their invaluable help in collecting the set of hand-printed letters.
References Baum, E., and Haussler, D. 1989. What size net gives valid generalization? In Advances in Neural Information Processing Systems I , D. S . Touretzky, ed., pp. 81-90. Morgan Kaufmann, San Mateo, CA. Burr, D. J. 1986. A neural network digit recognizer. Proc. 1986 lnt. Conf. Systems, M a n Cybernetics, Atlanta, GA, pp. 1621-1625. Denker, J. S., Gardner, W. R., Graf, H. P., Henderson, D., Howard, R. E., Hubbard, W., Jackel, L. D., Baird, H. s., and Cuyon, I. 1989. Neural network recognizer for hand-written zip code digits. In Advances in Neural Information Processing Systems I , D. S. Touretzky, ed., pp. 323-331. Morgan Kaufmann, San Mateo, CA. Denker, J. S., Schwartz, D., Wittner, B., Solla, S., Howard, R. E., Jackel, L. D., and Hopfield, J. 1987. Large automatic learning, rule extraction and generalization. Complex Syst. 1, 877-933. Duda, I<. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, NY. Hubel, D. H., and Wiesel, T. N. 1979. Brain mechanisms of vision. Sci. A m . 241, 150-162. LeCun, Y. 1989. Generalization and Network Design Strategies. Tech. Rep. CRGTR-89-4, Department of Computer Science, University of Toronto. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comp. 1, 541-551. Linsker, R. 1986. From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Natl. Acad. Sci. U.S.A. 83,8390-3894. Martin, G. L., Leow, W. K., and Pittman, J. A. 1990. Function Complexity Effects on Backpropagation Learning. MCC Tech. Rep. #ACT-HI-062-90. McClelland, J. L., and Rumelhart, D. E. 1988. Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises. MIT Press, Cambridge, MA. Mori, Y., and Yokosawa, K. 1989. Neural networks that learn to discriminate similar kanji characters. In Advances in Neuraf Information Processing Sys-
Recognizing Hand-Printed Letters and Digits
267
terns I, D. S. Touretzky, ed., pp. 332-339. Morgan Kaufmann, San Mateo, CA. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, Vol. 1, D. E. Rumelhart and J. L. McClelland, eds., pp. 318-362. MIT Press, Cambridge, MA. Weideman, W. E., Manry, M. T., and Yau, H. C. 1989. A comparison of a nearest neighbor classifier and a neural network for numerical handprint character recognition. IEEE International Conference on Neural Networks, Washington, DC, 1989.
Received 2 February 1990; accepted 26 November 1990.
This article has been cited by: 2. M. Mestari. 2004. An Analog Neural Network Implementation in Fixed Time of Adjustable-Order Statistic Filters and Applications. IEEE Transactions on Neural Networks 15:3, 766-785. [CrossRef] 3. Gale L. Martin. 2004. Encoder: A Connectionist Model of How Learning to Visually Encode Fixated Text Images Improves Reading Fluency. Psychological Review 111:3, 617-639. [CrossRef] 4. G.P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) 30:4, 451. [CrossRef] 5. R. Urbanczik. 1998. Multilayer perceptrons may learn simple rules quickly. Physical Review E 58:2, 2298-2301. [CrossRef] 6. S. Ridella, S. Rovetta, R. Zunino. 1998. Plastic algorithm for adaptive vector quantisation. Neural Computing & Applications 7:1, 37-51. [CrossRef] 7. Yali Amit , Donald Geman . 1997. Shape Quantization and Recognition with Randomized TreesShape Quantization and Recognition with Randomized Trees. Neural Computation 9:7, 1545-1588. [Abstract] [PDF] [PDF Plus] 8. S. Lee, J.C.-J. Pan. 1996. Unconstrained handwritten numeral recognition based on radial basis competitive and cooperative networks with spatio-temporal feature representation. IEEE Transactions on Neural Networks 7:2, 455-474. [CrossRef] 9. G. C. Vasconcelos, M. C. Fairhurst, D. L. Bisset. 1996. Efficient detection of spurious inputs for improving the robustness of MLP networks in practical applications. Neural Computing & Applications 3:4, 202-212. [CrossRef] 10. Richard G. Stearns. 1995. Neural network that incorporates direct optical imaging. Applied Optics 34:14, 2595. [CrossRef] 11. R. Rovatti, R. Ragazzoni, Zs. M. Kovàcs, R. Guerrieri. 1995. Adaptive Voting Rules for k-Nearest Neighbors ClassifiersAdaptive Voting Rules for k-Nearest Neighbors Classifiers. Neural Computation 7:3, 594-605. [Abstract] [PDF] [PDF Plus] 12. N. Barkai, H. Seung, H. Sompolinsky. 1993. Scaling laws in learning of classification tasks. Physical Review Letters 70:20, 3167-3170. [CrossRef] 13. Stuart Geman , Elie Bienenstock , René Doursat . 1992. Neural Networks and the Bias/Variance DilemmaNeural Networks and the Bias/Variance Dilemma. Neural Computation 4:1, 1-58. [Abstract] [PDF] [PDF Plus]
Communicated by Richard Durbin
Constrained Nets for Graph Matching and Other Quadratic Assignment Problems Petar D. Simii. Division of Physics, California Znstifute of Technology, Pasadena, C A 91125 U S A
Some time ago Durbin and Willshaw proposed an interesting parallel algorithm (the "elastic net'? for approximately solving some geometric optimization problems, such as the Traveling Salesman Problem. Recently it has been shown that their algorithm is related to neural networks of Hopfield and Tank, and that they both can be understood as the semiclassical approximation to statistical mechanics of related physical models. The main point of the elastic net algorithm is seen to be in the way one deals with the constraints when evaluating the effective cost function (free energy in the thermodynamic analogy), and not in its geometric foundation emphasized originally by Durbin and Willshaw. As a consequence, the elastic net algorithm is a special case of the more general physically based computations and can be generalized to a large class of nongeometric problems. In this paper we further elaborate on this observation, and generalize the elastic net to the quadratic assignment problem. We work out in detail its special case, the graph matching problem, because it is an important problem with many applications in computational vision and neural modeling. Simulation results on random graphs, and on structured (hand-designed) graphs of moderate size (20-100 nodes) are discussed. 1 Introduction
Recently, Durbin and Willshaw (1987) proposed an interesting and very efficient algorithm for approximately solving some geometric optimization problems, such as the Euclidean Traveling Salesman Problem (TSP). Their algorithm - they call it the "elastic net" method - naturally lends itself to implementation in parallel hardware. Although it is very different from neural network of Hopfield and Tank (19851, the two are in fact related (Simi6 1990). They both can be understood as the semicIassica1 approximation to statistical mechanics of related physical models, and can be derived and systematically corrected (at least in principle) from the "first principles." Durbin and Willshaw considered their algorithm to be fundamentally geometric - it could be applied to many geometric optimization problems, but not to more general nongeometric problems Neural Coinputation 3, 268-281 (1991) @ 1991 Massachusetts Institute of Technology
Constrained Nets for Graph Matching
269
such as the TSP with an arbitrary matrix of distances, or the quadratic matching problems (Durbin and Willshaw 1987; Willshaw 1990). We have observed, however, that within the probabilistic approach to optimizations, the main point about the elastic net algorithm is in the way one handles the global constraints when evaluating the effective cost function (the free energy in thermodynamic analogy). If all the constraints are enforced softly, the neural network is obtained. On the other hand, if about half of the constraints are enforced strongly, the elastic network is obtained. As a consequence, the elastic net algorithm is a special case of a class of more general, physics-based computations - we call them constrained neural nets - and can be generalized to a large class of optimization problems (Simik 1990). In related work, Peterson and Soderberg (1989) also considered the TSP problem and imposed global constraints on the Hopfield and Tank model by mapping it into a Potts glass. Althought they did not establish the connection with the elastic net method, some of the formulas obtained in (SimiC 1990) and reviewed here, were independently obtained by these authors. Further support for the idea of enforcing global constraints in generalizing the elastic net came from work of Yuille (1990). In this paper we generalize the elastic net algorithm to an important class of nongeometric optimization problems - the quadratic assignment problem and its specializations, such as the graph matching problem. These are representative of a general class of one-to-one (or many) matching problems characterized by quadratic, as opposed to linear (Yuille 1990) dependence of the objective function on the binary variables. Their potential relevance for invariant object recognition and model matching problems in vision was recently discussed in the context of neural style optimizations (von der Malsburg and Bienenstock 1986; von der Malsburg 1988; see also Mjolsness et al. 1989; Kree and Zippelius 1988). We discuss our experiments with the constrained net algorithm on random graphs, and on structured (hand-designed) graphs of moderate size (20100 nodes), which are very encouraging. 2 Constrained Mean-Field Nets for Graph Matching and
Other Quadratic Assignment Problems The important thing about a graph is its topology, that is, the way its nodes are linked together. The link connecting two nodes of a graph could represent a physical connection between, say, hand and the shoulder - if the graph represents human body - or it could represent a conceptual relation between the two objects, parts of some higher level (cognitive) structure - for example, object-class relationships. One can d o all kinds of transformations on graphs; so long as the transformation does not involve breaking of the graph links, the resulting graph is considered identical to the original.
270
Petar D. SimiC
Figure 1: Two hand-designed isomorphic graphs as an example of a graph matching problem.
This invariance with respect to coordinate transformations is an important characteristic of graphs, and makes them attractive for describing invariant relationships between the objects (von der Malsburg and Bienenstock 1986; von der Malsburg 1988). The graph matching problem is to determine if the two graphs have the same topology, that is, if the two graphs are in fact identical. In Figure 1, it is relatively easy to see that the two graphs are identical; they are just the different drawings of the same structure. In the case of more complicated drawings and/or graphs, this becomes very difficult. It is not known if this problem is in the NP-complete class or not, but, except for some special cases, no polynomial algorithm is known (Garey and Johnson 1987). We discuss this problem in the rest of the paper, but our algorithm is applicable to the general quadratic assignment problem. 2.1 Objective Function for Graph Matching. Graphs allow simple mathematical description. Consider the two graphs, such as the ones in the Figure 1. They can be described by the connectivity matrices, G A B and D,,,, elements of which take binary values. We call one graph (G) the modrl, and the other (D), the data. If the nodes A and I3 are connected, then G A B = 1, otherwise G"' = 0. If the two graphs are isomorphic (technical term for "identical'?, then there exists a permutation matrix, i $ , such that
(2.1) A.B
The objective of our optimization problem is to find the optimal match, q", such that relation 2.1 is optimally satisfied. If the two graphs are not identical then relations 2.1 cannot be exactly obeyed. The objective
Constrained Nets for Graph Matching
271
function (Kree and Zippelius 1988) that in both cases measures, in the LMS sense, the degree of the mismatch between the two graphs is given by (2.2) The r&' are similar to the neural variables introduced by Hopfield and Tank. We call them "events" in the following, to emphasize their probabilistic interpretation, which will be developed. If node A ( A = 1.2. . . ., M ) , on the model side, is matched to node p ( p = 1 . 2 . .. . . N = M ) , on the data side, then the event .I/," is true, that is, 17: = 1, and all the other incompatible events are false, q: =:)7 = 0 (B# A, q # p ) . These constraints are conveniently expressed as
Adding the penalties for violation of the constraints 2.3-2.4, our objective function 2.2, can be written as follows:
C
=
1 -
G A B ( l- 2Dl,,)$hf A Bnfq
+
c dl,,vtv; + c
A,P,(J
yAB$'riF
A BJI
(2.5) where d,, YA, d,, (d,, = 0), and YAB (YAA = 0) are arbitrary parameters, penalties for violation of the constraints 2.3-2.4. The active part of C is the first term, since it expresses the matching objective of the problem. When the two nodes on the data side, p and q, are connected ( D p q= I), then C will be decreased (SC = -1/2) if the two connected nodes, A and B (GAB= 1) on the model side, are matched to them. If, on the other hand, p and q are not connected (D,, = 01, then there is a penalty (6C= +1/2) if the two connected nodes, A and B, are matched to them. The rest of the terms in 2.5 are penalties. We can reduce a large number of arbitrary parameters in 2.5 by choosing ?A = 7, and d, = D; the least biased choice since all the nodes are equal. Likewise, the problem of choosing the two-index penalty parameters (
+
272
Petar D. Simii-
2.2 The Effective Cost Theory as an Alternative to Stochastic Annealing. The probabilistic approach to optimization is based on an analogy between optimization and statistical mechanics (Kirkpatrick et al. 1983; Cerny 1982). Given the objective function, one introduces the probabilistic formulation by postulating a certain probability distribution over the space of configurations - in our case, the space of matching events. A convenient choice, familiar from physics, is the Boltzmann distribution, PU[//] = (l/Z,j)f-''c['~]. The central object one wants to evaluate is the partition function,
where H / is some matrix of external "fields." Once the partition function is evaluated - typically using some approximations - all the relevant statistical information about the system can be deduced from it. For example, differentiating the generating functional, - (1/$)lnZp [HI with respect to external fields about H = 0, one generates all correlation functions of 4 distribution. At zero temperature (/? m) and no external fields present ( H = O), the partition function is dominated by the term of smallest cost, 2, e e x p ( - , K [ ~ " ] ) .This has an important implication for optimization: as the temperature is decreased the expectation value of the matching matrix ((q)),converges toward the optimal match (7") between the two graphs, that is, ---f
This suggests that instead of minimizing the cost function C directly, one could attempt to calculate the thermal averages such as (77) and ( C [ V ] ) , at a given temperature, and follow how these averages change as one decreases the temperature toward zero (Kirkpatrick et al. 1983; Cerny 1982). It is quite interesting that the expectation values (77) can be obtained by minimizing an effective cost function, known in physics as the thermodynamic free energy. It is defined as the Legendre transformation of the generating functional (-(1//3) In ZLf[H]): (2.8) where 717; is the expectation value, that is, the probability of the event q i in the presence of some external fields (H),
Constrained Nets for Graph Matching
273
Differentiating C'" with respect to m, one obtains
where the last equality follows from the definition of m. We see that at H = 0 the Ceff[m]has local extremum at m equal to the expectation value of 17 defined in 2.7. This suggests that instead of calculating ( ~ ) , jnumerically, one could attempt to evaluate analytically the effective cost function - note that it depends on the temperature - and then, while successively decreasing values of temperature, minimize it using differential equations. These deterministic equations describe relaxation of probabilities and stochastic averages as a function of the temperature. They provide an effective deterministic description of the system, completely equivalent, in principle, to stochastic descriptions obtained by Monte-Carlo simulations.' Solving them at successively decreasing temperatures - a process sometimes called "mean-field" annealing - is a way to model simulated annealing within a deterministic framework, and it is equivalent to stochastic annealing. From the point of view of combinatorial optimizations, it is interesting to note that the C2ffis a function of continuous variables (m) and minimizing it, is a continuous optimization problem. In general, C$f = C- 1/P S, where S is the entropy of the system. The functional form of the entropy function will depend on what states are available to the system, that is, what states we sum over in evaluating the partition function in 2.6. It follows that, depending on whether we enforce all, some, or none of the constraints in summing over configurations of our optimization problem, we will get different functional forms of the CF. It is known (SimiC 1990) that in the semiclassical approximation, none leads in general to neural networks of Hopfield and Tank (1985), and some in the case of the TSP, leads to the elastic net algorithm of Durbin and Willshaw (1987). 2.3 The Effective Cost Function for Quadratic Assignment. To generalize the elastic net algorithm of Durbin and Willshaw (1987), along the lines proposed in Simi6 (1990), we should construct the entropy function enforcing a significant number of constraints 2.3-2.4 when summing over the configurations in 2.6. If we assume that the graph from the data side may have in general more nodes than the model graph, it makes sense to choose constraints 2.4 (nondiagonal part) and the diagonal part of 2.3 to be enforced, while enforcing softly, with the help of the penalty terms, the remaining constraints. These are the same constraints as in our formulation of the TSP problem, and many of the formulas we need to generalize 'In practice, one has to use approximations in deriving the effective cost function and to this extent the effective deterministic equations will only approximate the stochastic description obtained by Monte-Carlo simulations.
274
Petar D. SimiC
the elastic net to graph matching, have been already derived there, and then specialized to the Euclidean TSP. In the following, we just outline the basic elements of this derivation, which holds for general quadratic assignment problem with an arbitrary matrix .I,”,” then, ; we specialize our results to the graph matching problem, with according to equation 2.5, and discuss the temperature dependence of the penalty terms - an important ingredient of the elastic net algorithm. For reasons that will become clear shortly, we will consider the following, regularized partition function:
.’I;”,”
Z,(t’[H]- X
F ’j[UVI+X, l , v l ‘ ~ l ; - ~
C,l(2,11:-1)L]
(2.11)
tr/}
The third term in the exponent is trivial configuration-independent constant, and its effect is just to multiply Z,[H], defined in 2.6, by c This clearly does not affect the r/-correlation functions - they are derivatives of In Z and are invariant under the rescaling of the partition function. This means that we can evaluate the effective cost function, and other results using the Z’;’[H],and at the end set X = 0 to obtain Zd[H] (= Z:’[H]), and other final expressions independent of A. Dropping the third and the last term from the objective function in 2.5 - these are identically zero when the subset of constraints are obeyed - and introducing the ”potentials” 4; dual to neural variables ( I $ ) , we can rewrite the partition function 2.11 as
zj;”)[H] ~
~ / ~ [ $ - ~ ’,,d’,~([xl-J1 (~~),,cl ‘
j
i
~
,
~
~
~
t
~
~
~
~
f
/
i
{Vl
(2.12) where (14 is the scale invariant integration measure defined in Simib (19901, and the regularization parameter X is chosen such that the inverted matrix in the exponent has all the eigenvalues positivea2 In the special case of the graph matching problem, we find from the equation 2.5,
+
(1 - b p q )
(2.13)
Since the dependence on event variables is linear in the exponent of 2.12, we can sum over all rl-configurations that obey 2.4 and the diagonal part of 2.3. Then, expanding the integral in 2.12 around its saddle point (this ’This choice is convenient and leads to the cleaner looking mathematical expressions, but is not essential because even in the X = 0 case there is no real divergence in 2.11 the exponential divergence of the integrand is matched with the same divergence in the denominator of the measure. This is in agreement with the previously discussed fact that for any A, Z,\*’[H]= Z,,[H] exp[X(/J/B)MN]. ~
’
(
~
~
+
~
~
275
Constrained Nets for Graph Matching
+
expansion is induced by letting 4 = &' ( l / f l ) b $ ) , and neglecting the terms quadratic and higher in fluctuations, we obtain the following approximate expression for the partition function3:
(2.14) Differentiating the In Zj;"[H] with respect to H , and setting X = 0, one finds the following semiclassical relationship between m and = f' + H :
6
(2.15) This relationship was written u p previously in Peterson and Soderberg (1989) and also in Simik (1990). On the other hand, from the saddle point equation 6E/6r+h= 0 obeyed by the semiclassical field @', and equation 2.15, one obtains,
4'' = Jm(6)
(2.16)
which justifies the name "potentials" for &variables. This expression is the mean-field version of the generaIly valid relationship, ( 4 ) = J 1 l q 5 q 5 ~ ~ f i ~ [= @ ]Jm, which can be proved from 2.12, and it shows the meaning of the semiclassical field 4". We now have all elements to derive the semiclassical approximation of the effective cost function. Using E[$"'],and the defining relation 2.8, we find (at X = 0):
4
where we have used equation 2.16 and the previously defined = $"'+H. Since d, is an implicit function of 711, one should consider Ceffto be a function of rri; note that H can be obtained from Ceff[m]using equation 2.10. 3This is the well-known semiclassical or ("naive") mean-field approximation. It is the first term in a systematic expansion of the partition function based on order-byorder inclusion of the effects of thermal fluctuations, and it is formally an expansion in powers of .",1 Unfortunately, for spin-glass and similar optimization problems this approximation does not become exact in the thermodynamic (large N ) limit. (Thouless eta!. 1977). The exact result is obtained (in the spin-glass case) by including the next term (quadratic in fluctuations). It is not clear, in general, how important these corrections are from the neural computation point of view, where the degree of improvement in the quality of approximate solutions should be measured relative to the computational complexity of the "corrected" algorithms.
Petar D. SimiC
276
The Ceffhas the expected C - T S form. The first term is just the original cost function 2.5, while the term in the brackets is the entropy function of the constrained net? One should note the second term in the brackets. Its sum-log-sum form is characteristic of the elastic net (Durbin and Willshaw 1987). In the elastic net case (originally applied to the TSP), a similar term was shown to be the consequence of enforcing strongly the constraints that each city be visited by one and only one node of the discretized elastic string (Simik 1990).5 Here, it is a consequence of enforcing strongly the constraint that each data node should be matched to one and only one model node. One could follow the analogy with Hopfield's equations (Hopfield 1984), and minimize the CT by postulating the relaxation equations for b, - potentials, $ = -4 Jm(4), with /rt:(4) given by relation 2.15. Instead, we will write down the relaxation equations directly for
+
mi(/).
It can be shown that this time evolution converges toward the minima of the effective cost function 2.17, that is, m(t)%=,,,(v). To prove this, consider how the Ceffchange with time under the time evolution specified by 2.18. Using the update equation 2.18, and the r i i ( 4 ) relationship 2.15, we find,
(2.19) are manifestly positive expressions.6 Since the where zT,, yp, and c,:"' constrained net entropy function, given by the expression in the brackets 4Had we enforced all the constraints softly (by penalty terms in 2.5), and summed over all possible configurations of the binary event variables, we would obtain the familiar neural network energy function (Hopfield 1984), and the bracket term could be reduced to the more familiar, information-theoretical entropy form (Simif 1990). 51n the TSP case, one could reabsorb neural variables (ms) and express this term in the original (Durbin and Willshaw 1987) form. This is not possible in the case of the general quadratic assignment problem, where J does not have necessarily some (low-dimensional) geometric interpretation. It is the form of the effective cost function, which "knows" about global constraints, that is in our view the essential ingredient of the elastic net algorithm, and not the ability to further eliminate neural variables (rns), which rests on the presence of some underlaying, low-dimensional geometric interpretation. 6 z p = =yC (,-"4:;, yp = Cc (,-[jW);, and (.AB = t - l -1 -/j($;+$:) p
- p
,Yp r
Constrained Nets for Graph Matching
277
of equation 2.17,is bounded - it is minimum for configurations obeying the constraints 2.3-2.4,and it is close to its maximal value cx N In N when the b, are all the same - equation 2.19 shows that the time evolution defined by 2.18 is converging toward the minima of Ceff. It is useful to summarize what has been achieved in this section. We have formulated the quadratic assignment problem probabilistically, using the partition function formalism. The allowed configurations obey the constraints 2.3-2.4,and we have enforced half of these constraints by summing only over the configurations that obey them. This was possible with help of the 8 fields, which are related to the matching probabilities (711; = (7);) by equations 2.15 and 2.16, and lead to an effective cost function 2.14 depending on 4s and some external fields (H). Then by the Legendre transformation, we have eliminated the explicit H-dependence, and have constructed the effective cost function 2.17. Its minima correspond to the most probable matching events, and 2.18 is a simple relaxation formula without any 4s that minimizes it. In the following, we will refer to relaxation equations 2.18,as the constrained net equations. They are applicable to the most general quadratic assignment problem, which is characterized by an arbitrary symmetric matrix of couplings, JAB, and the matching constraints given by equations 2.3-2.4. 2.4 Graph Matching Algorithm and Simulation Results. The final step in generalizing the elastic net algorithm is in choosing the temperature dependence of the penalty parameters, associated with the softly enforced constraints. Specializing here to the graph matching problem, with J given by equation 2.13,and after some experimentation, we decided for the following T-dependence of the penalty parameters: cu(T)zz 2fl and y ( T ) = 0.lP. This purely phenomenological choice, guarantees that at high T there will be a small penalty for violation of the constraints (2.4, diagonal part) and (2.3,nondiagonal part), but as the optimization temperature decreases toward zero, these penalties gradually increase, eventually becoming infinite, and all the constraints become exactly satisfied - an important characteristic of the elastic net method. The constrained net algorithm works as follows. One starts at some sufficiently high temperature - in our experiments T M 0.2-0.3was a good starting point - from a random configuration m(O), uniformly distributed within a small neighborhood around 1I N , and then iterates equations 2.18 until the equilibrium sets in; we take about 10-20 steps. Then, one decreases the temperature, typically 1-2%, and continues with iterations for another 10-20 steps. This process is repeated until the network converges to its final low-temperature equilibrium state in which all the constraints 2.3-2.4are satisfied. The algorithm was applied to both structured (hand-designed graphs such as the one in Fig. l), and random graphs with up to 100 nodes. In all the graphs we have been studying, the constrained net was capable of
278
Petar D. SimiS
solving the problem exactly in a significant fraction of the total number of trials. In general, the optimal solution is discovered very quickly, say in the first 50-100 steps, or the algorithm runs many iterations (few thousands) before converging to a local minimum, often far from the optimal solution. Figure 1 presents two isomorphic graphs, typical of the hand-designed graphs we have studied. Probability of successful match was 0.39, and the convergence was achieved, in such cases, in less then 60 iteration steps. We observed that part of the strategy our algorithm is using is based on searching for the notable nodes, that is, nodes with the highest or the lowest connectivity, and matching them first; if this is done successfully, the network would rapidly converge to the final, optimal match state. In another set of experiments we generated an ensemble of graphs with given number of nodes ( N ) and some expected connectivity ( p ) . For example, in the experiments with N = SO and = 0.1 we generate 512 different (pairs of isomorphic) graphs, by randomly generating links between the nodes with probability 0.1 (the expected number of links per node is in this case 5, but the actual number is randomly distributed about this value). For each pair of graphs, we run the algorithm once and record if the algorithm has found the optima1 solution or not. At the end, we count how many times the exact solution has been found, and divide this number by the total number of runs. This gives us a probabilistic estimate,('f N . p), of how successful in the average, is our algorithm on randomly generated ( N ,[ I ) graphs. Figure 2 summarizes our results for N = 20, 30,40,50, and 75; for each N,p takes number of values between 0.05 and 0.60. Notice the sigmoid-shaped curve indicating essentially 100% success on any size graph with connectivity within interval 0.4-0.6, and decreasing rate of success (but always significantly higher than zero) as the connectivity becomes smaller (or larger). An almost 100% matching success rate is somewhat surprising. It can be partially understood by noticing that the dispersion of the connectivity is proportional to Np(1 - p ) and takes the largest value around p = 0.5. This means that the distribution of the connectivities is rather broad in this case, with few nodes on the extreme ends of its tail having widely different connectivities. An interesting matching heuristic would be to try to match these nodes first. As we have already mentioned, this appears to be the strategy adapted by our network; typically after just 20 to 30 iterations, the notable nodes are identified and matched; all the remaining nodes would then be rapidly matched and the algorithm would converge. This strategy appears very useful in matching similar graphs. We find that if we start with the two isomorphic graphs with expected connectivity around 0.5, and distort one of them by deleting randomly u p to 10% of its links, there is no significant degradation in matching performance - the success rate is close to 1. As distortion become larger, at some point the locally stable states appear (corresponding to the wrong
Constrained Nets for Graph Matching
279
1 -
0.8
-
0.6
-
0.4
-
0.2
-
//
/
4
I
i
I
I
I
I
i
Figure 2: Probability that the constrained net finds the correct match between the two isomorphic graphs with N = 30, 40, 50, and 75 nodes, versus the expectationvalue of the (random)connectivityper node. (Expected connectivity normalized to one.) match), and the probability of success decreases rapidly. Basically, we believe that as long as the connectivity of the notable nodes is not considerably decreased (in the case of the high connectivity nodes) or increased (in the case of the low connectivity nodes) relative to the other nodes, our algorithm successfully converges towards the correct match. This appears to be the case in many of the examples we studied. 3 Concluding Remarks
The essential ingredients for generalizing the elastic net algorithm to a broad range of constrained optimization problems are (SimiC. 1990): (1)the probabilistic formulation (formulation based on statistical mechanics), and (2) the explicit enforcement of a significant part of the global constraints when evaluating the effective cost function. In this work we have applied these ideas to the quadratic assignment problem. Its important special case, the graph matching problem, is studied in some
280
Petar D. SimiS
detail. Our algorithm appears to be very efficient for matching isomorphic graphs as well as for matching similar graphs (in this case one graph is a simplified version of the other in that some fraction of its links are deleted). It spontaneously develops an interesting matching strategy, based on looking for the most notable nodes and matching them first. This strategy seems to be the type of approach that would be useful to humans, and that they might take, in sdving similar model matching problems. Many problems in pattern recognition and computer vision, such as classification and clustering and model and graph matching, can be formulated as assignment problems quadratic in binary variables. The constrained m i m l rirtworks should be applied to these problems since they are "natural" and compu ta tionally efficient generalization of both, the elastic net algorithm of Durbin and Willshaw, and the Hopfield and Tank neural networks. One could simply add the constrained net entropy to the original cost function, and choose the penalties for softly enforced constraints to be increasing functions of the inverse temperature. Then the "neural style" relaxation equations 2.18, or any other iterative method, can be used at a sequence of successively decreasing temperatures to find the locally optimal configurations.
Acknowledgments
--_
~-
I thank J. Hopfield, and S. Judd for useful conversations. The indirect inspiration for this paper came also from some conversations with D. Willshaw. This work was supported in part by the DOE under Grants DE-AC03-81ER40050 and DE-FG03-85ER25009, the Office of the Program Manager of the Joint Tactical Fusion Office, and the NSF under Grant EET-8700064. References
~-
Cerny, V. 1982. A thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Inst. Phys. & Biophys., preprint, Comenius Univ., Bratislava. Durbin, R., and Willshaw, D. 1987. An analogue approach to the traveling salesman problem using an elastic net method. Nnture (London) 326, 689691. Carey, M. R., and Johnson, D. S. 1987. Coniputers atid Intractability. W. H. Freeman, New York. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U S A . 81,3088-3092. Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Biol. Cyhernet. 52, 141-152.
Constrained Nets for Graph Matching
281
Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220, 671. Kree, R., and Zippelius, A. 1988. Recognition of topological features of graphs and images in neural networks. I. Phys. A: Math. Gen. 21,L813-LB18. Mjolsness, E., Gindi, G., and Anandan, P. 1989. Optimization in model matching and perceptual organization. Neural Comp. 1, 218-229. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. Int. 1. Neural Syst. 1, 3-22. Simik, P. D. 1990. Statistical mechanics as the underlaying theory of "elastic" and "neural" optimizations. Network 1, 89-104. Thouless, D. J., Anderson, P. W., and Palmer, R. G. 1977. Solution of 'solvable model of a spin glass'. Phil. Mag. 35, 593. von der Malsburg, C. 1988. Pattern recognition by labeled graph matching. Nrural Networks 1, 141-148. van der Malsburg, C., and Bienenstock, E. 1986. Statistical Coding and ShortTerm Synaptic Plasticity: A Scheme for Knowledge Representation in the Brain, pp. 247-252. Springer-Verlag, Berlin. Willshaw, D. 1990. Personal communication. Yuille, A. 1990. Generalized deformable templates, statistical physics and matching problems. Neural Comp. 2, 1-24. -
~~~
~
Received 2 August 1990; accepted 24 January 1991.
This article has been cited by: 2. Tabitha James, CÉsar Rego, Fred Glover. 2009. Multistart Tabu Search and Diversification Strategies for the Quadratic Assignment Problem. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 39:3, 579-596. [CrossRef] 3. B.J. van Wyk, M.A. van Wyk. 2004. A POCS-based graph matching algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:11, 1526-1530. [CrossRef] 4. M.A. van Wyk, T.S. Durrani, B.J. van Wyk. 2002. A RKHS interpolator-based graph matching algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:7, 988-995. [CrossRef] 5. Bin Luo, E.R. Hancock. 2001. Structural graph matching using the EM algorithm and singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:10, 1120. [CrossRef] 6. Marcello Pelillo . 1999. Replicator Equations, Maximal Cliques, and Graph IsomorphismReplicator Equations, Maximal Cliques, and Graph Isomorphism. Neural Computation 11:8, 1933-1955. [Abstract] [PDF] [PDF Plus] 7. Anand Rangarajan , Alan Yuille , Eric Mjolsness . 1999. Convergence Properties of the Softassign Quadratic Assignment AlgorithmConvergence Properties of the Softassign Quadratic Assignment Algorithm. Neural Computation 11:6, 1455-1474. [Abstract] [PDF] [PDF Plus] 8. Andrew M. Finch , Richard C. Wilson , Edwin R. Hancock . 1998. An Energy Function and Continuous Edit Process for Graph MatchingAn Energy Function and Continuous Edit Process for Graph Matching. Neural Computation 10:7, 1873-1894. [Abstract] [PDF] [PDF Plus] 9. K. Rose. 1998. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE 86:11, 2210. [CrossRef] 10. Geoffrey J. Goodhill, Terrence J. Sejnowski. 1997. A Unifying Objective Function for Topographic MappingsA Unifying Objective Function for Topographic Mappings. Neural Computation 9:6, 1291-1303. [Abstract] [PDF] [PDF Plus] 11. R.C. Wilson, E.R. Hancock. 1997. Structural matching by discrete relaxation. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:6, 634-648. [CrossRef] 12. T. Hofmann, J.M. Buhmann. 1997. Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:1, 1. [CrossRef]
13. S. Gold, A. Rangarajan. 1996. A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:4, 377-388. [CrossRef] 14. D. Miller, A.V. Rao, K. Rose, A. Gersho. 1996. A global optimization technique for statistical classifier design. IEEE Transactions on Signal Processing 44:12, 3108. [CrossRef] 15. S. Shams. 1996. Neural network optimization for multi-target multi-sensor passive tracking. Proceedings of the IEEE 84:10, 1442. [CrossRef] 16. A. Rangarajan, E.D. Mjolsness. 1996. A Lagrangian relaxation network for graph matching. IEEE Transactions on Neural Networks 7:6, 1365. [CrossRef] 17. I. M. Elfadel . 1995. Convex Potentials and their Conjugates in Analog Mean-Field OptimizationConvex Potentials and their Conjugates in Analog Mean-Field Optimization. Neural Computation 7:5, 1079-1104. [Abstract] [PDF] [PDF Plus] 18. Peter Dayan . 1993. Arbitrary Elastic Topologies and Ocular DominanceArbitrary Elastic Topologies and Ocular Dominance. Neural Computation 5:3, 392-401. [Abstract] [PDF] [PDF Plus]
Communicated by David Touretzky
Symmetric Neural Networks and Propositional Logic Satisfiability Gadi Pinkas Drpartmrwf of Cornyukr Science, Washington Unitwsify, St. LOUJS,MO 63230 U S A
Connectionist networks with symmetric weights (like Hopfield networks and Boltzmann Machines) use gradient descent to find a minimum for quadratic energy functions. We show an equivalence between the problem of satisfiability in propositional calculus and the problem of minimizing those energy functions. The equivalence is in the sense that for any satisfiable well formed formula (WFF) we can find a quadratic function that describes it, such that the set of solutions that minimizes the function is equal to the set of truth assignments that satisfy the WFF. We also show that in the same sense every quadratic energy function describes some satisfiable WFF. Algorithms are given to transform any propositional WFF into an energy function that describes it and vice versa. High-order models that use sigma-pi units are shown to be equivalent to the standard quadratic models with additional hidden units. An algorithm to convert high-order networks to low-order ones is used to implement a satisfiability problem-solver on a connectionist network. The results give better understanding of the role of hidden units and of the limitations and capabilities of symmetric connectionist models. The techniques developed for the satisfiability problem may be applied to a wide range of other problems, such as associative memories, finding maximal consistent subsets, automatic deduction, and even nonmonotonic reasoning.
1 Introduction
The problem of satisfiability is deciding whether a truth assignment for the variables of a given propositional WFF exists, so that the formula is evaluated to be true. In many cases the assignment needs also to be found. It is well known that any of the problems in NF can be reduced to the satisfiability problem and that satisfiability is N P complete. Neural Computation 3, 282-291 (1991) @ 1991 Massachusctts Institute of Technology
Symmetric Neural Networks
283
Apart from theoretical importance, satisfiability has direct applications. A satisfiability problem-solver may be used for example in an inference engine and for solving other hard problems that were reduced to satisfiability during the years. In this paper we show an equivalence between the satisfiability search problem and the problem of connectionist energy minimization. For every WFF we can find a quadratic energy function such that the values of the variables of the function at the minimum can be translated into a truth assignment that satisfies the original WFF and vice versa. Also, any quadratic energy minimization problem may be described as a satisfiable WFF that is satisfied for the same assignments that minimize the function. More details and formal proofs can be found in Pinkas (1990). Finding minima for quadratic functions is the essence of symmetric connectionist models (Hopfield 1982; Hinton and Sejnowski 1986; Hinton 1989). They are characterized by a recurrent network architecture, a symmetric weight matrix (with zero diagonal), and a quadratic energy function that should be minimized. Each unit asynchronously computes the gradient of the function and adjusts its activation value, so that energy decreases monotonically. The network eventually reaches equilibrium, settling on either a local or a global minimum. Hopfield and Tanks (1985) demonstrated that certain complex optimization problems can be approximated by these kind of networks and some of the work done in connectionist reasoning and knowledge representation has used energyminimization models (Ballard 1986; Touretzky and Hinton 1988; Derthick 1988). There is a direct mapping between these models and quadratic energy functions, and most of the time we will not distinguish between the function and the network that minimizes it. Thus, the equivalence between energy minimization and satisfiability means that everything that can be stated as satisfiability of some WFF and nothing else can also be expressed in symmetric models. The techniques described are used (in this paper) for the direct implementation of a satisfiability problem solver on connectionist networks; however they may also be applied to automatic deduction, abduction, construction of arbitrary associative memories, and more.
2 Satisfiability and Models of Propositional Formulas
A WFF is an expression that combines atomic propositions (variables) and connectives [V. A. +. (. )]. A model (truth assignment) is a vector of binary values that assigns 1 ("true'? or 0 ("false'? to each of the variables. A WFF p is satisfied by a model x if its characteristic function Hv evaluates to "one" given the vector x. 7.
284
Gadi I’inkas
The characteristic function is defined to be Hp: 2“
+
(0. l} such that
The satisfiability search problem for a WFF p is to find an x (if one exists) such that Hp(x) = 1.
3 Equivalence between WFFs
-
We call the atomic propositions that are of interest for a certain application ”visible variables” (denoted by x). We can add additional atomic propositions called ”hidden variables” (denoted by t) without changing the set of relevant models that satisfy the WFF. The set of models that satisfy p projected onto the visible variables is then called ”the visible satisfying models” ({x I (3t)Hp(x.t ) = 1)). Two WFFs are equivalent if the set of visible satisfying models of one is equal to the set of visible satisfying models of the other. A WFF 9 is in conjunction of triples form (CTF) if = A,: 9,and every p, is a subformula of at most three variables.’ Every WFF can be converted into an equivalent WFF in CTF by adding hidden variables. Intuitively, we generate a new hidden variable for every binary connective (e.g., V, -) except for the top most one, and we ”name” the binary logical operation with a new hidden variable using the connective (-1.
Example 3.1. Converting = ( ( - ( ( i A ) A U ) ) ((it.’)+ D ) ) into CTF: From ( ~ ( ( T AA) U ) ) we generate ( ( - ( ( - A ) A B))-Tl) by adding a new hidden variable TI, from ((1C) + D ) we generate ( ( ( - C ) + U)t-’?;) by adding a new hidden variable T2, for the top most connective (+I we generate (TI + T2). The conjunction of these subformulas is ((i((4) A B))*T1) A (((4’) .+ D)*T2) A (TI -+ T2). It is in CTF and is equivalent to 9. ---f
’CTF differs from the familiar conjunctive normal form (CNF). Thc p)s are WFFs of up to 3 variables that may include any logical connective and are not neccssarily a disjunction of literals as in CNE To put a bidirectional CTF clause into a CNF we would have to generate two clauses, thus (Awn)becomes (’A V B ) A ( A V 1 B ) .
285
Symmetric Neural Networks 4 Energy Functions
A k-order energy function is a function E : (0,1}" -, R that can be expressed in a sum of products form with product terms of up to k variables:
Quadratic energy functions are special cases:
c
wijxixj
l
+ c wixa + w i
We can arbitrarily divide the variables of an energy function into two sets: visible variables and hidden variables. We call the set of minimizing vectors projected onto the visible variables "the visible solutions" of the minimization problem: ( p ( E )= {x 1 (3t)E(x,t) = miny,z{E(y, z)}}). We can always translate back and forth (Hopfield 1982) between a quadratic energy function and a network with symmetric weights that minimize it (see Fig. 1). Further, we can use high-order networks (Sejnowski 1986)to minimize high-order energy functions (see Fig. 2). In the extended model each node is assigned a sigma-pi unit that updates its activation value using at = F
(c 31
z
-wa1>, a , %I.
,ak
n
xa.)
123
Like in the quadratic case, there is a translation back and forth between k-order energy functions and symmetric high-order models with k-order sigma-pi units (see Fig. 2).
5 The Equivalence between High-Order Models and Low-Order Models
We call two energy functions equivalent if they have the same set of visible solutions. Any high-order energy function can be converted into an equivalent low-order one with additional hidden variables. In addition, any energy function with hidden variables can be converted into a (possibly)
Gadi Pinkas
286
-1
Figure 1: A symmetric network that represents the function E = - 2 N T - 2 S T 2WT + 5T + N S + R N - W N + W , describing the WFF: ( N A 5' -+ W )A ( R ( - N ) ) A (NV ( 1 W ) ) .T is a hidden unit.
i
higher one by eliminating some or all of the hidden variables. These algorithms allow us to trade the computational power of sigma-pi units for additional simple units and vice versa. 0
nfil,
Any k-order term (m X,i), with NECA'TIVE coefficient UI, can be replaced by the quadratic terms: C,"=,2wX,T - (2k - 1)wT generating an equivalent energy function with one additional hidden variable T . Any k-order term ( T I ) Uf=, Xi),with POSITIVE coefficient 'UI, can be replaced by the terms: 111 X , (Cf;' 2,wXiT) 2111XkT (2k 3)viT,generating an equivalent energy function of order k - 1 with one additional hidden variable T.2
;:n
+
~
+
+
Example 5.1. The cubic function E = - N S W + N S R N - W N + W is equivalent to -2NT - 2ST 2W T +5T + N S+ K N - W N W (introducing T ) . The corresponding high-order network appears in Figure 2 while the equivalent quadratic one in Figure 1. ~
+
The symmetric transformation from low-order into high-order functions, by eliminating any subset of the variables, is also possible (of course 2A symmetric (but less efficient) transformation for the positive case also exists
287
Symmetric Neural Networks
+
+
+
Figure 2: A cubic network that represents E = - N S W N S RN - W N W using sigma-pi units and a cubic hyperarc. (It is equivalent to the network of figure 1 without hidden units.)
we are interested in eliminating only hidden variables). To eliminate T , bring the energy function to the form: E = + oldterm. where k 1 oldtrrrrr = (C,,, w, X,,)T. Consider all assignments S of the form ( L , , . . . xz,)for the variables ( X z ,. . . X , , ) in oldterm (not including T), such that El
,:n
Jd
2=1
Each negative gs represents an energy state of the variables X , , . . . X I , that pushes T to become ”one” and decreases the total energy by I fis 1. States with positive fis cause T to become zero, do not reduce the total energy, and therefore can be ignored. Therefore, the only states that matter are those that reduce the energy, i.e., Bs is negative. Let
L.’s =
if S ( X , , ) = 1
It is the expression “X,” or ”(1- X,)”depending on whether the variable is assigned 1 or 0 in S.3 3 L i is like a macro and is replaced by either “X,” or “1 - X,” once it is used in newterm.
Gadi Pinkas
288
The expression pression newterm=
ni=lL i therefore determines the state S, and the ex-
c
S such that ps
psiLi j=1
represents the disjunction of all the states that cause a reduction in the total energy. The new function E' f newterm is therefore equivalent to E' + oldterrn and does not include T . With this technique, we can convert any network with hidden units into an equivalent network without any such units. 6 Describing WFFs Using Penalty Functions
An energy function E describes a WFF cp if the set of visible satisfying models of cp is equal to the set of visible solutions of the minimization of Ee4 The penalty function E , of a WFF cp is an energy function E, : { O , l } ' L --$ N , that penalizes subformulas of the WFF that are not satisfied. It computes the characteristic of the negation of every subformula 'p, in the upper level of the WFF's conjunctive structure. If cp = 'pi then. m
Ep
= C(H-p,) = i=l
m
CCl - HOD,) i=l
If all the subformulas are satisfied, E p gets the value zero; otherwise, the function computes how many are unsatisfied. It is easy to see that cp is satisfied by x iff E, is minimized by x (the global minima have a value of zero). Therefore, every satisfiable WFF ~p has a function E p such that ECpdescribes cp. Example 6.1. E ( ( N ~ s ) - w ) ~ ( R - ( ~ N ) ) ~ ( N=v (H--w( () N) ~ s )W+ ) + H-.(R+(-.N))
+ K(N"(-.W))
= =
=
+ H R A N+ H(+v)Aw (NS(1- W ) ) (RN) + ( ( 1- N ) W ) -NSW+NS+RN-WN+W
HNAsA(-w)
+
The corresponding network appears in Figure 2. 4Note that it is only the minima (and not the details of the function's surface) that cause E to describe p.
Symmetric Neural Networks
289
The following algorithm transforms a WFF into a quadratic energy function that describes it, generating O[lertgtlt(p)]hidden variables: 0
0
0
Convert into CTF (Section 3). Convert CTF into a cubic energy function and simplify it to a sum of products form (Section 6). Convert cubic terms into quadratic terms. Each of the triples generates only one new variable (Section 5 ) .
The algorithm generates a network whose size is linear in the number of binary connectives of the original WFF. The fan-out of the hidden units is bounded by a constant. 7 Every Energy Function Describes Some Satisfiable WFF In Section 5 we saw that we can convert any energy function to contain no hidden variables. We show now that for any such function E with no hidden variables there exists a satisfiable WFF p such that E describes p. The procedure is first to find the set / r ( E )of minimum energy states (the vectors that minimize E ) . For each such state create an 'rt,-way conjunctive formula of the variables or their negations depending whether the variable is assigned 1 or 0 in that state. Each such conjunction A,"=, Lk where "X," if S ( X , ) = 1 L; = "(yX2)" if S ( X , ) = 0 represents a minimum energy state. Finally the WFF is constructed by taking the disjunction of all the conjunctions: = VSEp(E)(A/ll L k ) . The satisfying truth assignments of cp correspond directly to the energy states of the net. 8 Conclusions
We have shown an equivalence between the search problem of satisfiability and the problem of minimizing connectionist energy functions. Only those problems that can be stated as satisfiability search problems (and every such problem) can be stated in symmetric neural networks. Any satisfiable WFF can be described by an n-order energy function with no hidden variables, or by a quadratic function with O[lr.'/t,!ltli,(T.I.'F')] hidden variables. Using the algorithms described we can efficiently determine the topology and the weights of a connectionist network that represents and approximates" a given satisfiability problem. 'The networks are not always capable of escaping from local minima.
290
Gadi Pinkas
Several other applications may benefit from the techniques developed here. Two are associative memory and finding maximal consistent subsets. Given a set of binary vectors we wish to store in an associative memory, we can construct a WFF g that is satisfied for all and only these vectors. ( p is the boolean implementation of the function that outputs “one” for all memory vectors and ”zero” otherwise.) By implementing the network that describes 9,w e get an associative memory that performs completion when only a portion of the input is supplied. As a second application consider an inconsistent set of beliefs. We can construct a network that will search for a maximal consistent subset [adding degree of bclief or certainty a s in Derthick (1988) is a simple extension of the penalty function]. Subsets of beliefs compete among themselves and some are defeated in favor of others (Pinkas 1991). The network searches for a maximal consistent subset of beliefs such that the total penalty is minimized.
Acknowledgments I thank Bill Ball, Sally Gnldman, Dan Kimura, Arun Kumar, Stan Kwasny, Ron Loui, and Dave Touretzky for helpful comments and suggestions. This work was supported by McDonnell Douglas, Southwestern Bell, and Mitsubishi America at the Center for Intelligent Computing Systems and by NSF Grant 22-1321 57136. References
~~~
Ballard, D. H. 1986. Parallel logical inference and energy minimization. In Proceedings of the 5th Nutional Conferericr on Artificial Intelligence, Philadelphia, PA, pp. 203-208. Derthick, M. 1988. Mundane reasoning by parallel constraint satisfaction. Ph.D. thesis, CMU-CS-88-182, Carnegie Mellon University. Hinton, G. E. 1989. Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Comp. 1, 143-150. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and re-learning in Boltzmann machines. In Parallel Distributed Processing: Explorations in the Microstructitre of Cognition, J. L. McClelland and D. E. Rumelhart, eds., Vol. I, pp. 282-317. MIT Press, Cambridge, MA. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Bid. Cybernet. 52, 144152. Pinkas, G. 1990. Energy Minimization and the Salisfiability of Propositional Calculus. Tech. Rep. WUCS-90-03, Department of Computer Science, Washington University.
Symmetric Neural Networks
291
Pinkas, G. 1991. Propositional Non-Monotonic Reasoning and Inconsistency in Symmetric Neural Networks. In Proceedings of the 22th International Joint Conference on Artificial Intelligence, 1992, to appear. Sejnowski, T. J. 1986. Higher-order Boltzmann machines. Neural Networks for Computing, Proceedings of the American Institute of Physics 151, Snowbird, Utah, p. 3984. Touretzky, D. S., and Hinton, G. E. 1988. A distributed connectionist production system. Cog. Sci. 12(3), 423-466. .~
Received 29 August 1990; accepted 28 January 1991.
This article has been cited by: 2. Y. Takahashi. 1998. Mathematical improvement of the Hopfield model for feasible solutions to the traveling salesman problem by a synapse dynamical system. 28:6, 906. [CrossRef] 3. Barbara M. Ritz, G. Ludwig Hofacker. 1996. Informational properties of neural nets performing algorithmic and logical tasks. Biological Cybernetics 74:6, 549-555. [CrossRef] 4. Gianfranco Basti, Antonio L. Perrone. 1995. Chaotic neural nets, computability, and undecidability: Toward a computational dynamics. International Journal of Intelligent Systems 10:1, 41-69. [CrossRef]
VIEW
Communicated by Gary Dell and Garrison Cottrell
A Practical Approach for Representing Context and for Performing Word Sense Disambiguation Using Neural Networks Stephen I. Gallant HNC, Inc., 49 Fenno Street, Cambridge, M A 02138 USA
Representing and manipulating context information is one of the hardest problems in natural language processing. This paper proposes a method for representing some context information so that the correct meaning for a word in a sentence can be selected. The approach is primarily based on work by Waltz and Pollack (1985,1984), who emphasized neurally plausible systems. By contrast this paper focuses on computationally feasible methods applicable to full-scale natural language processing systems. There are two key elements: a collection of context vectors defined for every word used by a natural language processing system, and a context algorithm that computes a dynamic context vector at any position in a body of text. Once the dynamic context vector has been computed it is easy to choose among competing meanings for a word. This choice of definitions is essentially a neural network computation, and neural network learning algorithms should be able to improve such choices. Although context vectors do not represent all context information, their use should improve those full-scale systems that have avoided context as being too difficult to deal with. Good candidates for fullscale context vector implementations are machine translation systems and Japanese word processors. A main goal of this paper is to encourage such large-scale implementations and tests of context vector approaches. A variety of interesting directions for research in natural language processing and machine learning will be possible once a full set of context vectors has been created. In particular the development of more powerful context algorithms will be an important topic for future research. 1 Introduction A natural language processing system (for example one that performs machine translation) must be able to choose among several different Neural Computation 3, 293-309 (1991) @ 1991 Massachusetts Institute of Technology
294
Stephen I. Gallant
meanings for a word. For example star might refer to a celestial object (noun), a Hollywood personality (noun), the act of writing * (verb), etc. The choice is determined by 1. Syntax: If syntactical analysis requires that star be a direct object then this would rule out all verb definitions. 2. Context: star’s neighbors in its sentence and othcr (usually nearby) sentences often determine the correct interpretation. 3. Word Use Statistics: Common usages are preferred to infrequent usages. We would not interpret s t a r as the ”Kleene star” without strong contextual evidence for this usage. 4. Comniori Sense: Assumed knowledge of the world (including local social customs) can be the decisive aspect.
Here we examine a context vector method for representing (some) context information for use in performing word disambiguations. The method is specifically directed at points 2 and 3 above. It is important to note that this approach is readily implementable with existing natural language systems. While the context vector approach will not give human-level performance, it should improve the accuracy of systems that do not currently use context, for example machine (assisted) translation systems. To quantify its actual usefulness in such systems, however, a large-scale implementation of the context vector approach will be required. A major reason for writing this paper is to encourage such implementations. Prior work in word sense disambiguation has been done by Wilks (1975, 19781, who represented individual word senses as templates that gave general preferences for sentences containing that word sense. For example a particular word sense template might prefer a subject that was animate. The word sense templates were compared, and the best fitting one chosen. Representing subtle influences between words, however, would require larger and more complex templates than would be practical for a large system. We will be using a context vector approach that has many early roots. For example Smith and Medin (1978) give an overview of similar (feature vector) approaches for representing concepts. Our approach is based on work by Waltz and Pollack (1985,19841, who described a network model for disambiguation that is intended to be neurally plausible.’ In the following section we will consider some complexity considerations that motivate our approach to representing context. Section 3 describes how to use context vectors for word sense disambiguation, and Section 4 describes two practical applications that could benefit from context processing. Section 5 discusses the use of neural network learning ~--
’We do not address the question as to whether the Waltz-Pollack model is actually neurally plausible; there is debate on this issue.
Representing Context
295
algorithms for further improvement in performance, and Section 6 draws comparisons with the Waltz-Pollack and other models. The final sections give research directions and concluding remarks.
2 Complexity Motivations 2.1 Linear Network Size versus Representing Subtle Influences. A typical full-scale natural language processing system contains a dictionary with 50,000 to 500,000 words. Any method for handling context (or any other task) that requires examination of pairs of words is therefore not practical, especially if information (even a single bit) must be entered by hand. Thus for full-scale systems we seem to be limited to using information that grows linearly with respect to the number of words. We can call this “The Linear Law for Full-Scale Natural Language Systems.” On the other hand we would like to be able to represent subtle context interactions between a sizable fraction of word pairs. For example, if we are trying to disambiguate star then the appearance of any of the following in previous sentences would have an effect: telescope, science, observatory, movies, VCR, popcorn, makeup, attractive, slender, chewing-gum-on-shoe, etc. It is difficult to see how semantic networks (Quillian 1968) or frames (Minsky 1961) alone could deal simultaneously with these two seemingly mutually exclusive requirements. For example, it might be possible to represent subtle influences by using many different types of links, but then it would be hard to keep the effort involved in building the network linear in the number of nodes (we cannot examine pairs of nodes). Another approach would be to make heavy use of inheritance to reduce the numbers of other types of links. The problem here is that subtle context relationships cut across any conceivable inheritance hierarchy to a large extent. For example “popcorn” would not likely be in an inheritance class close to “movies,” even though we might closely associate them. Popcorn would more likely be in the ”food” branch and ”movies” in the “entertainment” branch. Thus we would be forced to omit such subtle influences because adding special links between a large fraction of the nodes would require too much efforte2 Because of these considerations, current full-scale frame-based systems capture only a small fraction of the context associations.
2.2 Contradictory and Distant Influences. Another complexity problem is how to take context information from neighboring sentences into 2Note that the limiting factor seems to be constructing a rich enough semantic network. Searching a big network with parallel hardware and marker passing (Fahlman 1979; Hendler 1989; Hirst 1988) or similar schemes seems less difficult.
Stephen I. Gallant
296
account. This can significantly increase the difficulty of our task; in fact it is not obvious how to do this at all. For example, How many previous sentences should we keep? 0
How can more recent context information be given priority over less recent information, and how does a sentence that changes context override previous context? How is conflicting context information reconciled? ("The astronomer married a star.')
A related complexity problem involves representation. If conflicting contexts are present it seems natural to resolve them by taking some sort of weighted sum where the weights indicate the strength and recency of the context information. Some representations, such as Disjunctive Normal Form expressions, are cumbersome when used to represent weighted sum functions. For example, DNF representations of the majority function grow exponentially with respect to the number of inputs. All of the above complexity considerations impose significant constraints on full-scale implementations and motivate a context vector approach.
3 The Context Vector Approach
Because it is too difficult to compile context information for pairs of words, we are limited to using information that is indexed by single words. This keeps data linear with respect to the number of words (and has the added advantage of allowing context information to be attached to a word's dictionary entry).
3.1 Feature Space and Context Vectors. The essence of the following representation scheme was previously proposed by Waltz and Pollack (using slightly different terminology). Their paper should be consulted for many interesting comments and plausibility arguments. In order to specify what information is attached to each word, we first define a feature space consisting of n common words (or concepts) that might be used to delineate contexts. Examples of such features are given in Figure 1. Initially an actual system might be based upon an estimated 200 to 500 such features (or more if an automated method can be used). There should be some overlap of features to help give robustness to the system, as described in Section 3.3 below.
Representing Context
human art walk research fnend cold light while Insect fruit future
paper
work afternoon Snow smart write
297
man saence
lie-down fun family hard b% blue
plant fragrant high metal early
morning hot dumb type
woman
machine
Play motion
sex
polltlcs entertalnrnent
speak
yell
sad baby ,“It small yellow
evcltlng
boring
country
hot
sharp red
heavy
tree stink low
building late runny cold car cook
animal flower part
wood house %dy humid truck eat
black mammal
bush preen1
plastic factory night rat”
bright
bike SPICY
Figure 1: Some typical features. For convenience some features that apply to star have been moved to the top of the list. For each word k we now define a context vector, Vk, to be an ndimensional vector and interpret each component of V k as follows:
‘ {strongly} positive
k is {strongly} associated with feature j 0 : if word k is not associated with feature j {strongly} negative : if word k { strongly} contradicts feature j : if word
As an example, Vastronomer might be
< +2 +1 +l -1 -1 0 0 0+2 0 0 0 +l +1 +1 +2 +1 -1 +1 -1 . . . . . . . . . .. . . .. > using the features in Figure 1. Note that the interpretation of components of context vectors is exactly the same as the interpretation of weights in neural networks. Constructing context vectors for 100,000 words should be a feasible (though perhaps unpleasant) task. If the context vectors are manually entered then it might be sufficient to restrict Vkj to values of {+2, +1, 0, -1, -2}. We can estimate the work to construct a minimal system of 200 features for 2000 words by assuming human entries at a rate of 10 entries per minute. This yields about 80 person-days for the task. Note that this task need only be done once and then shared with, or sold to, others. Also it would be easy to combine information from several context vector dictionaries by concatenating corresponding context vectors (and filling in with 0s if a context vector for a word is missing from one of the sets).
Stephen I. Gallant
298
It is likely that assigning such context vectors in any reasonable fashion should be sufficient, i.e., that performance of the final system for commonly occurring situations should not be critically dependent on such choices because of the redundant and overlapping nature of the feature space. More on this point later. The final part of the data base of contexts concerns words with multiple meanings such as star. For such words we define not only a context vector for the entire word, but also a context vector for every individual meaning. For example we might have @ar <
,st ar-I I
,ky 1
n
n +2
-
1
+2 11
o
~I +I
+I
n
0
c
+I
+2
+I
+2
+I
+I
+I ti
-2 t2
11 -2
ti
II
0
. .
~ ~ = I J ~ ~ ~ ~ ,$star ~ ~ ~ I +I -1
+2
I -2 I
12
I
o
n
12 i i
+I
f 2
+2
ti
+z
-1
+2
12
h
+ I 82
c.
+I
-2 1-4
+I
+2
+I
n
12
1
As illustrated above, it is convenient to take the context vector for the word to be the sum of the context vectors for its various meanings.3 Section 5 shows how to generate context vectors for different meanings (but not for entire words) by machine learning techniques. We are now ready to use our database of context vectors to dynamically define context and to disambiguate meanings.
3.2 Dynamic Context Vectors and Context Algorithms. To represent the overall context at any position in the input text we use an n-vector C called the dynamic context vector. The heart of our system for manipulating contexts is a contexf algorithm that computes C. Such a context algorithm must examine words and their context vectors from the current sentence, from previous sentences, and possibly from some future sentences, and use this information to produce the dynamic context vector C. For example a simple context algorithm is as follows: Let c^ be the dynamic context vector at the end of the previous sentence (c^ = 0-vector initially). Then define the dynamic context vector at every word in the current sentence to be C
=
12C
+ c{VI 'word i in current sentence}.
This is a very simple context algorithm because it does not take relative word position within a sentence into account, However, it does take into account all words in the current and previous sentences, with influence of previous contexts "decaying" gracefully with distance. Thus more recent context information will dominate less recent context information. 'It is possible that normalizing all word context vectors would improve performance. Thi? 5hould be determined cxpcrimentally
Representing Context
299
A slightly more complicated context algorithm that takes relative word position into account could consider:
d(i,k) D'
=
= =
c^
=
distance of word k from word i (measured by the number of intervening words) max { d ( i , k ) : k E sentence } distance of furthest word in the sentence from word i context vector at the last word of the preceding sentence
We now compute the context C at word X as
For very high performance we will probably need even more complex context algorithms based upon the parse for a sentence. Also note the computation of context algorithms is ideally suited to parallel (or neural network) hardware, but such a speedup is not needed for the context * algorithms we have examined. 3.3 Word Disambiguation. Once we have computed the dynamic context vector, the rest is easy. To choose between several competing meanings for a word (such as star-sky and star-Hollywood) we merely select the meaning whose corresponding context vector has the largest inner product with the dynamic context vector. Geometrically if the context vectors for each meaning are of the same length, this is equivalent to picking the meaning that is closest in direction to the dynamic context vector (see Fig. 2).4 For our star example, if the current dynamic context vector is
c = < +1
+2 -1 0 +1 0 -2 fl +2 +1 0
o...... o >
(where the tail terms are all 0 for ease of illustration), then we would compare C = -4 with
Vstar-skY. par-Hollywood
,
c
=
5
and select star-Hollywood as the meaning. 4Thisis similar to the method suggested by Waltz and Pollack, except that comparing inner products is much more computationally efficient than using lateral inhibition over several iteration steps.
Stephen 1. Gallant
300
Dynamic Context Vector
Figure 2: Geometric interpretation of dynamic context vector and context vectors for competing definitions. As illustrated by this example, there are several features that contribute to the selection of star-Hollywood over star-sky in the context of C. Therefore small changes in the definitions of VstarVsky or Vstar-Hollywood or in the computation of C should not change the eventual selection. Also note that the selection computation is very fast, even on conventional microcomputers. 3.4 Testing the Context Vector Approach. It is easy to construct examples to give some confidence in a context vector approach (see Waltz and Pollack 1985). It is almost as easy to give examples where the approach would fail. To really test a context vector approach would require a fairly large scale implementation consisting of 0
definition of a set of features (not difficult)
0
creation of context vectors (about 6 person-months of effort)
0
integration into an existing natural language processing system
0
testing on a corpus of text.
A main reason for publishing this paper is to encourage such tests. 4 Practical Applications
__-
Two full-scale practical applications that would plausibly benefit from using context vectors for word disambiguation are machine (assisted) translation and Japanese word processors.
Representing Context
301
4.1 Machine (Assisted) Translation. A typical translation system (such as the one created by Sharp Corporation for translating English to Japanese) does not use context because of the reasons previously discussed. Instead it makes a choice for the meaning of a word based upon word-use statistics and allows the user to override this choice using a menu. No action is necessary when the machine makes a correct first choice. Implementation of a word disambiguation system using context should improve the translating system's first choices. This would reduce the amount of human editing required and improve throughput (and "user-friendliness'? of the system.
4.2 Japanese Word Processors. Japanese word processors must deal with the English alphabet and with two Japanese alphabets (hiragana and katakana), as well as with several thousand Chinese-derived picture characters (kanji). Typically about half the characters in a text are kanji rather than alphabetic. In order to input kanji text, the corresponding pronunciation is keyed in using one of the alphabets. Unfortunately it is quite common for the same pronunciation to apply to many different kanji characters. Therefore the word processor guesses one possible kanji character (based on most recent correct choice or overall usage statistics) and allows the user to override using a menu. Evidently this is exactly the same setup as with machine translation, so that examining contexts should make entering Japanese text easier and faster.
5 Learning to Disambiguate Words The context vector approach seems well suited for machine learning techniques, particularly connectionist (i.e., neural network) learning algorithms. These algorithms can use training examples to learn the context vectors for competing definitions of a word. Figure 3 shows how to transform this task into a neural network learning problem. Here an input to the neural network is a dynamic context vector at the time when a choice of definitions for some particular word must be made, and the output is that choice. Every choice corresponds to one output cell, and the corresponding context vector for each choice is precisely the weight vector for the output cell in question. Training data should be readily available from the two previously mentioned applications. For example with machine translation whenever a choice of definitions by the machine is either accepted or changed, then that choice and the corresponding dynamic context vector constitute one training example for that words definitions. It should be easy to collect training examples for on-line or off-line updating of context vectors by learning algorithms. If sufficient training data is available then these
Stephen I. Gallant
302
Meaning 3 chosen
Output Cells (word def init ions)
0 0 1
2
0
Input Cells
n
Dynamic Context Vector
Figure 3: Word disambiguation by neural networks. The output cells correspond to various definitions for one word, and they comprise a winner-take-all group [or choice group or linear machine (Nilsson 1965)l.
automated techniques should produce better context vectors than could be produced by hand. If learning is done on-line while the system is operating, a suitable learning algorithm would be the cross-correlation (outer product) method of Kohonen (1972), Nakano (19721, J. A. Anderson (19771, and Amari (1972) (and others). This algorithm is very fast. If learning is done offline, we can use more powerful algorithms that are slightly slower, such as the ”pocket algorithm for linear machines” as described in Gallant (1988). This algorithm is a modification of Nilsson’s and Rosenblatt’s perceptron learning algorithm for linear machines (Nilsson 1965; Rosenblatt 1961). The pocket algorithm has the advantage of taking the distribution of training examples into account (Gallant 1988) and of being well behaved with noisy and nonseparable data. Note that there is an independent learning problem for each group of competing definitions for a word; this nicely decomposes the learning task into many small problems. Machine learning is therefore ready to be tried on full-scale systems once the context vectors are constructed. Section 7 suggests several other ways to use machine learning.
Representing Context
303
6 Comparison with the Waltz-Pollack and Other Models It is interesting to compare the model proposed in Waltz and Pollack’s pioneering work with the approach presented here. Most differences arise from a difference of emphasis: plausible neural modeling for Waltz and Pollack, as contrasted with computational efficiency in full-scale systems for the approach here. The Waltz-Pollack model consists of 5 groups of cells: 0
feature cells (referred to as “microfeatures” by Waltz and Pollack)
0
higher level context cells: ideally only one (or a few at most) of these cells are highly active for any sentence
0
0
0
lexical cells: cells for parts of speech (noun, verb, etc.) and various word sense cells for each word input cells: one for every word, but the only cells with activation are those present in the target text syntactic cells: corresponding to a parse tree of the sentence.
The model uses spreading activation over many cycles and lateral inhibition to disambiguate words. By contrast the model we have been examining works with a single cycle and directly compares inner products to do the same task more quickly. It is not clear how much, if any, the omission of spreading activation iterations would hinder performance. It is likely that the extra structure in the Waltz-Pollack model would be important for some tasks, such as understanding texts on a higher level, and not important for other tasks. Waltz and Pollack represented context using not only feature nodes, but also many (“potentially on the order of hundreds of thousands3 higher level context nodes connected to the feature nodes and to the various word definitions. Clearly such a collection of higher level nodes would preclude a practical full-scale system from being built. We eliminate these higher level nodes (and the need to construct appropriate weights for their connections) and work just with context vectors indexed by features. Note that eliminating higher level “local” nodes makes for a more ”distributed” system (Hinton 1986). The paper by Waltz and Pollack did not address the issue of efficient context algorithms that could be used with arbitrarily long bodies of text. Efficiently computing the context is of course vital for a practical, full-scale system. Finally, we have seen how to use neural network learning algorithms to learn word disambiguations. This is another issue that Waltz and Pollack did not examine in their paper.
304
Stephen I. Gallant
6.1 Other Models. Cottrell (1985), Cottrell and Small (19831, and Gigley (1988) investigated a relaxation technique similar to the work by Waltz and Pollack, but did not explore feature (=microfeature) representations. Other important work in this area was by Kawamoto (1985) and McClelland and Rumelhart (1986) (see also St. John and McClelland 1989). Kawamoto and McClelland explored feature-based representations for assigning roles (agent, patient, instrument, modifier) to words. Their representation for features differed from Waltz and Pollack's by having one weight for every pair of features. Such added redundancy helped make the model resistant to errors and (presumably) eased learning of case roles by increasing dimensionality. However, the representation size for each word was increased by a considerable amount; any full-scale system using over 100 features would be slowed correspondingly. Kawamoto and McClelland chose features to be as independent as possible, whereas the model we have examined employs overlapping features for a more distributed and (presumably) error-tolerant system. All of the previous systems assume a preliminary parse of the sentence, although it is not clear how vital such a parse is. Also none of these approaches deals with the practical requirement of processing medium or large-sized texts. Such texts cannot be considered all at once, so some sort of scan must be performed. This latter problem was addressed by Bookman (19891, who proposed a system similar to Waltz and Pollack's that dynamically processes the words in a sentence. Bookman also employs both relaxation (30-50 iterations) and higher level features. The higher level features would require as many as loy weights for a full-scale system, thereby raising practicality questions. Bookman also includes useful suggestions for which features to include in a feature-based system. More recently Miikkulainen (1990) and Miikkulainen and Dyer (1988) have proposed their FGREP system for learning to assign case roles. One very interesting aspect of their system is that it creates an artificial featurebased representation for words by learning from a training corpus. The basic idea is to use backpropagation to assign case roles, and then continue backpropagating through the inputs so that the word representations are also adjusted. It would be very interesting to see if such "extended backpropagation" could be scaled u p to large problems, although the reliance on backpropagation learning makes this somewhat doubtful. See Dyer (1990) for an excellent overview of this and other related approaches for distributed symbol formation.
7 Research Directions
Once context vectors have been defined for a full-scale system there will be several interesting research areas to explore. Some are listed below,
Representing Context
305
with easier and more promising projects given before those that seem more difficult and more speculative.
Perform article search by overall context. First devise an overall context algorithm to compute a single context vector for any body of text (e.g., take the normalized sum of all context vectors). Next compute the overall context vectors for all articles in a library. We can now find the closest article to a set of keywords by computing the overall context vector of the keywords and selecting articles whose overall context vectors have the largest inner products with the keywords’ context vector. This technique could also be used to find the closest articles to a given article or passage of text. Use parse information in the context algorithm. For example, the context of a direct object should be affected more by the context vector of the verb than by the context vector of the subject (e.g., “The astronomer married a star‘?. Learn context vectors for new words (that are not yet in the dictionary). Assign them the dynamic context vector at the time of their appearance. (If they appear several times, use the average dynamic context vector.) For high-use, poor performance words add intermediate cells (“hidden units“) to the network in Figure 3 to improve performance. In such cases either algorithms described in Gallant (1990) or the backpropagation algorithm (Werbos 1974; Rumelhart et al. 1986) could be used to generate such networks. (This would mean storing a ”context network for competing definitions of such words rather than context vectors for those definitions.) Use learning algorithms to predict the confcxt of the current word (or its correct meaning). The disparity between the predicted context and the actual context vector would indicate the novelty or information added by that word. This in turn could be used to trigger semantic routines or a recomputation of previous contexts and definition choices. Prediction could also be used by the context algorithm. (In fact such a prediction vector could be used directly for word disambiguation; in essence we would be learning a context algorithm.) Use the context information in other ways to kelp parsing algorithms or for semantic processing. Psychological modeling. How much is this context vector approach in accord with, or in contradiction to, psychological evidence? A number of experiments are suggestive of feature-based representations because humans make simple and unexpected errors that would be explained by the use of such representations at an early level of cognitive processing. The basic idea is that a simple feature-based
Stephen I. Gallant
306
representation does not keep track of relations among the features, and this would cause certain types of errors. Kanwisher (1990) contains a good brief summary of these results. For example Treisman and Schmidt (1982) give cases where subjects, when presented with a red “X,” a blue “T,” and a green “0,” confidently report seeing a green ”T” and a red “0.” Mozer (1989) has shown that people often judge a letter array to have fewer items if the array contains repeated letters than if it does not. Similarly Kanwisher (1987) has demonstrated a repetition blindness effect. For example subjects were briefly shown the sentence “When she spilled ink there was ink all ouer“ and reported the ungrammatical “When she spilled ink there was all ouer.” It would be interesting to see if repetition blindness extends to meanings of words. For example, when people incorrectly substitute remembered words from a previously seen sentence, are the substitutions less correlated than the original words with those features already represented by the other words in the sentence?
Learn context vectors from a standard on-line dictionary or thesaurus. There are several ways to approach this task, but it is not at all clear whether the result would be better than human-generated context vectors. Can we modify context vector representations to handle more general knowledge representation? A very important question. 8 Concluding Remarks
~.
~.
Much previous work in natural language processing has focused on describing and solving difficult problems in limited environments. Often it has not been clear how to extend such work to a full-scale system. Recently, however, the A1 engineers, motivated by market demand, have changed the game. They have gone ahead and constructed fullscale systems and made them interactive so that humans could handle the difficult choices. This makes possible a new type of natural language research, namely how to improve such full-scale systems (even if the systems cannot be made to function perfectly). Our proposal is in line with this latter approach to natural language research. Implementing a full-scale context vector system should help natural language research by allowing work of the following type:
1. Here’s a problem, X . 2. This is a solution to problem X , implemented as a modification of some standard context algorithm (andlor parsing algorithm). 3. Here’s a comparison on a body of (unseen) text to show that the modification reduced mistakes by Y%.
Representing Context
307
After such work is published, everyone would be free to include the suggested modification into their own context algorithm for their o w n full-scale system. To emphasize: any progress on context algorithms would automatically translate into improved full-scale systems. This seems a n inviting path for natural language processing research.
Acknowledgments Part of this work was performed while visiting NTT Information Processing Labs, Yokosuka, Japan. Thanks to the referees for helpful suggestions.
References Amari, S. 1972. Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Trans. Comput. C-21(11), 1197-1206. Anderson, J. A,, Silverstein, J. W., Ritz, S. A., and Jones, R. S. 1977. Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychol. Rev. 84, 413-451. Bookman, L. A. 1989. A connectionist scheme for modelling context. In Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, and G. Hinton, eds., pp. 281-290. Morgan Kaufmann, San Mateo, CA. Cottrell, G. W. 1985. Connectionist parsing. Seventh Annual Conference of the Cognitive Science Society, Irvine, CA. Cottrell, G. W., and Small, S. 1983. A connectionist scheme for modeling word sense disambiguation. Cognition Brain Theory 1, 89-120. Dyer, M. G. 1990. Distributed symbol formation and processing in connectionist networks. 1. Exp. Theoret. A1 2, 215-239. Fahlman, S. E. 1979. NETL: A System for Representing and Using Real World Knowledge. MIT Press, Cambridge, MA. Gallant, S. I. 1988. Bayesian assessment of a connectionist mode1 for fauIt detection. AAAI Fourth Workshop on Uncertainty in Artificial Intelligence, St. Paul, MN, August 19-21, 127-135. Gallant, S. I. 1990. Perceptron-based learning algorithms. IEEE Transact. Neural Networks 1(2), 179-192. Gigley, H. 1988. Process synchronization, lexical ambiguity resolution and aphasia. In Lexical Ambiguity Resolution, S. L. Small, G. W. Cottrell, and M. K. Tanenhaus, eds. Morgan Kaufmann, Los Altos, CA. Hendler, J. 1989. The design and implementation of marker-passing systems. Connection Sci. 1(1), 17-40. Hinton, G. E. 1986. Distributed Representations. Tech. Rep. CMU-CS-84-157, Carnegie Mellon University, Department of Computer Science. Revised version in Rumelhart, D. E., and McClelland, J. L. (eds.) 1986. Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 1. MIT Press, Cambridge, MA.
308
Stephen I. Gallant
Hirst, G. 1988. Semantic interpretation and ambiguity. Artificial Intelligence 34, 131-1 77. Kanwisher, N. G. 1987. Repetition blindness: Type recognition without token individuation. Cognition 27, 117-143. Kanwisher, N. G. 1990. Binding and type-token problems in human vision. Proceedings of the Twelfth Annual Conference of the Cognitive Science Society, July 25-28, Cambridge, MA, 606-613. Kawamoto, A. H. 1985. Dynamic processes in the (re)solution of lexical ambiguity. Ph.D. dissertation, Brown University. Kohonen, T. 1972. Correlation matrix memories. I E E E Trans. C-21, 353-359. J. L. McClelland and D. E. Rumelhart, (eds.) 1986. Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 2. MIT Press, Cambridge, MA. Miikkulainen, R. 1990. DISCERN: A Distributed Artificial Neural Network Model of Script Processing and Memory. TR UCLA-AI-90-05, UCLA Computer Sciencc Department. Miikkulainen, R., and Dyer, M. G. 1988. Encoding input/output representations in connectionist cognitive systems. In Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski (eds.), pp. 347-356. Morgan Kaufmann, San Mateo, CA. Minsky, M. 1961. Steps toward artificial intelligence. Proc. IRE 49,8-30. Reprinted in Feigenbaum, E. A., and Feldman, J. (eds.), 1963. Computers and Thought. McGraw-Hill, New York. Mozer, M. 1989. Types and tokens in visual letter perception. J. Exp. Psychol. Human Percept. Perform. 15, 287-303. Nakano, K. 1972. Associatron-A model of associative memory. I E E E Transact. Syst. Man, Cybernet. SMC-2(3), 380-388. Nilsson, N. J. 1965. Learning Machines. McGraw-Hill, New York. Quillian, M. R. 1968. Semantic memory. In Semantic Information Processing, M. Minsky, ed. MIT Press, Cambridge, MA. Rosenblatt, F. 1961. Principles of Neurodynamics: Perceptrons arid the Theory of Brain Mechanisms. Spartan Press, Washington, DC. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 1, D. E. Rumelhart and J. L. McClelland, eds. MIT Press, Cambridge, MA. Smith, E. E., and Medin, D. L. 1976. Categories and Concepts. Harvard University Press, Cambridge, MA. St. John, M., and McClelland, J. L. 1989. Applying contextual constraints in sentence comprehension. In Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky and G. Hinton, eds., pp. 338-346. Morgan Kaufmann, San Mateo, CA. Treisman, A., and Schmidt, H. 1982. Illusory conjunctions and the perception of objects. Cog. Psychol. 14, 107-141. Waltz, D. L., and Pollack, J. 8. 1985. Massively parallel parsing: A strongly interactive model of natural language interpretation. Cog. Sci. 9, 51-74.
Representing Context
309
Waltz, D. L., and Pollack, J. 6. 1984. Phenomenologically Plausible Parsing. AAAI84, pp. 335-339. William Kaufmann, Los Altos, CA. Werbos, P. J. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. Thesis, Harvard University. Wilks, Y. 1975. A preferential, pattern-seeking, semantics for natural language inference. Artificial Intelligence 6, 53-74. Wilks, Y. 1978. Making preferences more active. Artificial Intelligence 11,197-223. __ Received 30 April 1990; accepted 29 March 1991. .~
~~
This article has been cited by: 1. Susan T. Dumais. 2004. Latent semantic analysis. Annual Review of Information Science and Technology 38:1, 188-230. [CrossRef] 2. Ikuo Keshi, Hiroshi Ikeuchi, Ken'Ichi Kuromusha. 1996. Associative image retrieval using knowledge in encyclopedia text. Systems and Computers in Japan 27:12, 53-62. [CrossRef] 3. Michael W. Berry, Susan T. Dumais, Gavin W. O’Brien. 1995. Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37:4, 573. [CrossRef]
Communicated by Scott Fahlman
NOTE
A Modified Quickprop Algorithm Alistair Craig Veitch Geoffrey Holmes Deputiiienf of Corripirter Scwizcr~, Unizwszty
iif
Waikato, N
m Ztwliirid
1 Introduction
__
~~
Scott Fahlman’s benchmarking studies (Fahlman 1988) indicate that quickprop is a fast and reasonably reliable learning algorithm. One weakness of the algorithm, however, is that it is sometimes necessary to restart learning, after a (problem dependent) number of epochs has elapsed, usually because a local minimum has been encountered. In this note we suggest a modification to the original algorithm which uses a problem independent means of restarting, requires only one major parameter (the learning rate), and which works as well as the original. 2 Modified Quickprop
~
~
-
~
Our interpretation of the original quickprop algorithm, which closely reproduced Fahlman’s benchmark results, is given in Veitch and Holmes (1991). The modification we suggest focuses on the caw where weights grow too large. Fahlman uses a weight decay term added to the gradient for each weight, and if the weights d o not converge after a certain limiting number of epochs the system is restarted. This weight limit is generally different for each problem. When testing our implementation of quickprop we observed that weight growth and local minima seemed linked. The weights grew extremely large very quickly only when the system was stuck. Experiments in which weight decay was eliminated and the system reset only if a weight grew beyond a preset limit performed as well as the original. 3 Discussion
-
We tested the modified algorithm on four classes of problem: encoders, XOR, random, and dtoa. The latter two were new problems developed Neural Computation 3, 310-311 (1991)
@ 1991 Massachusetts Institute of Technology
_
_
Modified Quickprop Algorithm
311
for benchmarking purposes (Veitch 1991). Of these only the XOR and random (mapping randomly generated inputs to outputs) problems needed restarting; approximately 10% of the time for XOR (Fahlman quotes a similar figure) and 2% of the time for random. A general weight limit of 10,000 was used for all problems. A large limit is necessary to allow for legitimate large weight solutions. When a local minimum was encountered the limit was reached extremely quickly, typically in 2-5 epochs. We feel confident that limits of this scale are detecting local minima. It may be a reflection on the scale of the problems studied to date that very few need to be restarted, although the random class of problems is potentially very large. It is possible that, as more complex and less welldefined problems are tackled in applications, automatic restarting will become more of an issue.
References Fahlman, S. E. 1988. Faster learning variations on back-propagation: An empirical study. In Proceedings of the 1988 Connectionist Models Summer School, pp. 38-51. Veitch, A. C. 1991. Benchmarking and fast learning in neural networks: Results for back-propagation and real-time recurrent learning. MSc. Thesis, University of Waikato, New Zealand. Veitch, A. C., and Holmes G. 1991. Benchmarking and fast learning in neural networks: Results for back-propagation. Proceedings of the Second Australian Conference on Neural Networks, pp. 167-171.
Received 11 February 1991; accepted 4 March 1991.
This article has been cited by: 1. K. Sirlantzis, J. D. Lamb, W. B. Liu. 2007. Novel Algorithms for Noisy Minimization Problems with Applications to Neural Networks Training. Journal of Optimization Theory and Applications 129:2, 325-340. [CrossRef]
Communicated by Geoffrey Hinton
Removing Time Variation with the Anti-Hebbian Differential Synapse Graeme Mitchison Physiological Laboratory, Dozoning Street, Cambridge CB2 3EG, England
I describe a local synaptic learning rule that can be used to remove the effects of certain types of systematic temporal variation in the inputs to a unit. According to this rule, changes in synaptic weight result from a conjunction of short-term temporal changes in the inputs and the output. Formally, A(weight,) x -A(input,) x A(output) This is like the differential rule proposed by Klopf (1986) and Kosko (19861, except for a change of sign, which gives it an anti-Hebbian character. By itself this rule is insufficient. A weight conservation condition is needed to prevent the weights from collapsing to zero, and some further constraint - implemented here by a biasing term - to select particular sets of weights from the subspace of those which give minimal variation. As an example, I show that this rule will generate center-surround receptive fields that remove temporally varying linear gradients from the inputs. 1 Introduction
~.
.-
Forming stable representations of a changing world is a basic goal of sensory processing. In visual perception, for example, it is desirable to eliminate from the representation of the visual field changes due to varying illumination, observer motion and the predictable motions of objects. More precisely, an economical encoding should not represent this information at every point in space as this would lead to a large measure of redundancy (Barlow 1989). Achieving invariance under such classes of transformation requires fine-tuning of receptive field parameters, and is probably at least partly learned. I suggest here that a certain type of synapse, the anti-Hebbian differential synapse, may play an important part in such learning. The basic idea is that neurons in sensory pathways adjust their synaptic weights so as to minimize changes in their firing rate due to alterations of the incoming sensory pattern. Some of these changes will be relatively complex and unpredictable, whereas variations due to, say, motion of the Neural Computation 3,312-320 (1991) @ 1991 Massachusetts Institute of Technology
Removing Time Variation
313
observer, will be more systematic and frequently encountered. A suitable synaptic learning rule can cause the latter type of variation to be nulled out, while leaving a neuron still responsive to the more unpredictable part of its input. 2 The Biased Differential Synapse Suppose we have a unit with weights w,and a time-varying pattern of inputs x l ( t ) (we usually omit the t). How should one choose the weights so as to obtain an output y ( t ) that varies as little as possible? One strategy is to minimize V = ( ( 3 y / 3 t ) 2 )by gradient descent in the weights (here the brackets ( ) denote the average over the distribution of inputs), which means that the weights are adjusted by the rule dw,/dt = -c~i)V/dw,,where o > 0. The analogue of this for multilayer nets, namely backpropagation (Rumelhart et al. 1986) using the difference between y( t ) and y(t At) as output error, is discussed by Hinton (1989). If the unit is linear, with y = Cw,x,, the rule becomes dw,/dt = -a((dx,/dt)(ay/dt)). The averaging ( ) can be attained by changing weights while the inputs vary randomly. The rule can then be expressed in discrete time steps as ~
Awl = -oAx,AY
(2.1)
This is the anti-Hebbian form of the differential rule Awl = aAx,Ay proposed by Kosko (1986) and Klopf (19861, and discussed by Barto and Sutton (1981). The change of sign leads, of course, to radically different behavior. Under our rule the weights w,will generally be driven to zero if there is noise in the inputs, since there is no nonzero assignment of weights that gives invariance under random variations in the input. To avoid this, one can add a conservation rule, such as C wI2 = K, or C lw,I K, where K is a constant. The effect of this is to ensure that the weights converge to a ”subset of minimal variation” on which (Ay2) is minimized (see Fig. 1). This is still not an adequate learning rule, because one usually wishes to define not a single unit but a set of units that spans the subspace efficiently. This can be achieved by biasing units toward particular desired responses, even if these responses are not themselves invariant. As we shall see, this strategy Ieads to invariant or minimally varying units whose outputs approximate the bias. A simple biasing rule can be derived from gradient descent on ( ( a y / a t ) 2 ) A((( - Y ) ~ ) where , ( is the bias and X > 0. This gives
+
Aw, = -aAx1Ay
+ 3x,(<
-
y)
(2.2)
where /3 = d. We also require a weight conservation rule such as C wZ1= K or C lwll = K. The way equation 2.2 works is illustrated in Figure 1. Let us first neglect the effect of the biasing term, /3x,(<- y), in equation 2.2 and
Graeme Mitchison
314
suppose that the weight conservation rule is C w2,= K . Then Azo, = 0 when the mean flow from the differential term is balanced by projection onto the sphere Cw’, = K, that is, when (Axlay) = Cw,(Ax,Ax,) = p,,for all i and some fixed p . Thus p is an eigenvalue of the linear transformation A with A , = (Ax,Ax,).Let p1 . . . . p k be the complete set of eigenvalues of A. It is easy to check that the flow induced by A on the sphere in the neighborhood of the jth eigenspace has eigenvalues ( / I , - p , ) , i = 1> .. . . k, i # k, so only the smallest eigenvalue gives a stable attractor of the flow. Note that this is the eigenvalue that gives the smallest value of (Ay’), since (Ay2) = ~zo,w,(Ax,Ax,)= pCw2, = p K . The subset of minimal variation is therefore the intersection of this eigenspace with the sphere Cui’, = K. Consider now the effect of the second term in equation 2.2. Suppose the bias ( is generated from the inputs by a set of weights z, (shown as the point Z in Fig. 1). Then the second term induces a flow toward Z. If 1 is small compared to (2 (i.e, X small), this flow will never pull the state point far from the subset of minimal variation. Thus if the minimal eigenvalue is nondegenerate, so the subset of minimal variation consists of two points (the intersection of a line with the sphere), the second term will have little effect. But in the case where the minimal eigenvalue is degenerate, the state point will slide until it reaches a point of equilibrium (shown as Z’ in Fig. 1) under the flow toward Z . Although the learning rule with bias equation 2.2 is quite simple mathematically, it is not obvious how to invent a neuronal justification for it. The error-correcting term dx,(( - y) requires rather elaborate circuitry for its realization (Mitchison 1989), and the interaction with the differential term makes matters worse. It seems more natural, neuronally, to replace the error-correcting term with a Hebbian term Ox,y and to supply the bias by adding an input proportional to <, so the response of the unit is y = u( Cw,x,, ( u a suitably chosen constant). A suitable set of equations is
+
Aw, = -aAx,Ay
+~X,Y
(2.3)
where y = u( + Cw,x, and C w 2 , = K, or C I W , ~ = K. Note that, in the case where the bias is one of the inputs, the biasing weight u must not be included in the normalization of weights; it is the exemption from normalization of part of the weight of this input that allows it to dominate the others. 3 An Example: Invariance under Added Gradients
Early vision provides several examples where invariance to added linear gradients is desirable. Changes in illumination produce linear changes in log intensity (Hurlbert and Poggio 1988), and changes in viewing geometry add gradients of disparity (Mitchison and Westheimer 1990). In
Removing Time Variation
315
t
Figure 1: A unit receiving inputs from both differential synapses and biasing inputs, and (below) the flows induced by these inputs in weight space (i.e., the n-dimensional space of weights w,,i = 1 , .. . , n ) . The weight conservation condition restricts the flow to a surface, here shown as the sphere Cw2i = K. The attractor of the differential flow (solid arrows) is the eigenspace of the smallest eigenvalue of the linear transformation A, where Aij = (Axinxi). The plane shown here represents the eigenspace of a degenerate eigenvalue. The subset of minimal variation is the intersection of this plane with the sphere. The bias induces a flow (dotted lines) that will usually be much weaker than that due to the differential component, so the state point stays close to the subset of minima1 variation.
Graeme Mitchison
316
the case of lightness, the problem is usually seen as that of estimating the reflectance function given that the brightness incorporates a n unknown linear illumination gradient (Land 1964); the variation of illumination with time has not generally been stressed, and is indeed likely to be a less significant effect than the variation of disparity with changing viewing parameters. However, the problems have a similar formal structure, and we can model them by the same simplified scheme. Let us therefore represent the variables of interest (patterns of log-reflectance, disparity, etc.) by a random one-dimensional array r,, to which is added a time-varying linear gradient. Formally, the inputs are x , ( t ) = Y , (a h . i)(t n T ) during the time interval [nT.( n t ) T ) , where i - 1.2,. . . m labels the points in the array, and the r, and the constants a , h are chosen randomly from suitable distributions at the beginning of each time interval (Fig. 2a). Consider first what happens if a single unit is trained by the differential rule, equation 2.1, without conservation of wcights or biasing. First, we remove the discontinuities between the time intervals IHT,(17+I)T), so the unit is trained only on the input patterns within each interval. The weights then rapidly evolve from initial randomly chosen values and converge to the linear subspace defined by Cw,= 0 and Ciw, = 0, so the unit is invariant under all linear perturbations (a b . i) of the inputs. Thus the smallest eigenvalue of the linear transformation A defined in the preceding section is zero and degenerate, with multiplicity m - 2. If the discontinuities between intervals are now included, all the weights tend to zero, since there is no nontrivial subspace that is
+
+ +
~
+
Figure 2: Facing pagr. (a) The pattern of inputs assumed in the example discussed in the text. The input in one time interval is an array of random values, chosen at the beginning of each interval, with an added (randomly chosen) linear gradient varying linearly in time. The inputs extending over two time intervals are shown. Note the discontinuity at the boundary of the intervals; this constitutes a source of noise in the inputs. (b) A solution obtained (after 10,000cycles) using equation 2.1 with ct = .01 and the weight conservation rule C Iw,I= 1. (c) The effect of adding a bias. Equation 2.2 was used, with the = 1 as conservation rule. To overcome third input serving as bias and Iw,/ the effects of the noise, small values of the rate constants were used ((u = .01, /3 = ,0001) with a large number of cycles (200,000for the pattern shown here), since this allows averaging over many inputs. The ideal (noiseless)solution is also shown. This was calculated by minimizing ((C Y ) ~ )subject to Cwi = 0 and C iw,= 0,and then normalizing to give C I W , ~ = 1. The minimization is equivalent to putting wi = X + pi, i # k, wk = 1 + X + pk, and choosing X and IL so that C w ;= 0 and C i w , = 0. A simple calculation shows that X = (6k - 4n - 2)/n(n- l), p = 6{1 - 2k/(n + l ) } / n ( n- 1). (d) In two dimensions the same procedure gives wv = X + pi + vj, wke = 1 + X + pk + d, with A, p, u, chosen so that C w s = 0, C i w ~= 0, and Cjw,j = 0. This gives X = -1/n2 - 6 ( n + 1 - k - ! ) / n 2 ( n - l), p = 6(n + 1 - 2k)/n2(n - l ) , u = 6(n + 1 - 2!)/n2(n - 1). -
Removing Time Variation
317
invariant under the random perturbations corresponding to choices of the ris at the start of each time interval. Imposing a conservation condition on the weights avoids this collapse to zero, and one again obtains weights that (up to small variations due to noise) satisfy the conditions C w,= 0 and C iwi= 0. However, there is considerable arbitrariness in such sets of weights (e.g., Fig. 2b). Biasing allows one to select favorable sets in an orderly manner. As an illustration, we use equation 2.2 and take the inputs xk as biases. We then get a set of weights, one for each k, which has approximately a center-surround structure, with a pronounced peak at wk (Fig. 2c). This resembles the solution found by Hurlbert and Poggio’s
31 8
Graeme Mitchison
algorithm (1988), and indeed the algorithms are formally related, though the computational goals are somewhat different (see Appendix). One can easily extend this analysis to the case of a two-dimensional array of inputs, biased by one of the inputs, as shown in Figure 2d. Essentially identical receptive fields are obtained if the Hebbian model, equation 2.3, is used instead of equation 2.2. 4 Discussion
Various other learning rules have something in common with the differential synapse. Hinton and Becker (1990) investigated a rule whereby units with nonoverlapping receptive fields try to maximize their mutual information. This could be cast in a temporal form by requiring that units maximize the mutual information between successive time steps. This leads to a conflict between the goals of maintaining constant firing rate and signaling something informative, which can best be resolved if a unit is invariant to fast-changing components (like the linear gradients in the example above), but responds to some slow-changing pattern feature. The unit will therefore pick up an invariant feature without the need for the explicit biasing in our scheme. Something like this can be achieved with the differential synapse by replacing biasing by decorrelation (Barlow and Foldi6k 1989). Foldiak (1990) has proposed an algorithm that learns a position-invariant receptive field resembling that of a complex cell by combining the inputs from lower level "simp1e"units. This learning rule is less powerful than the differential synapse because it requires as inputs a set of units that are activated sequentially. For example, it will not be able to learn the receptive fields shown in Figure 2 from the given inputs. The differential synapse, on the other hand, can learn complex-type receptive fields. Finally, one can ask whether differential synapses exist in real neural systems. Although our requirement is for an anti-Hebbian differential synapse, a Hebbian version would also be a useful component of an invariance-seeking system, since it would generate units that respond primarily to the fast-changing components in the inputs. It would therefore be of interest to find either type of differential synapse. Hippocampal LTP does not seem to exhibit the right properties, since evidence suggests that the strength of the potentiation depends upon the amount of postsynaptic depolarization (Wigstrom et al. 1986; Kelso et al. 1986) rather than its rate of change. However, it may be necessary to design an appropriate experiment to demonstrate differential synaptic effects, especially since the rate of change of synaptic weight would be expected to be small, according to the scheme proposed here, to allow averaging over an ensemble of inputs. One approach would be to present a prolonged, temporally modulated input, and to vary its temporal frequency. A differential effect should increase with the frequency u p to, say, 10-
Removing Time Variation
319
100 Hz, whereas no such increase would be expected from conventional Hebbian synaptic changes.
Appendix Hurlbert and Poggio (1988) studied the problem of subtracting an illumination gradient from an image, and obtained a set of weights somewhat like those shown here in Figure 2c. In our terms, they sought a linear unit whose inputs take the form x , (a bi), where the x , , a and b are randomly chosen from suitable distributions, and whose output gave the best approximation in a least squares’ sense to a chosen xk. The resemblance to our problem is closer if we regard the functions x, and x , + ( a + bi) as the inputs at times t and f At, and ask that the output y(f At) = y Ay should approximate X k , that is, that weights should be chosen to minimize W = ((y Ay - xk)’). We can write W = (Ay’) ((y - ~ k ) ~ 2(Ay(y ) - xk)), and the last term can be expressed as a weighted sum of the terms (Ax,‘ x , ) , which vanish if each Ax, is symmetrically distributed around zero and independent of each x, (a condition which confines us to two-step sequences rather than the continuous or multistep sequences used to generate the weights shown ) , is the same as the in Fig. 2c). In this case W = (Ay2) ((y - ~ k ) ~ which potential function underlying equation 2.2, with X = 1. With this value of A, the biasing term has the same weight as the temporal invariance term (Ay’), so the solution will generally not be close to the minimum of (Ay2). This reflects the difference between our goaI of attaining invariance, and Hurlbert and Poggio’s goal of predicting (or retrodicting, the direction of time being irrelevant here) the kth input x k .
+ +
+
+
+
+
+
+
+
Acknowledgments I thank Richard Durbin for helpful discussions.
References Barlow, H. 8.1989. Unsupervised learning. Neural Comp. I, 295-311. Barlow, H. B., and Foldiak, I? 1989. Adaptation and decorrelation in the cortex. In The Cotnputing Neuron, R. M. Durbin, C. Miall, and G. J. Mitchison, eds., Ch. 4, pp. 54-72. Addison-Wesley, Wokingham. Barto, A. G., and Sutton, R. S. 1981. Goal seeking components for adaptive intelligence: An initial assessment. Tech. Rep. No. 81-1070. Wright-Patterson Air Force Base, OH: Air Force Wright Aeronautical Laboratories. (DTIC Report AD 101476 available from the Defense Technical Information Center, Cameron Station, Alexandria, VA 22304-6145.) Foldiak, I? 1991. Learning invariance from transformation sequences. Neural Comp. 3, 194-200.
320
Graeme Mitchison
Hinton, G. E. 1989. Connectionist learning procedures. Artificial Intelligence 40, 185-234. Hinton, G. E., and Becker, S. 1990. An unsupervised learning procedure that discovers surfaces in random-dot stereograms. Proc. Int. Joint Conf. Neural Networks, Washington, DC, January. Hurlbert, A. C., and Poggio, T. A. 1988. Synthesizing a color algorithm from examples. Science 239,482-485. Kclso, S . R., Ganong, A. H., and Brown, T. H. 1986. Hebbian synapses in hippocampus. Pmc. Natl. Acad. Sci. U.S.A. 83, 5326-5330. Klopf, A. H. 1986. A drive-reinforcement model of single neuron function: An alternative to the Hebbian neuronal model. In AIP Conference Proceedings 152: Neural Networks for Computing, J. S. Denker, ed., pp. 265-270. American Institute of Physics, New York. Klopf, A. H. 1988. A neuronal model of classical conditioning. Psychobiology 16, 85-125. Kosko, B. 1986. Differential Hebbian learning. In AIP Conference Proceedings 152: Neurnl Networks for Computing, J. S. Denker, ed., pp. 277-288. American Institute of Physics, New York. Land, E. H. 1964. The retinex. Am. Sci. 52, 247-264. Mitchison, G. J., and Westheimer, G. 1990. Viewing geometry and gradients of horizontal disparity. In Vision: Coding and Efficiency, C. Blakemore, ed., pp. 302-309. Cambridge University Press, Cambridge. Rumelhart, D. E., Hinton, G. E., and Williams R. J. 1986. Learning internal representations by back-propagating errors. Nature (London) 323, 533-536. Wigstrom, H., Gustafsson, B., Huang, Y.-Y., and Abraham, W. C. 1986. Hippocampal long-term potentiation is induced by pairing single afferent volleys with intracellularly injected depolarizing current pulses. Acta Physiol. Scand. 126,317-319.
Received 18 October 1990; accepted 15 February 1991.
This article has been cited by: 2. Richard Turner, Maneesh Sahani. 2007. A Maximum-Likelihood Interpretation for Slow Feature AnalysisA Maximum-Likelihood Interpretation for Slow Feature Analysis. Neural Computation 19:4, 1022-1038. [Abstract] [PDF] [PDF Plus] 3. Laurenz Wiskott . 2003. Slow Feature Analysis: A Theoretical Analysis of Optimal Free ResponsesSlow Feature Analysis: A Theoretical Analysis of Optimal Free Responses. Neural Computation 15:9, 2147-2177. [Abstract] [PDF] [PDF Plus] 4. Jarmo Hurri , Aapo Hyvärinen . 2003. Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural VideoSimple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video. Neural Computation 15:3, 663-691. [Abstract] [PDF] [PDF Plus] 5. Laurenz Wiskott , Terrence J. Sejnowski . 2002. Slow Feature Analysis: Unsupervised Learning of InvariancesSlow Feature Analysis: Unsupervised Learning of Invariances. Neural Computation 14:4, 715-770. [Abstract] [PDF] [PDF Plus] 6. James V. Stone. 1996. Learning Perceptually Salient Visual Parameters Using Spatiotemporal Smoothness ConstraintsLearning Perceptually Salient Visual Parameters Using Spatiotemporal Smoothness Constraints. Neural Computation 8:7, 1463-1492. [Abstract] [PDF] [PDF Plus] 7. Kenneth D. Miller , David J. C. MacKay . 1994. The Role of Constraints in Hebbian LearningThe Role of Constraints in Hebbian Learning. Neural Computation 6:1, 100-126. [Abstract] [PDF] [PDF Plus]
Communicated by Idan Segev
Simulations of a Reconstructed Cerebellar Purkinje Cell Based on Simplified Channel Kinetics Paul C. Bush Terrence J. Sejnowski Howard Hughes Medical lnstitute and Computational Neurobiology Laboratory, The Salk Institute, La Jolla, C A 92037, U S A and University of California at Sun Diego, La Jolla, C A 92037, U S A
When cerebellar Purkinje cells are depolarized with a constant current pulse injected at the soma, complex spike discharge patterns are observed (Llinas and Sugimori 1980b). A computer model has been constructed to analyze how the Purkinje cell ionic conductances identified to date interact to produce the observed firing behavior. The kinetics of voltage-dependent conductances used in the model were significantly simpler than Hodgkin-Huxley kinetics, which have many parameters that must be experimentally determined. Our simplified scheme was able to reproduce the complex nonlinear responses found in real Purkinje cells. A similar approach could be used to study the wide variety of neurons found in different brain regions. 1 Introduction Neurons have a wide range of shapes, sizes, and intrinsic properties, and have correspondingly specialized functions. In particular, dozens of nonlinear membrane conductances have been characterized that are found in different combinations at different spatial locations. These segregated nonlinear mechanisms, coupled with complex dendritic morphologies, make it possible in principle for single neurons to compute spatiotemporal correlations of very high order. A large network of semilinear processing units, familiar in connectionist models, would be needed to provide equivalent computationaI power. Cerebellar Purkinje cells have large, complex dendritic trees with a variety of active membrane conductances inhomogeneously distributed over the dendrites and soma. In uitro intrasomatic and intradendritic recordings have been used to characterize these conductances (Llinas et al. 1980b; Llinas and Sugimori 1980a; Hounsgaard and Midtgaard 1988). The question of whether these conductances are sufficient to account for the observed responses of Purkinje cells can be addressed by incorporating them into compartmental models of reconstructed neurons. Neural Computation 3,321-332 (1991) @ 1991 Massachusetts Institute of Technology
1.' C. Bush and T. J. Sejnowski
322 Table 1: Rate Constants.
C, = 1pF/cm2 Soma g]cak = 1.32 mS/cm2 X N ~ = 40 mS/cm2 g N a p = 0.25 mS/cm2 g K = 2 mS/cm2 gKca = 0.1 mS/cm2 ICa], decay rate = 5 msec-'
Ve R, (mV) (msec
r, = 225 -cm Dendritic gleak = 0.0219 mS/cm2 gKd = 2 mS/cm2 gca = 2 mS/cm2 gcas= 0.03 mS/cm2 Resting [Ca],= 50 nM
7
-' mV-')
(msec- I )
~.
-50 -50 -57 -30 -30 -55 -55
0.13 0.08 0.0009 0.06 0.04 0.0006 (1 = 100 x [Call
_ _ _ _ _ ~ ~
0 0.05 0.005 0.05 0.02 0.01 0.07
0.1 0.1 0 0.1 0 0 0
7 0.1 0 5 0 0 0
Unfortunately, the parameters that characterize Hodgkin-Huxley channel kinetics are often incomplete or inadequate. In this paper we adopt a simpler kinetic scheme that is much easier to fit to existing data and accurately captures the essential intrinsic properties of the channels. The simplified kinetic scheme introduced here has the additional advantage that it allows accurate simulations of realistic neurons to be run much faster than with Hodgkin-Huxley kinetics that have multiple closed states. This speedup is important when many neurons must be simulated simultaneously in model neural networks. 2 Methods
The compartmental modeling technique has been well studied and can be used to explore the electrotonic properties of morphologically accurate neuron models (Rall 1964; Jack et al. 1975; Segev et ul. 1989). Our simulations were based on a single cerebellar Purkinje cell reconstructed by Shelton (1985) (Fig. 1). The model consisted of 1089 compartments that contained active conductances consistent with data from the literature. The passive membrane parameters used in the model were those used by Shelton (Table 1). Note that the membrane resistance of the soma was 60x lower than that of the dendrites.
Cerebellar Purkinje Cell Simulations
323
Smooth and spiny dendritic tree
Smooth dendritic tree only
Figure 1: Morphology of rat cerebellar Purkinje cell (reprinted with permission from Shelton 1985). (A) Soma and proximal (smooth) dendrites are stippled. Spiny dendrites are drawn as lines. (B) Smooth dendritic tree only. Spiny dendritic tree attachments are numbered counterclockwise. Soma is hatched. The soma contains potassium and sodium conductances, smooth dendrites contain fast calcium and potassium conductances, and spiny dendrites contain slow calcium and potassium conductances (see text).
P. C. Bush and T. J. Sejnowski
324
The suma contains a fast, inactivating sodium conductance, gNa, responsible for the upstroke of the action potential, a fast potassium conductance, gKd, responsible for AP repolarization, and a low-threshold, slow, plateau sodium conductance, gNap. There are large, fast calcium conductances, gca on the proximal (smooth) dendrites that cause discrete dendritic calcium spikes. These calcium spikes are repolarized by large potassium conductances, gK. The spiny dendrites contain smaller, slower calcium conductances, gGs, and a slow calcium-dependent potassium conductance, g ~ See~ Table ~ 1 . for conductance values, gx. The Hodgkin-Huxley model of the squid axon is the starting point for most biophysical models of neurons (Hodgkin and Huxley 1952). The time- and voltage-dependent kinetics of the ionic conductances can be depicted as a Markov process (Hille 1984) (Fig. 2a). The following equations describe the transitions between the open/closed states, m, and the active/inactive states, h.
dh -
~
dt
y(1 - h ) - h h
In the Hodgkin-Huxley system the inactivation of the channel is independent of its activation. It is generally assumed that the rate constant of channel activation, a, is much larger than that of inactivation, 7. That is, channel opening is fast and channel closing (inactivation) is slow, so the decay of the macroscopic current is governed by 7 . However, single channel patch clamp data from mammalian sodium channels (Aldrich et al. 1983) indicates that inactivation follows activation after a very short latency in a voltage-independent step. Thus, inactivation is coupled to activation, y is not dependent on voltage and is much larger than the activation rate constant. The decay of the macroscopic current is governed by n (slow activation). These changes are incorporated into the model kinetics (Fig. 2b). Note that the same macroscopic current can be produced by different microscopic kinetics (Hille 1984; Kienker 1989). In our model it is assumed that there are no transitions from the inactivated to the open state or from the closed to the inactivated state. For conductances that do not inactivate the kinetics are reduced to a twostate system with two voltage-dependent rate constants. The following equations describe the state transitions of our model kinetics.
5 dt dt
x
=
po-crC+6x
=
ac-po-,yo
= 1-(Ct-0)
as shown in Figure 2b.
325
Cerebellar Purkinje Cell Simulations
-
A
HODGKIN HUXLEY
B
SIMPLIFIED
C = C
closed
C
a
y
/ XO
0
open
x
a
a
= inactivated
Hodgkin-Huxley
Vm
linear
Figure 2: (A) Hodgkin-Huxley kinetics for the spike sodium conductance represented as a Markov process. Activation and inactivation are independent processes. (B) The simplified kinetics used in our model (derived from Aldrich et nl. 1983). Inactivation is coupled to activation with a voltage-independent step. There are no transitions from the inactivated state to the open state nor from the closed to the inactivated state. See text for the equations describing the voltage-dependent transitions between the states. These kinetics were used for all conductances in the model. (C) The voltage dependence of the HodgkinHuxley activation rate constant is shown above, and the linear simplification used in our model is shown below.
326
I? C. Bush and T. J. Sejnowski
In the traditional Hodgkin-Huxley model of the sodium channel the rate constants c t , , j , 7 , and h are determined empirically by fitting voltageclamp data with equations of the form
Where the constants A, B, C, D, and F are different for each rate constant. A very accurate fit to current clamp data can be obtained provided complete voltage clamp data are available for the conductance in question. If complete data are not available then it is difficult to modify an existing set of Hodgkin-Huxley conductance parameters to obtain even slightly different behavior. In our model the rate constants are directly or inversely proportional to voltage, V,, (Fig. 2c).
where voltages are in millivolts. The rate constants 3 and 6 were constrained to be less than or equal to R0.6 at all times. Since the threshold VOfor each conductance is fairly easy to establish from current clamp data, adjusting the rate constants is just a matter of varying the slope of their voltage dependence, R, (Table 1). Thus it is easy to fit the behavior of the conductance to any desired form. a is the most important parameter as it is the primary determinant of the activation and decay rates of the conductance transient. A large /3 reduces the activation rate and steady-state conductance, while a large h prolongs the duration of an inactivating conductance transient. An inactivating conductance with a small 6 takes a long time to recover after activation, and so does not function well at high frequencies. The linear simplification of the rate constants inevitably reduces the accuracy of the model kinetics. However, the original Hodgkin-Huxley model was formulated to describe the conductances underlying the action potential, an invariable event produced by conductances unchanged from cell to cell. Conductances with longer time constants, responsible for determining the excitability and interspike intervals of the cell, are much more variable between cells and even within the same cell over time. In practice, the general behavior of each conductance in our model was well captured by a system with linear rate constants, a system that is computationally both simple and fast. The calcium-dependent potassium conductance is not voltage dependent. The activation rate constant for this conductance in our model was proportional to intracellular calcium concentration rather than voltage. Calcium entry into a compartment was calculated from the calcium
Cerebellar Purkinje Cell Simulations
327
current. Intracellular calcium decayed exponetially to the resting value (Table 1). Under voltage clamp the activation of many ionic currents is sigmoidal with respect to time. To model this in the Hodgkin-Huxley system, the activation state variable ( m )is raised to a power when calculating the ionic current ( I ) , simulating closed-state transitions.
I
= g,m3h(E,
-
V,)
where E , is the reversal potential and gx is the maximum conductance of the ionic current. This could be accomplished in our model by adding extra closed states, but this would add an extra step to the calculations, and would add extra rate constants that would be hard to constrain with existing physiological data. Consequently, in our model the ionic current is calculated as follows.
We simulated this compartmental model using CABLE, written by Michael Hines (Hines 1989) and further modified by Jack Wathey and William Lytton. The simulations were run on a MIPS RC3240. Simulation of 100 msec of model time required about 5 min of computation.
3 Results and Discussion
As a preliminary test of our simplified channel kinetics, we compared data from mammalian sodium current transients in response to various voltage steps (Aldrich ef al. 1983) with the simulated sodium current transients of our model ( V S = -50 mV, R , = 0.04 msec-' mV-', Ro = 0 msec-', Rb = 0.05 msec-', 7 = 10 msec-'). The model reproduced the physiological data except for the slow rise phase of the current at low depolarizations (Fig. 3). This slow (sigmoidal) rise is due to transitions between closed states not included in the model, as discussed in the methods. However, leaving multiple closed states out of the model did not prevent us from achieving a close fit between the model Purkinje cell responses and the in vifro data (see below). Figure 4A shows an intracellular recording from the soma of a turtle cerebellar Purkinje cell in response to a somatically injected depolarizing current (Hounsgaard and Midtgaard 1988). Although the dendritic morphology of the turtle Purkinje cell is not as complex as that of mammalian Purkinje cells, the firing pattern is not significantly different (Llinas and Sugimori 1980b). This indicates that the firing pattern is not dependent on the exact morphology of the neuron.
I? C. Bush and T. J. Sejnowski
328
A I
cy"'c--t--.
--
B
-- --
+1
+5
Figure 3: Voltage clamp responses of mammalian and model sodium spike conductances. (A) Averaged single channel currents for steps to different command voltages (shown as millivolts above threshold on the right) (reprinted with permission from Aldrich et a / . 1983). (B) Responses of model sodium conductance to same voltage steps as in (A). The increased peak height and faster decay of the current transient with depolarization are common to both the model and the physiology, as is the decrease of the peak height as the command voltage approaches the sodium reversal potential (last two traces). The model, however, does not show the slow rise of the current transient seen at low depolarizations by the real cell. The slow rise is due to voltage-dependent transitions between closed states, which are not included in the model. Time scale bar is 7.5 msec for all traces except top trace, for which it is 15 msec.
Figure 4B shows the response of the model to a simulated somatic injection of depolarizing current, recorded at the soma. The pattern of spikes is similar to that displayed by real Purkinje cells in response to depolarizing current (Fig. 4A): A slow depolarization due to sodium and calcium plateau currents causes an accelerating train of sodiumdependent action potentials at the soma. A high-threshold calciumdependent spike is triggered in the dendrites just at the point of inactivation of the sodium spike train. Voltage- and calcium-dependent potassium currents then produce a large hyperpolarization, which deinactivates the sodium spikes, allowing the cycle to begin again. After the depolarizing current is turned off there is an "afterdischarge" of spikes due to residual activation of the plateau currents.
Cerebellar Purkinje Cell Simulations
A
329
B
r
c
D
Figure 4: Intracellular recordings from a turtle cerebellar Purkinje cell in response to a somatically injected constant depolarizing current pulse are shown on the left (reprinted with permission from Hounsgaard and Midtgaard 1988). (A) Recording from the soma. Somatic sodium spikes ride on a slow depolarization due to plateau currents. When sodium spikes are voltage inactivated the membrane potential reaches the threshold of the dendritic calcium spike conductance. The resulting calcium spike is repolarized by potassium conductances, which resets the sodium spiking. (C,E) Recordings from proximal and distal dendrites, respectively. The small size of the sodium spikes and relatively large calcium spikes reflect their respective sites of generation. (B,D,F) Model responses from soma and proximal and distal dendrites, respectively, are shown on the right. The model replicates all the essential features of Purkinje cell behavior. The duration of the stimulus is shown beneath each trace. Firing continues after the stimulus is turned off due to the continued activation of plateau currents, as seen in real cells (Llinas et al. 1980b).
330
P. C. Bush and T. J. Sejnowski
Figure 4C, D and E, F show comparisons between intracellular recordings and model recordings in a proximal dendrite and a distal dendrite, respectively. The sodium spikes become smaller as they passively propagate into the dendrites, confirming their somatic origin. The calcium spikes are much larger in the dendrites than the soma, reflecting their site of generation. The calcium spikes of the model are not the “doublets” seen in real cells. It is possible that these multiple spikes are the result of inhomogeneities in the density of calcium channels over the proximal dendrites (Llinas and Sugimori 1980b). The calcium channels responsible for the spiking in the model were homogeneously distributed over the proximal dendrites. Our model was not designed to address questions concerning the detailed spatial localization of channels or the subtleties of intradendritic calcium dynamics. Instead it is aimed at developing ”unit cells” for use in physiologically realistic network models. If some of these properties are later found to be important for information processing (as they are likely to be) then the model can be modified appropriately. We expect that the spiking pattern displayed by our model could be reproduced by a geometrically simplified neuron (Bush and Douglas 1991), though not if the mode1 were reduced to just a single compartment. Such a simplification would significantly increase the speed of the model, which would be important for network simulations incorporating many such neurons. Although no “A”-like potassium conductance was explicitly included in the model, we found that to allow the depolarization to continue to calcium spike threshold after inactivation of the sodium spikes, it was necessary to make the delayed rectifier potassium conductance inactivate slowly with depolarization. This is the essential characteristic of the “A“ conductance (Connor and Stevens 1971). Thus the model predicts that the Purkinje cell delayed rectifier responsible for repolarization after a sodium spike has “A”-like properties. 4 Conclusion
Our model builds on previous models of Purkinje cells (Shelton 1985; Segev et al. 1991). The model demonstrates that the conductances characterized to date are sufficient to produce the cyclical firing pattern generated by Purkinje cells in response to constant depolarizing current input. The channel kinetics used in the model, differing significantly from Hodgkin-Huxley kinetics, are useful in fitting complex response functions due to many incompletely characterized conductances. These simplified kinetics can be rapidly and easily tuned to simulate the intrinsic behavior of any neuron. Such single neuron models can then be incorporated into model networks where speed and simplicity of operation are more important characteristics of the component neurons than the detailed performances of their ionic conductances. These models
Cerebellar Purkinje Cell Simulations
331
would also be easier to analyze with phase planes than would full-scale Hodgkin-Huxley models. Acknowledgments This work was started at the “Methods in Computational Neuroscience” MBL short course taught by James Bower a n d Christof Koch. We thank Tony Bell, Paul Rhodes, a n d Shawn Lockery for helpful discussions and technical assistance. References Aldrich, R. W., Corey, D. P., and Stevens, C. F. 1983. A reinterpretation of mammalian sodium channel gating based on single channel recording. Nature (London) 306, 436441. Bush, I? C., and Douglas, R. J. 1991. Synchronization of bursting action potential discharge in a model network of neocortical neurons. Neural Comp. 3(1), 19-30. Connor, J. A,, and Stevens, C. F. 1971. Voltage clamp studies of a transient outward membrane current in gastropod neural somata. I. Physiol. 213, 21-30. Hille, B. 1984. Ionic Channels of Excitable Membranes. Sinauer Associates, Sunderland, MA. Hines, M. L. 1989. A program for simulation of nerve equations with branching geometries. Int. J. Biomed. Comp. 24, 55-68. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117, 500-544. Hounsgaard, J., and Midtgaard, J. 1988. Intrinsic determinants of firing pattern in Purkinje cells of the turtle cerebellum in vitro. J. Physiol. 402, 731-749. Jack, J. J. B., Noble, D., and Tsien, R. W. 1975. Electric Current Flow in Excifable Cells. Oxford University Press, Oxford. Kienker, I? 1989. Equivalence of aggregated Markov models of ion-channel gating. Proc. R. SOC.Lond. B 236, 269-309. Llinas, R., and Sugimori, M. 1980a. Electrophysiological properties of in vitro Purkinje cell dendrites in mammalian cerebellar slices. J. Physiol. 305, 197213. Llinas, R., and Sugimori, M. 1980b. Electrophysiological properties of in vitro Purkinje cell somata in mammalian cerebellar slices. J. Physiol. 305,171-195. Rall, W. 1964. Theoretical significance of dendritic tree for input-output relation. In Neural The0y and Modeling, R. Reiss, ed., pp. 73-97. Stanford University Press, Stanford. Segev, I., Fleshman, J. W., and Burke, R. E. 1989. Compartmental models of complex neurons. In Methods in Neuronal Modeling: From Synapse f o Networks, C. Koch and I. Segev, eds., pp. 63-96. MIT Press, Cambridge, MA.
332
P. C. Bush and T. J. Sejnowski
Segev, I., Rapp, M., Manor, Y., and Yarom, Y. 1991. Analog and digital processing in single nerve cells: dendritic integration and axonal propagation. In Single Neuron Computation, T. McKenna, J. Javis, and S. F. Zarnetzer, eds. Academic Press, New York. Shelton, D. P. 1985. Membrane resistivity estimated for the Purkinje neuron by means of a passive computer model. Neurosci. 14, 111-131.
Received 17 January 1991; accepted 22 April 1991.
This article has been cited by: 2. G. L. Yuen, P. E. Hockberger, J. C. Houk. 1995. Bistability in cerebellar Purkinje cell dendrites modelled with high-threshold calcium and delayed-rectifier potassium channels. Biological Cybernetics 73:4, 375-388. [CrossRef] 3. Alexander V. Lukashin, Apostolos P. Georgopoulos. 1994. Directional operations in the motor cortex modeled by a neural network of spiking neurons. Biological Cybernetics 71:1, 79-85. [CrossRef] 4. Gordon G. Globus . 1992. Toward a Noncomputational Cognitive NeuroscienceToward a Noncomputational Cognitive Neuroscience. Journal of Cognitive Neuroscience 4:4, 299-300. [Abstract] [PDF] [PDF Plus] 5. William W. Lytton, Terrence J. Sejnowski. 1992. Computer model of ethosuximide's effect on a thalamic neuron. Annals of Neurology 32:2, 131-139. [CrossRef]
Communicated by Christof Koch
On the Mechanisms Underlying Directional Selectivity H. Ogmen Depnrt men t of Elect r irnl E rig itleer ins, University of Houston, Houston, TX 77204-4793 U S A Recent efforts in the understanding of motion detection and directional selectivity include electrophysiological studies using single photoreceptor stimulations and a combination of electrophysiology and neuropharmacology. Results of the former have been interpreted in favor of facilitatory motion detection models while results of the latter have been interpreted in favor of inhibitory models. In this paper, this conflicting data interpretation problem is addressed by mathematically modeling some effects of neuropharmacological substances and by applying this formalism to a neural network model of directionally selective motion perception. The study offers a possible resolution to the paradox. 1 Introduction
Early models of motion detection and directional selectivity (DS)’ emphasized functional aspects of neural computations rather than their neural correlates f e g , Reichardt 1961). On successful verification of predictions of these models in many experimental paradigms, experimental work has been directed toward the understanding of neural mechanisms underlying functional properties of these phenomenological models. In the spirit of the minimal formulation of Reichardt, many authors considered “minimal neural realizations” and classified neural models into two categories: facilitatory and inhibitory models (Fig. 1). In both cases one needs a single (nonlinear) cell and two spatially distinct input channels to achieve directionally selective motion detection. In the facilitatory model (Fig. lA), both channels make excitatory connections with the elementary motion detection cell (EMD). The output of one channel is delayed by At. If an input signal perturbs this channel first and then moves to the next one and if the time lapse between the excitation of channels (which is determined by the spatial separation of sampling channels and the velocity of the stimulus) is close to ’Abbreviations: Ds,directional selectivity; EMD, elementary motion detector; EPSP, excitatory postsynaptic potential; ERG, electroretinogram; GABA, y-aminobutyric acid.
Neural Compufation 3,333-349 (1991) @ 1991 Massachusetts Institute of Technology
H. Ogmen
334
NULL
HI
PREF
NULL
H1
PAEF
B Figure 1: Excitatory (A) and inhibitory (B) motion detection models. The arrows on top show the preferred and null directions for the HZ cell. Inputs from two neighboring spatial locations are combined at subunits (SU). The outputs of SUs are connected to the H I cell (bottom). Adapted from Schmid and Biilthoff (1988).
At, then the EMD will register temporally overlapping excitatory postsynaptic potentials (EPSP).The threshold of the EMD is chosen such that only overlapping EPSPs can yield an output signal. Then, a motion in this direction (preferred direction) within a given range of velocities will be detected by the EMD (facilitation in the preferred direction). If the stimulus moves in the opposite direction, EPSPs will not overlap and the cell will stay silent. The EMD of an inhibitory model (Fig. 1B) receives an inhibitory signal from one channel and an excitatory signal from the
Mechanisms Underlying Directional Selectivity
335
second. The inhibitory signal is assumed to last longer than the excitatory one. If a stimulus excites the excitatory channel first and then the inhibitory one, since the inhibition arrives after the excitation, the cell will respond (preferred direction). In the opposite direction (null direction), the stimulus excites the inhibitory channel first and then the excitatory channel. The inhibitory signal arrives before and lasts longer than the excitatory signal. The cell will remain silent because the effects of the excitatory input are cancelled by those of the inhibitory input (vetoing in the null direction). Finally, the outputs of two mirror image like EMDs are combined at another cell in an opponent fashion. Riehle and Franceschini used a special apparatus to stimulate single photoreceptors on the compound eye of the fly (Riehle and Franceschini 1984). By using this technique, they were able to activate single cartridges (columns of the first optic ganglion, the lamina) and monitor properties of DS by recording the activity of a wide-field motion sensitive neuron (HZ)by extracellular electrodes. This neuron is located in the lobula plate (posterior part of the third optic ganglion) and is sensitive to regressive motion (horizontal motion from the back to the front of the animal). By using sequential stimuli emulating apparent motion they found that: 1. Stimulation of one cartridge alone did not elicit any response unless the stimulus intensity was very high. 2. Synchronous stimulation of cartridges did not elicit any response. 3. Stimulation of two photoreceptors projecting to two neighboring cartridges elicited a strong response in the HZ cell when the temporal ordering of stimuli corresponded to regressive motion (preferred direction). The spike discharge was an on-off response locked into the second stimulus, and this response was released only when facilitated by the first stimulus. The cell responded with or without temporal overlap of stimuli. The facilitatory effect of the first spot was graded and lasted about 150 msec (Franceschini 1985).
4. Stimulation in the opposite sequence, that is, corresponding to the progressive (null) direction, yielded a similar response with an opponent effect: a transient on-off reduction of the spike discharge was observed when the cell had a spontaneous activity. These results favor a facilitatory model such as the model of Figure 1A over the inhibitory model of Figure 1B. Another line of experiments combined neuropharmacology and electrophysiology to study the role of inhibition in motion detection in the insect visual system (Schmid and Biilthoff 1988). An antagonist of the inhibitory neurotransmitter GABA, picrotoxinin, was used to block the functioning of the inhibitory interactions between neurons and the activity of the HZ neuron was recorded under this condition. Injection of GABA (200 nl) caused a decrease in the spike rate for both resting activity and responses to motion in the preferred direction. The
336
H. Ogmen
response for the null direction stayed at zero. The cell recovered slowly and complete recovery was reached in about 35 min after injection. Injection of picrotoxinin at small amounts (0.5-1 id) to lobula or medulla caused a general increase in spike activity. The cell began to respond to motion in the null direction but the amplitude of the responses was higher for the preferred direction than the null direction, thus DS was preserved although the cell responded to motion in both directions. For medium amounts (1-3 nl) basically similar effects were observed. However, in the case where picrotoxinin was injected to lobula the null response became equal to the preferred one shortly after injection (1-2 rnin). Thus, HZ lost its directional selectivity. The abolishment of DS lasted about 1 to 2 min. At high amounts (3-8 nl) the detected spike activity increased for 2 min and then completely disappeared for the following 5 min. Then, spike activity started to increase again. The null response became higher than the preferred response at 8 min after injection for the following 4-5 min. Thus, directional selectivity of H1 was inzierted during this period. Complete recovery was obtained 60 rnin after injection. Based on the simple models of Figure 1 these authors concluded that their results favored an inhibitory scheme rather than a facilitatory one. Furthermore, they modified the inhibitory model by adding extra EMDs so that stimulation of a single channel activates two EMDs with opposite preferred directions thereby eliminating the sensitivity of the model to single channel inputs to make it compatible with Riehle-Franceschini data. However, such a modification does not make the inhibitory model fully compatible with Riehle-Franceschini data because its predictions are inconsistent with responses obtained for overlapping stimuli. The inversion of DS poses a serious problem to both of these simplified schemes. Although it is not perfectly clear whether the inversion of directional selectivity is due to the properties of the motion detection network or to the action of the picrotoxinin on the integrating cell HZ,this point deserves attention because none of the simplified models of Figure 1 can explain this result. Moreover, the inversion of directional selectivity was also reported in behavioral experiments (Biilthoff and Biilthoff 1987) and does not seem to be a peculiarity limited to the HZ cell. Thus, if this result is really due to the properties of the neural network, rather than being an experimental artifact, both of the simplified models can be rejected by the pharmacological experiments alone. In this paper, this apparent conflict between recent experimental data is addressed. Although there are many experimental studies on motion perception in the fly, these two experiments are particularly important because they offer critical tests to two classes of models. We model effects of neuropharmacological agents in neural networks and apply this formalism to a directionally selective motion detection model proposed for the fly visual system (Ogmen and GagnC 1988, 1990b). Since a good agreement has already been found between predictions of this model and Riehle-Franceschini data (Ogmen and Gagnk 1990b1, we focus here
Mechanisms Underlying Directional Selectivity
337
on the pharmacological properties of the model and compare its predictions with Schmid-Bulthoff data as an attempt to resolve the "conflict." For a review of invertebrate neuropharmacology the reader is referred to Leake and Walker (1980) and Hardie (1989). The neural network model for directionally selective units of the fly will be briefly described in the following section. For a detailed analysis of the model as well as comparisons with other experimental data and models the reader is referred to Ogmen and Gagnk (1990b).
2 A Directionally Selective Motion Detection Model
Figure 2 illustrates the neural model for directionally selective motion detection (Ogmen and Gagnk 1988, 1990b). The input signals are first processed by sustained (boxes marked S ) and transient (boxes marked 0) channels. The output of the sustained unit is passed through a cell with slow dynamics (box marked D). This first stage of processing realizes spatial and temporal adaptations. Unlike many versions of the Reichardt model, these "filters" are nonlinear and time-varianf. Corresponding neural correIates for S and 0 units are the spiking neurons recorded by Arnett (1972) and conjectured to be L4 and L5 neurons of lamina (Laughlin 1984; Shaw 1981,1984). Analysis of models for these units and comparisons with electrophysiological data can be found in Ogmen and Gagnd (1990a). The outputs of neighboring sustained and transient units are combined nonlinearly at the elementary motion detector cell (EMD). The mechanism yielding directional selectivity is similar to the excitatory model of Figure 1A: Temporally overlapping EPSPs exceed the threshold while nonoverlapping EPSPs remain subthreshold. However, the early processing differs in a crucial way from the simple models of Figure 1 and their elaborated versions (Schmid and Biilthoff 1988; Franceschini et al. 1989) due to nonlinear intensity dependence, nonlinear signal rectification, sustained/ transient channel interactions (dual channel model), and adaptivity. These properties are achieved by use of extensive inhibition. Thus, although inhibition is absent in the elementary motion detection stage of the facilitatory model, it plays an important role in our model and the purpose of this study is to establish whether the pharmacological findings can be explained based on pharmacological alterations of these inhibitory interactions. Finally, EMDs responding to regressive (progressive) motion make excitatory (inhibitory) connections to the wide-field cell HZ.For a detailed exposition of this model and comparisons with Riehle-Franceschini data the reader is referred to Oigmen and Gagnk (1990b). The mathematical description of the model is provided in Appendix A and to illustrate model equations and motivate the pharmacological modifications we will briefly introduce the equation describing the dynamics of the sustained cell. (The activity
H. Ogmen
338
Lobula Plate
\
Medulla
Figure 2: Neural model for directionally selective motion detection in the visual system of the fly. Sustained units (boxes marked S ) and transient units (boxes marked 0 ) make excitatory connections with the elementary motion detectors (boxes marked EMD). EMDs responding to regressive (progressive) motion make excitatory (inhibitory)connections to the HZ cell. Sustained units are laterally connected through inhibitory synapses. Box D represents a cell with slow dynamics. From Ogmen and Gagn6 (1990b). of a cell in a Box S in Figure 2 is denoted by x,. Symbols S and D in the equation below have different meanings than those in Figure 2.)
lixl = -Ax,+ ( B - xl)S,? - ( D + xl)S; dt
(2.1)
This shunting equation (Grossberg 1973) is a membrane equation describing the time domain behavior of the membrane potential x, of ith sustained unit. Its rate of change is controlled by three terms: The term ( B - x , ) S T represents the effect of depolarizing inputs. S,? is the total transmitter gated excitatory input to the ith cell (conductance modulated by the input), and B is the upper saturation level (Nernst potential) ( B > 0) of x,. The term -(D x,)S,- accounts for the effect of hyperpolarizing
+
Mechanisms Underlying Directional Selectivity
339
inputs with Sly representing the total transmitter gated inhibitory input to the ith cell, and -D ( D > 0) is the lower saturation level of xi. Finally the term -Ax;represents the effect of passive channels ( A > 0). 3 Mathematical Characterization of Pharmacological Effects
Two major theories have been proposed for the kinetics of physiological responses induced by drugs. Both postulate that there exist reversible reactions between drug and receptor molecules and that the law of mass action applies. However, the occupation theory (Stephenson 1956; Ariens 1966) links the physiological response to the proportion of receptors occupied by the drug while the rate theory (Paton 1961) attributes the response to the process of occupation. Thus, according to the rate theory, the response to the drug is proportional to the rate of drug-receptor association. However, both theories predict equivalent equilibrium responsedose relationships and the response is a sfricfly increasing function of the dose. In addition to these local drug-receptor interactions, the diffusion of the drug in the tissue should be characterized by taking into account diffusion barriers and tissue properties. However, given the difficulties associated with the characterization of these local properties and the lack of precise anatomical corroboration of subelements in the model, a macroscopic, lumped description of drug action becomes desirable. We will use a simple formalism that can be reduced to microscopic interactions by using the Michaelis-Menten kinetic model and mass action laws. We introduce a coefficient @ which depends both on the amount of substance injected to the brain and on time. This coefficient gates the interactions carried out by the pharmacologically targeted neurotransmitter. Accordingly, equation 2.1 becomes (3.1)
Similarly, equations in Appendix A are modified to include the coefficient @ gating the inhibitory interactions using the targeted transmitter. We will simply assume that the neuropharmacological effect is a strictly increasing function of the injected dose and decays slowly with time (possibly due to a combination of diffusion and inactivation processes). We will neglect the delay between drug injection and the first noticeable physiological response. Thus, the following differential equation can give a simple description for the pharmacological manipulation: &D = ((1 - @) dt
+ G+
-
@G-
(3.2)
where @ is the pharmacological coefficient, G+ and 9- represent, respectively, the dose of injected agonist and antagonist of the neurotransmitter and ( is the macroscopic passive decay constant. Hence, @ has the following interpretation: Normal conditions: @ = 1; injection of the antagonist:
H. Ogmen
340
0 5 9 < 1; and injection of agonist: 1. Injection of GABA is characterized by putting G - = 0 and G + ( t ) = GJ for 0 5 t 5 to and 0 otherwise. Then, solving equation 3.2 and approximating x, at its equilibrium value one finds
xI
BS'
- - _ I
A
~ ( t =)
- DS.(t)S,-
(3.3)
+ S)++ 9 ( t ) S ;
r
for t < 0 @l(t) = f;(l - e-") + cX'for o 5 t 5 to 1 (1 - @ l ( t o ) ) c ~ ( b - ' ~ l ) for t > to
(3.4)
-
+
where P - E Gz and X = E . According to equation 3.4 during the injection of GABA, (D increases monotonically and once the injection is stopped it decays monotonically with a time constant E . This is illustrated in the simulations of Figure 3. The agonist is injected at time t = 5 min. The variations of the pharmacological coefficient are plotted against time for injection of three different doses (solid curves; the peak increases as the dose increases). Injection of picrotoxinin which is an antagonist of GABA is modeled similarly: Let G+ = 0 and G - ( t ) = G i for 0 5 t 5 to and 0 otherwise. Then equation 3.4 will hold with p = E and X t: + A.; After the injection of picrotoxinin, 9 decreases toward its asymptotic value
<
4 Some Predictions of the Model
<
-~ ~
In this section we study some predictions of the motion detection model coupled with pharmacological equations. Formal results and intuitive explanations are presented and the reader is referred to Appendices A and B, and Ogmen and G a p 6 (1990b) for mathematical details and proofs. For g(a), the input-output nonlinearity of neurons, a positive semidefinite sigmoidal function admitting the upperbound U is used: g(a) = 0 for a 5 0, g is faster-than-linear in the interval [O,A,], linear in [A1,A2], and slower-than-linear in [A2, m[. We will call the interval Rsub=] ca,01 the subthreshold region, and Rsup=]O, m[ the suprathreshold region. The suprathreshold region comprises an amplification region R, =lo, A21 and a saturation region R,=]A2, coo[. g is assumed to be strictly increasing in the -
Mechanisms Underlying Directional Selectivity
8 ,
341
T ' " ' 1
7 -
antagonist \ 300 I
I
1 1
I 0
20
40 time (mini
60
80
Figure 3: Simulations showing the pharmacological coefficient versus time (equation 3.4). Different doses of the agonist and the antagonist are injected at time = 5 min. Once the injection stops, the pharmacological coefficient decays to its normal value 1. There is a strict (nonlinear) relationship between the pharmacological coefficient and the spike frequency (Lemma 1). The inset shows changes in the spike frequency of the HI cell after the injection of picrotoxinin for stimulation in the preferred (upper trace) and null (middle trace) directions as well as the resting activity (lower trace). The data in the inset are from Schmid and Bulthoff (1988).
suprathreshold region. But since the intracellular activities are bounded by B, in order to make all portions of the input-output curve effective, B is chosen such that when the membrane potential tends to its upper limit B, the output tends to its upper limit U . To relate the time behavior of the pharmacological coefficient @ to the DS property of the motion detecting network, one needs to relate variations of @ to variations of cell variables, and variations of cell variables to the DS property of the network. Let us denote the activity of sustained, transient, and lobula plate cells by x, y, and xlob, respectively. To obtain nonzero thresholds, we will use a threshold parameter r, which shifts
H. Ogmen
342
the function g. The following lemmas provide these relationships: Lemma 1. x, y, and
are decreasing functions of @.
Lemma 2. Let h(a) = g(a -
DS is preserved iff (4.1)
is positive; DS is abolished iff it is zero; and DS is inverted iff it is negative. Dh is the deficiency of h {i.e., D,,(a, 6) = h(a + 6) - [h(a)+ h ( b ) ] } . Lemma 3 (Necessary conditions). If DS is inverted then g(x - r,) > r e m d
and g(y - I y ) > I X e m d . Lemma 1 shows how to relate unambiguously the time behavior of the pharmacological coefficient to that of the activities of sustained, on-off, and H I neurons while Lemma 2 gives the conditions that relate variations in x and y to the DS property of the network. By Lemma 1 and equation 3.4, injection of GABA suppresses the activities of early units and recovery occurs slowly. The suppression of early units leads to a suppression of activities of EMDs since even in the case of overlapping EPSPs the threshold cannot be exceeded. Therefore, a general decrease in the spike activity of HZ and a recovery with time constant E is predicted. By Lemma 1 and equation 3.4 with the substitutions corresponding to the injection of picrotoxinin, there will be a general increase in the activities of early units as well as lobula units receiving opponent inputs. The implications of these increases are highly dependent on the nonlinear properties of g. Administration of sufficient amounts of antagonistic drugs drives intracellular potentials of these cells into the suprathreshold region of g, thus resulting into spontaneous activity (or increase in the spontaneous activity if it already exists). Accordingly, EMDs start to respond to motion in both preferred and null directions, since the pooled EPSPs from early units can reach the suprathreshold region even without temporal overlap. Changes in the DS property of the wide-field cells (e.g., H I ) are established by the following propositions: Proposition 1 (Small amounts-preservation). If [g(x - I?,) remd] E
+ g(y - I?,)
-
R,then DS is preserved.
Proposition 2 (Large amounts-inversion). Assume that conditions of Lemma 3 hold, and that g(x - I?,) + U,g(y - r,,) .+ U (large amounts). If U > h2+ remd then DS is inverted.
Mechanisms Underlying Directional Selectivity
343
Corollary 1 (Intermediate amounts-abolishment). There exists an intermediate amount of drug for which DS is abolished. These results are based on the fact that for motion in the preferred (null) direction overlapping (nonoverlapping) EPSPs are transformed by g and signaled to H I . Thus, as stated by Lemma 2, one needs to compare the transformation of overlapping and nonoverlapping signals. If the overlapped activities are in the amplification region, then the same is true for nonoverlapping activities (or they are subthreshold). And since in that region g is faster-than-linear or linear, the overlapped activity, thus the preferred direction, prevails upon nonoverlapped activities (null direction), that is, DS is preserved. When the amount of drug is increased such that activities are in the saturation region (g slower-than-linear) then the preferred direction starts to lose its advantage coming from the temporal overlap. When these signals are balanced on the saturation curve, DS is abolished and a further increase in the amount of drug inverts DS. The time course of these changes as well as recovery is dictated by equation 3.4. Figure 4 illustrates these ideas. In normal conditions single inputs to EMDs are below threshold and produce no output (null direction) while the overlapped activity exceeds the threshold (preferred direction) (Fig. 4a). When picrotoxinin is injected potentials increase due to the decrease in inhibition (the dashed portion of the inputs in Fig. 4b). At this point nonoverlapping inputs can exceed the threshold. If the drug related potential shift drives the activities in the saturation region of g where the sum of the outputs is larger than the output for the sum of the inputs, inversion of DS occurs (Fig. 4b). 5 Concluding Remarks
Although many experimental data indicate that lamina units are good candidates for our sustained and on-off units (Laughlin 1984; Shaw 1981, 1984), there is no direct demonstration of this anatomical mapping. Similarly, the anatomical location of EMDs is unknown. To answer questions of a similar nature, Schmid and Biilthoff (1988) conducted experiments where the activity of HI and the electroretinogram (ERG) potentials in lamina were recorded simultaneously. At small amounts of drug injection no drug-correlated change in the ERG was observed. This seems to indicate that, since even at small amounts important changes in the DS occur, lamina units do not play an important role in the DS property. However, these findings are, in principle, insufficient to reach that conclusion because the ERG in the lamina is principally caused by large monopolar cells, L1 and L2, and not the spiking cells recorded by Arnett (1972),which are the candidates for our early units (Coombe 1986). Moreover, at high amounts of drug injection, the ERG was modified and this can suggest that similar effects may exist for small amounts but are too
344
H. Ogmen
t
t
t
t
i4 A NULL
PREFERRED
OUTPUT& O + -
A NULL
LL PREFERRED
Figure 4: The reversal phenomenon: In normal conditions (a) for a motion in the null direction, sustained and transient inputs to the EMD d o not overlap in time and remain subthreshold (a, left). For a motion in the preferred direction one of the EMDs receives overlapping inputs that go above threshold (a, right). Injection of picrotoxinin causes a shift in the activities (dashed portion). The cell starts to respond to motion in the null direction as well (b, left). In the slower-than-linear portion of the sigmoidal nonlinearity the sum of the outputs becomes larger than the output of the sum (compare b, left with b, right).
weak to be detected a s a prominent ERG component yet strong enough to alter DS properties. Another recent experimental study combined neuropharmacology a n d electrophysiology in the fly visual system (Egelhaaf et al. 1990). These authors rejected Schmid a n d Biilthoff’s interpretation based o n the Fourier domain analysis of their results. The interpretation of the results depends on the assumption that the major nonlinearity in the system is that of
Mechanisms Underlying Directional Selectivity
345
the "multiplicative stage" and that the effect of other nonlinearities is negligible. The data used in their study are based on spike histograms and thus necessarily include a nonlinearity other than multiplication (spike frequency versus membrane potential, which is basically our function g). Although in normal conditions, the stimulus can be constrained in an attempt to keep activities in the linear range of this function, in pharmacological experiments the situation becomes more problematic. Reduction in the inhibition can drive the activities of various neurons to the nonlinear range of the function. Then, the interpretation of data becomes ambiguous because it is very difficult to assess the contributions of different nonlinearities. These experiments do not provide us with sufficient constraints to relate the results precisely to the components of the model. Our model contains a large number of nonlinearities and a wide range of possibilities can produce the second harmonic observed in the data of Egelhaaf et al. (1990). Despite these difficulties, neuropharmacology provides a powerful tool for probing various properties of neural networks. At the network level, one can use a macroscopic characterization of neuropharmacological effects and study network equations coupled with a pharmacological model. Our main motivation in this study was to formally analyze pharmacological properties of a neural network model proposed for motion detection in the visual system of the fly to seek a resolution to paradoxical data. From the theoretical point of view, such a resolution is offered. Experimentally, local circuit analysis can be performed by more precise techniques and these results can be compared with model predictions by reducing the neuropharmacological equations to microlevel interactions through the application of Michaelis-Menten equations and the law of mass action. Combined interpretation of pharmacological and focused stimulation experiments rejects both simplified models and suggests that motion detection should not be confined to a single synaptic interaction but should be analyzed at the network level. A very interesting experimental approach would be to combine these two techniques for the same preparation.
Appendix A: Mathematical Description of the Model The reader is referred to Ogmen and Gag& (1990b) for a full description of the model. We will summarize below the equations used in the proofs. The pharmacological coefficient is introduced. We first assume that all inhibitory interactions are modified by the drug and then determine those that are critical. Since we do not know the exact anatomical location of units (except for the HI cell) and since we are seeking a macroscopic description, we do not account for differences of modification strengths according to the anatomical location and the site of drug injection.
H. Ogmen
346
Sustained Units. The dynamics of the ith sustained unit is characterized by the following equations:
3 dt
=
-Ax,+(B-x,)[I+J,]z, - @ ( D + x , ) ~ [ I + J k ] z k
(A.1)
k#t
dzt = cL(J - z t ) - (I + J , ) Z , (A.2) dt where It and 1 are, respectively, specific and arousal inputs. zi denotes the amount of available transmitter and A, B, D, ct, and 0 are positive constants. ~
3 dt
=
- ~ y ,+ ( B - Yi)[Fg(Xon,i- ran,,) fGg(xoff.1 - roff,JI
(A.7)
where g is a nonlinear function, F , G are constants, and r' is the signal threshold. We call this structure an augmented gated dipole. EMDs. Let us denote by xemd,i,j(t)the activity of the EMD at position i receiving inputs from the sustained unit at i and on-off unit at i + j, that is, with a sampling base equal to j . The dynamics of this EMD is described by
+g(Yi-,
-
r!/,l-,)l
where f(.) represents the transformation by slow interneurons.
(A.9)
Mechanisms Underlying Directional Selectivity
xl&
347
Wide-Field Cells. The following equation is proposed for the activity of the lobula cells:
(A.lO)
where the coefficients HI,, and I,,, account for the connection strengths. These connection strengths determine the receptive field profiles of lobula plate cells. Consequently,the general expression (A.10) can be specialized to various classes of cells found in the lobula plate including the HI cell, by a proper choice of H,,] and JI,, [see, for example, Schuling (1988) for an elaborate study]. For a spatially homogeneous stimulation and for recordings under steady-state conditions (conditions of Schmid-Biilthoff experiments) the lobula plate equation is calculated at its steady state: (A.11)
where KE = C i , j H i , , g ( x e m d , i , j - remd,i,])r K/= E i , j J i , j g ( X z , i , , - r&,j,,). For motion in the preferred direction g(xemd,i,, - remd,i,j) and g(xa,i,j - remd,,,j) will be approximated by g [ g ( x i - r x , i ) + g ( y i + , -r,,,,+]) -remd,i,j] (overlapping EBPs) and by &(XI - rx,i) - remd,l,j] +g[g(yi+j - r y , i + j ) - remd,i,j] (nonoverlapping EPSP signaled by two different EMDs), respectively. For motion in the null direction they will be approximated by g[g(xi - r,,i)- remd,i,j] gk\Yl+,- r y , i + j ) - remd,r,j] and by g k ( x i - rx,i)+ g ( y i + j - r y , i + j ) - remd,t,j], respectively (mirror image symmetry). The temporal overlap of inputs is critical and it is captured by these approximations. We will assume for simplicity that H i , , = In the formulation of Ogmen and Gagn6 (1990b) the only constraint imposed on g was a threshold nonlinearity and a sigmoidal function was suggested (Oigmen and G a p 6 1990b, footnote 6). This paper makes a full use of sigmoidal characteristics of g.
+
Appendix E Proof of Lemma 2. In order to establish the DS property one needs to compare Xlob for preferred and null direction stimulations. Given the mirror image symmetry of EMDs, the excitatory (inhibitory) input for the preferred direction will be equal to the inhibitory (excitatory) input for the null direction. Therefore, from equation A.11 the DS property is determined by the sign of
348
H. Oigmen
which in turn is determined by the sign of KE - K,. Moreover, for a spatially homogeneous stimulus pattern (i.e., a stimulus pattern like the one used in Schmid-Biilthoff experiments) the summations can be replaced by a single typical EMD output. It then suffices to study the sign of one summand in the numerator. Since we need to study a single term, w e drop the indices and represent the sustained and transient cell activities by x and y, respectively. Then expression 4.1 follows. Note that inhibition at the HZ level is not critical for the DS property. Proof of Lemma 3. Since g is monotone increasing, by applying Lemma 2, it is necessary to have both g(x - rr)and g(y - I'y) in the suprathreshold region. Proof of Proposition 1. If [x(x-rx)+g(y-T,,)] E R,then [g(x-r',)-17emd] E R,u Rsub and [x(y - r,,)- rcmd] E R,u Rsub.Since g is strictly increasing, faster-than-linear or linear in R,, preservation of DS follows. Proof of Proposition 2. 1ims(?-r,)-u,s(y-ru)+u D,,[g(x - r,),g(y - r,)]= k ( 2 U ) - 2 k ( U ) . But if U > A2 + r e m d then k ( a ) is slower-than-linear for a > U , thus the deficiency is negative and by Lemma 2 DS is inverted. The corollary follows from continuity of Q,. Acknowledgments
~.
..
I would like to thank Dr. Michael E. Rudd for comments on an earlier version of this work. References ___ Ariens, E. J. 1966. Receptor theory and structure-action relationships. In Advances in Drug Research, N. J. Harper and A. 8. Simmonds, eds. Academic Press, London. Amett, I).W. 1972. Spatial and temporal integration properties of units in first optic ganglion of dipterans. J. Neurophysiol. 35, 429444. Bulthoff, H., and Biilthoff, I. 1987. Combining neuropharmacology and behavior to study motion detection in flies. Bid. Cybernet. 55, 313-320. Coombe, P. E. 1986. The large monopolar cells L1 and L2 are responsible for ERG transients in Drosophila. J. Comp. Physiol. A 159, 655-665. Egelhaaf, M., Borst, A,, and Pilz, 8. 1990. The role of GABA in detecting visual motion. Brain RCS. 509, 156-160. Franceschini, N. 1985. Early processing of colour and motion in a mosaic visual system. Neurosci. Res. Suppl. 2, 517-549. Franceschini, N., Riehle, A., and Le Nestour, A. 1989. Directionally selective motion detection by insect neurons. In Facets of Vision, D. Stavenga and R. Hardie, eds. Springer-Verlag, Berlin. Crossberg, S. 1973. Contour enhancement, short term memory, and constancies in reverberating neural networks. Stud. Appl. Math. 52, 217-257.
Mechanisms Underlying Directional Selectivity
349
Hardie, R. C. 1989. Neurotransmitters in compound eyes. In Facets of Vision, D. G. Stavenga and R. C. Hardie, eds. Springer, Berlin. Laughlin, S. 1984. The roles of parallel channels in early visual processing by the anthropod compound eye. In Photoreception and Vision in Invertebrates, M. A. Ali, ed., pp. 457481. Plenum Press, New York. Leake, L. D., and Walker, R. J. 1980. Invertebrate Neuropharmacology. Wiley, New York. Ogmen, H., and Gagne, S. 1988. Short-range motion detection in the insect visual system. Neural Networks 1, supplement 1, p. 519. Proceedings of the First International Neural Network Society Conference. Ogmen, H., and Gagne, S. 1990a. Neural models for sustained and ON-OFF units of insect lamina. Biol. Cybernet. 63, 51-60. Ogmen, H., and Gagnk, S. 1990b. Neural network architectures for motion perception and elementary motion detection in the fly visual system. Neural Networks 3, 487-505. Paton, W. D. M. 1961. A theory of drug action based on the rate of drug-receptor combination. Proc. X. SOC.London Ser. B 154, 21-69. Reichardt, W. 1961. Autocorrelation, a principle for evaluation of sensory information by the central nervous system. In Principles of Sensory Comrnunications, W. A . Rosenblith, ed., pp. 303-317. Wiley, New York. Riehle, A., and Franceschini, N. 1984. Motion detection in flies: Parametric control over on-off pathways. Exp. Brain Res. 54, 390-394. Schmid, A., and Biilthoff, H. 1988. Using neuropharmacology to distinguish between excitatory and inhibitory movement detection mechanisms in the fly Calliphora erythrocephala. Biol. Cybernet. 59, 71-80. Schuling, F. H. 1988. Processing of moving images in natural and artificial visual systems. Doctoral dissertation, Rijksuniversiteit Groningen, The Netherlands. Shaw, S. R. 1981. Anatomy and physiology of identified non-spiking cells in the photoreceptor-lamina complex of the compound eye of insects, especially Diptera. In Neurones without Impulses, A. Roberts and B. M. H. Bush, eds., pp. 61-116. Cambridge University Press, Cambridge. Shaw, S. R. 1984. Early visual processing in insects. J. Exp. Biol. 112, 225-251. Stephenson, R. P. 1956. A modification of the receptor theory. Br. 1. Pharrnacol. Chemother. 11, 379-386.
Received 15 March 1990; accepted 18 March 1991.
This article has been cited by: 2. S.X. Yang, M.Q.-H. Meng. 2003. Real-time collision-free motion planning of a mobile robot using a neural dynamics-based approach. IEEE Transactions on Neural Networks 14:6, 1541-1552. [CrossRef] 3. Norberto M. Grzywacz, John S. Tootle, Franklin R. Amthor. 1997. Is the input to a GABAergic or cholinergic synapse the sole asymmetry in rabbit's retinal directional selectivity?. Visual Neuroscience 14:01, 39. [CrossRef] 4. Randall D. Smith, Norberto M. Grzywacz, Lyle J. Borg-Graham. 1996. Is the input to a GABAergic synapse the sole asymmetry in turtle's retinal directional selectivity?. Visual Neuroscience 13:03, 423. [CrossRef] 5. Paolo Gaudiano. 1992. A unified neural network model of spatiotemporal processing in X and Y retinal ganglion cells. Biological Cybernetics 67:1, 23-34. [CrossRef]
Communicated by Michael Jordan
2-Degree-of-freedom Robot Path Planning using Cooperative Neural Fields Michael Lemmon Depurtrnetit of EIrrtriraI Engitiwring, University of Notre Dame, Notre Dame, IN 46556 USA This paper proposes a neural network solution to path planning by two degree-of-freedom (DOF) robots. The proposed network is a twodimensional sheet of neurons forming a distributed representation of the robot’s workspace. Lateral interconnections between neurons are ”cooperative,” so that the field exhibits oscillatory behavior. This paper shows how that oscillatory behavior can be used to solve the path planning problem. The results reported show that the proposed neural network finds the variational solution of Bellman’s dynamic programming equation. 1 Introduction Autonomous robotic systems are often required to move among obstacles toward a desired location. Since there may be many paths to choose from, a truly autonomous robot must be able to locate an “optimal” path in a reasonable length of time. In our context, “reasonable” means that planning is fast enough to allow real-time interaction between the robot and its environment. Most recent work in path planning can be classified as “global” or ”local” in nature. Global algorithms such as the “cell-decomposition” approach (Schwartz and Sharir 1983) generate a directed graph of the robot’s admissible positions and then uses a heuristic search algorithm such as the A*-algorithm (Nilsson 1971) to locate the globally optimal path. The generation of the directed graph and it subsequent search, however, have a complexity that is polynomial in the number of constraints (i.e., obstacles) (Schwartz and Sharir 1983) and therefore global techniques are difficult to realize in real time. Local approaches using artificial potential field (Khatib 1986) are more amenable to real-time implementation but are easily fooled into following suboptimal paths that may never reach the intended destination. As a result, all existing methods for path planning fail to simultaneously achieve the goals of global optimality and real-time implementation. More recently, an artificial neural network (ANN) solution has been proposed (Barraquand and Latombe 1989) for the path planning problem. Neitvnl Conryufatiorr 3, 350-362 (1991)
@ 1991 Massachusetts Institute of Technology
2-DOF Robot Path Planning
35 1
The ANN approach uses the network to form a distributed representation (Hinton et al. 1986) of the robot’s workspace. A stochastic search technique such as simulated annealing is then used to generate the globally optimal path. The algorithm’s reliance on stochastic search strategies renders the approach unsuitable for real-time implementation. This is because stochastic searches can exhibit notoriously slow convergence rates. In spite of this drawback the ANN approach has two things recommending it. In the first place, the approach’s complexity scales with workspace size, rather than the number of obstacles. In the second place, the approach is highly parallel so it can be efficiently implemented on parallel computers. Therefore if a faster search technique can be found then the approach should be a likely solution to the real-time path planning problem. This paper describes a neural network solution to the 2 degree-offreedom (DOF) path planning problem which invokes a radically different search strategy. The proposed network is a two-dimensional (2-D) sheet of neurons forming a distributed representation of the robot’s workspace. Instead of a stochastic search, the network implements a highly ”parallel” search of the free workspace. The search technique relies strongly on the oscillatory behavior of cooperative neural fields. This paper shows that the proposed neural network computes the dynamic programming solution (Bryson and Ho 1975) of the path planning problem. The proposed structure therefore locates globally optimal paths with relatively modest computational requirements. The remainder of this paper is organized as follows. Section 2 discusses path planning in the context of dynamic programming solutions. Section 3 describes a cooperative neuraI field. Section 4 discusses the dynamic behavior of the proposed neural field. Section 5 shows how the proposed network computes the dynamic programming solution of the path planning problem. Simulation results are also presented in this section. Section 6 summarizes the conclusions.
2 Dynamic Programming Solutions
Consider a 2-DOF robot moving about in a two-dimensional world. A robot’s location is denoted by the real vector, p. The collection of all locations forms a set called the workspace. An admissible point in the workspace is any location that the robot may occupy. The set of all admissible points is called the free workspace. The free workspace’s complement represents a collection of obstacles. The robot moves through the workspace along a path that is denoted by the parameterized curve, p(t). An admissible path is one that lies wholly in the robot’s free workspace. Assume that there is an initial robot position, PO,and a desired final position, pf. The robot path planning problem is to find an admissible
Michael Lemmon
352
path with POand pf as endpoints such that some "optimality" criterion is satisfied. The path planning problem may be stated more precisely from an optimal control theorist's viewpoint. Treat the robot as a dynamic system that is characterized by a state vector, p, and a control vector, u. For the highest levels in a control hierarchy, it can be assumed that the robot's dynamics are modeled by the following differential equation, i, = u. This equation says that the state velocity equals the applied control. To define what is meant by "optimal," a performance functional is introduced.
where //x/Iis the norm of vector x and where the functional c(p) is unity if p lies in the free workspace and is infinite otherwise. This weighting functional is used to ensure that the control does not take the robot into obstacles. Equation 2.1's optimality criterion minimizes the robot's control effort while penalizing controls that do not satisfy the terminal constraints. With the preceding definitions, the optimal path planning problem states that for some final time, tf, find the control u(t) that minimizes the performance functional J(u). One very powerful method for tackling this minimization problem is to use dynamic programming (Bryson and Ho 1975). According to dynamic programming, the optimal control, uOpt, is obtained from the gradient of an "optimal return function," Jo(p). In other words, uOpt= 01'. The optimal return functional satisfies the Hamilton-Jacobi-Bellman (HJB) equation. For the dynamic optimization problem given above, the HJB equation is easily shown to be
-={ ap at
-~(vJo)'(vp) c(p) = 1 0
(2.2)
C(P) = 00
This is a first-order nonlinear partial differential equation (PDE) with ) i/p(tf) pr/l2.Once equation 2.2 terminal (boundary) condition, J o ( t f = has been solved for the I", then the optimal " p a t h is determined by following the gradient of Jo. Solutions to equation 2.2 must generally be obtained numerically. One solution approach numerically integrates a full discretization of equation 2.2 backward in time using the terminal condition, Jo(tf), as the starting point. The proposed numerical solution is attempting to find characteristic trajectories (Carrier and Pearson 1976) of the nonlinear first-order PDE. The PDE nonlinearities, however, only ensure that these characteristics exist locally (i.e., in an open neighborhood about the terminal condition). The resulting numerical solutions are, therefore, valid only in a "local" sense. This is reflected in the fact that truncation errors introduced by the discretization process will eventually result in numerical solutions violating the underlying principle of optimality (Bryson and Ho 1975) embodied by the HJB equation. ~
2-DOF Robot Path Planning
353
In solving path planning problems, "global" solutions (i.e., solutions that satisfy the HJB equation over the entire workspace) are required. These global solutions may be obtained by solving an associated variational problem (Benton 1977). Assume that the optimal return function at time ff is known on a closed set €3. The variational solution (Benton 1977) for equation 2.2 states that the optimal return at time t < ff at a point p in the neighborhood of the boundary set B will be given by (2.3)
Equation 2.3 applies only in regions where c(p) = 1 (i.e., the robot's free workspace). For obstacles, fn(p,t ) = In(p,t i ) for all f < tf. In other words, the optimal return is unchanged in obstacles. Equation 2.3 provides an iterative procedure for computing the solution to the path planning problem. The solution procedure is particularly straightforward when the workspace consists of a collection of discrete positions. In this case, the proposed path planning algorithm is identica1 to the Bellman-Ford algorithm (Bertsertkis and Tsitsiklis 1989) used in finding the shortest path through a directed graph. Formal statement of this algorithm in the context of path planning problems begins with a discretization of the workspace into a finite set of MN position vectors, P = (pS;i = 1,.. . ,N; j = 1,.. . ,M}. These vectors represent positions in the workspace. A neighborhood structure is introduced over P by defining collections of MN subsets of P, {N,];i = 1 , . . . , N;j = 1 , .. . ,M}, where the set N, consists of all position vectors that are "neighbors" of the position vector, pl,. The iterative application of equation 2.3 will compute the optimal return function on the discrete set of points, P. The resulting algorithm is summarized below. 1. Let J"(pf)= 0 and let J"(p) = K for all p constant.
#
pf. K is a positive
2. For each position vector, plr,in the workspace where c(pll) = 1, compute the optimal return as (2.4)
For each position vector where c(pij) = co,leave the optimal return unchanged. 3. Repeat step 2 until the position vector representing the robot's current position, Po, has its optimal return function changed. 4. The optimal path is obtained by using the control that takes the robot from its current position, pi,, to the neighboring position with smallest optimal return. Repeat this step until the robot is at the desired terminal position.
354
Michael Lemmon
The algorithm is a straightforward modification of the Bellman-Ford algorithm. For this particular approach it can be easily shown that the dynamic programming iteration of equation 2.4 generates a n optimal return function where I0(p)equals the length (with respect to the vector norm IIxII) of the shortest path from point p to the terminal point pf. The above algorithm could be easily implemented on a synchronous array of SIMD machines (Lemmon 1991). In the following section we propose a cooperative neural field that will also be shown to compute the optimal return function.
3 Cooperative Neural Fields
___ -. ___
The proposed neural network consists of MN neurons arranged as a 2-D sheet called a "neural field." The neurons are put in a one-to-one correspondence with the ordered pairs, {(i,j);i = 1 , .. . , N;j 1,.. . , M } qThe ordered pair (i,j) will sometimes be called the (i,j)th neuron's "label." The (i,j)th neuron is characterized by two states. The short term activity (STA) state, x,,j, is a scalar representing the neuron's activity in response to the currently applied stimulus. The long-term activity (LTA) state, zu1,!, is a scalar representing the neuron's "average" activity in response to re~ ) , is cently applied stimuli. Each neuron produces an output, f ( ~ ~ ,which a unit step function of the STA state (i.e., f ( x ) = 1 if x > 0 and f ( x ) = 0 if x 5 0). A neuron will be called "active" or "inactive" if its output is unity or zero, respectively. Each neuron is also characterized by a set of constants. These constants are either externally applied inputs or internal parameters. They and the position vector pIJ. are the disturbance ylJ, the rate constant The position vector is a 2-D vector mapping the neuron onto the robot's workspace. The rate constant models the STA state's underlying dynamic time constant. The rate constant is used to encode whether or not a neuron maps onto an obstacle in the robot's workspace. The external disturbance is used to initiate the network's search for the optimal path. The evolution of the STA and LTA states is controlled by the state equations. These equations are assumed to change in a synchronous fashion. The STA state equation is 2
(3.1)
where the summation is over all neurons contained within the neighborhood, N+ of the (i,j)th neuron. The function G(x) is zero if x < 0 and is x if x 2 0. This function is used to prevent the neuron's activity level from falling below zero. Dkl are network parameters controlling the strength
355
2-DOF Robot Path Planning
of lateral interactions between neurons. The LTA state equation is
(3.2) Equation 3.2 means that the LTA state is incremented by one every time the (i. j)th neuron’s output changes. The values chosen for the network parameter, Dkl, determine whether equation 3.1 describes a competitive or cooperative field. If Dkl > 0 when (k.1) # (i.j) and Dkl < 0 when (k.1) = (i,j) then the network becomes ”cooperative.” Cooperation means that a given neuron turning active increases the STA states of its neighbors. Reversing the inequalities results in a competitive system. 4 Cooperative Field Dynamics
Specific choices of interconnection weights result in oscillatory behavior. The specific field under consideration was chosen so that neurons pass their activity levels to their neighbors while suppressing their own activity. In other words, if a given neuron has a positive output at time n, then at time n + 1 that neuron switches off and its nearest neighbors switch on. The oscillatory behavior drives the LTA state dynamics, which counts the number of times a neuron changed state. In this section, it is shown that a particular choice of interconnection parameters will cause network STA states to oscillate. The specific network under consideration is a cooperative field where Dkl = 1 if @ , I ) # ( i , j ) and Dkl = - A < 0 if ( k , l ) = (i,j). Without loss of generality it will also be assumed that the external disturbances are bounded between zero and one. It is also assumed that the rate constants, A,, are either zero or unity. In the path planning application, rate constants will be used to encode whether or not a given neuron represents an obstacle or a point in the free-workspace. Consequently, any neuron where A,,, = 0 will be called an “obstacle” neuron and any neuron where A,,, = 1 will be called a ”free-space” neuron. Because A,, = 0 for an obstacle neuron, equation 3.1 becomes x: = G(x;). Therefore an obstacle neuron never changes its STA state regardless of external disturbances or neighboring neuronal activity. Since obstacle neurons never change their initial activity levels, the following discussion concerning STA oscillations pertains only to free-space neurons. STA oscillations can be proven in the following manner. Assume that the (i,j)th neuron is a free-space neuron. If xi; = 0 then the STA state turns nonzero if and only if yl,, n > 0, where n is the number of active free-space neurons that are neighbors to the (i,j)th neuron. This equation means that the STA state switches into activity if and only if the neuron has active neighbors ( n > 0) or it has an applied disturbance (y,,, = 1). The only condition under which the inactive neuron remains inactive is
+
Michael Ixmmon
356
if it has no applied stimulus and it has no active neighbors. If x, > 0, then the STA state becomes zero if and only if y,,; n - A < 0. A sufiicient condition for this condition to occur is that A > INit,/,where IN,,,] is the number of neurons in the neighborhood set N,,,. Under this condition, the inequality holds regardless of external disturbances and the number of active neighbors. Therefore any active neuron always turns inactive on the next iteration. The preceding results imply that once a free-space neuron turns active it will be oscillating with a period of 2 provided it has at least one free-space neuron as a neighbor. This fact can be seen from the following argument. Assume the (i. j)th neuron turns active at time 11. For iteration n + 1, it will turn inactive and it will activate all of its free-space neighbors. Since, by assumption, there is at least one such neighbor and since neighborhood relations are symmetric, it is concluded that at iteration 17 + 2 the (i. j)th neuron will be reactivated by its neighbor. Consequently, it is concluded that once the (i,j)th neuron has been activated, it will continue switching back and forth between inactivity and activity. The preceding discussion establishes the oscillatory behavior of freespace neurons once the neuron has been activated. To determine when the neuron first becomes active we need to be more specific about the networks initial conditions. The following section considers this question with a set of initial conditions motivated by the robotic path planning problem.
+
5 Robot Path Planning
~~~~~
~
~~
To determine when a neuron first becomes active, a specific initial condition and neighborhood structure will be assumed. These assumed conditions are motivated by the path planning problem under consideration. Assume that all neuron STA and LTA states are zero at time 0. Assume that the position vectors form a regular grid of points, p,,, = (iA3jA)f where A is a constant controlling the grid's size. Assume that all extcrnal disturbances but one are zero. In other words, for a specific neuron with label (i3j ) , yk,( = 1 if ( k . I ) = (i.j) and is zero otherwise. We also define the sup (supremum) norm of the ( k , I)th neuron as Pkl
IIPL,
-
-
-
Pk.l/lr .. .
a
(5.1)
where for a vector x --- (XI . x z ) ~the , sup norm is IIxiir, = sup{lxll, lxZl}. Also assume a neighborhood structure where N;,j consist of the (i,j)th neuron and its eight nearest neighbors, Nii = { ( i + k , j + l ) ; k = - l , O , 1 ; l = -1.0,l). With the preceding assumptions it is apparent that a neuron's norm will always be a nonnegative integer p k l = sup{ li-kl. Ij-ll}. Consider the
2-DOF Robot Path Planning
357
proposition that a neuron with norm p k / will first become active on the ~ k l + 1iteration. This proposition, over the set of nonnegative integers, can be easily proven using induction. If p k l = 0 then this is the (i.j)th neuron and it will turn active on the next iteration due to the nonzero applied disturbance. If /’k/ = 1, then all neurons with this norm are neighbors of the (i. j)th neuron and hence will turn active on the second iteration. Now assume that the proposition holds for all neurons with norm p k l . Then because all neurons with norm ilk/ + 1 are neighbors to at least one neuron with norm pk/, these neurons are turn active in the next iteration, pkl + 2. On the basis of the preceding inductive argument it is seen that a neuron first turns active in the pkl+ first iteration. Since we know such neurons continue oscillating after that time and since LTA states count u p the number of STA state changes, it is concluded that the LTA state for the (k,l)th neuron in the nth iteration must equal u7k,l(n) = G ( n p k , ) where G ( x ) = x if x > 0 and is zero otherwise. Note that if we define an associated functional, ] k / ( n )= n - G ( n - pkl), then for neurons where pkl < n, the functional becomes J k l ( n ) = pkl. In other words, the associated functional is simply the minimum distance (with respect to the sup norm) from the (k,I)th neuron to the initial disturbance [the (i, j)th neuron]. Comparing this result back to the dynamic programming solution obtained at the end of section 2, it is apparent that the solutions are identical with respect to the sup norm. In other words if the DP iteration of equation 2.4 is defined using the sup norm, then the resulting optimal return function is identical to the associated functional, ] k / ( n ) , computed from the cooperative neural field’s LTA states. In light of the preceding discussion, the use of cooperative neural fields for path planning is straightforward. First apply a disturbance at the neuron mapping onto the desired terminal position, pf and allow the field to generate STA oscillations. When the neuron mapping onto the robot’s current position is activated, stop the oscillatory behavior. The resulting LTA state distribution for the (i,j)thneuron equals the negative of the minimum distance (with respect to the sup norm) from that neuron to the initial disturbance. The optimal path is then generated by a sequence of controls that ascends the gradient of the LTA state distribution. SeveraI simulations of the cooperative neural path planner have been implemented. The most complex case studied by these simulations assumed an array of 100 by 100 neurons. Several obstacles of irregular shape and size were randomly distributed over the workspace. An initial disturbance was introduced at the desired terminal location and STA oscillations were observed. A snapshot of the neuronal outputs is shown in Figure 1. This figure dearly shows wavefronts of neuronal activity propagating away from the initial disturbance [neuron (70,10) in the upper right hand corner of Figure 11. The “activity” waves propagate around obstacles without any reflections. When the activity waves reach the neuron mapping onto the robot’s current position, the STA oscillations were turned off. The LTA distribution resulting from this particular
358
Michael Lemmon
Figure 1: STA activity waves. simulation run is shown in Figure 2. In this figure, light regions denote areas of large LTA state and dark regions denote areas of small LTA state. The generation of the optimal path can be computed as the robot is moving towards its goal. Let the robot’s current position be the (i,j)th neuron’s position vector. The robot will then generate a control which takes it to the position associated with one of the (i, j)th neuron’s neighbors. In particular, the control is chosen so that the robot moves to the neuron whose LTA state is largest in the neighborhood set, N,,,. In other words, the next position vector to be chosen is pk,/ such that its LTA state is
Because of the LTA distribution’s optimality property, this local control strategy is guaranteed to generate the optimal path (with respect to the sup norm) connecting the robot to its desired terminal position. It should be noted that the selection of the control can also be done with an analog neural network. In this case, the LTA states of neurons in the neighborhood set, Ni,j are used as inputs to a competitively inhibited neural net (Lemmon 1990). The competitive interactions in this network will always select the direction with the largest LTA state.
2-DOF Robot Path Planning
359
Figure 2: LTA distribution. Since neuronal dynamics are analog in nature, it is important to consider the impact of noise on the implementation. Analog systems will generally exhibit noise Ievels with effective dynamic ranges being at most 6 to 8 bits. Noise can enter the network in several ways. The LTA state equation can have a noise term (LTA noise), so that the LTA distribution may deviate from the optimal distribution. The introduction of noise into the LTA state equation can be done in the following way w;
=
”; +
If’(XL,)l
+ z4>l
(5.3)
is an array of positive i.i.d. stochastic process. Noise may also where vu,,] enter in the selection of the robot’s controls (selection noise). In this case, the robot’s next position is the position vector, Pk,l such that (5.4) where u , , ~is an i.i.d. array of stochastic processes. Simulation results reported below assume that the noise processes, v,,~,are positive and uniformly distributed i.i.d. processes. The noise introduced by equations 5.3 and 5.4 place constraints on the “quality“ of individual neurons, where quality is measured by the neuron’s effective dynamic range. Two sets of simulation experiments have been conducted to assess the neural field’s dynamic range requirements. In the following simulations, dynamic range is defined by the equation
360
Michael Lemmon
Figure 3: (a) Selected path (0 bit); (b) selected path (-3 bit). - log, I u , ! ~ , where is the maximum value the noise process can take. The unit for this measure of dynamic range is “bits.” The first set of simulation experiments selected robotic controls in a noisy fashion. Figure 3a and b shows the paths generated by two simulation runs where signal-to-noise ratios were 1 (0 bits) and 0.1 (-3 bits), respectively. These simulation results indicate that the impact of “selection” noise is to “confuse” the robot so it takes longer to find the desired terminal point. The paths shown in Figure 3a and b were measured with respect to the sup norm and standard Euclidean norm. Figure 3a’s path had a sup norm length of 70 and a Euclidean norm length of 96.25. Figure 3bs path had a sup norm length of 109 and a Euclidean norm length of 123.18. For both norms, the introduction of noise increased the length of the generated path. The important thing to note, however, about this example is that the system is capable of tolerating extremely large amounts of ”selection” noise. In spite of the fact that SNRs were less than unity, the robot was still capable of finding the terminal position with relatively little performance degradation. This is because the underlying optimality of the LTA distribution is not disturbed by the noise process. As a result, increases in generated path length are due to random fluctuations about the optimal path. The second set of simulation experiments introduced LTA noise through equation 5.3. These noise experiments had a detrimental effect on the robot’s path planning abilities in that several spurious extremals were generated in the LTA distribution. The result of the spurious extremals is to fool the robot into thinking it has reached its terminal destination when in fact it has not. As noise levels increase, the number of
2-DOF Robot Path Planning
361
r \
0
1
2
3
4
5
Neuron‘s Dynamic Range (bits)
Figure 4: Dynamic range versus number of spurious states. spurious states increase. Figure 4 shows how this increase varies with the neuron’s effective dynamic range. The surprising thing about this result is that for neurons with as little as 3 bits of effective dynamic range the LTA distribution is free of spurious maxima. These results hold for problems of the size illustrated in Figure 1. Furthermore, even with less than 3 bits of dynamic range, the performance degradation is not catastrophic. LTA noise may cause the robot to stop early; but on stopping the robot is closer to the desired terminal state. Therefore, the path planning module can be easily run again and because the robot is closer to its goal there will be a greater probability of success in the second trial. 6 Conclusions
This paper reports on the novel use of cooperative dynamics to implement a highly parallel search of a mobile robot’s workspace. It was shown that the proposed neural field can compute a dynamic programming solution to the path planning problem with respect to the supremum norm. The proposed strategy is highly parallel so it can be simulated on digital parallel computers such as the Connection Machine (Hillis 1985). Furthermore, preliminary simulation results suggest that
362
Michael Lemmon
analog implementations of the approach are also feasible and that the effective dynamic range of the individual processing elements need not be more than 3 bits to solve nontrivial path planning problems. The globally optimal character of the approach’s generated paths coupled with its potential for parallel implementation on either digital or analog hardware renders it a highly attractive solution to the real-time path planning problem. A great deal of work remains, however, to fully realize this idea’s potential. Future work will include extensions of the approach to n-DOF robots, implementation issues, extensions to more general optimality criterion than the sup norm, and more careful analysis of the model’s underlying dynamics and sensitivities.
Barraquand, J., and Latombe, J. C. 1989. Robot Motiori IJlanning: A Distributed Representation Approach. Tech. Rep. Stanford University Computer Science Dept., STAN-CS-89-1257, May. Benton, S. H., Jr. 1977. The Hamiltoiz-)acobi Eqtmtiotr: A Global Approach. Academic Press, New York. Bertsertkis, D. P., and Tsitsiklis, J. N. 1989. Puvnllrl m i d Distributed Cornpwtation: Nwnerical Methods. Prentice-Hall, Englewood Cliffs, NJ. Bryson, A. E., and Ho, Y. C . 1975. Applied Optiniui Cunfroi, Optimization, Estirnution, and Control. Hemisphere Publishing, Washington, D.C. Carrier, G. F., and Pearson, C. E. 1976. Partial Differential Equation: Theory and Technique. Academic Press, New York. Hillis, D. 1985. The Connection Machine. MIT Press, Cambridge, MA. Hinton, G., McClelland, J. L., and Rumelhart, D. E. 1986. Distributed representations. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumelhart and J. L. McClelland, eds. MIT Press, Cambridge, MA. Khatib, 0. 1986. Real-time obstacle avoidance for manipulators and mobile robots. Int. I. Robotic Res. 5(1), 90-98. Lemmon, M. D. 1990. Parameter estimation using competitively inhibited neural networks. 1’h.D. dissertation, Carnegie Mellon University, Pittsburgh, PA, May. Lemmon, M. D. 1991. Real time optimal path planning using a distributed computing paradigm. Proc. Am. Control Conf., Boston, MA, June. Nilsson, N. 1971. Prciblern Solving Methods in Artificial Intelligence. McGraw-Hill, New York. Schwartz, J. T., and Sharir, M. 1983. On the “piano movers” problem I. The case of a two-dimensional rigid polygonal body moving amidst polygonal barriers. Comm. Pure Appl. Math XXXVI, 345-398.
Received 7 July 1990; accepted 20 April 1991.
This article has been cited by:
Communicated by Richard Durbin and David Willshaw
Parameter Sensitivity of the Elastic Net Approach to the Traveling Salesman Problem Martin W. Simmen Department of Physics, University of Edinburgh, Mayfield Road, Edinburgh EH9 312, Scotland, U.K.
Durbin and Willshaw's elastic net algorithm can find good solutions to the TSP. The purpose of this paper is to point out that for certain ranges of parameter values, the algorithm converges into local minima that do not correspond to valid tours. The key parameter is the ratio governing the relative strengths of the two competing terms in the elastic net energy function. Based on recent work by Durbin, Szeliski and Yuille, the parameter regime in which the net may visit some cities twice is examined. Further analysis predicts the regime in which the net may fail to visit some cities at all. Understanding these limitations allows one to select the parameter value most likely to avoid either type of problem. Simulation data support the theoretical work. 1 Introduction
The traveling salesman problem (TSP) is a widely studied problem in the field of combinatorial optimization. Given a set of N locations ("cities'?, the problem is finding the shortest closed path ("tour'? around them, visiting each city exactly once. Like many other combinatorial optimization problems, the TSP is known to be NP-hard (Garey and Johnson 1987). Thus it is widely believed that any algorithm guaranteeing the optimal tour will require an amount of computation that grows faster than any polynomial function of N. Research is therefore directed instead a t developing algorithms to find near-optimal tours, running in low order polynomial time. Recently, several novel approaches to the TSP (and optimization problems in general) have been developed, such as simulated annealing (van Laarhoven and Aarts 1987), genetic algorithms (Brady 1985; Muhlenbein et al. 19881, and connectiocist algorithms (Hopfield and Tank 1985; Wilson and Pawley 1988; Durbin and Willshaw 1987; Anghiol et al. 1988; Peterson and Soderberg 1989). The connectionist approaches essentially explore a continuous search space en route to finding a solution, whereas the other approaches are restricted to searching within the finite set of possible tours. Another feature of the connectionist algorithms is that, unlike the traditional approaches, most of them are Neural Computation 3,363-374 (1991) @ 1991 Massachusetts Institute of Technology
364
Martin W. Simmen
inherently parallel. Therefore they can run efficiently on parallel computers and some, perhaps, can be implemented in hardware. This paper deals with the elastic net (Durbin and Willshaw 1987), a connectionist algorithm for the Euclidean TSP. It can be visualized as a rule for deforming a n imaginary elastic band placed in the city plane by attractive, distance-dependent forces from the cities and by elastic forces within the band itself. A scale parameter controls the effective range of the city forces. It is initially set high, then gradually reduced; thus it plays a role comparable in some respects to "temperature" in simulated annealing. In practice, the net is modeled by a finite number of points ("beads'? and the algorithm reduces to an iterative procedure for updating the bead positions. Let x, denote the fixed position of the ith city, 1 5 i 5 N, and y, the variable position of the jth bead, 1 5 j 5 M,M 2 N. At each iteration all of the beads are updated in parallel by AY,
=
(1
cwt,(xl
~
Y,)
+
K:qY,+l - 2Y,
+ Y,-l)
(1.1)
I
where CY and 3 are the constants governing the strengths of the city and tension-like forces respectively and K is the scale parameter. wl, - the normalized "weight" of the connection between the ith city and jth bead - is defined by (1.2) where ~ ( dK .) = e-1'2/2K2.This update rule performs gradient descent on the energy function E, defined as'
E
-
-oKC1nCo(lX,-Y,l 1
1
K)
+
~EIY,+1-Y,l2 2
(1.3)
l
From 1.3 it is clear that, as K -+ 0, net configurations corresponding to tours lie at local minima in the energy landscape, and that the shortest tour should correspond to the global minimum. The algorithm attempts to find one of these minima by first finding a minimum at high K, where the energy landscape is smoother, then trying to track it as K is reduced. 2 A Brief Review of Previous Elastic Net Work Simic (1990)and Yuille (1990)independently demonstrated that the elastic net and the method due to Hopfield and Tank (1985) are related through a common underlying framework; Simic through statistical mechanics and Yuille from work on stereo vision models. Simic also pointed out 'Note that the second term has coefficient /7/2, and not ;7, as is stated in the original and several subsequent papers.
Elastic Net Approach to TSP
365
that the second term in the energy function measures the sum over the inter-bead distance-squares rather than the desired sum over distances. Minimization of the distance-squares only corresponds to minimization of the tour length in the M I N + 00 limit. This drawback was alluded to in the original paper but was not stressed therein. Durbin et al. (1989) (referred to as DSY hereafter) investigated how the energy landscape changes as K is reduced, and have deduced several results. First, that for E to remain bounded, every city requires at least one bead within a distance of O(K‘/’) of it; second, a condition on Pla is needed to prevent two neighboring beads converging on the same city. Third, they derived an implicit expression for the critical value of K , K,, above which the energy function has a minimum corresponding to all the beads lying at the center of the city distribution, and have proposed using K, and this configuration as the initial state. Finally, DSY discussed how the system’s dynamics are influenced by the Hessian of E, and concluded that the algorithm cannot guarantee to find the global minimum, even when the initial state is chosen in this way. The elastic net algorithm contains few parameters and so it might be possible to understand through analysis what the good parameter settings are; as, for example, DSY did for the initial value of K. One of the most important parameters is the ratio PIN (which I call 71, which controls the relative strength of the tension-like forces to the city forces. This paper builds on the work of Durbin et al. on selecting the value of 7 . I prove that, as K + 0, there exist local energy minima in which some cities remain unvisited, that is, minima corresponding to configurations that are not tours. The range of 7 values for which the algorithm is liable to find one of these minima (and therefore fail) is then derived; this information, combined with the earlier DSY condition, gives a good prescription for choosing 7 as a function of the typical separation (denoted by p ) between neighboring beads in a tour configuration. 3 Sensitivity to the Value of /3/a
Durbin et al. investigated the parameter conditions needed to guarantee that, as K -+ 0, there would be only a single bead at each city. They analyzed the stability of equilibrium configurations of two beads close to a single city; K was considered small enough such that only these beads interact significantly with the city. From their work, the condition for instability (hence for only one bead at the city) can be expressed as
+
where Aj = (y,+, y,-1 - 2yj) and wj is the weight between the city and jth bead. They then considered only the case in which the two beads are
Martin W. Simmen
366
A
B
C
Figure 1: Example of spiking. Squares denote cities, dots beads. (a) Net configuration showing a spike caused by two non-neighboring beads converging onto one city. (b) and (c) show the two possible city orderings obtainable by effectively "deleting" one of these beads. immediate neighbors in the net. This allows IA,i to be interpreted approximately as the distance separating a bead converging on the city from its neighbor which is not attracted to the city. Thus the lA,l terms can be approximated by p. Noting that the minimum of [(wz/w:) (wl/4)], subject to (w1 w2) = 1, is 4, it can be inferred from their analysis that to prevent two neighboring beads converging onto a single city, CY and 3 should be chosen such that
+
+
7; a
- > -
1
(3.2)
211
However this requirement is not strictly necessary, since a configuration in which neighboring beads have converged on the same city still defines a perfectly valid tour. Suppose, however, that the beads are not neighbors: equation 3.1 still holds but the IA,I terms can now become arbitrarily small. Hence $ / a may need to be arbitrarily large to prevent the convergence of both beads and subsequent formation of a "spike" in the net (see Fig. I). A tour configuration containing a spike is, strictly speaking, an illegal tour, since the city at the spike's base is visited twice, but a simple postprocessing operation can recover a legal tour (see Fig. 1). In summary, satisfying y > 1/2p should ensure that no city will have two neighboring beads close to it as K --+ 0, but is no guarantee against spikes in the net. All that can be predicted from this analysis about the spike problem, is that their frequency should be a decreasing function of y. The issue of how to estimate p for any particular problem will be deferred until the end of Section 4.1. 4 Stable Non-tour Configurations in the K
+
0 limit
The previous section suggested that to avoid spikes 7 should be chosen "large." This section will demonstrate that such a policy can cause
Elastic Net Approach to TSP
367
other problems. To motivate what follows, observe that even in simple situations the algorithm can fail to find a net configuration that visits every city (see Fig. 2). Such failures can occur in situations where two (or more) cities lie close together. An insight into why this may happen can be gained using the result from DSY that, for E to remain bounded, every city requires at least one bead within a distance O(K’/2)of it. During the early stages of the algorithm a close pair of cities may not be resolvable on a length scale of O(K’/’). Thus the system may commit only one bead to the region yet still be able to keep the energy contribution of both cities bounded. Later as K -+ 0 and the cities become resolved, the bead converges to the point midway between the cities. Figure 3 is a rough sketch of this. To prove the stability of the midpoint configuration, consider the situation of Figure 3(c) in detail. Let the cities lie at ( h A . 0 ) and consider the component E’ of E due to these cities as K + 0. The contribution of the second term in the energy function 1.3 can be ignored since, being only of 0 ( 3 ) ,it will be seen below to be negligible compared to the other term. Let the closest bead lie at (x,y); the other more distant beads can be ignored since these have negligible weights with the two cities in the K + 0 limit, a fact easily established from 1.2. Thus
(4.1)
A
B
Figure 2: Failure of the algorithm on N = 4. Squeres represent the cities [with coordinates of (.3,.7), (.7,.6), (.47,.3)and (.53,.3)],dots represent the beads. (a) Final configuration found using 10 beads, a = .05, /3 = 1.0, with an initial state having K > K , and the beads configured in a small ring around the center of the cities; K was reduced by 1% every 20 updates. A slower K reduction schedule (1% every 100 updates) also found the same configuration. (b) Final configuration found using the same parameters as (a), except that here 01 = 0.1.
Martin W. Simmen
368
A
C
B
Figure 3: Resolution argument to account for failure on close pairs of cities.
Squares denote cities, dots beads, and the shaded disks are schematic representations of the O(K’/*)zones of each city (see main text for details). K1 > K2 > K3. (a) Situation a t K , , cities not resolvable. (b) Situation at K2. (c) K3, cities are resolved. The neighboring beads cannot move in because as K 0 their weights with the cities become negligible. +
This shows that the bead lies in a radially symmetric quadratic well, the minimum of which is midway between the cities, and that the energy of the (stable) equilibrium configuration rises without bound as K + 0. This disproves previous claims (Durbin et al. 1989) that in the limit of small K all minima correspond to valid tours, since it shows the existence of energy minima corresponding to configurations in which some cities remain unvisited (a city is considered “visited” if, for any small distance c, some bead($ can be found within c of it in the K + 0 limit). 4.1 Avoiding Non-tour Minima - , 3 / ( 1 Revisited. There is a straightforward way to avoid such minima: simply use so many beads that the typical spacing between neighboring beads is much less than the minimum inter-city spacing. This strategy is, however, computationally inefficient, since the complexity per iteration is O ( N M ) . Instead, the analysis presented below will show that non-tour minima can be avoided by choosing -1 such that -1 < 1 / 2 i i . Consider the stability of an equilibrium configuration in which two beads (labeled 1 and 2) lie near a close pair of cities, for small K : if the configuration remains stable as K 0 then each city can attract a bead to it; instability, however, leaves just one bead with the cities and subsequently the system becomes trapped in a non-tour energy minimum. Several simplifying assumptions shall be made here. First, these beads interact significantly only with these two cities. Similarly, these two cities interact significantly only with these beads. Second, the two cities are assumed to be coincident; this simplifies the analysis and also represents
-
Elastic Net Approach to TSP
369
the “hardest-case” local scenario for the algorithm in its attempts to have every city visited by a unique bead. Finally, there is the issue of the equilibrium distances, called s1 and s2, from the beads to the cities. The analysis below considers only the special case of s1 = s2; the general case result will be discussed later. We seek the conditions for which this equilibrium system is a local minimum, by examining the change in energy induced by local perturbations of the beads. Without loss of generality, let s - hl and s h2 denote the distances between the cities and beads 1 and 2 respectively, in the perturbed state. It will be helpful here to write E = El + E2 with E l = -trKClnCe-~X~-y~12/2K2 and € 2 = d / 2 C IyI+l - y,I2. To derive AE,, the change in E l , let C = ln(e1i(5phl)2 + dr(’+*z)* ), and observe that C can be expressed as
+
where 6, = max((&(. (621).Setting k = -1/2K2 and noting that at equilibrium 1.1 implies s = yAK (where A = lAll = /A2/),therefore gives AE, as
Writing 6,= 6,
AEl
62 =
= o-,A6(r
v6 gives 062
-
1) - -[(l 4K
+ r)222A2- 2(1 + r2)] + 0 ( b e 3 )
(4.4)
Noting that, by definition, the first-order component of AE vanishes at a n equilibrium state, and, by inspection of the E2 definition, that the second order component of AE2 will contain no K dependence, we find
+
+
as K + 0. As the minimum value of 2(1 ?)(1 r)-* is unity, one can therefore guarantee that AE is always positive, and hence that the configuration is stable, in the low K limit, by choosing y such that yA < 1.
Martin W. Simmen
370
The corresponding condition for the general case (Simmen 19901, in which the beads have weights of w and (1 - W ) with both cities, is (4.6)
The right-hand term is a single-humped function, symmetrical about w = 1/2. As w + 0 or 1 this function goes to zero, implying that y may unfortunately need to be chosen arbitrarily small to prevent instability. This is just a formal expression of the idea in Figure 3, that, once a single bead begins to dominate the interaction with the pair of cities this dominance tends to grow, so that as K + 0 this bead is the only one close to the cities. The crucial point therefore is to prevent the emergence of a single dominant bead in the first place, by ensuring that configurations having two beads with comparable weights remain stable down to the K + 0 limit.* Thus the special case of s1 = s2 (i.e. w 1/21 is the most relevant one for getting a constraint on 7 . Recall that in this case stability was guaranteed, provided -1 < 1/A. When the two beads are immediate neighbors in the net IA,( is approximately 11, whereas for cases in which the beads are not neighbors /A,/ can clearly range from approximately 2p down to zero, where 11 is as defined in Section 2. Thus, the prediction of this analysis is that all non-tour minima can be avoided by selecting 7 such that < 1/2p. Two remarks should be made about the above energy analysis and its result. First, equation 4.6 can also be derived by modifying DSY’s eigenvalue analysis of the two beads/one city configuration to the current two beads/two cities case. Similarly, an energy analysis of the two beadslone city case yields the instability condition found by the earlier DSY eigenvalue analysis. This correspondence arises since, whereas the energy analysis determines directly whether the equilibrium state is a local minimum, DSY’s analysis does this indirectly, essentially by investigating the eigenvalues of the Hessian. Second, the fact that 1/2p is an upper bound on 7 for the stability of the two beads/two cities case as well as the lower bound for the instability of the one city/two (neighboring) beads case is because these two cases are clearly mathematically related. In summary, 1/2p emerges as an important value for the parameter 7 (or 3/cr). Choosing 3 below 1/2p risks creating spikes in the net as well as the lesser problem of neighboring beads converging on the same city; whilst setting a/ above 1/2p, though it decreases the likelihood of spikes, risks the system finding a non-tour minimum. Since is the average separation between neighboring beads, it can be estimated given some prior estimate of the tour length. For instance one can use the result of Beardwood et al. (1959) - that for N cities drawn from a uniform random distribution in some bounded 2D region of unit
-,
~-
’Or, in the case of a close but noncoincident pair of cities, stable down to the K value at which each bead converges to a specific city.
Elastic Net Approach to TSP
371
area, the optimal tour has length cfl, with c E 0.75, in the N 4m limit to give a crude tour length estimate even for non-asymptotic N.
-
5 Simulations and Discussion
Simulations were performed to test whether the algorithm’s behavior varied with y in the predicted manner. Five TSPs with N = 50 and five with N = 200 were studied, with all the coordinates drawn randomly from within the unit square. Every city set was run with a range of y values, and the number of spike defects and ”frozen bead” defects (i.e., single beads trapped in high energy, non-tour minima) present at the end of each run recorded; the results are presented in Figure 4. The y values were chosen relative to y’,where y’denotes the value of 1/2p using the Beardwood estimate (i.e., y’= 2MN-’/2/3). Based on the analysis of DSY, the initial value of K was chosen to be K,, where K, is the positive root of 4K3y sin2 M
+- K2N/M
-
X/M
=0
(5.1)
and X is the principal eigenvalue of the city distribution’s matrix of second-order moments. Note that this differs from the original K, prescription of DSY due to several algebraic errors in DSY. K was reduced by 1%every 10 updates. Further technical details are given in the legend to Figure 4. The Figure 4 plots give consistent support to the analytical predictions. Spike and frozen bead defects dominate the low y and high y regimes respectively, with 1/2p marking the approximate boundary between the two regimes; note that, as expected, some spikes still occur above 1 1 2 ~ (Fig. 4d). Qualitatively, the division into two regimes can be understood from the roles of a and /? as the coefficients of competing terms in the energy function. A low value of p/cr emphasizes moving the beads closer to the cities rather than minimizing the net length, hence it may lead to cities being visited by more than one bead; a high value of P / a does the opposite, so may lead to some cities remaining unvisited. Figure 4a-c shows that increasing M, for fixed N, substantially reduces the number of frozen beads. This trend is understandable since, as mentioned in Section 4.1, what influences whether a close pair of cities develops a frozen bead defect is not the inter-city distance itself but rather this distance relative to the typical spacing between beads. Increasing M / N appears to have little effect on spiking, except at very low y where it helps slightly. The analyses given here and in DSY plus the empirical evidence of Figure 2 strongly suggest that many defects develop because of the intrinsic structure of the energy landscape, and therefore will not just disappear by annealing more slowly. This was confirmed by runs reducing K ten times more slowly than in the Figure 4 simulations showing no significant change in the number of defects produced (data not shown).
Martin W. Simmen
372
( a ) N=50, M=75 ._
(b) N=50, M=125 ...... spike defcels
spike defects
frozen bead defecla
frozen bead dcfeclr
25
izoI!J,;\ 3
P
c c 1 ' ; IS W
:$f
/
W
L
131 10
C
s
,
i,,J,,,;
,
,
',,, q. b
0
0.0
1.0
0.3
1.5
2.0
2.3
3.0
( d ) N=200, M=500
( c ) N=50, M=250 -
spike defects frozen bead d e l e c b
s p i k e defects frozen bead dcfecln
+i==Li 2.0
2.5
3.0
Figure 4: Frequency of tour defects as a function of y/y'. Each data point represents the total number of defects (of one type). summed over five TSP instances, found in the final net configurations. In all the simulations, /3 was fixed at 1.0 and the beads were initially placed in a ring of radius 0.05 around the center of the cities (starting with all the beads exactly at the center causes problems, because when K is slightly below K, the gradients there are very small and so the system requires a large number of iterations before settling into an energy minimum). Simulations were terminated when either of two criteria were satisfied: (1) if, Vi, max,(w,,) > 0.95; followed by a further reduction of K by a decade to allow final settling, or (2) when K < O.Olp, p calculated using the Beardwood estimate discussed in the main text. A spike occurs where a city has significant interactions (here taken to mean w,,> 0.3) with two or three, noncontiguous beads. A bead k is frozen if it is the bead nearest to two or more cities, that is, if there are two or more cities i for which max,(w,,) = w,k. The simulations were performed on an AMT DAP parallel computer.
Elastic Net Approach to TSP
373
Of course if K is reduced so rapidly that the network has insufficient time to ever relax into local minima (the physical analogy here is of a system cooled too rapidly to allow equilibration at any temperature) then naturally many more defects develop, including frozen beads for 2 < 1/2p. In summary, to avoid defects 7 is best chosen to be approximately 1 / 2 p , or perhaps slightly above this if M / N is large. If legal tours can be successfully recovered from net configurations with defects using postprocessing, then other properties, most obviously the tour length, may well be optimized by some other choice of y. However, this issue has not been addressed in the current study. (Within the range of y vaIues that gave nets with no defects, inspection of the tour lengths showed them to be fairly insensitive to the precise value of 7 . Nets postprocessed to remove spikes tended to give slightly longer tours. Thus the empirical evidence suggests that selecting -/ to avoid defects is also a sensible policy with regard to finding short tours.) This work also ties in with Simic’s (1990) observation that the elastic net only solves the ”correct” problem when M >> N. We see here that the consequences of not having M >> N can include not just sub-optimal tours, but also the possibility (in a particular region of parameter space) of finding net configurations that do not correspond to valid tours at all. The elastic net algorithm is not unique in regard to the possibility of convergence to non-tour configurations - this also happens in the original Hopfield and Tank (1985) algorithm as well as in Peterson and Soderberg’s (1989) improved version of it. It should be noted though that the problem is far less serious (and can be controlled by analytically guided parameter choice) in the elastic net and Peterson and Soderberg algorithms, than in Hopfield and Tank’s. This is because the energy function used by Hopfield and Tank has fewer constraints built into it than do the energy functions used by the other two algorithms (Peterson and Soderberg 1989; Simic 1990; Yuille 1990). 6 Conclusions
Three particular issues regarding the performance of the elastic net algorithm on the TSP have been addressed here. First, by extending the analysis of Durbin et al. (1989), the problem of cities being visited twice by non-neighboring beads was examined. Second, it was proved that, in the K --+ 0 limit, there exist high energy locaI minima in which some cities remain unvisited by the net. Finally, the parameter regime in which the algorithm might find one of these non-tour minima was derived. This allowed a decent prescription to be given for the P/cz parameter value most likely to produce valid tours. Simulations were found to support the details of the analysis in all of these areas. In practical terms, in using the elastic net on the TSP one should therefore choose K, as the initial value of K , set M as large as is feasible given the available computing
374 resources, a n d finally select the avoid defects.
Martin W. Simmen ;Jl/ci
value in the way described here to
Acknowledgments
~~
I wish to thank Peter Dayan for valuable criticisms of an early draft of this paper, plus David Wallace and David Willshaw for helpful comments on a later draft. Thanks are also d u e to the referee for useful remarks. I acknowledge the support of the SERC through a Studentship and the Edinburgh Parallel Computing Centre for time on the AMT DAP facility.
References Angeniol, B., de la Croix Vaubois, G., and le Texier, J-Y. 1988. Self-organizing feature maps and the travelling salesman problem. Neural Networks 1, 289293. Beardwood, J., Halton, J. H., and Hammersley, J. M. 1959. The shortest path through many points. Prcic. Cainhridye /%il. SOC. 55, 299-327. Brady, R. M. 1985. Optimization strategies gleaned from biological evolution. Nature (London) 317, 804-806. Durbin, R., and Willshaw, D. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (/.ondon) 326, 689691. Durbin, R., Szeliski, R., and Yuille, A. 1989. An analysis of the elastic net approach to the traveling salesman problem. Neural Comp. 1, 348-58. Garey, M. R., and Johnson, D. S. 1987. Computers m d Intractability. Freeman, San Francisco. Hopfield, J. J., and Tank, D. W. 1985. “Neural” computation of decisions in optimization problems. B i d . Cybern. 52, 141-152. Miihlenbein, H., Gorges-Schleuter, M., and Kramer, 0. 1988. Evolution algorithms in combinatorial optimization. Parallel Comput. 7, 65-85. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. lnt. J. Neural Syst. 1, 3-22. Simic, P. D. 1990. Statistical mechanics as the underlying theory of ’elastic’ and ’neural’ optimisations. Network 1, 89-103. Simmen, M. W. 1990. Unpublished. van Laarhoven, I? J. M., and Aarts, E. H. L. 1987. Sitnulated Annealing: Theory and Applications. D. Reidel, Dordrecht. Wilson, G. V., and Pawley, G. S. 1988. On the stability of the travelling salesman problem algorithm of Hopfield and Tank. Biol. Cybern. 58‘63-70. Yuille, A. L. 1990. Generalized deformable models, statistical physics, and matching problems. Neural Comp. 2, 1-24.
Received 30 November 1990; accepted 23 May 1991
This article has been cited by: 2. Mircea Ancau. 2009. The processing time optimization of printed circuit board. Circuit World 35:3, 21-28. [CrossRef] 3. Ibrahim H. Osman, Gilbert Laporte. 1996. Metaheuristics: A bibliography. Annals of Operations Research 63:5, 511-623. [CrossRef] 4. S.Z. Li. 1996. Improving convergence and solution quality of Hopfield-type neural networks with augmented Lagrange multipliers. 7:6, 1507. [CrossRef] 5. Andrew I. Vakhutinsky, Bruce L. Golden. 1995. A hierarchical strategy for solving traveling salesman problems using elastic nets. Journal of Heuristics 1:1, 67-76. [CrossRef]
Communicated by Michael Jordan
FIR and IIR Synapses, a New Neural Network Architecture for Time Series Modeling A. D. Back A. C. Tsoi Depurtment of Electrical Eizgirzeeriiig, University of Queerislaiid, Qneeiisluitd 4072, Aiistraliu
A new neural network architecture involving either local feedforward global feedforward, and/or local recurrent global feedforward structure is proposed. A learning rule minimizing a mean square error criterion is derived. The performance of this algorithm (local recurrent global feedforward architecture) is compared with a local-feedforward global-feedforward architecture. It is shown that the localrecurrent global-feedforward model performs better than the localfeedforward global-feedforward model. 1 Introduction
-
A popular class of neural network architecture, in particular, a multilayer perceptron (MLP) may be considered as providing a nonlinear mapping between an input vector, and a corresponding output vector (Lippman 1987). From a set of input and output vectors, an MLP with a given number of hidden layer neurons may be trained by minimizing a least mean square (LMS) cost criterion. Most work in this area has been devoted to obtaining this nonlinear mapping in a static setting, that is, the input-output pairs are independent of one another. Many practical problems may be modeled by such static models, for example, the XOR problem and handwritten character recognition. On the other hand, many practical problems such as time series forecasting and control plant modeling require a dynamic setting, that is, the current output depends on previous inputs and outputs. There have been a number of attempts to extend the MLP architecture to encompass this class of problems. For example, Lapedes and Farber (1987) used an MLP architecture with linear output units, rather than nonlinear output units. The linear output units allow the output values to be real rather than discrete as in classification problems. Waibel et al. (1989) used a time delay neural network architecture that involves successive delayed inputs to each neuron. All these attempts use only a feedforward architecture, that is, no feedback from later layers to previous layers. There are other Neural Computation 3, 375-385 (1991)
@ 1991 Massachusetts Institute of Technology
376
A. D. Back and A. C. Tsoi
approaches that involve feedback from either the hidden layer or from the output layer to the input layer (Jordan 1988). This class of network is known broadly as recurrent networks. In one way or the other, all these approaches attempt to incorporate some kind of contextual information (in our case, the dynamic nature of the problem is the context required) in a neural network structure. However, these are not the only neural network architectures that can incorporate contextual information. In this paper we will consider a class of network that may be considered as intermediate between a (global) feedforward architecture and a (global) recurrent architecture. We introduce architectures that may have local recurrent nature, but have an overall global feedforward construction. Our contribution is the derivation of a training algorithm that is based on a linear adaptive filtering theory. The work presented here is similar to Robinson's (19891, except that in his network the feedback occurs globally, whereas in ours the feedback is local to each synapse. It is shown by simulation that networks employing this local-feedback architecture perform better than those with only local feedforward characteristics. The structure of the paper is as follows: in Section 2, a network architecture is introduced. In Sections 3 and 4 training algorithms for the FIR synapse case and IIR synapse case, respectively, are derived (the nomenclature will be clarified in Section 2). In Section 5, the performance of an IIR synapse case is compared against an FIR synapse. 2 A Network Architecture
In a traditional MLP architecture, each synapse is considered as having a constant weight. Using the same methods as introduced by Lapedes and Farber (19871, the dependency of current outputs on previous inputs may be modeled using the following synaptic model. (2.1) where y(t) is the synapse output at time t, z(f - j) is a delayed input to the synapse, and b;, i = 0 , 1 , 2 , .. . , M are constants. This synaptic weight is the same as a finite impulse response filter (FIR) filter in digital filter theory. As a result, we will denote this synapse an FIR synapse (Fig. 1). On the other hand, the output may be dependent on both the previous inputs and outputs. In this case, we have the following model. Let 9-' x ( t ) = x ( t - j). Then (2.2)
FIR and IIR Synapses
377
Figure 1: An FIR synapse.
Figure 2: An IIR synapse.
This is called an infinite inpulse response (IIR) synapse (Fig. 2). An MLP may use FIR synapses, IIR synapses, or both. Note that this type of network is still globally feedforward in nature, in that it has a global feedforward structure, with possibly local recurrent features (for IIR synapses). Thus, in the FIR synapse case, we will have a localfeedforward global feedforward architecture, while in the IIR synapse case, we will have a local-recurrent global feedforward architecture. It is obvious that a more complicated structure will be one involving both FIR synapses and IIR synapses. Figure 3 shows the neuron structure. Consider an L + 1 layer network. Each layer consists of Nl neurons. Each neuron i has an output at time t as z f ( t ) , where 2 is the index for the layer, I = 0 denotes the input layer, and I = L denotes the output layer.
A. D. Back and A. C. Tsoi
378
Figure 3: A neuron showing IIR synapses.
An MLP with FIR synapses can be modeled as follows:
z?'(f) = f
[XL+1(f)]
(2.3)
(2.4)
where (2.5) (2.6)
k = 1,2, . , . , Nl+I (output layer index) 1 = 0,1,2 L M = number of delayed inputs to a neuron X&N, = bias , . . a ,
(2.7) (2.8) (2.9) (2.10)
Note that we have made the simplifying assumption that each neuron receives the same number M, delayed inputs from the previous layer. This can be made to vary for each neuron. It is not used here since it would add unnecessary burden to the notation. An MLP with IIR synapses can be modeled as follows: (2.11)
(2.12)
FIR and IIR Synapses
379
where
(2.13) denotes the local feedback in each synapse. All the other notation is the same as for the FIR case. 3 Derivation of a Training Algorithm for an FIR MLP
Let the instantaneous error be (3.1) where p ( t ) is the desired output at time t. The weight changes can be adjusted using a simple gradient method
bfk,(f .f 1)
=
&,(f)
f
abfkl(t)
The learning rule for the output layer weights is (3.4) = qf' [ i t i t ) ] z:-l(t
-
j)ek(t)
1I j 5M
(3.5)
The learning rule for the hidden layers can be obtained using a chain rule as
(3.6) =
&zf(t
-
j)
(3.7)
(3.9)
These equations define an LMS weight adjusting algorithm. It is quite easy to modify the gradient learning rule to incorporate a momentum term.
380
A. D. Back and A. C. Tsoi
Notice that while the FIR MLP model is nonlinear, the weight updating rules are linear in the unknown parameters. This property implies that the weight updating rules will converge to a minimum, not necessarily the global minimum, in the mean square error surface of the weight space. The derived updating rules for the FIR case are not new, but are given for completeness, and serve as a background for the derivation of the IIR synapse to be considered in the next section. 4 Derivation of a Training Algorithm for an IIR MLP
A training algorithm for an MLP consisting of IIR synapses can be obtained by minimizing the cost criterion 3.1. For the output layer, we have (4.1)
For the Aaij(t) parameter, (4.3) (4.4)
where (4.6)
(4.7) For the hidden layers we have (4.8) (4.9)
FIR and IIR Synapses
381
where (4.10) The updating rule for aikj(t)is given by (4.11) (4.12) where 6; is defined in 4.10. Equations 4.2-4.12 form the complete set of updating rules for the IIR MLP. It is quite simple to incorporate a momentum term in the gradient update rules. Note that in contradistinction with the FIR MLP case, 3.74.8 are nonlinear in the parameters. Hence, there is no guarantee that the training algorithm will converge. Indeed, from our own experience, for unsuitably large chosen gain "1, the algorithm may explode. The problem of instability that is normally present with linear IIR filters does not arise in the same way with the model presented here. The maximum output from each neuron is limited by the sigmoidal function, thus giving a bounded output (the weights should also be bounded). The usual stability monitoring devices such as pole reflection and weight freezing used in the linear case are therefore not necessary for this model. 5 Simulations
We tested the performance of the FIR MLP and IIR MLP on the following plant
where x ( t ) is a zero mean white noise source, low-pass filtered with a cut-off frequency of 7 rad/sec, with a1 = 0.8227, a2 = -0.9025, and Dl = 0.99. These parameters are chosen to highlight the dynamics of the system and its nonlinearity. For the FIR MLP, we have chosen L = 2, NI = 6, ( I = l), NL= 1. At the hidden layer we selected M = 11 and 7 = 0.0001, and for the output layer M = 1 and 7 = 0.005. Zero bias was used throughout. The simulation was run for 5 x lo6 data points. After training we tested the learned weights on a new data set of 1000 points. The mean square prediction error for the test set was 0.0664 and the variance was 0.0082. The results of the simulation are shown in Figures 4 and 5. It is
382
A. D. Back and A. C. Tsoi
,,
1
,
r , 7 , , ,
, , , , , , , , , , , , , , , , , , ,
, , , , , ,
I
I
I-.-~-----
1
Figure 4: The plant output, and the response from an FIR MLP with architecture described in the text.
observed that the plant and model output, while following one another, appear to have significant differences at points; this is indicated more clearly in Figure 5. We have also used the IIR network to model the plant given by 5.1. The architecture of the model is the same as for the FIR case, except that M = 5 and N = 6 in the hidden layer, and in the output layer N = 1 with afk,= 0 V i. k,j ( I = 2). In this case the mean square error over the test set was 0.0038 and the variance was 0.000012. The results of the simuIation are shown in Figures 6 and 7. In Figure 6, it is observed that the response of the IIR MLP is much closer to the plant. This is revealed in the error plot of Figure 7.
6 Conclusions
We have investigated a class of neural networks which has a globally feedforward architecture with locally recurrent nature. A training algorithm has been derived which can be seen as an extension of the FIR MLP and the more widely used static MLP. It is shown, by simulation, that the IIR MLP is a better model than the FIR MLP for modeling a nonlinear plant. It is almost trivial to modify a n algorithm to a recursive secondorder gradient algorithm (Kalman type filter) used in traditional adaptive
FIR and IIR Synapses
t
383
i
Figure 5: The error between the plant output, and the response of the FIR MLP.
Zt
Figure 6: The plant output, and the response from an IIR MLP with architecture described in the text.
A. D. Back and A. C. Tsoi
384
H:
Figure 7 The error between the plant output, and the response of the IIR MLP. identification or control literature. As indicated, our algorithm is a recursive first-order gradient algorithm. While there is a certain advantage to use a Kalman filter type second-order gradient algorithm, the added computational complexity slows the computation considerably. Hence, in the work reported here we only show the performance of the first-order gradient method. It would be interesting to compare the performance of the IIR MLP model with a fully recurrent model. This will be presented in future work. Acknowledgments The first author is supported by a Research Fellowship with the Electronics Research Laboratory, DSTO, Australia. References Jordan, M. I. 1988. Supervised learning and systems with excess degrees of freedom. COINS Tech. Rep. 88-27, MIT. Lapedes, A., and Farber, R. 1987. Nonlinear signal processing using neural networks: Prediction and system modelling. Tech. Rep. LA-UR87-2662, Los Alamos National Laboratory.
FIR and IIR Synapses
385
Lippmann, R. P. 1987. An introduction to computing with neural networks. l E E E ASSP Mag. 4, 4-22. Robinson, A. J. 1989. Dynamic error propagation networks. Ph.D. dissertation, Cambridge University Engineering Department. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. J. 1989. Phoneme recognition using time-delay neural networks. ZEEE ASSP 37,328-339.
Received 3 December 1990; accepted 28 January 1991.
This article has been cited by: 2. Dhruba C. Panda, Shyam S. Pattnaik, Swapna Devi, Rabindra K. Mishra. 2010. Application of FIR-neural network on finite difference time domain technique to calculate input impedance of microstrip patch antenna. International Journal of RF and Microwave Computer-Aided Engineering NA-NA. [CrossRef] 3. Dimitris G. Stavrakoudis, John B. Theocharis. 2007. Pipelined Recurrent Fuzzy Neural Networks for Nonlinear Adaptive Speech Prediction. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 37:5, 1305-1320. [CrossRef] 4. Chi Sing Leung, Ah Chung Tsoi. 2006. Combined learning and pruning for recurrent radial basis function networks based on recursive least square algorithms. Neural Computing and Applications 15:1, 62-78. [CrossRef] 5. M.K. Ranganathan, L. Kilmartin. 2005. Neural and Fuzzy Computation Techniques for Playout Delay Adaptation in VoIP Networks. IEEE Transactions on Neural Networks 16:5, 1174-1194. [CrossRef] 6. P.R. Davidson, R.D. Jones, J.H. Andreae, H.R. Sirisena. 2002. Simulating closed- and open-loop voluntary movement: a nonlinear control-systems approach. IEEE Transactions on Biomedical Engineering 49:11, 1242-1252. [CrossRef] 7. P.A. Mastorocostas, J.B. Theocharis. 2002. A recurrent fuzzy-neural model for dynamic system identification. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:2, 176-190. [CrossRef] 8. J.T. Lo, D. Bassu. 2002. Adaptive multilayer perceptrons with long- and short-term memories. IEEE Transactions on Neural Networks 13:1, 22. [CrossRef] 9. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 10. Wolfgang Maass , Eduardo D. Sontag . 2000. Neural Systems as Nonlinear FiltersNeural Systems as Nonlinear Filters. Neural Computation 12:8, 1743-1772. [Abstract] [PDF] [PDF Plus] 11. Paolo Campolucci , Aurelio Uncini , Francesco Piazza . 2000. A Signal-Flow-Graph Approach to On-line Gradient CalculationA Signal-Flow-Graph Approach to On-line Gradient Calculation. Neural Computation 12:8, 1901-1927. [Abstract] [PDF] [PDF Plus] 12. Wolfgang Maass , Thomas Natschläger . 2000. A Model for Fast Analog Computation Based on Unreliable SynapsesA Model for Fast Analog Computation Based on Unreliable Synapses. Neural Computation 12:7, 1679-1704. [Abstract] [PDF] [PDF Plus]
13. D.P. Mandic, J.A. Chambers. 2000. On the choice of parameters of the cost function in nested modular RNN's. IEEE Transactions on Neural Networks 11:2, 315-322. [CrossRef] 14. P. Campolucci, F. Piazza. 2000. Intrinsic stability-control method for recursive filters and neural networks. IEEE Transactions on Circuits and Systems II Analog and Digital Signal Processing 47:8, 797. [CrossRef] 15. E. Haselsteiner, G. Pfurtscheller. 2000. Using time-dependent neural networks for EEG classification. IEEE Transactions on Rehabilitation Engineering 8:4, 457. [CrossRef] 16. Wolfgang Maass , Anthony M. Zador . 1999. Dynamic Stochastic Synapses as Computational UnitsDynamic Stochastic Synapses as Computational Units. Neural Computation 11:4, 903-917. [Abstract] [PDF] [PDF Plus] 17. Stanislaw Osowski, Andrzej Cichocki. 1999. Learning in dynamic neural networks using signal flow graphs. International Journal of Circuit Theory and Applications 27:2, 209-228. [CrossRef] 18. P. Campolucci, A. Uncini, F. Piazza, B.D. Rao. 1999. On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 253-271. [CrossRef] 19. B. Cannas, S. Cincotti, A. Fanni, M. Marchesi, F. Pilo, M. Usai. 1998. Performance analysis of locally recurrent neural networks. COMPEL: The International Journal for Computation and Mathematics in Electrical and Electronic Engineering 17:6, 708-716. [CrossRef] 20. D.S. Modha, E. Masry. 1998. Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory 44:1, 117. [CrossRef] 21. R. Chandra, L.M. Optican. 1997. Detection, classification, and superposition resolution of action potentials in multiunit single-channel recordings by an on-line real-time neural network. IEEE Transactions on Biomedical Engineering 44:5, 403-412. [CrossRef] 22. H.T. Siegelmann, B.G. Horne, C.L. Giles. 1997. Computational capabilities of recurrent NARX neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 27:2, 208-215. [CrossRef] 23. S. Lawrence, A.D. Back, A.C. Tsoi, C.L. Giles. 1997. On the distribution of performance from multiple neural-network trials. IEEE Transactions on Neural Networks 8:6, 1507. [CrossRef] 24. Eric A. Wan , Françoise Beaufays . 1996. Diagrammatic Derivation of Gradient Algorithms for Neural NetworksDiagrammatic Derivation of Gradient Algorithms for Neural Networks. Neural Computation 8:1, 182-201. [Abstract] [PDF] [PDF Plus] 25. P. Frasconi, M. Gori. 1996. Computational capabilities of local-feedback recurrent networks acting as finite-state machines. IEEE Transactions on Neural Networks 7:6, 1521. [CrossRef]
26. Tsungnan Lin, B.G. Horne, P. Tino, C.L. Giles. 1996. Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks 7:6, 1329. [CrossRef] 27. B. DasGupta, E.D. Sontag. 1996. Sample complexity for learning recurrent perceptron mappings. IEEE Transactions on Information Theory 42:5, 1479. [CrossRef] 28. Andrew D. Back , Ah Chung Tsoi . 1993. A Simplified Gradient Algorithm for IIR Synapse Multilayer PerceptronsA Simplified Gradient Algorithm for IIR Synapse Multilayer Perceptrons. Neural Computation 5:3, 456-462. [Abstract] [PDF] [PDF Plus] 29. Renée Koplon, Eduardo D. Sontag. 1993. Linear Systems with Sign-Observations. SIAM Journal on Control and Optimization 31:5, 1245. [CrossRef] 30. Richard R. Gawronski, Rita V. Rodriguez. 1993. A learning algorithm for the classification of dynamic events using a neuron-like dynamic tree. International Journal of Intelligent Systems 8:4, 509-526. [CrossRef]
Transition to Perfect Generalization
387
seems that the error on future examples is linearly proportional to the inverse sample size (and in fact, for uniform distribution, with proportionality constant equal one). Recent papers of Gardner and Derrida (1989), Gyorgyi (19901, and Sompolinsky et al. (1990) have, however, suggested a novel phenomena when the target function and the hypothesis function are both simple perceptrons (with no hidden units) whose weight values are il, and the distribution D is uniform over the vectors {+l,-l}". Using methods of statistical physics, in the n -+ co limit, they argue that for m > 1.24n, perfect generalization is achieved; that is there is no choice of hypothesis function other than the target function itself, which is consistent with the data set. The purpose of this note is to show rigorously the existence of a constant (t, such that for m > m, this phenomena of perfect generalization is achieved. Thus we consider the following problem. Let wfE {+13 -l}" be the target perceptron. We see a set of m examples. Here each example is a vector x drawn uniformly from {fl, -1)" and classified as positive or negative according to whether wt . x 2 0 or not. We show that if m 2 2.0821 n the probability is less than 2-(fi) that there is any perceptron y E {+l,-1)" other than wt, which consistently classifies the examples, i.e., such that y . x 2 0 for positive examples and y . x < 0 for negative examples in our set. 2 Definitions and Terminologies
-
The number of examples will be denoted by m = o n , and vectors x E {+l,-1}" by ( X I , . . . , x , ~ ) .Assume n > 0. Let the target perceptron be, without loss of generality,
n
wt= ( + l , + l , ..., + l ) and
I
7
yI = (-1, -1. ..., -1, +l... . . +1)
for n 2 i 2 1. Due to symmetry, we will without loss of generality use y , for potential target perceptrons with Hamming distance i from w t in the calculation of probabilities. We adopt the convention that
for k < 0 and k > r.
Definition 2.1. For 1 5 i and i + j 2 0 , define
E. B. Baum and Y-D. Lyuu
388
Pl(j) is exactly the probability that a random x E {+l,-l}isatisfies C;=,xi= j. Note that Pi(j)= Pi(-j). Definition 2.2. Let P ( i ) denote the probability that a y E {+1,-l}n with exactly i (-l)-components misclassifies a random x t {+l,-1)". Remark 2.3. Clearly, 0 = P ( 0 ) P ( n ) < 1 when n is even.
I P(1) I P ( 2 ) I . . . I P ( n ) .
Note that
Definition 2.4. Define
A upper-bounds the probability that there exists a non-w, vector y E {+I.-l}nthat consistently classifies m random examples. We will show that A = 2T(fi) for m > 2.0821 n.
g ( n ) means h ( n ) / g ( n )-+ 1 as n means h ( n ) - g ( n ) + 0 as n + w.
Definition 2.5. h ( n )
N
m; h ( n ) = g ( n )
+
3 Preliminaries
Lemma 3.1. yl misclassifies an x if and only if: 0
05
k X ]
< 2&;
]=I 1
or
1=1 11
Proof. By definition, yl misclassifies x iff (1) Cy=,X, 2 0 and yl . X < 0, or (2) xl < 0 and yl . x 2 0. This is equivalent to (1)' xI 2 0 and x/ + c/=,+1 x/ < 0, or (2)' C;=,x, < 0 and -xI + &,+I XI 2 0. 0 Simple manipulations then yield the lemma.
x;'l
c;=,
The following two facts will be useful tools. Fact 3.2. f(p. 4 Bolloblis 1985)l If m
where p
= m/n
and q
=
1-p .
--t
co and n - m
--$
0,
then
Transition to Perfect Generalization
389
Fact 3.3. [Vandermonde's convolution (Table 169, Graham et al. 1989)l For integers m and n ,
Remark 3.4. Observe that when r = s in the above equation, the first term will be equal to the last, the second to the second to last, etc. Lemma 3.5. For an even integer a,
Proof. By Fact 3.2. 0 We will assume n is even for the rest of this paper. Let 1 5 i < n. With help from Lemma 3.1, P(i) can be calculated:
k=l k-1
I
Hence, for i even, we have, by Definition 2.1,
P(ij
1
= 2n-1
C
~
k=2,4, ...,I
(3.1) I=-k,-k+2,
...,k - 2
I=-k,-k+2,
...,k - 2
and, for i odd, we have 1
(3.2)
In general, the above complicated formulae have no closed-form solutions. Good estimates or exact solutions of P ( i ) can be obtained for specific is, however, as the next two corollaries demonstrate.
Corollary 3.6. P(1) M
&.
Proof. Apply equation 3.2 with i
= 1 to
get
E. B. Baum and Y-D. Lyuu
390
then use Lemma 3.5 to show P(1)
d2
- a.
Now, since P(1) -+ 0, P(1) x 0
Corollary 3.7. For n / 2 even,
for 11/2 odd, P(n/2)=
1 2 ~
*(:)I
Proof. First, assume n / 2 is even. From (l),P(n/2) is equal to
(;$) (.i) +(jI+($
I(;)]
Use column-wise summation ke., the ith term, if any, of each row above is added together) and then Fact 3.3 to get
/=1
+q)( =2'-'l
g -1+1 ;
{;[(;)-()I /=1
g-1-1
Y
)
1 I= 1
)+;(;+
391
Transition to Perfect Generalization
Continuing, we have
The same steps can be applied to the case where n / 2 is odd. So, from 3.2, P ( n / 2 ) is equal to
2'-"
{ (4)[($!)I +($+)[(&) +(&) +( ) [(,; _ -J +(&) +(&) +(&) +($+3]
I)&(+
q2 r
+()
2
[() +(;)
Now, do column-wise summation to get
+
+(:yJ
+(;-)I}
E. B. Baum and Y-D. Lyuu
392
Finally, use Fact 3.3 to obtain
0
2
Apparently, it is difficult to have a close-form solution or even good estimate for general P ( i ) . Fortunately, it turns out that the relationship between successive P(i) is very simple. Below is the main result of this section.
Theorem 3.8. Fur i even, W E have
P(i+l)-P(i)-
-
arid
P ( i ) - P(i - 1) - 0 Proof.
'El
-A.--A+2.
.AL2
2" 1
Transition to Perfect Generalization
Now, the other case:
393
E. B. Baum and Y-D. Lyuu
394
= o We will use the following lemma in some of the estimates later.
-
Lemma 3.9. I f b ( n ) / a ( n ) x ( n ) = O(1) and a ( n ) + 00, then
Proof. Take natural logarithm of z ( n ) = (1 + [l/u(n)])"'"'h'"'and use Taylor expansion for In (1 + [l/a(n)]).Then show that lnz(n) = b(n)-[x(n)/2]. 0
4 The Main Result
The main result of this paper, to be proved in this section, is: Theorem 4.1. A = e-(fi)
fur m 2 2.0821 n.
Our proof has three steps. (1) First we show that z k = 2p(n)for k 2 n/2 if o > 1. Hence, we only have to concern ourselves with zk for 1 5 k 5 n/2. (2) We then show that, for
Transition to Perfect Generalization
395
Hence,
for n large. Since [l + ( l l x ) ] "
--f
e and
it is clear that z k = 2-(") for LY > 1. We proceed to the second step. To show that the odd-numbered terms of {Zk}k=1,2,...,n/2 form a nonincreasing sequence for a suitable a, we show that ~ i - 1 / 2 , + ~2 1 for i = 2 , 4 , .. . , ( n / 2 ) - 1. By the definition of Z k s and Lemma 3.8,
Since ( i + 1)i / ( n - i + 1) ( n - i) 2 ( i / n - i)' for i equation 4.1 is at least n-i-1
I n / 2 and P(i + 1) 2 0,
an
which is at least
( n-1 i [1-+(;)*] ) 2
as n
00
due to
n-i-1
an
(4.2)
E. B. Baum and Y-D. Lyuu
396
and Lemma 3.5. It is enough to find an (1 such that 4.2 is at least one. To proceed, we consider two cases: i < log, log, n and i 1 log, log, n. For i < log, log, n, 4.2 is greater than
due to
(f )>
2i/2
2
and Lemma 3.9. Clearly, any Q > 0 will make the above formula at least one for n large enough. Now consider i 2 log, log, n. Using Lemma 3.5 applied to
equation 4.2 becomes
(4.3)
(Note that Lemma 3.5 is applicable in this case since i Let c = n/i. Clearly, c 2 2. Equation 4.3 is now
-+ 00
as n
+ 00.)
by Lemma 3.9. Simple manipulations show that, to make the above formula at least one for n + co,
Numerical calculations show that max,f(c)
= (Y*
M
2.0821 (see Fig. 1).
Transition to Perfect Generalization
397
OL.--.-80
2.08198.
100 120 140
Figure I: (a) f(c) a s a function of c; (b) a close-up look of f ( c ) near c = 12: from 11.8 to 12.2. We are now at the final step. By Corollary 3.6 and Lemma 3.9, rr'n+O(l) 21 =
n [ l - P(l)]*'" n
(4.4)
N
Furthermore,
A I n ( 2 , + z3 + . . . + z,p) =
+ 2-(") < r22-(&) + 2-(")
2-(4
(4.5)
5 Conclusion and Numerical Results We have given a rigorous proof of the phenomena of perfect generalization first proposed by Gardner and Derrida (1989). Their derivation is very appealing but there are a few points that can not be trivially made rigorous. They approximate, a s is standard in physics caIculations, a sum of random {*1} as a gaussian. This is of course usually all right; however, since this sum is raised to an infinite power, in the limit, the validity of this substitution is not obvious in this case. Also they approximate each of two factors separately, where the limit of one factor is zero and the limit of the other is infinite. Also, their method is not sensitive to the tails, where x, the fraction of bits on which the target and learning perceptrons agree or disagree, goes to 1 or 0. Finally, they are unable to address the issue of rate of convergence. Gardner and Derrida calculate that for (kc = 1.448.. ., the expected value of the number of consistent perceptrons is less than or equal to one
398
E. B. Baum a n d Y-D. Lyuu
A[n-,alphal-.alpha2-,step_l:=
Block[{CC.B,DD,P,Z,S,AA={),rn,i,alpha.al =alpha1, a2=alpha2).
(' check the input ') If[OddQ[n] 11 EvenQ[n/2]. Printylst argument must be even and its half odd]; Return[] I; If[!Positive[step], PrintY4th argument must be positive"]; Return[]
I;
(* precompute some useful constants ')
F3=2"n; (* create i choose i12; so CC[[i]]= 2i choose i *) CC=Table[Binomial[2i , i].{i,n/2)]: (* create n choose I; so, DD[[i]]= n choose I *) DD=Table[Einomial[n , i],(i,n/2}]; (* compute P[i]'s *) P=(CC[[n/2]]/B.O): Do/ P=Append[Append[P , P[[i-1]] + CC[[i/2]] CC[[(n (i,2,n/2-1,2) 1: (* compute z-i's for various alpha's *) While[True, Do[
s=o:
m=alpha n; Do[ S+= DD[[i]] (I-P[[i]])^m, (i.1 ,n/2,2)
I:
AA=Append[AA,S], (alpha,al ,a2,step) I:
(* ask for more alpha's ')
If[SameQ[lnputString["morealpha's? (y/n)"]."n"],Break[]]: a1=a2+step; a2+=step
I: Return[AA]
1
Figure 2: A Mathemafica program.
as n goes to infinity, which gives a, as an upper bound on the critical point for transition to perfect generalization. Using our Theorem 3.8 for calculation (see Fig. 2 for the program), we find their claim not to be confirmed for n up to 19002 (at which point the expected number is 1.86516). But we caution that this calculation is not conclusive since the
Transition to Perfect Generalization
399
Figure 3: (a) Typical behavior of A in a. Taking min(l,A), we obtain (b), the steep descent of which near 1.5 suggests "sudden" transition to perfect generalization, as reported in Sompolinsky et al. (1990) and Gyorgyi (1990). Again, Z is used as approximation for A, which is justified by 4.5.
1.7'
Figure 4: min(1, A) in 1.3 5 a 5 1.7 and n = 10,14,. . . ,162. Again, we use Z for A in the computation. expected number, though slowly decreasing, might still fall below o n e for sufficiently large n. Sompolinsky et al. (1990) a n d Gyorgyi (1990) use methods of replica symmetry breaking to obtain a value of cy = 1.24 and find agreement in
400
E. 8. Baum and Y-D. Lyuu
simulations. If their arguments are correct, our calculation has not found the transition point tightly. One reason why our calculation is not sharp is that we have demanded N be large enough that the zks are monotonic. If we drop this condition, we may use equations 3.1 and 3.2 to evaluate Z = Ck=1,3,,,,,n/2 zk numerically. Note that 4.5 demonstrates that A is not much bi ger than Z. Calculating with n = 3002, we found that Z = 1.93496 x 10'' at Q = 1.4 but Z = 1.29923x at cr = 1.5. Figure 2 contains our Mathernatica program used for the above calculation, Figure 3 shows a typical behavior for A plotted as a function of Q and support for the sudden transition to perfect generalization, and Figure 4 is a plot for min(1,A) as a function of N and n. This calculation provides evidence for three interesting conclusions. (1) Perfect generalization holds already for Q = 1.5. (2) A critical transition occurs as was proposed by Sompolinsky et al. (1990) and Gyorgyi (1990). (3) Arguments based on A will not be able to yield a critical IY below 1.4. Note that this is not necessarily in contradiction with the critical value of 1.24 given by Sompolinsky et al. (1990) and Gyorgyi (1990), since A is only an upper bound on the desired probability. Thus the true critical point for perfect generalization might plausibly occur at a slightly lower value than the critical point of A, that is, between 1 and 1.5.
Acknowledgments
- .-
The first author thanks H. Sompolinsky and N. Tishby for their preprints and discussions regarding their work, and the Aspen Center for Physics for providing an environment conducive to such discussions. The second author thanks Sunita Hingorani for providing initial help on the use of Mathematica and David Jagerman, Leonid Kruglyak, Fred Rieke, and Satish Rao for discussions. References Baum, E. B. 1990. What can back propagation and k-nearest neighbor learn with feasible sized sets of examples? In Neural Networks, EURASIP Workshop 1990 Proceedings, L. B. Almeida and C. J. Wellekens, eds. Lecture Notes i n Computer Science Series, pp. 2-25. Springer-Verlag, New York. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Bollob6s, B. 1985. Random Graphs. Academic Press, New York. Gardner, E., and Derrida, B. 1989. Three unfinished works on the optimal storage of networks. I. Phys. A.: Math. Gen. 22, 1983-1994. Graham, R. L., Knuth, D. E., and Patashnik, 0. 1989. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, Reading, MA.
Transition to Perfect Generalization
40 1
Gyorgyi, G. 1990. First order transition to perfect generalization in a neural network with binary synapses. Phys. Rev. A 41, 7097-7100. Sompolinsky, H., Tishby, N., and Seung, H. 1990. Learning from examples in large neural networks. Phys. Rev. Lett. 65, 1683-1686.
Received 18 October 1990; accepted 4 March 1991.
This article has been cited by: 2. Adam Tauman Kalai, Adam R. Klivans, Yishay Mansour, Rocco A. Servedio. 2008. Agnostically Learning Halfspaces. SIAM Journal on Computing 37:6, 1777. [CrossRef] 3. Shao C. Fang, Santosh S. Venkatesh. 1999. Learning finite binary sequences from half-space data. Random Structures and Algorithms 14:4, 345-381. [CrossRef] 4. Lennart A. Saaf, G. Michael Morris. 1995. Photon-limited image classification with a feedforward neural network. Applied Optics 34:20, 3963. [CrossRef] 5. Yuh-Dauh Lyuu , Igor Rivin . 1992. Tight Bounds on Transition to Perfect Generalization in PerceptronsTight Bounds on Transition to Perfect Generalization in Perceptrons. Neural Computation 4:6, 854-862. [Abstract] [PDF] [PDF Plus]
Communicated by Geoffrey Hinton
Learning by Asymmetric Parallel Boltzmann Machines Bruno Apolloni Dipartiinento di Scietzze dell' lnformazione, Universita di Milano, 1-20133 Milano, ltaly
Diego de Falco Dipartiinerito di Matemntica, Politecnico di Milano, 1-20133 Milano, ltaly
We consider the Little, Shaw, Vasudevan model as a parallel asymmetric Boltzmann machine, in the sense that we extend to this model the entropic learning rule first studied by Ackley, Hinton, and Sejnowski in the case of a sequentially activated network with symmetric synaptic matrix. The resulting Hebbian learning rule for the parallel asymmetric model draws the signal for the updating of synaptic weights from time averages of the discrepancy between expected and actual transitions along the past history of the network. As we work without the hypothesis of symmetry of the weights, we can include in our analysis also feedforward networks, for which the entropic learning rule turns out to be complementary to the error backpropagation rule, in that it "rewards the correct behavior" instead of "penalizing the wrong answers." A set of n neurons is coupled by a real n x n matrix W = / / w l I of // synaptic weights. If the machine is in the configuration s = ( ~ 1 . .. . ,s,!) E ( 0 , l ) " the signal X , present on the ith neuron, and on which it will base its future action is determined by W and by the real threshold vector 8 = (01,. . . , O n ) as li
X,(S) = ~ 2 U l / S,
8,
/=1
The configuration S ( t ) of the machine performs a random walk on (0, l}n, with discrete time parameter t = 0, 1, 2, . . ., and with conditional expectation of S , (t + 1) given S ( t ) , and given that neuron i is called to update, of the form
Neural Computation 3, 402-408 (1991) @! 1991 Massachusetts Institute of Technology
Asymmetric Boltzmann Machines
403
Ackley et al. (1985) studied the above dynamic system under the assumption of a random sequential activation mechanism (at each clock one randomly chosen neuron is called to update) and under the hypothesis of a symmetric synaptic matrix W with vanishing diagonal elements. The symmetry of W plays, via detailed balance, a crucial role in providing the explicit form of the stationary distribution of the process studied by Ackley et al., as a Boltzmann distribution, at inverse temperature /3 for a system with energy &(s)= -
c sjsjwjj + c n
n
iij
1
sjej
On this explicitly given stationary distribution
the above authors studied the learning problem of the maximum likelihood estimation of W and 0 based on a sample drawn from a given environmental distribution. Here we wish to study this same problem releasing the hypothesis of symmetry of W and considering, instead of the random sequential activation mechanism, the synchronous parallel one, in which, given the current configuration of the network, each neuron is activated simultaneously to and independently of all the others. Namely we pose the Ackley, Hinton, Sejnowski (AHS) learning problem for the full Little, Shaw, Vasudevan model (Shaw and Vasudevan 1974; Little 1974; Little and Shaw 1978), viewed as a parallel asymmetric Boltzmann machine. We study, to this purpose, the random walk S ( t ) on ( 0 , l ) " ruled by the Markov transition matrix
pS.+,
+ 1) = s ' l ~ ( t=) s ) ] = H[I+ eP(1-2s~)x~(s)]-1
=P[s(~
i=l
which, after some algebra, can be rewritten (Apolloni and de Falco 1991a) as ePs'.X(s) PS$
=
cre P r ' . x ( s )
The effect of parallelism on the symmetric model has been studied in Bertoni et al. (1989) from the point of view of combinatorial optimization, and in Apolloni and de Falco (1991a) from the point of view of learning theory. In this note we deal with the specific difficulty emerging from the asymmetry of W and from the consequent lack of a detailed balance condition giving an explicit expression of the stationary distribution 7ro(s; W, 0) for the transition matrix p&( W , 6).
B. Apolloni and D. de Falco
404
We are forced to work in the time dependent formalism, considering
c
w,6 ,$) =
7 w s ) = "(l)(s;
..
d ~ [ ~ ( o ) l p s ( o ) , s (.Ps(t),s(t-1) l)
S(O),S(l), ...,s(t-1)
namely the distribution reached after t evolution steps starting from a given distribution 4) in which the network was initialized at time 0. Coherently with the maximum likelihood principle, we look for the choice of parameters that minimizes the relative entropy
between d t )and the given environmental law we want to simulate. We are not considering here, for notational simplicity only, the presence of hidden nodes: it can be checked that all our conclusions hold, with the obvious modifications, in the presence of hidden nodes (Pisano 1991). It is easy to compute that
where
This ratio can be easily expressed in terms of the backward transition probabilities:
h = 0,1,., . , f
-
1
where the subscripts I/), p refer to the process having 11, as initial distribution at time 0 and p as transition matrix. The above expression for 8G(')/dw, can therefore be resummed as
405
Asymmetric Boltzmann Machines
An analogous expression can be obtained for dG(t)/a&, namely, 80,
=
{ [&E ( S , ( k ) l S ( k
-5'x@(s). EzL,p S
-
=I
1
l ) ) l S ( t )= s
Were we considering hidden nodes, the conditional expectation at time t should be taken with respect to the configuration of the visible nodes. The resulting Hebbian rule is: for each sample path ending in a configuration s exhibited by the environment with significant probability @ ( T ) , the "innovation" signal to be collected and on which to base the updating of wlI is the time average along the past history of the difference between the actual "consensus" T,(k - l ) T , ( k ) of the receiver neuron i with the transmitter neuron j and the conditional expectation of this consensus given the configuration of the transmitters. The novel feature introduced by asymmetry of synaptic weights is the need to condition with respect to the final time t. As a first step toward an understanding of the above general prescription we observe that the familiar AHS rule follows from it as a particular case (modulo the changes required by the parallel activation considered here as opposite to the sequential activation of Ackley et al. 1985) under two additional assumptions: W is symmetric and 1c, = T O ( . ; W, 0). In the AHS equilibrium framework one is in fact studying the effect on G of letting a state that has already reached dynamical equilibrium under p ( W) evolve for t more steps according to p(W d W ) . The assumption II, = TO(.; W, 0) is crucial in deriving the equality P:,b = pa,& holding because of the detailed balance condition satisfied by the stationary distribution T O in the case of symmetric weights. The fact that the time reversed process has the same transition matrix as the process actually followed by the machine is, in turn, crucial in disposing of the conditional expectation with respect to time t according to the following steps:
+
pk c
k=l s ( 0 ) ..., , s(k),s(k+l)
B. Apolloni and D. de Falco
406
Symmetry plays two distinct roles in deriving the above result: 1. Detailed balance has the effect of substituting the conditional expectations with respect to the final configuration S ( n ) with expectations Eq,,,(w)for the process that evolves according to ps,sf(W) from the i n i t i d state o at t = 0. 2. The i)G(')/Ozci,iterm provides intermediate cancellations, so that ony the final and initial steps of the trajectory become relevant. In the limit t + 00, in which the first expectation of the above derivative tends to an expectation with respect to the stationary distribution x,,(s; W . ej, one has
aG 374,
, ~ E , , , ~ , , w , [ S , ( O )+ S ~Si(O)Sj(lj] (~) -,~E,.,,(w,[s,(O)S,(l) + Si(O)S,(l)l
This is the expression for grad C studied in Ackley et al. (19851, with the modification studied in ApoIloni and de Falco (1YYlaf for the case of parallel evolution, mirroring the "duplication trick," which relates parallel and sequential evolution according to Bruck and Goodman (1988). Returning to asymmetric networks, we wish to discuss, next, the case of a feedforward layered architecture, with layer sequential activation, in which, because of layer-time association, the prescription of conditioning with respect to the final time is operationally very transparent: it amounts, for each input clamped on the input layer, to waiting for the machine to exhibit, because of its intrinsic fluctuations, the correct output vector on the output layer. Call sy = (sf?.. . , s : ~ )a configuration of the tth layer (P = 0,. . . , L). Think of layer 0 as input layer and of layer L as output layer, Inputoutput pairs (so,sl,) are exhibited to the machine by the training agent according to a given probability law @(so, sL).Call 8' the threshold vector for layer P and w f (i = 1,.. . , f l y ; j' = 1,.. . , n e - - l ) the synaptic weight with which the jth node of level I - 1 sends signals to the ith node of layer P. With layer 0 clamped at so,layer P gets updated only at time I: according to
P- l , * . . ? L P(S"
= s")
=
$(SO)
= C($(S",SL) SL
After all the layers have been updated, the current set of weights and thresholds determines the following joint law of the input-output pair: q ( s W=
$(SO)
c'
s',
,sL
psII,sI,ps1,s2".psf-I,sI
Asymmetric Boltzmann Machines
407
We look for the choice of parameters that minimizes G ( ~= )
C ~(s~,s~)ln;-- o(s" ss''))
sl',sL
./,(SO,
It is easy to compute that
The emerging Hebbian learning rule is particularly transparent if we think of a joint law o(s",s'.) sharply concentrated on the graph of a function g : so + sL: those input-output pairs presented as final state by the machine that are significantly represented in the training set (namely, which are in the graph of the function g to be learned), for short "the correct associations," determine, by gradient descent, the direction of motion in parameter space. This is to be compared with the error backpropagation rule (Rumelhart et nl. 1986), which draws the error signal from those input-output pairs that are "wrong" compared to what is exhibited in the training set. The learning algorithm for feedforward networks based on the above considerations has been successfully tested on simple benchmarks. Preliminary results are reported in Apolloni and de Falco (1991b). As the signal that triggers the updating of the weights is "the output is correctly associated to the input," this algorithm can be classified as a "reinforcement learning procedure" as reviewed in section 11 of Hinton (1989). As an outlook for future research we wish to stress the fact that the learning procedure we propose admits a natural hybridization with backpropagation: one simply takes the innovation signal from our prescription on those sample paths that realize the correct association, from backpropagation of the error measured as the distance between actual and wanted output on those sample paths which don't.
Acknowledgments This research was supported in part by Consiglio Nazionale delle Ricerche under Grants 88.03556.12and 89.05261.CT12.
References Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. 1985. A learning algorithm for Boltzmann machines. Cognitive Sci. 9, 147.
408
B. Apolloni and D. de Falco
Apolloni, B., and de Falco, D. 1991a. Learning by parallel Boltzmann machines. I E E E Transact. Inform. Theory, July 1991. Apolloni, B., and de Falco, D. 1991b. Learning by feed-forward Boltzmann machines. Proceedings Neuronet 90, World Scientific, in press. Bertoni, A,, Campadelli, P., and Grassani, F. 1989. Full parallelism in Boltzmann machines. Proc. Neuro-Nirnes '89, 361. Bruck, J., and Goodman, J. W. 1988. A generalized convergence theorem for neural networks. I E E E Trans. Inform. Theory 34,1089. Hinton, G. E. 1989. Connectionist learning procedures. Artificial Intelligence 40, 185. Little, W. A. 1974. The existence of persistent states in the brain. Math. Biosci. 19, 101. Little, W. A., and Shaw, G. L. 1978. Analytic study of the memory storage capacity of a neural network. Math Biosci. 39, 281. Pisano, R. 1991. Macchine di Boltzmann asimmetriche, University of Milano, thesis. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error backpropagation. In Parallel Distributed Processing: Exploration in Microstructures of Cognition, Vol. I, p. 318. MIT Press, Cambridge, MA. Shaw, G. L., and Vasudevan, R. 1974. Persistent states of neural networks and the random nature of synaptic transmission. Math. Biosci. 21, 207.
Received 25 July 1990; accepted 8 May 1991.
This article has been cited by: 2. Kenichiro Mogi. 1994. Multiple-valued energy function in neural networks with asymmetric connections. Physical Review E 49:5, 4616-4626. [CrossRef] 3. Radford M. Neal . 1992. Asymmetric Parallel Boltzmann Machines are Belief NetworksAsymmetric Parallel Boltzmann Machines are Belief Networks. Neural Computation 4:6, 832-834. [Citation] [PDF] [PDF Plus]
Communicated by Richard Lippmann
Generalization Effects of k-Neighbor Interpolation Training Takeshi Kawabata NTT Basic Research Laboratories, 3-9-22 Midori-cho Musashino-shi, Tokyo 180, Japan This paper describes a new training method for a continuous mapping and/or pattern classification neural network that performs local sampledensity smoothing. A conventional training method uses point-topoint mapping from an input space to an output space. Even though the mapping may be precise at two given training sample points, there are no guarantees of mapping accuracy at points on a line segment connecting the sample points. This paper first discusses a theory for formulating line-to-line mapping. The theory is called interpolation training. This paper then expands the theory to k-nearest neighbor interpolation. The k-neighbor interpolation training (KNIT) method connects an input sample training point to its k-neighbor points via k line segments. Then, the method maps these k line segments in the input space for each training sample to linear line segments in the output space that interpolate between training output values. Thus, a web structure made by connecting input samples is mapped into the same structure in an output space. The KNIT method reduces the overlearning problem caused by point-to-point training by smoothing input/output functions. Simulation tasks show that KNIT improves vowel recognition on a small speech database. 1 Introduction
Backpropagation training (Rumelhart et al. 1986) has been used extensively for research on continuous mapping and pattern classification. This research has encountered a common serious problem called overlearning. Even though a neural network accomplishes low distortion for training samples, it may not work well for other unknown samples. Overlearning is related to the complexity of a problem, the amount of training data, and the complexity of the neural network (Baum 1989). A conventional training method uses point-to-point mapping from an input space to an output space (Fig. la). In this case, even though the mapping is precise at the two given training sample points, there is no guarantee 'of mapping accuracy at the points on a line segment in the input space that connects the training samples. Neural Computation 3,409-417 (1991)
@ 1991 Massachusetts Institute of Technology
Takeshi Kawabata
410
(a) Point-to-point
(b) Interpolation training
( c ) KNIT (k=3)
Figure 1: Point-to-point training, interpolation training, and k-neighbor interpolation training (KNIT).
Therefore, it may be more desirable to map a point on a line in an input space into a point on a line in the output space that interpolates between desired output points. Wolpert showed that such an interpolation condition is satisfied by assuming transformation invariances in a generalizer (Wolpert 1990a,b). Without assuming transformation invariances, interpolation smoothing and surface fitting approaches have been proposed (Stanfill and Waltz 1986; Farmer and Sidorowich 1989). The main goal of this paper is to show an efficient algorithm that uses backpropagation (Rumelhart et at. 1986) to train a neural network to form a local linear fit. First, this paper discusses a theory for mapping a line segment in the input space into a line segment in the output space (Fig. lb). The lineto-line mapping is achieved by differential vector mapping. This theory is called interpolation training. In addition, this paper expands interpolation training to k-nearest neighbor interpolation (Fig.lc). By connecting sample points to their k-neighbors, an input space can be covered by a web structure. This kneighbor interpolation training (KNIT) method maps each line segment from an input space to an output space by using interpolation training. In some simulation and classification experiments, KNIT effectively reduces overlearning caused by the point-to-point training. 2 Formulation of Interpolation Training
An interpolation vector on the line segment connecting points in an input space is defined as a vector: a0
+ x (a,
- ao)
(0 5
x I 1)
a0
and
a1
(2.1)
Hereafter, this paper represents the vector as a0 + Aa. Let SO'and s1 be the corresponding supervisory or desired output points for the input vectors
k-Neighbor Interpolation Training
411
uo and u l . A supervisory vector on the line segment connecting points SO and s1 in an output space is defined as a vector so +
(0 5
- so)
(s1
x 5 1)
(2.2)
Hereafter, this paper represents the vector as so + As. In the neighbor area around point UO, an output vector b is approximated as
where f is an input/output mapping function and df/da(ao) is the Jacobi matrix of mapping f at point 00. This paper represents vector b as bO+ Ab. Now let (2.4) be the error measurement to be minimized, where J J x JisJthe Euclidean norm of x. The backpropagation training procedure is applicable to minimize this modified mean square error (MSE) measurement. A neural unit consists of an adder for weighted summation and a sigmoid function for nonlinear transformation. Let u, be the total input of the jth neural unit, and let oj be the output of the unit. The change in the weighting coefficient from unit i to unit j is given by (2.5) where 6, is defined as 6,= -dE/du,. At the output layer, 6,is calculated as follows:
6
-
- - =dE ------o dEdo,
-
do, du,
du,
BE do,
(1I
01
1
(2.6)
From equation 2.4,
dE - I d - -- II(so +AS) - (bo Ab)1I2 2 do, I d = -- {(sol Asl) - (bq + Ab,)}* 2 do,
+
-
do,
+
3
x - {(so,
801
+ As,)
- (bo, + Ab,)}
(2.7)
where SO,, As,, b , , and Ab, are jth elements of so, As, bo, and Ab. Thus, SO, is constant, and As, and Ab, are constant for a fixed Aa. At the output
Takeshi Kawabata
412
layer, clearly bol = ol. Consequently,
Vector formulation is
3~E 30
=
- {(so
=
-
{(SO
+ As) - [bo + - bo)
+ [AS
-
(2.9)
The first term (so - bo) also appears in the formulation of point-to-point methods. The second term is derived from the differential parts of our error measure (equation 2.4). Equation 2.9 only gives a solution for single-point mapping from a0 Aa to so +As. To expand the theory for line segment mapping, a n integral operation is necessary. Remember that Aa and As are functions of A.
+
The solution for line segment mapping is calculated as an integral of equation 2.9 about A. Since Equation 2.3 assumes first-order approximation for the input/output function, the Jacobi matrix is constant in the neighbor area around the base point (ao). This paper approximates the Jacobi matrix as that of its nearest training sample point. On the line segment from a0 to 110 + 1/2(al - ao), 8f//3a(ao) is used. (2.11) 8f Iao + X (a] - ao)l (Y (uo) = const (0 I X <_ 1/2) 8U aU On the line segment from a0 1/2(a1 - ao) to al, L)f/8a(al) is used. -
+
3f [an + (ai
(2.12) 8f (a,) = const (1/2 5 X 5 I ) i3U The KNIT method shown in the next section uses each training sample as a base point. Equation 2.12 is treated as equation 2.11 when the base point is al. Because of this reason, this paper uses the integral limits between X = 0 to 1/2. From equations 2.9 and 2.11, -
3U
-
an)]
Equation 2.13 gives a solution for the line segment mapping. Consequently, the h, at the output layer is calculated from equations 2.6 and 2.13.
k-Neighbor Interpolation Training
413
At each hidden layer, the h, values are calculated by the backpropagation process: (2.14) where k is the index to neural units that have a connection from neural unit j . In equation 2.13, the product of the Jacobi matrix df/&(ao) and the difference vector (a1 - a") is calculated by forward propagation. Let ol be the output of the jth neural unit. Define g, as a product of the Jacobi matrix of jth unit and vector (01 ao). ~
(2.15) Initialize 0, and g, at input units as 01 = ao,.
8, = (01, -no,)
(2.16)
where do, and all are the jth elements of input sample vectors d o and nl. Forward propagation is carried out by applying the following formula at each hidden and output unit: (2.17) (2.18)
In the above formula, N,is the number of connections to jth unit,
0
and
d are the sigmoid function and its derived function. (2.19) After the forward propagation, the elements of vector af/aa(ao)(al- ao) are given as g, values at output units. 3 k-Neighbor Interpolation Training
The KNIT method is an expansion of the interpolation training theory. The theory is most effective for interpolating nearest-neighbor samples, because the theory assumes that the Jacobi matrix is constant in the neighbor area around a sample point. First, KNIT connects an input sample point to its k-neighbor points via k line segments. This process is performed for all training samples and makes a web structure. KNIT maps these k line segments for each
Takeshi Ka wa ba ta
414
sample using interpolation training. Consequently, a web structure is mapped into a similar structure in the output space. This paper uses a combined distance measure to select nearest neighbors:
In this equation, din) is a distance in the input space and is a distance in the output space for a training point. Because these distance measurements are used only for selecting nearest neighbors, arbitrary metrics can be used in Equation 3.1. This paper uses a distance measure defined as follows:
The weighting coefficient w is called a categorical factor. In a pattern classification task, d(""') indicates whether the two samples belong to the same class. Therefore, nonzero w inhibits selecting a different-class sample as a nearest neighbor, and performs within-class interpolation. Because of this, we use a nonzero categorical factor only for the classification task. Simulation experiments were conducted, to compare point-to-point training and KNIT. Task xor (2) comb-l ( 3 ) comb-2 (1)
Network
Samples
k
w
2 x 2 ~ 1 2x2 20.0 2 x 8 x 1 sparse(5 x 5) 2 0.5 2 x 8 x 1 dense(l7 x 17) 2 0.5
Figure 2(1) shows an xor task. The 2 x 2 x 1network has 2 inputs (x,y), 2 hidden units, and 1 output (z). The 2 x 2 mesh samples [f(O,O) = 0, f ( l > O ) = 1, f ( 0 , l ) = 1, and f ( 1 , l ) = 01 were used for training. The upper figure shows the function z = f ( x , y) trained by the point-to-point method. Because the point-to-point method minimizes errors only at sample points, slopes are steep. The lower figure is a result of KNIT. Dotted lines in this figure are the line segments that connect neighbor sample points. KNIT organizes a mapping function which approximates the dotted lines. Figure 2(2) shows a comb task with a few samples. This situation often occurs at the marginal intersection of two category distributions. We choose a nonzero categorical factor (w = 0.5). In this simulation, 5 x 5 = 25 mesh samples were used for training. Even though the sample density is low, the point-to-point method organizes local structures according to the training samples. In the upper figure, local steep boundaries were organized. When an input sample changes along the y-axis, the output alternates between 0 and 1. The KNIT method does not form such steep
k-Neighbor Interpolation Training
415
Pointropoint
Y
Y
X
X
0
0
KNIT
(1) XOR
(2) Comb-1 flow sample
(3)Comb-2 (high sampte densiiyl
Figure 2: Comparison of point-to-point training and KNIT. local structures with few samples. Instead, a smooth mapping function is formed (lower figure). Figure 2(3) shows a comb task with many samples. In this case, 17 x 17 289 mesh samples were used for training. The sample density is high enough to determine reliable class boundaries. Both the point-topoint and KNIT methods organized the given comb structure with many samples. The smoothness of the input/output function with KNIT thus depends on the input point sample density. 4 Vowel Recognition Experiments
KNIT was also tested using a vowel classification task. Two vowel sample sets uttered by a male speaker were collected. One was used for training and the other testing. The training set contains 250 samples (50 samples x 5 vowels) and the testing set contains 1000 samples (200 samples x 5 vowels). These vowel samples were sampled at 12 kHz, spectrum analyzed by 256-point FET with Hamming window, and transformed into 16 spectral coefficients by computing log energies in each melscale energy band (Waibel et al. 1989). Adding a power coefficient, a vowel sample vector consists of 17 elements. The network for vowel recognition had 17 input units, 18 units in the first hidden layer, 18 units in the second hidden layer, and 5 output units. The 5 output units correspond to five Japanese language vowels. A test sample was recognized as a vowel whose output gave the highest intensity.
Takeshi Kawabata
416
Table 1: Vowel recognition results.
Method Point-to-point Point-to-point KNIT KNIT
Samples Closed Open Closed Open
MSE 0.000001 0.008450 0.001708 0.004466
Rate(%) 100.0 97.4 99.6 98.9
The network was trained by the following two methods, 1. Point-to-point method ("minimize the MSE at each training sample point'?. 2. KNIT method with k(neighbors) = 2 and zufcategoricalfactor) = 0.5,
and tested with two sample conditions. 1. Closed: Test the network with the training sample set.
2. Open: Test the network with the testing sample set.
Table 1 shows the summary of these recognition experiments. The point-to-point method performs best with the closed condition. Its MSE is nearly zero, and recognition rate is 100%. In the open condition, however, the recognition rate drops to 97.4%. This shows that the pointto-point method excessively tunes the network to training samples. On the contrary, KNIT performance becomes worse in the closed condition, but is better than the point-to-point method in the open condition. The degradation of performance was effectively reduced. To check the significance of the improvement, a paired t test was performed. KNIT and point-to-point error rates were different at a significant level of 0.01. 5 Conclusion
This paper proposed a new approach to train neural networks which have a capability of local sample-density sensitive smoothing. The basic idea is to map continuous manifolds from an input space to an output space. The interpolation training theory gives a formulation of line-to-line mapping. The KNIT (k-neighbor interpolation training) method maps all line segments connecting neighbor samples. Simulation tasks show that KNIT organizes reliable discriminant boundaries according to the local sample density. Vowel recognition experiments show that the method improves the recognition performance in open test condition.
k-Neighbor Interpolation Training
417
Acknowledgments The theoretical formulation of interpolation training a n d the first implementation of the KNIT method w a s performed while the author was a t the ATR Interpreting Telephony Research Laboratories.
References Baum, E. B. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Farmer, J. D., and Sidorowich, J. J. 1989. Exploiting chaos to predict the future and reduce noise. Los Alamos National Laboratory, LA-UR-88-901. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing. Vol. 1, Chap. 8. The MIT Press, Cambridge, MA. Stanfill, C., and Waltz, D. 1986. Toward memory-based reasoning. Commun. ACM 29(12), 1213-1228. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. J. 1989. Phoneme recognition using time-delay neural networks. IEEE Trans. ASSP 37(3), 328-339. Wolpert, D. H. 1990. A mathematical theory of generalization: Part I. Complex Syst. 4, 151-200. Wolpert, D. H. 1990. A mathematical theory of generalization: Part 11. Complex Syst. 4, 201-249.
Received 10 July 1990; accepted 27 March 1991.
This article has been cited by: 2. Randall S. Sexton, Naheel A. Sikander. 2002. Data mining using a genetic algorithm-trained neural network. International Journal of Intelligent Systems in Accounting, Finance & Management 10:4, 201-210. [CrossRef] 3. Shouhong Wang. 1996. LEARNING MONOTONIC-CONCAVE INTERVAL CONCEPTS USING THE BACK-PROPAGATION NEURAL NETWORKS. Computational Intelligence 12:2, 260-272. [CrossRef] 4. Shouhong Wang. 1995. A neural network technique of generating empirical bivariate distribution functions. Neural Processing Letters 2:5, 14-18. [CrossRef] 5. R. Rovatti, R. Ragazzoni, Zs. M. Kovàcs, R. Guerrieri. 1995. Adaptive Voting Rules for k-Nearest Neighbors ClassifiersAdaptive Voting Rules for k-Nearest Neighbors Classifiers. Neural Computation 7:3, 594-605. [Abstract] [PDF] [PDF Plus]
Communicated by Yaser Abu-Mostafa
~.
Including Hints in Training Neural Nets Khalid A. Al-Mashouq Irving S. Reed Department of Electrical Engineering, Utiiuersity of Southern California, Los Angcles, CA 90007 USA The aim of a neural net is to partition the data space into near optimal decision regions. Learning such a partitioning solely from examples has proven to be a very hard problem (Blum and Rivest 1988; Judd 1988). To remedy this, we use the idea of supplying hints to the network - as discussed by Abu-Mostafa (1990). Hints reduce the solution space, and as a consequence speed u p the learning process. The minimum Hamming distance between the patterns serves as the hint. Next, it is shown how to learn such a hint and how to incorporate it into the learning algorithm. Modifications in the net structure and its operation are suggested, which allow for a better generalization. The sensitivity to errors in such a hint is studied through some simulations. 1 Introduction
~~
-
Learning from examples is a unique feature of neural nets. However, robustness of a net comes from its response to novel examples, which is called generalization. To have a good generalization, one needs a very large number of training data, which in turn leads to a very long training time. Abu-Mostafa (1990) suggested the use of known "hints" to accelerate the learning process (see Fig. 1). Hints are some kind of information about the function to be learned which is extra to the training examples. For instance in the problem of character recognition, it is known a priori that any character is invariant to shifts, scale change, and small rotations. Another example of hints is found in decoding error correcting codes in the form of the minimum Hamming distance between the codewords. Presumably, this extra knowledge can be exploited to achieve a more efficient training algorithm. In some cases the hint is known precisely, or may be approximated by employing previous experience from similar problems. It is shown here that hints may also be learned from the training set. After acquiring the hint an important remaining question is as follows: How does one incorporate this hint into the training algorithm? Inappropriate inclusion of hints might slow or even cripple the learning Neural Computntion 3, 418-427 (1991) @ 1991 Massachusetts Institute of Technology
Hints in Training Neural Nets
419
Figure 1: Training a neural net using hints.
process. To show such an effect, assume that it is given that a network with one hidden layer is sufficient to learn a certain problem. Also assume that the bias of the hidden nodes is known eventually to be -K, where K is a large positive number. If one starts the training by setting the bias of the hidden nodes to that value, the surprising result is that no learning can take place because the net is caught in a well-known local minimum where the output of the hidden nodes are always zeros. This paper is an attempt made to demonstrate the above concepts using the example of the minimum Hamming distance as a hint. It is organized as follows: In Section 2, the structure of a general neural net is described, which has modified threshold elements in the hidden nodes. Feedback from the output to the input is allowed. The advantages of such structures and its operations are discussed in the subsequent sections. As an alternative to exact knowledge of the minimum Hamming distance, d, a simple algorithm to estimate d is proposed in Section 3. Section 4 outlines an algorithm that utilizes an estimate of d for training the net to have a good generalization in the sense of Hamming distance. A simulation example is presented in Section 5 to examine the network's sensitivity to errors in estimating the minimum Hamming distance d, and to show how the threshold of the hidden nodes can be modified to mitigate the effect of estimation error. 2 The Structure and Operation of the Network
The network used here consists of an input layer at the input end, followed by hidden layers and finally by an output layer at the output end (see Fig. 2.) The input layer is called the x-layer and the output layer is called the y-layer. Patterns that are presented to the x-layer are called x-patterns, and patterns that are at the y-layer are called y-patterns. To simplify the analysis, only one hidden layer is used. This layer has a number of nodes equal to or greater than the number of valid patterns.
K.A. Al-Mashouq and I. S. Reed
420
INPUT PATTERN
Xi
ASSOC. OUTPUT
yi
1 ''-layer CONNECIION
MATRIX
wx HIDDEN LAYER
~~-1ayer CONNECTION
MATRIX
Figure 2: The network structure.
Each node receives a weighted sum from the preceding layer. This sum passes through a threshold function, f(.), given by ifx-T
(2.1)
where T is a threshold, and n is a parameter that is called the threshold power. For the output layer, n is chosen to be equal to 0, which is the case of the usual hard limiter in which the output is binary. However, the hidden nodes are allowed to have different values of n. In error-correction applications it is useful to have neurons with n > 0 (or a "semisoft" threshold), because a hard limiter discards most of the information about the match between the input to the neuron and its weight. For any value of n, one can construct a network that associates any number of x-patterns to corresponding y-patterns while simultaneously correcting all errors that lie within a Hamming distance less than d , / 2 (Al-Mashouq et al. 1990), where d, is the minimum Hamming distance between the x-patterns. To show that this is always possible, consider the net shown in Figure 2. It has one hidden layer with p nodes, which is equivalent to the number of patterns. Let W, be a matrix of the synaptic weight vectors. The ith column of W, corresponds to the synaptic weight vector connecting the x-layer to the ith hidden node. Similarly, W, is the matrix of the synaptic weight vectors that connects the y-layer to the hidden layer, where the ith row in W, is the weight vector that connects the y-layer to the ith hidden node. Let X and Y be the matrices of the valid patterns.
Hints in Training Neural Nets
421
Both matrices are assumed to have entries of only f l . The ith row xi in X is to be associated to the ith row yi in Y. Let W, = XT and W, = Y, where T represents a matrix transpose, and let the threshold of the hidden nodes from the x-layer side be T, = N, - d,, where N, is the dimension of the xis vectors, and d, is the minimum Hamming distance between them. The net in Figure 2 makes the association between the rows of X and the corresponding rows of Y, even if the input vector has errors, provided that the number of errors is less than dx/2. To explain how does the above set-up work, consider a received (row) vector a E { -1, +1lNx, which is at a distance ei < d,/2 from xi, and at a distance ej > d,/2 from any other pattern xi. The input to the hidden layer is the vector aW, = ( N , - 2el,N, - 2e2,.. . , N, - 2ei,. , .,N, - 2 9 )
(2.2)
The jth component of this vector is the input to the jth node. This node is "on" if its input exceeds the threshold, that is, if
N, - 2ei > N, - d ,
(2.3)
which means that
ej < d,/2
(2.4)
Since ej > d,/2 for all j # i, the ith hidden node is the only active node in response to the stimulus a. From 2.1, the ith node output is
k, = ( d , - 2ei)" > 0
(2.5)
Also the output at the y-layer is sign[(d, - 2ei)'wyi] = yi
(2.6)
where wyi is the ith row in W,. Thus, the correct pattern associated with a is obtained. Similarly, if one allows an input to be applied to the ylayer, and the threshold is set to N, - d,, where N,,is the dimension of the y-patterns and dy is their minimum Hamming distance, then a correct association between the yis and the xis is maintained. As a special case of this construction, one can associate patterns with themselves. This results in an autoassociative memory, which receives a corrupted pattern and produces its corrected version. This memory can store and retrieve a number of patterns equal to the number of hidden nodes. It also has error correction capability such that it corrects any number of errors less than d,/2. For the sake of a simple comparison, it is shown in McEliece et ul. (1987) that the Hopfield net can, at most, store _-
2_ ~ ) _ patterns ~ logN
4
K. A. Al-Mashouq and I. S. Reed
422
where N is the dimensionality of the input vector and pN is the number of allowable errors in the decoded vector. Thus, if the pattern length is N = 2R, one can store only 8 patterns without any error correction capability. Another interesting feature of this net is that it can operate bidirectionally and it stabilizes to the correct patterns provided that the following is true: (1) The number of errors in the input pattern is less than d,/2, and (2) the minimum Hamming distance is exactly known. Here by stability it is meant that the input and the output do not change with time, only one pattern x appears at the x-layer, and only one pattern appears at the y-layer. However, if any of these conditions is violated, the output pattern may have errors. If the number of errors is less than that of the input pattern, then the output can be fed back in the hope of reducing or eliminating errors. The same operation is performed back and forth repeatedly to obtain ultimately the correct pattern. A simulation example is given in Section 5 to illustrate the network sensitivity to those two conditions. 3 Estimating the Minimum Hamming Distance
The minimum Hamming distance, d, is an important ingredient for the algorithm given below. In some cases d is known a priori, such as in the case of error-correcting codes. In other situations, former experience might be deployed to make an "educated guess" about d. A third possibility is to learn d from the training examples themselves. This third method is considered in the following. By definition, d can be computed by finding the Hamming distance d,] between each pair of patterns i and j ; then select the minimum distance among them. That is,
This process is prohibitive in complexity, especially when a large number of patterns are to be considered. A much simpler approximation is to find the Hamming distance between each consecutive pair of patterns and retain the minimum one. To justify the use of this method one needs to assume a high signal-to-noise ratio such that the received patterns almost have no errors. If that is not the case, it is more appropriate to average the Hamming distances instead of choosing the minimum among them. In either cases one gets an estimate of d, say d'. A s it would be expected, the use of d' to adjust the threshold may cause decoding errors. To show the effect of an error in d:, the estimated minimum Hamming distance between the x-patterns, let
d,; = d,
+ ed
(3.2)
Hints in Training Neural Nets
423
where ed > 0 represents the estimation error. If one uses the value of d; to adjust the threshold of the hidden units, then in 2.3 d, is replaced by its estimate d:, which yields that the jth node is activated whenever 2e,
-
ed < d,
(3.3)
Therefore, the effect of the estimation error, en, is to make errors seem smaller. This in turn may result in activating nodes that are not necessarily the correct nodes. To mitigate this problem, one uses a threshold of power greater than zero. This is useful since the next layer has more information about the input pattern. For example, assume two hidden nodes, i and j , have been activated in response to an input pattern which is closer to xi. If d, in 2.5 is replaced by its estimate d: given by 3.2, then the outputs of the two hidden nodes i and j are
hl h,
(d, - 2e, +ed)' = (d, - 2e, +ed)" =
(3.4) (3.5)
respectively. Notice that e, < el, hence 11, > hl for n > 0. Thus, the output at the y-layer is sign(hyi + hlyj)
= Yi
(3.6)
which is the desired output pattern. This can be done only if n > 0. To combat the effect of more falsely activated nodes, n has to increase. As n increases, the complexity of the thresholds gets prohibitively large. A practical choice would be n = 1 or n = 2. 4 Training Using the Hint
In training a multilayer neural net the backpropagation (BP) algorithm (Rumelhart and McClelland 1986) is often used. Formally, BP is used in only one direction. Given an input pattern, the weights are adapted to best match the desired output. This implies that there is a "teacher" who assigns a n abstract label to each input pattern. A more appealing method is to assume that the data are coming in pairs from two (or more) sensors, and makes associations among them. This seems to be a more natural way to retrieve information. To train the net to make such an association, the dual back-propagation (DPB) algorithm is proposed (Al-Mashouq et al. 1990). This algorithm employs the BP algorithm in both directions. If both the x and y training vectors are available, then x is considered the input, y is the desired output, and vice versa. The BP is used to adapt the weights in the "x-to-y" direction. After a few iterations, the operation is reversed to make y be the input (to the y-layer), and x becomes the desired output (at the xlayer). During training, one needs to supply the threshold with the hint. To correct errors, ideally one would like the threshold from the x-side
K. A. Al-Mashouq and I. S. Reed
424
and the y-side to be N, d, and N!,- d,, respectively. However, this is valid only under the assumption that the synaptic weights are unity. If the weights are small, which is usually the case during the initial time, then a large threshold forces the output of the hidden layer to be all zeros, which is a trivial local minimum. To skip this minimum point, a useful method is to scale the threshold down by a factor that is proportional to the average norm of the synaptic weight. ~
5 A Simulation Example
To illustrate the effect of an erroneous estimation of the minimum Hamming distance, the following simple experiment is conducted. Training data of four patterns representing the numbers 0, 1, 2, and 3 as patterns, are chosen. Each pattern occupies a rectangle of 5 x 4 pixels. The same patterns are used as the desired outputs. This results in an autoassociative neural network. Four hidden nodes are used, that is, four neurons are used to connect the x-layer to the hidden layer, and 20 neurons connect the hidden layer to the y-layer. In this experiment it is assumed that the training algorithm results in copying the training data into the weights. That is, each input neuron represents an input pattern. Since one wants the net to map the input patterns onto themselves, the weights of the y-layer are mirror images of the x-layer's weights. This in effect reduces the network's complexity by using only one layer in two directions to replace the cascade of two similar layers. The minimum Hamming distance of this training set is 7, which means that one can always correct three errors. Given the above patterns and configuration, the network is tested by introducing three then four random errors into the four patterns. In both cases d* is varied from 7 to 19 (which corresponds to the variation of ed from 0 to 12.) Another parameter is the threshold power, n, which is varied to take the values 0 to 2. The network is operated bidirectionally until it reaches a stable point. Then the average symbol error is obtained. Figure 3 shows two examples for correcting errors sequentially, and Table 1 displays the average symbol error as a function of the estimated d and the threshold power n. It is clear that the proposed network is not very sensitive to error in d * . Moreover, to compensate for the estimation error in d', one can raise the power of the threshold in the hidden nodes. However, the price of raising the power is to spend more processing time and/or to have more hardware complexity. It is interesting to note that the network may correct more than three errors when the threshold is relaxed, that is, when d' > d. On the other hand, relaxing the threshold can result in decoding errors even if the error is less than d / 2 .
Hints in Training Neural Nets
425
E Figure 3: Sequences of patterns produced by the network when (a) the input is the number 3 corrupted with three errors (b) the input is the number 1 corrupted with four errors. Different shading corresponds to different gray levels, black = 1, white = -1 and gray = 0.
6 Conclusions Learning solely from examples is a hard problem that requires an extremely large number of examples and an exorbitant amount of time.
426
K. A. Al-Mashouq and I. S. Reed
Table 1: Average decoding error due to error in a'.
3 errors n=l n=2 0 0 0 0 1 0.0250 0 0 0 2 0.0250 0 0 3 0.0250 0 4 0.0250 0 0 5 0.2250 0 0 0 6 0.2250 0 7 0.7500 0 0 0 8 0.7500 0 0 9 0.8000 0 10 0.8000 0.0250 0 11 1.oooo 0.0250 0 12 1 .oooo 0.0500 0 ed
n=O
4 errors n=O 1.0000 0 0 0.1500 0.1500 0.2000 0.2000 0.7750 0.7750 0.7750 0.7750 1.oooo 1.oooo
n = l n=2 1.0000 1.0000 1.oooo 1.oom 0 0 0 0 0 0 0 0 0 0 0 0 0.0250 0 0.0250 0 0 0.0500 0.1250 0 0.1500 0.0250
To simplify the learning process, one can use previous experience or the training data itself to supply hints to the training algorithm. In this paper a simple, but important, example is presented that learns the Hamming distance and uses it as a hint. Due to imperfections in learning the hint, one might expect severe degradation in the performance. However, this is not the case. A mild error in estimating the hint results in a small degradation. This is due primarily to the improvements made in the network structure, and to the "constructive" feedback allowed in its operation. More research is required to understand how to derive hints, and how to feed them to the training mechanism in different applications.
References Blum, A., and Rivest, R. 1988. Training a 3-node neural network is NP-complete. Proceeding of the 1988 Workshop on Computational Learning Theory, pp. 9-18. Morgan-Kaufmann,San Mateo, CA. Judd, S. 1988. On the complexity of loading shallow neural networks. 1. Complex. 4, 177-192. Abu-Mostafa, Y. 1990. Learning from hints in neural networks. J. Complex. 6, 192-198. Al-Mashouq, K., Reed, I., and Patapoutian, A. 1990. Complexity and learning in error-tolerant neural nets. The 7th International Conference on Systems Engineering, Las Vegas, Nevada, July.
Hints in Training Neural Nets
427
McEliece, R., Posner, E., Rodemich, E., and Venkatesh, S. 1987. The capacity of the Hopfield asscociative memory. IEEE Trans. Inform. Theory IT-33(4), 4614 8 2 . Rumelhart, D.; and McClelland, J. L. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. I. MIT Press, Cambridge, MA.
Received 25 June 1990; accepted 12 February 1991.
This article has been cited by: 2. J.F. Hurdle. 1997. The synthesis of compact fuzzy neural circuits. 5:1, 44. [CrossRef] 3. Paolo Frasconi, Marco Gori, Marco Maggini, Giovanni Soda. 1996. Representation of finite state automata in Recurrent Radial Basis Function networks. Machine Learning 23:1, 5-32. [CrossRef] 4. C.W. Omlin, C.L. Giles. 1996. Rule revision with recurrent neural networks. 8:1, 183. [CrossRef] 5. Yaser S. Abu-Mostafa . 1995. HintsHints. Neural Computation 7:4, 639-671. [Abstract] [PDF] [PDF Plus] 6. Yaser S. Abu-Mostafa . 1993. Hints and the VC DimensionHints and the VC Dimension. Neural Computation 5:2, 278-288. [Abstract] [PDF] [PDF Plus]
Communicated by James Anderson
On the Characteristics of the Autoassociative Memory with Nonzero-Diagonal Terms in the Memory Matrix Jung-Hua Wang Thomas E Krile John F. Walkup Department of Electrical Engineering, Texas Tech University, Lubbock, T X 79409-3102 U S A Tai-Lang Jong Deparhnent of Electrical Engineering, National Tsing-Hua Unizjersity, Taiwan A statistical method is applied to explore the unique characteristics of a certain class of neural network autoassociative memory with N neurons and first-order synaptic interconnections. The memory matrix is constructed to store M = ruN vectors based on the outer-product learning algorithm. We theoretically prove that, by setting all the diagonal terms of the memory matrix to be M and letting the input error ratio p = 0, the probability of successful recall P, steadily decreases as a increases, but as (? increases past 1.O, P, begins to increase slowly. When 0 < p 5 0.5, the network exhibits strong error-correction capability if a 5 0.15 and this capability is shown to rapidly decrease as cy increases. The network essentially loses all its error-correction capability at N = 2, regardless of the value of p. When 0 < p 5 0.5, and under the constraint of P, > 0.99, the tradeoff between the number of stable states and their attraction force is analyzed and the memory capacity is shown to be 0.15N at best. 1 Introduction
An important characteristic of memory in a biological system is its associative nature, that is, the ability to recall complete information, given only partial information. The neural network associative memory model proposed by Hopfield (1982) prescribes the interconnection weights to be the sum over the outer products of the stored vectors with the diagonal (or self-connected) terms equal to zero. With M randomly coded stored vectors each having N binary-valued components, the dynamics of the Hopfield model are described as a minimization of a Liapunov energy function. For the network to work well as a n associative memory, it is required that the stored patterns M are themselves stable states and that Neural Computation 3, 428-439 (1991) @ 1991 Massachusetts Institute of Technology
Autoassociative Memory
429
the network has error-correcting capability. In that case, Hopfield (1982) determined through simulations that the ratio M / N = N M 0.15. Gindi et al. (1988) showed that for the memory to evolve so as to minimize the energy function, the diagonal terms need not be zero. In addition, the diagonal term serves as an inertia that allows the network to move down relatively large gradients in the energy landscape. Stiles et al. (1987) conducted a quantitative comparison of the performances of various associative memory models that work with discrete-valued data. They concluded that, in general, the recall accuracy P, ke., the probability of successful recall) declines as N increases. Among the models studied, one that drew much attention was the Hopfield autoassociative memory with nonzero diagonal terms in the memory matrix (NZAM). Stiles et al. observed that, when the input contains no error bits, the recall accuracy of the NZAM does not continuously decrease as a increases. In particular, the accuracy begins to increase as N becomes greater than 1.0. If PI is expressed as a function of a, there exist double roots a1 and a2 such that N ~ N Z= 1 and P,(N,)= P,(N~). This is unique in the sense that it occurs only in the NZAM case. For the ZAM (zero-diagonal associative memory) case or other higher order models, the recall accuracy always decreases as N increases. Even when 0 < p 5 0.5, where p = b / N (b is the number of input error bits), the NZAM is unique in its own way and results in special network behavior due to the nonzero diagonal terms. In this paper we present a theoretical formulation capable of analyzing several characteristics of the NZAM including (1) the unique behavior of the network when p = 0, (2) the degeneration of the network performances such as the recall accuracy P, and error-correcting capability as N increases when 0 < p 5 0.5, and (3) the tradeoff between the number of attractors and their attraction radii under the requirement of a large P,, for example, P, > 0.99. We also relate results by Keeler (1986) in his study of attraction basins with the attraction radius results in this paper, and use the comparison to justify the validity of our results. 2 Unique NZAM Behavior When p
=0
For illustration purposes we start with a computer simulation result where the input vector contains no erroneous bits. We ran simulations on two NZAM and ZAM networks, both with N = 20 neurons. Each bit of the pattern vectors is considered as an independent identically distributed (i.i.d.) random variable with equal probabilities of being +1 or -1. We also assume synchronous update, that is, all neurons will update simultaneously. Thus the updating processes of all neurons can be viewed as independent. We denote P, as the recall accuracy by which we mean the probability that a probe vector will lead to its corresponding stored vector after one iteration. The use of one-step convergence is well justified by the fact
J.-H. Wang et a/.
430
z
0
1
2
3
4
NZAM ZAM
5
6
a ~~
Figure 1: Simulation results of the NZAM and ZAM with N
= 20
and
p = 0.0.
that in the recall process of Hopfield memories with ”suitably” chosen values of N and M, every neuron in the network will change in the correct direction right at the first update; if not, rarely will it change in the successive updates (McEliece et al. 1987). Figure 1 shows the results for P, as cr increases from 0.15 to 5. For a 5 1, P, decreases as M increases in both networks, although it is easily seen that the NZAM outperforms the ZAM. But beyond the point of (2 = 1, these two networks behave quite differently. The P, for the ZAM still continues to decrease as expected, whereas in the NZAM, P, begins to rise steadily for N > 1 (until it saturates at 1.0). In the following section, we give a rigorous analysis of this unique behavior along with other important characteristics of the NZAM.
3 The Theory
-~
3.1 The C Parameter and the Double-Root Characteristic. Consider as an input a state vector VY’ that differs from a specific stored vector Vq by a small number of bits, h, 0 5 b < 0.5N (if b = 0, Vq = V1i;otherwise, Vq # Vq’).The next state of neuron i will be (3.1)
Autoassociative Memory
431
where F,, is the sign function. Using T,, tion 3.1 yields
=
xrz,VFV;
to expand equa-
The first term in equation 3.2 can be viewed as the signal, and the second term modifies the signal by adding or subtracting M - 1 depending on whether qf’is a correct bit. Thus the signals is s = V l ( N - 2 b ) + V I ( M - l ) if Vl‘ is a correct bit, that is, Vl’ = and s = V l ( N - 2b) - VY(M - 1) if Vl’ is an incorrect bit. The third term in equation 3.2 can be viewed as noise originating from the interference of input vector Vq’ with the rest of the stored vectors other than the target vector Vq. To calculate the ’ )the noise associated with the ath neuron noise variance, we let ~ ~ ( 9be when the input vector is Vd. Thus, the covariance of the noise of the neuron pair a and b, that is, E[n,(q’)nb(q’)] is given by
y;
(3.3)
1J It is easily seen that equation 3.3 = 0 if u # b. Consider, however, the special case where a = b. Then terms with jl = j 2 , K~ = ~2 are not zero mean and E[n,(q’)nb(q’)]reduces to (3.4)
Thus, the signal-to-noise ratio (i.e., s/a,) state or incorrect state is given by
( N - 2b) + ( M -
ci
=
( N - 2b) - ( M - 1)
JFGW3
for neurons with the correct
for a correct bit
(3.5)
for an incorrect bit
(3.6)
Without loss of generality, we define an averaged C parameter as follows:
c=
(y) c, + -c; b N
(3.7)
J.-H. Wang et a/.
432
Letting blN becomes
=p =
input error ratio, M / N = a, and N >> 1, equation 3.7
c = (1 - p)C, + pc, = (1 - p )
1-2p+tY
fi
+P
1 -2p-Ck
fi
(3.8)
As we will see, C, and C, can serve as a measure of the “attachment” to its previous state of any arbitrary neuron in the network after one update. Thus the larger the C, value, the more likely that a correct input bit will remain correct, and similarly the smaller the C, value, the more likely that an incorrect input bit will remain incorrect after one update. Later we will show that the range 0 < a 5 0.15 is where the memory proves itself most useful. Thus when a is fairly small, the C parameter in equation 3.8 can be viewed as an overall measure of the tendency of an arbitrary neuron to store a correct bit after one update. Furthermore, solving equation 3.8 for a, we obtain two roots a l , a2 as Ql.Q2
=
c2- 2(1 - 2p)2 i c$=qc?$ 2(1 - 2p)2
Thus there exist ( ~ 1 ,a2 that result in the same value of C. In addition, ala2 = 1, regardless of the value of p. This is the double-root characteristic of the NZAM. If plotted, it can be seen that the value of C decreases rapidly when 0 5 1, but as a exceeds 1 the value of C starts to increase but when gradually. Asymptotically when a << 1 we have C = (I >> 1 we have C M 3.2 p = 0 Case. For the special case of p = 0 presented in Figure 1, equation 3.7 reduces to C, only, but it still retains the double-root characteristic of a1a2= 1. If plotted (C versus a), beyond the point of CY = 1 the curve of C will begin to rise, consistent with the trend of P, in Figure 1. Because the value of C directly determines Hcorrect) (i.e., the probability for a neuron to be in a correct state after one update) by
P[correct] = 1 -
1 lmexp
~
& C
[-:I
dz =. 1 - Q ( C )
(3.9)
and the probability of successful recall P, is approximately given by (Gindi et al. 1988; Wang et al. 1990) P,
E
correct]}^
M
e-‘”
(3.10)
where 71 = 1 - P[correct], P, is seen to be a function of C and N. Thus the trend of the curve for P, in Figure 1 must follow the trend of the C curve as a increases and the double-root characteristic described earlier will result in the special property of Pr(al) = I’r(a2), where 01012 = 1. Furthermore, the inflection point of the P, curve (i.e., a = 1) in Figure 1 does not vary with N, and it can be predicted by the double-root characteristic of a1az = 1. Note that by invoking definitions of the C parameter
Autoassociative Memory
433
(Wang et al. 1990) for other classes of memory [e.g., (0, I ) binary-valued network, ZAM, higher order nets etc.. . .I, we cannot find double roots of cr for the values of C as in the case of the NZAM. Thus, this special property of first-decrease-then-increasein P, is true for the NZAM case only. 3.3 p # 0 Case. We now turn to the more general case when the input error ratio p # 0 and analyze how increasing M affects the error correcting capability and the recall accuracy of the network. We start with the probability of changing a correct input bit to an incorrect bit, and the probability of changing an incorrect input bit to a correct bit. These probabilities can be written as
P[incorrectIcorrect]= Q(C,) =
~
lm exp [
6 c,
and P[incorrect/incorrect]= Q(Ci) =
~
6Jmexp c,
-
z]
[-;I
dz
dz
(3.11)
(3.12)
respectively. Figure 2 illustrates C, and C, curves for various values of p. As can be seen at the far right end of the figure (i.e., a >> 1) , all and merge three pairs of (Cc, Ci)curves go to their limits of (+&, -&I together. Thus from the above two equations, in the case of very large a, every neuron in the network will simply remain unchanged regardless of the values of p. Thus the network acts like an all-pass filter to any input vector when CY >> 1. This behavior is another unique property of the NZAM, since it is not found in any other variants of Hopfield associative memories. At this point, it is interesting to compare the NZAM with the BSB model (i.e., brain-state-in-box; Anderson and Mozer 1981). In the BSB each initial input state (which may have analog-valued elements) within the state space will eventually evolve into one of the corner states. But not all corners are stable states in the BSB. Similarly, only a few of the corners can be stable states (i.e., attractors) in the NZAM when Q is not too large. Thus error correction can be viewed as the attracting force by which the input state is "pulled to a stable corner state, although in the NZAM, due to its discrete structure, the state transition is on a corner-to-corner basis. From the previous discussion, it is noted that increasing the diagonal terms (i.e., value of Q) in the memory matrix will inevitably make it harder for neurons to change state and thus shrink the attraction radius. In the extreme case of (Y >> 1, the NZAM will completely degenerate into a system in which all corners are stable states with zero attraction radii. In the BSB model, increasing the diagonal terms causes a relative enhancement of the positive feedback and thus results in speeding u p the continuous transition to one of the corner states. Computationally,
J.-H. Wang et al.
434
5
I
3
i5
8
-
p=o.o
8
-p=0.1
(CC 1
(cc)
-
p= 0.2 ( C C )
1
I p 0 . 0 (Ci )
p=O.1 (Ci ) p=O.2 (Ci )
-1
-3 0.0
2.0
6.0
4.0
8.0
a Figure 2: The curves of C,, C; for p = 0.0, 0.1, and 0.2. When curves tend to merge together; so do all Ci curves.
ty
>> 1 all C,
it is interesting to note that the diagonal terms can be powerful too, as in the network introduced by Pentland (1989) to solve the problem of parts segmentation for object recognition. In that network the diagonal terms are used as time-decaying feedback, that is, initially the diagonal terms are quite large and their weights are reduced after each time-step until they reach their final value, as then the desired network outputs are obtained. To examine how the error correcting capability is affected by increasing a, we define po as the output error ratio, that is, the fraction of N bits that is in error after one update. Then po can be approximated by averaging equations 3.11 and 3.12 to get (3.13) Clearly the output error ratio po is a function of both the input error ratio p and a. It is interesting to note that numerical computations of equation 3.13 give po M p (where p > 0) as (Y = 2, and po saturates at p as ( Y continues to increase. This means the network will essentially lose all its error-correcting capability if ty 2 2. This is illustrated in Figure 3 by plotting a versus po using p = 0.0, 0.1,0.2,0.3, and 0.4. With no erroneous input bits (i.e., p = O.O), the network will have certain erroneous bits in the output until a 2 8. In particular, the trend starts decreasing at Q = 1, in agreement with the results shown in Figures 1 and 2. Note that the results of equation 3.13 and Figure 3 are invariant as N changes. Also, the threshold capacity M = 2N (from N = 2) is independent of p, as illustrated in Figure 3c.
Autoassociative Memory
435
04
04
03
0.3
Po
Po
02
02
0.1
01
0.0
0.0 0
2
4
6
a
8
1 0 1 2 1 4
0
2
4
6
8
10
12
14
a
Figure 3: (a) The output error ratio po versus a from equation 3.13 for input error ratios p = 0.0, 0.1, 0.2, 0.3, and 0.4. At a = 2, nearly all curves saturate at po = p . Thus, the NZAM with p # 0 will essentially lose its error-correcting capability if Q 2 2. This relationshipbetween po and a is invariant as N changes. (b) Simulation results for (a) when N = 20. (c) Illustration of increasing the number of stable states by increasing (Y when the input error ratio = p . The attraction force for each stable state is proportional to p - p , ( p , a ) . For a 2 2, po --t p, thus the attraction force is = 0 (i.e., the saturation region in a). The probability of these stable states being the stored vector is very small for Q 2 0.2. Although it is very difficult to calculate the total number of stable states in the network, by examining Figure 3a for 0 < a .< 2 it is safe to say that, as a increases, the number of stable states also increases and with high probability E 1 their attraction force (i.e., the error-correction capability of the network) is proportional to p - po(p,N), that is, of all the p N incorrect input bits, [p - po(p,a ) ] N bits will be corrected. Thus any initial input states located [ p - po(p,a ) ] NHamming distance away in the state space from these stable states (which are not necessarily the stored states) will very likely be attracted to them (see Fig. 3c, where a = 0.5, 1.0, and 2.0 are explicitly shown). For example at cy E 0 (but not = 0), po FZ 0 and the stored vector is capable of attracting any initial input state located pN Hamming distance away. Therefore, the attraction radius is the maximum value of the input error ratio p such that po(p,a ) = 0. In general, any initial input states with pN incorrect bits will converge to
436
J.-H. Wang et a / .
the stable states located [ p u ( p rr)]N , Hamming distance in the state space away from the stored states. Since p o ( p , o ) increases as (P increases and will not saturate at p until (1 = 2, the number of stable states does increase as (r increases. This is also true when cr > 2 because the network tends to become an all-pass filter with too many stored states. Note that these stable states are not necessarily the stored vectors. In fact, the probability that these stable states are themselves stored vectors can be estimated by using equation 3.10 and the relationship Pjcorrect] zz 1 - po if N is given. From the above discussion, it is clear that to ensure that the stored vectors do not degenerate into spurious states or unstable states, it is necessary to keep n at a fairly small number. As can be seen from Figure 3, the network exhibits no error-correcting capability at all for large ck. Even when the network still exhibits a certain error-correction capability, as in the range of 0.2 < t i 5 2, most likely it is too weak to pull the initial erroneous input state into its nearest stored vector, and the net result is poor recall accuracy P,. Therefore, it is important to determine the relationship between (I and 0 for small ( y . From equation 3.8 we obtain (3.14) Now consider two different NZAMs A and B. Both NZAMs have the same number of neurons N.We assume that their values of N and input error ratios p are ( o l , % and ( w h . o h ) , respectively. We also assume that both networks under this arrangement are capable of error-correcting the input errors, that is, /Io = 0 and with a high probability of P, > 0.99 they converge to the corresponding stored vector. From equations 3.9 and 3.10, for these two networks to have the same recall probability P,, their values of the C parameter must be identical. Based on this consideration and equation 3.14, the tradeoff between the allowable memory capacity M and the attraction radius p for the NZAM can be determined. This tradeoff is plotted in Figure 4 for C = 3.0, 3.5, and 4.0. Note that the shaded area where 0 < 0 is not allowable. It is important to note that in evaluating the attraction radius given a specific value of ( 1 , the choice of C curve in Figure 4 depends on the value of N used (equation 3.10). For example, the following combinations will all result in P, > 0.99; ( N , C ) = (5,3), (20,3.5),and (200,4). By examining Figure 4, it is interesting to note that the largest possible capacity M (e.g., for the smallest value of N = 5, its corresponding required C z 3) for the NZAM is 0.15N at best. This is because larger N requires higher C (> 3) in order to achieve P, > 0.99, and at p = 0.0 and C = 3, we obtain (P = 0.15. Also as p increases, the necessary decrease in ( Y in order to retain po = 0 and P, > 0.99 can be estimated from Figure 4. For comparison, a curve for the ZAM case when C = 3 is also plotted and the largest possible capacity M is approximately 0.12N, which is quite close to the result of M zz (YN, /I < oC, with oC5 0.14 obtained by Amit et al. (1985). As can be seen, the
Autoassociative Memory
437
c=3 C=3(ZAM) G3.5 c=4
0.06
0.02
0.10
0.1 4
0.1 8
a
Figure 4: The tradeoff between the attraction radii and the number of attractors in the NZAM under the constraint of high P, > 0.99. Depending on the number of neurons N, a curve of C is chosen to determine the corresponding attraction radius p for a given value of a. To show that the NZAM outperforms the ZAM, the curves of C = 3 for these two networks are plotted. NZAM outperforms the ZAM in terms of both the attraction radius size (generalization capability) and the memory capacity. The above discussion of attraction radius can be viewed from another slightly different perspective. We want to know how the attraction radii change as the number of stored states increase. Again we consider two NZAMs A and €3 and assume their values of M and the input error ratios p are ( M ,0) and (M’,p), respectively. Since both NZAMs are required to have high probability of P, > 0.99, they must have identical values of C. Using these assumptions and equation 3.14, we obtain
where
(1 - 2 p )
=
i
Q’l+Cr -~
Ql+Cu’
From the previous discussion M’ = (1 - 2p)’M.
Q
and a’ << 1, so
(3.15)
J.-H. Wang et al.
438
Thus we obtain the same relationship that has been derived for the ZAM previously (McEliece el al. 1987; Wang et al. 1990). It is interesting to compare the above results with the result for attraction basins using Keeler’s graphing method (Keeler 19861, which determines the structure of attraction basins of the Hopfield net using a random two-dimensional slice of state space. The existence of a well-defined ”ball” of radius p contained entirely within the basin of attraction ensures that the memory will recall any state within Hamming distance pN of the stored state. For N = 200 and 28 stored vectors, Keeler’s method gives a ball of the radius p = 0.05. If we use this value of radius and M = 0.15N in equation 3.15, we obtain the predicted capacity M’ = 25. Thus, considering the similarity of the NZAM to the Hopfield net, it is not surprising to see that Keeler’s results for attraction basins and the attraction radius results presented in this paper compare favorably. 4 Summary
-~
.-
With the help of the C parameter, this paper has offered theoretical explanations for several unique characteristics associated with the NZAM, that is, the Hopfield network with nonzero diagonal terms in the memory matrix. We have proved that there exists a double-root characteristic that results in the unique behavior of recall probability P, when the input error ratio p = 0, that is, the probability of a successful one-step recall decreases rapidly as 0 < 5 1, but once ck exceeds 1 it begins to increase slowly (up to P , = 1.0) and finally saturates at 1 when LY >> 1. The result of the double-root characteristic is also extended to examine the case of p > 0. The network is shown to essentially perform no error correction if (I 2 2. Finally, we showed that to ensure that the stored vectors be the attractors and every erroneous input vector will, with high probability, converge to its corresponding stored vector, ck must not exceed 0.15 for all N. Also the tradeoff between the size of the attractor radii and the number of stored patterns M under the requirement of high P,, that is, P, > 0.99 was analyzed.
Acknowledgments This research was supported by the Air Force Office of Scientific Research (AFOSR Grant 88-0064).
References Amit, D. J., Gutfreund, H., and Sompolinsky, H. 1985. Storing infinite numbers of patterns in a spin-glass model of neural networks. Phys. Rev. Lett. 55,
1530-1533.
Autoassociative Memory
439
Anderson, J. A., and Mozer, M. 1981. Categorization and selective neurons. In Parallel Models of Associative Memory, G. E. Hinton and J. A. Anderson, eds. Erlbaum, Hillsdale, NJ. Gindi, G. R., Gmitro, A. F., and Parthasarathy, K. 1988. Hopfield model associative memory with nonzero-diagonal terms in the memory matrix. Appl. Opt. 27(1), 129-134. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A.79,2554-2558. Keeler, J. D. 1986. Basins of attraction of neural network models. In Neural Networks for Computing, Vol. 15'1, pp. 259-264, J. Denker, ed. AIP Conference Proceedings. McEliece, R. J., Posner, E. C., Rodemich, E. R., and Venkatesh, S. S. 1987. The capacity of the Hopfield associative memory. I € € € Trans. Inform. Theory IT-33, 461482. Pentland, A. 1989. Part segmentation for object recognition. Neural Comp. 1, 82-91. Stiles, G. S., and Denq, D. L. 1987. A quantitative comparison of the performance of three discrete distributed associative memory models. I € € € Trans. COmpUt. C-36(3), 257-263. Wang, J. H., Krile, T. F., and Walkup, J. F. 1990. Determination of Hopfield associative memory characteristics using a single parameter. Neural Networks 3(3), 319-331.
Received 8 December 1989; accepted 7 February 1991.
This article has been cited by:
Communicated by Richard Lippmann
Handwritten Digit Recognition Using K Nearest-Neighbor, Radial-Basis Function, and Backpropagation Neural Networks Yuchun Lee Digital Equipment Cory., 40 Old Bolton Road OG01-2/U11, Stow, M A 02775 U S A
Results of recent research suggest that carefully designed multilayer neural networks with local ”receptive fields” and shared weights may be unique in providing low error rates on handwritten digit recognition tasks. This study, however, demonstrates that these networks, radial basis function (RBF) networks, and k nearest-neighbor (kNN) classifiers, all provide similar low error rates on a large handwritten digit database. The backpropagation network is overall superior in memory usage and classification time but can provide ”false positive” classifications when the input is not a digit. The backpropagation network also has the longest training time. The RBF classifier requires more memory and more classification time, but less training time. When high accuracy is warranted, the RBF classifier can generate a more effective confidence judgment for rejecting ambiguous inputs. The simple kNN classifier can also perform handwritten digit recognition, but requires a prohibitively large amount of memory and is much slower at classification. Nevertheless, the simplicity of the algorithm and fast training characteristics makes the kNN classifier an attractive candidate in hardware-assisted classification tasks. These results on a large, high input dimensional problem demonstrate that practical constraints including training time, memory usage, and classification time often constrain classifier selection more strongly than small differences in overall error rate. 1 Introduction
Several successes have recently been reported in applying neural networks to handwritten digit recognition (LeCun et al. 1989; Martin and Pittman 1990). Near human performance seems to be within reach, at least in cases where digits are accurately segmented. The most noticeable achievement in neural network-based algorithms has been in recognizing handwritten digits, although preliminary results have shown that alphacharacter recognition is also quite promising (Martin and Pittman 1990). Neitrul Computation 3,440449 (1991) @ 1991 Massachusetts Institute of Technology
Handwritten Digit Recognition
441
Unlike previous research in character recognition, neural network-based recognition algorithms require little preprocessing of the character images. The input to the network is typically a fix-sized, gray-scale, pixel image of the character. No other feature information is necessary. Backpropagation in conjunction with a multilayered feedforward structure and sigmoidal nonlinearity is most commonly used. Why did such a simple solution take so long to develop? First, there was speculation that local receptive fields and weight sharing were the keys to high recognition performance (LeCun et al. 1989). Martin and Pittman (1990), however, showed that even a standard fully connected feedforward neural network trained with backpropagation achieved excellent recognition performance. As pointed out in (Martin and Pittman 19901, this result suggests that with 5,000 to 30,000 training samples, minimizing the number of free parameters does not contribute significantly to high performance in character recognition. Factors such as the quality and quantity of the training set appear to be more critical. Fully connected feedforward neural networks, trained with backpropagation and tested on small speech problems have been shown to be equivalent in classification performance to other classifiers such as k nearest-neighbor (k"), decision trees, and radial basis function (RBF) classifiers (Lee and Lippmann 1990; Lippmann 1989; Ng and Lippmann 1991). When this occurs, a classifier should be chosen to satisfy practical constraints such as memory requirements, training time, classification time, etc. However, there are sufficient reasons to believe that this result may not generalize to larger problems. Algorithms based on Euclidean distances, such as the kNN algorithm, presumably suffer from the "curse of dimensionality" and require a number of training examples that grows exponentially with the dimensionality of the input. There are, however, speculations based on theoretical and empirical results that many "real-world" problems are much "simpler" than those problems that have been proven to be problematic in high dimensions. One empirical result reported in Martin and Pittman (1990) showed that fewer examples are needed for a digit recognition problem than predicted by a theoretical analysis from Baum and Haussler (1989). Based on these past results, a comparison was made between backpropagation, k",and RBF classifiers on a large handwritten digit recognition task. The purpose of this study was (1) to extend previous results on classifier comparisons to a much higher input dimensional problem, and (2) to explore the practicality of all three algorithms on handwritten digit recognition tasks. 1.1 Data and Hardware Platform. The digit database contains 30,600 training and 5,060 testing patterns. These patterns are presegmented handwritten digits from the total amount section of real-world financial receipts written by different people. Even though segmentation is an important part of the overall problem, this paper concentrates on the task
442
Yuchun Lee
Figure 1: Examples of digits from the test set.
of classifying segmented digits. Fewer than one percent of the training patterns are mislabeled or incorrectly segmented. Some examples from the database are shown in Figure 1. Each original 300 dots-per-inch binary image of a pre-segmented digit in the database was scaled linearly to fit within a 15 by 24 gray-scale pixel map with pixel values normalized between 0 and 9. The 360 pixel gray-scale images were used directly for classifier training and testing. A11 experiments were performed on a RISC-based DECstation 3100 with a rating of 14 million instructions per second (MIPS) and roughly 3.7 million floating-point operations per second (MFLOPS). 1.2 Backpropagation Network. The backpropagation neural network has a feedforward structure in which nodes in the hidden layer have local "receptive fields," which receive inputs from a limited number of nodes in the layer below. Within a hidden layer, nodes are grouped to form various "feature maps." Nodes of the same feature map share the same set of weights but cover different spatial locations. Each node in the output layer represents one class. Classification is determined by the node with the highest activation. The learning algorithm is the standard backpropagation algorithm that minimizes the mean-square error between the desired output and the actual output of the network (the desired values for each training example are "1" for the correct class and "0" for the incorrect classes). For this study, a network with 360 input nodes, and two hidden layers with 540 and 102 nodes was used. Nodes in the hidden layers have limited receptive fields. Ten output nodes representing the 10 classes were fully connected to the second hidden layer. Other networks used in digit recognition that have similar architectures can be found in LeCun et al. (1989) and Martin and Pittman (1990).
Handwritten Digit Recognition
443
1.3 Radial Basis Function Classifier. A radial basis function network (Broomhead and Lowe 1988) performs classification using kernel functions, typically gaussian. These kernel functions are placed at key locations in the input space. A map based on a weighted-sum interpolation of the kernel functions is then formed. The RBF classifier has one hidden layer of gaussian nodes and an output layer of linear nodes, with full inter-layer connections. Weights to hidden nodes encode basis function centers and standard deviations. Each hidden node computes a gaussian kernel function based on the euclidean distance between the input and the basis function center. Each output node’s activation is then a weighted sum of the hidden node outputs. The RBF classifier used in this study had 1,000 basis function nodes, which was determined experimentally to be adequate. To shorten training time, the locations of all centers were determined by randomly selecting samples from the training set. Previous experiments have shown that with enough basis functions and appropriate spreads, this method of selecting the basis center is adequate (Ng and Lippmann 1991). Each basis function is a gaussian with diagonal covariance. The standard deviation of a basis function is the same in all dimensions and is set to be the euclidean distance between the basis function and the nearest other basis function, multiplied by a global scaling factor. The proper scaling factor was determined experimentally to be 1. Weights to output nodes were determined by a pseudo matrix inversion performed using singular value decomposition (Ng and Lippmann 1991).
1.4 K Nearest-Neighbor Classifier. The kNN classifier is a very simple algorithm in which each input pattern to be classified is compared to a set of stored patterns. Each stored pattern has a class label from the digit set “0” to “9.“ The k nearest stored patterns to the input pattern are retrieved. The classification is the class with most representatives in the k retrieved patterns. The stored set of patterns in this study was the whole training set. Euclidean distances were used, and the best value of k was determined experimentally to be nine. 1.5 Classification Confidence. In real-world character recognition, the cost of an incorrect classification is often much greater than that of “rejecting’’ a doubtful classification. To effectively apply a neural net or any other classifier, the classification needs to include a ”confidence” measure. For this study, the classification confidence of the neural network and the radial basis function classifier were determined by the activation difference between the two output nodes with the highest outputs. The confidence of the kNN classifier was determined by the difference between the number of representatives of the top two classes in the set of k nearest neighbors, divided by the radius of the hypersphere enclosing all k nearest neighbors (Duda and Hart 1973). A scalar threshold value
Yuchun Lee
444
Table 1: Performance for the Three Classifiers without Rejection on the Hand-
written Digit Problem.
Backpropagation Error rate (0% rejected) Number of free parameters Training time (hours) Classification time (sec/char)
5.15% 5,472 67.68 0.14
k"
RBF
.~
4.77% 5.14% 11,016,000 371,000 16.54 0.00 0.24 6.22
was applied to these confidence values to reject classifications with low confidence. This increases classification accuracy by rejecting inputs that are similar to more than one class. 2 Experimental Results
All classifiers were trained on the training set and good parameter configurations were found through trial-and-error using the test set. That is, the results reported represent the best performance of each classifier on the test set. Only the backpropagation network encoded spatial information of the digits in local receptive fields. RBF and k" classifiers both treated the input as an unordered one-dimensional vector with 360 elements. Classification error on the test set, number of free parameters (or memory usage), training time, and classification time of the three classifiers are shown in Table 1. Error rates of all classifiers were within one binomial standard deviation (CT= f0.31%)of the mean error rate (5.02%)with 0% rejection. Similar small differences in error rates were also evident in previous smaller lower dimensional problems (Lee and Lippmann 1990; Ng and Lippmann 1991). Practical characteristics, however, differ dramatically between the three algorithms. The kNN classifier requires a prohibitively large memory to store the entire training set, but no training time is required. Since each k" classification requires more than 30,000 distance calculations, classification is also extremely slow. The RBF classifier requires an intermediate amount of memory. Training time is 25% that of the backpropagation network, but classification time is twice as long as the backpropagation network. The backpropagation classifier performed very well in terms of memory usage and classification time, but required a long training time. It required only 5,472 free parameters. In comparison, the RBF classifier had 371,000 and the k" required over 11 million free parameters. The
Handwritten Digit Recognition
445
t
0
20 30 40 Percentage of Input Patterns Rejected
10
50
Figure 2: Test set classification error versus percent patterns rejected for kNN, RBF, and backpropagation classifiers. backpropagation classifier, however, required close to 3 days of training to obtain good classification results. Figure 2 shows the test-set classification error as a function of the fraction of input digits rejected. Overall, the k" classifier was the least effective at confidence level generation. RBF and the backpropagation classifiers did not differ significantly in generating confidence values when the required error was above 1%.However, most applications usually need to accurately recognize a field with multiple digits. This requires higher per-digit recognition accuracy (e.g., a lower than 1.5%error on five digit fields requires a per-digit error of no more than 0.3% if the errors are not correlated). For the low per-digit error rate requirements, the RBF classifier was better at providing low error rates with low rejection. For example, if the maximum error rate allowed per character is 0.3%, the RBF classifier rejects only 19.3% of the digits. To achieve this accuracy rate, a backpropagation classifier must reject 30.9% and a k" classifier 66.4% of the test set digits. 2.1 "False-Positive" Responses. The test set used for the comparisons consisted of totally segmented hand-written digits. Although digit patterns can be classified quite effectively, some patterns that were not digits were also classified with high confidence by the backpropagation network as valid digits. This phenomenon can be explored by
446
Yuchun Lee
Figure 3: Examples o f false digit patterns classified with high confidence by the backpropagation network but rejected by kNN and RBF classifiers. The patterns are, from upper-left to lower-right, “zero”to “nine.” ”inverting” the feedforward network as described in Williams (1986) to iteratively modify the input pattern. Figure 3 shows some input patterns created from network inversion for the 10 digits. Various input pattern initialization methods were used, including varying the density of “on” pixels and combining parts of real numbers, etc. Different sets of “false-positive” patterns can result from different initializations. The confidence levels for these “false positive” patterns are high. A backpropagation network with threshold set to achieve 99.98% accuracy with 50% rejection rate on the test set will not be able to reject these patterns. As expected, both kNN and RBF classifiers were able to assign low confidence to these patterns and rejected them even at low rejection rates (less than 6 and 4%, respectively).
3.1 Handwritten Digit Recognition Is ”Simple”. Surprisingly, a brute-force kNN algorithm provided excellent performance on the handwritten digit recognition task. Other similar classifiers such as the learning vrctor quantizer and the feature map (Lee and Lippmann 1990)classifier are also candidates for the problem since these classifiers are similar to the k” classifier. Analytical results in Baum (1990) which show that a kNN classifier may perform poorly in high-dimensional problems do
Handwritten Digit Recognition
447
not contradict this result. Real-world problems, in this case handwritten digits, may be far more structured than the artificial uniform distribution problems used in the analysis in Baum (1990). Results in this study emphasize the importance of applying learning theory to more realistic distributions. The k"algorithm's unacceptable memory usage and classification time need to be addressed before this algorithm can become useful. As memory and CPU resources are becoming less costly, custom hardware implementations of kNN algorithms are becoming feasible. Improved variations of k" classifiers that reduce memory and computation requirements (Chang and Lippmann 1991; Lee and Lippmann 1990; Lippmann 1989; Ng and Lippmann 1991) can also improve the practicality of these classifiers. The unavailability of adequate computing resources and training databases, not the lack of better classification algorithms, appears to be the main factor in preventing earlier success in handwritten digit recognition. The digit classification problem is as simple as many other classification tasks. However, a sufficiently large training set (tens of thousands of training samples) is required along with long training times and/or large memory resources. These practical constraints were much more severe even a few years ago. 3.2 Impacts of "False-Positive"Patterns. One factor that differentiates the backpropagation network from the two euclidean distance-based classifiers (kNN and RBF) is in the type of mapping generated by each processing element. The sigmoid operator in the backpropagation network can have "high" output even in regions far away from the area occupied by patterns used to train the weights. In contrast, the RBF and the k" classifier tend to map input regions far from the training patterns to low values. This difference may be the main reason for the "false-positive" response found in backpropagation networks and not in RBF and kNN classifiers. These "false-positive" responses are a problem only if they occur too frequently in real applications. This potential problem must, however, be recognized. Applications using neural network classifiers such as character recognition may need to be designed to minimize this problem. Additional work is needed to gain a better understanding of how remote regions not specified by the training set are specified by a backpropagation network. A better understanding of this problem is essential to improve the overall generalization performance of neural networks.
3.3 Local versus Nonlocal Operators. A "nonlocal operator," as in a backpropagation network, is trained on regions specified by the training set. Regions remote from the training data are mapped arbitrarily because they post no constraint on the error minimization criteria. The
Yuchun Lee
448
advantage of algorithms that use “local operators”, such as k” and RBF classifiers, is that regions not specified by the training data will not be mapped to high activation values. However, a major problem in local operator-based classification is that more memory usage and computation requirements are typically necessary. These resource usage problems can deteriorate drastically as the classification task becomes ”harder.” Therefore, these algorithms can be feasible only in a range where the difficulty of the problem can be properly counterbalanced by the flexibility in implementation constraints. The surprisingly good results from the kNN classifier in digit recognition suggest that the feasibility of such algorithms on real-world problems should be determined through actual testing before ruling them out as implausible. 4 Conclusion
~~
This study extends and complements previous results in classifier comparison and handwritten digit recognition using neural networks. RBF, kNN, and a backpropagation network were shown to have equivalent classification accuracy in a large presegmented handwritten digit recognition problem. These classifiers differ drastically in memory usage, training time, and classification time. A backpropagation network with local receptive fields and shared weights was shown to be highly effective in compressing the information needed in classification to a low number of free parameters. Memory requirements can be reduced with fewer free parameters; however, generalization did not improve. With enough training examples, even a simple classifier such as a kNN classifier was able to solve a seemly complex high-dimensional problem. The locality and the smooth interpolation in the RBF classifier enabled it to produce better confidence judgments for rejecting patterns than kNN and backpropagation classifiers. Orders of magnitude more memory were required for kNN and RBF classifiers than the backpropagation network. Classification time was also longer for kNN and RBF classifiers than it was for the backpropagation network. However, the backpropagation network required more training time. These implementational characteristics, not error rate, dictate the feasibility of these classifiers. The fast rate of decline in memory cost and increase in CPU speed constantly alters the criteria for classifier selection. Some algorithms such as RBF and kNN classifiers that are memory and computationally intensive are now feasible alternatives for difficult real-world problems.
Acknowledgments
I would like to thank Dr. Richard Lippmann of MIT Lincoln Laboratory and Roger Marian, Ruby Li, and John Canfield of Digital Equipment Corporation for reviewing this paper and providing valuable suggestions.
Handwritten Digit Recognition
449
I would also like to thank Bill Muth of Digital Equipment Corporation for his help in image processing. References Baum, E. B. 1990. When are k-nearest neighbor and back propagation accurate for feasible sized sets of examples? In EURASIP Workshop on Neural Networks, European Association for Signal Processing, Sesimbra, Portugal. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? In Advances in Neural Information Processing Systems I, D. S. Touretzky, ed. Morgan Kauffman, San Mateo, CA. Broomhead, D. S., and Lowe, D. 1988. Radial Basis Functions, Multivariable Functional Interpolation and Adaptive Networks. Tech. Rep. RSRE Memorandum No. 4148, Royal Speech and Radar Establishment, Malvern, Worcs., Great Britain. Chang, E. I., and Lippmann, R. P. 1991. Using genetic algorithms to improve pattern classification performance. In Neural Information Processing Systems 3, D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comp. 1, 541-551. Lee, Y., and Lippmann, R. P. 1990. Practical characteristics of neural network and conventional pattern classifiers on artificial and speech problems. In Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 168-177. Morgan Kaufmann, San Mateo, CA. Lippmann, R. P. 1989. Review of neural networks for speech recognition. Neural Comp. 1, 1-38. Martin, G. L., and Pittman, J. A. 1990. Recognizing hand-printed letters and digits. In Neural Information Processing Systems 2, D. Touretzky, ed., pp. 405414. Morgan Kaufmann, San Mateo, CA. Ng, K., and Lippmann, R. P. 1991. A comparative study of the practical characteristics of neural network and conventional pattern classifiers. In Neural Information Processing Systems 3, D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. Williams, R. J. 1986. Inverting a connectionist network mapping by backpropagation of error. In Proceedings 8th Annual Conference of the Cognitive Science Society, Lawrence Erlbaum, Hillsdale, NJ.
s.
Received 26 October 1990; accepted 25 January 1991.
This article has been cited by: 2. Chrissanthi Angeli. 2008. Online expert systems for fault diagnosis in technical processes. Expert Systems 25:2, 115-132. [CrossRef] 3. Wanpracha Art Chaovalitwongse, Ya-Ju Fan, Rajesh C. Sachdeo. 2007. On the Time Series $K$-Nearest Neighbor Classification of Abnormal Brain Activity. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 37:6, 1005-1016. [CrossRef] 4. M. Mestari. 2004. An Analog Neural Network Implementation in Fixed Time of Adjustable-Order Statistic Filters and Applications. IEEE Transactions on Neural Networks 15:3, 766-785. [CrossRef] 5. Man-Wai Mak, Sun-Yuan Kung. 2000. Estimation of elliptical basis function parameters by the EM algorithm with application to speaker verification. IEEE Transactions on Neural Networks 11:4, 961-969. [CrossRef] 6. M. Kubat. 1998. Decision trees can initialize radial-basis function networks. IEEE Transactions on Neural Networks 9:5, 813. [CrossRef] 7. Sang-Hoon Oh. 1997. Improving the error backpropagation algorithm with a modified error function. IEEE Transactions on Neural Networks 8:3, 799-803. [CrossRef] 8. M. Revow, C.K.I. Williams, G.E. Hinton. 1996. Using generative models for handwritten digit recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:6, 592-606. [CrossRef] 9. S. Lee, J.C.-J. Pan. 1996. Unconstrained handwritten numeral recognition based on radial basis competitive and cooperative networks with spatio-temporal feature representation. IEEE Transactions on Neural Networks 7:2, 455-474. [CrossRef] 10. G. C. Vasconcelos, M. C. Fairhurst, D. L. Bisset. 1996. Efficient detection of spurious inputs for improving the robustness of MLP networks in practical applications. Neural Computing & Applications 3:4, 202-212. [CrossRef] 11. Wesley E. Foor, Mark A. Neifeld. 1995. Adaptive, optical, radial basis function neural network for handwritten digit recognition. Applied Optics 34:32, 7545. [CrossRef] 12. Mark A. Neifeld. 1995. Optical dual-scale architecture for neural image recognition. Applied Optics 34:26, 5920. [CrossRef] 13. R. Rovatti, R. Ragazzoni, Zs. M. Kovàcs, R. Guerrieri. 1995. Adaptive Voting Rules for k-Nearest Neighbors ClassifiersAdaptive Voting Rules for k-Nearest Neighbors Classifiers. Neural Computation 7:3, 594-605. [Abstract] [PDF] [PDF Plus] 14. Ethem Alpaydin, Fikret G�rgen. 1995. Comparison of kernel estimators, perceptrons and radial-basis functions for OCR and speech classification. Neural Computing & Applications 3:1, 38-49. [CrossRef]
15. Jane Bromley , John S. Denker . 1993. Improving Rejection Performance on Handwritten Digits by Training with “Rubbish”Improving Rejection Performance on Handwritten Digits by Training with “Rubbish”. Neural Computation 5:3, 367-370. [Citation] [PDF] [PDF Plus]
Communicated by Halbert White
A Matrix Method for Optimizing a Neural Network Simon A. Barton Defence Research Establishment Suffield, B(JX4000, Medicine Hat, Alberta, T l A 8K6, Canada
A matrix method is described that optimizes the set of weights and biases for the output side of a network with a single hidden layer of neurons, given any set of weights and biases for the input side of the hidden layer. All the input patterns are included in a single optimization cycle. A simple iterative minimization procedure is used to optimize the weights and biases on the input side of the hidden layer. Many test problems have been solved, confirming the validity of the method. The results suggest that for a network with a single layer of hidden sigmoidal nodes, the accuracy of a functional representation is reduced as the nonlinearity of the function increases. 1 Introduction
__
___
For a feedforward network with a single hidden layer and a simple linear combination output layer (i.e., no squashing), the hidden to output weights can be fully optimized by the least-squares method, given a fixed set of input to hidden weights. By varying the input weights, and always finding the optimum output weights for each variation, optimization of the complete network may be approached iteratively. This can be viewed as an implementation of the EM algorithm (Dempster rt a!. 1977). The work of Golub and Pereyra (1973) shows that this procedure will lead to a local optimum jointly in the output and input weights, even though they are not jointly optimized (cf. their Theorem 2.1). As in any general multiparameter optimization scheme, local minima may be reached, and intialization of the weights is important. In Section 2.3 we suggest a simple initialization scheme that leads to good convergence in all the cases tested here. A feedforward neural network passes one or more input signals through one or more layers of processing units (nodes). Every input channel is connected to every node, and the output of each node then acts as an input channel to the next layer. Every node first multiplies each input signal by a constant weight, sums these values, and adds a further constant bias. The output of the node is some function, the transfer function, of this sum. For J discrete input signals on I channels, Neural Cornputution 3,450-459 (1991) @ 1991 Massachusetts Institute of Technology
451
Matrix Method
{q,; i = 1.I;] = I , J ] , the jth sum, layer, is
clnl,
formed by the nth node, in the Ith
(1.1) i=l
where bl, is the bias and {win,} are the weights connecting the ith input channel to the nth node in the lth layer. The output, Flnj, of the node is generally chosen to be sigmoidal. Thus
In conventional networks, the input channels do not process the signals, whereas the output nodes normally do. The layers of processing nodes between the input and output nodes are called hidden layers. The weights {win;} and biases {bin} are normally optimized by iterative adjustment. A set of inputs and required outputs must be provided for the optimization, which normally minimizes the least-squares function
m=l]=1
where ymj is the mth output signal for the jth input vector, and y, is the corresponding required value. M is the number of output channels and J is the number of points (I/O vectors) in the fit data set. The backpropagation algorithm (Rumelhart et al. 1986) is most commonly used to adjust the weights and biases. This is essentially a linear gradient descent technique, with optional empirical parameters, called the learning rate and momentum, that are used to adjust the magnitude and rate of change of the weights and biases. The backpropagation method has been described in detail (Rumelhart et al. 1986; Wasserman 1989). The weights and biases are initially chosen to be a set of random numbers on some interval, usually between -1 and +l. 2 A Matrix Optimization Technique
Even for relatively simple problems, the backpropagation algorithm may fail to find a global minimum in the error function S (equation 1.3). Convergence is slow, there may be oscillations, local minima may be reached, and high precision in the network output is difficult to achieve. An alternative to the backpropagation method is therefore desirable. 2.1 A Linear Combination of Node Outputs. Consider the representation of a function of a single variable that can be obtained as a linear combination of the outputs from a set of N sigmoidal nodes in a single
Simon A. Barton
452
hidden layer. In this case the subscripts i and 1 can be eliminated. The network output for input point x, is Y(x,) = co
where, for
11 =
+
CiFi,
+ . . + CNFN,
(2.1)
1,N, the transfer function value is
In this network representation, the output node simply performs a summation that is not passed through the transfer function. For the output node, cg corresponds to a bias, and {cl,. . . ,c N } correspond to a set of weights. From recent theoretical work (Hartman ef al. 1990), we expect that most real functions will be well approximated by such a network of sigmoidal units. Let the network output for the jth input be y, [i.e., y(x,)], and let the required value be ij/.For any given set {w,,, b,,}, we wish to minimize the sum (2.3)
(2.4) where FOj E
(2.5)
1
To minimize S, we require, for k = 0 , 1 , .. .
N
-65= o
(2.6)
3ck
(2.7)
Reversing the order of the summations gives /
(2.8) n=O
j=l
Equation 2.8 is equivalent to a linear system of (N + 1) simultaneous equations in the (N + 1) unknowns {c"}. For k = 0 this becomes (2.9)
Matrix Method
453
and for k > 0
In matrix notation, the linear system is
AC=Y
(2.11)
where the elements of the vector C are {CO, c1,. . . , C N } , and the elements of the symmetric matrix A and the vector Y are given by the following (recall that Fo; = 1): (2.12) 1
(2.13) (2.14) (2.15) The solution for C is
C
=
A-'Y
(2.16) (2.17)
# 0
IAl
This system may be solved by gaussian elimination provided that the determinant of A does not approach zero. To summarize, for a given set of values { W n , bn} that connect a single input to N nodes in a single hidden layer, the best values for the output node expansion coefficients are given by equations 2.12 to 2.16. 2.2 Functions of Multiple Inputs. The matrix method can also be applied to a single output that is a function of I inputs. In this case, the weights {w,;}connect each input channel ( i = 1,I ) to each hidden layer node ( n = 1,N).For the jth input vector, the network output is
(2.18) where now
F
."I
~
1 1 exp [ - (C;wnixij b n ) ]
+
+
(2.19)
454
Simon A. Barton
The equations for the optimum C then follow as before. Multiple output nodes can be treated as independent functions, each with its set of weights and biases {ui,,,.b,,}, and optimized expansion coefficients C. Alternatively, one could attempt to use the same set {u~,,,. b,,}, and find M optimizing vectors {C,,,},where M is the number of output channels. The former approach leads to an independent network for each output channel, whereas the latter would construct a single network for all output channels. 2.3 Optimization of { w , ~b ,~l }.. In general, for each output channel the surface given by S (equation 2.3) must be minimized. This is a function of N ( I + l ) variables, that is, the set {w~,,. bt1}.By varying the values of these parameters, and always finding the optimum vector C for each choice, a minimum in S can be approached numerically. The choice of the best minimization procedure for the weights and biases on the input side of the hidden layer is not the goal of this work. Whatever the procedure for choosing the {zu,,,. b,,}, the coefficients connecting the hidden layer to an output node are globally optimized by the vector C. To demonstrate that this procedure can lead to a rapid network optimization we made a simple initial choice and subsequent variation of {7uNI> b,} to minimize S. First, note that the matrix A will become singular if the output from any node in the hidden layer is equal to, or is a constant multiple of, the output of any other hidden layer node, for every point in the input vector space. This can be prevented by choosing unique initial values for {b,,}. The values for {w,,,} can then be initialized either as random numbers on some interval, or they can all be chosen to be equal. To prevent b,,}, it is sufficient to maintain singularity during the variation of {w,,,% unique values for the set {b,,}, that is, bk is not allowed to approach bk+l. To determine reasonable unique initial values for {b,l},consider first the representation of a function of a single variable as a linear combination of sigmoid functions (equations 2.1, 2.2). Each function F k ( x ) can be written
that is, a sigmoid whose slope at x = -xnk is -ZL’k/4, and whose Offset is = b k / W k . The offset is the position of the center of the sigmoid curve with respect to x = 0. Thus, if b k = 0 the sigmoid is centered at x = 0. For bk > 0 the curve is centered in x < 0, and for bk < 0 it is centered in x > 0. To represent a function that is defined on some interval of x-space, it is reasonable to choose a set of sigmoids whose centers are initially distributed on and around that interval. We found that it is numerically efficient to scale the x-input range on each channel to lie on [-1>+1],and to distribute the {b,} evenly over the interval [-5>+5]. Functions of many variables ( I input channels) may also be viewed as linear combinations x0k
Matrix Method
455
of a set of multidimensional sigmoids with unique offsets in the input space, and the same initial distribution of b values may be chosen. To optimize {zu,,,.b,,}, we used a simple second-order minimization scheme based on the Newton-Raphson method. Keeping all other parameters fixed, we make small positive and negative changes in one parameter and calculate the step required to move to the quadratic minimum obtained by expanding S as a truncated Taylor series. The presence of maxima, points of inflection, and oscillations must be tested for and avoided. Each of the {zu,,,. b,,} is varied in this way to generate a set of indicated changes {Azu,,,.Ab,,},which are then applied simultaneously to define a new parameter set. If the new set does not lower S, the changes are all reduced until S is lowered. If S is changing very slowly, we change only one parameter, in the direction of maximum change in S . This simple minimization procedure has sufficed to demonstrate that our method for generating the optimum coefficient vector C is valid and leads to accurate, optimized networks.
Many functions can be accurately represented after relatively few iterations of the Newton-Raphson optimization procedure. Indeed, problems are often solved in one calculation cycle if sufficient nodes are used. A double precision FORTRAN program, running on a DECstation 3100 computer, performed the optimization of the network parameters. The symmetric linear system (equation 2.16) was solved by the IMSL routine DLSASF, which uses gaussian elimination with iterative refinement of the solutions. In all of the applications tested here, the matrix technique optimized networks much more rapidly than has been reported by others using the backpropagation method. 3.1 Polynomial Fits. We found that a polynomial of degree D can be accurately represented by a network containing D nodes in one hidden layer. However, the accuracy of the fit decreases as the polynomial degree increases. We fitted u p to 8th degree polynomials. Figures 1 and 2 show fits through data points for the 7th and 8th degree polynomials. Figures 3 and 4 show the network output for these fits extrapolated far from the fit region. The extrapolated curves for all the even and odd polynomials have the same forms as those in Figures 3 and 4, respectively, but they are not necessarily symmetric. From the magnitudes of the asymptotic values of the network outputs, it is clear that the details of the fits are small ripples on curves of large magnitude. As the polynomial degree increases, the asymptotic magnitude of the network output increases and the detailed structure of
Simon A. Barton
456
10
05
00
-0 5
-1
0 0
2
4
6
8
X
Figure 1: Fit to a seventh degree polynomial using 7 nodes.
Figure 2: Fit to an eighth degree polynomial using 8 nodes.
the functions eventually gets lost due to limited machine precision. Thus, while 2 nodes can give better than 6-figure accuracy for a quadratic, 8 nodes give only about 4-figure accuracy for an 8th degree polynomial.
Matrix Method
457
-40
20
-20
40
X
Figure 3: Extrapolation of the fit for a seventh degree polynomial.
r 6x105 -
>
2x10~-
0-
-40
-20
20
40
X
Figure 4: Extrapolation of the fit for an eighth degree polynomial.
3.2 Continuous Functions of Several Variables. We have fit many functions of several variables, that is, using many input channels. The accuracy of the representation again depends on the degree of nonlinearity. For example, any linear combination of I input channels can be achieved with virtually zero error using only one node, whereas our solution of the
458
Simon A. Barton
robot-arm coordinate transformations described by Josin (1988) required 12 nodes to give a maximum error of 0.02%. 3.3 Parity Problems. These functions have multichannel binary input patterns, for which the required output is 0 if an even number of input values are one, and 1 otherwise. The parity problem of dimension D has D input channels. The number of input patterns is ZD. This has been described as a difficult problem for a neural network to solve (Minsky and Papert 1969; Rumelhart et al. 1986). The parity problem of dimension 2 is often called the exclusive-or (XOR) problem. Neural network solutions of parity problems u p to a dimension of 10 have been reported (Rumelhart et al. 1986; Frean 1990). Using the matrix optimization technique, we solved parity problems up to dimension 12 (4096 input patterns). In every case, an exact solution is obtained in one calculation cycle, with the initial weights all chosen to be equal, and the biases evenly distributed on some (any) interval, as described in Section 2.3. With a minimum of D nodes, any parity problem of dimension D is solved, in principle, in one calculation cycle by this method.
4 Conclusions
The matrix technique described here finds the optimum weights and biases on the output side of a network with a single hidden layer, given any set of weights and biases for the input side of the hidden layer. A simple minimization scheme for the parameters on the input side of the hidden layer leads to a very rapid network optimization in all the test cases. Our results irnpIy that for a network with a single hidden layer, the maximum accuracy of a functional representation is reduced as the nonlinearity of the function increases. It is possible that better representations of highly nonlinear functions may be obtained by using networks that have more than one hidden layer. For some applications, nodes that use a gaussian transfer function may also be more appropriate. These possibilities will be investigated in future work.
References ____D?mpster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. SOC.Ser. B 39, 1-38. Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Comp. 2, 198-209. Golub, G. H., and Pereyra, V. 1973. The differentiation of pseudo-inverses and
Matrix Method
459
nonlinear least squares problems whose variables separate. Siam I. Numer. Anal. 10, 413432. Hartman, E. J., Keeler, J. D., and Kowalski, J. M. 1990. Layered neural networks with gaussian units as universal approximations. Neural Comp. 2, 210-215. Josin, G. 1988. Neural-space generalization of a topological transformation. Biol. Cybernet. 59, 283-290. Minsky, M., and Papert, S. 1969. Perceptrons. MIT Press, Cambridge, MA. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, Vol. 1, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., pp. 310-362. MIT Press, Cambridge, MA. Wasserman, P. D. 1989. Neural Computing Theory and Practice, pp. 43-59. Van Nostrand Reinhold, New York.
Received 14 November 1990; accepted 10 May 1991.
This article has been cited by: 2. Gennady I. Belchansky. 2004. Spatial and temporal multiyear sea ice distributions in the Arctic: A neural network analysis of SSM/I data, 1988–2001. Journal of Geophysical Research 109:C10. . [CrossRef] 3. Zi-Qin Wang, M.T. Manry, J.L. Schiano. 2000. LMS learning algorithms: misconceptions and new results on converence. 11:1, 47. [CrossRef] 4. M.S. Dawson, A.K. Fung, M.T. Manry. 1997. A robust statistical-based estimator for soil moisture retrieval from radar measurements. 35:1, 57. [CrossRef] 5. A.H. Zaabab, Qi-Jun Zhang, M.S. Nakhla. 1997. Device and circuit-level modeling using neural networks with faster training based on network sparsity. 45:10, 1696. [CrossRef]
VIEW
Communicated by Richard Durbin
Neural Network Classifiers Estimate Bayesian u posteriori Probabilities Michael D. Richard Richard P. Lippmann Room B-349, Lincoln Laboratory, MIT, Lexington, M A 02173-9108 U S A Many neural network classifiers provide outputs which estimate Bayesian a posteriori probabilities. When the estimation is accurate, network outputs can be treated as probabilities and sum to one. Simple proofs show that Bayesian probabilities are estimated when desired network outputs are 2 of M (one output unity, all others zero) and a squarederror or cross-entropy cost function is used. Results of Monte Carlo simulations performed using multilayer perceptron (MLP) networks trained with backpropagation, radial basis function (RBF) networks, and high-order polynomial networks graphically demonstrate that network outputs provide good estimates of Bayesian probabilities. Estimation accuracy depends on network complexity, the amount of training data, and the degree to which training data reflect true likelihood distributions and u priori class probabilities. Interpretation of network outputs as Bayesian probabilities allows outputs from multiple networks to be combined for higher level decision making, simplifies creation of rejection thresholds, makes it possible to compensate for differences between pattern class probabilities in training and test data, allows outputs to be used to minimize alternative risk functions, and suggests alternative measures of network performance. 1 Introduction A strong, poorly understood, relationship exists between many neural networks and minimum-error Bayesian pattern classifiers. The outputs of many networks are not likelihoods or binary logical values near zero or one. Instead, they are estimates of Bayesian u posteriori probabilities, hereafter referred to as Bayesian probabilities. For an M class problem, Bayesian probabilities are estimated in a minimum mean-squared error sense when the network has one output for each pattern class, desired outputs are are 1 of M (one output unity corresponding to the correct class, all others zero), and a squared-error cost function is used. These conditions often hold for networks with sigmoidal nonlinearities trained using backpropagation, for radial basis function networks, and for networks with high-order polynomials trained using a squared-error cost Neural Computation 3,461-483 (1991) @ 1991 Massachusetts Institute of Technology
462
Michael D. Richard and Richard P. Lippmann
function. When Bayesian probabilities are estimated accurately, classification error rate will be minimized, outputs sum to one, and outputs can be treated as probabilities. In addition, interpretation of network outputs as Bayesian probabilities makes it possible to compensate for differences in pattern class probabilities between test and training data, to combine outputs of multiple classifiers for higher level decision making, to use alternative risk functions different from minimum-error risk, to implement conventional optimal rules for pattern rejection, and to compute alternative measures of network performance. A review of papers and recent discussions with other researchers suggest that the relationship between neural networks and optimal Bayesian classifiers is poorly understood. For example, network outputs are frequently treated as likelihoods or as binary values that should always be near zero or one. In addition, classification decisions are often considered incorrect unless the "correct" network output is greater than 0.5. Although the desired output values used in a squared-error cost function are zero and one, actual output values, which are estimates of Bayesian probabilities, are not binary valued and may be near zero or one for only a small range of inputs. Common rules of thumb that output values different from zero and one are indications that more training is required or that no classification decision should be made are not necessarily true. Such values may actually indicate that classes have overlapping distributions. In addition, the common practice of selecting patterns during training that are frequently confused may lead to poor estimates of Bayesian probabilities and may not necessarily reduce classification error rate. Bayesian probabilities are estimated accurately only when training data reflects the actual distribution of input features within each class. This paper first summarizes recent theoretical analyses and presents short proofs that network outputs estimate Bayesian probabilities when squared-error or cross-entropy cost functions are used. Results of simulation studies are then presented which demonstrate that network outputs closely estimate Bayesian probabilities. These simulations use squarederror, cross-entropy, and normalized-likelihood cost functions and three different types of neural network classifiers. Simulation results are also presented which suggest that different cost functions yield comparable estimation accuracy, and that illustrate how estimation accuracy degrades with inadequate network size or insufficient training data. Finally, important practical implications of interpreting network outputs as Bayesian probabilities are discussed. 2 Theory
After describing the general pattern classification problem and defining Bayesian probabilities, this section provides two short proofs which demonstrate that when desired outputs are binary valued, squared-error
Neural Network Classifiers
463
and cross-entropy cost functions are minimized when network outputs are Bayesian probabilities. A third cost function called normalized-likelihood is also briefly reviewed. 2.1 Pattern Classification and Bayesian Probabilities. The task in many pattern classification problems is to assign an input vector, X, with elements { x , : i = 1... . . D } to one of M classes {C,: i = 1.. . . . M } . Classes might represent different phonemes for speech recognition or different letters for hand-printed character recognition. Input values might be continuous or binary. Minimum-error Bayesian classifiers perform this task by calculating the Bayesian probability, p(C, I X), for each class, and assigning the input to the class with the highest Bayesian probability. The Bayesian probability p(C, I X) represents the conditional probability of class C, given the input X. Use of Bayes rule allows it to be expressed as follows:
(2.1)
In this equation, p ( X 1 C,) is the likelihood or conditional probability of producing the input if the class is C,, p(C,) is the u priori probability of class C,, and p ( X ) is the unconditional probability of the input. Conventional Bayesian classifiers estimate the Bayesian probability for each class by separately estimating the factors in the above equation. Since p ( X ) is common to all classes, it is usually omitted and instead p ( X I C,)p(C,)is used for classification. In addition, conventional classifiers estimate the likelihoods, p ( X 1 C,), by assuming they can be well-modeled by specific parametric distributions, such as gaussian or gaussian mixture distributions. Training involves estimating the parameters of the assumed likelihood distributions and estimating the u priori class probabilities from training data. In contrast, neural networks do not estimate Bayesian probabilities in this indirect way. Instead, when the desired outputs are 1 of M and an appropriate cost function is used, Bayesian probabilities are estimated directly. The implication and practical benefit for pattern classification is that network outputs can be used as Bayesian probabilities for simple classification tasks and can be treated as probabilities when making higher level decisions. However, as illustrated in Section 3, network outputs provide good Bayesian probability estimates only if sufficient training data are available, if the network is complex enough, and if classes are sampled with the correct a prior; class probabilities in the training data. 2.2 Squared-Error Cost Function. The squared-error cost function has been used more frequently than any alternative. Its use yields good performance with large data bases on real-world problems; and it can be used for prediction or input/output mapping problems as well as
464
Michael D. Richard and Richard I? Lippmann
for classification problems. In addition, its use leads to a simple, noniterative, matrix-inversion based algorithm to determine the network parameters for single-layer networks with linear output nodes. The relationship between minimizing a squared-error cost function and estimating Bayesian probabilities was established for the two-class case as early as 1973 by Duda and Hart (1973). Many recent papers have provided new derivations for the two-class and multiclass case (Bourlard and Wellekens 1989; Gish 1990; Hampshire and Perlmutter 1990; Ruck et al. 1990; Shoemaker 1991; Wan 1990; White 1989). The following simple derivation proves this relationship for the general multiclass case. As above, consider the problem of assigning an input vector X {x,: i = 1... . , D} to one of M classes {Cl: i = 1.. . . , M } . Let C, denote the corresponding class of X , { y l ( X ) : i = l , , . . ? M }the outputs of the network, and {dl: i = 1. . . . ,M} the desired outputs for all output nodes. Note that the actual network output is a function of the input X , whereas the desired output is a function of the class C, to which X belongs. For a 1 of M classification problem, d , = 1 if i = j ( X belongs to C,)and 0 otherwise. With a squared-error cost function, the network parameters are chosen to minimize the following:
where E { . } is the expectation operator. Denoting the joint probability of the input and the ith class by p ( X , C i ) and using the definition of expectation allows 2.2 to be expressed as follows:
The above equation represents a sum of squared, weighted errors, with M errors appearing for each input-class pair. For a particular pair of input X and class C,, each error, y i ( X ) - di is simply the difference of the actual network output y i ( X ) and the corresponding desired output di. The M errors are squared, summed, and weighted by the joint probability p ( X , C,) of the particular input-class pair. Substituting p ( X , Cj) = p(C, I X ) p ( X ) in 2.3 yields
(2.4)
Neural Network Classifiers
465
or equivalently
I
* JE[1c[Yi(x) d~12P(c~I x) p ( x , ck)dX M
M
=
-
k=l
{C c b f ( x ) M
=
E
=I 1=1
M
- dil2p(Cj
,=1
(2.5)
I x)}
(2.6)
I=]
The advantage of expressing A as in 2.6 is the simplification it facilitates. Expanding the bracketed expression in 2.6 yields
Exploiting the fact that y:(X) is a function only of X and allows 2.7 to be expressed M
M
j=l
j=1
p(C, I X) = 1
where E{di I X} and E{d? I X} are the conditional expectations of di and A:, respectively. Adding and subtracting CEl E2{di I X} in 2.9 allows it to be cast in a form commonly used in statistics that provides much insight as to the minimizing values for yi(X): A
=
E C[Y'(X) - 2yi(X)E{dj I X} ( i=l M
-E2{di
I X}I}
+ E2{dj 1 X} + E{d? I X } (2.10)
where var{di I X} is the conditional variance of di, and the identity var{d, I X} = E{d? I X} - E2{di 1 X} has been used. Since the second expectation term in 2.11 is independent of the network outputs, minimization of A or equivalently the squared-error cost function is achieved by choosing network parameters to minimize the first expectation term. But the first expectation term is simply the meansquared error between the network outputs yi(X) and the conditional expectation of the desired outputs. Thus, when network parameters are chosen to minimize a squared-error cost function, outputs estimate the conditional expectations of the desired outputs so as to minimize the mean-squared estimation error. For a 1 of M problem, di equals one if the
Michael D. Richard and Richard P. Lippmann
466
input X belongs to class C, and zero otherwise. Therefore, the conditional expectations are the following: (2.12)
(2.13) which are the Bayesian probabilities. Therefore, for a 1 of M problem, when network parameters are chosen to minimize a squared-error cost function, the outputs estimate the Bayesian probabilities so as to minimize the mean-squared estimation error. In the more general case when network outputs are not necessarily I of M but are binary, the outputs still have a probabilistic interpretation. Specifically, the conditional expectations of the desired outputs now become
E{dl I X)
M
=
CdIP(C1 I X)
(2.14)
1'1
=
p"4
=
1) I XI
(2.15)
where p [ ( d , = 1) I XI is the probability that the desired output is one given the input X. Therefore when the desired outputs are binary but not necessarily I of M and network parameters are chosen to minimize a squared-error cost function, the outputs estimate the conditional probabilities that the desired outputs are one given the input. 2.3 Cross-Entropy Cost Function. Many cost functions besides squared-error have been proposed that can be used to estimate Bayesian probabilities (Hampshire and Perlmutter 1990). These have been derived using cross-entropy (Baum and Wilczek 1988; Hinton 1990; Solla et al. 1988), Kullback-Liebler information (El-Jaroudi and Makhoul 1990; Gish 19901, maximum mutual information (Bridle 1990; Gish 19901, and Minkowski-r (Hanson and Burr 1988) criteria. The most popular alternative cost function measures the cross-entropy between actual outputs and desired outputs, which are treated as probabilities (Baum and Wilczek 1988; Hinton 1990; Hopfield 1987; Solla et al. 1988). It is normally motivated by the assumption that desired outputs are independent, binary, random variables, and that the actual, continuous, network outputs represent the conditional probabilities that these binary, random variables are one (Hinton 1990). It can also be interpreted as minimizing the Kullback-Liebler probability distance measure, maximizing mutual information, or as maximum likelihood parameter estimation (Baum and Wilczek 1988; Bridle 1990; Gish 1990; Hinton 1990). When desired outputs are zero and one, the cross-entropy cost function is the following:
(2.16)
Neural Network Classifiers
467
The cross-entropy cost function has a different theoretical justification than the squared-error cost function and weights errors more heavily when actual outputs are near zero and one. However, use of both cost functions has yielded similar error rates in experiments with real-world data, including a phoneme classification experiment that used a large speech data base (Hampshire and Waibel 1990). Experiments on artificial problems have, however, demonstrated reduced training times with the cross-entropy cost function (Holt and Semnani 1990; Solla et al. 1988). In addition, experiments on an artificial medical diagnosis problem have demonstrated improved performance with the cross-entropy cost function when desired network outputs were known Bayesian probabilities instead of binary values (Hopfield 1987). A recent paper by Hampshire and Perlmutter (1990) proves that when desired outputs are binary, a cross-entropy cost function is minimized when network outputs estimate Bayesian probabilities. The following simple proof assumes desired network outputs are binary and is similar to the proof presented above for the squared-error cost function. This proof begins after the cross-entropy cost function in equation 2.16 has been expanded and simplified into 2.17 as was done in equations 2.3 to 2.9 for the squared-error cost function. Equation 2.17 is then expanded and simplified as was done in equations 2.10 and 2.11 for the squarederror cost function.
(2.18)
-E{g[E{di
I x}l o g E { d i I XI
i=l
+ ( I - E{di I x ) ) l o g ( l - E{di
I x})I}
(2.19)
Analogous to 2.11, the second major expectation term in 2.19 is independent of the outputs y,(X). Taking first and second derivatives shows that
468
Michael D. Richard and Richard P. Lippmann
the first expectation term in 2.19 is minimized when yi(X) = E{di I X } for i = 1, . . . .M. Therefore, when network parameters are chosen to minimize a cross-entropy cost function, the outputs estimate the conditional expectations of the desired outputs. As noted earlier, when the desired outputs are binary, the conditional expectations are the conditional probabilities of the desired outputs being one; and for the special case of 1 of M problems, the conditional expectations are the Bayesian probabilities. 2.4 Normalized-Likelihood Cost Function. A popular approach to parameter estimation with desirable asymptotic properties finds network parameters that maximize the likelihood of the training data (Duda and Hart 1973). The normalized-likelihood cost function described in this section is explicitly motivated by this approach. [With certain assumptions, the squared-error and cross-entropy cost functions have impIicit maximum likelihood interpretations as well (Baum and Wilczek 1988; Bridle 1990; Gish 1990; Hinton 1990).] If training patterns are independent, then the log likelihood of N training patterns is
(2.20) (2.21) In this equation, {Xp5p = 1,.. . ,N} represents the training data (in this case N samples), Cl(p)is the class of the pth sample, p[C,(p),Xp]is the joint probability of input pattern X, and class C,(p) occurring together, and p(X,) is the unconditional probability of Xp. Since p ( X p ) is independent of the network parameters, maximizing 2.21 is equivalent to maximizing (2.22) Each term in the above sum is the logarithm of the Bayesian probability of the class C,(p) corresponding to the pattern X,. If network outputs are assumed to be accurate estimates of these Bayesian probabilities, then maximizing the likelihood of the training data corresponds to minimizing the following cost function: (2.23) For the pth training pattern, this cost function includes only the network output y, corresponding to the class Cj(p) of that training pattern. Also, its use requires that network outputs can be interpreted as probabilities (i.e., outputs are nonnegative and sum to one). This probabilistic
Neural Network Classifiers
469
interpretation can be guaranteed by normalizing network outputs as suggested in Bridle (1990), El-Jaroudi and Makhoul (1990), and Gish (1990). With the softrnax normalization approach described in Bridle (19901, the usual, sigmoidal functions in the output layer of the network: 1
Yi =
(2.24)
where net, is the weighted sum of inputs to output node i, are replaced by the following normalizing functions:
(2.25) where net, is the weighted sum of inputs to output node j . For a typical multilayer perceptron network, with H inputs {xi: i = 1,. . . ,H } to the output layer (each input corresponding to the output of a node in the preceding layer of the network), net, has the following form: H
(2.26) where {w,, : i = 1 , .. . . H } are the weights associated with output node j . The advantage of the softrnax form of normalization is that with these functions, update equations used during backpropagation are almost identical to those used for the cross-entropy cost function. Although maximum likelihood estimation has desirable asymptotic properties, the normalized-likelihood approach has not led to large reductions in classification error rate with finite amounts of real-world training data. For example, little difference in error rates was found when squared-error and normalized-likelihood cost functions were compared on a vowel classification problem (Nowlan 1990). 3 Simulation Studies
Many neural network and conventional classifiers use squared-error cost functions (Lippmann 1989). Although the above proofs demonstrate that this cost function is minimized when network outputs estimate true Bayesian probabilities, estimation accuracy may be poor with limited training data, incorrect network size, and the nonoptimal heuristic search procedures typically used for training networks (White 1989). This section describes simulation studies that explore estimation accuracy with three different neural network classifiers. Results demonstrate that these classifiers provide outputs which accurately estimate known Bayesian probabilities, that network outputs sum to one even though they are not explicitly constrained during training, that estimation accuracy degrades when training data or the network size is reduced, and that the use of alternative cost functions has little effect on estimation accuracy.
Michael D. Richard and Richard I? Lippmann
470
3.1 Estimation Accuracy with Squared-Error Cost Function. The accuracy with which neural network classifier outputs estimate Bayesian probabilities was explored using a squared-error cost function and three neural networks: multilayer perceptron (MLP) networks trained with backpropagation, radial basis function (RBF) networks trained with a matrix pseudoinverse technique, and high-order polynomial networks designed with the Group Method of Data Handling (GMDH). All experiments used one continuous-valued input, and the actual Bayesian probabilities were known and used to generate training and test data. For the MLP network, various topologies, with both one and two hidden layers, were tested. Of the topologies tested, one with a single hidden layer of 24 nodes offered the best training-time/estimation-accuracy tradeoff. Unless indicated otherwise, the results shown in this section were obtained with this topology, a step size of 0.01, and momentum of 0.6. The RBF network contained one hidden layer with gaussian basis function nodes. Gaussian means and variances were obtained using a clustering technique based on the Estimate-Maximize (EM) algorithm as in Ng and Lippmann (1991a,b). Weights between basis function and output nodes were determined using a matrix pseudoinverse approach and outputs of basis function nodes were not normalized to sum to one. Twenty-four hidden nodes were used to facilitate comparison with the MLP network. High-order polynomial networks, hereafter referred to as GMDH networks, were created with the Group Method of Data Handling (Barron 1984). In contrast to MLP and RBF networks in which the network topologies were fixed and only the weights changed during training, both the topology and weights of the high-order polynomial networks changed during training as in Ng and Lippmann (1991a,b). Thus, the topologies of the high-order polynomial networks used for the two problems differed. Two classification problems used for experiments are depicted in Figures 1A and 2A. Figure 1A shows the likelihoods p ( X I C,) for a three class, univariate problem. All three likelihood distributions are unit variance, gaussian distributions and differ only in their means. Figure 2A depicts the likelihoods for a two-class problem. Likelihood distributions have two-component gaussian mixture distributions:
P ( x 1 C,)
=
21 [N(-4.2)
P ( x 1 C,)
=
- “(-2.2)
1 2
+ N(2.2)] + N(4.2)]
(3.1) (3.2)
where N ( r n . c ~ is ) a univariate, gaussian distribution with mean rn and variance 0 2 . In all examples, the a priori class probabilities are equal. Figures 1 B and 2B show the Bayesian probabilities for the corresponding likelihood distributions of Figures 1A and 2A. Note that for each
Neural Network Classifiers
471
h) LIKELIHOODS CLASS 2
0 B) BAYESIAN PROBABILITIES
1
€d
m a m
a a
.5
-0
-16
-8
0
8
16
INPUT VALUE
Figure 1: (A) Likelihoods, p ( X
I Ci),and (B) Bayesian probabilities, p ( C i I X ) ,
for the three-class problem.
input value, the Bayesian probabilities sum to one. Also, since the a priovi class probabilities are equal, for each input value the Bayesian probability is largest for that class Ci for which the corresponding likelihood p ( X I C i ) is largest, and smallest for that class for which the corresponding likelihood is smallest. Figures 3A and B depict the actual Bayesian probabilities for Class 1 and the corresponding network outputs for the two problems. Four thousand training samples were used for each class. Twelve thousand training samples were thus used for the three-class problem and eight thousand samples were used for the two-class problem. For the MLP network, each training sample was used only once for training because of the good convergence that resulted without repeating samples. The network outputs estimated Bayesian probabilities best in regions where the input X had high probability for at least one class and worst in regions where the input had low probability for all classes. This was a consequence of the squared-error cost function used for training the networks. Much
Michael D. Richard and Richard P. Lippmann
472
A) LIKELIHOODS
.I5
-
>
CLASS 1
CLASS 2
B) BAYESIAN PROBABILITIES
-.4
I
-16
I -10
I 0
I 10
16
INPUT VALUE
Figure 2: (A) Likelihoods, p ( X 1 C;), and (B) Bayesian probabilities, p(C; I X ) , for the two-class problem. training data existed (on average) in regions of high probability and little training data existed in regions of low probability. Because of this, deviations of the network outputs from the Bayesian probabilities in regions of high probability strongly impacted the squared-error cost function. Similarly, deviations of the network outputs from the actual Bayesian probabilities in regions of low probability only weakly influenced the squared-error cost function. MLP network outputs provided the best estimates in regions of low probability. The GMDH network outputs behaved erratically in these regions, and the RBF network outputs quickly approached zero independent of the actual Bayesian probabilities. This behavior of the RBF network was due to the fact the node centers, {m,},calculated using the EM algorithm lay in or near regions of high probability (equivalently regions where most of the training data lie), and the outputs of the RBF network approached zero for input samples far from the node centers. Addition of extra nodes with centers in regions of low probability or
Neural Network Classifiers
473
I A) BAYESIAN PROBABILITIES-- THREE-CLASS PROBLEM E
2
.5
a
n
-0 -.4
B) BAYESIAN PROBABILITIES ACTUAL
-- TWO-CLASS PROBLEM
MLP
-
1
E
i2
-
.5
2n -0 ! -.4
-16
-10
0
10
16
INPUT VALUE
Figure 3: Actual Bayesian probabilities and corresponding network outputs for (A) the three-class problem and (B) the two-class problem.
nodes with constant outputs did not improve the accuracy of the estimation. In fact, simulations revealed that overall estimation accuracy often deteriorated with the addition of extra nodes. 3.2 Network Outputs Sum to One. Network outputs should sum to one for each input value if outputs accurately estimate Bayesian probabilities. For the MLP network, the value of each output necessarily remains between zero and one because of the sigmoidal functions used. However, the criterion used for training did not require the outputs to sum to one. In contrast, there were no constraints on the outputs of the RBF and GMDH networks. Nevertheless, as shown in Figures 4A and B, the summed outputs of the MLP network are always close to one and the summed outputs of the RBF and GMDH networks are close to one in regions where the input has high probability for at least one class. As such, normalization techniques proposed to ensure that the outputs of an MLP network are true probabilities (Bridle 1990; El-Jaroudi and Makhoul
Michael D. Richard and Richard P. Lipprnann
474
SUMMED OUTPUTS
t)
--
THREE-CLASS PROBLEM
1
3 .s
GMDH
u)
-
+REF
\ \
] I I
\
-0 -4
~~
B) SUMMED OUTPUTS
-- TWO-CLASS PROBLEM
1
I I
MLP
f
.5
\ \ \
v)
\ I
\
-0 -.4
-16
I -10
I
0
I 10
16
INPUT VALUE
Figure 4: Summed outputs of networks for (A) the three-class problem and (B) the two-class problem. 1990; Gish 1990) may be unnecessary. This is further supported by results of experiments performed in Bourlard and Morgan (19891, which demonstrated that the sum of the outputs of MLP networks is near one for large phoneme-classification speech-recognition problems. 3.3 Effects of Reducing Training Data and Network Size. The derivation in the preceding section, in particular the expression for A in 2.2, implicitly assumed availability of infinite training data. In practice, training data is finite. Instead of minimizing 2.2, the following is minimized:
(3.3)
In this equation { X , , p = l . . . ., N ) represents the training data (in this case N samples), d , ( p ) represents the desired output for the pth sample, and yj(X,) represents the actual network output for the pth sample. The
Neural Network Classifiers
a)
475
BAYESIAN PROBABILITIES
--
1
3000 SAMPLEWCLASS
I
,!
ACTUAL
I
Ed
:E
.s -0
'_1
k..,!? ................. ' -
----A/
-.4 B) BAYESIAN PROBABILITIES '
-- 1000 SAMPLES/CLASS ii !:
ACTUAL
m
n -0
-.4
-___-16
I
I
I
-10
0
10
16
INPUT VALUE
Figure 5: Accuracy of Bayesian probability estimation for Class 1 of the twoclass problem with (A) 3000 samples/class and (B) 1000 samples/class. accuracy of the error criterion given by 3.3 in estimating 2.2 influences the accuracy of the network outputs in estimating Bayesian probabilities. The result is that the accuracy of the Bayesian estimation deteriorates with a decreasing training set size. Figures 5A and B illustrate this by showing actual Bayesian probabilities and the corresponding outputs of the three networks for Class 1 of the two-class problem when fewer than the original four thousand training samples per class are used. Figure 5A depicts the results of using three thousand training samples per class; and Figure 5B depicts the results of using one thousand training samples per class. The derivation in the preceding section also implicitly assumed that the network is sufficiently "complex" to enable the outputs to accurately estimate the Bayesian probability functions. If the network is not sufficiently complex, however, estimation accuracy degrades. Figures 6A and B confirm this for the MLP and RBF networks by depicting the actual Bayesian probabilities for Class 1 of the two-class problem and the
Michael D. Richard and Richard I? Lippmann
476
-0
-----/
-.4 8) BAYESIAN PROBABILITIES
-- FOUR HtDDEN NODES
ACTUAL
............_...___.._.__.....
n
-0
-.4
------/
I
/
I
I
Figure 6: Accuracy of Bayesian probability estimation for Class 1 of the twoclass problem with (A) 12 hidden nodes and (B) 4 hidden nodes.
corresponding outputs of networks with 12 and 4 hidden nodes, respectively, down from the 24 hidden nodes in the networks used for the preceding examples. 3.4 Comparison of Cost Functions. A final set of simulations was performed to compare the estimation accuracy provided by the three cost functions. Comparisons used MLP classifier networks trained with squared-error, cross-entropy, and normalized-likelihood cost functions. Figure 7 shows results for the three-class and two-class problems, using a network with a single hidden layer of 24 nodes and using four thousand training samples per class. Estimation accuracy is comparable with all three cost functions. This result agrees with previous experiments (Hampshire and Waibel 1990; Nowlan 1990) which demonstrated little differences in error rates when comparing squared-error to cross-entropy or normalized-likelihood cost functions on vowel and phoneme classification tasks. Although the cross-
Neural Network Classifiers
477
B) BAYESIAN PROBABILITIES
-- TWO-CLASS PROBLEM
1
>
c
da m
.5
gn -0 NORMALIZED-LIKELIHOOD
-.4 -16
I -10
I 0 INPUT VALUE
1 10
16
Figure 7 Network outputs and Bayesian probabilities with squared-error,crossentropy, and normalized-likelihoodcost functions for (A) the three-class problem and (B) the two-class problem.
entropy cost function applies more weight to errors when network outputs arc near zero and one, Figure 7 demonstrates that estimation accuracy is no better in those regions than that obtained using a squared-error cost function. However, for the two-class problem, use of the normalizedlikelihood cost function offers a slight increase in estimation accuracy over use of both the cross-entropy and squared-error cost functions, in the region of low probability for the class shown. The high estimation accuracy achieved with each cost function required careful selection of step size and momentum. Experiments were conducted with step size values ranging from 0.001 to 0.5, and momentum values ranging from 0.05 to 1. Estimation accuracy for all cost functions was fairly sensitive to small variations in step size but less sensitive to variations in momentum. Results shown in Figure 7 were obtained with step-size/momentum values of 0.01/0.6 for the squared-
478
Michael D. Richard and Richard I? Lippmann
error, 0.005/0.5 for the cross-entropy, and 0.005/0.l for the normalizedlikelihood cost function. 4 Practical Implications
The above results demonstrate that many common neural network classifiers have outputs which estimate Bayesian probabilities. An understanding of this relationship offers practical guidance for training and using these classifiers. Interpretation of network outputs as Bayesian probabilities allows outputs from multiple networks to be combined for higher level decision making, simplifies creation of rejection thresholds, makes it possible to compensate for differences between pattern class probabilities in training and test data, allows outputs to be used to minimize alternative risk functions, and suggests alternative measures of network performance. 4.1 Compensating for Varying u priori Class Probabilities. Networks with outputs that estimate Bayesian probabilities do not explicitly estimate the three terms on the right of equation 2.1 separately. However, the output yl(X) is implicitly the corresponding n priori class probability p ( C , ) times the class likelihood p ( X I C,) divided by the unconditional input probability p ( X ) . It is possible to vary a priori class probabilities during classification without retraining, since these probabilities occur only as multiplicative terms in producing the network outputs. As a result, class probabilities can be adjusted during use of a classifier to compensate for training data with class probabilities that are not representative of actual use or test conditions. Correct class probabilities can be used during classification by first dividing network outputs by training-data class probabilities and then multiplying by the correct class probabilities. Training-data class probabilities can be estimated as the frequency of occurrence of patterns from different classes in the training data. Correct class probabilities required for testing can be obtained from an independent set of training data that needs to contain only class labels and not input patterns. Such data are often readily available. For example, word frequency counts useful for spoken word recognition can be obtained from computer-based text data bases and the frequency of occurrence of various diseases for medical diagnosis can be obtained from health statistics. 4.2 Minimum-Risk Classification. Minimum-risk classifiers differentially weight the various types of classification errors (e.g. false positives and false negatives on a medical screening test) and require class likelihoods and likelihood ratios to make classification decisions (Duda and Hart 1973; Fukunaga 1972). As indicated by equation 2.1, ratios of network outputs will be likelihood ratios if each output is first divided
Neural Network Classifiers
479
by the corresponding training-data class probability, P ( C,). Minimumrisk classifiers can thus be designed using normalized or scaled ratios of network outputs. 4.3 Combining Outputs of Multiple Networks. Class likelihoods are often multiplied together during higher level decision making to combine information from multiple classifiers with independent inputs. Equation 2.1 demonstrates that network outputs can be divided by training-data class probabilities to produce scaled likelihoods, where the scaling factor is the reciprocal of the unconditional input probability. Corresponding scaled likelihoods (i.e., normalized outputs) from several classifiers can be multiplied together to determine overall class likelihoods if inputs to different classifiers are independent. Since all scaled likelihoods for any one classifier have the same scaling factor (the unconditional input probability), classification decisions based on the product of scaled likelihoods will be the same as those based on actual likelihoods. We, for example, have used this approach to obtain scaled word likelihoods by multiplying scaled likelihoods (normalized network outputs from RBF networks) from classifiers that model subword speech units (Singer and Lippmann 1992). In this application, the outputs of networks that model subword units are normalized by the training-data subword-unit class probabilities and the resulting normalized outputs are multiplied together to determine scaled word likelihoods. Normalizing outputs by training-data subword-unit class probabilities in our experiments and in speech recognition experiments by others (Bourlard and Morgan 1990) has resulted in a large reduction in word error rate over unnormalized outputs. Similar techniques could be used for handwritten word recognition if individual classifiers recognize letters, and for other applications that integrate scores from many classifiers.
4.4 Setting Rejection Thresholds. In many classification problems, it is more costly to misclassify an input pattern than to reject an input. For example, in digit recognition of dollar amounts on checks it may be less costly to have a human read and verify a check than to recognize an incorrect dollar amount. In these situations statistical theory suggests rejecting an input if all Bayesian probabilities for that input are less than a threshold (Fukunaga 1972). Such a rejection rule can be directly implemented by using network outputs as Bayesian probabilities and rejecting an input if all outputs are below a threshold. 4.5 Alternative Performance Measures. The performance of a classifier that uses a squared-error cost function is normally assessed by measuring the classification error rate and the squared error between desired and actual network outputs. The above theoretical analysis, however, suggests two other useful figures of merit. First, if network
480
Michael D. Richard and Richard P. Lippmann
outputs estimate Bayesian probabilities accurately, then all network outputs should be nonnegative and sum to unity. This was demonstrated in the above simulations and in studies using speech data (Bourlard and Morgan 1989). Second, as noted in Wan (1990), the expected value of each network output y, should be the a priori class probability P ( C , ) for the corresponding class C,. These expected values can be estimated by averaging the network outputs over all training data. The difference between averaged network outputs and estimated a priori class probabilities can be measured using a relative entropy distance or any other distance measure suitable for use with probabilities. For example, if Ave{y,} represents network outputs averaged over all training data and Freq{C,} represents the frequency of occurrence of class C, in the training data (number of times class C, occurred in the training data divided by total number of training patterns), an appropriate relative entropy distance measure is the following:
Significant differences either between averaged network outputs and estimated a priori class probabilities or between the sum of network outputs and unity indicate inaccurate estimation of Bayesian probabilities.
5 Summary This paper has shown that there is a strong relationship between the outputs of neural networks and Bayesian probabilities. Theoretical analyses demonstrated that a squared-error cost function is minimized for an M class problem when network outputs are minimum, mean-squared, error estimates of Bayesian probabilities. Similar theoretical results demonstrated that Bayesian probabilities are estimated with other cost functions such as cross-entropy, as well. Simulations demonstrated that network outputs estimate Bayesian probabilities when using a squared-error cost function with radial basis function networks, high-order polynomial networks, or multilayer perceptron networks with sigmoidal nonlinearities. Estimation accuracy is high only if the network is sufficiently complex, adequate training data are available, and training data accurately reflect the actual likelihood distributions and the a priori class probabilities. Researchers should be aware of the connection between neural network outputs and Bayesian probabilities and not treat network outputs as binary, logical values, or as likelihoods. They should also understand the practical implications of this relationship between network outputs and Bayesian probabilities as discussed in the previous section of this paper.
Neural Network Classifiers
481
Acknowledgments The authors would like to thank Dave Nation for writing or revising much of the software used for the simulations, in particular the software for the multilayer perceptron network simulations, and Kenney N g for writing the software for the GMDH and RBF simulations. The authors would also like to thank William Huang for writing an earlier version of the multilayer perceptron network simulation software and Linda Kukolich for additional software work that facilitated the simulations. This work was sponsored by the Defense Advanced Research Projects Agency. Michael D. Richard was supported by a fellowship from the Air Force Office of Scientific Research under the Laboratory Graduate Fellowship Program and in part by the Office of Naval Research under Grant N00014-89-J-1489 at M.I.T.
References Barron, A. R. 1984. Adaptive learning networks: Development and application in the United States of algorithms related to GMDH. In Self-organizing Methods in Modeling, S. J. Farlow, ed., pp. 25-65. Marcel Dekker, New York. Baum, E. B., and Wilczek, F. 1988. Supervised learning of probability distributions by neural networks. In Neural Information Processing Systems, D. Anderson, ed., pp. 52-61. American Institute of Physics, New York. Bourlard, H., and Morgan, N. 1989. Merging rnultilayer perceptrons and hidden Markov models: Some experiments in continuous speech recognition. Tech. Rep. 89-033, International Computer Science Institute, Berkeley, CA, July. Bourlard, H., and Wellekens, C. J. 1989. Links between Markov models and multilayer perceptrons. In Advances in Neural Information Processing Systems 1, D. S. Touretzky, ed., pp. 502-510. Morgan Kaufmann, San Mateo, CA. Bourlard, H., and Morgan, N. 1990. A continuous speech recognition system embedding MLP into HMM. In Advances in Neural Information Processing 2, D. Touretzky, ed., pp. 186-193. Morgan Kaufmann, San Mateo, CA. Bridle, J. S. 1990. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Neural Information Processing Systems 2, David S. Touretzky, ed., pp. 211-217. Morgan Kaufmann, San Mateo, CA. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. El-Jaroudi, A., and Makhoul, J. 1990. A new error criterion for posterior probability estimation with neural nets. In Proceedings International Joint Conference on Neural Networks, pp. III:185-192. IEEE, San Diego, CA, June. Fukunaga, K. 1972. Introduction to Statistical Pattern Recognition. Academic Press, New York.
482
Michael D. Richard and Richard P. Lippmann
Gish, H. 1990. A probabilistic approach to the understanding and training of neural network classifiers. In Proceedings of l € E € Conference on Acoustics Speech and Signal Processing, pp. 1361-1364, April. Hanson, S. J., and Burr, D. J. 1988. Minkowski-r back-propagation: Learning in connectionist models with non-Euclidean error signals. In Neural Information Processing Systems, D. Anderson, ed., pp. 348-357. American Institute of Physics, New York. Hinton, G. E. 1990. Connectionist learning procedures. In Machine Learning: Paradigms and Methods, J. G. Carbonell, ed., pp. 185-234. MIT Press, Cambridge, MA. Holt, M. J., and Semnani, S. 1990. Covergence of back propagation in neural networks using a log-likelihood cost function. Electron. Lett. 26(23), 19641965. Hopfield, J. J. 1987. Learning algorithms and probability distributions in feedforward and feed-back networks. Proc. Nal. Acad. Sci. U.S.A. 84, 8429-8433. Hampshire, J. B. 11, and Waibel, A. H. 1990. A novel objective function for improved phoneme recognition using time-delay neural networks. l E E € Transact. Neural Networks 1(2),216-228. Hampshire, J. B. 11, and Perlmutter, B. A. 1990 Equivalence proofs for multilayer perceptron classifiers and the Bayesian discriminant function. In Proceedings of the 1990 Connectionist Models Summer School, D. Touretzky, J. Elman, T.Sejnowski, and G. Hinton, eds. Morgan Kaufmann, San Mateo, CA. Lippmann, R. P. 1989. Pattern classification using neural networks. I € € € Comm u n . Mag. 27(11), 47-54. Ng, K., and Lippmann, R. P. 1991a. A comparative study of the practical characteristics of neural network and conventional pattern classifiers. Tech. Rep. 894, MIT Lincoln Laboratory, Lexington, MA, March. Ng, K., and Lippmann, R. P. 1991b. A comparative study of the practical characteristics of neural network and conventional pattern classifiers. In Advances in Neural Tnformation Processing 3, R. P. Lippmann, J. Moody, and D. S. Touretzky, eds. Morgan Kaufmann, San Mateo, CA. Nowlan, S. J. 1990. Competing experts: A n experimental investigation of associative mixture models. Tech. Rep. CRG-TR-90-5, University of Toronto, September. Ruck, D. W., Rogers, S. K., Kabrisky, M., Oxley, M. E., and Suter, B. W. 1990. The multilayer perceptron as an approximation to a Bayes optimal discriminant function. I E E E Transact. Neural Networks 1(4),296-298. Shoemaker, P. A. 1991. A note on least-squares learning procedures and classification by neural network models. LEE€ Transact. Neural Networks 2(1), 158-160. Singer, E., and Lippmann, R. P. 1992. Improved hidden Markov model speech recognition using radial basis function networks. In Neural lnformation Processing Systems 4, John Moody, Steve Hanson, and Richard Lippmann, eds. San Mateo, California. Morgan Kaufmann. Solla, S. A., Levin, E., and Fleisher, M. 1988. Accelerated learning in layered neural networks. Complex Syst. 2, 625-640.
Neural Network Classifiers
483
Wan, E. A. 1990. Neural network classification: A Bayesian interpretation. IEEE Transact. Neural Networks 1(4), 303-305. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Comp. 1,425464.
Received 15 August 1990; accepted 14 June 1991.
This article has been cited by: 1. Jyh-shyan Lan, Victor L. Berardi, B. Eddy Patuwo, Michael Hu. 2009. A joint investigation of misclassification treatments and imbalanced datasets on neural network performance. Neural Computing and Applications 18:7, 689-706. [CrossRef] 2. Sung-Hwan Shin, T. Hashimoto, S. Hatano. 2009. Automatic Detection System for Cough Sounds as a Symptom of Abnormal Health Condition. IEEE Transactions on Information Technology in Biomedicine 13:4, 486-493. [CrossRef] 3. Zhaozhang Jin, DeLiang Wang. 2009. A Supervised Learning Approach to Monaural Segregation of Reverberant Speech. IEEE Transactions on Audio, Speech, and Language Processing 17:4, 625-638. [CrossRef] 4. Tae Hyung Kim, Il Kyu Eom, Yoo Shin Kim. 2009. Multiscale Bayesian texture segmentation using neural networks and Markov random fields. Neural Computing and Applications 18:2, 141-155. [CrossRef] 5. Douglas M. Kline. 2009. Two-group classification using the Bayesian data reduction algorithm. Complexity NA-NA. [CrossRef] 6. Jean-Philippe Poli. 2008. An automatic television stream structuring system for television archives holders. Multimedia Systems 14:5, 255-275. [CrossRef] 7. Yoshifusa Ito. 2008. Simultaneous Approximations of Polynomials and Derivatives and Their Applications to Neural NetworksSimultaneous Approximations of Polynomials and Derivatives and Their Applications to Neural Networks. Neural Computation 20:11, 2757-2791. [Abstract] [PDF] [PDF Plus] 8. Marco Chini, Fabio Pacifici, William J. Emery, Nazzareno Pierdicca, Fabio Del Frate. 2008. Comparing Statistical and Neural Network Methods Applied to Very High Resolution Satellite Images Showing Changes in Man-Made Structures at Rocky Flats. IEEE Transactions on Geoscience and Remote Sensing 46:6, 1812-1821. [CrossRef] 9. Sung Bae Cho, Hong-Hee Won. 2007. Cancer classification using ensemble of neural networks with multiple significant gene subsets. Applied Intelligence 26:3, 243-250. [CrossRef] 10. G.P. Zhang. 2007. Avoiding Pitfalls in Neural Network Research. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 37:1, 3-16. [CrossRef] 11. Douglas M. Kline, Victor L. Berardi. 2006. Revisiting squared-error and cross-entropy functions for training neural network classifiers. Neural Computing and Applications 14:4, 310-318. [CrossRef] 12. O. Basir, F. Karray, H. Zhu. 2005. Connectionist-Based Dempster–Shafer Evidential Reasoning for Data Fusion. IEEE Transactions on Neural Networks 16:6, 1513-1530. [CrossRef]
13. D. Acosta, J. Adelman, T. Affolder, T. Akimoto, M. Albrow, D. Ambrose, S. Amerio, D. Amidei, A. Anastassov, K. Anikeev, A. Annovi, J. Antos, M. Aoki, G. Apollinari, T. Arisawa, J-F. Arguin, A. Artikov, W. Ashmanskas, A. Attal, F. Azfar, P. Azzi-Bacchetta, N. Bacchetta, H. Bachacou, W. Badgett, A. Barbaro-Galtieri, G. Barker, V. Barnes, B. Barnett, S. Baroiant, G. Bauer, F. Bedeschi, S. Behari, S. Belforte, G. Bellettini, J. Bellinger, A. Belloni, E. Ben-Haim, D. Benjamin, A. Beretvas, T. Berry, A. Bhatti, M. Binkley, D. Bisello, M. Bishai, R. Blair, C. Blocker, K. Bloom, B. Blumenfeld, A. Bocci, A. Bodek, G. Bolla, A. Bolshov, D. Bortoletto, J. Boudreau, S. Bourov, B. Brau, C. Bromberg, E. Brubaker, J. Budagov, H. Budd, K. Burkett, G. Busetto, P. Bussey, K. Byrum, S. Cabrera, M. Campanelli, M. Campbell, F. Canelli, A. Canepa, M. Casarsa, D. Carlsmith, R. Carosi, S. Carron, M. Cavalli-Sforza, A. Castro, P. Catastini, D. Cauz, A. Cerri, L. Cerrito, J. Chapman, Y. Chen, M. Chertok, G. Chiarelli, G. Chlachidze, F. Chlebana, I. Cho, K. Cho, D. Chokheli, J. Chou, S. Chuang, K. Chung, W-H. Chung, Y. Chung, M. Cijliak, C. Ciobanu, M. Ciocci, A. Clark, D. Clark, M. Coca, A. Connolly, M. Convery, J. Conway, B. Cooper, K. Copic, M. Cordelli, G. Cortiana, J. Cranshaw, J. Cuevas, A. Cruz, R. Culbertson, C. Currat, D. Cyr, D. Dagenhart, S. Da Ronco, S. D’Auria, P. de Barbaro, S. De Cecco, A. Deisher, G. Lentdecker, M. Dell’Orso, S. Demers, L. Demortier, M. Deninno, D. De Pedis, P. Derwent, C. Dionisi, J. Dittmann, P. DiTuro, C. Dörr, A. Dominguez, S. Donati, M. Donega, J. Donini, M. D’Onofrio, T. Dorigo, K. Ebina, J. Efron, J. Ehlers, R. Erbacher, M. Erdmann, D. Errede, S. Errede, R. Eusebi, H-C. Fang, S. Farrington, I. Fedorko, W. Fedorko, R. Feild, M. Feindt, J. Fernandez, R. Field, G. Flanagan, L. Flores-Castillo, A. Foland, S. Forrester, G. Foster, M. Franklin, J. Freeman, Y. Fujii, I. Furic, A. Gajjar, M. Gallinaro, J. Galyardt, M. Garcia-Sciveres, A. Garfinkel, C. Gay, H. Gerberich, D. Gerdes, E. Gerchtein, S. Giagu, P. Giannetti, A. Gibson, K. Gibson, C. Ginsburg, K. Giolo, M. Giordani, M. Giunta, G. Giurgiu, V. Glagolev, D. Glenzinski, M. Gold, N. Goldschmidt, D. Goldstein, J. Goldstein, G. Gomez, G. Gomez-Ceballos, M. Goncharov, O. González, I. Gorelov, A. Goshaw, Y. Gotra, K. Goulianos, A. Gresele, M. Griffiths, C. Grosso-Pilcher, U. Grundler, J. da Costa, C. Haber, K. Hahn, S. Hahn, E. Halkiadakis, A. Hamilton, B-Y. Han, R. Handler, F. Happacher, K. Hara, M. Hare, R. Harr, R. Harris, F. Hartmann, K. Hatakeyama, J. Hauser, C. Hays, H. Hayward, B. Heinemann, J. Heinrich, M. Hennecke, M. Herndon, C. Hill, D. Hirschbuehl, A. Hocker, K. Hoffman, A. Holloway, S. Hou, M. Houlden, B. Huffman, Y. Huang, R. Hughes, J. Huston, K. Ikado, J. Incandela, G. Introzzi, M. Iori, Y. Ishizawa, C. Issever, A. Ivanov, Y. Iwata, B. Iyutin, E. James, D. Jang, B. Jayatilaka, D. Jeans, H. Jensen, E. Jeon, M. Jones, K. Joo, S. Jun, T. Junk, T. Kamon, J. Kang, M. Unel, P. Karchin, Y. Kato, Y. Kemp, R. Kephart, U. Kerzel, V. Khotilovich, B. Kilminster, D. Kim, H. Kim, J. Kim, M. Kim, M. Kim, S. Kim, S. Kim, Y. Kim, M. Kirby, L. Kirsch, S. Klimenko, M. Klute, B. Knuteson, B. Ko, H. Kobayashi, D. Kong, K. Kondo, J. Konigsberg, K. Kordas, A. Korn, A. Korytov, A. Kotwal, A. Kovalev, J. Kraus, I. Kravchenko, A. Kreymer, J. Kroll, M. Kruse, V. Krutelyov, S. Kuhlmann, S.
Kwang, A. Laasanen, S. Lai, S. Lami, S. Lammel, M. Lancaster, R. Lander, K. Lannon, A. Lath, G. Latino, I. Lazzizzera, C. Lecci, T. LeCompte, J. Lee, J. Lee, S. Lee, R. Lefèvre, N. Leonardo, S. Leone, S. Levy, J. Lewis, K. Li, C. Lin, C. Lin, M. Lindgren, E. Lipeles, T. Liss, A. Lister, D. Litvintsev, T. Liu, Y. Liu, N. Lockyer, A. Loginov, M. Loreti, P. Loverre, R-S. Lu, D. Lucchesi, P. Lujan, P. Lukens, G. Lungu, L. Lyons, J. Lys, R. Lysak, E. Lytken, D. MacQueen, R. Madrak, K. Maeshima, P. Maksimovic, G. Manca, F. Margaroli, R. Marginean, C. Marino, A. Martin, M. Martin, V. Martin, M. Martínez, T. Maruyama, H. Matsunaga, M. Mattson, P. Mazzanti, K. McFarland, D. McGivern, P. McIntyre, P. McNamara, R. McNulty, A. Mehta, S. Menzemer, A. Menzione, P. Merkel, C. Mesropian, A. Messina, T. Miao, N. Miladinovic, J. Miles, L. Miller, R. Miller, J. Miller, C. Mills, R. Miquel, S. Miscetti, G. Mitselmakher, A. Miyamoto, N. Moggi, B. Mohr, R. Moore, M. Morello, P. Fernandez, J. Muelmenstaedt, A. Mukherjee, M. Mulhearn, T. Muller, R. Mumford, A. Munar, P. Murat, J. Nachtman, S. Nahn, I. Nakano, A. Napier, R. Napora, D. Naumov, V. Necula, J. Nielsen, T. Nelson, C. Neu, M. Neubauer, T. Nigmanov, L. Nodulman, O. Norniella, T. Ogawa, S. Oh, Y. Oh, T. Ohsugi, T. Okusawa, R. Oldeman, R. Orava, W. Orejudos, K. Osterberg, C. Pagliarone, E. Palencia, R. Paoletti, V. Papadimitriou, A. Paramonov, S. Pashapour, J. Patrick, G. Pauletta, M. Paulini, C. Paus, D. Pellett, A. Penzo, T. Phillips, G. Piacentino, J. Piedra, K. Pitts, C. Plager, L. Pondrom, G. Pope, X. Portell, O. Poukhov, N. Pounder, F. Prakoshyn, A. Pronko, J. Proudfoot, F. Ptohos, G. Punzi, J. Rademacker, M. Rahaman, A. Rakitine, S. Rappoccio, F. Ratnikov, H. Ray, B. Reisert, V. Rekovic, P. Renton, M. Rescigno, F. Rimondi, K. Rinnert, L. Ristori, W. Robertson, A. Robson, T. Rodrigo, S. Rolli, R. Roser, R. Rossin, C. Rott, J. Russ, V. Rusu, A. Ruiz, D. Ryan, H. Saarikko, S. Sabik, A. Safonov, R. Denis, W. Sakumoto, G. Salamanna, D. Saltzberg, C. Sanchez, L. Santi, S. Sarkar, K. Sato, P. Savard, A. Savoy-Navarro, P. Schlabach, E. Schmidt, M. Schmidt, M. Schmitt, T. Schwarz, L. Scodellaro, A. Scott, A. Scribano, F. Scuri, A. Sedov, S. Seidel, Y. Seiya, A. Semenov, F. Semeria, L. Sexton-Kennedy, I. Sfiligoi, M. Shapiro, T. Shears, P. Shepard, D. Sherman, M. Shimojima, M. Shochet, Y. Shon, I. Shreyber, A. Sidoti, A. Sill, P. Sinervo, A. Sisakyan, J. Sjolin, A. Skiba, A. Slaughter, K. Sliwa, D. Smirnov, J. Smith, F. Snider, R. Snihur, M. Soderberg, A. Soha, S. Somalwar, J. Spalding, M. Spezziga, F. Spinella, P. Squillacioti, H. Stadie, M. Stanitzki, B. Stelzer, O. Stelzer-Chilton, D. Stentz, J. Strologas, D. Stuart, J. Suh, A. Sukhanov, K. Sumorok, H. Sun, T. Suzuki, A. Taffard, R. Tafirout, H. Takano, R. Takashima, Y. Takeuchi, K. Takikawa, M. Tanaka, R. Tanaka, N. Tanimoto, M. Tecchio, P. Teng, K. Terashi, R. Tesarek, S. Tether, J. Thom, A. Thompson, E. Thomson, P. Tipton, V. Tiwari, S. Tkaczyk, D. Toback, K. Tollefson, T. Tomura, D. Tonelli, M. Tönnesmann, S. Torre, D. Torretta, W. Trischuk, R. Tsuchiya, S. Tsuno, D. Tsybychev, N. Turini, F. Ukegawa, T. Unverhau, S. Uozumi, D. Usynin, L. Vacavant, A. Vaiciulis, A. Varganov, S. Vejcik, G. Velev, V. Veszpremi, G. Veramendi, T. Vickey, R. Vidal, I. Vila, R. Vilar, I. Vollrath, I. Volobouev, M. von der Mey, P. Wagner, R. Wagner, R. Wagner, W. Wagner,
R. Wallny, T. Walter, Z. Wan, M. Wang, S. Wang, A. Warburton, B. Ward, S. Waschke, D. Waters, T. Watts, M. Weber, W. Wester, B. Whitehouse, D. Whiteson, A. Wicklund, E. Wicklund, H. Williams, P. Wilson, B. Winer, P. Wittich, S. Wolbers, C. Wolfe, M. Wolter, M. Worcester, S. Worm, T. Wright, X. Wu, F. Würthwein, A. Wyatt, A. Yagil, T. Yamashita, K. Yamamoto, J. Yamaoka, C. Yang, U. Yang, W. Yao, G. Yeh, J. Yoh, K. Yorita, T. Yoshida, I. Yu, S. Yu, J. Yun, L. Zanello, A. Zanetti, I. Zaw, F. Zetti, J. Zhou, S. Zucchelli. 2005. Measurement of the cross section for tt¯ production in pp¯ collisions using the kinematics of lepton+jets events. Physical Review D 72:5. . [CrossRef] 14. Yann Guermeur, Andr� Elisseeff, Dominique Zelus. 2005. A comparative study of multi-class support vector machines in the unifying framework of large margin classifiers. Applied Stochastic Models in Business and Industry 21:2, 199-214. [CrossRef] 15. Agostino Manzato. 2005. The Use of Sounding-Derived Indices for a Neural Network Short-Term Thunderstorm Forecast. Weather and Forecasting 20:6, 896. [CrossRef] 16. Christopher C. Hennon, Caren Marzban, Jay S. Hobgood. 2005. Improving Tropical Cyclogenesis Statistical Model Forecasts through the Application of a Neural Network Classifier. Weather and Forecasting 20:6, 1073. [CrossRef] 17. R. Carballo, A. S. Cofiño, J. I. González-Serrano. 2004. Selection of quasar candidates from combined radio and optical surveys using neural networks. Monthly Notices of the Royal Astronomical Society 353:1, 211-220. [CrossRef] 18. D. Acosta, T. Affolder, M. Albrow, D. Ambrose, D. Amidei, K. Anikeev, J. Antos, G. Apollinari, T. Arisawa, A. Artikov, W. Ashmanskas, F. Azfar, P. Azzi-Bacchetta, N. Bacchetta, H. Bachacou, W. Badgett, A. Barbaro-Galtieri, V. Barnes, B. Barnett, S. Baroiant, M. Barone, G. Bauer, F. Bedeschi, S. Behari, S. Belforte, W. Bell, G. Bellettini, J. Bellinger, D. Benjamin, A. Beretvas, A. Bhatti, M. Binkley, D. Bisello, M. Bishai, R. Blair, C. Blocker, K. Bloom, B. Blumenfeld, A. Bocci, A. Bodek, G. Bolla, A. Bolshov, D. Bortoletto, J. Boudreau, C. Bromberg, E. Brubaker, J. Budagov, H. Budd, K. Burkett, G. Busetto, K. Byrum, S. Cabrera, M. Campbell, W. Carithers, D. Carlsmith, A. Castro, D. Cauz, A. Cerri, L. Cerrito, J. Chapman, C. Chen, Y. Chen, M. Chertok, G. Chiarelli, G. Chlachidze, F. Chlebana, M. Chu, J. Chung, W. Chung, Y. Chung, C. Ciobanu, A. Clark, M. Coca, A. Connolly, M. Convery, J. Conway, M. Cordelli, J. Cranshaw, R. Culbertson, D. Dagenhart, S. D’Auria, P. de Barbaro, S. De Cecco, S. Dell’Agnello, M. Dell’Orso, S. Demers, L. Demortier, M. Deninno, D. De Pedis, P. Derwent, C. Dionisi, J. Dittmann, A. Dominguez, S. Donati, M. D’Onofrio, T. Dorigo, N. Eddy, R. Erbacher, D. Errede, S. Errede, R. Eusebi, S. Farrington, R. Feild, J. Fernandez, C. Ferretti, R. Field, I. Fiori, B. Flaugher, L. Flores-Castillo, G. Foster, M. Franklin, J. Friedman, I. Furic, M. Gallinaro, M. Garcia-Sciveres, A. Garfinkel, C. Gay, D. Gerdes, E. Gerstein, S. Giagu, P. Giannetti, K. Giolo, M. Giordani, P. Giromini, V. Glagolev, D. Glenzinski, M. Gold, N. Goldschmidt, J. Goldstein, G. Gomez,
M. Goncharov, I. Gorelov, A. Goshaw, Y. Gotra, K. Goulianos, A. Gresele, C. Grosso-Pilcher, M. Guenther, J. Guimaraes da Costa, C. Haber, S. Hahn, E. Halkiadakis, R. Handler, F. Happacher, K. Hara, R. Harris, F. Hartmann, K. Hatakeyama, J. Hauser, J. Heinrich, M. Hennecke, M. Herndon, C. Hill, A. Hocker, K. Hoffman, S. Hou, B. Huffman, R. Hughes, J. Huston, C. Issever, J. Incandela, G. Introzzi, M. Iori, A. Ivanov, Y. Iwata, B. Iyutin, E. James, M. Jones, T. Kamon, J. Kang, M. Karagoz Unel, S. Kartal, H. Kasha, Y. Kato, R. Kennedy, R. Kephart, B. Kilminster, D. Kim, H. Kim, M. Kim, S. Kim, S. Kim, T. Kim, Y. Kim, M. Kirby, L. Kirsch, S. Klimenko, P. Koehn, K. Kondo, J. Konigsberg, A. Korn, A. Korytov, J. Kroll, M. Kruse, V. Krutelyov, S. Kuhlmann, N. Kuznetsova, A. Laasanen, S. Lami, S. Lammel, J. Lancaster, K. Lannon, M. Lancaster, R. Lander, A. Lath, G. Latino, T. LeCompte, Y. Le, J. Lee, S. Lee, N. Leonardo, S. Leone, J. Lewis, K. Li, C. Lin, M. Lindgren, T. Liss, T. Liu, D. Litvintsev, N. Lockyer, A. Loginov, M. Loreti, D. Lucchesi, P. Lukens, L. Lyons, J. Lys, R. Madrak, K. Maeshima, P. Maksimovic, L. Malferrari, M. Mangano, G. Manca, M. Mariotti, M. Martin, A. Martin, V. Martin, M. Martínez, P. Mazzanti, K. McFarland, P. McIntyre, M. Menguzzato, A. Menzione, P. Merkel, C. Mesropian, A. Meyer, T. Miao, R. Miller, J. Miller, S. Miscetti, G. Mitselmakher, N. Moggi, R. Moore, T. Moulik, M. Mulhearn, A. Mukherjee, T. Muller, A. Munar, P. Murat, J. Nachtman, S. Nahn, I. Nakano, R. Napora, F. Niell, C. Nelson, T. Nelson, C. Neu, M. Neubauer, C. Newman-Holmes, T. Nigmanov, L. Nodulman, S. Oh, Y. Oh, T. Ohsugi, T. Okusawa, W. Orejudos, C. Pagliarone, F. Palmonari, R. Paoletti, V. Papadimitriou, J. Patrick, G. Pauletta, M. Paulini, T. Pauly, C. Paus, D. Pellett, A. Penzo, T. Phillips, G. Piacentino, J. Piedra, K. Pitts, A. Pompoš, L. Pondrom, G. Pope, T. Pratt, F. Prokoshin, J. Proudfoot, F. Ptohos, O. Poukhov, G. Punzi, J. Rademacker, A. Rakitine, F. Ratnikov, H. Ray, A. Reichold, P. Renton, M. Rescigno, F. Rimondi, L. Ristori, W. Robertson, T. Rodrigo, S. Rolli, L. Rosenson, R. Roser, R. Rossin, C. Rott, A. Roy, A. Ruiz, D. Ryan, A. Safonov, R. St. Denis, W. Sakumoto, D. Saltzberg, C. Sanchez, A. Sansoni, L. Santi, S. Sarkar, P. Savard, A. Savoy-Navarro, P. Schlabach, E. Schmidt, M. Schmidt, M. Schmitt, L. Scodellaro, A. Scribano, A. Sedov, S. Seidel, Y. Seiya, A. Semenov, F. Semeria, M. Shapiro, P. Shepard, T. Shibayama, M. Shimojima, M. Shochet, A. Sidoti, A. Sill, P. Sinervo, A. Slaughter, K. Sliwa, F. Snider, R. Snihur, M. Spezziga, F. Spinella, M. Spiropulu, L. Spiegel, A. Stefanini, J. Strologas, D. Stuart, A. Sukhanov, K. Sumorok, T. Suzuki, R. Takashima, K. Takikawa, M. Tanaka, M. Tecchio, R. Tesarek, P. Teng, K. Terashi, S. Tether, J. Thom, A. Thompson, E. Thomson, P. Tipton, S. Tkaczyk, D. Toback, K. Tollefson, D. Tonelli, M. Tönnesmann, H. Toyoda, W. Trischuk, J. Tseng, D. Tsybychev, N. Turini, F. Ukegawa, T. Unverhau, T. Vaiciulis, A. Varganov, E. Vataga, S. Vejcik, G. Velev, G. Veramendi, R. Vidal, I. Vila, R. Vilar, I. Volobouev, M. von der Mey, R. Wagner, R. Wagner, W. Wagner, Z. Wan, C. Wang, M. Wang, S. Wang, B. Ward, S. Waschke, D. Waters, T. Watts, M. Weber, W. Wester, B. Whitehouse, A. Wicklund, E. Wicklund, H. Williams, P. Wilson, B.
Winer, S. Wolbers, M. Wolter, S. Worm, X. Wu, F. Würthwein, U. Yang, W. Yao, G. Yeh, K. Yi, J. Yoh, T. Yoshida, I. Yu, S. Yu, J. Yun, L. Zanello, A. Zanetti, F. Zetti, S. Zucchelli. 2004. Optimized search for single-top-quark production at the Fermilab Tevatron. Physical Review D 69:5. . [CrossRef] 19. A. Guerrero-Curieses, J. Cid-Sueiro, R. Alaiz-Rodriguez, A.R. Figueiras-Vidal. 2004. Local Estimation of Posterior Class Probabilities to Minimize Classification Errors. IEEE Transactions on Neural Networks 15:2, 309-317. [CrossRef] 20. Eiichi Tsuboka. 2004. Deriving multiplication-type FVQ/HMM from a new information-source model and a viewpoint of information theory, and the relation of the model to discrete and continuous HMMs. Systems and Computers in Japan 35:2, 59-65. [CrossRef] 21. M. C. Storrie-Lombardi. 2004. Elemental abundance distributions in suboceanic basalt glass: Evidence of biogenic alteration. Geochemistry Geophysics Geosystems 5:10. . [CrossRef] 22. C.M. Bachmann, M.H. Bettenhausen, R.A. Fusina, T.F. Donato, A.L. Russ, J.W. Burke, G.M. Lamela, W.J. Rhea, B.R. Truitt, J.H. Porter. 2003. A credit assignment approach to fusing classifiers of multiseason hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing 41:11, 2488-2499. [CrossRef] 23. C.M. Bachmann. 2003. Improving the performance of classifiers in high-dimensional remote sensing applications: an adaptive resampling strategy for error-prone exemplars (ARESEPE). IEEE Transactions on Geoscience and Remote Sensing 41:9, 2101-2112. [CrossRef] 24. Paul E. Patton , Thomas J. Anastasio . 2003. Modeling Cross-Modal Enhancement and Modality-Specific Suppression in Multisensory NeuronsModeling Cross-Modal Enhancement and Modality-Specific Suppression in Multisensory Neurons. Neural Computation 15:4, 783-810. [Abstract] [PDF] [PDF Plus] 25. T. Tsuji, Nan Bu, O. Fukuda, M. Kaneko. 2003. A recurrent log-linearized gaussian mixture network. IEEE Transactions on Neural Networks 14:2, 304-316. [CrossRef] 26. L.S. Oliveira, R. Sabourin, F. Bortolozzi, C.Y. Suen. 2002. Automatic recognition of handwritten numerical strings: a recognition and verification strategy. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:11, 1438-1454. [CrossRef] 27. M. Saerens, P. Latinne, C. Decaestecker. 2002. Any reasonable cost function can be used for a posteriori probability approximation. IEEE Transactions on Neural Networks 13:5, 1204-1210. [CrossRef] 28. Xiaojuan Feng, C.K.I. Williams, S.N. Felderhof. 2002. Combining belief networks and neural networks for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:4, 467-483. [CrossRef]
29. S.H. Choi, P. Rockett. 2002. The training of neural classifiers with condensed datasets. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:2, 202-206. [CrossRef] 30. Lin Zhong, Jia Liu, Runsheng Liu. 2002. A rejection model based on multi-layer perceptrons for Mandarin digit recognition. Journal of Computer Science and Technology 17:2, 196-202. [CrossRef] 31. J. P. Janssen, M. Egmont-Petersen, E. A. Hendriks, M. J. T. Reinders, R. J. van der Geest, P. C. W. Hogendoorn, J. H. C. Reiber. 2002. Scale-invariant segmentation of dynamic contrast-enhanced perfusion MR images with inherent scale selection. The Journal of Visualization and Computer Animation 13:1, 1-19. [CrossRef] 32. C.M. Bachmann, T.F. Donato, G.M. Lamela, W.J. Rhea, M.H. Bettenhausen, R.A. Fusina, K.R. Du Bois, J.H. Porter, B.R. Truitt. 2002. Automatic classification of land cover on Smith Island, VA, using HyMAP imagery. IEEE Transactions on Geoscience and Remote Sensing 40:10, 2313-2330. [CrossRef] 33. Marco Saerens , Patrice Latinne , Christine Decaestecker . 2002. Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple ProcedureAdjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation 14:1, 21-41. [Abstract] [PDF] [PDF Plus] 34. A.H. Gunatilaka, B.A. Baertlein. 2001. Feature-level and decision-level fusion of noncoincidently sampled sensors for land mine detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:6, 577-589. [CrossRef] 35. J. Cid-Sueiro, A.R. Figueiras-Vidal. 2001. On the structure of strict sense Bayesian cost functions and its applications. IEEE Transactions on Neural Networks 12:3, 445-455. [CrossRef] 36. Wolfgang Utschick , Werner Weichselberger . 2001. Stochastic Organization of Output Codes in Multiclass Learning ProblemsStochastic Organization of Output Codes in Multiclass Learning Problems. Neural Computation 13:5, 1065-1102. [Abstract] [PDF] [PDF Plus] 37. Stuart Reed, Jeremy Coupland. 2001. Cascaded Linear Shift-Invariant Processors in Optical Pattern Recognition. Applied Optics 40:23, 3843. [CrossRef] 38. W. M. Brown, T. D. Gedeon, D. I. Groves, R. G. Barnes. 2000. Artificial neural networks: a new method for mineral prospectivity mapping. Australian Journal of Earth Sciences 47:4, 757-770. [CrossRef] 39. M. Egmont-Petersen, U. Schreiner, S.C. Tromp, T.M. Lehmann, D.W. Slaaf, T. Arts. 2000. Detection of leukocytes in contact with the vessel wall from in vivo microscope recordings using a neural network. IEEE Transactions on Biomedical Engineering 47:7, 941-951. [CrossRef]
40. L.M. Fu. 2000. Discrete probability estimation for classification using certainty-factor-based neural networks. IEEE Transactions on Neural Networks 11:2, 415-422. [CrossRef] 41. Azriel Rosenfeld, Harry Wechsler. 2000. Pattern recognition: Historical perspective and future directions. International Journal of Imaging Systems and Technology 11:2, 101-116. [CrossRef] 42. Y.Y. Chen. 2000. Fuzzy analysis of statistical evidence. IEEE Transactions on Fuzzy Systems 8:6, 796. [CrossRef] 43. G.P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) 30:4, 451. [CrossRef] 44. M. Saerens. 2000. Building cost functions minimizing to some summary statistics. IEEE Transactions on Neural Networks 11:6, 1263. [CrossRef] 45. L. Bruzzone. 2000. An approach to feature selection and classification of remote sensing images based on the Bayes rule for minimum cost. IEEE Transactions on Geoscience and Remote Sensing 38:1, 429. [CrossRef] 46. David W. Hilbert, Jeroen Van Den Muyzenberg. 1999. Using an artificial neural network to characterize the relative suitability of environments for forest types in a complex tropical vegetation mosaic. Diversity Distributions 5:6, 263-274. [CrossRef] 47. Holger Holst, Mattias Ohlsson, Carsten Peterson, Lars Edenbrandt. 1999. A confident decision support system for interpreting electrocardiograms. Clinical Physiology 19:5, 410-418. [CrossRef] 48. Victor L. Berardi, G. Peter Zhang. 1999. The Effect of Misclassification Costs on Neural Network Classifiers. Decision Sciences 30:3, 659-682. [CrossRef] 49. L. Bruzzone, D.F. Prieto, S.B. Serpico. 1999. A neural-statistical approach to multitemporal and multisource remote-sensing image classification. IEEE Transactions on Geoscience and Remote Sensing 37:3, 1350-1359. [CrossRef] 50. J. Clark, K. Gernoth, S. Dittmar, M. Ristig. 1999. Higher-order probabilistic perceptrons as Bayesian inference engines. Physical Review E 59:5, 6161-6174. [CrossRef] 51. J. Cid-Sueiro, J.I. Arribas, S. Urban-Munoz, A.R. Figueiras-Vidal. 1999. Cost functions to estimate a posteriori probabilities in multiclass problems. IEEE Transactions on Neural Networks 10:3, 645-656. [CrossRef] 52. J.T.-Y. Kwok. 1999. Moderating the outputs of support vector machine classifiers. IEEE Transactions on Neural Networks 10:5, 1018. [CrossRef] 53. Chuan Wang, J.C. Principe. 1999. Training neural networks with additive noise in the desired signal. IEEE Transactions on Neural Networks 10:6, 1511. [CrossRef] 54. Gokaraju K. Raju, Charles L. Cooney. 1998. Active learning from process data. AIChE Journal 44:10, 2199-2211. [CrossRef]
55. Sun-Yuan Kung, Jenq-Neng Hwang. 1998. Neural networks for intelligent multimedia processing. Proceedings of the IEEE 86:6, 1244-1272. [CrossRef] 56. Xiaofan Lin, Xiaoqing Ding, Youshou Wu. 1998. Theoretical analysis of the confidence metrics for nearest neighbor classifier. Chinese Science Bulletin 43:6, 464-467. [CrossRef] 57. C. Santa Cruz, J.R. Dorronsoro. 1998. A nonlinear discriminant algorithm for feature extraction and data classification. IEEE Transactions on Neural Networks 9:6, 1370. [CrossRef] 58. W. J. Staszewski, K. Worden. 1997. Classification of faults in gearboxes ? pre-processing algorithms and neural networks. Neural Computing & Applications 5:3, 160-183. [CrossRef] 59. L. Bruzzone, S.B. Serpico. 1997. An iterative technique for the detection of land-cover transitions in multitemporal remote-sensing images. IEEE Transactions on Geoscience and Remote Sensing 35:4, 858-867. [CrossRef] 60. M. Compiani, P. Fariselli, R. Casadio. 1997. Noise and randomlike behavior of perceptrons: Theory and applicationto protein structure prediction. Physical Review E 55:6, 7334-7343. [CrossRef] 61. H. Osman, M.M. Fahmy. 1997. Neural classifiers and statistical pattern recognition: applications for currently established links. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 27:3, 488-497. [CrossRef] 62. Sang-Hoon Oh. 1997. Improving the error backpropagation algorithm with a modified error function. IEEE Transactions on Neural Networks 8:3, 799-803. [CrossRef] 63. C. Chatterjee, V.P. Roychowdhury. 1997. On self-organizing algorithms and networks for class-separability features. IEEE Transactions on Neural Networks 8:3, 663-678. [CrossRef] 64. Avi Naim, Kavan U. Ratnatunga, Richard E. Griffiths. 1997. Quantitative Morphology of Moderate‐Redshift Galaxies: How Many Peculiar Galaxies Are There?. The Astrophysical Journal 476:2, 510-520. [CrossRef] 65. S.J. Roberts, W. Penny. 1997. Maximum certainty approach to feedforward neural networks. Electronics Letters 33:4, 306. [CrossRef] 66. Sung-Bae Cho. 1997. Neural-network classifiers for recognizing totally unconstrained handwritten numerals. IEEE Transactions on Neural Networks 8:1, 43. [CrossRef] 67. Shang-Hung Lin, Sun-Yuan Kung, Long-Ji Lin. 1997. Face recognition/detection by probabilistic decision-based neural network. IEEE Transactions on Neural Networks 8:1, 114. [CrossRef] 68. L. Holmstrom, P. Koistinen, J. Laaksonen, E. Oja. 1997. Neural and statistical classifiers-taxonomy and two case studies. IEEE Transactions on Neural Networks 8:1, 5. [CrossRef]
69. K. Popat, R.W. Picard. 1997. Cluster-based probability model and its application to image and texture processing. IEEE Transactions on Image Processing 6:2, 268. [CrossRef] 70. JAMES PARDEY, STEPHEN ROBERTS, LIONEL TARASSENKO, JOHN STRADLING. 1997. A new approach to the analysis of the human sleep/wakefulness continuum. Journal of Sleep Research 5:4, 201-210. [CrossRef] 71. M. Schuster, K.K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45:11, 2673. [CrossRef] 72. Partha Niyogi, Federico Girosi. 1996. On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis FunctionsOn the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions. Neural Computation 8:4, 819-842. [Abstract] [PDF] [PDF Plus] 73. W.C. Chen, N.A. Thacker, P.I. Rockett. 1996. Adaptive step edge model for self-consistent training of neural network for probabilistic edge labelling. IEE Proceedings - Vision, Image, and Signal Processing 143:1, 41. [CrossRef] 74. M. Ostendorf, V.V. Digalakis, O.A. Kimball. 1996. From HMM's to segment models: a unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing 4:5, 360. [CrossRef] 75. P. Comon, G. Bienvenu. 1996. Ultimate performance of QEM classifiers. IEEE Transactions on Neural Networks 7:6, 1535. [CrossRef] 76. D. Miller, A.V. Rao, K. Rose, A. Gersho. 1996. A global optimization technique for statistical classifier design. IEEE Transactions on Signal Processing 44:12, 3108. [CrossRef] 77. Z. Roth, Y. Baram. 1996. Multidimensional density shaping by sigmoids. IEEE Transactions on Neural Networks 7:5, 1291. [CrossRef] 78. Thomas H. Fischer, Wesley P. Petersen, Hans Peter L�thi. 1995. A new optimization technique for artificial neural networks applied to prediction of force constants of large molecules. Journal of Computational Chemistry 16:8, 923-936. [CrossRef] 79. C. Jacobsson, L. Jönsson, G. Lindgren, M. Nyberg-Werther. 1995. Jet identification based on probability calculations using Bayes’ theorem. Physical Review D 52:1, 162-174. [CrossRef] 80. R. Rovatti, R. Ragazzoni, Zs. M. Kovàcs, R. Guerrieri. 1995. Adaptive Voting Rules for k-Nearest Neighbors ClassifiersAdaptive Voting Rules for k-Nearest Neighbors Classifiers. Neural Computation 7:3, 594-605. [Abstract] [PDF] [PDF Plus] 81. P. O. G. Hagman, S. A. Grundberg. 1995. Classification of scots pine (Pinus sylvestris) knots in density images from CT scanned logs. Holz als Roh- und Werkstoff 53:1, 75-81. [CrossRef]
82. H. Ney. 1995. On the probabilistic interpretation of neural network classifiers and discriminative training criteria. IEEE Transactions on Pattern Analysis and Machine Intelligence 17:2, 107. [CrossRef] 83. Takio Kurita. 1994. Iterative weighted least squares algorithms for neural networks classifiers. New Generation Computing 12:4, 375-394. [CrossRef] 84. Lynne Boddy, C. W. Morris, M. F. Wilkins, G. A. Tarran, P. H. Burkill. 1994. Neural network analysis of flow cytometric data for 40 marine phytoplankton species. Cytometry 15:4, 283-293. [CrossRef] 85. Chris M. Bishop. 1994. Neural networks and their applications. Review of Scientific Instruments 65:6, 1803. [CrossRef] 86. Thorsteinn Rögnvaldsson . 1993. Pattern Discrimination Using Feedforward Networks: A Benchmark Study of Scaling BehaviorPattern Discrimination Using Feedforward Networks: A Benchmark Study of Scaling Behavior. Neural Computation 5:3, 483-491. [Abstract] [PDF] [PDF Plus] 87. Ronald C. Beavis, Steven M. Colby, Royston Goodacre, Peter de B. Harrington, James P. Reilly, Stephen Sokolow, Charles W. WilkersonArtificial Intelligence and Expert Systems in Mass Spectrometry . [CrossRef]
Communicated by Hal White
NOTE
Lowering Variance of Decisions by Using Artificial Neural Network Portfolios Ganesh Mani Computer Sciences Department, University of Wisconsin-Madison, 1210 W. Dayton Street, Madison, W153706 U S A
Artificial neural networks (ANNs) usually take a long time to train. Experimenters often find that a number of parameters have to be tuned manually before a network that manifests reasonable performance is obtained. For example, in the case of backpropagation (Rumelhart et al. 19861, these parameters include the learning rate, number of units at each layer, the connectivity among the nodes, the function to be executed at each node, offline vs. online error propagation, and the momentum term. The choice of the learning algorithm (Hebbian learning, standard backpropagation, etc.) adds another degree of freedom in this model-building exercise. Instead of hand-picking (by trial and error) a good point in the multidimensional parameter space described above, it is desirable to automate the entire training process, including the choice of values for parameters such as topology and learning rate. However, automating the entire process can be a very difficult task. An attractive, albeit indirect, way of dealing with this problem is to train a number of ANNs (on the same data, using different learning algorithms, momentum rates, learning rates, topologies, etc.) and to use this portfolio of ANNs to make the final decision. Portfolio theory (starting with Markowitz 1952) provides the rationale for such an approach and the purpose of this note is to draw attention to this fact. Consider a portfolio of ANNs with the individual network decisions denoted by d, and the variance of the decisions denoted by a2(d;).The expected decision of a portfolio of N nets is given by
E(d,)
= CxjE(d,) j=l
where x, is the influence or weight (not to be confused with the individual link weights in each ANN) of each constituent net in the total decision and E ( ,) denotes an expected value. Usually, C,”=, x, = 1. Neural Computation 3, 484-486 (1991)
@ 1991 Massachusetts Institute of Technology
Lowering Variance of Decisions by ANN Portfolios
485
The variance of the portfolio decision is given by N
hi
where Cov(d,.d,) is the covariance between the decisions of net i and net j . For an equally weighted portfolio of nets, the above expression simplifies to
RZ
1-
+
-c2(di) Cov(di.dj) (for i
N
#j)
Thus, we see that the variance of the decision of a portfolio of ANNs is dominated by the average covariance between the decisions of pairs of distinct nets; the first term on the right-hand side above has a lower and lower effect as more and more nets are added to the portfolio. The average covariance is usually much smaller than the variance of the individual ANN decisions. The equivalent problem of combining forecasts has been addressed in the econometric literature. Although the desirability of linear combination of forecasts over individual forecasts is well established, there is controversy on the combination method to be used in different situations (for an overview of work in the area, see Granger 1989). Results from the econometric literature indicate that combining the results of individual models using least-squares regression with the dependent variable as the target of interest, and the individual model decisions along with a constant (unity) as exploratory variables generates better performance (than the usual method which ignores the bias term and imposes the constraint that the portfolio weights sum to unity). It has also been reported that a simple averaging of the results of the constituent models generates a very good composite model. An important question that arises is the determination of the number of models to combine; or in our framework, how many individual nets to train. From a theoretical perspective, results of Barron (1991) suggest that an additional network can be added to the portfolio if the resulting improvement in fit (on training examples) outweighs the complexity cost of adding the network. From a practical standpoint (where good performance on the test or out-of-sample examples is the desired end), using a portfolio of a small number of nets would be a reasonable strategy. For example, in the motivating domain of portfolio management, it is found that even a set of between 8 and 12 stocks chosen at random can constitute a well-diversified portfolio. However, unlike stock returns, the decisions of individual ANN models are rarely negatively correlated and hence the reduction in variance may not be as high.
Ganesh Mani
486
Note that the networks to be included in a portfolio can be trained in parallel, thus reducing the total model-building time drastically. Thus, a portfolio approach can facilitate rapid generation of composite models that are superior to their constituent models.
Acknowledgments
~
..
~~
I a m grateful to the reviewer for pointing m e to the econometric literature on combining forecasts. References
.~
Barron, A. 1991. Complexity regularization with application to ANNs. Proceedings N A T O AS1 Nunparametric Functional Estimation, Kluwer. Granger, C. W. J. 1989. Combining forecasts-twenty years later (invited review). \. Forecast. 8(3). Markowitz, H. M. 1952. Portfolio selection. 1. Finance 7(1). Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Paralld Distributed Processing: Explorutiuns in the Microstructure of Cognition. D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., Vol. 1, Chapter 8. MIT Press/Bradford Books, Cambridge, MA.
Received 6 February 1991; accepted 3 May 1991.
This article has been cited by: 1. Taghi M. Khoshgoftaar, Pierre Rebours, Naeem Seliya. 2009. Software quality analysis by combining multiple projects and learners. Software Quality Journal 17:1, 25-49. [CrossRef] 2. George Albanis, Roy Batchelor. 2007. Combining heterogeneous classifiers for stock selection. Intelligent Systems in Accounting, Finance and Management 15:1-2, 1-21. [CrossRef] 3. G.P. Zhang. 2000. Neural networks for classification: a survey. 30:4, 451. [CrossRef] 4. J.A. Benediktsson, J.R. Sveinsson, O.K. Ersoy, P.H. Swain. 1997. Parallel consensual neural networks. 8:1, 54. [CrossRef]
Communicated by Allen Selverston
Oscillating Networks: Control of Burst Duration by Electrically Coupled Neurons L. F. Abbott Department of Physics and Center for Complex Systems, Brandeis University, Waltham, M A 02254 U S A
E. Marder Department of Biology and Center for Complex Systems, Brandeis University, Waltham, M A 02254 U S A S. L. Hooper Department of Physiology and Biophysics, Box 1218, M t . Sinai Hospital, 1 Gustav Levy Place, New York, N Y 10029 U S A and Center for Neurobiology and Behavior, College of Physicians and Surgeons, Columbia University, New York, N Y 10032 U S A
The pyloric network of the stomatogastric ganglion in crustacea is a central pattern generator that can produce the same basic rhythm over a wide frequency range. Three electrically coupled neurons, the anterior burster (AB) neuron and two pyloric dilator (PD) neurons, act as a pacemaker unit for the pyloric network. The functional characteristics of the pacemaker network are the result of electrical coupling between neurons with quite different intrinsic properties, each contributing a basic feature to the complete circuit. TI e AB neuron, a conditional oscillator, plays a dominant role in rhythm generation. In the work described here, we manipulate the frequency of the AB neuron both isolated and electrically coupled to the PD neurons. Physiological and modeling studies indicate that the PD neurons play an important role in regulating the duration of the bursts produced by the pacemaker unit. The functional characteristics of a neural network arise both from the intrinsic properties of individual component neurons and from emergent, collective effects. Central pattern generators are relatively simple networks with well-defined outputs that can be used to study the interplay of intrinsic and emergent phenomena. The pyloric network of the stomatogastric ganglion is a particularly well-studied central pattern generator that produces a three-phase rhythm driving muscles in the Neural Computation 3,487-497 (1991) @ 1991 Massachusetts Institute of Technology
488
L. F. Abbott, E. Marder, and S. L. Hooper
stomach of lobsters and crabs (Selverston and Moulins 1987). This central pattern generator produces the same basic rhythmic pattern over a frequency ranging from about 0.3 to 3 Hz. A pacemaker unit consisting of three electrically coupled cells, the anterior burster (AB) neuron and two pyloric dilator (PD) neurons, plays an important role in regulating network frequency. Experimental studies of isolated AB and PD neurons reveal that they have intrinsic properties quite different from each other (Marder and Eisen 1984; Flamm and Harris-Warrick 1986; Bal et al. 1988). Modeling studies reported in this paper suggest how the different intrinsic properties of the AB and PD neurons combine to produce the characteristics of the full pacemaker network. The AB neuron plays a dominant role in rhythm generation whereas the I'D neurons regulate the duration of the pacemaker bursts. A burst from the pacemaker unit forms one element of the three-phase pyloric rhythm. As shown in Figure lA, the AB and PD neurons depolarize synchronously in periodic bursts. The frequency of the rhythm can be controlled in the laboratory by injecting current into the AB neuron. When the period is modified in this way, the duration of the AB/PD bursts varies in direct proportion to the period. In Figure lB, AB/PD burst duration is plotted against network period. The duration of the pacemaker burst increases linearly with the period over a wide range of frequency. Defining the duty cycle as the ratio of the burst duration to the period, we see that in the full network the pacemaker unit acts as a constant duty cycle oscillator. The pacemaker unit consisting of the AB and PD neurons' (Fig. 2A) can be isolated from the rest of the pyloric network by blocking glutamatergic synaptic transmission with picrotoxin (Eisen and Marder 1982). The frequency of the AB/PD pacemaker unit can again be modified by injecting current into the AB neuron. In the isolated pacemaker network, as in the full network, the duration of the pacemaker burst increases as the cycle period is increased. This is shown in representative intracellular recordings (Fig. 2A) and in a plot of burst duration versus period (Fig. 2C). Individual AB or I'D neurons can be isolated by killing the cells to which they are coupled (Miller and Selverston 1979). An isolated PD neuron may oscillate but it does so irregularly with a period much longer than that of the normal pyloric rhythm (Bal et al. 1988). An isolated AB neuron oscillates at a frequency in the normal pyloric range (Miller and Selverston 1982; Hooper and Marder 1987; Bal et al. 1988; Marder and Meyrand 1989). When the frequency of an isolated AB neuron is modified
'The AB and PD neurons are also coupled by both electrical and chemical synapses to the ventricular dilator (VD) neuron. The electrical coupling to the VD neuron is weaker than that between the AB and I'D neurons and our experiments indicate that the VD neuron does not play an important role in the effects we discuss. For simplicity we have left the VD neuron out of our discussion and diagrams.
Oscillating Networks
489
period
H
n
1 ,
burst duration
H
interburst interval
H
I
I
I
I
I
I
1
2
3
u
Q)
cn 0.8 .
W
E
0
. . I
0.6
1
0.4
.
Y
B
rn .L
5
a
0.2 .
0
period (sec) Figure 1: The full pyloric network. (A) Simultaneous intracellular recordings from the somata of the AB and I‘D neurons. Standard electrophysiological methods (Hooper and Marder 1987) were used. The AB and PD neurons fire during synchronous periodic bursts of depolarization. (B) Burst duration versus period. When the period of AB/PD bursting in the full pyloric network is modified by injecting current into the AB neuron, the AB/PD burst duration increases linearly with the oscillation period. by current injection, it behaves quite differently than when it is part of the full pyloric or AB/PD pacemaker networks. As shown in Figure 2B and C, the burst duration of a n isolated AB neuron remains constant as its period changes. Thus, in isolation the AB neuron acts as a constant burst duration oscillator. When coupled to the PD neurons however, it behaves as a constant duty cycle oscillator.
L. F. Abbott, E. Marder, and S. L. Hooper
490
C
L
-0 Y
n
0.2 0
0
1
2
3
4
period (sec)
Figure 2: The effect of current injection on the pacemaker network and isolated AB neuron. (A) Intracellular recordings from the AB and PD neurons electrically coupled in the pacemaker network. The first pair of recordings were produced by injection of depolarizing current into the AB neuron. The second pair were at zero injected current and the third pair were with hyperpolarizing current injected into the AB neuron. The increase in period from top to bottom results from increases of both the interburst intervals and the duration of the bursts. (B) Intracellular recording from an isolated AB neuron. The AB neuron was "isolated" by photoinactivating the I'D and VD neurons and placing the preparation in picrotoxin (Hooper and Marder 1987). The increase of the period in this case is solely due to an increase in the interburst interval; the burst duration does not change. ( C ) Burst duration versus period for the pacemaker network and for the isolated AB neuron. Burst duration increases with increased period for the pacemaker but is independent of period for the isolated AB neuron. Isolated AB neuron results include our data and those of Bal et a / . (1988).
Oscillating Networks
491
To study how the PD neurons modify the nature of the pacemaker oscillations, we have constructed relatively simple models of the isolated AB and PD neurons and examined the activity the models produce when electrically coupled. We model only a spike-averaged "slow-wave" membrane potential ignoring individual action potentials. [The pyloric network can continue to function when action potentials are blocked by tetrodotoxin (Raper 1979).] To simplify the equations of the model we use units of time in which the cell membrane time constant is one, and choose arbitrary units and an arbitrary zero for the membrane potential u. When results of the model are compared with experimental data we scale and zero the voltage and time variables appropriately. The models we use to describe the AB and PD neurons are by no means accurate in every detail. For example, the shape of the slow wave for the AB neuron in the model is different from that of the real neuron. However, we have taken care to model accurately those characteristics of the AB and PD neurons which are likely to be important for control and regulation of burst period and duration. Since the AB and PD are resistively coupled it is important to model the amplitude of their respective bursts so that the amount of current flowing through the gap junction is correctly predicted. Likewise, the dependence of the input impedance of each neuron on its membrane potential should be included in the model so that the effect of the current entering through the gap junction can be evaluated. The AB neuron is represented by a simple oscillator model which mimics the behavior of an isolated AB neuron. The AB membrane potential is governed by the equation of current conservation
+
The term UAB(UAB - ~ ) ( U A B 1) represents the rapid I-V characteristics of the neuron as in the familiar FitzHugh-Nagumo equations (FitzHugh 1961; Nagumo ef af. 1962). In addition, w e include the term H A B to account for the large difference in membrane conductance in the hyperpolarized and depolarized regions. HAB(Z)AB) is given by (1.2) The voltage-dependent HABfactor has the important effect of causing the amplitude of the burst oscillations in the model to increase as the model neuron is hyperpolarized and decrease as it is depolarized. With a constant HABthe burst amplitude is independent of injected current. In equation 1.1, IeXt represents the external current injected into the cell and IR is the current entering the AB neuron from the PD neuron,
L. F. Abbott, E. Marder, and S. L. Hooper
492
where G is the coupling conductance of the electrical synapse between the AB and PD neurons. Of course when we discuss the AB neuron in isolation this resistive coupling is set to zero. For simplicity we model the pacemaker network with an AB neuron and a single I'D neuron. Oscillations of the model AB neuron are produced by the variable u, which represents the slow voltage-dependent conductances of the AB neuron responsible for rhythmic bursting. It obeys the equation du dt
20-
= [l - tanh(5uAB)](u,B
-
+ [l+ tanh(5vAB)ljl
-
u - .1)3 u)
(1.4)
The form of this equation (which is more complex than the familiar FitzHugh-Nagumo model) was chosen on the basis of reductions of a more complex model of the AB neuron based on realistic ionic conductances (Epstein and Marder 1990). Because of the factors 1 f t a n h ( 5 v ~ ~ ) the first term on the right side of equation 1.4 governs the behavior of u when the AB neuron is hyperpolarized, while the second term determines the behavior when the neuron is depolarized. The power of three in the first term has been included so that the effect of hyperpolarizing current on the neuron is more correctly modeled. If this power is one as in the FitzHugh-Nagumo model, the frequency of the oscillations is relatively insensitive to hyperpolarizing current until a critical value is reached and the neuron suddenly stops bursting. With the power of three in this term, the oscillation frequency decreases more smoothly as a function of depolarizing current. The constant .1 in equation 1.4 was adjusted to make the ratio of the burst duration to the period match that of the real neuron. The second term on the right side of equation 1.4, which governs the behavior of u when the AB neuron is depolarized, contains the factor (1 - u ) , which is independent of voltage. Besides being more realistic than the usual FitzHugh-Nagumo form, this makes it easier for the PD neuron (when it is coupled to the AB) to sustain the duration of the combined AB/PD bursts, as it must if it is to act as a burst duration regulator. The model of equations 1.1,1.2, and 1.4 duplicates quite well the effect of injected current on both the frequency and amplitude of oscillations for an isolated AB neuron. The oscillating waveform of a model AB neuron and its burst duration at different frequencies are shown in Figure 3B and C. Like the real AB neuron, the model maintains a constant burst duration as its frequency is varied. The model PD neuron is in some respects similar to the AB model but it lacks the slow current variable u that causes oscillation in the model AB neuron. Instead the PD neuron model includes a slowly varying, voltage-dependent current that allows the PD neuron to oscillate very slowly or to generate plateau potentials. The PD membrane potential
493
Oscillating Networks
AB
AB
mls3 AB
PD
0
1
2
3
4
5
0
1
time (sec)
2
3
4
5
time (see) C
1 n
0
2
v
0.8 -
0
AB-PD
0
Isolated AB
0
-
-
-
0 0 0
-
2 0.2 a n
0
-
%%oo
Y
-
0
0
-
0 I
I
I
Figure 3 Model pacemaker and AB neuron. (A) Voltage traces for the model pacemaker unit. Only the slow-wave part of the membrane potential is modeled, individual action potentials are not included. (B) Voltage traces for the isolated AB. (C) Burst duration versus period. In the model AB, burst duration is constant as the period is changed by current injection. In the model pacemaker network, burst duration increases linearly with period.
L. F. Abbott, E. Marder, and S. L. Hooper
494
obeys the equation
where the overall voltage-dependent conductance factor in this case is HPD= 0.2 + 0.06 tanh(5upD)
(1.6)
The factor of .5 in equation 1.5 plays a role similar to the factor .1 in equation 1.4. The particular value chosen allows the AB neuron to drive the I'D when the two are coupled with the coupling strength we use. The term Ig is a sIowly varying current representing the summed effect of one or more different conductances. This current is predominantly active in the depolarized voltage range. It consists of an outward component that increases slowly in strength when the neuron is depolarized and slowly becomes weaker when the cell is hyperpolarized. This component has characteristics similar to those of a calcium-activated potassium current. In addition Ig may have an inward component that is activated by hyperpolarization and deactivated by depolarization. Specifically, I, is determined by
r -
g(VUPD
- 1
+ 1)
+ exp(-upD)
(1.7)
The voltage dependence we have chosen has the convenient feature of making the model easier to analyze because the current plays a predominant role only when the I'D neuron is bursting. The time-dependent conductance strength is given by
300-dg dt
=
-[1 1 + 3 tanh(5ufD)] 4
(1.8)
The factor of 300 in this equation causes the variation of g to be much slower than the AB neuron oscillation frequency.' Note that a term like equation 1.7 with constant g can always be absorbed into the fast part of the PD membrane current. As a result the zero of g can be shifted making it impossible in this model to establish unambiguously the ratio of inward and outward components in the composite current Ig. The factor 1+ 3 tanh(5vpD)in equation 1.8 is approximately equal to -2 when the I'D is hyperpolarized and f 4 when it is depolarized. This means that g will increase when the neuron is depolarized at twice the rate that it decreases when the neuron is hyperpolarized. As we will see, the ratio of increase to decrease rates will set the ratio of burst duration to period when the PD is coupled to the AB neuron. *In principle, this equation allows 8 to increase or decrease indefinitely, which is clearly unrealistic. In actual simulations g stays within reasonable bounds but to be safe we sometimes restrict it to a predefined range.
Oscillating Networks
49s
When the model AB neuron is electrically coupled to a model I'D, the PD neuron is driven by the AB into rhythmic oscillation (as in the real preparation). Coupling to the PD neuron has the effect of reducing the frequency of the AB oscillations (Kepler et al. 1990; Meunier 1990). During each burst the slow current conductance factor g increases and during each interburst it decreases. The integrated effect of the small cycle-bycycle fluctuations of g depends on the ratio of the burst duration to the length of the interburst interval. If depolarized bursts dominate, g will increase during the bursts more than it decreases during the interbursts and g will become more positive. If the interburst intervals dominate, g will on average decrease. Because lg is most active when the I'D neuron is depolarized, a change in its strength and polarity, determined by g, will affect the length of the AB/PD bursts. A large negative lg will tend to depolarize the PD neuron causing current to flow through the electrical coupling from the I'D to the AB neuron. The resulting injection of current into the AB neuron will prolong the depolarized bursts. A large positive Ig will have the opposite effect shortening the duration of the bursts. Because the variation of g is slower than the oscillation frequency, the value of g will drift either up or down over several oscillation cycles until an equilibrium condition is reached. The equilibrium is attained when the ratio of the burst duration to the interburst interval matches the ratio of the rate of decrease of the current strength factor g to the rate of its increase. As can been seen from equation 1.8 this ratio has been set equal to 1/2. (For large positive x, [l 3 tanh(x)]/4 x 1 and [l + 3 tanh(-x)]/4 N" -1/2.) This assures that the correct duty cycle ratio of 1/3 (burst/interburst = 1/2, burst/period = 1/3) will be achieved through a dynamic adjustment of the slowly varying PD current, independent of oscillation frequency. Other duty cycle ratios can be obtained by varying the ratio of decrease to increase rates in equation 1.8. The burst duration and voltage waveforms of the model pacemaker network are shown in Figure 3A and C. Because the regulating effect of the PD neuron is dynamic, it is quite robust and not overly sensitive to any particular choice of model parameters. For example, we can change the strength of the electrical coupling G by a factor of three without destroying or drastically modifying the constant duty cycle behavior. Likewise the model is not overly sensitive to the values of other parameters such as the time constant (300) in equation 1.8 or the form of the voltage dependence for Ig. The AB/PD pacemaker unit offers an interesting example of neurons with different characteristics complementing each other to form a network with desirable features not expressed by any single neuron in isolation. The AB neuron by itself can oscillate over a wide frequency range but does so with fixed burst duration. The I'D neuron by itself has membrane characteristics much too slow to drive the network at the correct frequency. However, in combination with the more rapid oscillations of the AB neuron, the slow current characteristics of the I'D neuron
+
496
L. F. Abbott, E. Marder, and S. L. Hooper
act to regulate the pacemaker burst duration resulting in a pacemaker network with the desired characteristics needed to produce the pyloric rhythm.
Acknowledgments Research supported by National Institute of Mental Health Grant MH46742, Department of Energy Contract DE-AC0276-ER03230 a n d National Institutes of Health postdoctoral fellowship 1F32MH09830.
References Bal, T., Nagy, E, and Moulins, M. 1988. The pyloric central pattern generator in Crustacea: A set of conditional neuronal oscillators. J. Comp. Physiol. A163, 715-727. Eisen, J. S. and Marder, E. 1982. Mechanisms underlying pattern generation in lobster stomatogastric ganglion as determined by selective inactivation of identified neurons: 111. Synaptic connection of electrically coupled pyloric neurons. J. Neurophysiol. 48, 1392-1415. Epstein, I. and Marder, E. 1990. Multiple modes of a conditional neural oscillator. Bid. Cyber. 63, 25-34. FitzHugh, R. 1961. Impulses and physiological state in theoretical models of nerve membrane. Biophys. J. 1,4455466. Flamm, R. E., and Harris-Warrick, R. M. 1986. Aminergic modulation in the lobster stomatogastric ganglion I. and 11.1. Neuuophysiol. 55, 847-881. Hooper, S., and Marder, E. 1987. Modulation of the lobster pyloric rhythm by the peptide proctolin. J. Neurosci. 7,2097-2112. Kepler, T., Marder, E., and Abbott, L. F. 1990. The effect of electrical coupling on the frequency of a model neuronal oscillator. Science 6, 83-85. Marder, E., and Eisen, J. S. 1984. Electrically coupled pacemaker neurons respond differently to same physiological inputs and neurotransmitters. J. Neurophysiol. 51, 1362-1374. Marder, E., and Meyrand, P. 1989. Chemical modulation of an oscillatory neural circuit. In Neuronal and Cellular Oscillators, J. W. Jacklet, ed. Marcel Dekker, New York. Meunier, C. 1990. Electrical coupling of two simple oscillators. B i d . Cyber. (submitted). Miller, J., and Selverston, A. I. 1979. Rapid killing of single neurons by irradiation of intracellular injected dye. Science 206, 702-704. Miller, J., and Selverston, A. I. 1982. Mechanism underlying pattern generation in lobster stomatogastric ganglion as determined by selective inactivation of identified neurons. 11. Oscillatory properties of pyloric neurons. J. Neurophysiol. 48, 1378-1391. Raper, J. A. 1979. Non-impulse mediated synaptic transmission during the generation of a cyclic motor program. Science 205, 304-306.
Oscillating Networks
497
Nagumo, J. S., Arimoto, S., and Yoshizawa, S. 1962. An active pulse transmission line simulating nerve axon. Proc. IRE 50, 2061-2070. Selverston, A. I., and Moulins, M. eds. 1987. The Crustacean Stomatogastric System. Springer-Verlag, Berlin.
Received 30 January 1991; accepted 12 April 1991.
This article has been cited by: 2. Yu Zhang, Amitabha Bose, Farzan Nadim. 2008. Predicting the activity phase of a follower neuron with A-current in an inhibitory network. Biological Cybernetics 99:3, 171-184. [CrossRef] 3. Seon Park, Seunghwan Kim, Hyeon-Bong Pyo, Sooyeul Lee. 1999. Multistability analysis of phase locking patterns in an excitatory coupled neural system. Physical Review E 60:2, 2177-2181. [CrossRef] 4. Eve Marder. 1998. FROM BIOPHYSICS TO MODELS OF NETWORK FUNCTION. Annual Review of Neuroscience 21:1, 25-45. [CrossRef] 5. Valerie L. Kilman, Eve Marder. 1996. Ultrastructure of the stomatogastric ganglion neuropil of the crab,Cancer borealis. The Journal of Comparative Neurology 374:3, 362-375. [CrossRef] 6. J. J. Collins, I. Stewart. 1994. A group-theoretic approach to rings of coupled biological oscillators. Biological Cybernetics 71:2, 95-103. [CrossRef] 7. Frances K. Skinner, Gina G. Turrigiano, Eve Marder. 1993. Frequency and burst duration in oscillating neurons and two-cell networks. Biological Cybernetics 69:5-6, 375-383. [CrossRef]
Communicated by Gordon Shepherd
A Computer Simulation of Oscillatory Behavior in Primary Visual Cortex Matthew A. Wilson James M. Bower Computation and Neural Systems Program, California fnsfifufe of Technology, Pasadena, C A 91125 USA
Periodic variations in correlated cellular activity have been observed in many regions of the cerebral cortex. The recent discovery of stimulusdependent, spatially-coherent oscillations in primary visual cortex of the cat has led to suggestions of neural information encoding schemes based on phase and/or frequency variation. To explore the mechanisms underlying this behavior and their possible functional consequences, we have developed a realistic neural model, based on structural features of visual cortex, which replicates observed oscillatory phenomena. In the model, this oscillatory behavior emerges directly from the structure of the cortical network and the properties of its intrinsic neurons; however, phase coherence is shown to be an average phenomenon seen only when measurements are made over multiple trials. Because average coherence does not ensure synchrony of firing over the course of single stimuli, oscillatory phase may not be a robust strategy for directly encoding stimulus-specific information. Instead, the phase and frequency of cortical oscillations may reflect the coordination of general computational processes within and between cortical areas. Under this interpretation, coherence emerges as a result of horizontal interactions that could be involved in the formation of receptive field properties. 1 Introduction
An obvious characteristic of the general behavior of cerebral cortex, as evident in EEG recordings, is its tendency to oscillate (Bressler and Freeman 1980). Cortical oscillations have been observed both in the electric fields generated by populations of cells (Bressler and Freeman 1980) as well as in the activity of single cells (Llinas 1988). Recent observations of oscillations within visual cortex that are dependent on the nature of the visual stimulus (Gray and Singer 1987; Eckhorn et al. 1988; Gray et al. 1989; Gray and Singer 1989) have generated increased interest in the role of periodic behavior in cerebral cortical processing in general. These studies have shown that populations of visual cortical neurons at Neural Computatim 3, 498-509 (1991)
@ 1991 Massachusetts Institute of Technology
Simulation of Oscillations in Visual Cortex
499
considerable cortical distances exhibit increased coherence in neuronal activity when the visual stimulus is a single continuous object as compared to a discontinuous object. This work represents an extension of earlier work showing that the responses of cells can be influenced by stimuli that are located beyond of the boundaries of the classical receptive field (Allman et al. 1985), with horizontal interactions implicated in shaping these more complex receptive field properties (Tso et al. 1986). These recent results have led to suggestions that differences in oscillatory phase and/or frequency between cell populations in primary visual cortex could be used to label different objects in the visual scene for subsequent processing in higher visual areas (Eckhorn et al. 1988; Gray et al. 1989; Gray and Singer 1989; Sporns et al. 1989; Kammen et al. 1989). It has further been suggested that these oscillatory patterns may rely on central, extracortical control to assure temporal coherence (Kammen et al. 1989). In this paper we describe the results of simulations of a biologically realistic model of neocortical networks designed to explore the possible mechanisms underlying oscillations in visual cortex, as well as the functional significance of this oscillatory behavior. In particular we analyze the role of horizontal interactions in the establishment of coherent oscillatory behavior. 2 Cortical Model
The model consists of a network of three basic cell types found throughout cerebral cortex (Fig. 1). The principal excitatory neuron, the pyramidal cell, is modeled here as five coupled cylindrical membrane compartments (soma 1=20 pm, d=20 pm; dendrites 1=100 pm, d=1.5 pm). In addition there are two inhibitory neurons, one principally mediating a slow K+ inhibition (soma 1=10 pm, d=10 pm) and one mediating a fast CI- inhibition (soma 1=15 pm, d=15 pm). Both are modeled as a single compartment. Connections between modeled cells are made by axons with finite conduction velocities, but no explicit axonal membrane properties other than delay are included. Synaptic activity is produced by simulating the action-potential triggered release of presynaptic transmitter followed by the activation of the postsynaptic conductance (0.8 msec delay) and the resulting flow of transmembrane current through membrane channels. Each of these channels is described with parameters governing the time course and amplitude of synaptically activated conductance changes. Conductances have single open and closed states with transitions between these states governed by independent time constants. The open time constant for each conductance is 1 msec. The closing time constant for the excitatory conductance is 3 msec, for the CCinhibitory conductance 7 msec, and for the K+ inhibitory conductance 100 msec. Each conductance has a driving potential associated with it.
Matthew A. Wilson and James M. Bower
500
afferent input
rostrally directed assoc
c
fb inhibitory rostrally directed assoc.
caudally directed assoc.
I
Figure 1: Schematic represention of the local circuit used in the simulations of visual cortex consisting of an excitatory pyramidal cell (P), a feedback inhibitory interneuron (FB), and a feedforward inhibitory interneuron (FF).Darkened halfcircles indicate inhibitory synapses; lightened half-circles excitatory synapses. These potentials are 0 mV for the excitatory conductance, -65 mV for the C l - inhibitory conductance, and -90 mV for the K+ inhibitory conductance. The compartmental models of the cells integrate the transmembrane and axial currents to produce transmembrane voltages. Basic membrane properties include membrane (r, = 2000 R-cm2) and axial resistivity (r, = 50 R-cm) and membrane capacitance (c, = 1 pF/cm2) with a resting potential of -55 mV (assumed depolarized due to bias of spontaneous input). Excursions of the cell body membrane voltage above a specified threshold (normally distributed: x = -40 mV, u = 2 mV) trigger action potentials. Following an action potential, there is a 10-msec absolute refractory period during which the cell cannot fire another spike regardless of membrane potential. Additional details of these features of the model are described in Wilson and Bower (1990).
Simulation of Oscillations in Visual Cortex
501
This model is intended to represent a 10 x 6 mm region of visual cortex. The many millions of actual neurons in this area are represented here by 375 cells (25 x 15) of the three types for a total of 1125 cells. Input to the model is provided by 100 independent fibers, each making contact within a local cortical region (1 mm2),and each reflecting the retinotopic organization of many structures in the visual system (Van Essen 1979). The model also includes excitatory horizontal fiber connections between pyramidal cells (Gilbert 1983) (Fig. 1) that extend over a radius of 3 mm from each pyramidal cell (conduction velocity = 0.85f0.13 m/sec; lower bound = 0.45, upper bound = 1.25). Inhibitory cells receive input from pyramidal cells within a 2 mm radius and make connections with pyramidal cells over a radius of 1 mm (Fig. 1) (inhibitory conduction velocity = 1.050.06 m/sec; lower bound = 0.8, upper bound = 1.2). The influence of each of these fiber systems falls off exponentially with a space constant of 5 mm. No effort was made to reproduce the periodic structure of actual connections or many other known features of visual cortex. Instead, our intention was to reproduce oscillations characteristic of visual cortex using a small but sufficient set of physiological and anatomical features.
3 Coherent Oscillations Figure 2 shows auto- and cross-correlations of simulated pyramidal cell spike activity recorded from two sites in visual cortex separated by 6 mm. Total cross-correlations in the modeled data were computed by averaging correlations from 50 individual 500 msec trials. Within each trial, simulated activity was generated by providing input representing bars of light at different locations in the visual field. In these cases, the model produced oscillatory auto- and cross-correlations with peak energy in the 30-60 Hz range, consistent with experimental data (Gray et al. 1989). As in the experimental data, the model also produced nearly synchronous oscillatory activity in groups of neurons separated by 6 mm when presented with a continuous bar (Fig. 2A). A broken bar that did not stimulate the region between the recording sites produced a weaker response (Fig. 2B), again consistent with experimental evidence (Gray et al. 1989). Shuffling trials with respect to each other prior to calculating cross-correlation functions greatly diminished or completely eliminated oscillations. The same technique applied to actual physiological data yields similar results (Gray and Singer 1989) indicating that while the oscilIations are stimulus dependent, they are not stimulus locked. Simulations run in the absence of stimuli produced low baseline activity with no oscillations. Further analysis of the model's behavior revealed that the 30-60 Hz oscillations are primarily determined by the amplitude and time course of the fast feedback inhibitory input. Increasing the amplitude of the inhibitory input to pyramidal cells reduced oscillatory frequency, while
Matthew A. Wilson and Tames M. Bower
502
1
2
6
A
1-1 Y VI
L 2-2
VI
80
0 TIME ( m r e c )
80
80
0
80
TIME Imsec)
Figure 2: Simulated auto- and cross-correlations generated by presentation of a broken bar (A) and a continuous bar (B) over 500 msec. Upper diagrams show the model with the stimulus region shaded. Grid squares correspond to the location of modeled cells. The numbers indicate the location of the recording sites referred to in the auto- (1-1,2-2) and cross- (1-2) correlations. Methods: Multineuronal activity used to produce the correlations were summed from the 9 neurons nearest each recording site. The stimulus was generated using independent Poisson processes on individual input fibers. The Poisson rate parameter was increased from a baseline of 20 to 500 spikes/sec over the onset period from 20 to 100 msec. The difference in phase between the firing of cells in these locations was estimated by measuring the offset of the dominant peak in the cross-correlation function. These values were consistent with measurements obtained both through chi-square fitting of a modified cosine function and measurement of the phase of the peak frequency component in the correlation function power spectra.
reducing inhibition produced an increase in frequency. Allowing inhibitory cells to inhibit each other within a local region improved frequency locking and produced auto- and cross-correlations with more pronounced oscillatory characteristics.
Simulation of Oscillations in Visual Cortex
503
4 Dependence on Horizontal Interconnections
While the frequency of oscillations was primarily due to local inhibitory circuitry, the coherence in correlated cell firing appears to be related primarily to activity in the horizontal interconnections between pyramidal cells. When all long-range (> 1 mm) horizontal fibers were eliminated, the autocorrelations at each recording site continued to show strong oscillatory behavior, but oscillations in the cross-correlation function vanished (Fig. 3A). Increasing the range of horizontal fibers to 2 mm restored cohcrent oscillatory behavior (Fig. 3A). The dependence of phase coherence on horizontal connections immediately raises a number of interesting questions. First, because horizontal fibers have finite conduction velocities, it was surprising that they would produce coherence with zero phase over relatively long distances, If phase coherence was strictly a consequence of horizontal fiber coupling between the recorded cell groups, it seems reasonable to expect a phase difference related to the propagation delay. To explore this further, we reduced the propagation velocity of horizontal fibers from 0.86f0.13 m/sec to 0.43 i0.13 m/sec and examined the response to a continuous bar. No effect on phase was found in the cross-correlation function. If, however, the degree of horizontal fiber coupling was enhanced by increasing synaptic weights along horizontal pathways, the cortex displayed a transition from near-zero phase coherence to a phase shift consistent with the delay along the shortest horizontal interconnection path (Fig. 3B). To examine this result more closely, we analyzed the time course of phase coherence at successive time periods following stimulus onset in both the strong and weakly coupled cases. Initially, in both conditions, the synchronizing effect of the stimulus onset itself produces a tendency for zero-phase correlations during the period from 0 to 125 msec (Fig. 3B). However, in the periods following the onset of the stimulus, when activity is dominated by horizontal fiber effects (125-500 msec), the response differs in the two cases. With enhanced horizontal fiber coupling, nonzero phase shifts emerge that reflect the propagation delays along horizontal fibers (Fig. 3B). However, in the weak coupling case, zero-phase correlations persist, decaying over the entire trial interval (0-500 msec). 5 Mechanisms Governing Coherence
Analysis of the activity patterns generated in the weak coupling condition indicates that the mechanism that sustains the zero-phase bias between distant cell groups after stimulus onset depends on the activation of spatially intermediate cells via horizontal fibers. When this intermediate population of cells is activated by the single stimulus bar, they can activate adjacent cells through their own horizontal fibers in a phase-symmetric fashion. When these intermediate cells are not activated
Matthew A. Wilson and James M. Bower
504
A
I
-80
0
80
-80
TIME ( m s e c )
0
0
80
TIME (mssc)
WINDOW (rnsec)
-
375 500
Y (0
Y 0
'
250 375
A -BD
0 TIME l m s e c )
80
125-250
I
0-125
-80
0
40
80
TIME ( m s s o )
Figure 3: (A) Cross-correlationsbetween sites 1 and 2 (see Fig. 2) for a continuous bar stimulus with radius of horizontal fiber coupling of 2 mm (left) and 1 mm (right). (8)Time course of cross-correlation functions taken at successive 125 intervals over the 500-msec period for relative horizontal fiber coupling strengths of 1 (left) and 1.5 (right). The bottom-most correlation function covers the entire 500-msec interval.
directly by the stimulus, as in the case of the discontinuous bar, their ability to coactivate adjacent cell populations is diminished, resulting in a reduction in observed long-range phase coherence. Increasing the strength of horizontal connections establishes a path of direct polysynaptic coupling between distant sites, which gives rise to systematic phase shifts related to propagation delay.
Simulation of Oscillations in Visual Cortex
1
2
505
3
1-3
-80
0
80
TIME (msec)
Figure 4: Cross-correlationsbetween sites along a 12-mm stimulus bar.
The model's dependence on horizontal connections for phase coherence leads directly to the prediction that the areal extent of strongest correlations should be related to the spatial spread of the horizontal fibers. This effect was demonstrated in the model by increasing the size of the stimulus bar from 6 to 12 mm in an enlarged cortical simulation in which the horizontal fibers remained at a length of 3 mm. Under these conditions, oscillatory correlations were not found between distant recording sites (1,3 in Fig. 4). Interestingly, correlations were still found between recording points separated by no more than 6 mm (pairs 1,2 and 2 3 in Fig. 4). This absence of transitivity demonstrates the presence of withinand between-trial variations in phase relationships and suggests that the
506
Matthew A. Wilson and James M. Bower
observed zero-phase phenomena may be present only in the average of multiple trials. Overall, our simulation results suggest that the oscillatory patterns so far reported to exist in visual cortex can be explained by mechanisms that are entirely intrinsic to the cortical region and do not require an extrinsic driving mechanism (cf. Kammen et al. 1989). In the current simulations of visual cortex, we have used long bar stimuli to make the additional prediction that the more restricted extent of horizontal connections should limit coherent correlated activity to an area twice the radius of the horizontal fibers [4-12 mm in cats and monkeys (Gilbert 1983)l. More extensive correlations within primary visual cortex would imply either an additional intrinsic mechanism (e.g., long distance inhibitory coupling) or a more global synchronizing mechanism (Kammen et al. 1989). Even if such mechanisms exist, it is likely that they will be coordinated with intrinsic cortical mechanisms.
6 Local Field Potentials In addition to the observation of synchronized unit activity at spatially separated sites, experimental results also indicate zero-phase synchronization of the oscillatory local field potential (LFP) (Eckhorn et 01. 1988; Engel et al. 1990). Because these potentials are generated principally by dendritic currents summed over large number of cells, LFPs can be evaluated on a trial-by-trial basis. The observation that these potentials are also at zero phase has been taken as evidence for the zero phase relationships between neuronal spiking on individual trials. In interpreting this observation it must be noted that synchronized LFP is often observed in the absence of unit activity, and that stimulus specificity of synchronized LFP responses differs from unit responses indicating a dissociation of mechanisms giving rise to either phenomena (Gray and Singer 1989; Engel et al. 1990). To understand how synchronized LFP responses could be observed with underlying phase-variable unit responses, it is important to note that LFPs reflect the average input and activity of large numbers of cells. Thus, the presence of adjacent, relatively independent, oscillatory cell groups is sufficient to explain the LFP synchronization in the presence of nonzero instantaneous unit phase relationships, although this has not been directly tested within the model. For example, two groups of units A and B could show correlated oscillatory behavior with a variable or even consistently nonzero instantaneous phase relationship. If the LFP reflected only the behavior of this population of cells, then the LFP would be expected to reflect the nonzero phase coherence of unit responses. But the presence of additional cell groups, adjacent to but independent of the first group, with
Simulation of Oscillations in Visual Cortex
507
different instantaneous phase relationships would produce a summed contribution to the LFP that would show zero-phase coherence. Thus, zero-phase as observed in the correlated unit activity may be dependent on trial averaging while zero-phase in the coherence of LFPs may result from spatial averaging of adjacent cell populations with different response properties. 7 Significance of Phase Relationships
Beyond providing a structural explanation for the properties of visual cortical oscillations, our results also have implications for several recently proposed functional interpretations of the observed stimulus-dependent zero-phase coherence. Several researchers have proposed the use of these phase relationships as a means of cortically segmenting, or labeling, different objects in a visual scene (Eckhorn et al. 1988; Gray et al. 1989; Gray and Singer 1989; Sporns ef a/. 1989; Kammen ef al. 1989). Associated with this idea, models have been generated that produce the instantaneous phase effects presumably necessary for the visual system to make use of such a coding mechanism on single stimulus trials (Kammen ef al. 1989). If our results are correct, however, zero-phase relationships between particular neurons should exist, on average, only over multiple trials. The absence of consistent within-trial coherence over long distances wouId be expected to seriously confound the interpretation of fine phase differences in higher visual processing areas (Wilson and Bower 1990). Our simulations suggest that the oscillatory behavior seen in visual cortex may be dependent on horizontal interactions that are capable of modulating the responses of widely separated neurons. While the computational function of these types of interactions within the actual cortex is not yet understood, the lateral spread of information could be involved in reinforcing the continuity of visual objects, in modulating classical receptive field properties (Tso ef al. 1986; Mitcheson and Crick 1982), or in establishing nonclassical receptive field structure (Allman ef al. 1985). The stimulus dependence of coherence in the model is observed to result from the modulation of the magnitude of these interactions as a function of stimulus structure. Under this interpretation, phase coherence does not in itself encode information necessary for subsequent processing, but rather, phase relationships emerge as a result of the horizontal integration of information involved in the shaping of receptive field properties. 8 General Cerebral Cortical Processing
For the last several years we have been using biologically realistic computer simulations to study the oscillatory behavior of another primary sensory region of cerebral cortex, the olfactory or piriform cortex (Wilson and Bower 1988, 1989, 1990, 1992). This structure is also known to
Matthew A. Wilson and James M. Bower
508
generate oscillatory activity in the 40 Hz range under a variety of experimental conditions (Adrian 1942; Freeman 1968, 1978). It is interesting to note that the neural mechanisms that generate the oscillatory behavior described here in the visual cortex model are also capable of reproducing the basic frequency and phase relationships of olfactory cortex. In each case inhibitory neurons govern the frequency of the oscillations while the long-range horizontal connections are involved in establishing specific phase relationships. Our work in piriform cortex suggests that the 40 Hz cycle reflects a fundamental cortical processing interval while phase relationships, as in the model of visual cortex, reflect the structure of intercellular communication within the network (Wilson and Bower 1992). If true, then this 40 Hz oscillatory structu-re may reflect very general properties of cerebral cortical function.
Acknowledgments This research was supported by the NSF (EET-8700064),the ONR (N0001488-K-05131, and the Lockheed Corporation. We wish to thank Christof Koch and Dan Kammen for valuable discussions.
References
~
.~
Adrian, E. D. 1942. Olfactory reactions in the brain of the hedgehog. J. Physiol. (London) 100, 459472. Allman, J,, Miezin, F., and McGuinness, E. 1985. Stimulus specific responses from beyond the classical receptive field: Neurophysiological mechanisms for local-global comparisons in visual neurons. Ann. Rev. Neurosci. 8, 407430. Bressler, S. L., and Freeman, W. J. 1980. Frequency analysis of olfactory system EEG in cat, rabbit and rat. Electroenceph. Clin. Neurophysiol. 50, 19-24. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121-130. Freeman, W. J. 1968. Relations between unit activity and evoked potentials in prepyriform cortex of cats. J. Neurophysiol. 31, 337-348. Freeman, W. J. 1978. Spatial properties of an EEG event in the olfactory bulb and cortex. Electroenceph. Clin.Neurophysiol. 44, 586-605. Gilbert, C. D. 1983. Microcircuitry of the visual cortex. Ann. Rev. Neurosci. 6, 217-247. Gray, C. M., and W. Singer. 1987. Stimulus-specific neuronal oscillations in the cat visual cortex: A cortical functional unit. SOC. Neurosci. Abstr. 404, 3. Gray, C. M., and Singer, W. 1989. Stimulus specific neuronal osciIlations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A. 86, 1698-1702.
Simulation of Oscillations in Visual Cortex
509
Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Kammen, D. M., Holmes, P. J., and Koch, C. 1989. Cortical architecture and oscillations in neuronal networks: Feedback versus local coupling. In Models of Brain Function, R. M. J. Cotterill, ed. Cambridge Univ. Press, Cambridge. Llinas, R. 1988. The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Science 242, 16541664. Mitchison, G., and Crick, E 1982. Long axons within the striate cortex: Their distribution, orientation and patterns of connection. Proc. Nafl. Acad. Sci. U.S.A. 79,3661-3665. Sporns, O., Gally, J. A., Reeke, G. N., Jr., and Edelman, G. M. 1989. Reentrant signaling simulated neuronal groups leads to coherency in their oscillatory activity. Proc. Natl. Acad. Sci. U.S.A. 86, 7265-7269. Tso, D. Y., Gilbert, C. D., and Wiesel, T. N. 1986. Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. J. Neurosci. 6, 1160-1170. Van Essen, D. C. 1979. Visual areas of the mammalian cerebral cortex. Ann. Rev. Neurosci. 2, 227-263. Wilson, M. A., and Bower, J. M. 1988. A computer simulation of olfactory cortex with functional implications for storage and retrieval of olfactory information. In Neural Information Processing Systems, D. Z. Anderson, ed. AIP Press, New York. Wilson, M. A., and Bower, J. M. 1989. The simulation of large scale neuronal networks. In Methods in Neuronal Modeling: From Synapses to Networks, C. Koch and I. Segev, eds., pp. 291-334. MIT Press, Cambridge, MA. Wilson, M. A., and Bower, J. M. 1990. Computer simulation of oscillatory behavior in cerebral cortical networks. In Advances in Neural Information Processing Systems, Vol. 2, D. Touretzky, ed., pp. 84-91. Morgan Kaufmann, San Mateo, CA. Wilson, M. A., and Bower, J. M. 1992. Cortical oscillations and temporal interactions in a computer simulation of piriform cortex. J. Neurophysiol., in press.
Received 23 July 1990; accepted 5 June 1991.
This article has been cited by: 2. G. Frank, G. Hartmann, A. Jahnke, M. Schafer. 1999. An accelerator for neural networks with pulse-coded model neurons. IEEE Transactions on Neural Networks 10:3, 527-538. [CrossRef] 3. Sharon M. Crook, G. Bard Ermentrout, James M. Bower. 1998. Spike Frequency Adaptation Affects the Synchronization Properties of Networks of Cortical OscillatorsSpike Frequency Adaptation Affects the Synchronization Properties of Networks of Cortical Oscillators. Neural Computation 10:4, 837-854. [Abstract] [PDF] [PDF Plus] 4. Geoffrey M. Ghose, Ralph D. Freeman. 1997. Intracortical connections are not required for oscillatory activity in the visual cortex. Visual Neuroscience 14:06, 963R. [CrossRef] 5. Geoffrey M. Ghose, Ralph D. Freeman. 1997. Intracortical connections are not required for oscillatory activity in the visual cortex. Visual Neuroscience 14:05, 963. [CrossRef] 6. Paul Bush, Terrence Sejnowski. 1996. Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. Journal of Computational Neuroscience 3:2, 91-110. [CrossRef] 7. Wulfram Gerstner. 1995. Time structure of the activity in neural network models. Physical Review E 51:1, 738-758. [CrossRef] 8. Marius Usher , Martin Stemmler , Christof Koch , Zeev Olami . 1994. Network Amplification of Local Fluctuations Causes High Spike Rate Variability, Fractal Firing Patterns and Oscillatory Local Field PotentialsNetwork Amplification of Local Fluctuations Causes High Spike Rate Variability, Fractal Firing Patterns and Oscillatory Local Field Potentials. Neural Computation 6:5, 795-836. [Abstract] [PDF] [PDF Plus] 9. Thomas B. Schillen, Peter König. 1994. Binding by temporal structure in multiple feature domains of an oscillatory neuronal network. Biological Cybernetics 70:5, 397-405. [CrossRef] 10. Erik De Schutter , James M. Bower . 1993. Sensitivity of Synaptic Plasticity to the Ca2+ Permeability of NMDA Channels: A Model of Long-Term Potentiation in Hippocampal NeuronsSensitivity of Synaptic Plasticity to the Ca2+ Permeability of NMDA Channels: A Model of Long-Term Potentiation in Hippocampal Neurons. Neural Computation 5:5, 681-694. [Abstract] [PDF] [PDF Plus] 11. E. R. Grannan , D. Kleinfeld , H. Sompolinsky . 1993. Stimulus-Dependent Synchronization of Neuronal AssembliesStimulus-Dependent Synchronization of Neuronal Assemblies. Neural Computation 5:4, 550-569. [Abstract] [PDF] [PDF Plus]
12. T. Murata, H. Shimizu. 1993. Oscillatory binocular system and temporal segmentation of stereoscopic depth surfaces. Biological Cybernetics 68:5, 381-391. [CrossRef] 13. Wulfram Gerstner, Raphael Ritz, J. Leo Hemmen. 1993. A biologically motivated and analytically soluble model of collective oscillations in the cortex. Biological Cybernetics 68:4, 363-374. [CrossRef]
Communicated by Christof Koch
Segmentation, Binding, and Illusory Conjunctions D. Horn School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel
D. Sagi M. Usher Department of Applied Matkenlatics and Computer Science, Weizrnann lnstitiite of Scimce, Rehovot 76700, Israel
We investigate binding within the framework of a model of excitatory and inhibitory cell assemblies that form an oscillating neural network. Our model is composed of two such networks that are connected through their inhibitory neurons. The excitatory cell assemblies represent memory patterns. The latter have different meanings in the two networks, representing two different attributes of an object, such as shape and color. The networks segment an input that contains mixtures of such pairs into staggered oscillations of the relevant activities. Moreover, the phases of the oscillating activities representing the two attributes in each pair lock with each other to demonstrate binding. The system works very well for two inputs, but displays faulty correlations when the number of objects is larger than two. In other words, the network conjoins attributes of different objects, thus showing the phenomenon of "illusory conjunctions," as in human vision. 1 Introduction Recent observations of synchronous oscillatory behavior of neural firings (Eckhorn ef al. 1988; Gray et al. 1989) have strengthened the idea that temporal correlations are the means by which binding is achieved (von der Malsburg and Schneider 1986). The binding problem may be viewed as the quest for a mechanism uniting parts of incoming sensory information into coherent activation patterns representing objects or situations in the external world. In the case when the assembled parts are essential for the object's identity (as when the object is defined by specific relations between its parts), this mechanism could be provided by an underlying synaptic connectivity reflecting prior knowledge. A theoretical attempt in this direction was made by several researchers (Sompolinsky ef al. 1989; Kammen et al. Neural Computation 3, 510-525 (1991) @ 1991 Massachusetts Institute of Technology
Segmentation, Binding, and Illusory Conjunctions
511
1989) who showed that synchronized oscillations between distant neural populations may be obtained, once an explicit connectivity among the neural populations is assumed. The binding problem is more acute in the case when the relation among the parts of the integrated objects is of contingent nature, that is, when the parts do not bear any relations essential to the identity of the objects. In this case a mechanism that does not rely on a priori connectivity should be provided. We will limit our discussion to the binding of intramodality information. To illustrate the problem consider the case studied in psychophysical experiments (Treisman and Schmidt 1982), in which an observer is presented with a display consisting of three colored shapes, for example, a red diamond, a blue square, and a green circle. If we suppose that shapes and colors are stored in different cortical modules (networks), we are faced with the double problem of segmentation and binding. That is, the “shape” module should recognize and segment the shapes, while the ”color” module should recognize and segment the colors. The binding problem, then, is to provide the correct matching between the shapes and their corresponding colors. Treisman and Schmidt suggested that correct matching can be obtained by the human visual system only when focusing attention on each of the objects separately, otherwise illusory conjunctions may occur. We wish to study this problem within a model of coupled oscillatory networks that receive a mixed input, as illustrated schematically in Figure 1. Such a system is then required to perform simultaneously both segmentation and binding. Segmentation is the task of parallel retrieval of the individual memorized patterns composing the input. This can be achieved in oscillating networks as was demonstrated by Wang et al. (1990) and by Horn and Usher (1991). What happens is that the activities of the different memory patterns that are turned on by the input oscillate in a staggered fashion. Binding is modeled by assuming that patterns corresponding to the related attributes oscillate in phase (e.g., the activity of the pattern representing the shape ”diamond” should oscillate in phase with the activity of the color ”red”). Modeling the binding process is especially challenging since no a priori stored synaptic structure relating the corresponding patterns is allowed, due to the fact that their relation is contingent. We will show how a solution to the binding problem is achieved by using a mechanism based on enhancement of noise correlations. Moreover we will show that for more than two input patterns synchronization faults occur. These faults may provide a natural explanation for perceptual errors of the illusory conjunction type. The neural networks that we study are based on coupled formal neurons that possess dynamic thresholds that exhibit adaptation: they vary as a function of the activity of the neurons to which they are attached. As such they introduce time dependence, which can turn a neural network from a dissipating system that converges onto fixed points into one that
D. Horn, D. Sagi, and M. Usher
512
networks
input objects
Figure 1: Schematic drawing of a problem of joint segmentation and binding. The example of objects of different shapes and colors will be discussed throughout this paper. The inhibitory connection between the two networks is an important element of our solution. moves from one center of attraction to another (Horn and Usher 1989). Here we will use a variant of these models (Horn and Usher 1990) that is based on a model of excitatory and inhibitory neurons. This model is explained briefly in the next section. It can be expressed in terms of a set of differential equations that involves the activities of the excitatory cell assemblies that represent the memories of this model. In the following section we describe how two such networks can be coupled through their inhibitory neurons in a nonsemantic fashion, that is, no explicit connections exist between the patterns that are to be bound. We show that this coupling leads to matching their periods and phases. The next section is devoted to an explanation of how binding is achieved. Afterward we turn to an analysis of the performance of our model and dwell on the illusory conjunctions that it exhibits when the number of inputs is larger than two. 2 The E-I Model
The system that we will study is based on a model of excitatory and inhibitory neurons with dynamic thresholds. These two kinds of neurons are assumed to have excitatory and inhibitory synapses exclusively. Memory patterns are carried by the excitatory neurons only. Furthermore, we make the simplifying assumption that the patterns do not overlap with one another, that is, the model is composed of disjoint Hebbian
Segmentation, Binding, and Illusory Conjunctions
513
cell assemblies of excitatory neurons that affect one another only through their interaction with a group of inhibitory neurons common to all of them. We refer to a previous paper (Horn and Usher 1990) for details of the microstructure of this model. Here we will limit ourselves to its description in terms of differential equations for the activities of the cell assemblies. To start with let us consider the case of static thresholds. We denote by m p ( t ) the fraction of cell assembly number p that fires at time t, and by m’(t) the fraction of active inhibitory neurons. We will refer to mp as the activity of the pth memory pattern. There are p different memories in the model, and their activities obey the following differential equations dmp/ddt = -mp dm’/dt = -m’
+ F T ( A d h- Bm’ - HE) + FT(CM Dm’ 0’) -
(2.1)
-
where
HE and 0’ are the (constant) thresholds of all excitatory and inhibitory neurons, respectively. The four parameters A, B, C, and D are all positive and represent the different couplings between the neurons. This system is an attractor neural network. It is a dissipative system that flows into fixed points determined by the memories. This system is a generalization of the E-I model of Wilson and Cowan (1972) in which we have introduced competing memory patterns. The latter make it into an attractor neural network. Wilson and Cowan have shown that a pair of excitatory and inhibitory assemblies, when properly connected, will form an oscillator. We induce oscillations in a different way, keeping the option of having the network behave either as an attractor neural network or as an oscillating one: we turn the thresholds of the excitatory neurons into dynamic variables. For this purpose we introduce new variables Y@ that represent the average alternating behavior of the thresholds of the excitatory neurons in cell assembly p, and change the p equations of the excitatory neurons to the following 2p equations:
dmp/dt
=
-rnp
drJ’/df = ( l / c
+ FT(Amp l ) r p+ mp
-
Bm’ - SF - b r p )
(2.3)
-
For c > 1 and appropriate values of g = bc/(c-1), this system exhibits local fatigue effects. The reason is simple. Imagine a situation in which the system would move into a fixed point ml’ = 1. r” will then increase until it reaches the value c/(c - 1). This means that the argument of the F T function in the equation for m p decreases by g. If this overcomes the effect of the other terms the amplitude r n p decreases and the system moves out of the attractor and falls into the basin of a different center of attraction. This process can continue indefinitely.
D. Horn, D. Sagi, and M. Usher
514
3 The Model of Coupled Networks
The system we study is presented diagrammatically in Figure 1. We realize it by using two E-I networks that are coupled through their I neurons. This type of coupling is chosen to avoid any a priori specific connection between memory patterns of the two different networks. They may still affect each other through their couplings to the connected sets of I neurons, but there is no explicit relation between the two sets of memories. Let us introduce also external inputs to the E neurons designated by P. The system of the two coupled networks takes then the form
dmt’,,/dt dr:,/dt dm: Jdt
= =
=
-mr2 + F7 (Amf’, - Bm: ( l / c - l)rY2 + my, -mi + FT(CMI - Dml
,
-
,
0; - brr,,
-
+ i:,)
(3.1)
0’ - Am: ,)
The subindices refer to the two different networks, whose only connection is through the term Am’ representing the coupling between the two sets of inhibitory neurons. We present in Figure 2 a schematic drawing of the relation between the variables that appear in this set of differential equations. Drawn here is one of the two networks with three memories and an input that feeds into two of them. Let us start our discussion of this system of differential equations by limiting ourselves to the case of a single excitatory cell assembly in each
Figure 2: Schematic drawing of the relations between the variables of one of the two networks described by equations 3.1. Shown here is the case of three memories and two inputs.
Segmentation, Binding, and Illusory Conjunctions
515
network. Assuming first no input and no coupling between the two networks we obtain the results shown in Figure 3. We have chosen the b parameters to be different in the two networks, therefore we obtain oscillations with different frequencies. Figure 4 shows how the situation changes when the coupling between the two I assemblies is turned on, X = 1. It is quite evident that this coupling forces the two networks to move in tandem. The common frequency is lower than the two frequencies of the free networks. We observe a difference in the shape and phase of the activities of the two networks, which is the remnant effect of the two different frequencies. The phase shift is particularly strong between the two I activities because they inhibit each other. The regular shape of the average I activity in the coupled case justifies a posteriori approximating its equation of motion by
dm'ldt
=
-m'
+ FT(CM - Dm' - 8'
-
Am')
+
meaning that its effective autoinhibition increased from D to D A. This seems to be the reason for the lower overall frequency. Let us turn now to the general case of p excitatory cell assemblies, n of which receive a common input:
For X = 0 we find the phenomenon of temporal segmentation discussed by Wang et al. (1990) and by Horn and Usher (1991). This means that different memories oscillate in a staggered manner, each one peaking at different times, thus leading to segmentation of the mixed input. This scenario works as long as n is small. Once we couple the two inhibitory assemblies we may expect the oscillations of the two networks to match one another in period and phase. However, this matching will be random, since there is no reason for a particular cell assembly of one network to oscillate in phase with a particular one of the other network. How to achieve such binding will be discussed in the next section. 4 Binding by Correlated Fluctuations
Our problem, which is symbolically presented in Figure 1, assumes that the two networks describe two attributes of objects that appear in a mixed form in the input. We expect our combined network to be able to segment this information and, moreover, to order the staggered oscillations in such a form that the activities of the two attributes of the same object have the same phase. To achieve the latter we make use of noisy inputs. For two attributes of the same object we assume that both are affected by some common random activity fluctuation. However, the noises affecting two different objects are assumed to be uncorrelated. The noise is transmitted
D. Horn, D. Sagi, and M. Usher
516
1 .o
0.5 0.0
0.4 0.2 0.0
0.4 0.2 0.0
0
20
40
60
80
100
120
time
Figure 3: Activities of two networks with one E assembly each, no input, and no coupling. The activities of the first and the second network are represented by full and dashed curves, respectively. The results are numerical solutions of the differential equations 3.1 using time steps of dt = 0.1 and parameters A = 1, B = 0.7, C = 1, D = 1, T = 0.1, c = 1.2, BE 0.1, 8’ = 0.55. The parameters b are chosen differently for the two networks, bl = 0.15 and b2 = 0.2, hence the different frequencies of oscillation. together with the constant input to the relevant cell assemblies of the two networks. The inputs we use take the form i’’((t) = 0.1
+ O.l[p”((t)- 0.51
(4.1)
where @’ is a random variable distributed between 0 and 1. The same input is used for both attributes which refer to the same object iy,2 = P, yet different pairs of attributes are driven by different and uncorrelated random noises p’‘. We solve numerically the differential equations 3.1 using small time steps of dt = 0.1 for each iteration. We assume that the inputs are updated on a time scale 7 that is an order of magnitude larger, either T = 1 or 2. Correspondingly we represent the time scale in the following figures by integers.
Segmentation, Binding, and Illusory Conjunctions
517
l-4
Q)
M (d
h
0.2
Q)
$ 0.0 0
20
40
60
80
100
120
time
Figure 4: The result of turning on the coupling X of the previous figure.
= 1 between
the two networks
Figure 5 describes results when two pairs of input of the type of 4.1 were used, Starting from random initial conditions we observe correct phase correlations after 10 time units, turning into almost perfect binding after 30 time units. Binding occurs almost instantly if one starts from zero (instead of random) initial conditions for the activities. In this figure we show in addition to the activities of the two different cell assemblies in the two networks also the random noises used for the two pairs of inputs. Note that the time scale of phase-locked oscillations is much larger than that of the autocorrelations of the fluctuating noise (which is T = 1). In the case of three objects, shown in Figure 6, it takes longer time to achieve correct binding. Moreover, we have noticed that the system can move out of correct binding into erroneous phase correlations, of the type shown here from f = 30 to 90. In order to quantify the binding quality we measure the fraction of correct activity correlations: (4.2)
D. Horn, D. Sagi, and M. Usher
518
1.o
E
0.5
0.0 1.o
E
0.5 0.0
0.2 Q)
.-rn0
0.1
c
0.0
Figure 5: The first two frames exhibit activities of memory patterns (excitatory cell assemblies) of two coupled networks. The parameters b are 0.1 and 0.12. The other parameters used here and in all following figures are A = 1, B = 1.1, C = 1.2, D = 1, T = 0.1, c = 1.2, BE = 0.1, 8' = 0.55, X = 1.2. The first network has five memory patterns and the second has three. Two cell assemblies receive inputs of the type of equation 4.1. The activities of these two memories are shown by the full line and the dashed line. The dot-dashed curve represents an activated memory that does not receive an input. We observe both segmentation and binding. Segmentation means that the two different patterns in the two networks oscillate in a staggered fashion, and binding means that the patterns that are associated with one another oscillate in phase. The association is brought about by the common noise, which is shown in the third frame.
For the case of n = 2 in networks with different parameters (bl = 0.1, b2 = 0.15) we find high correct correlations, B = 0.83. In general binding is best when the frequencies of the two coupled networks are identical. Nonetheless, when we turn to n = 3 in networks with identical parameters as shown in Figure 6, we find that B reduces to 0.41 f .02. Better performance is obtained if we allow the noise correlation time to be longer, for example, we change p only every two time units (7 = 2). This leads to B = 0.51 f .02.
Segmentation, Binding, and Illusory Conjunctions
519
1.o
&
0.5
0.0 1.o
&
0.5
0.0
0
50
100
150
200
time
Figure 6: The behavior of the joint networks in the case of three inputs. Using equal frequencies, bl = b2 = 0.1, we find that temporal segmentationworks very well but binding is less successful than in the case of two patterns. Associated patterns are represented by the same type of lines in the two different networks.
5 Binding Errors and Illusory Conjunctions We saw that the binding obtained by our model is not perfect and that some degree of erroneous matching of oscillation occurs. The frequency of these matching errors increases with the increase in the number of displayed objects from two to three. We propose that this could be the mechanism responsible for the phenomenon of illusory conjunctions. Let us shortly describe the outcome of a typical experiment in the illusory conjunction paradigm. When an observer is presented with a display containing several visual shape-color patterns for a short exposure time, and when due to experimental set-ups his attention is spread over the whole display, perceptual errors (such as reporting a green diamond when presented with a red diamond and a green circle) occur. As we previously mentioned, Treisman and Schmidt (1982) suggest that integrating shape and color information related to one object requires focusing attention on the object. Thus when attention is distributed across several
D. Horn, D. Sagi, and M. Usher
520
objects incorrect matching occurs. However, in all illusory conjunction experiments a considerable amount of correct responses is obtained, even when the experiment is designed to maximize the size of the attention window. As we will show our model can provide the explanation. In the previous section we have defined the fraction of correct binding B and have given several numerical examples. Thinking of our model as a binding predictor we should however take into account that an observer randomly conjoining n objects’ shapes and colors will obtain B = l / n by purely guessing the result. Trying to correct for this trivial baseline we define the significance of the binding probability by
s=-B - l / n
(5.1)
1- l / n
The denominator serves as a normalization factor, allowing S to vary between 0 and 1. To demonstrate the systematic trend of our model we display in Figure 7 both B and S as a function of n; the number of objects, for two coupled networks of the same frequency. Clearly n = 3 is the worst case. Higher n seem to lie on a plateau of S = 0.3. Explaining illusory conjunctions by our model, we expect their number to increase strongly when the number of objects increases from 2 to 3, but to level off a t a rate that is significantly different from pure chance after that. Until now we have discussed the case in which all color and shape patterns were different. It is interesting to examine the model in the case in which one of the attributes is shared by several objects. Consider the case in which the display consists of three objects, two of which share the same color, for example, using a green square instead of a blue one in the example of Figure 1. The repeated color (green) is represented just once in the color network, but it will receive a large input, which is the linear sum of the inputs of the two objects that share the same color. The result of such a simulation is illustrated in Figure 8. We observe that the repeated color is indeed strongly enhanced. The amount of conjunctions between the repeated color (green represented by the dashed curve) and the unrelated shape (diamond represented by a full line) is higher than that for the unrepeated color (red) and an unrelated shape (square or circle). In this asymmetric case it is advantageous to consider a correlation matrix (5.2) which describes the probability of binding shape (Y with color [I. Running the system of Figure 8 with nl = 3 n2 = 2 for a long time we obtain C=
.148 .183 ,088 .248
(.(I87 .244)
Segmentation,Binding, and Illusory Conjunctions
52 1
1.0
0.8
0.6
0.4
0.2
0.0
2
3
4 n
5
6
Figure 7: Typical variation of the binding and significance parameters as a function of the number of objects. The two networks have bl = b2 = 0.1. The noisy input has correlations of r = 2. This corroborates the statement we made before that there are more illusory conjunctions with the repeating attribute. Note, however, that for a given color the strongest correlations are with the correct shapes. The fact that relative duration of green is roughly twice that of red in Figure 8 is a consequence of using a linear sum of the two inputs that contribute to green. Changing the parameters of our model we can change the relative strengths of these signals, but usually in the direction of further amplifying the effect of the stronger amplitude. One can conceive of a different situation, in which the two colors appear with about the same strength. This calls, however, for a modification of our model: it necessitates a nonlinear interaction at the sensory input stage. 6 Discussion
The model that we propose is based on two oscillatory networks, in which excitatory neural assemblies represent attributes of objects, such as shape and color. The networks are coupled through their inhibitory neurons, whose role is to mediate competition among the excitatory assemblies in each network, and also to phase lock the oscillations of the
D. Horn, D. Sagi, and M. Usher
522
0.5
0.0
-
0
I
50
100
150
time
Figure 8: An example of three objects, two of which share one attribute (green color). Most conjunctions are correct, The shape and the color networks have the parameters bl = bz = 0.1, p l = 5, pz = 3. shape and color networks. The problem of binding the correct assemblies in the shape and color networks (i.e., assemblies representing attributes of the same object) is solved by introducing correlated noise fluctuations into the corresponding assemblies. Thus our model provides the means by which binding via amplification of noise correlations can be obtained. This model can serve as an example of intramodality binding, in which we can assume that the input carries some information regarding the connection between the two attributes that are to be bound. This cannot be applied in the same form to the interesting question of crossmodality binding (e.g., connecting visual and acoustical attributes) where the common input layer does not exist. A model for cross-modality binding needs a different approach, which may have to rely on prior knowledge that introduces explicit synaptic connectivity between memory patterns, an element that we have successfully avoided in our model. Two characteristics of noisy inputs are worthwhile stressing in the context of our model. First, we wish to point out that noisy input increases the segmentation power of the network. Running the network with a constant input we find that it cannot segment successfully more
Segmentation, Binding, and Illusory Conjunctions
523
than about five objects. This is in complete agreement with other oscillatory networks performing segmentation (Wang et al. 1990; Horn and Usher 1991). If more than five excitatory assemblies receive a constant input many activities try to rise simultaneously leading to the collapse of all of them. When noise is added to the constant inputs the network can continue its staggered oscillations for very large numbers of objects. The reason for this is that noise fluctuations will always enhance momentarily the input of one of the assemblies, enabling it to overtake the other ones. It seems that this has to do with the fact that we run the network in a chaotic phase, which is the second point we wish to stress. When we use n = 3 input patterns the networks segment the input into a well ordered sequence of staggered oscillations. This is no longer true for n = 4 or 5. The order of the staggered oscillations is quite random, indicating chaotic behavior. Therefore, if synchronicity between the activities of the two connected networks fails, it is easier to amend it when n > 3. In other words, sensitivity to noise correlations is enhanced when the network is in its chaotic phase, leading to an increase in the value of S beyond three displayed objects. The importance of chaos in the sensory processing of information by the brain was discussed by Skarda and Freeman (1987), who found that neural activity in the olfactory bulb shows chaotic characteristics when the animal is engaged in odor recognition. They suggested that the advantage of chaos for the processing of sensory information is that a chaotic state is more sensitive to changes in the incoming input. It seems that this characteristic is also demonstrated by our model. Although we have not attempted to model the physiological observations in the visual cortex (Eckhorn et al. 1988; Gray et al. 1989), we should be aware of an interesting qualitative difference: binding in our model takes some time to develop, as seen in Figures 5 and 6, whereas in the experimental results phase locking develops rapidly. The delay in our model comes about because we start from random initial conditions that the input has to overcome. It is quite possible that the physiological process is also assisted by auxiliary mechanisms. For example, "spotlight" attention (Koch and Ullman 1985) can eliminate all but one object and, therefore, lead to fast binding. Our model shows that even when the attention spotlight is spread, as in illusory conjunction experiments, significant amount of binding can be obtained by making use of noise effects. Within the context of the psychological phenomenon of illusory conjunctions, our model differs from the Feature Integration Theory (Treisman and Schmidt 1982). While according to this approach, feature representations (e.g., shape and color) are completely separate, according to our model some early mixed representation of shape and color information exists in the input layer. Only at a higher order memory level shape and color information are separated. The main prediction of our model is that binding performance depends on the number of displayed objects (Fig. 7). In particular, we find
524
D. Horn, D. Sagi, and M. Usher
a strong increase in the rate of illusory conjunctions when the number of objects is increased from two to three. We predict, however, quite uniform behavior when the number of objects is larger than three. Due to the linear dependence on the input we expect predominance of repeated attributes both in correct and illusory conjunctions. In this context we wish to stress that we represented the different objects with the same weights, that is, by equal numbers of neurons. If this is modified, we expect oscillations of the excitatory cell assemblies to be ordered according to input strength, thus producing biased errors (e.g., if "red" is stronger than "green" and "circle" is stronger than "square," then circles may be always red regardless of spatial coincidence). An experimental examination of this issue is needed in spite of its difficulty. It may call for additional mechanisms to rescale the representation on the input level. Finally we wish to address the issue of temporal versus spatial coincidence. We assume that a shape and a color appearing in synchrony will be matched by some higher level process. To test this assumption is rather difficult since it involves rapid presentation (at the oscillation rate, probably higher than 40 Hz, which is probably higher than sensory integration rate) of isolated object attributes at different locations. Thus it is not surprising that Keele et al. (1988) failed to find direct support for temporal binding, and concluded that spatial coincidence is the preferred mechanism for binding that is revealed by psychophysical experiments. Note, however, that our model is making use of spatial coincidence as a binding clue (by local noise) and thus is not in disagreement with these results. It is also possible to introduce spatial location explicitly into the model by adding a network encoding relative or absolute location as an attribute that can oscillate synchronously with all other attributes and enhance the role of spatial coincidence. In conclusion, we have shown that noise correlations in the input layer can provide the mechanism by which binding via matching of oscillations is achieved. It remains to be seen whether this mechanism is used by the brain to conjoin sensory attributes when attention is distributed across several objects.
Acknowledgment __
M. Usher is a recipient of a Dov Biegun postdoctoral fellowship.
References Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: a mechanism of feature linking in the visual cortex? Bid. Cybern. 60, 121-130.
Segmentation, Binding, and Illusory Conjunctions
525
Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory response in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus attributes. Nature (London) 338, 334-337. Horn, D., and Usher, M. 1989. Neural networks with dynamical thresholds. Phys. X u . A 40, 1036-1044. Horn, D., and Usher, M., 1990. Excitatory-inhibitory networks with dynamical thresholds. Int. J. Neural Syst. 1, 249-257. Horn, D., and Usher, M. 1991. Parallel activation of memories is an oscillatory neural network. Neural Cornp. 3, 31-43. Kammen, D., Koch, C., and Holmes, P. J. 1989. Collective oscillations in the visual cortex. Proceedings of the NIPS Conference, pp. 76-83. Koch, C., and Ullman, S. 1985. Shifts in selective attention: Towards the underlying neural circuitry. Human Neurobiol. 4, 219-227. Keele, S. W., Cohen, A., Ivry, R., Liotti, M., and Yee, P. 1988. Tests of a temporal theory of attentional binding. J. Exp. Psychol: Human Percept. Perform. 14, 444-452. Skarda, C. A., and Freeman, W. J. 1987. How brains make chaos in order to make sense of the world. Behav. Brain Sci. 10, 161-195. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1989. Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acad. Sci. U.S.A. 87, 7200-7204. Treisman, A., and Schmidt, H., 1982. Illusory conjunctions in the perception of objects. Cognit. Psychol. 14, 107-141. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail party processor. Biol. Cybern 54, 29-40. Wang, D., Buhmann, J. and von der Malsburg, C. 1990. Pattern segmentation in associative memory. Neural Cornp. 2, 94-106. Wilson, H. R., and Cowan, J. D. 1972. Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. 1. 12, 1-24.
Received 27 March 1991; accepted 28 June 1991.
This article has been cited by: 2. ZhiWei Shi, ZhongZhi Shi, Xi Liu, ZhiPing Shi. 2008. A computational model for feature binding. Science in China Series C: Life Sciences 51:5, 470-478. [CrossRef] 3. Yuval Aviel , David Horn , Moshe Abeles . 2005. Memory Capacity of Balanced NetworksMemory Capacity of Balanced Networks. Neural Computation 17:3, 691-713. [Abstract] [PDF] [PDF Plus] 4. Vincent A. Billock , Brian H. Tsou . 2004. A Role for Cortical Crosstalk in the Binding Problem: Stimulus-driven Correlations that Link Color, Form, and MotionA Role for Cortical Crosstalk in the Binding Problem: Stimulus-driven Correlations that Link Color, Form, and Motion. Journal of Cognitive Neuroscience 16:6, 1036-1048. [Abstract] [PDF] [PDF Plus] 5. Andreas Knoblauch, Günther Palm. 2003. Synchronization of neuronal assemblies in reciprocally connected cortical areas. Theory in Biosciences 122:1, 37-54. [CrossRef] 6. M.B.H. Rhouma, H. Frigui. 2001. Self-organization of pulse-coupled oscillators with application to clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:2, 180. [CrossRef] 7. S. Campbell, DeLiang Wang. 1996. Synchronization and desynchronization in a network of locally coupled Wilson-Cowan oscillators. IEEE Transactions on Neural Networks 7:3, 541-554. [CrossRef] 8. Raphael Ritz, Wulfram Gerstner, Ursula Fuentes, J. Hemmen. 1994. A biologically motivated and analytically soluble model of collective oscillations in the cortex. Biological Cybernetics 71:4, 349-358. [CrossRef] 9. Yukio Hayashi . 1994. Numerical Bifurcation Analysis of an Oscillatory Neural Network with Synchronous/Asynchronous ConnectionsNumerical Bifurcation Analysis of an Oscillatory Neural Network with Synchronous/Asynchronous Connections. Neural Computation 6:4, 658-667. [Abstract] [PDF] [PDF Plus] 10. Thomas B. Schillen, Peter König. 1994. Binding by temporal structure in multiple feature domains of an oscillatory neuronal network. Biological Cybernetics 70:5, 397-405. [CrossRef] 11. Steven Sloman. 1993. Do simple associations lead to systematic reasoning?. Behavioral and Brain Sciences 16:03, 471. [CrossRef] 12. Georg Dorffner. 1993. Connectionism and syntactic binding of concepts. Behavioral and Brain Sciences 16:03, 456. [CrossRef] 13. Günther Palm. 1993. Making reasoning more reasonable: Event-coherence and assemblies. Behavioral and Brain Sciences 16:03, 470. [CrossRef] 14. Lokendra Shastri, Venkat Ajjanagadde. 1993. From simple associations to systematic reasoning: A connectionist representation of rules, variables and
dynamic bindings using temporal synchrony. Behavioral and Brain Sciences 16:03, 417. [CrossRef] 15. Michael R. W. Dawson, Istvan Berkeley. 1993. Making a middling mousetrap. Behavioral and Brain Sciences 16:03, 454. [CrossRef] 16. John E. Hummel, Keith J. Holyoak. 1993. Distributing structure over time. Behavioral and Brain Sciences 16:03, 464. [CrossRef] 17. Gary W. Strong. 1993. Phase logic is biologically relevant logic. Behavioral and Brain Sciences 16:03, 472. [CrossRef] 18. Mike Oaksford, Mike Malloch. 1993. Computational and biological constraints in the psychology of reasoning. Behavioral and Brain Sciences 16:03, 468. [CrossRef] 19. Richard Rohwer. 1993. Useful ideas for exploiting time to engineer representations. Behavioral and Brain Sciences 16:03, 471. [CrossRef] 20. Walter J. Freeman. 1993. Deconstruction of neural data yields biologically implausible periodic oscillations. Behavioral and Brain Sciences 16:03, 458. [CrossRef] 21. Malcolm P. Young. 1993. Ethereal oscillations. Behavioral and Brain Sciences 16:03, 476. [CrossRef] 22. Graeme Hirst, Dekai Wu. 1993. Not all reflexive reasoning is deductive. Behavioral and Brain Sciences 16:03, 462. [CrossRef] 23. Stephen Grossberg. 1993. Self-organizing neural models of categorization, inference and synchrony. Behavioral and Brain Sciences 16:03, 460. [CrossRef] 24. Graeme S. Halford. 1993. Competing, or perhaps complementary, approaches to the dynamic-binding problem, with similar capacity limitations. Behavioral and Brain Sciences 16:03, 461. [CrossRef] 25. Lokendra Shastri, Venkat Ajjanagadde. 1993. A step toward modeling reflexive reasoning. Behavioral and Brain Sciences 16:03, 477. [CrossRef] 26. Malcolm I. Bauer. 1993. Plausible inference and implicit representation. Behavioral and Brain Sciences 16:03, 452. [CrossRef] 27. Ichiro Tsuda. 1993. Dynamic-binding theory is not plausible without chaotic oscillation. Behavioral and Brain Sciences 16:03, 475. [CrossRef] 28. James W. Garson. 1993. Must we solve the binding problem in neural hardware?. Behavioral and Brain Sciences 16:03, 459. [CrossRef] 29. P. J. Hampson. 1993. Rule acquisition and variable binding: Two sides of the same coin. Behavioral and Brain Sciences 16:03, 462. [CrossRef] 30. E. Koerner. 1993. Synchronization and cognitive carpentry: From systematic structuring to simple reasoning. Behavioral and Brain Sciences 16:03, 465. [CrossRef] 31. Steffen Hölldobler. 1993. On the artificial intelligence paradox. Behavioral and Brain Sciences 16:03, 463. [CrossRef]
32. Stellan Ohlsson. 1993. Psychological implications of the synchronicity hypothesis. Behavioral and Brain Sciences 16:03, 469. [CrossRef] 33. Jerome A. Feldman. 1993. Toward a unified behavioral and brain science. Behavioral and Brain Sciences 16:03, 458. [CrossRef] 34. John A. Barnden. 1993. Time phases, pointers, rules and embedding. Behavioral and Brain Sciences 16:03, 451. [CrossRef] 35. Simon J. Thorpe. 1993. Temporal synchrony and the speed of visual processing. Behavioral and Brain Sciences 16:03, 473. [CrossRef] 36. Stanley Munsat. 1993. What we know and the LTKB. Behavioral and Brain Sciences 16:03, 466. [CrossRef] 37. Joachim Diederich. 1993. Reasoning, learning and neuropsychological plausibility. Behavioral and Brain Sciences 16:03, 455. [CrossRef] 38. Garrison W. Cottrell. 1993. From symbols to neurons: Are we there yet?. Behavioral and Brain Sciences 16:03, 454. [CrossRef] 39. David L. Martin. 1993. Reflections on reflexive reasoning. Behavioral and Brain Sciences 16:03, 466. [CrossRef] 40. David S. Touretzky, Scott E. Fahlman. 1993. Should first-order logic be neurally plausible?. Behavioral and Brain Sciences 16:03, 474. [CrossRef] 41. Paul R. Cooper. 1993. Could static binding suffice?. Behavioral and Brain Sciences 16:03, 453. [CrossRef] 42. Reinhard Eckhorn. 1993. Dynamic bindings by real neurons: Arguments from physiology, neural network models and information theory. Behavioral and Brain Sciences 16:03, 457. [CrossRef] 43. Wulfram Gerstner, Raphael Ritz, J. Leo Hemmen. 1993. A biologically motivated and analytically soluble model of collective oscillations in the cortex. Biological Cybernetics 68:4, 363-374. [CrossRef]
Communicated by Geoffrey Hinton
Contrastive Learning and Neural Oscillations Pierre Baldi jet Propulsion Laboratory and Division of Biology, California Institute of Technology, Pasadena, C A 92125 U S A
Fernando Pineda Applied Physics Laborato y and Department of Electrical and Computer Engineering, The Johns Hopkins University, Baltimore, M D 21218 U S A
The concept of Contrastive Learning (CL) is developed as a family of possible learning algorithms for neural networks. CL is an extension of Deterministic Boltzmann Machines to more general dynamical systems. During learning, the network oscillates between two phases. One phase has a teacher signal and one phase has no teacher signal. The weights are updated using a learning rule that corresponds to gradient descent on a contrast function that measures the discrepancy between the free network and the network with a teacher signal. The CL approach provides a general unified framework for developing new learning algorithms. It also shows that many different types of clamping and teacher signals are possible. Several examples are given and an analysis of the landscape of the contrast function is proposed with some relevant predictions for the CL curves. An approach that may be suitable for collective analog implementations is described. Simulation results and possible extensions are briefly discussed together with a new conjecture regarding the function of certain oscillations in the brain. In the appendix, we also examine two extensions of contrastive learning to time-dependent trajectories. 1 Introduction
In this paper, we would like to develop the concept of Contrastive Learning (CL) as a family of possible learning algorithms for arbitrary convergent dynamical systems. CL is an extension of Deterministic Boltzmann Machines (Peterson and Anderson 1987) and Contrastive Hebbian Learning (Movellan 1990). Deterministic Boltzmann Machines are mean field approximations to Boltzmann Machines (Ackley et al. 1985). Contrastive Hebbian Learning is essentially a different method for deriving a Hebbian learning rule for Deterministic Boltzmann Machines. It is equivalent to the observation in Hinton (19891, based on a geometric argument, that Deterministic Boltzmann Machines perform gradient descent on a suitNeural Computation 3, 526-545 (1991) @ 1991 Massachusetts Institute of Technology
Contrastive Learning
527
ably defined cross-entropy function. The mathematical approach given here makes very few assumptions concerning the details of the activation dynamics beyond the requirement that it be a gradient dynamics. It is, therefore, a general approach that illuminates the common structure of Deterministic Boltzmann Machines and their variants. It also allows one to derive, almost by inspection, new learning algorithms for particular gradient systems. These algorithms are not necessarily Hebbian. Consider the problem of training a neural network to associate a given set of input-output pairs. In the course of CL training, the network is run in alternation with and without a proper teacher signal applied to some of its units, The weights in the network are updated so as to reduce the discrepancy between the steady-state behavior of the free network (without teacher) and the forced network (with teacher). As we shall see, this waxing and waning of the teacher can also be approximated with continuous oscillations. In Section 2, we describe CL for a general class of convergent dynamical systems. For clarity, the reader may want to particularize some of the statements to the usual additive neural network model with symmetric zero-diagonal interactions (see, for instance, Hopfield 1984) with
and energy function
Ef(V,I.W)
=
1 1 - - c w i j V i V j + C. - /'Ti 2 i,j
"1
rest g - ' ( v ) d v - C I i V i
(1.2)
I
Throughout the paper, the superscripts f ("free") and t ("teacher") are used to distinguish quantities in the free system and in the system with teacher. In Section 3, we give four different examples of CL that go beyond the special case of Deterministic Boltzmann Machines and the additive neural network model. In Section 4, we analyze the landscape of the contrast function, argue that it is characterized by three types of regions, and make some predictions about CL curves. In Section 5, we present an approach that may be suitable for collective analog implementations and that uses oscillations to continuously approximate CL. We conclude in Section 6 with the results of some preliminary simulations and a few possible extensions and open questions. Finally, in the appendix we sketch how CL may be extended from learning fixed points to learning trajectories. 2 Contrastive Learning
To be more precise, consider an arbitrary convergent n-dimensional dynamica1 system, where the states are described by the vector u = ( ~ 1 ,... ,u,,).
Pierre Baldi and Fernando Pineda
528
The parameters of the dynamical system can be classified into two classes: the external parameters I and the internal parameters W. Only the internal parameters W are subject to modification in the course of learning. In the case of neural networks, the states u, can be thought of as representing the membrane potentials of the units, while the vector 1 = (11.. . . .Ill) and the array W represent the external inputs and the connection weights, respectively (in addition, W may also include various gains and time constants). We assume that the convergent dynamical system is a gradient system governed by an energy function' Ef(V. I. W) so that du, dt
-
~
dEf
-~
ov,
(2.1)
with V = ( V , .. . . . VFl) and V , - ~ ( X U , ) , where g is a monotonically increasing function such as the identity or one of the usual sigmoid transfer functions used in neural models (V, can be interpreted in terms of firing rates). X is a parameter corresponding to the gains or the temperature in the system. Here, and in the rest of the paper, all functions are assumed to be continuously differentiable, unless otherwise specified. Out of the I I variables V , . . . . . V,, the first I V , . . . . . VI are considered output variables. The goal of the learning algorithm is to adjust the internal parameters so that for a given fixed initial state u ( 0 ) and external input I, the free system converges to a stable state where the output variables have the target values T = (TI... . . TI). For simplicity, we are dealing here only with one association 1 -+ T (the generalization to many input-output pairs by averaging in the usual way is straightforward). To train the free system, we introduce a second "forced" dynamical system. In the forced system the energy E'(V.1. W. T) has the form
E'
= Ef
+ F(V,I . W. T )
(2.2)
The function F denotes a generalized teacher forcing term. The function F(V,1. W, T) is not completely arbitrary and is chosen such that F(V.I, W, T) = 0 if TI = V , over all visible units. In addition, F must be continuously differentiable and bounded with respect to the Vs so that the function E' governs the gradient dynamics of a convergent system defined by
The equilibrium point of the free and forced systems are denoted by Vf and V', respectively. At a fixed point, V becomes an implicit function of I, W, and T. 'Even more generality could be achieved by using an equation of the form d u / d t = -A(u)VE, where the matrix A satisfies some simple properties (see, for instance, Cohen and Grossberg 1983). The dynamics could be "downhill" without following the gradient.
Contrastive Learning
529
If Ht and Hf are two similar functions defined on the V s in the forced and free systems, then the difference in behavior between the two systems can be measured by taking the difference Ht(l/t)- Hf(Vf) [or alternatively, as in the usual LMS approach, by using a function H(V' - V')]. In contrastive learning, the internal parameters are adjusted by gradient descent so as to minimize this difference. A particularly interesting choice of H, which will be used in the rest of the paper, is when H is the energy function of the corresponding system. The main reason for that, in addition to being a natural extension of the Deterministic Boltzmann Machine algorithm, is that it leads to very simple learning rules. Indeed, we can define the contrast function C by
C(I,W,T) = E ' ( V ' , I , W , T ) - E ' ( V ' , I ,W)
(2.4)
The contrastive learning rule modifies an adjustable internal parameter w by (2.5) where 7 is the learning rate. To see the second equality in equation 2.5, notice that (2.6) where 7 is either f or t. Now at equilibrium d EY / d Vz = -du,Y/dt = 0. Therefore the implicit terms do not contribute to equation 2.5. The fact that only the explicit dependence enters into the gradient means that the learning rule is simple and can often be written down from the contrast function by inspection. It is essential to notice that the derivation of equation 2.5 is purely heuristic. In general, the contrast function C is not necessarily bounded in W. If, over the range of operation, both E'(Vf,I, W) and Et(V',I, W, T ) are convex in W, then equation 2.5 achieves the desired goal but this is certainly a very unrealistic assumption. A more detailed treatment of the properties of the contrast function will be given in Section 4. CL is based on successive alternating relaxations of a free and a forced system. The initial state of the activation dynamics must be set before each relaxation. In general, the state to which the activation dynamics is reset depends on the task being performed. There are two classes of tasks. A parametric input task is one where the initial state of the network is always the same (usually the origin) and the input information is fed in through the vector I . An initial state task, on the other hand, has the input information fed in as part of the initial state and I is always the same. Here, we concern ourselves with parametric input tasks only. Accordingly, the activation of the network is reset to zero before each relaxation. The CL approach, however, can also immediately be extended
Pierre Baldi and Fernando Pineda
530
to the case of initial state tasks by initializing the fixed and forced systems in similar pattern-dependent ways and computing the contrast function and its gradient in the way described above. 3 Examples 3.1 Competitive Systems. A useful class of competitive dynamical systems, discussed in Cohen and Grossberg (1983), can be described by
where W = (wij) is symmetric. Equation 3.1 includes as special cases the additive model and also a number of models from population biology. If we define the matrix (3.2) then equation 3.1 is gradient dynamics of the form
du
- = -A(u)VE(U)
(3.3)
dt
with energy function
The corresponding forced energy function is given by equation 2.2 where F(V,1. W, T) is any convenient forcing term. If F has no explicit W dependence, the CL rule is simply
(3.5)
Awl, = 11 (V:V; - Vf V;)
The form of this learning rule is identical to that used in Deterministic Boltzmann Machines (DBM) (Peterson and Anderson 1987) and in Contrastive Hebbian Learning (CHL) (Movellan 1990). Both DBM and CHL are based on the additive model [a,(u,)= 1 and b,(u,) = 141 and the learning rule Awl, = q(Vc,Vcl- Vf,Vf,)where V: denotes the equilibrium activities in the network where the output (and input) units are clamped to their target values. To show that DBM are a special case of the general framework, we need to interpret clamping in terms of a teacher forcing term. This is easily done by writing du1 - = -ul+g;'(T,) dt
for i = 1 ..... I
(3.6)
Contrastive Learning
531
which relaxes to u, = g-'(T1).' It is simple to check that the network with such corresponding teacher has the energy function
E'(V.1, W. T )
=
€'(TI.. . . .T/.V/+l,..., V,, I , W) I
By applying equation 2.5 to the corresponding contrast function, one immediately gets equation 3.5 with in fact V: = Vi(=Ti for i = 1,.. . , I ) . The previous considerations extend immediately to neural networks consisting of higher order (or En) units with the proper symmetric interconnections. For instance, in the case of third-order interactions one can replace equation 3.5 by Awljk = r~(VitViV; - VyVfVi).These examples can also be extended to networks of simple or higher order threshold gates with the proper modifications. 3.2 Simple Error Teacher Signal. In this example, we consider a particularly simple form of teacher term in the additive model. The free network satisfies equations 1.1 and 1.2. In the forced network, the activation equation contains a teacher signal which is a simple measure of the error and is given by
dui
111
dt
7;
- = --
+ c,w , j v ,
where yi may be nonzero only for the output units and a is a positive parameter. The associated energy function is
(3.9) For simpIicity, we shaIl assume that on the output units yl = y. Different values of the parameter a yield different models. For instance, when cu = 1 we obtain the usual LMS error term in the energy function. A value of cu = 1/3 has been used in relation to the terminal attractors approach (Zak 1989). By applying equation 2.5, the corresponding CL rule is Awl, = ?](V:V;- VtV:). Notice that in contrast to DBM where the output units are clamped to their target values, the output units here may relax to fixed points of equation 3.8 satisfying V: # TI. Of course, as y + 03, V: + T , (i = 1,.. . , I). It is worth noticing that in this example (as in the previous ones), the teacher forcing term F does not depend 'Alternatively, to prevent the initial relaxation of the clamped units, one could write du,/dt = 0 and require the initiaI conditions to be u,(0) = g-'(TZ),i = 1,. . . ,1.
Pierre Baldi and Fernando Pineda
532
explicitly on W. Thus the expression of the resulting CL rule is identical for an entire family of systems corresponding to different values of the parameters (v and 7 . However, these values affect the learning implicitly because the corresponding systems have different dynamics and relax to different fixed points while being trained. 3.3 Coupled Oscillators Models. This fourth example is mainly meant to illustrate the CL formalism on a different class of models for networks of coupled oscillators (see, for instance, Baldi and Meir 1990 and references therein) used recently in conjunction with the study of oscillatory brain activity. Networks of coupled oscillators can be studied by representing each oscillator by a single variable, its phase u,. The oscillators are associated with the vertices of a graph of interactions. Each One edge in the graph corresponds to a symmetric coupling strength w,,. possibility is to model the evolution of the phases by the system of equations du,
-=
dt
Cw,,sin(u,
(3.10)
- 11,)
1
The corresponding energy function is 1 Ef(tf. W) = - -
cw,,cos(
ill
- u,)
(3.11)
1.1
If we let Ti denote a set of target angles, then a possible associated forced system can be described by (3.12) with (3.13)
By inspection, this results in the CL learning rule
awl/ = r/[coS(u:- u;,
~
cos(u; - u;)]
(3.14)
In some sense, this learning rule is still Hebbian since if we use the complex exponential zk = eruk, then &&, = I / Re [zLzi - zizf]. In many examples, the CL rule is a very simple Hebbian one and this is attractive from a hardware perspective. It should be noticed, however, that this is a consequence of the fact that the explicit dependence of the energy function on the weights rests on a quadratic form. As can be seen from the examples of higher order DBM and coupled oscillator models, more complicated learning rules can be derived by introducing different terms in the energy function with an explicit dependence on wq.
Contrastive Learning
533
The previous examples also show that the notion of clamping a unit is not a precise one and that different algorithms can be defined depending on the degree of softness contained in the clamping. In hard clamping, all the variables pertaining to one output unit are kept fixed at their target values and only these target values appear in the contribution of the unit to the learning rule. In soft clamping, some of the variables may evolve. In fact, by varying the parameter in Section 3.2, one can easily envision a continuum of possible clampings. The description we have given of a DBM is based on hard clamping. However, one can conceive a softer DBM, for instance, where the value V,of any output unit is held constant at a value V,C = T,while the network relaxes. The internal states u, of the output unit may then evolve according to equation 1.1 and equilibrate to a final value u: (such a network always reaches a fixed point although, strictly speaking, it has no energy function). V : = g ( u ! )may then be used to adjust the weights rather than the clamped value V:. 4 The Landscape of the Contrast Function We shall now analyze the typical landscape of the contrast error function C(W) as a function of the internal parameters W. For simplicity, we shall deal with the case of Deterministic Boltzmann Machines with hard clamping or with their version with a softer teacher term given by equation 3.8. We shall argue that, in general, the landscape of C contains three interesting and characteristic regions. A region corresponding to the initial stage of learning characterized by rapid progress and smooth descent. A region corresponding to an intermediary stage possibly characterized by abrupt discontinuities due to basin hopping phenomena. A third region associated with a final stage, found in the neighborhood of an optimal set of weights W, characterized by local convexity and smooth descent. To begin with, it should be noticed that the contrast function is a n average of several contrast functions, one for each pattern. To emphasize this important point and only in this section, we shall use a p superscript to denote the pattern dependence. Thus, for instance, C = &CP = &E‘P(VtP) - EfP(VfP). Furthermore, even for one pattern, the contrast function CP( W) is not bounded and is not continuous everywhere because there are values of W for which Vtp’ or VfP vary abruptly. If one tries to learn a unique association pair by CL, the descent, however, is in general smooth. It can easily be seen that the contrast function is continuous over the restricted region covered by the corresponding descent procedure. Yet, it is when we try to learn several associations simultaneously and satisfy somewhat conflicting constraints that gradient descent leads us to regions of the parameter space where the contrast functions corresponding to the individual patterns may be discontinuous. We shall call a fracture any point or connected set of points where
Pierre Baldi and Fernando Pineda
534
C is discontinuous (that is where at least one of the CP is discontinuous and the corresponding V'!' or V'P varies abruptly). Thus a fracture is a point or a set of connected points in weight space associated with a basin boundary going through the initial state (usually the origin) in the activation space of either the free or forced system for at least one of the patterns (see also Pineda 1988). In general, basin hopping can result from abrupt disruption in the flow of the system at bifurcation points or from the smooth evolution of basin boundaries in the course of learning. Notice that when a bifurcation occurs and a new fixed point is created, the newly created basin boundaries may cross the origin only some time after. In the initial stage, when training is started with very small initial weights, the units in the network operate near their linear regime and, for each pattern, both the free and forced networks have a unique fixed point. Thus, for each pattern p and as the weights W begin to evolve, Vt/' and VflJ vary continuously. Thus, for small weights, each Cp(W) is continuous and differentiable. Thus the total contrast function C( W) is also continuous and differentiable and the learning curve decreases smoothly. This smooth descent lasts at least until the first bifurcation occurs in one of the networks corrresponding to one of the patterns. The first bifurcation creates the first additional fixed points and therefore the first basin boundary capable of causing a discontinuity. In the intermediary stage, which is probably the most crucial for successful learning, the conflict introduced by the different patterns can become apparent. The learning trajectory may run into and cross fractures. Every time this happens, the contrast function jumps abruptly u p or down. We believe that such a stage is inevitable in any reasonably challenging problem (for instance, we have seen no discontinuities in the case of one, two, or even three pattern learning from the XOR table; on the other hand, these tasks are easy and linearly separable) and the phenomenon is accentuated by large learning rates, noise, or increasing approximations to the gradient. It remains to analyze what happens in the proximity of an optimal set of weights W* that achieves perfect Iearning, that is, such that, for each pattern, the equilibrium values of the output units in the free and forced network are identical and identical to the targets. In addition, at an optimum W =- W* we can also assume that the hidden units in the free and forced systems equilibrate for each pattern to the same values. Configuration of optimal weights without this property could conceivably exist. These would likely be unstable when gradient descent updates are performed after each pattern (this is because 3C~'/tlw,,- V y V y - Vf'V? # 0 for many patterns although 3C/8zu,, may be 0 on the average). So, with these assumptions, for an optimal set of parameters W*, a'(W') 0 for every p and therefore C( W')= 0. Yet, the converse is not necessarily true. Now, it is reasonable to consider that no fracture passes through W'.Otherwise, the problem would be inherently intractable. It ~
Contrastive Learning
535
would mean that the set of training patterns cannot be loaded on the chosen architecture in any stable way or, equivalently, that there is no region in weight space with a proper interior that contains an optimal solution. Thus, if we assume that the network is capable of implementing the given function, each C?' and C must be continuous in a neighborhood of W'. It is easy to see that this implies that each CP and C is also differentiable at W*. From the form of the CL rule, it is obvious that if W * is optimal, then dCp/aW = 0 for each p (and d C / a W = 0) at W'. In the case of Deterministic Boltmann Machines with hard clamping the converse is also true. Indeed, let us assume that qpV," = V?V: for every p and every i and j . Since the inputs are clamped to the same values in the free and clamped network, by simple propagation through the network this implies that V:p= Vf" everywhere and therefore we are at an optimum. In the case of softer forms of Deterministic Boltzmann Machines (as in equation 3-81,it is not difficult to show that if every connected component of the graph of connections contains at least one cycle of odd length, then DCP/DW = 0 everywhere implies that VfP= V p everywhere and therefore W must be optimal. This sufficient condition for equivalence on the structure of the graph of connections is usually easily realized (for instance, in the case of fully interconnected or of typical random graphs). So, without great loss of generality, we can assume that
acp - 0 for every p
--
aW
W is optimal
(4.1)
Thus, the only critical points of C satisfying BCp/BW = 0 everywhere are the points W that lead to a perfect implementation of the inputloutput function. We are going to show that, in general, these critical points W are local minima of C and therefore C is convex (not necessarily strictly convex if W' is not isolated) in a neighborhood of any such W . Since C = &CP, it is sufficient to show that each CP is convex in a neighborhood of W'. Consequently, in the rest of the discussion we shall assume now that the pattern p is fixed. When both the free and forced systems are started at the origin with a critical configuration of weights W " , they both converge to the same vector of activities V'P. In neural networks, we can safely assume that V'P is an attractor (thus W' is asymptotically stable). Let Bw(Vfp)denote the domain of attraction of Vfp in the free system associated with W. Bw.(V'p) is an open connected set that contains V'P and the origin (see Fig. 1). 0 and V'fp are in the interior of BW.(V*~P). For sufficiently small perturbations A W* of W * ,the fixed points Vfpand VtP vary continuously as a function of W = w" -k AW*. Let d ( W ) denote the smallest distance from V f p to the boundary of its basin Bw(VfPP)(when these quantities are defined, which is the case in a small neighborhood of W " ) . Clearly, d ( W * ) > 0. We can find do > 0, €1 > 0, and €2 > 0 such that for any
Pierre HaIdi and Fernando Pineda
536
oulpul
1I itlclcn
Inpu1
Figure 1: When the free (respectively forced) system is started at the origin 0, it converges to V" (respectively V*) when the internal parameters are W', and to V' (respectively V ' ) when the internal parameters are W' + AW*. The contours are meant to represent basin boundaries. A representation of the perturbation in the space of internal parameters is shown in the inset. perturbation A W' of the parameters:
/lAW*((< 61 =+do
< d(W)
(4.2)
and (4.3) Thus, for any perturbation satisfying 1 1 A W' I I < ~3 = inf(f1, t 2 ) we simultaneously have d ( W ) > d o and IIV@- W'j/ < do/2, which ensures that the open ball with radius do/2 centered at V'P is entirely contained in B W ( V f p ) . Now, by continuity in the forced system, we can find €4 > 0 such that
(4.4) Finally, for any perturbation satisfying IlAW*l( < E = inf(E3,64) we have that VtP(W) is contained in the ball of radius do/2 centered at V*P and therefore also in B W . + h W * ( V f p ) . But since the activation dynamics is a gradient dynamics, within the basin of attraction of Vfp we must have
Efp(Vfp) 5 Efp(V'p) = EtP(VtP)
(4.5)
Contrastive Learning
537
+
+
and therefore CP(W* A W ) L 0. Hence, for any iiAW*/i < t, P ( W * AW*) 2 Cp(W*) = 0. Thus the critical point W*is a local minimum of each Cp, Each CP is convex around W* and the same holds for C. It is important to observe that, in a given problem, the previous analysis does not say anything about the size of the neighborhood of W* over which C is convex (nor do we know with certainty that such a neighborhood is always entered). This neighborhood can conceivably be small. Furthermore, nothing can be inferred at this stage on the duration of each one of the stages in the course of contrastive training. If at some point, however, the contrast function becomes negative, then one knows that the optimum has been missed and training should be stopped and restarted from a point where C is positive. Finally, the previous analysis assumes the existence of a set of weights that can do the task perfectly, without any error. Additional work is required for the case of approximate learning. 5 Oscillations and Collective Implementations In this section, we consider some issues concerning the implementation of CL algorithms in collective physical analog systems. We cast the algorithm in the form of a set of nonautonomous coupled ordinary differential equations with fast and slow time scales, governing the simultaneous evolution of both the activation and the weight dynamics. The formal description we have given thus far of CL relies on two different dynamical systems, a free system and a forced system. For implementations, however, it is desirable to have a collective dynamical system that alternates between forced and free phases. This can be achieved in several ways by making use of some basic oscillatory mechanism to alternate between the phases and leads to algorithms that are local in space but not necessarily in time. For instance, in Example 3.2 we can introduce an oscillatory mechanism in the activation dynamics by considering the parameter y to be a periodic function y ( t ) resulting in dtl, dt
-
(5.1)
where y,(f) = 0 for the internal units and yl(f) = y(t) for the output units. If y(f) changes slowly enough and if the activation dynamics is fast enough, the network is almost always at steady state. For example, y( t) could be a square wave oscillation with magnitude and frequency w [i.e., y(t) = y or 0, depending on whether the teacher is on or off]. The network departs from steady state only during the transient after y ( f ) suddenly changes its value. We now consider several possibilities for using oscillations in the weight updates. The most obvious approach is to perform one update per relaxation. The learning rate must be small and alternate in sign
Pierre Baldi and Fernando Pineda
538
(positive when the system is forced and negative when the system is free). In this simple approach, the weight dynamics hemstiches its way downhill. It follows the true gradient of the contrast function only in the t i + 0 limit. We have found that this method is particularly prone to discontinuities associated with basin hopping phenomena. This leads to learning curves that exhibit extreme sawtooth and/or spiking behavior and often do not converge. A better approximation to equation 2.5 is to continuously integrate the weights based on the difference of some running average of V,V, while the teacher signal is on and off. That is, something like
&ulJ
= rl-
1
tl-to
/
'1
83(t)V,(f)Vl(t)dt
(5.2)
fli
where A j ( t )can be a bipolar square wave or any other kind of bipolar oscillation and I j ( t ) is phase locked with ? ( t ) [such that $ ( t ) < 0 if ? ( t ) = 01. In the implementation of equation 5.2, one needs to choose an integration time interval t l - to of the order of several teacher signal periods and to be able to update the weights at regular multiples of this time interval. In an analog system, a suitable alternative may be to approximate equation 2.5 using the differential equations dw 1 r r dt ( , 2= -s rs
'I
(5.3)
with 1 ds,, = --sl, + S(t)V,(t)V,(t) dt r,
(5.4)
Equation 5.4 is a leaky integrator that averages over the forced and free gradients with opposite signs and thus cal&lates an approximation to the gradient of the contrast function. If r << r, << 27r/w, then sll remains close at any time to its instantaneous equilibrium value r J ( t)V,(t)V,(t) in which case the right-hand side of equation 5.3 becomes just equal to $ ( t ) V , ( t ) V , ( fThus ). wlI integrates the instantaneous correlation between the activities V ,and V,. Yet in practice, the filter should be rather operated with rq >> r and rs 2 2x/w in order to get good averaging over several oscillations of the teacher signal. In summary, for this scheme to work, it is essential to keep in mind that there are at least four time scales involved: the relaxation time of the activities r, the period of the 0-1 J the phase-locked fl weight modulation), the teacher signal ~ / U(and time between pattern changes T? and the relaxation time of the weights r , . These should satisfy r < 2x/w < rp < rw. In principle, even with r small, it cannot be guaranteed that the activations always converge faster than the patterns changes or faster than the oscillations. This is because there can be large variations in the convergence time of the activation dynamics as the weights evolve. In practice, it seems to be sufficient to have the activation dynamics very fast compared to other time scales
Contrastive Learning
539
so that the transients in the relaxation remain short. A similar caveat applies to Deterministic Boltzmann Machines. It should be noticed that in CL, the waxing and waning of the clamping or of the teacher signal provide a basic underlying rhythm. In general, the time scale of the synaptic weight updates can be much larger than the period of this rhythm. However, one can envision the case where the time scale of the updates is close to the period of the rhythm. In this range of parameters, CL requires some mechanism for fast Hebbian synaptic plasticity consistent with the ideas developed by von der Malsburg (see von der Malsburg 1981). 6 Simulations and Conclusion
We have tested the CL algorithm corresponding to equations 3.8 and 3.9 on a small size but nonlinearly separable problem: XOR. The network architecture consists of three input lines (one for the bias), three hidden units and one output unit (see Fig. 2). All the units receive an input bias, but only the hidden units are connected to the inputs. In addition, the hidden units and the output are fully interconnected in a symmetric way. We use equation 2.5 to descend along the gradient, as described in the previous section. All the gains of the units are kept constant (A = 1) and no annealing is used.3 The learning rate is r) = 0.01. In these simulations, patterns are presented in cyclic fashion although no major differences seem to arise when presentations are done randomly. The network is relaxed using a fourth order Runge Kutta method but the results are essentially identical when Euler's integration method is used with a small step size. Figure 3 shows the results of the simulations for different values of 7 . As the strength of the teacher signal 7 is reduced, learning becomes slower. As can be seen, there are spikes that occur simultaneously in the contrast function and in the LMS error. These discontinuities are a consequence of basin hopping. It is easy to test this hypothesis numerically by freezing the weights just before a given spike and probing the structure of the energy surface by starting the activation dynamics at random points in a neighborhood of the origin. As expected, we observe that the system converges to one of several final states, depending sensitively on the particular initial state. To demonstrate that this algorithm actually trains the input-to-hidden weights, it is necessary to perform an additional control to eliminate the possibility that the hidden units start with an internal representation that is linearly separable by the output unit. To perform this control, we trained the network in two ways. In one case, we froze the input-to-hidden weights and in the second 'Tt is worthwhile to stress that our purpose here is only to demonstrate the basic principles and not to develop optimal algorithms for digital computers. Indeed, we can achieve significant speed-ups by using the usual tricks such as annealing, momentum terms, and higher learning rates (up to = 1).
Pierre Baldi and Fernando Pineda
540
I .o Y
vl
m 0.x c
I
c 6 0.6 0.1 0.2 0.0
0
I00
260
Time
360
400
300
400
500
2.0 -
0.5
0
100
200
Time
2
I 500
Figure 2: The network architecture used in all simulations. case we let them adapt. We found a set of parameters where networks were untrainable with frozen input-to-hidden weights, but trainable with adaptable input-to-hidden weights. This proves that the CL algorithm is training the input-to-hidden weights. The present work needs to be extended toward problems and networks of larger size. The ratio of the number of visible units to the number of hidden units may then become an important parameter. Finally, we would like to conclude with the conjecture that oscillations of the sort discussed in this paper may possibly play a role in biological neural systems and perhaps help in the interpretation of some of the rhythms found in nervous tissues, especially in circuits that are believed to participate in associative memory functions (for instance in hippocam-
Contrastive Learning
541
I
I
\-
/ -
I
/
Figure 3: Time evolution of the contrast function and the LMS error function associated with equations 3.8 and 3.9 for different values of the parameter 7 , with a = 1, X = 1, 11 = 0.01 and the target values Ti are f0.99. Relaxations are performed using fourth order Runge-Kutta method with a time step of 0.1, starting from activations initialized to 0. Initial weights are normally distributed with a mean of 0 and a variance of 0.5. Patterns are presented cyclically, once per weight update. Each iteration on the time axis corresponds to one pattern presentation, thus to the relaxation of a free and a forced network.
pus or piriform cortex). Functionally, these oscillations would correspond to some form of rehearseal whereby a waxing and waning teacher signal generated by reverberating short-term memory circuits would induce stable synaptic modifications and storage in long-term memory circuits. Physically, they would correspond to oscillations in the degree of clamping of various neuronal assemblies. These oscillations would act as local clocks in analog hardware, ensuring that communications and computations occur in the proper sequence. Intuitively, it would seem more likely for this rehearsing process to be associated with a relatively slow rhythm, perhaps in the 8 range (3-10 Hz) rather than with the faster 7 oscillations
Pierre Baldi and Fernando Pineda
542
(40-80 Hz) that have recently been linked to functions such as phase labeling and binding (see, for instance, Gray ef al. 1989 and Atiya and Baldi 1989). However, a form of CL based on y oscillations originated in sensory cortices may be able to account for the sort of memory traces revealed by experiments on priming effects (Tulving and Schacter 1990). Obviously, additional work is required to explore such hypotheses. 7 Appendix: Contrastive Learning of Trajectories
The same minimization principle that leads to simple learning algorithms for convergent dynamical systems can be extended to time-dependent recurrent networks where the visible units are to learn a time-dependent trajectory in response to a time-dependent input. In this appendix, we include for completeness two possible directions for this extension. In the first case, we assume that there are delays in the interactions and we exploit the fact that the limit trajectories are minima of suitably defined Lyapunov functions. In the second case, we assume no delays and exploit the fact that trajectories can be expressed as extrema of functionals or actions. 7.1 Time-Dependent Contrastive Learning with Delayed Interactions. In Herz et al. (19911, it is shown that if multiple pathways with different time delays are introduced in the usual binary Hopfield model, then cycles of period P can be stored as local minima of a time-dependent Lyapunov function that decreases as the network is relaxed. Here, we apply the contrastive learning approach to this setting with discrete time and continuously valued model neurons (Hem 1991). To be specific, consider a network of n units with activations defined by V,(f + 1) gi[ui(f)l
(a.1)
where gi is a differentiable, bounded, and monotonically increasing input/output function. ui(t) is the usual local field or internal membrane potential and is defined by
w,,(T)is the weight of the connection from j to i with delay 7 . Since the time is discretized, all the delays are integers. In addition, for simplicity, we assume that q ( 7 ) = 0 for 7 2 P - 1. The task of the network is to learn to associate a periodic output cycle of length P to a periodic input cycle also of length P. The output can be read on all the units in the networks or alternatively from a subset of "visible" units. Provided the couplings satisfy an extended synaptic symmetry ~
~
+
wij[P 7 ) (2 ~)(modP)]
~ = (
(a.3)
Contrastive Learning
543
it is not difficult to see that the network has a time-dependent energy function given by
E(V.1. W. t ) =
with
GI( Vi) =
bv'
8;' (x)d x
(a.5)
At each synchronous time step of the dynamic defined by equations a.1 and a.2, AE 5 0 and since E , for fixed weights, is a bounded function of V , the function E must converge to a stable value. Once E has stabilized, the network is either at a fixed point, either on a cycle of period P (or, possibly, a divisor of P). Learning can be attempted by using, for instance, a generalization in time of Hebb's rule. In the CL approach, as for the case of Boltzmann Machines, one can define a forced network that relaxes with its inputs and outputs clamped to the respective target limit cycles. A time-dependent contrast function in this case is defined by
C [ W , I ( t ) , T ( t ) . t=] E'[V',I(t), W . T ( t ) . t ]-Ef[Vf,I(t), W , t ]
(a.6)
where Vt and Vf represent the instantaneous activities in the free and forced networks. By differentiating equation a.6 with respect to w, one immediately obtains the CL rule P-1
v(t- o)V,! [(t-
A w i j ( ~ ) = q{
(0
+ 7 + l)(modP)]
fY=O
-Vf(t - ct)V![t- (N
+ + l)(modP)]} T
(a.7)
which is a generalized version of the usual Hebbian CL rule of Deterministic Boltzmann Machines. 7.2 Lagrange Machines. In this section, we briefly extend CL to timedependent problems using a variational formalism. Consider a free system with an action along the trajectory rf in the space of the activations Vf, during the time interval [to, t l ] , given by
Pierre Baldi and Fernando Pineda
544
The dynamics of the system extremizes this action and corresponds to the Lagrange equations d aLf dtavf
~
-~ -
8L' aVf
~
(a.9)
Similar relations hold in the forced system with
S'(T')
=
J" Lt[V'. v'. W. Z(t). T ( t ) .t]d t
(a.10)
to
We can define the contrast function C( W) = S'(1'') - Sf(rf) so that the internal parameters are updated according to Aw = -q(aC/aw), calculated along the t w o actual trajectories followed by the free and forced systems. As in the fixed point algorithms, the differentiation of the actions contains both implicit and explicit terms. The implicit terms can be simplified by integrating by parts and using the Lagrange equations. Finally, we get
Now there are several possible choices for Lf and L' and two observations can be made. First, if the dependence of Lf (and L') on w is based on a ( Vf )f ,.then the quadratic form such as Lf = -f C , , , w , , V f V ~ + V ~ ~ , ( t ) + C , f ,V integral in equation a.11 is Hebbian, similar to the Deterministic Boltzmann Machine learning rule, and equal to the average of the difference between the local covariance of activities. Second, by properly choosing the initial conditions in the free and forced systems, it is easy to have the corresponding boundary term in equation a.11 vanish. Finally, it should be noticed that with a nondissipative dynamics, the trajectories that are learned can vary with the input but cannot be stored as attractors.
Acknowledgments We would like to thank the referee for several useful comments. This work is supported by ONR Contract NAS7-100/918 and a McDonnellPew grant to P. B., and AFOSR Grant ISSA-91-0017 to F. I?
References Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. 1985. A learning algorithm for BoItzmann Machines. Cog. Sci. 9, 147-169. Atiya, A., and Baldi, P. 1989. Oscillations and synchronizations in neural networks: An exploration of the labeling hypothesis. Inf. J. Neural Sysf. 1(2), 103-1 24.
Contrastive Learning
545
Baldi, P., and Meir, R. 1990. Computing with arrays of coupled oscillators: An application to preattentive texture discrimination. Neural Comp. 2(4), 458-471. Cohen, M. A,, and Grossberg, S. 1983. Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. l E E E Transact. Syst. M a n Cyber. SMC 13(5), 815-826. Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Herz, A. V. M. 1991. Global analysis of parallel analog networks with retarded feedback. Phys. Rev. A 44(2), 1415-1418. Herz, A. V. M., Li, Z., and van Hemmen, J. L. 1991. Statistical mechanics of temporal association in neural networks with transmission delays. Phys. Rev. Lett. 66(10), 1370-1373. Hinton, J. 1989. Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Comp. 1, 143-150. Hopfield, J. J. 1984. Neurons with graded responses have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81, 3088-3092. Movellan, J. 1990. Contrastive Hebbian learning in the continuous Hopfield model. Proceedings of the 1990 Carnegie Mellon Summer School. Morgan Kaufmann, San Mateo, CA. Peterson, C., and Anderson, J. R. 1987. A mean field theory learning algorithm for neural networks. Complex Syst. 1, 995-1019. Pineda, F. 1988. Dynamics and architecture for neural computation. /. Complex. 4, 216-245. Tulving, E., and Schacter, D. L. 1990. Priming and human memory systems. Science 247, 301-306. von der Malsburg, C. 1981. The correlation theory of brain function. Internal Report 81-2, Department of Neurobiology, Max Planck Institute for Biophysical Chemistry. Zak, M. 1989. Terminal attractors in neural networks. Neural Networks 2, 259274. ~
~~
-~
Received 6 February 1991; accepted 14 June 1991.
This article has been cited by: 2. Simone Kühn, Wolf-Jürgen Beyn, Holk Cruse. 2007. Modelling memory functions with recurrent neural networks consisting of input compensation units: I. Static situations. Biological Cybernetics 96:5, 455-470. [CrossRef] 3. Xiaohui Xie , H. Sebastian Seung . 2003. Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered NetworkEquivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network. Neural Computation 15:2, 441-454. [Abstract] [PDF] [PDF Plus] 4. R.S. Schneider, H.C. Card. 1998. Analog hardware implementation issues in deterministic Boltzmann machines. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 45:3, 352-360. [CrossRef] 5. Sepp Hochreiter , Jürgen Schmidhuber . 1997. Long Short-Term MemoryLong Short-Term Memory. Neural Computation 9:8, 1735-1780. [Abstract] [PDF] [PDF Plus] 6. Javier R. Movellan, James L. McClelland. 1993. Learning Continuous Probability Distributions with Symmetric Diffusion Networks. Cognitive Science 17:4, 463-496. [CrossRef] 7. Peter König , Bernd Janosch , Thomas B. Schillen . 1992. Stimulus-Dependent Assembly Formation of Oscillatory Responses: III. LearningStimulus-Dependent Assembly Formation of Oscillatory Responses: III. Learning. Neural Computation 4:5, 666-681. [Abstract] [PDF] [PDF Plus]
Communicated by Joshua Alspector
Weight Perturbation: An Optimal Architecture and Learning Technique for Analog VLSI Feedforward and Recurrent Multilayer Networks Marwan Jabri Barry Flower Systems Engineering and Design Automation Laboratory, School of Electrical Engineering, University of Sydney, Sydney, Australia Previous work on analog VLSI implementation of multilayer perceptrons with on-chip learning has mainly targeted the implementation of algorithms like backpropagation. Although backpropagation is efficient, its implementation in analog VLSI requires excessive computational hardware. In this paper we show that, for analog parallel implementations, the use of gradient descent with direct approximation of the gradient using ”weight perturbation” instead of backpropagation significantly reduces hardware complexity. Gradient descent by weight perturbation eliminates the need for derivative and bidirectional circuits for on-chip learning, and access to the output states of neurons in hidden layers for off-chip learning. We also show that weight perturbation can be used to implement recurrent networks. A discrete level analog implementation showing the training of an XOR network as an example is described. 1 Introduction
Many researchers have recently proposed architectures for very large scale integration (VLSI) implementations of the multilayer perceptron (MLP). Most of the reported work has addressed digital implementation (Huang and Kung 1989). Furman and associates (1988) have reported an analog implementation of backpropagation. In both digital and analog reported work, backpropagation was selected because of its efficiency and popularity. The common update rule for backpropagation (excluding momentum) is a w , , = ,r]xj6,
with
f’(net,).(T,- x,) if i is an output neuron f’( net,). & bkwk, if not Nrurul Computation 3, 546-565 (1991)
@ 1991 Massachusetts Institute of Technology
Weight Perturbation
547
I
Uni-Directional
,,A*... net j
I
Figure 1: Feedforward architecture without provisions for training.
where neti is the net input to a neuron, xi is the output of neuron i, Ti is the training value for output neuron i, and 71 is the learning rate. A schematic showing information flow during normal operation (from j to i) is shown in Figure 1. For fully parallel analog implementations of backpropagation (BPI the requirements for on-chip and in-loop training are different.
Marwan Jabri and Barry Flower
548
h
i 1 T
Bi-Directional
...._..______-*
neurun j
,,A net j
Figure 2: Feedforward architecture with backpropagation training.
For on-chip training the actual update rule needs to be implemented on chip. Many approaches are possible according to tradeoffs between speed and hardware cost.' If area is not an issue, on-chip training using backpropagation can be achieved in constant time. Figure 2 depicts the schematic of information flow for backpropagation training when fully bidirectional information paths are available. The constant update speed in this case corresponds to the roundtrip (sum of feedforward and backward passes) propagation delay. 'By hardware cost we mean the chip area, power, and design time and complexity
Weight Perturbation
549
The hardware cost in this case is
+
H(NT.NI. N O ) = o(N-f)o(NT)
+ o(NT - NO) + O(NT
-
N1) (1.1)
where NT is the total number of neurons, NIis the number of neurons in the input layer, and NO is the number of neurons in the output layer, for a fully interconnected network. On the other hand, if area is to be reduced then the multiplication hardware of the update rule can be shared across the weights and the update speed (in terms of clock cycles required to update all weights) is proportional to the number of weights. The hardware cost in this case is H(NT. N1.N O ) = o(NT) f o(NT - N O )
+ o(NT
-
N[)
(1.2)
For in-loop training using backpropagation many options are also possible with the bottom line being that neuron output states are required to be communicated through log N pads in addition to the off-chip access to the weights. The most likely of the options to be considered are related to the way the derivatives f’(net,) are evaluated: either on- or off-chip. If they are evaluated on-chip, then evaluation circuits and log N pads are required for an N-neuron network. If the derivatives are evaluated off-chip then either logN pads are required to communicate the net, or they are evaluated using the already communicated neuron output states and the weights. The latter evaluation involve a number of multiply/accumulate cycles that is proportional to the number of weights. Clearly from the above the analog VLSI implementation of learning based on backpropagation incurs massive hardware costs (bidirectional synapses, bidirectional neurons, and multiplication circuits) to accommodate forward and backward passes in the case of on-chip learning, and a massive number of pads and derivative circuitry in the case of in-loop training. Recently, the Madaline Rule I11 (MR 111) was suggested as an alternative to backpropagation for analog implementation (Widrow and Lehr 1990) with a cheaper hardware cost. This rule can be considered implementing gradient evaluation using “node perturbation” according to
where net; = Cjwiixj and x, = f (net,) with f being the nonlinear squashing function. Figure 3 illustrates the flow of information for “node perturbation.“ Therefore, in addition to the actual hardware needed for the normal operation of the network, the implementation of the MR I11 learning rule for an N-neuron network in analog VLSI requires 0
An addressing module with wires routed to select and deliver the perturbation to each neuron.
550
Marwan Jabri and Barry Flower -
AE is generated by measuring the difference betwrrn E before and after a node perturbation IS applied.
Uni-Directioiial
*en
neuron j
pert
1
I -q.Lil
net I I
Figure 3: Feedforward architecture with “node perturbation” training. 0
Either one or N multiplication modules to compute the term ( AE/Aneti)xj in addition to the multiplication by the Iearning rate. If one multiplier is used then additional multiplexing hardware is required. An addressing module with wires routed to select and read the x, terms.
Note that if greater training flexibility is required in the sense of offchip access to the gradient values, then the states of the neurons (xi) would need to be made available off-chip as well, which will require
Weight Perturbation
551
a multiplexing scheme and logN chip pads. The complexity for node perturbation is then
(1.4)
H(NT,NIj NO) = o(NT) f o(NT- N O )
where NT is the total number of neurons, NIis the number of neurons in the input layer, and No is the number of neurons in the output layer, for a fully interconnected network. In this paper we propose “weight perturbation” (WP) as an alternative approach to the MR 111 and backpropagation for analog VLSI implementations. With WP the gradient is approximated to a finite difference. We show in this paper that gradient evaluation using WP greatly reduces hardware implementation cost and complexity, for both on-chip and inloop training, and can equally be used to train recurrent networks. 2 Gradient Evaluation Using Weight Perturbation
The gradient with respect to the weight can simply be evaluated by using the gradient approximation method of finite difference:
(2.1) -
E(w,j +pert) - E(wi,) pert
+ O(pert)
(2.2)
where the step size Apertwllis equal to the perturbation signal pert, and the particular finite difference method used is the forward difference method. The weight update rule then becomes
(2.3) where E ( ) is the total square error produced at the output of the network for a given pair of input and training patterns and a given value of the weights. The order of the error of the finite difference approximation can be improved from O ( k )to O(k2)by using the central difference method, so that dE AE - + O(ApertW2) dwq apertwi1 - E[w,, (pert/2)] - E[wq - (pert/Z)] O(pert2) (2.4) pert
+
+
where again the step size ApertWij, is equal to the perturbation signal pert, and the weight update rule becomes A w , ~=
E[w;j
+ (pert/2)]
-
E[wij -
pert
(pert/2)]
(2.5)
Marwan Jabri and Barry Flower
552
AE is generatedby measuring the difference between ti before and after a weight perturbation is applied.
net
im
Figure 4: Feedforward architecture with “weight perturbation” training.
however, the number of forward relaxations of the network required for the central difference method is O(N;) rather than O(N;) for the forward difference method, where NT is the total number of neurons in the network. Thus either method can be selected on the basis of a speed/accuracy tradeoff. Figure 4 illustrates the concept of information flow for “weight perturbation” training. Note, that as q and pert are both constants, the analog implementation version can simply be written as
Awjl = G(pert)AE(wil,pert)
(2.6)
Weight Perturbation
553
Table 1: Arithmetic Order of Complexity for Training Techniques.
Technique
Backpropagation Node perturbation Weight perturbation
Arithmetic order On-chip
In-loop
O(1) O(N) O(W)
0(N2) 0(N2) O(2W) = O(W)
with rl G(pert) = -~ pert
and
AE(wij.pert) = E(wq + pert) - E(wq) The weight update hardware involves the evaluation of the error with perturbed and unperturbed weight and then the multiplication by a constant. It should now be clear that there are significant variations in arithmetic complexity for the training techniques described here. The order of arithmetic operations required for the three techniques, in both on-chip and in-loop training modes, and where N is the total number of neurons in the network and K is some constant, is shown in Table 1. There is a decrease in arithmetic efficiency as the hardware architecture optimality increases, such that speed can be addressed from a complexity point of view rather than from a technological point of view. Note, however, that in-loop backpropagation has the same arithmetic order of complexity as on-chip weight perturbation but, as is shown below, has a higher hardware complexity. The complexity of the hardware for on-chip and in-loop WP is H(NT,NI, N O )
= O(K)
(2.7)
where K , NT,NI, and NO are as defined previously. The comparison of equations 1.1, 1.2, 1.4, and 2.7 shows qualitatively that WP requires the least hardware complexity of the three training methods described and is in fact optimal as will be shown in Section 4.1. WP is then equivalent to any other gradient descent method with the modification that the search direction is generated on a per weight basis using a finite difference method, and should not be confused with a "blind search method. As with all gradient descent methods, WP is susceptible to being trapped in local minima; however, the various techniques used for escaping or avoiding these in other gradient descent optimization techniques are also applicable.
Marwan Jabri and Barry Flower
554
Column shift registers
k
Neurons layer 1 Output weights
r---.
1
R o w shift registers
I
Figure 5: Architecture for the implementation of "weight perturbation" training.
3 An Architecture for WP Training
~
Figure 5 shows an architecture for implementing WP. The modules shown in dashed lines are those required for on-chip learning. If omitted, inloop learning can be performed with no additional hardware needed with respect to that of the normal operation of the network. The hardware cost of the learning modules (in dashed lines) is independent of the size of the network. The row and column shift registers are used for weight addressing. As there is no need for random access to the weights, address generation for weight update/access can be done using simple shift registers. This will save valuable chip pads (normally used for weight addressing) and eliminate the need for multiplexing/ decoding hardware.
Weight Perturbation
555
The error module has the following functions: It stores the feedforward mean square error. 0
It evaluates the error difference [errodperturbated weight) - error (non-perturbated weight)].
The weight update module produces the new value of a weight given its old value, the error difference, the perturbation strength (which is a constant) and the learning rate. WP learning is implemented according to the following procedure: A Apply a pattern to the input of the network. B Reset column and row weight decoding shift registers. Clear total error. C Measure, (calculate), and save error. Add error to total error. D Apply perturbation to current weight.
E Measure, (calculate), and save error difference. F Remove perturbation and update current weight.
G Shift weight row decoding shift register.
H If end of row, then shift column register. I If not end of column Goto K J if total error < criteria then stop; else goto B. K Goto C (or D if error saved in C is safe).
Note row shifting corresponds to the selection of a weight of an adjacent neuron. This is in contrast to selecting the next weight of the same neuron that results in slower learning.
4 Hardware Cost Comparison with BP and MR 111 WP is ideal for analog VLSI implementation for the following reasons: 1 As the gradient SEISw,, is approximated to (Epert-E)/Apertw9(where
Apertwllis the perturbation applied at weight wJ,no backpropagation pass is needed and only the forward path is required. This means, in terms of analog VLSI implementations, no bidirectional circuits and hardware are needed. The hardware used for the operation of the network is used for the training. Only single simple circuits to implement the weight update are required. This considerably simplifies the implementation.
556
Marwan Jabri and Barry Flower
2. Compared to node perturbation our technique does not require the two neuron addressing modules, routing, and extra multiplication listed above.
WP does not require any overheads (in routing and addressing connections to every neuron) to deliver the perturbations as the same wires used to access the weights are used to deliver weight perturbations. Furthermore, node perturbation requires extra routing to access the output state of each neuron and extra multiplication hardware for the (AE/Anet,)x, terms, which is not the case with weight perturbation. Finally, with weight perturbation, the approximated gradient values can be made available if needed at a rather low cost? 4.1 Hardware Implementation Optimality. The optimality of WPbased learning for analog implementation is considered in terms of dependence on the hardware cost (area, design time, complexity) as a function of network size (number of neurons, number of synapses). This can be seen from the architecture sketched in Figure 5: hardware required for learning shown in dashed lines is constant in size. We do not account here for the hardware required for error generation, which is at most linear with the number of output neurons because this hardware is needed for any error (gradient) directed learning. Let us define
Cwo: cost of hardware implementation of a network with no training support (only normal operation). Cwt: cost of hardware implementation with training support.
We consider optimal, an implementation of a neural network with a training algorithm where
CWt= C,,
+ constant
That is, the hardware implementation cost of on-chip training support is not dependent on network size. To prove the optimality of analog implementation of WP-based training, we will consider the two cases: in-loop and on-chip training. WP with In-Loop Training In this case, WE' does not require any additional hardware or access to neuron states. This assumes that access to the weights is already in place, which is the case for multilayered networks. For example, if the weights are stored on capacitors and refreshed from digital RAM, then the perturbation can be applied to the weights in 21f the mean square error is required off-chip then only one single extra pad is required. Otherwise, if approximated gradient values are to be calculated off-chip, then no extra chip area or pads are required as the output of the network would be accessible anyway.
Weight Perturbation
557
RAM according to the WP algorithm and downloaded to the capacitive storage at the next refresh. In contrast, BP and MR I11 require access to neuron states. As with WP the actual hardware used for the normal operation of the network is sufficient for in-loop training; this proves the Optimality of WP in this case. WP with On-Chip Training The extra hardware required to add an on-chip learning capability to an analog neural network in the case of BP and MR 111 has been outlined in the sections above. To prove the optimality of on-chip WP-based training, it is sufficient to note that only the following hardware is needed for its implementation:
Perturbation Generation: This is a single module that drives the weight connection lines and delivers the constant strength perturbation. Error Difference Evaluation: This is a single module attached to the output neurons and evaluates the TMSE.
Weight Update: This is a single module that evaluates the weight update. and that all of this additional hardware cost is independent of network size. 5 Search Efficiency and Complexity
WP follows the same search procedure as backpropagation, (i.e., gradient descent), if the perturbation applied to the weights is small. The ability to select any weight for perturbation provides a means of examining restricted regions of the error surface, which in turn allows the development of training heuristics that can make use of second order information at minimal computational cost. An example of such a heuristic is that the learning rate is increased when the interval between perturbing weights of adjacent neurons is less than the interval between perturbing adjacent weights of the same neurons: and is best when the former interval is minimized and the latter interval is maximized. 6 Simulations
The “weight perturbation” technique was used on two test cases: XOR (feedforward and recurrent) and intracardiac electrograms (ICEG). The learning procedure was implemented as shown in Figure 6. 3This phenomenon was realized by Yun Xie, visiting scholar at SEDAL from Tsing hua University.
Marwan Jabri and Barry Flower
558
for each pattern p { E = Forwardpasso ClearDeltaWeights ( > for each weight w,,do { Epert = ApplyPerturbate(w,) DeltaError = Epert - E DeltaW[i] [ j ] = - ?) * DeltaError/Perturbation Removeperturbat ion (w,,)
} }
Figure 6: Weight perturbation algorithm in its simplest form. This procedure can be used either for on-line or batch training.
Table 2: Configuration Parameters for Training XOR.
Parameter Value Perturbation strength 0.00001 Learning rate 0.3 Convergence criteria 0.001 Initial weight range 0.3 Sensitivity criteria 0.3
6.1 XOR. 6.2.1. Feedforward XOR. The XOR network used has one hidden layer with two hidden units, two input units acting as pins and one output unit. The training was done in the on-line mode. The network parameters are shown in Table 2. The total mean squared error is shown in Figure 7 for both training with backpropagation and weight perturbation. As Figure 9 shows and as one may expect, the overall shape of the error as function of training iteration is very similar. All four XOR patterns are training in 145 iterations with both techniques (the final average mean square errors are not however equal). A study of the weights produced by both techniques shows that they are extremely similar (the differences were not visible from weight density plots).
Weight Perturbation
0.25
I
559
I
1
I
I
I
0.2
I
-
'xor-wp.exr' 'xor-bp.err'-,
0.15
-
-
0.1
-
-
0.05
-
-
0 0
I
I
I
I
I
I
I
20
40
60
80
100
120
140
160
Figure 7: Mean square error for XOR training using BP (xor-bp.err)and WP (xor-wp.err). Table 3: Parameters of the XOR Recurrent Network.
RBP RWP Parameter NA Perturbation strength 0.001 0.01 Neuron relaxation constant 0.01 0.1 Weight relaxation constant NA Network stability constant 0.000001 0.000001 0.3 0.3 Learning rate Convergence criteria 0.1 0.1 Initial weight range 0.7 0.7 Sensitivity criteria 0.3 0.3
6.1.2 Recurrent XOR. A multilayer recurrent was trained using weight perturbation. The architecture of the network is shown in Figure 8 and the training parameters are shown in Table 3. The same architecture was trained using recurrent backpropagation based on the algorithm of Pineda (1989). The training error curves are shown in Figure 9. Although the two training techniques started from identical initial conditions, the convergence speed was different and the final weight solution was different for both techniques. This may be
560
Marwan Jabri and Barry Flower
Offset node Figure 8: Architecture of the XOR recurrent network. attributed to different learning steps (perturbation strength) in the case of the weight perturbation technique. 6.2 ICEG Classification. Another of our tests is on the training of a three layer perceptron to classify ICEG. The size of the training set is 120 patterns, and the network has 21 input units, 10 hidden units and 5 output units. Figure 10 shows the mean square error for weight perturbation and backpropagation training. Following the training we have tested the trained networks on a set of 2600 patterns. The training with backpropagation and weight perturbation has led to an identical performance of 91% correct classification.
7 Implementation To show the feasibility of learning with analog implementation of weight perturbation, we have constructed a discrete component implementation of an XOR network. Figure 12 shows a block diagram of the network used and Figure 11 shows a picture of the hardware implementation (synapse and neuron boards). In addition a PC was provided as a con-
Weight Perturbation
561
4.5 4
3.5 3
2.5 2 1.5 1
0.5 0
1000
0
2000
4000
3000
6000
5000
~~
Figure 9: Mean square error for WP and recurrent BP training.
0.4
0.35 0.3
,
i
1
1
1
1
1
1
1
1
1
keg-wp.err’ ’iceg-bp.err’
-
0.25 0.2 0.15 0.1
0.05 0 0
20
40
60
80
100
120
140
160
180
200
Figure 10: Mean square error for intracardiac electrogram training using BP (iceg-bperr) and WP (iceg-wp.err).
562
Marwan Jabri and Barry Flower
Figure 11: Picture of the XOR hardware (synapse and neuron boards). troller to orchestrate the presentation of training vectors and weight updates. The weights are stored as a voltage on capacitors using a sample and hold circuit, that are periodically updated. The voltage range for a signal is 110.0 V and the weight values also have a range of 110.0 V. This means that a mean square error of 10.0 V2 indicates that the network output signal and the training signal vary by 32%. Figure 13 shows a training session of the XOR network reaching a mean square error of 8.0 V2 using the weight perturbation algorithm. The noise apparent is due to A / D sampling errors and noise on the network due to weight and training vector refresh. We note that convergence occurs in spite of this noise level, demonstrating the robustness of the weight perturbation optimizing technique to noise.
Weight Perturbation
563
z
3
Figure 12: XOR hardware implementation circuit block diagram. 8 Conclusions
In this paper we show that weight perturbation is a very cheap and flexible learning technique for anaIog implementations of neural networks. We also show that it is more flexible than backpropagation and node perturbation (MR 111). We demonstrate using simulations that weight perturbation produces the same performance as backpropagation and recurrent backpropagation. A discrete analog implementation was used to demonstrate the feasibility of multilayer feedforward training using weight perturbation. The same technique can be used to train simple recurrent networks (like Elman networks) and continuously running recurrent networks for temporal sequences recognition (like Williams and Zipser networks). For all these networks it is easy to see that as far as training i s concerned, the hardware implementation using a weight perturbation architecture is very similar to that required for the normal operation of the networks.
Marwan Jabri and Barry Flower
564
I 15 00
iionn 105 00
-
100 00 9s
no
90 00 8500 -
xn no 75 0
70 00 65 00 6000 55 (10
50 00 45
nn
40 00
3500 i n 00 25 00
2000 15 00
in00
500 000
.L
L 1 -
no0
0 20
n 40
n 60
n 80
'
d
I 00
xxin3
Figtire 13: Mean square error of training hardware XOR network.
Acknowledgments This research is supported by the Australian Research Council a n d a Sydney University Special Project grant.
Furman, B., White, J., and Abidi, A. 1988. CMOS analog implementation of back propagation algorithm. In Abstracts of the First Annual INNS Meeting, Boston, p. 381. tIwang, J. N., and Kung, S. Y. 1989. Parallel algorithms/architectures for neural networks. V L S l Signal Process., 221-251. Pineda, F. J. 1989. Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neural Comp. 1(2), 161-172.
Weight Perturbation
565
Widrow, B., and Lehr, M. A. 1990. 30 years of adaptive neural networks: Perceptron, MadaIine, and backpropagation. Proc. I E E E 78(9),415-1442.
Received 22 February 1991; accepted 3 May 1991.
This article has been cited by: 2. P. S. Sastry , M. Magesh , K. P. Unnikrishnan . 2002. Two Timescale Analysis of the Alopex Algorithm for OptimizationTwo Timescale Analysis of the Alopex Algorithm for Optimization. Neural Computation 14:11, 2729-2750. [Abstract] [PDF] [PDF Plus] 3. K. P. Unnikrishnan , K. P. Venugopal . 1994. Alopex: A Correlation-Based Learning Algorithm for Feedforward and Recurrent Neural NetworksAlopex: A Correlation-Based Learning Algorithm for Feedforward and Recurrent Neural Networks. Neural Computation 6:3, 469-490. [Abstract] [PDF] [PDF Plus] 4. Barak A. Pearlmutter . 1994. Fast Exact Multiplication by the HessianFast Exact Multiplication by the Hessian. Neural Computation 6:1, 147-160. [Abstract] [PDF] [PDF Plus]
Communicated by John Moody
Predicting the Future: Advantages of Semilocal Units Eric Hartman James D. Keeler Microelectronics and Coinputrr Technology Corporation, 3500 West Balcoizes Center Drive, Austin, T X 78759-6509 USA In investigating gaussian radial basis function (RBF) networks for their ability to model nonlinear time series, we have found that while RBF networks are much faster than standard sigmoid unit backpropagation for low-dimensional problems, their advantages diminish in highdimensional input spaces. This is particularly troublesome if the input space contains irrelevant variables. We suggest that this limitation is due to the localized nature of RBFs. To gain the advantages of the highly nonlocal sigmoids and the speed advantages of RBFs, we propose a particular class of semilocal activation functions that is a natural interpolation between these two families. We present evidence that networks using these gaussian bur units avoid the slow learning problem of sigmoid unit networks, and, very importantly, are more accurate than RBF networks in the presence of irrelevant inputs. On the Mackey-Glass and Coupled Lattice Map problems, the speedup over sigmoid networks is so dramatic that the difference in training time between RBF and gaussian bar networks is minor. Gaussian bar architectures that superpose composed gaussians (gaussians-of-gaussians) to approximate the unknown function have the best performance. We postulate that an interesing behavior displayed by gaussian bar functions under gradient descent dynamics, which we call automatic connection pruning, is an important factor in the success of this representation. 1 Introduction
~
Modeling nonlinear time series directly from data is a classic and very difficult problem to which the application of dynamic systems theory and connectionist methods may be of great benefit. Several powerful new algorithms have been presented recently that significantly improve our ability to automatically model nonlinear time series. Building on the ideas of Packard et af. (1980) for reconstructing the attractor from a time series in a time-delayed embedding space, Farmer and Sidorowich (1987, 1988) and Crutchfield and McNamara (1987) proposed local prediction methods that approximate the future trajectory of the current state with Neural Computation 3, 566-578 (1991)
@ 1991 Massachusetts Institute of Technology
Advantages of Semilocal Units
567
a simple function of the trajectories of the current state’s nearest neighbors on the attractor. For the same Mackey-Glass time series benchmark problem, Lapedes and Farber (1987) showed that backpropagation neural networks with sigmoid activation functions (Rumelhart et al. 1986) can also perform quite well with a minimal amount of data. Subsequently, Casdagli (1989) and Moody and Darken (1989) showed that gaussian radial basis function (RBF) neural networks predict nonlinear time series with high accuracy. The latter reported that the RBF networks learned the Mackey-Glass problem on the order of 1000 times faster than the Lapedes and Farber network. In this article we consider a certain class of semilocal activation functions, which respond to more localized regions of input space than sigmoid functions but less localized regions than RBFs. In particular, we examine “gaussian bar” functions, which sum the gaussian responses from each input dimension. We present evidence that gaussian bar networks avoid the slow learning problems of sigmoid networks, and deal more robustly with irrelevant inputs and perform better on a variety of prediction problems than RBF networks. 2 Representation, Scaling, and Irrelevant Inputs
The simplest version of the Moody and Darken (1989) training algorithm for RBF networks consists of three sequential steps: (1) Adjust the RBF centers pzin input space according to k-means clustering (competitive learning), (2) set each RBF width 0, equal to the distance from center p, to its nearest neighbor pyrest, and (3) set the RBF-to-output weights w , according to standard LMS (gradient descent in the network error). A primary disadvantage of RBF networks is that a high dimensional attractor may require a very large number of RBFs for accurate approximation.’ The scaling is essentially exponential: to uniformly cover a d-dimensional cube with subcubes of side length 1,” requires Nd subcubes. Note that the scaling exponent d is the dimension of the attractor submanifold, not of the embedding space. However, if some inputs are irrelevant variables with substantial variance, which is a very common situation in certain real-world problems, the scaling exponent is the dimension of the product space of the attractor times the irrelevant inputs. This is a critical problem for the RBF representation in cases where it is impossible a priori to distinguish relevant from irrelevant variables. Figure l a illustrates the irrelevant variable problem. Input variable XI is related to the output y as shown, and the input domain is covered by RBFs. In Figure l b the irrelevant input variable x2 has been added: y is independent of XZ, yet the entire XI 8 xz input space must be covered with RBFs; the dimensionality of the attractor is artificially inflated. ‘This has been pointed out by several researchers, including Weigend et al. (19901, which appeared near the end of our study and to which our title alludes.
568
Eric Hartman and James D. Keeler
Figure 1: (a) RBFs covering the x1 input space. (b) Attractor dimensionality artificially inflated by irrelevant input x2, Techniques to factor out irrelevant input dimensions and leave only the core attractor are clearly desirable. The problem cannot be solved by simply eliminating inputs that are weakly correlated with the output, because correlations capture only linear relations and inputs strongly but nonlinearly related to the output might inadvertently be eliminated. In principle the backpropagation algorithm (Rumelhart et al. 1986) offers a solution. Backpropagation networks can learn to ignore irrelevant inputs because the output error is minimized in setting the parameters in every layer, not just the hidden-to-output layer as in the algorithms of Moody and Darken (1989) or even Saha and Keeler (1990). If we allow the centers, widths, and weights of an RBF network to all vary according to gradient descent, overcoming the problems of dimensionality and irrelevant inputs requires increasing the initial RBF widths to semilocal sizes. Also, relaxing the radial constraint and allowing a different width for each dimension can improve performance. In our experience, however, the gradient descent behavior of these ”elliptical basis functions” is inferior to that of a different semilocal function that we now describe.
3 Semilocal Functions: Gaussian Bars
~
An RBF unit responds to a small localized region of the input space (Fig. 2d). At the other extreme, a sigmoid unit responds to a highly
Advantages of Semilocal Units
569
Figure 2: Spectrum of localized activation functions. Activation locates the input in increasingly confined regions of the input space: (a) sigmoid, (b) gaussian bar, (c) I-D bar, (d) radial basis function (RBF). Since in a network each I-D bar would connect to a single input, these units would not be useful unless combined. Their product (f) is equivalent to a single RBF unit, while their weighted sum (e) is equivalent to a single gaussian bar unit. Rotation, skewing, etc., of a gaussian bar unit with respect to the input axes can be accomplished by an intervening layer of units.
nonlocalized region by partitioning the input space with a (smeared) hyperplane (Fig. 2a). RBFs have a greater ability to discriminate regions of the input space, but this enhanced discrimination can come at the expense of a great increase in resource requirements. To overcome this tradeoff w e propose the “gaussian bar” activation function (Fig. 2b), which sums the gaussian responses of each input
Eric Hartman and James D. Keeler
570
dimension: (3.1) where i indexes the gaussian bar units and j the input dimensions. For comparison, we write the gaussian RBF as a product
If we think of each gaussian as representing a condition of locality in each dimension, the gaussian bars respond if any of these conditions are satisfied (OR) while RBFs require all conditions to be satisfied (AND). Note that gaussian bars have three times as many network parameters as RBFs (nl,and w,] in addition to ,uI,),and hence more algorithm parameters to adjust (initial values and a learning rate for each type of network parameter). In simulations, we use a separate learning rate for each of these parameter types. Note that the range of a gaussian bar is unbounded and of either sign. Adding an adaptive bias term to equation 3.1 has sometimes improved performance but we have not methodically pursued this issue. The gradient descent equations for the weights wlI, centers pI,, and widths o,,are
dE awl,
--
=
h,e
(3.3) (3.4) (3.5)
where 6, = -dE/do,, E is the network error, and
ui
is the output of unit i.
3.1 Automatic Connection Pruning. For a gaussian bar unit i to effectively "prune" (disconnect from) input dimension j : 1. wl, can become zero. 2. p l , can move away from the data (Fig. 3a). 3.
(T,,
can shrink to zero.2
'Numerically it is necessary to impose a nonzero limit on the widths ut,.
Advantages of Semilocal Units
571
Note that these adjustmenfs can occur completely independently for each dimension. In contrast, moving any one p,! away from the data (or shrinking C T ~to zero) inactivates an RBF unit completely. Gaussian bar networks have greater pruning flexibility than sigmoid networks, since sigmoid units are limited to option (1) above. One expects case (1) to occur only if the input contains only noise. Cases (2) and (3), however, can occur even if the input contains relevant information; see Figure 3b for an example. Dynamic, automatic reduction of the network architecture by these mechanisms occurred in many of our simulations with gaussian bars. Since a very small width creates a spike response at p,], in such a case pZfmust move to a safe location. The danger of such spikes occurring during generalization could be avoided and storage could be reduced by postprocessing trained networks to remove the pruned parts of the network. Also, training time could be reduced by keeping track of pruned units and excluding them from the calculations. Another commonly stated reason for pruning networks is to improve generalization. We have not carefully examined these pruning mechanisms with respect to this issue, and feel the topic deserves further investigation. 4 Test Problems
We have consistently found gaussian bar networks to perform at least as well as and frequently much better than sigmoid or RBF networks for a variety of problems. Here we compare performance on three different prediction problems. Significant effort was made to optimize the algorithm parameter settings in each simulation (learning rates, etc.). As a measure of performance we use the relative error: the root-meansquare error of the prediction divided by the standard deviation of the data. A relative error less than 1.0 indicates a prediction more accurate than "guessing" (simply predicting the mean of the data at every step). 4.1 Logistic Map with Irrelevant Inputs. To test the ability of the networks to learn to ignore irrelevant inputs, we constructed patterns consisting of one relevant and nineteen irrelevant inputs. The relevant input was the iterated logistic map
x ( t + 1) = Xx(t)[l - x ( t ) ]
(4.1)
and the irrelevant inputs were random numbers uniform on [0,1]. With bifurcation parameter X = 3.97 the logistic map is strongly chaotic and has a fairly low correlation with the output of -0.24; the time series appears quite random and is difficult to distinguish from the irrelevant inputs. The task was to learn to predict x( t + 1) given X( t ) and 19 random numbers as input; 20,000 patterns were generated for training and 10,000 for testing.
Eric Hartman and James D. Keeler
572
Table 1 shows that, as expected, the RBF networks were unable to perform well even using as many as one RBF per 10 training patterns. The small performance gain in going from 100 to 2000 RBFs is consistent with the arguments of Section 2 that the inflated input product space of the single relevant and 19 irrelevant dimensions requires 0 [ ( 1 / ~ ) ~ ~ ] RBFs for a prediction accuracy of O ( E ) .Unlike the RBF networks, the backpropagation networks were able to learn to ignore the irrelevant inputs and devote their resources to the single relevant input. As in all our simulations, the gaussian bar networks performed better with
Table 1: Performance Summary.a
Network architecture Gaussian bars Sigmoids RBFs
LMS Gaussian bars
Sigmoids
RBFs
Gaussian bars Sigmoids RBFs
Total parameters
Relative error Training Testing
Logistic Map with Irrelevant Inputs: 1 time step 5 306 0.05 0.05 315 0.01 0.01 5, bar output 111 0.01 0.01 5 100 2201 0.98 0.91 44001 0.65 0.76 2000 Mackey-Glass: 85 time steps 0 5 0.54 50 651 0.41 750 0.22 50, bar output 20-20 1461 0.19 300, bar output 4500 0.11 4500 0.06 300, bar output, Irate 171 0.54 10-10, Irate 171 0.06 10-10, lrate 20-20, lrate 541 0.54 100, Irate 601 0.53 300, lrate 1801 0.69 100 601 0.30 1801 0.03 300 1801 0.06 300,wide 3001 0.02 500 Coupled Lattice Map: 3 time steps 10 1511 0.02 1530 0.02 10, bar output 1301 0.04 25 5201 0.35 100 1000 52001 0.09
0.59 0.40
0.28 0.23 0.11 0.08 0.55 0.08 0.55 0.56 0.84 0.29 0.11 0.08 0.18
Epochs
50 50 50 50/67 37/45 400 400 400 400 400 400 (4) 400 200,000 (160) 400 400 400 100/300 100/300 100/300 (2) 1/300 400 100 24000 100/500 100/2000
‘Network architecture indicates the type and number of hidden units. Except for networks Iabeled “bar output,” output units were linear for the Mackey-GIass problem and sigmoidal otherwise. LMS with 0 hidden units corresponds to the simple linear predictive method. ”Irate” indicates that independent learning rates were used for network parameters in the upper and lower layers of the network. Clustering/LMS epochs are shown for RBF networks. ”wide” indicates that the RBF widths were larger (by 60%) than the distance to the nearest RBF; in this case the response becomes semilocal (many RBF units respond to each input pattern). Training hours on a Sun Sparcstation-1 are shown in parentheses for the best Mackey-Glass runs.
Advantages of Semilocal Units
573
a gaussian bar output unit than with a sigmoid or linear output unit. With a gaussian bar output unit, the network superposes compositions of gaussian functions, as shown in Figure 3. 4.2 Mackey-Glass Equation. The Mackey-Glass time series, generated by integiating the delay differential equation (Mackey and-Glass 1977) d X(t - T ) zx(t) =a - bx(t) (4.2) 1 + x'O(t - 7 )
has become something of a standard benchmark for prediction algorithms (Farmer and Sidorowich 1987, 1988; Lapedes and Farber 1987; Casdagli 1989; Moody and Darken 1989). With a = 0.2, b = 0.1, and T = 17 the trajectory is chaotic and lies on an approximately 2.1-dimensional strange attractor. Following the previous references, we trained networks to predict x ( t + 85) given x ( t ) , x ( t - 6), x ( t - 12), and x ( t - 18) as inputs, and used 500 training and 500 testing patterns. Table 1 shows that even with the relatively small training set the asymptotic accuracies of the three kinds of networks are comparable. Learning times in Sun Sparcstation-1 hours for this problem are shown in parentheses in the epochs column. The increase in convergence speed of the gaussian bar networks compared to the sigmoid networks is dramatic. Moody and Darken (1989) conjectured that RBF networks learn much faster than sigmoid backpropagation networks at least partially because backpropagating through hidden units is avoided in RBF learning. Our results indicate instead that the slow performance of sigmoid backpropagation on this problem is an artifact of sigmoid activation functions rather than a property of backpropagation itself. For the Mackey-Glass problem, RBF networks offer no apparent dramatic speed or accuracy advantages over gaussian bar networks. The sigmoid networks seem to quickly learn the linear portion of the problem (compare LMS in the table), but the nonlinear gradient information is evidently very weak and learning is slow. Since the sigmoid net converges as quickly as other nets on other problems such as the Logistic Map with Irrelevant Inputs problem, slow learning is evidently problem dependent rather than an inherent trait of sigmoid backpropagation. Our experiments lead us to expect backpropagation using gaussian bar units to be more robust with respect to fast learning. 4.3 Coupled Lattice Map. Coupled lattice maps were studied in Keeler and Farmer (1986) as discretized reaction-diffusion equations and as models for spatiotemporal intermitten~y.~ The dynamic system consists 'Smooth laminar flow is randomly interspersed with chaos. Such behavior has been viewed as an extremely difficult prediction problem (Casdagli 1989). To obtain this behavior the lattice must be initialized with a "kink", for example, all sites initialized to 0.2 except sites 15-35 which are initialized to 0.8.
574
Eric Hartman and James D. Keeler
out of range of the data, the gaussian Figure 3: (A) By moving the center bar unit has effectively pruned, or disconnected from, its first input dimension; its output depends only on the value of x2. (8)Solution found by the gaussian bar network with gaussian bar output unit for the Logistic Map with IrreIevant Inputs problem (Table 1). The output unit disconnected from hidden units 1 and 2 by moving its and /LZ out of the range of activation values exhibited by those units. Since both units had large connection weights from the relevant input, their activations contained relevant information; nevertheless, their contribution to the output unit was evidently counterproductive and they were pruned. Hidden unit 4 developed a constant activation and served as a "bias" term for the output unit. Hiddens 3 and 5 disconnected from the irrelevant inputs. (a) Response of hidden unit 3 to the relevant input: h3 = -0.85e-(x-0.38)2/0.'9. (b) Response of hidden unit 5 to the relevant input: h5 = -0.90e-(x-0.63)2/0.21.(c) Gaussian component of the network output due to hidden unit 3 (plus half of the constant component due to hidden unit 4): 1.13e-(h3+0.71)2/0.31 - 0.54 (gaussian-of-a-gaussian). (d) gaussian component of the network output due to hidden unit 5 (plus half of the constant component due to hidden unit 4): 1.00e-(h5+0.76)2/0.28 - 0.54 (gaussian-of-a-gaussian). (e) The network output approximating the logistic: map, the sum of (c) and (d). Deviation of the approximation from the actual logistic map is not visually discernible.
Advantages of Semilocal Units
575
of a circular chain of lattice sites, with each site obeying a spatially averaged logistic map dynamics:
For our experiments we used 50 lattice sites, p = 113 and X = 3.659 (2-band chaos region of the logistic map). The problem is to predict the value at a single fixed site T time steps into the future. The networks were given only the current state of the lattice as input. Due to the coupled dynamics, some information about past values is embedded in the current state. Since the relative error can differ significantly in the chaotic and laminar phases, we generated patterns continuously, alternately training for 500 patterns and testing for 500 patterns, and measured long-term averages of the relative error.4 Table 1 summarizes performance for predicting 3 time steps in the future. The gaussian bar networks converged 1-2 orders of magnitude faster (less data) or to a significantly more accurate solution than the other networks. Since the prediction is only 3 time steps into the future, the attractor dimension is fairly low, and the larger RBF network performed fairly well. Predicting farther into the future of course decreases accuracy. However, the gaussian bar networks were able to qualitatively predict very well the onset of chaos as far as 25 time steps in the future [10-12 dimensional attractor (Keeler and Farmer 1986)l. 5 Discussion The experiments in the previous section support our hypothesis that semilocal activation functions are more accurate than the standard RBF network methods. This is especially true if the data set contains irrelevant inputs, but is also evident in the Mackey-Glass results: increasing the widths of the RBF units to make them semilocal improved the accuracy in this test case as well. Compared to sigmoid networks, we find that gaussian bar networks learn dramatically faster on difficult problems. Since the performance of algorithms and architectures is dearly problem dependent, they should be compared on a variety of benchmarks problems. We have compared the semilocal gaussian bar functions to sigmoids and RBFs on three quite different function approximation problems. In these and other problems we have examined, while sigmoids showed weaknesses in some cases and RBFs in others, the performance of gaussian bar networks equaled or surpassed the other networks. While 4Clustering in RBF networks used 10 patterns per RBF after which LMS used continuously generated data.
576
Eric Hartman and James D. Keeler
we have here restricted ourselves to time series prediction problems, preliminary investigations indicate that semilocal networks will be very useful for other pattern recognition applications as well. It remains for us to understand in detail (1) why gaussian bar units learn significantly faster than sigmoids on certain problems, (2) the robustness of approximation via composition of gaussians (hidden-output or hidden-hidden gaussian bar architectures performed better than other architectures), and (3) the importance of the automatic connection pruning mechanism inherent in the gaussian bar functions under gradient descent dynamics. Unlike the case of sigmoid or RBF networks (Hornik et al. 1989; Hartman et al. 1990; Giorsi and Poggio 19901, it has not been proven that gaussian bar networks are universal approximators; because gaussian bars do not form an algebra, the proof of Hartman et al. (1990) and Giorsi and Poggio (1990) does not carry over. Gaussian bar networks with linear output units have similarities to the class of networks described as "basis function trees" by Sanger (1991). In a network with a linear output unit and a single layer of gaussian bar hidden units, the summing operations of the output unit and of the gaussian bar units could be combined and the network viewed as one without hidden units but with multiple connections from each input unit to the output unit. Such a structure is equivalent to a single layer Sanger tree using gaussian basis functions and all of the input dimensions. With multiple layers, however, the networks differ in the way that the gaussians are combined: a network with multiple layers of gaussian bar units composes the gaussians, whereas in a multilayer Sanger tree their combination is mulfiplicative. Also, in a multilayer Sanger tree, leaf-splitting heuristics play a crucial (and problematic) role, while in a multilayer gaussian bar network, the appropriate combinations of inputs can develop through the action of the backpropagation learning algorithm, and no such heuristics are required. Finally, it is interesting to point out that the activation function in these networks was biologically as well as numerically inspired. Moody and Darken (1989) point out the similarity of RBF networks to the Albus (1971) Cerebellar model, where the RBF units are likened to the granule cells of the cerebellum. However, granule cells are more faithfully modeled by a linear sum of their inputs than by a product (Fujita 1982). If granule cells do in fact respond locally in each input dimension, then one would expect behavior more like sparsely connected gaussian bars than radial basis functions.
Acknowledgments We thank David Rumelhart and Carsten Peterson for valuable discussions. The reviewer emphasized the connections with Sanger (1991) and
Advantages of Semilocal Units
577
provided several comments that helped clarify the manuscript. We also thank Mark Derthick for useful comments on the manuscript and Mark Ring for generating some of t h e figures. Simulations were carried out using a modified version of the Rumelhart-McClelland simulator (Volume 3 of Rumelhart et al. 1986).
References Albus, J. S. 1971. A theory of cerebellar functions. Math. Bio. 10,25-61. Casdagli, M. 1989. Nonlinear prediction of chaotic time series. Physica D35, 335-356. Crutchfield, J. P., and McNamara, B. S. 1987. Equations of motion from a data series. Complex Syst. 1, 417. Farmer, J. D., and Sidorowich, J. J. 1987. Predicting chaotic time series. Phys. Rev. Lett. 59, 845-848. Farmer, J. D., and Sidorowich, J. J. 1988. Exploiting chaos to predict the future and reduce noise. Los Alamos Preprint 88-901. Fujita, M. 1982. Adaptive filter model of the cerebellum. Biol. Cybernet. 45, 207-21 4. Giorsi, F., and Poggio, T. 1990. Networks and the best approximation property. Biol. Cybernet. 63, 169-176. Hartman, E., Keeler, J. D., and Kowalski, J. 1990. Layered neural networks with gaussian hidden units as universal approximators. Neural Cornp. 2,210-215. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Keeler, J. D., and Farmer, J. D. 1986. Robust space-time intermittency and l / f noise. Physica D 23, 413-435. Lapedes, A., and Farber, R. 1987. Nonlinear signal processing using neural networks: Prediction and system modeling. Los Alamos Technical Report LA-UR-87. Mackey, M. C., and Glass, L. 1977. Oscillation and chaos in physiological control systems. Science 197, 287. Moody, J., and Darken, C. J. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294. Packard, N. H., Crutchfield, J. I?., Farmer, J. D., and Shaw, R. S. 1980. Geometry from a time series. Phys. Rev. Lett. 45, 712-716. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds. The MIT Press/Bradford Books, Cambridge, MA. Saha, A., and Keeler, J. D. 1990. Algorithms for better representation and faster learning in radial basis function networks. In Neural Information Processing Systems, D. Touretzky, ed. Morgan Kaufmann, San Mateo, CA, pp. 482-489.
578
Eric Hartman and James D. Keeler
Sanger, T. D. 1991. A tree-structured algorithm for reducing computation in networks with separable basis functions. Neural Comp. 3, 67-78. Weigend, A. S., Huberman, B. A., and Rumelhart, D. E. 1990. Predicting the future: A connectionist approach. Inti. 1. Neural Syst. 1, 193.
Received 15 January 1991; accepted 10 June 1991.
This article has been cited by: 2. Mian Hong Wu, Wanchang Lin, Shang Y Duan. 2006. Developing a neural network and real genetic algorithm combined tool for an engine test bed. Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering 220:12, 1737-1753. [CrossRef] 3. Adrian Costea, Iulian Nastac. 2005. Assessing the predictive performance of artifIcial neural network-based classifiers based on different data preprocessing methods, distributions and training mechanisms. Intelligent Systems in Accounting, Finance and Management 13:4, 217-250. [CrossRef] 4. Kian Hsiang Low , Wee Kheng Leow , Marcelo H. Ang Jr. . 2005. An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion TasksAn Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks. Neural Computation 17:6, 1411-1445. [Abstract] [PDF] [PDF Plus] 5. Shun-Feng Su, Chan-Ben Lin, Yen-Tseng Hsu. 2002. A high precision global prediction approach based on local prediction approaches. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 32:4, 416-425. [CrossRef] 6. G. Hennessey, H. Leung, A. Drosopoulos, P.C. Yip. 2001. Sea-clutter modeling using a radial-basis-function neural network. IEEE Journal of Oceanic Engineering 26:3, 358-372. [CrossRef] 7. J. A. Garc�a, A. Taz�n Puente, A. Mediavilla S�nchez, I. Santamar�a, M. L�zaro, C. J. Pantale�n, J. C. Pedro. 1999. Modeling MESFETs and HEMTs intermodulation distortion behavior using a generalized radial basis function network. International Journal of RF and Microwave Computer-Aided Engineering 9:3, 261-276. [CrossRef] 8. N.W. Townsend, L. Tarassenko. 1999. Estimations of error bounds for neural-network function approximators. IEEE Transactions on Neural Networks 10:2, 217-230. [CrossRef] 9. S. Chen, Y. Wu, B.L. Luk. 1999. Combined genetic algorithm optimization and regularized orthogonal least squares learning for radial basis function networks. IEEE Transactions on Neural Networks 10:5, 1239. [CrossRef] 10. Fuchun Sun, Zengqi Sun, Peng-Yung Woo. 1998. Stable neural-network-based adaptive control for sampled-data nonlinear systems. IEEE Transactions on Neural Networks 9:5, 956. [CrossRef] 11. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 12. B.A. Whitehead, T.D. Choate. 1996. Cooperative-competitive genetic evolution of radial basis function centers and widths for time series prediction. IEEE Transactions on Neural Networks 7:4, 869-880. [CrossRef]
13. Rick L. Jenison , Kate Fissell . 1996. A Spherical Basis Function Neural Network for Modeling Auditory SpaceA Spherical Basis Function Neural Network for Modeling Auditory Space. Neural Computation 8:1, 115-128. [Abstract] [PDF] [PDF Plus] 14. Tin-Yau Kwok, Dit-Yan Yeung. 1996. Use of bias term in projection pursuit learning improves approximation and convergence properties. IEEE Transactions on Neural Networks 7:5, 1168. [CrossRef] 15. Thorsteinn Rögnvaldsson . 1994. On Langevin Updating in Multilayer PerceptronsOn Langevin Updating in Multilayer Perceptrons. Neural Computation 6:5, 916-926. [Abstract] [PDF] [PDF Plus] 16. Jonathan Wray, Gary G. R. Green. 1994. Calculation of the Volterra kernels of non-linear dynamic systems using an artificial neural network. Biological Cybernetics 71:3, 187-195. [CrossRef] 17. Dimitry Gorinevsky , Thomas H. Connolly . 1994. Comparison of Some Neural Network and Scattered Data Approximations: The Inverse Manipulator Kinematics ExampleComparison of Some Neural Network and Scattered Data Approximations: The Inverse Manipulator Kinematics Example. Neural Computation 6:3, 521-542. [Abstract] [PDF] [PDF Plus] 18. Michel Benaim . 1994. On Functional Approximation with Normalized Gaussian UnitsOn Functional Approximation with Normalized Gaussian Units. Neural Computation 6:2, 319-333. [Abstract] [PDF] [PDF Plus]
Communicated by David Lowe
Improving the Generalization Properties of Radial Basis Function Neural Networks Chris Bishop Neural Netzuorks Group, AEA Technology, Harwell Laboratory, Oxfordshire OX22 ORA, United Kingdom
An important feature of radial basis function neural networks is the existence of a fast, linear learning algorithm in a network capable of representing complex nonlinear mappings. Satisfactory generalization in these networks requires that the network mapping be sufficiently smooth. We show that a modification to the error functional allows smoothing to be introduced explicitly without significantly affecting the speed of training. A simple example is used to demonstrate the resulting improvement in the generalization properties of the network. 1 Introduction
Radial basis function (RBF) neural networks (Broomhead and Lowe 1988) provide a powerful technique for generating multivariate, nonlinear mappings. Unlike the widely used technique of error backpropagation (Rumelhart and McClelland 1986) the learning algorithm for RBF networks corresponds to the solution of a linear problem. The training of the network is therefore a fast procedure. An important consideration in setting up an RBF network is the choice of the number and centers of the radial basis functions (i.e., the hidden units). The most natural choice is to let each data point in the training set correspond to a basis function center. In this case the number of degrees of freedom in the network equals the number of items of data, and the network function fits exactly through each data point. If the data have a regular behavior, but are contaminated by noise, the network will learn all the details of the individual data points, rather than representing the underlying trends in the data. This phenomenon is sometimes called overfitting. The resulting network function often has poor generalization properties as a result of the rapid oscillations that usually characterize an overfitted function. One procedure for damping out these oscillations, referred to as curvature-driven smoothing, has been developed earlier in the context of networks trained by error backpropagation (Bishop 1990). Here we show that an analogous technique can be applied in the case of RBF networks, and that the resulting trained networks do indeed exhibit improved generalization. Neural Computation 3, 579-588 (1991) @ 1991 Massachusetts Institute of Technology
Chris Bishop
580
outputs
Inputs
Figure 1: Architecture of a radial basis function network. An introduction to RBF networks is given in Section 2. In Section 3 the technique of curvature-driven smoothing is developed in the context of RBF networks, and results from the application to a simple problem are presented in Section 4. A brief summary is given in Section 5. 2 Radial Basis Function Networks
Here we review briefly the central features of radial basis function networks. For a more extensive discussion see Broomhead and Lowe (1988). The network has a three layer feedforward architecture as shown in Figure 1. Input vectors x are propagated to the hidden units (hidden neurons) each of which computes a hyperspherical function of x, so that the output of the ith hidden unit is given by
4, = 4ill x - Yt II)
(2.1)
where y, is the center of the radial basis function for unit i, and 11 . . . 11 denotes a distance measure that is generally taken to be the Euclidean norm. The nonlinear function (Is can be chosen in a variety of ways and
Generalization in RBF Neural Networks
581
can in principle vary from one hidden unit to the next. For the examples shown later we have taken a gaussian nonlinearity:
$(x) = exp{-x’/m’}
(2.2)
The outputs of the network are formed from the weighted sum of the outputs from the hidden units:
(2.3) where the synaptic weights wq and the biases 19i are adaptive variables that are set during the learning phase. The bias terms can be absorbed into the weight matrix by introducing an extra hidden unit whose output $k = 1. Training data are supplied to the network in the form of pairs x p , t, of input and target vectors, where p = 1,. . . ,P labels the individual training pairs. The learning algorithm aims to minimize the sum-ofsquares error defined by (2.4) where zip = z ; ( x p )denotes the output of unit i when the network is presented with input vector x p . At a minimum of ES we have (2.5) Together with equation 2.3 (and omitting the explicit bias terms) this gives
(2.6) where
$jp
= o , ( x p ) .This can be written in the form
(2.7) where the square matrix M is defined by
Mkj
$kp$jjp
(2.8)
P
Note that M is the covariance matrix of the transformed data (for data with zero mean). Provided M is not singular, we can compute M-’ (in practice using singular value decomposition), and hence solve equation 2.7 to give
582
Chris Bishop
Michelli (1986) has shown that for a large class of functions q5 the matrix M is nonsingular provided the data points are all distinct. For nonsingular M, the quantity (4T4)-14T, which appears implicitly in equation 2.9, is the Moore-Penrose pseudoinverse of the matrix # (Golub and Kahan 1965). In the case where the number of basis functions equals the number of training data points, the matrix Q, is square, and the pseudoinverse of I$ reduces to the usual inverse. The minimum of ES then occurs at ES = 0, and the function generated by the trained network passes exactly through every data point. One of the great advantages of RBF networks is that the learning algorithm involves the solution of a linear problem, and is therefore fast. Due to the nonlinearity of the basis functions, however, the network can generate complex nonlinear mappings. In principle learning strategies could be devised that involve changes also in the location and form of the radial basis functions. The advantages of a linear learning algorithm would then be lost, however. The centers yi of the basis functions can be chosen in a variety of ways. A natural choice would be to take the yi to be the input vectors xp from the training data set, or a subset of these in the case where the number of hidden units is less than the number of training data points. If the network is to be used as a pattern classifier the number of basis functions is generally taken to be significantly larger than the number of input units. The hidden units then nonlinearly map input vectors into a space of higher dimension. The problem may be linearly separable in this higher space even when it is not linearly separable in the original space. In this case the single layer of modifiable weights between hidden and output units is sufficient to give correct classification. In this paper we are interested primarily in continuous mappings between input and output variables. 3 Curvature-DrivenSmoothing in RBF Networks
The situation in which the network mapping passes exactly through each training data point is generally not desirable, even though this gives ES = 0. In many practical applications of neural networks the available set of training data will be noisy. If the network mapping fits the data exactly, the capability of the network to generalize, that is to produce an acceptable output when a novel input is applied, will often be poor. This arises from the rapid oscillations in the network function that generally are needed for the function to fit the noisy data. The situation is analogous to the problem of overfitting which can occur when curve fitting using high order polynominals. To improve the generalization capability of the network it is necessary for the network mapping to represent the underlying trends in the data, rather than fitting all of the fine details of the data set. One way
Generalization in RBF Neural Networks
583
in which this can be achieved is to reduce the number of degrees of freedom by using fewer hidden units. Although this leads to a smaller network, it is not clear how the basis function centers should be chosen. One possibility is to take a subset of the input vectors from the training data. The subset may be chosen randomly, or by a more systematic elimination procedure starting with a full-sized network (Admoaitis et al. 1990). Another procedure for choosing the basis function centers is to use a self-organizing neural algorithm such as the "topology preserving feature map" (Kohonen 1988). If the quantity of training data available is at all limited, however, it may be undesirable to eliminate potential basis function centers, particularly if there are regions of the input space where the data are relatively sparse. We consider here an alternative procedure for avoiding the overfitting problem in RBF networks. The full set of radial basis functions, whose centers correspond to the input vectors from the training data, is retained. An additional term is added to the error measure whose role is to damp out the rapid oscillations in the network function that are characteristic of overfitting, while retaining the longer wavelength variations describing the underlying nonlinear trends in the data. The total error function then becomes
E
= ES
+ XEC
(3.1)
where ES is the standard sum-of-squares error given by equation 2.4 and EC is arranged to be large for functions with rapid oscillations. The parameter X in equation 3.1 controls the degree to which the network function is smoothed. This approach, known as regularization, is commonly used in a number of other fields for tackling "ill-posed" problems (Tikhonov and Arsenin, 1977). Poggio and Girosi (19901, starting with the concept of regularization, have derived an approximation scheme that includes radial basis function networks as a special case, thus demonstrating a close relation between these two techniques. Regularization terms also arise when considering the effects of noise on the input data in least squares functional approximation, as discussed in Webb (1991). A regularization technique, referred to as curvature-driven smoothing, has also been applied to neural networks trained by error backpropagation (Bishop, 1990). The functional EC in equation 3.1 will be chosen to have the following form:
EC
=
1
-
c c{ (
2 ,
$)2}
(3.2)
i
where n labels the input unit. This choice for EC has the required property of penalizing functions with large second derivatives and, most importantly, is bilinear in the synaptic weights. Thus the great advantage of RBF networks, namely the linear learning algorithm and consequent speed of training, will be retained.
Chris Bishop
584
If we now minimize E with respect to the weights {zu,,} we obtain
Rearranging terms gives Wkfik, = k
fip4lp
(3.4)
P
which is analogous to equation 2.7, with M defined by
(3.5) Equation 3.4 can now be solved using the same techniques as for equation 2.7. The appropriate value for X will be problem dependent. It should not be chosen too large since this will smooth the network function too much and lead to a deterioration in the ability of the network to generalize. Results presented in the next section suggest, however, that the performance of the network may be fairly insensitive to the precise value of A. The form of EC given by equation 3.2 treats each input-output unit pair on an equal footing. It thus presupposes that the input (and output) variables have been rescaled to span a similar range of values. As an alternative, suitable scaling factors cin for each input-output unit pair can be included in equation 3.2. 4 Simulation Results
We now illustrate the ideas introduced in the previous section with a simple example. Consider a network with a single input unit and a single output unit. Data are generated from the function
z = 0.8 sin(27rx)
(4.1)
sampled at 25 equally spaced values of x in the range (0, l),and perturbed with i 20% random noise. A similar set of test data was generated by sampling equation 4.1 at intermediate values of x, and again perturbing with & 20% noise. The number of basis functions is chosen to equal the number of training data points. Gaussian basis functions of the form of equation 2.2 are used, and the basis function centers are taken to coincide with the training data input vectors. Figure 2 shows the training data together with the network function that results from training the network without any smoothing. The function fits each data point exactly, and the rapid oscillations (with corresponding high curvature) that are characteristic of overfitting are clearly seen.
Generalization in RBF Neural Networks
585
c. 5
0.0
-c. 5
-1
.o
Figure 2: Training data generated from the function z = 0.8sin(Zrx) and perturbed with i20% noise, together with a plot of the network function obtained without smoothing. The effect of introducing a smoothing term, with a small value of A, is to increase the error with respect to the training data, while reducing the test data error. This is illustrated in Figure 3 in which the mean square error ES is plotted as a function of !nX for both training and test data. The fall in the test data error indicates that the network is better able to generalize. Larger values of X result in oversmoothing, and the error with respect to the test data increases again. For this example the optimum value of A, corresponding to the minimum of the test data error, is given by X = 8.3 x The corresponding network function is plotted in Figure 4. At this value of X the short scale oscillations are completely suppressed. The minimum value of the test error is close to the value 0.004 obtained by comparing the test data with the original function in equation 4.1, showing that the network has good generalization properties when X is set to the optimum value.
Chris Bishop
586
I
t.e s t.
Figure 3: Mean square error for training data (lower curve) and test data (upper curve) versus In A.
Although the appropriate value for X must be determined by experiment, Figure 3 indicates that variations in X of about an order of magnitude (note the logarithmic scale of the abscissa) have little effect on the test data error for this problem. The parameter (T in equation 2.2, which governs the width of the gaussian functions, also has to be chosen appropriately. Too small a value leads to a hidden unit response which is highly localized, making it difficult to generate smooth network functions. At too large a value, the matrix M becomes ill-conditioned. A suitable choice would allow the gaussians to span a number of data points, and a value of D = 0.1 was used in the above example.
Generalization in RBF Neural Networks
1
587
1'-
0.21
0.
00
Figure 4: Plot of the network function obtained with a smoothing term and with X = 8.3 x
5 Summary In this paper we have described a practical procedure for improving the generalization properties of radial basis function neural networks. The performance of the network for new data (i.e., data not used during training) can be controlled by varying a single parameter A. The optimum value for X must be found by experiment, although simulations suggest that the results are not strongly dependent on the precise value chosen. This technique can prevent overfitting without needing to limit the number of radial basis functions, and therefore allows all training data points to act as basis function centers. This may be particularly useful when the amount of training data is limited, or when the data are sparsely distributed in important regions of the input space. Furthermore, for many problems it is known that the desired mapping should
588
Chris Bishop
have certain smoothness properties, a n d this technique allows this to be imposed explicitly. Where appropriate, curvature-driven smoothing can easily be combined with techniques for restricting the number of basis functions. Finally, the network can generate a large class of nonlinear multivariate mappings, while the learning algorithm corresponds to the solution of a linear problem a n d is therefore a fast one-step procedure.
References Admoaitis, R. A. et al. 1990. Application of neural nets to systems identification and bifurcation analysis of real world experimental data. Proceedings of International Corifererice on Neural Networks, Lyons, France (in press). Bishop, C. M. 1990. Curvature-driven smoothing in backpropagation neural networks. Proceedings of the lnternational Neural Network Conference, Paris, Vol. 2, p. 749. Submitted to Neural Networks. Broomhead, D. S., and Lowe, D. 1988 Multi-variable functional interpolation and adaptive networks. Complex Syst. 2, 321. Golub, G., and Kahan, W. 1965. Calculating the singular values and pseudoinverse of a matrix. J. S l A M Numerical Anal. Ser. B2, 205. Kohonen, T. 1988. Self Organisation and Associative Memory. Springer-Verlag, New York. Michelli, C. A. 1986. Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Construct. Approx. 2, 11. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. l E E E 78(9), 1481. Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol I: Foundations. The MIT Press, Cambridge, MA. Tikhonov, A. N., and Arsenin, V. Y. 1977. Solutions of Ill-Posed Problems. Wiley, New York. Webb, A. R. 1991. Functional approximation by feed-forward networks: A leastsquares approach to generalisation. RSRE Memorandum 4453, R.S.R.E., St Andrews Road, Malvern, Worcs., WR14 3PS, U.K.
Received 6 March 1991; accepted 18 April 1991.
This article has been cited by: 2. D. Shi, C. Quek, R. Tilani, J. Fu. 2007. Product Demand Forecasting with a Novel Fuzzy CMAC. Neural Processing Letters 25:1, 63-78. [CrossRef] 3. D. Shi, F. Chen, G. S. Ng, J. Gao. 2006. The construction of wavelet network for speech signal processing. Neural Computing and Applications 15:3-4, 217-222. [CrossRef] 4. J. Zhang, Q. Jin, Y. Xu. 2006. Inferential Estimation of Polymer Melt Index Using Sequentially Trained Bootstrap Aggregated Neural Networks. Chemical Engineering & Technology 29:4, 442-448. [CrossRef] 5. Y.-J. Oyang, S.-C. Hwang, Y.-Y. Ou, C.-Y. Chen, Z.-W. Chen. 2005. Data Classification With Radial Basis Function Networks Based on a Novel Kernel Density Estimation Algorithm. IEEE Transactions on Neural Networks 16:1, 225-236. [CrossRef] 6. Miroslaw Galicki, Lutz Leistritz, Ernst Bernhard Zwick, Herbert Witte. 2004. Improving Generalization Capabilities of Dynamic Neural NetworksImproving Generalization Capabilities of Dynamic Neural Networks. Neural Computation 16:6, 1253-1282. [Abstract] [PDF] [PDF Plus] 7. M.A. Moreno, J. Usaola. 2004. A New Balanced Harmonic Load Flow Including Nonlinear Loads Modeled With RBF Networks. IEEE Transactions on Power Delivery 19:2, 686-693. [CrossRef] 8. S. Chen, X. Hong, C.J. Harris. 2003. Sparse kernel regression modeling using combined locally regularized orthogonal least squares and d-optimality experimental design. IEEE Transactions on Automatic Control 48:6, 1029-1036. [CrossRef] 9. Jau-Jia Guo, P.B. Luh. 2003. Selecting input factors for clusters of gaussian radial basis function networks to improve market clearing price prediction. IEEE Transactions on Power Systems 18:2, 665-672. [CrossRef] 10. S. Chen, X. Hong, C.J. Harris. 2003. Sparse multioutput radial basis function network construction using combined locally regularised orthogonal least square and D-optimality experimental design. IEE Proceedings - Control Theory and Applications 150:2, 139. [CrossRef] 11. J. del R Millan, J. Mourino, M. Franze, F. Cincotti, M. Varsta, J. Heikkonen, F. Babiloni. 2002. A local neural classifier for the recognition of EEG patterns associated to mental tasks. IEEE Transactions on Neural Networks 13:3, 678-686. [CrossRef] 12. S. Chen. 2002. Multi-output regression using a locally regularised orthogonal least-squares algorithm. IEE Proceedings - Vision, Image, and Signal Processing 149:4, 185. [CrossRef]
13. Chi Sing Leung, G.H. Young, J. Sum, Wing-Kay Kan. 1999. On the regularization of forgetting recursive least square. IEEE Transactions on Neural Networks 10:6, 1482. [CrossRef] 14. S. Chen, Y. Wu, B.L. Luk. 1999. Combined genetic algorithm optimization and regularized orthogonal least squares learning for radial basis function networks. IEEE Transactions on Neural Networks 10:5, 1239. [CrossRef] 15. Meiqin Liu, Jida Chen. 1998. Determining the structures and parameters of radial basis function neural networks using improved genetic algorithms. Journal of Central South University of Technology 5:2, 141-146. [CrossRef] 16. David J. Miller , Hasan S. Uyar . 1998. Combined Learning and Use for a Mixture Model Equivalent to the RBF ClassifierCombined Learning and Use for a Mixture Model Equivalent to the RBF Classifier. Neural Computation 10:2, 281-293. [Abstract] [PDF] [PDF Plus] 17. B.A. Whitehead, T.D. Choate. 1996. Cooperative-competitive genetic evolution of radial basis function centers and widths for time series prediction. IEEE Transactions on Neural Networks 7:4, 869-880. [CrossRef] 18. S. Gopal, C. Woodcock. 1996. Remote sensing of forest change using artificial neural networks. IEEE Transactions on Geoscience and Remote Sensing 34:2, 398-404. [CrossRef] 19. B.A. Whitehead. 1996. Genetic evolution of radial basis function coverage using orthogonal niches. IEEE Transactions on Neural Networks 7:6, 1525. [CrossRef] 20. Mark J. L. Orr. 1995. Regularization in the Selection of Radial Basis Function CentersRegularization in the Selection of Radial Basis Function Centers. Neural Computation 7:3, 606-623. [Abstract] [PDF] [PDF Plus] 21. Chris M. Bishop . 1995. Training with Noise is Equivalent to Tikhonov RegularizationTraining with Noise is Equivalent to Tikhonov Regularization. Neural Computation 7:1, 108-116. [Abstract] [PDF] [PDF Plus] 22. Manfred M. Fischer, Sucharita Gopal. 1994. ARTIFICIAL NEURAL NETWORKS: A NEW APPROACH TO MODELING INTERREGIONAL TELECOMMUNICATION FLOWS*. Journal of Regional Science 34:4, 503-527. [CrossRef] 23. Dimitry Gorinevsky , Thomas H. Connolly . 1994. Comparison of Some Neural Network and Scattered Data Approximations: The Inverse Manipulator Kinematics ExampleComparison of Some Neural Network and Scattered Data Approximations: The Inverse Manipulator Kinematics Example. Neural Computation 6:3, 521-542. [Abstract] [PDF] [PDF Plus] 24. Chris M. Bishop. 1994. Neural networks and their applications. Review of Scientific Instruments 65:6, 1803. [CrossRef] 25. An Mei Chen , Haw-minn Lu , Robert Hecht-Nielsen . 1993. On the Geometry of Feedforward Neural Network Error SurfacesOn the Geometry of Feedforward
Neural Network Error Surfaces. Neural Computation 5:6, 910-927. [Abstract] [PDF] [PDF Plus] 26. Kelly LiuNeural Net Architecture . [CrossRef] 27. Mohamad T. Musavi, Alan Fern, Dan R. CoughlinPaper Industry, System Identification and Modeling . [CrossRef]
Communicated by Erkki Oja
Temporal Evolution of Generalization during Learning in Linear Networks Pierre Baldi Jet Propulsion Laboratory and Division of Biology, California Institute of Technology, Pasadena, C A 91125 U S A
Yves Chauvin Department of Psychology, Stanford University, Stanford, C A 94305 U S A and NET-ID, Inc., Menlo Park, C A 94025 U S A
We study generalization in a simple framework of feedforward linear networks with n inputs and n outputs, trained from examples by gradient descent on the usual quadratic error function. We derive analytical results on the behavior of the validation function corresponding to the LMS error function calculated on a set of validation patterns. We show that the behavior of the validation function depends critically on the initial conditions and on the characteristics of the noise. Under certain simple assumptions, if the initial weights are sufficiently small, the validation function has a unique minimum corresponding to an optimal stopping time for training for which simple bounds can be calculated. There exists also situations where the validation function can have more complicated and somewhat unexpected behavior such as multiple local minima (at most n ) of variable depth and long but finite plateau effects. Additional results and possible extensions are briefly discussed. 1 Introduction
Generalization properties of neural networks trained from examples seem fundamental to connectionist theories but also poorly understood. In practice, the question to be answered is how should one allocate limited resources and parameters, such as network size and architecture, initial conditions, training time, and available examples, to optimize generalization performance? One conventional approach is to consider the problem of learning as a surface fitting problem. Accordingly, neural networks should be very constrained, with a minimal number of parameters, to avoid the classical "overfitting" problem. In practice, however, not too much is known about overfitting, its nature, and its onset both Neural Computation 3, 589-603 (1991) @ 1991 Massachusetts Institute of Technology
590
Pierre Baldi and Yves Chauvin
as a function of network parameters and training time. Furthermore, the conventional view has sometimes been challenged in light of simulation results and may need to be revised to some extent. It may be the case, for instance, that a suitable strategy consists rather in using networks with a few more parameters than the most constrained ones and training these slightly larger networks for shorter times, based on a careful monitoring of the evolution of the validation error during training and its minimization. Partial interesting results on generalization have been obtained in recent years in terms of VC dimension and statistical mechanics (see, for instance, Baum and Haussler 1989; Tishby et al. 1989; and Sompolinsky et al. 1990). Most of these results, however, are static in the sense that they study generalization as a function of network architecture and number of examples. Here, we propose a different and complementary approach consisting in a detailed analysis of the temporal evolution of generalization in simple feedforward linear networks. This setting is not as restricted as it may seem because parametrically linear networks have been gaining popularity recently (e.g., radial basis functions or polynomial networks). Additional motivation for investigating these architectures can be found in Baldi and Hornik (1989, 1991). Even in this simple framework, the question is far from trivial. Thus we have restricted the problem even further: learning the identity map in a single layer feedforward linear network. With suitable assumptions on the noise, this problem turns out to be insightful and to yield analytical results that are relevant to what one observes in more complicated situations. With hindsight, it is rather remarkable that the complex phenomena related to generalization that are observed in simulations of nonlinear networks are already present in the linear case. In Section 2, we define the framework and derive the basic equations first in the noiseless case and then in the case of noisy data. The basic point is to derive an expression for the validation function in terms of the statistical properties of the population and the training and validation samples. Section 3 contains the main results, which consist of an analysis of the landscape of the validation error as a function of training time. Simple simulation results are also presented and several interesting phenomena are described. The results are discussed and some possible extensions are briefly mentioned in the conclusion. Mathematical proofs are deferred to the Appendix.
2 Formal Setting
2.1 Noiseless Data. We consider a simple feedforward network with n input units connected by a weight matrix W to n output linear units. The network is trained to learn the identity function (autoassociation)
Generalization in Linear Networks
591
from a set of centered training patterns X I , . . . ,xT. The connection weights are adjusted by gradient descent on the usual LMS error function 1
E(W) = T
c 11%
-
WXf (I2
(2.1)
f
The gradient of E with respect to the weights W is given by
VE
=
( W - I)C
(2.2)
where C = C X X is the covariance matrix of the training set. Thus, the gradient descent learning rule can be expressed as
wk+'= wk- r(Wk
-
I)C
(2.3)
where W k is the weight matrix after the kth iteration of the algorithm and 7 is the constant learning rate (7 > 0). If el and A1(A1 2 . . .A, > 0) denote the eigenvectors and eigenvalues of C, then Wk+'eI = vX,e, + (I - 7
1 ~ Wke, ~ )
(2.4)
A simple induction shows that
Wk = W(I- 7C)k - [(I- 7C)k -I]
(2.5)
and therefore
+
Wke, = [I - (1 - 7 ~ ~ , ) ~ ] (I e , - 71A,)kW'e,
(2.6)
The behavior of equation 2.6 is clear: provided the learning rate is less than twice the inverse of the largest eigenvalue (rl < 2/A1), then Wk approaches the identity exponentially fast. This holds for any starting matrix Wo. The eigenvectors of C tend to become eigenvectors of Wk and the corresponding eigenvalues approach 1 at different rates depending on A, (larger eigenvalues are learned much faster). As a result, it is not very restrictive to assume, for ease of exposition, that the starting matrix Wo is diagonal in the el basis, i.e., W' = diag(a,(0)) (in addition, learning is often started with the zero matrix). In this case, equation 2.5 becomes
W e , = [I - (1 - ~ A , ) ~ (at(o))le, I = a,(k)e,
(2.7)
A simple calculation shows that the corresponding error can be written as
E( Wk) =
2 ,=I
-
1)2 =
f:A,(1
-
01(O))~(1 - ~ J X , ) ~ ~
(2.8)
1=1
'Superscripts on the sequence Q are in parenthesis to avoid possible confusion with exponentiation.
Pierre Baldi and Yves Chauvin
592
2.2 Noisy Data. We now modify the setting to introduce noise effects. To fix the ideas, the reader may think for instance that we are dealing with hand-written realizations of single digits numbers. In this case, there are 10 possible patterns but numerous possible noisy realizations. In general, we assume that there is a population of patterns of the form xp np, where xp denotes the signal and n p denotes the noise, characterized by the covariance matrices C X X ,C N N ,and C X N . Here, as everywhere else, we assume that the signal and the noise are centered. A sample xt nf(15 t 5 T ) from this population is used as a training set. The training sample is characterized by the covariance matrices C = CXX, C" and Cx, calculated over the sample. Similarly, a different sample x,, n,, from the population is used as a validation set. The validation sample is characterized by the covariance matrices C' = Cl,,, C;, and Cl,,. To make the calculations tractable, we shall make, when necessary, several assumptions. First, C = C = C', thus there is a common basis of eigenvectors el and corresponding eigenvalues A, for the signal in the population and in the training and validation sample. Then, with respect to this basis of eigenvectors, the noise covariance matrices are diagonal C" = diag(v,) and ChN = diag(v:). Finally, the signal and the noise are always uncorrelated C X N = Cl,, = 0. Obviously, it also makes sense to assume that C N N = diag(v,) and CX, = 0 although these assumptions are not needed in the main calculation. Thus we make the simplifying assumptions that both on the training and validation patterns the covariance matrix of the signal is identical to the covariance of the signal over the entire population, the components of the noise are uncorrelated, and the signal and the noise are uncorrelated. Yet we allow the estimates v, and I( of the variance of the components of the noise to be different in the training and validation sets. For a given W, the LMS error function over the training patterns is now
+ +
+
1 E(W) = T
C llxt - W(xt + nr)I12
(2.9)
t
By differentiating
w(c + CNXf CXN4- C N N )- c
VE
-
CXN
(2.101
and since CXN = CNX= 0, the gradient is given by
VE
=
( W - I)C + WC"
(2.11)
To compute the image of any eigenvector el during training, we have Wk+le, = r\X,e,
+ (1
- I ~ X,
rp,)Wke,
(2.12)
Thus by induction
W k = WMk- C(C + C")-'(Mk
- I)
(2.13)
Generalization in Linear Networks
where M = I - 7j(C
593
+ CNN),and
XI Wke,= _ _ [l - (1 - 7 j X , - ~ p , ) ~ ] e(1 , - qX, - rp,)kl@el
A
+
+ y,
(2.14)
Again if we assume here, as in the rest of the paper, that the learning rate satisfies 7 < min[l/(A, + v,)],then the eigenvectors of C tend to become eigenvectors of W k and W k approaches exponentially fast the diagonal matrix diag[A,/(X, v,)]? Assuming that Wo = diag(alO))in the e, basis, we get
+
(2.15) whereb, = l-c~,'*'(X~+v,)/X, anda, = (l-?]Al-qvl). Notice that 0 < a , < 1. Since the signal and the noise are uncorrelated, the error in general can be written in the form
Using the fact that CNN= diag(u,) and Wk = diag(cry)), we have n
E ( W k )= CIAl- 2X,cu,'k)+ X , ( o ~ ) )+ * V~(C~,'~')~]
(2.17)
1=1
and finally
E ( W k )=
k[Al(l
+v;(N~(~))~]
- CY,'~))~
(2.18)
1=l
It is easy to see that E ( W k )is a monotonically decreasing function of k that approaches an asymptotic residual error value given by (2.19) For any matrix W, we can define the validation error to be
E"(W
=
v1 c IIxu
- W(xu
+ nu)l12
(2.20)
U
Using the fact that C i N = 0 and ChN = diag(v:), a derivation similar to equation 2.18 shows that the validation error €"( Wk) is given by (2.21) 'As in equation 2.6, the convergence in fact holds for 7 < 2minlZ/(X, + q)].The slightly more restrictive assumption has been chosen to ensure that the numbers a, are positive.
Pierre Baldi and Yves Chauvin
594
Clearly, as k ccj, EV(W k )approaches its horizontal asymptote, which is independent of CY,(" and given by ---f
(2.22) However, it is the behavior of EV before it reaches its asymptotic value, which is of most interest to us. This behavior, as we shall see, can be fairly complicated. 3 Validation Analysis
O b v i ~ u s l y from , ~ equation 2.15, dtujk)/dk = -(X,biaf l o g a , ) / ( X ,+ v J ) .Thus using equation 2.21 and collecting terms yieIds (3.1)
or, in more compact form,
with
and
(3.4) The behavior of EV depends on the relative size of u, and v: and the initial conditions trjo), which together determine the signs of b,, A,, and B,. The main result we can prove is as follows.
Assume that learning is started with the zero matrix or with a matrix having sufficiently small weights satisfying, for every i,
'Here and in what follows we take time derivatives with respect to k. Although k was originally introduced as an integer, we can easily consider that and E"( W k ) are continuous functions of k, defined by equations 2.15 and 2.21, and study them everywhere.
Generalization in Linear Networks
595
1. lffor every i, u( 5 ui, then the validation function EV decreases monotonically to its asymptotic value and training should be continued as long as possible. 2. If for every i, u: > ui, then the validation function E" decreases monotonically to a unique minimum and then increases monotonically to its asymptotic value. The derivatives of all orders of E" have also a unique zero crossing and a unique extremum. For optimal generalization, EV should be monitored and training stopped as soon as E" begins to increase. A simple bound on the optimal training time kept is given by 1 log 2 - A 5 k"pt 5 max ___ 1 log __ -A, min __ I loga, B, I loga, B,
(3.6)
In the most general case of arbitrary initial conditions and noise, the validation function E" can have several local minima of variable depth before converging to its asymptotic value. The number of local minima is always at most n. The main result is a consequence of the following statements, which are proved in the Appendix. First case: For every i, u: 2 ui, i.e., the validation noise is bigger than the training noise. Then a. If for every i, a?) 2 Xi/(Xi its asymptotic value.
+
+ vi), then EV decreases monotonically to
@lo)
b. If for every i, Xi/(X; u:) 5 I Xi/(X, monotonically to its asymptotic value.
+ ui), then E"
increases
c. If for every i, 5 Xi/(X, +v:) and vi f v:, then EV decreases monotonically to a unique global minimum and then increases monotonically to its asymptotic value. The derivatives of all orders of E" have a unique zero crossing and a unique extremum.
Second case: For every i, v: 5 v,, i.e., the validation noise is smaller than the training noise. Then a. If for every i, a!') 2 X,/(Xi+v:) and vi f v:, then EV decreases monotonically to a unique global minimum and then increases monotonically to its asymptotic value. The derivatives of all orders of E" have a unique zero crossing and a unique extremum.
+
tilo)
ui) I 5 Xi/(Xi b. If for every i, X,/(Xi monotonically to its asymptotic value.
c. If for every i, al(') 5 X;/(Xi its asymptotic value.
+ i/i),
+ v:),
then EV increases
then EV decreases monotonically to
Several remarks can be made on the previous statements. First, notice that in both (b) cases, E" increases because the initial Wo is already too
596
Pierre Baldi and Yves Chauvin
good for the given noise levels. The monotone properties of the validation function are not always strict in the sense that, for instance, at the common boundary of some of the cases EV can be flat. These degenerate cases can be easily checked directly. The statement of the main result assumes that the initial matrix be the zero matrix or a matrix with a diagonal form in the basis of the eigenvectors e;. A random initial nonzero matrix will not satisfy these conditions. However, EV is continuous and even infinitely differentiable in all of its parameters. Therefore the results are true also for random sufficiently small matrices. If we use, for instance, an L2 norm for the matrices, then the norm of a starting matrix is the same in the original or in the orthonormal e, basis. Equation 3.5 yields a trivial upperbound of n1/2 for the norm of the initial diagonal matrix, which roughly corresponds to having random initial weights of order at most n-1/2 in the original basis. Thus, heuristically, the variance of the initial random weights should be a decreasing function of the size of the network. This condition is not satisfied in many of the usual simulations found in the literature where initial weights are generated randomly and independently using, for instance, a centered gaussian distribution with fixed standard deviation. In nonlinear networks, small initial weights are also important for not getting stuck in high local minima during training. When more arbitrary conditions are considered, in the initial weights or in the noise, multiple local minima can appear in the validation function. As can be seen in one of the curves of the example given in Figure I, there exist even cases where the first minimum is not the deepest one, although these may be rare in some sense, which is not completely understood at this time. In addition, in this particular case, an indication that training should not be stopped at the first minimum comes from the fact that at that point the LMS curve is still decreasing significantly. Also in this figure, better validation results seem to be obtained with smaller initial conditions. This can easily be understood, in this small dimensional example, from some of the arguments given in the Appendix. Another potentially interesting and relevant phenomena is illustrated in Figure 2. It is possible to have a situation where after a certain number of training cycles, both the LMS and the validation functions appear to be flat and to have converged to their asymptotic values. However, if training is continued, one observes that these plateaux can end and the validation function comes back to life starting to decrease again. In the example, the first minimum is still optimal. However, it is possible to construct examples of validation functions, in higher dimensions, where long plateaux are followed by a phase of significant improvements (see Chauvin 1991). Finally, we have made an implicit distinction between validation and generalization throughout most of the previous sections. If generalization performance is measured by the LMS error calculated over the entire population, it is clear that our main result can be applied to the generalization error by assuming that C" = diag(v,), and v; = V; for every
Generalization in Linear Networks
597
0
0
50
100 150 Number of Cycles
200
250
Figure 1: LMS error functions (lower curves) and corresponding validation error functions (upper curves). The parameters are n = 3, A, = 22, 0.7, 2.5, v, = 4, 1, 3, 11: = 20, 20, 20, r r r ) = @) = 0. From top to bottom, the third initial weight corresponding to a?) takes the values 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5. The horizontal asymptote of the validation curves is at 23.34. Notice, in particular, the fourth validation curve ( n p ) = 0.9), which has two local minima, the second one being deeper than the first one. At the first minimum, the LMS function is still far from its horizontal asymptote. Also in this case, the validation improves as the initial conditions become closer to 0.
i. In particular, in the second statement of the main result, if for every
i fi; > vi, then the generalization curve has a unique minimum. Now, if a validation sample is used as a predictor of generalization performance and the vi’s are close to the P;’s, then by continuity the validation and the generalization curves are close to each other. Thus, in this case, the strategy of stopping in a neighborhood of the minimum of the validation function should also lead to near optimal generalization performance.
10 20 30 Number of Cycles
40
0
...
100
200 300 400 Number of Cycles
500
Figure 2: LMS error function (lower curves) and corresponding validation error functions (upper curves). The parameters are n = 3, A, = 22,0.7,2.5, vi = 4,1,4, v: = 20.20,20, a y ) = I$' = 0 and a?) = 0.7. Notice, on the first two curves, that after 40 cycles both the LMS and the validation function appear to be flat and would suggest one stop the training. The second set of curves corresponds to 500 training cycles. Notice the existence of a second (although shallow) minima, undetectable after 40 cycles.
0
.._.................................................
Generalization in Linear Networks
599
4 Conclusion
In the framework constructed above, based on linear single layer feedforward networks, it has been possible to analytically derive interesting results on generalization. In particular, under simple noise assumptions, we have given a complete description of the validation error EV as a function of training time. Although the framework is simplistic, we believe it leads to many nontrivial and perhaps mathematically tractable questions related to generalization. This analysis is only a first step in this direction and many questions remain unanswered. More work is required to test the statistical significance of some of the observations (multiple local minima, plateau effects) and their relevance for practical simulations. For instance, it seems to us that in the case of general noise and arbitrary initial conditions, the upper bound on the number of local minima is rather weak in the sense that, at least on the average, there are many fewer. It seems also that in general the first local minima of E" is also the deepest. Thus, "pathological" cases may be somewhat rare. In the analysis conducted here, we have used uniform assumptions on the noise. In general, we can expect this not to be the case and properties of the noise cannot be fixed a priori. Therefore one needs to develop a theory of EV over different possible noise and/or sample realizations, that is to find the average curve EV (one could also consider averages with respect to initial weights). It would also be of interest to study whether some of the assumptions made on the noise in the training and validation sample can be relaxed and how noise effects can be related to the finite size of the samples. Finally, other possible directions of investigation include the extension to multilayer networks and to general input/output associations.
Appendix: Mathematical Proofs Let us study E" under uniform conditions. We shall deal only with the case v: 2 v, for every i (the case v: 5 v, is similar).
+
a. If for every i, a!') 2 A,/(A, v,), then b, I 0, A, I 0, and B, I 0. Therefore, d E V / d k 5 0 and EV decreases to its asymptotic value.
+
+
b. If for every i, X,/(A, u:) 5 oyl(O) 5 A,/(A, v,), then 0 I b, 5 (v: - v , ) / ( X , u:), A, 20, B, I 0, and A, B , 2 0. Since u? decays to 0 faster than uf, dE"/dk 2 0 and E" increases its asymptotic value.
+
+
c. The most interesting case is when, for every i, a!') 5 A,/(A,+v:), i.e., when b, 2 (v: - v,)/(X, u:). Then A, 2 0, B, 5 0, and A, B, 5 0 so that d E V / d k is negative at the beginning and approaches zero from the positive side as k + 00. Strictly speaking, this is not satisfied if A, = 0. This can occur only if b, = 0 or A, = 0 (but then B, = 0
+
+
600
Pierre Baldi and Yves Chauvin also) or if u, = v:. For simplicity, let us add the assumption that v, # u:. A function which first increases (respectively decreases) and then decreases (respectively increases) with a unique maximum (respectively minimum) is called unimodal. We need to show that EV is unimodal. For this, we shall use induction on n combined with an analysis of the unimodality properties of the derivatives of any order of E V . In fact we will prove the stronger result that the derivatives of all orders of EV are unimodal and have a unique zero crossing. For p
=
1.2.. . ., define
dr’EV F”(k) = dkP
(4.1)
Then Fl’(k)=
Ef,’(k)= CAYaZ + BYafk I
(4.2)
I
with A,’ = A,, Ei; = B,, A! = A,(loga,)~p‘ and BY = BI(2loga,)~-’.Clearly, for any p 2 1, sign ( A : ) = (-l)k’+’, sign (BY) = ( - l ) P , and sign (jlP)(0) = sign (AY+Bf’) = (-l)!’.Therefore sign [FP(O)]- ( - 1 ) Y and, ask + m, FP(k) approaches zero as 1, Aflaf, thus with the sign of A; which is ( - l ) P + l . As a result, all the continuous functions FP must have at least one zero crossing. If FP is unimodal, then FP has a unique zero crossing. If FP+’ has a unique zero crossing, then FP is unimodal. Thus if for some po, FPa has a unique zero crossing, then all the functions FP (1 I p < pol are unimodal and have a unique zero crossing. Therefore, EV has a unique minimum if and only if there exists an index p such that Fp has a unique zero crossing. By using induction on n, we are going to see that for p large enough this is always the case. Before we start the induction, for any continuously differentiable function f defined over [0,m), let zerov) = inf{x : f(x) = 0)
(4.3)
and
{
::
ext(j) = inf x : -(x)
=0
}
(4.4)
Most of the time, zero and ext will be applied to functions that in fact have a unique zero or extremum. In particular, for any i and p , it is trivial to see that the functions f: are unimodal and with a unique zero crossing. A simple calculation gives 1 -A, - 1 u; - u, zeroCf,”)= -log -- -log loga, 2P-’B1 loga, 2Pp’b,(X,
+
1):)
(4.5)
Generalization in Linear Networks
601
and
Also notice that for any p 2 1 min zeroKP) 5 zeroFP 5 max zero(fp)
(4.7)
minextCf?) I extFP 5 maxextNp)
(4.8)
I
I
and t
I
(equations 4.7 and 4.8 are in fact true for any zero crossing or extremum of FP). We can now begin the induction. For n = 1, EV has trivially a unique minimum and all its derivatives are unimodal with a unique zero crossing. Let us suppose that this is also true of any validation error function of n - 1 variables. Let A1 2 . .. 2 A, > 0 and consider the corresponding ordering induced on the variables a, = 1 - qX, - qv,,1 > a,, 2 . . .a,,l 2 0. Let i, be a fixed index such that a,, 2 all 2 a,,, and write, for any p 2 1, FP(k) = GP(k) + f , r ( k ) with GP(k) = ~ , , , f ~ ( k ) .fl: is unimodal with a unique zero crossing and so is GP by the induction hypothesis. Now it is easy to see that FP will have a unique zero crossing if zero(GP) 5 zerou,") 5 ext(GP)
(4.9)
By applying equations 4.7 and 4.8 to GP, we see that Fp will have a unique zero crossing if %?,xzero(f,P) 5 zero($)
I minextCfp)
(4.10)
'#I,
Substituting the values given by equations 4.5 and 4.6, we can see that for large p, equation 4.10 is equivalent to max-p-
log 2 < - p - log2 5 mip-p- log 2 loga, logai,j ,#I, logai
(4.11)
and this is satisfied since a,, 2 . . . 2 a,". Therefore, using the induction hypothesis, we see that there exists an integer po such that, for any p > pol F p has has a unique zero crossing. But, as we have seen, this implies that F p has a unique zero crossing also for 1 I p 5 PO. Therefore EV is unimodal with a unique minimum and its derivatives of all orders are unimodal with a unique zero crossing. Notice that F ( k ) cannot be zero if all the functions fi(k) are simultaneously negative or positive. Therefore, a simple bound on the position of the unique minimum kept is given by m!n
zero&) 5 zero(F) 5 max zero($) I
(4.12)
Pierre Baldi and Yves Chauvin
602
or min 1
1 log 2 - A 5 kept 5 max 1 log -A, loga, B, i loga, R, ~
(4.13)
[It is also possible, for instance, to study the effect of the initial a;') on the position or the value of the local minima. By differentiating the relation F'(k) = 0 one gets immediately
(4.14) (see Fig. 2)]. To find an upper bound on the number of local minima of E" in the general case of arbitrary noise and initial conditions, we first order the 2n numbers a, and af into an increasing sequence c,, i = 1,, . . ,2n. This induces a corresponding ordering on the 2n numbers A, and B, yielding a second sequence C,, i = 1,.. . ,2n. Now the derivative of E" can be written in the form
dEV ~
dk
= F'(k) =
/C(a)akdp(a)
(4.15)
where p is the finite positive measure concentrated at the points a, and af. The kernel ak in the integral is totally positive. Thus (see, for instance, Karlin 1968, theorem 3.1, p. 233) the number of sign changes of F1(k)is bounded by the number of sign changes in the sequence C. Therefore the number of sign changes in F' is at most 2n - 1 and the number of zeros of F' is at most 2n - 1. So the number of local minima of E" is at most n.
Acknowledgments This work is in part supported by grants from the Office of Naval Research and the McDonnell-Pew foundation to P. B. We would like to thank Yosi Rinott for useful discussions.
References Baldi, P., and Hornik, K. 1989. Neural network and principal component analysis: Learning from examples without local minima. Neural Networks 2, 53-58. Baldi, P., and Hornik, K. 1991. Back-propagationand unsupervised learning in linear networks. In Back-propagation: Theory, Architectures and Applications, Y. Chauvin and D. E. Rumelhart, eds. Lawrence Erlbaum, NJ. In press. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160.
Generalization in Linear Networks
603
Chauvin, Y. 1991. Generalization dynamics in LMS trained linear networks. Neural Information Processing Systems 3 (Proceedings of the 1990 NIPS Conference). Morgan Kaufmann, San Mateo, CA. Karlin, S. 1968. Total Positivity. Stanford University Press. Stanford, CA. Sompolinsky, H., Tishby, N., and Seung, H. S. 1990. Learning from examples in large neural networks. Pkys. Rev. Lett. 65(13), 1683-1686. Tishby, N., Levin, E., and Solla, S. A. 1989. Consistent inference of probabilities in layered networks: Predictions and generalization. In Proceedings of the IJCNN, pp. 403409. IEEE, New York. -
~
Received 1 February 1991; accepted 13 April 1991.
This article has been cited by: 2. M. Islam, A. Sattar, F. Amin, Xin Yao, K. Murase. 2009. A New Adaptive Merging and Growing Algorithm for Designing Artificial Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:3, 705-722. [CrossRef] 3. M.M. Islam, Xin Yao, S.M. Shahriar Nirjon, M.A. Islam, K. Murase. 2008. Bagging and Boosting Negatively Correlated Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38:3, 771-784. [CrossRef] 4. Zehra Cataltepe , Yaser S. Abu-Mostafa , Malik Magdon-Ismail . 1999. No Free Lunch for Early StoppingNo Free Lunch for Early Stopping. Neural Computation 11:4, 995-1009. [Abstract] [PDF] [PDF Plus] 5. Sheng Ma, Chuanyi Ji. 1999. Performance and efficiency: recent advances in supervised learning. Proceedings of the IEEE 87:9, 1519. [CrossRef] 6. A. Atiya, Chuanyi Ji. 1997. How initial conditions affect generalization performance in large networks. IEEE Transactions on Neural Networks 8:2, 448-451. [CrossRef] 7. M. R. W. Dawson, A. Dobbs, H. R. Hooper, A. J. B. McEwan, J. Triscott, J. Cooney. 1995. Artificial neural networks that use single-photon emission tomography to identify patients with probable Alzheimer's disease. European Journal of Nuclear Medicine 21:12, 1303-1311. [CrossRef]
Communicated by Eric Baum
Learning the Unlearnable Dan Nabutovsky Department of Electronics, Weizmann Institute of Science, Rehovot 76200, Israel Eytan Domany' Department of Theoretical Physics, Oxford University, Oxford, OX1 3NP, United Kingdom We present a local perceptron learning rule that either converges to a solution, or establishes linear nonseparability. We prove that when no solution exists, the algorithm detects this in a finite time (number of learning steps). This time is polynomial in typical cases and exponential in the worst case, when the set of patterns is nonstrict2y linearly separable. The algorithm is local and has no arbitrary parameters. 1 Introduction The simplest neural network, the perceptron, was introduced by Rosenblatt (1962). Its goal is to find a vector of weights w, such that the condition2 w.[,>O
(1.1)
is satisfied for each of P patterns ep E B N , 11, = 1 . . .P, B = {-1,l). Such a vector of weights is found in the course of a training session; the P patterns are presented cyclically and after each pattern the weights are modified according to the perceptron learning rule: (1.2) where h,
= w . Ep
H(x) =
and
i
1, x > o 0, X I 0
'On leave from the Weizmann Institute of Science, Rehovot, Israel. *This is strictly speaking a homogeneously linearly separable model. Formally, linear separability means that for a set of answers 6; there are w E RN and t E R such that V p : [; . (w E,, - t) > 0. This can, however, be transformed to a homogeneous model [* (-1.E), W (t,w). by redefinitions E +
-
Neural Computation 3, 604-616 (1991) @ 1991 Massachusetts Institute of Technology
Learning the Unlearnable
605
A convergence theorem can be proved for this learning rule; it means that after a finite number of training steps the perceptron will find a correct weight vector w if one exists. There are many extended and improved versions of the original perceptron learning rule (Abbott and Kepler 1989; Krauth and Mezard 1987; Anlauf and Biehl 1989; Kinzel and Rujan 1990). All these methods converge to a solution if one exists, but none of them detects absence of a solution; in this case they run endlessly (see, however, Anlauf and Biehl 1990). On the other hand, there are algorithms that do detect the absence of a solution. For example, searching for the weights can be formulated as a linear programming problem and as such can be solved, using, say, the simplex algorithm (see Taha 1982). This algorithm needs polynomial time in the average case; one may use more sophisticated linear programming methods (Khachian 1979; Karmarkar 1984), which need polynomial time in any case. Another procedure (Ho and Kashyap 1965) treats the embedding . w - k,)* fields k, as the independent variables and minimizes in the region h, > 0. Initially it sets h = (1 . . . l ) and then repeats the replacements
xp((,
wnew= Ah
h y w = k,, + pmax(0. (, . w
-
k,)
until hr” = h, or [, . w > 0 for all patterns. Here A is the pseudoinverse of the matrix of patterns and p is an arbitrary parameter. Related procedures, that eliminate the need to evaluate A, also exist. When comparing perceptron-type learning algorithms with the latter, a n important point must be taken into account. All algorithms of the type introduced by Ho and Kashyap need to remember at all times all the P values of the embedding fields k,. As to linear programming methods, they require storage of the entire table of patterns as well as some additional information. In other words, these algorithms ure not local and as such cannot be realized by neurons. A number of recently introduced learning algorithms use the perceptron learning rule for networks with one or more layers of hidden units, with fixed (Grossman et al. 1989; Nabutovsky et al. 1990) as well as with varying (Mezard and Nadal 1989; Frean 1990) architectures. Finding a local algorithm that indicates when no solution exists, and does so in the course of a regular perceptron learning process, is of practical as well as theoretical interest. In this paper we introduce such an algorithm and prove that it either solves a perceptron learning problem or states the absence of a solution. It has no arbitrary constants. The main idea of the algorithm, presented in Section 2, is to store a lower bound for cosy = w .w*, where w is the current vector of weights, and w* is a solution whose existence is assumed
[h,
Dan Nabutovsky and Eytan Domany
606
(both vectors are normalized). For reason of convenience we store and update a parameter d, which is a lower bound for (w .w*)/(minpw* .tP), and which we called despair. In Section 3 we show that when d exceeds a critical value, this implies that cosy > 1, which means that our assumption (that a solution existed) was wrong. Despair increases with every learning step that modifies w. The size of the learning steps is chosen to maximize the increase of despair; this results in a convergence theorem, which is proved in Section 4. In Section 5 we prove that in general the algorithm identifies absence of a solution in polynomial number of steps, since despair increases exponentially. Numerica 1 results for random patterns and comparisons with other methods are presented in Section 6. Finally, our work is summarized and discussed in Section 7. 2 The Learning Algorithm
Our aim is to construct a learning algorithm that either finds a solution w* E RN to 1.1,or terminates, thereby proving that no such solutions exist. The algorithm is initialized by setting w = tl/v%,d = l/&. This new quantity d, which we called despair, is stored and modified in the course of the learning process. Next we present patterns cyclically, and whenever an error is encountered, that is, (2.1) k, = w . t P 5 0 for some pattern LL, a learning step is taken, which consists of changing w and d according to
(2.3)
where
-k,
+
l/d (2.4) N - k,/d The learning process stops when one of two conditions becomes satisfied. We stop if error-free performance is achieved, and a solution has been found. Alternatively, the procedure is aborted when the despair d exceeds a critical value: V=
In this case there is no solution? 3Note that practically we do not need to wait for so long. In most problems one replaces the value d , by some (much smaller) d:, which then becomes a parameter of the algorithm. di has the same meaning as 1/6 in Abbott and Kepler (1989).
Learning the Unlearnable
607
3 Analysis
In this section we prove that when the condition 2.5 is satisfied, indeed no solution exists. First note normalization: the vector w was initialized with ( w (= 1, and its normalization is preserved by the learning step equation 2.2. Assuming that a (normalized) solution w* exists, we must have (3.1)
w.w*
The next step establishes that throughout the learning process the condition w . w*
(3.2)
2 d
Kin
is satisfied, where
and w* is a solution of 1.1. The third step is to prove that if inequalities 1.1 have a solution, they also have a solution with (3.3)
But now it is easy to see that (3.4)
where we used 3.2 for the first inequality, 3.1 for the second, and 3.3 for the third. Hence when a solution exists, d cannot exceed d,, QED. We now proceed to prove statements 3.2 and 3.3.
Proof of (3.2). Initially d
= l/&,
and
so that 3.2 holds. We now assume that it holds at some intermediate stage, for the current w and d, and show that then it holds also after a learning step given by (2.2-3):
Dan Nabutovsky and Eytan Domany
608
This now proves 3.2 for any w and d obtained in the course of our learning procedure.
Proof of (3.3h4 If the system 1.1 has a solution, then the system u . € p2 1
(3.5)
also has a solution; note that u is not normalized. Define its solution with minimal absolute value as u*. By definition, at least N linear independent inequalities 3.5 are satisfied at u* as equalities: i = 1 ...N
Ep,u*=1.
(3.6)
Then, as a solution, we have A, A
i = 1 ...N
u*=--,
'
(3.7)
Here A is the determinant of the coefficient matrix in 3.6; Ai is a determinant of the matrix obtained by replacing all elements in the ith column of the same matrix by unity. Application of Hadamard's evaluation of determinants leads to
IA,l 5 N N f 2
(3.8)
Adding the first column of A to all others and dividing them by 2, we can see that ~ l / 2 ~ -is' an integer; therefore
]A1 2 zN-'
(3.9)
Combining 3.7, 3.8, and 3.9, we obtain "12
lUf1
I 2N-1
and hence (3.10) Finally, denoting w* = u*//u*I,using 3.5 with 3.10 together with the definition 2.5, we can see that 2N-1 w * . Cp
1 No/2 = dc
QED. *The equivalent result for a perceptron with threshold and inputs from {O,1) rather than {-l,l} was obtained by Muroga et a/. (1961). Our proof is a version of theirs.
Learning the Unlearnable
609
4 Convergence Theorem
We now prove that our algorithm terminates after a finite number of learning steps. Define (4.1) -k‘ + l / d %pt(k’) = N - k‘/d
(4.2)
In this notation equation 2.3 takes the form dnew
= dnew[//opt(hp).kpI
For k’2 5 N (which is always obeyed), the function dnew(V’.k’) is maximal at rt/ = rlopt(h’),and therefore we get, for k,, < 0, dnew
=
1 dnew[l/opt(kp).kl]L dnew(z.hp)
This immediately yields die, 2 d2
+ 1/N
(4.3)
so that after at most Nd: learning steps 2.3, d must exceed d,, which means that in (at most) Ndz sweeps either an error-free performance will be achieved or the condition 2.5 will be satisfied. In both cases we stop the process. If the problem is linearly separable and can be solved with minimal field hZin, then d cannot exceed l/k,?,,i,, because of 3.1 and 3.2, and we find the solution after at most N/kZin2sweeps. A case of some theoretical interest is that of a random set of P input patterns, in the limit of large N and fixed P I N . There are three distinct situations. When PIN < 2, k;,n depends only on P/N (Gardner 19881, and at most O ( N ) sweeps are needed to find a solution. In the case P I N > 2, there is no solution for an infinite random set of patterns (Cover 1965). At the limiting value P I N = 2 a solution exists with probability 1/2 (Cover 1965), and when it does exist, our procedure will find it in at most O ( N 5 ) sweeps. 5 The Exponential Increase of d
We now show that in the most prevalent cases of nonseparability despair grows exponentially fast. The worst case of our algorithm is that of nonstrict linear separability, that is, when 1.1 does not have a solution,
Dan Nabutovsky and Eytan Domany
610
but the system w* . tI1 20
(5.1)
does. In such a case d2 increases linearly and an exponentially large number of learning steps is needed to reach 2.5. The reason is that in this case the magnitude of the negative fields k,, < 0 decreases as we learn, and the size I ) of the learning steps decreases also. Luckily, when the system 5.1 has no solution, d increases exponentially. The proof of this is the subject of this section. Define ka
= - max min lu/=l
lL
u . tII
If the system 5.1 has no solution, ko > 0. By definition of ko, min,, w[@ 5 -ko. So, at the beginning of any learning sweep there is (at least one) pattern N for which k , 5 -ko. Our learning step 2.4 is such that after processing pattern I / we have k , 2 0. Let us denote by i the last instance when k,, = k , 5 -ha, and by f the first instance when h, 2 0. Note that the field k , changes also when other patterns are learned; we know for certain, however, that during the learning sweep instances i and f will occur. Now we establish a bound on the increase of d during the learning steps taken between instances i and f . To do this note first that after any application of 2.3,
where functions dne, and voptare defined by 4.1 and 4.2. Substituting 4.1 into 5.2 and using k , < 0, we obtain d,,, 2 d / Since k:l 5 N this inequality yields
Jm.
Ind,,,
2 lnd + k:
(5.3) 2N Summing this inequality over the learning steps from i to f we obtain (5.4)
This sum runs over all patterns p for which nontrivial learning steps were taken between instances i and f (including i, but not f); each had k,, 5 0 prior to its learning step. We now concentrate on the value of k,, during these learning steps. By 2.2 and 2.3, k,, new
+ r/N
611
Learning the Unlearnable
and hence
But note that between i and f we have ki 5 k,, 5 0, and that d 2 d,; using this in (5.5a) we get
But, by definition off and i,
We now sum 5.5b over all (non--ivial) c:ps between i and f , and use 5 . 5 ~ to get
Using di 2 l / f i and 0 5 -hi 5
fl this immediately yields
On the other hand by its definition 2.4 this sum is given by
Trl”
+
-hP l / d p N - k,,/d,
~
E
-kp
+ 1/di N
(5.6b)
Combining 5.6a-b we obtain
where P, 5 P is the number of terms in the sum. We now consider two cases. First, when di 2 4P/ko, we use 5.6 to establish a lower bound on - C, k, 1 ko/4. This yields a lower bound on
x k : 2 Pe(ko/4P,)’ 2 ki/16P P
Using this in 5.4 we get lndf - lnd, 2. 32PN
(5.7)
Dan Nabutovsky and Eytan Domany
612
Therefore, in any sweep lnd increases by at least kg/32NP, and hence, bearing in mind that initially d 2 l/a, we conclude that after at most
Ns
=
32NP l n ( d c f i )
(5.8)
sweeps the condition 2.5 must be obeyed. In the second case, that is, l/a5 d, < 4P/ko, it is easy to see, using 4.3, that after at most N(4Plko)' sweeps d increases to the point where the conditions of the previous case apply, and hence this number of sweeps has to be added to the number calculated above. One should note that the result 5.8 contains a factor of l/hi; the hard case of a "barely unseparable" problem gives rise to very small ko and hence to large N,. This situation is, however, not the common one, as we now demonstrate by a numerical implementation of our algorithm for random patterns. 6 Simulations
<:,
The algorithm was tested for random training sets, that is, &l,with N = 25 inputs and P = 2N = 50 patterns. This value of rt = P/N = 2 is the most difficult to learn. In the large N limit one has always a solution (Cover 1965) for (r < 2 and no solution for (t > 2. For P = 2N a solution exists in half the cases. Results obtained by our algorithm, equation 2.2-4, are presented in Figure 1. The cases in which a solution is found are presented in Figure la, those with no solution on Figure lb. When a solution does exist, d2 increases linearly according to 4.3. Each line corresponds to a particular set of random training patterns, and terminates when a solution is found. We find that the number of training sweeps needed to learn is not exponentially large. As shown in Figure lb, when the problem is not linearly separable, despair increases exponentially according to 5.10. Here we iterated 2 . 2 4 until despair reached the critical value d,. Next we compared the performance of our algorithm with that of other local and nonlocal algorithms. In Figure 2 we present results for the median number of sweeps needed to find a solution to the problem of P random patterns, with (2 = 312, considering only linearly separable cases. All algorithms need O ( P N )operations per sweep. To compare efficiency one also needs a measure of cpu time per sweep. In our simulations our algorithm was slower by (approximately) a factor of two than the simple perceptron learning rule 1.2, and about three times faster than the revised simplex algorithm. When no solution exists, as in the case for P = 3N random patterns for large N, we can compare the time it takes to establish this for our algorithm and simplex. These results are presented in Figure 3. ~
Learning the Unlearnable
613
B
A
r".'1' l'"I
1 L -
loo
10
NO
of sweeps
0
20000
40000
60000
BUOOU
iVDVU0
NO. of sweeps
Figure 1: Despair d versus number of learning sweeps of training set for N = 25, P = 50 (random training patterns). (a) Linearly separable cases. (b) LinearIy nonseparable cases. Each line corresponds to one experiment; lines terminate when a solution is found (a) or when d > d,.
We can see that our algorithm is faster than the simplex method when a solution exists, and slower than simplex when there is no solution. It is also a faster solver than the simple Perceptron rule and somewhat slower than the Abbott-Kepler algorithm. Finally, we tested generalization ability of our algorithm, that is, probability to properly classify a new random example (provided by a "teacher" perceptron). Results for N = 20 are shown on Figure 4, together with the analytic results of Opper et al. (1990); the latter were derived for the perceptron of optimal stability.
7 Discussion An immediately obvious shortcoming of existing sequential implementations of the Perceptron Learning Rule is that they keep searching for a solution even though one does not exist. We presented an algorithm that identifies this situation and terminates itself.
614
Dan Nabutovsky and Eytan Domany
I
I I>
-Simplex method
I
,
60
L
Figure 2: The median number of sweeps needed to learn a set of P = 3/2N random patterns. Only linear separable cases were treated. Every point uses 100 experiments.
10
20
30 N
40
50
Figure 3: The median number of sweeps needed to prove nonseparability of a training set of P = 3N random patterns. Only unseparable cases were treated. Every point uses 20 experiments.
Learning the Unlearnable
615
Figure 4: The generalization ability versus P I N . Points are obtained numerically, each averaged over 1600 experiments. The solid line is a theoretical result for an optimal perceptron that learns examples presented by a "teacher" perceptron (Opper et ul. 1990). The algorithm can be easily modified for a system t P w > K , which is important for Boltzmann networks. We also plan to incorporate this algorithm in our previously developed (Grossman et al. 1989; Nabutovsky et al. 1990) multilayer learning scheme, which utilizes perceptron learning. There, however, we plan to use a much smaller critical despair than the one given in 2.5. By doing this one may make the mistake of missing a solution, that is, abort learning even though a " h a r d solution does exist (i.e., there are only solutions with very low values of h*). On the other hand we do hope to gain by identifying, for every learning task, the most suitable values of the critical despair, which u p to now were set by trial and error. Acknowledgment This research was supported by Minerva and the SERC. References Abbott, L. F., and Kepler, T. B. 1989. Optimal learning in neural network memories. J. Phys. A 22, L711.
616
Dan Nabutovsky and Eytan Domany
Anlauf, J. K., and Biehl, M. 1989. The AdaTron: An adaptive perceptron algorithm. Europhys. Lett. 10, 687. Anlauf, J. K., and Biehl, M. 1990. Properties of an adaptive perceptron learning algorithm. In Parallel Processing in Neural Systems and Computers, Eckmiller, R., Hartmann, G. and Hauske, G., eds., Elsevier. This work presents a parallel algorithm, which converges to a local minimum of an energy function when no solution exists. Cover, T. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Elect. Comput. 14, 326. Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Comp. 2, 198. Gardner, E. 1988. The space of interactions in neural network models. 1. Phys. A 21, 257. Grossman, T., Meir, R., and Domany, E. 1989. Learning by choice of internal representations. Complex Syst. 2, 555. Ho, E., and Kashyap, R. L. 1965. An algorithm for linear inequalities and its applications. I E E E Trans. Elect. Comp. EC-14, 683. Karmarkar, N. 1984. A new polynomial time algorithm for linear programming. Combinatorica 4, 373-395. Khachian, L. G. 1979. A polynomial time algorithm for linear programming. Dokl. Akud. Nauk USSR 244(5), 1093-1096. Translated in Soviet Math. Dokl. 20, 191-194. Kinzel, W., and Rujan, P. 1990. Improving a network generalization ability by selecting examples. Euraphys. Lett. 13, 473. Krauth, W., and Mezard, M. 1987. Learning algorithms with optimal stability in neural networks. J. Phys. A 20, L745. Mezard, M., and Nadal, J. 1989. Learning in feedforward neural networks: The tiling algorithm. 1. Phys. A 22, 2191. Muroga, S., Toda, I., and Takasu, S.1961. Theory of majority decision elements. J. Franklin lnst. 271, 376. Nabutovsky, D., Grossman, T., and Domany, E. 1990. Learning by CHIR without storing internal representations. Complex Syst. 4, 519. Opper, M., Kinzel, W., Kleinz, J., and Niehl, R. 1990. On the ability of the optimal perceptron to generalise. Reprint. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York.
Received 3 December 1990; accepted 1 July 1991.
This article has been cited by: 2. G. Salvi, P. De Los Rios. 2004. Effective Interactions Cannot Replace Solvent Effects in a Lattice Model of Proteins. Physical Review Letters 91:25. . [CrossRef] 3. N. Caticha , J. E. Palo Tejada , D. Lancet , E. Domany . 2002. Computational Capacity of an Odorant Discriminator: The Linear Separability of CurvesComputational Capacity of an Odorant Discriminator: The Linear Separability of Curves. Neural Computation 14:9, 2201-2220. [Abstract] [PDF] [PDF Plus] 4. Kibeom Park, Michele Vendruscolo, Eytan Domany. 2000. Toward an energy function for the contact map representation of proteins. Proteins: Structure, Function, and Genetics 40:2, 237-248. [CrossRef] 5. Michele Vendruscolo, Rafael Najmanovich, Eytan Domany. 2000. Can a pairwise contact potential stabilize native protein folds against decoys obtained by threading?. Proteins: Structure, Function, and Genetics 38:2, 134-148. [CrossRef] 6. M. Vendruscolo, R. Najmanovich, E. Domany. 1999. Protein Folding in Contact Map Space. Physical Review Letters 82:3, 656-659. [CrossRef] 7. Michele Vendruscolo, Eytan Domany. 1998. Pairwise contact potentials are unsuitable for protein folding. The Journal of Chemical Physics 109:24, 11101. [CrossRef] 8. Bruno Raffin , Mirta B. Gordon . 1995. Learning and Generalization with Minimerror, A Temperature-Dependent Learning AlgorithmLearning and Generalization with Minimerror, A Temperature-Dependent Learning Algorithm. Neural Computation 7:6, 1206-1224. [Abstract] [PDF] [PDF Plus] 9. A Wendemuth. 1995. Journal of Physics A: Mathematical and General 28:18, 5423-5436. [CrossRef] 10. Timothy Watkin, Albrecht Rau, Michael Biehl. 1993. The statistical mechanics of learning a rule. Reviews of Modern Physics 65:2, 499-556. [CrossRef]
Communicated by Halbert White
Kolmogorov’s Theorem Is Relevant VSra Kurkovzi Institute of Computer Science, Czechoslovak Academy of Sciences, P. 0. Box 5, 182 07 Puague 8, Czechoslovakia We show that Kolmogorov‘s theorem on representations of continuous functions of n-variables by sums and superpositions of continuous functions of one variable is relevant in the context of neural networks. We give a version of this theorem with all of the one-variable functions approximated arbitrarily well by linear combinations of compositions of affine functions with some given sigmoidal function. We derive an upper estimate of the number of hidden units.
Hecht-Nielsen (1987) suggested that a remarkable mathematical result of Kolmogorov (1957)could provide new insights and tools for understanding multilayer neural networks. There are several theorems in different branches of mathematics named after this great Russian mathematician. The one mentioned by Hecht-Nielsen was a theorem disproving Hilbert’s conjecture formulated as the thirteenth of the famous list of 23 open problems that Hilbert supposed to be of the greatest importance for the development of mathematics in this century. The thirteenth problem, although formulated as a concrete minor hypothesis, is connected with the basic problem of algebra - the solution of polynomial equations. Could roots of a general algebraic equation of higher degree be expressed, analogously to the solution by radicals, by sums and compositions of a one-variable function of some suitable type? Hilbert conjectured that some continuous functions of three variables are not representable by sums and superpositions even of functions of two variables. This was refuted by Arnold (1956). Kolmogorov (1957) even proved a general representation theorem stating that any continuous function f defined on an n-dimensional cube is representable by sums and superpositions of continuous functions of only one variable. Kolmogorov’s formula
readily brings to mind perceptron type networks with the qualification that the one-variable functions p4(q= 1,. . . ,2n 1 ) and $ J (p~ =~ 1,. . . , n, q = 1,. . . ,2n + 1) are far from being any of the type of functions currently
+
Neural Cornputofion 3, 617-622 (1991) @ 1991 Massachusetts Institute of Technology
618
Vera Kurkova
used in neurocomputing. In fact, having even fractal graphs, they are highly nonsmooth. This was the reason for Girosi and Poggio’s (1989) criticism of HechtNielsen‘s proposal. They formulated two main reservations: 1. The functions $lPq are highly nonsmooth. 2. The functions pq depend on the specific function f and hence are not representable in a parameterized form. We shall show that by replacing the equality in equation 1.1by only an approximation, we can eliminate both of these difficulties. Highly nonsmooth functions encountered in mathematics are mostly constructed as limits or sums of infinite series of smooth functions. This is the case, e.g., with the classical Weierstrass‘s function with no derivative at any point and many other famous examples of functions with fractal graphs. Since in the context of neural networks we are interested only in approximations of functions, the only problem concerning the possible relevance of Kolmogorov’s theorem for neurocomputing is whether Kolmogorov’s construction can be modified in such a way that all of the one-variable functions are limits of sequences of smooth functions used in perceptron type networks. By a perceptron type network we mean a multilayer network where units in each hidden layer sum up weighted inputs from the preceding layer, add to this sum a constant (bias), and then apply a sigmoidal nonlinearity, while units in the output layer sum only weighted inputs. So functions used in perceptron type networks are finite linear combinations of compositions of affine transformations of the real line E l with some given sigmoidal function [a function ~7 : El -+ (0,1] with limt--oo o(t) = 0 and limb+ooo(t) = 11. We call them staircase-like functions of a sigmoidal type (or of a type a). Kolmogorov’s construction of the functions pqand 1Lp4 and their later improvements by Lorentz (1962) and Sprecher (1965) are, in fact, perfectly suited for staircase-like functions of any sigmoidal type. Being very complex, all of these arguments contain a lot of unnecessary assumptions. But the only really relevant property of the functions used in inductive construction of one-variable functions pq and $+q is that they have prescribed values on finitely many closed intervals; elsewhere they can be arbitrary, provided they are sufficiently bounded. However, such functions can be approximated arbitrarily well by staircase-like functions of any sigmoidal type (KurkovP 1991). To illustrate the idea of Kolmogorov’sconstruction of functions recall the classical Devil’s staircase (Fig. 1). Kolmogorov, probably inspired by this nineteenth-century construction, developed “the second generation Devil’s staircase,” something Mandelbrot (1982) would appreciate, by replacing in each induction step the already constructed Devil’s staircase’s steps (within a very small neighborhood of each) by smaller steps. $ J ~ ,
Kolmogorov's Theorem Is Relevant
619
Figure 1: Devil's staircase. The result was a strictly increasing function with, in contrast to the rectifiable classical Devil's staircase, a fractal graph. Nevertheless, both first and second generation Devil's staircases are limits of uniformly converging series of staircase-like functions of any sigmoidal type. In contrast to the functions (l'p4, being for the given dimension n universal, the functions pq depend on f. However, they can be also constructed as limits of staircase-like functions of any sigmoidal type. Consider for staircase-like functions (+,of any sigmoidal type, the function 9 defined on the n-dimensional cube by 9 ( x l , . . . ,x,) = CF=, $ J ~ ( X ~ ) . defines on the cube a Rubik's cube-like structure with small boxes having edges corresponding to the steps of (+ and gaps corresponding to the slopes of gP. Suppose that the small boxes are mapped by 9 into closed mutually disjoint subintervals of the real line. Ascribing to these intervals values off at chosen points in the small boxes that 9 maps into these intervals, we define a finite family of steps that can be approximated arbitrarily well by a staircase-like function p of a given sigmoidal type. This function p is representable in a parameterized form with the
VGra Kurkova
620
values of parameters depending on f . The function . \II approximates f on the subset of the cube formed by the union of all small boxes. The smaller the steps of $p, the better the approximation. However, f is not approximated on the gaps. Now, we come to the reason why there are 2n + 1 terms under the summation in (1). By suitable shifts of the slopes of the staircase we can gain 2n + 1 Rubik's cube-like structures on the unit cube covering the n-dimensional cube sufficiently well in such a way that for each point there are more structures containing it in a box than structures containing it in a gap. We need 2n + 1 such structures, since at some point of the cube it may happen that each of its n coordinates is contained in the gaps of a different structure (at most n). These are, roughly speaking, the ideas behind the proofs of the following theorems. Theorem 1. (Kurkovd 1991). Let n , m be natural numbers with n 1 2, m 2 2n + I, and u : El + [0,I] be any sigrnoidal function. Then there exist such real numbers w,,(p = 1,., , n,q = 1,.. . ,m ) and functions qjq(q= 1,.. . ,m) being limits of uniformly converging sequences of staircase-like functions of a type u that for every continuous function f : [0,1]" El there exists a continuous El being a limit of a uniformly converging sequence of function p : El staircase-like functions of a type a, suck that for every ( X I , . . . ,x,) E [O,l]" -+
--$
Theorem 2. (Kurkovd 1991). Let n 2 2 be a natural number, (T : El + [0,1] be a sigmoidal function, f : [0,1]" -+ El be a continuous function and f a positive real number. Then there exist a natural number k and sfaircase-like functions of a fype a $ p f , y i ( i = 1... . , k , p = 1.. . . , n ) such that for every ( X I , .... xn) E [0,1]"
Theorem 2 implies that any continuous function can be approximated arbitrarily well by a four-layer perceptron type network. However, several recent results (Funahashi 1989; Hecht-Nielsen 1989; Hornik et al. 1989; Cybenko 1989; Carroll and Dickinson 1989; Stinchcornbe and White 1989, 1990; Hornik 1991) established that three layers are sufficient for approximations of general continuous functions. Nevertheless, the approach based on the technique developed by Kolmogorov is not without value. The above mentioned theorems are proved very elegantly using advanced theorems from functional analysis. However, nondirect proofs do not provide clear insight into constructions of approximating functions. The directness of our proofs can
Kolmogorov’s Theorem Is Relevant
621
be exploited for estimating the number of hidden units and for exploring which properties of a function being approximated are relevant for the growth of this number. The first step in this direction was done in Kurkovii (1991), where the numbers of units in the second and the third layer are estimated by n m ( m 1) and m2(m l ) n respectively, 3 where n is the dimension of the unit cube I” and rn depends o n t/llfll as well as on the rate with which f increases distances. Hopefully, further analysis could bring finer estimates and more insight to the questions of what properties of the function being implemented play a role in determining the number of hidden units, and whether this number can be sufficiently reduced by using two instead of only one hidden layer.
+
+
References Alexandrov, P. S. (ed.) 1983. Die Hilbertschen Probleme. Akademische Verlagsgesellschaft, Leipzig. Arnold, V. I. 1957. On functions of three variables. Dokl. Akad. Nauk USSR 114, 679-681. Carroll, S. M., and Dickinson, 8. W. 1989. Construction of neural nets using the Radon transform. In Proceedings of the International Joint Conference on Neural Networks, pp. I, 607-611. IEEE, New York. Cybenko, G. 1989. Approximation by superpositions of a single function. Math. Control, Signals Syst. 2, 303-314. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2, 183-192. Girosi, F., and Poggio, T. 1989. Representation properties of networks: Kolmogorov’s theorem is irrelevant. Neural Comp. 1, 465-469. Hecht-Nielsen, R. 1987. Kolmogorov’s mapping neural network existence theorem. In Proceedings of the International Conference on Neural Networks, pp. 111, 11-14. IEEE, New York. Hecht-Nielsen, R. 1989. Theory of the back-propagation neural network. In Proceedings of the International Joint Conference on Neural Networks, pp. I, 593606. IEEE, New York. Hecht-Nielsen, R. 1990. Neurocomputing. Addison-Wesley, New York. Hornik, K., Stinchcombe, M., White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359-366. Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 2, 251-257. Kolmogorov, A. N. 1957. On the representations of continuous functions of many variables by superpositions of continuous functions of one variable and addition. Dokl. Akad. Nauk USSR 114 (5), 953-956. KurkovB, V. 1991. Kolmogorov’s theorem and multilayer neural networks. Neural Networks (in press). Lorentz, G. G. 1962. Metric entropy, widths, and superpositions of functions. A m . Math. Monthly 69, 469-485. Mandelbrot, B. B. 1982. The Fractal Geometry of Nature. Freeman, San Francisco.
622
VGra Kirkova
Sprecher, D. A. 1965. On the structure of continuous functions of several variables. Trans. Am. Math. SOC.115, 340-355. Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions. In Proceedings of the International Joint Conference on Neural Networks, pp. I, 613617. IEEE, New York. Stinchcombe, M., and White, H. 1990. Approximating and learning unknown mappings using multilayer feedforward networks with bounded weights. In Proceedings of the International Joint Conference on Neural Netzuorks, pp. 111, 7-16. IEEE, New York.
Received 20 December 1991; accepted 6 June 1991.
This article has been cited by: 2. Jürgen Braun, Michael Griebel. 2010. On a Constructive Proof of Kolmogorov’s Superposition Theorem. Constructive Approximation 30:3, 653-675. [CrossRef] 3. Yoshifusa Ito . 2003. Activation Functions Defined on Higher-Dimensional Spaces for Approximation on Compact Sets with and without ScalingActivation Functions Defined on Higher-Dimensional Spaces for Approximation on Compact Sets with and without Scaling. Neural Computation 15:9, 2199-2226. [Abstract] [PDF] [PDF Plus] 4. B. Igelnik, N. Parikh. 2003. Kolmogorov's spline network. IEEE Transactions on Neural Networks 14:4, 725-733. [CrossRef] 5. R. Grino, G. Cembrano, C. Torras. 2000. Nonlinear system identification using additive dynamic neural networks-two on-line approaches. IEEE Transactions on Circuits and Systems I Fundamental Theory and Applications 47:2, 150. [CrossRef] 6. Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143. [CrossRef] 7. G.C. Mouzouris, J.M. Mendel. 1997. Nonsingleton fuzzy logic systems: theory and application. IEEE Transactions on Fuzzy Systems 5:1, 56. [CrossRef] 8. H.T. Nguyen, V. Kreinovich, V. Nesterov, M. Nakamura. 1997. On hardware support for interval computations and for soft computing: theorems. IEEE Transactions on Fuzzy Systems 5:1, 108. [CrossRef] 9. E.A. Rietman. 1996. A neural network model of a contact plasma etch process for VLSI production. IEEE Transactions on Semiconductor Manufacturing 9:1, 95. [CrossRef] 10. Mohammad Bahrami. 1995. Issues on representational capabilities of artificial neural networks and their implementation. International Journal of Intelligent Systems 10:6, 571-579. [CrossRef] 11. Yoshifusa Ito . 1994. Approximation Capability of Layered Neural Networks with Sigmoid Units on Two LayersApproximation Capability of Layered Neural Networks with Sigmoid Units on Two Layers. Neural Computation 6:6, 1233-1243. [Abstract] [PDF] [PDF Plus] 12. Ji-Nan Lin , Rolf Unbehauen . 1993. On the Realization of a Kolmogorov NetworkOn the Realization of a Kolmogorov Network. Neural Computation 5:1, 18-20. [Citation] [PDF] [PDF Plus]
Communicated by Rodney Goodman
An Exponential Response Neural Net Shlomo Geva Joaquin Sitte Faculty of Information Technology, Queensland University of Technology, GPO Box 2434, Brisbane Q 4001, Australia
By using artificial neurons with exponential transfer functions one can design perfect autoassociative and heteroassociative memory networks, with virtually unlimited storage capacity, for real or binary valued input and output. The autoassociative network has two layers: input and memory, with feedback between the two. The exponential response neurons are in the memory layer. By adding an encoding layer of conventional neurons the network becomes a heteroassociator and classifier. Because for real valued input vectors the dot-product with the weight vector is no longer a measure for similarity, we also consider a euclidean distance based neuron excitation and present Lyapunov functions for both cases. The network has energy minima corresponding only to stored prototype vectors. The exponential neurons make it simpler to build fast adaptive learning directly into classification networks that map real valued input to any class structure at its output. 1 Introduction
Associative memories implemented on neural networks have been the subject of intensive research for some years now. The associative neural network memories tend to have low storage capacity (Gardner 1987), generate spurious states, or use global learning rules. For polar, binary inputs many of these problems can be overcome, with the use of exponential response neurons (Goodman and Chiueh 19881, to control the competition and cooperation between the prototype vectors (Geva and Sitte 1990a). Chiueh and Goodman (1988) first proposed a binary associative memory with exponential neurons and showed that the storage capacity for random memory vectors grows exponentially with the dimesion of the vectors. They also described a VLSI implementation of their network (Goodman and Chiueh 1990; Chiueh and Goodman 1991). Exponential response neurons have also been used by Specht (1990) in his probabilistic neural network (PNN). The purpose of this paper is to describe an exponential response network for real valued inputs, to analyze its dynamics through a Lyapunov function, and to show how it can be used for classification, with fast learning built into the network. Most of Neural Computation 3, 623-632 (1991) @ 1991 Massachusetts Institute of Technology
624
Shlomo Geva and Joaquin Sitte
the artificial neural networks proposed in the past are based on neurons that evaluate the dot product of an input vector with their own synaptic weight vector. This dot product is the total input excitation of the neuron, which is then subjected to a nonlinear transfer function to produce the neuron output. When all the input vectors have the same length, as happens when inputs are restricted to binary vectors from {-1, l}N, the dot product is a measure of similarity between the input vector and the weight vector. For the usual (0. l}Nbinary vectors and for real valued inputs the dot product is no longer a good measure of similarity. A suitable alternative is the euclidean distance, widely used in traditional pattern recognition as a straightforward measure of similarity. The euclidean distance excitation requires a more complicated neuron, but it has definite advantages, particularly for classifiers. Some of the more interesting artificial neural networks require that only one of all the neurons that sense the common inputs produces a strong output. Usually it is the neuron with the synaptic weight vector most similar to the input. Different methods have been used to achieve this. The most common one is the winner-take-all arrangement used in Grossberg and Carpenter’s ART network, the Hamming network, and Kohonen’s self-organizing feature maps. The winner-take-all arrangement relies on competitive, mutually inhibitory interaction between the neurons. The memory layer neurons of a two layer network, with exponential response neurons, exhibit an equivalent behavior. In what follows we first describe the networks and the action of the different layers. Then we analyze their dynamics with the help of Lyapunov functions and briefly discuss the associative memory networks. Finally we examine the adaptive training of the classifier networks and show how they can be built into the network for on line, parallel training. 2 The Network
There are three layers in the network (Fig. 11, which is an extension of the model we described previously (Geva and Sitte 1990). The neurons in the input layer correspond to the N components of the input vector x. Each neuron in the memory layer corresponds to one of the memorized prototype vectors, denoted MIL. Each neuron at the input layer is connected to every neuron in the memory layer, and the memory layer feeds back to the input layer. The memory layer neurons also feed forward to each of the neurons in the output layer. The output layer neurons correspond to the n components of the output vector y. No connections exist between neurons in the same layer. 2.1 The Memory Layer. The output x of the input layer produces an excitation of the rn memory neurons. The memory neuron excitation eP(M’’.x)measures the similarity between the input vector x and the
Exponential Response Neural Net
625
MEMORY
neuron
Figure 1: Network structure. With feedback, the output from the memory layer is scaled by the input layer. The output from the memory layer can be encoded by a subsequent output layer.
prototype vector MP, and is not necessarily restricted to the traditional scalar product. The output up of each memory neuron is defined as an exponential gain function of the excitation.
2.2 The Input Layer. The memory neurons feed back to the input layer producing excitation of the input neurons: m
).(ih
=
C v , ( x ) M ~ i = 1 . ..N
(2.2)
p=l
The excitation vector of the input layer is a linear combination of the memorized prototype vectors:
h(x) = C M%, = Mv(x) P
where M is the matrix whose columns are the prototype vectors.
(2.3)
Shlomo Geva and Joaquin Sitte
626
The output of the input layer neurons is the scaled excitation: h(x) x' = _ _
c,,
(2.4)
UP
The scaling can be implemented in the form of gain control. The output from an extra linear neuron in the input layer, which sums all the outputs of the memory layer, controls the gain of all the input neurons. The change of the input vector in one iteration is ax
= XI-x
(2.5) (2.6)
2.3 The Output Layer. The memory layer also feeds forward to the output layer through the synapses. The output layer is nothing more than an encoder for the memory layer output. The new state of neurons at the output layer is given by
where C,' is the binary class vector associated with the prototype M,, and sgn(.) is the sign function. The process performed by the feedforward part of the network can be summarized in vector notation: Y ( 4 = sgnICv(.)l
(2.8)
where C is the matrix whose columns are the class vectors the system was trained to recognize. The excitation of the output layer is the squashed linear combination of the memorized class vectors. But because, as we shall prove below, one of the 71, tends to become much larger than the rest, y is just one of the binary class vectors. 3 Network Dynamics
The network has three distinct modes of operation: associative memory mode, classification mode, and training mode. When operating as an associative memory, the feedback loop is enabled, and the network performs autoassociation from an input state to a new state at the input layer, and heteroassociation, from an input state to a new state at the output layer. In classification mode the network is similar to a feedforward perceptron, having neurons with an exponential transfer function at the hidden layer. Input states produce a binary coded class vector at the output layer. In training mode feedback loops are enabled, and the network can implement several supervised learning algorithms.
Exponential Response Neural Net
627
3.1 The Stable Attractor Dynamics. When feedback is enabled the network evolves in parallel from an initial input state, through the interaction between the input layer and the memory layer. We now look at two distinct excitation functions, e(x. M), that may be used, provide the Lyapunov functions for the network, and show that it will progress along a gradient descent trajectory toward a stable state.
3.2 .2 Euclidean Excitation. We define the excitation of each memory layer neuron as the euclidean distance between M, and the input vector x e, = /lM/”- XI[* (3.1) The output from the neurons in the memory layer takes the form of a Boltzmann distribution function: v,, = e-ci, /2k (3.2) A distribution function
w,
~
A~-E./~T
(3.3)
leads to a free energy
F
=
-kTlnxw,
=
-kTlnZ
(3.4)
n
where Z is the partition function. In analogy we introduce the energy function for the system: E
=
-klnxv,,
(3.5)
P
The excitation e,,(MP.x)takes the role of the discrete energy levels En. Each of the memory neurons represents a quantum state associated with a prototype vector MP. The input vector x acts as a continuous parameter for the system. The partition function for the quantum system is (3.6)
E has the property that 3E 1 - -- - - CU,(M~ -xi) axj
z
(3.7)
and 1 VE = - - X V , ( M ’ - x )
z,
(3.8)
Comparison with the change of the input state 2.6 shows that the network‘s evolution dynamics is that of gradient descent on the energy surface: AX = -VE (3.9) and therefore the energy defined in 3.5 is a Lyapunov function for the network.
628
Shlomo Geva and Joaquin Sitte
3.1.2 Dot-Product Excitation. The dot-product may replace the euclidean distance as the memory layer neuron excitation: e,, = 2MvTx (3.10)
resulting in an output from the memory neurons: (3.11) The dynamics is still gradient descent, but for a slightly modified energy function: (3.12) When the input vectors are normalized then this is essentially the same energy function (up to a constant) as for the euclidean metric. 3.2 The Energy Surface. The number of minima in the energy surface depends on the value of the scale parameter k in 3.2 and 3.11. For large values of k, all the memory layer neurons will produce comparable excitations (close to a value of 1.0). The state of the input layer, as calculated in 2.4, will not be weighted toward any of the prototype vectors, regardless of proximity to the input state. The system will have a single energy minimum, corresponding to the average of all the prototypes. As k is reduced, the resolution of the energy function will increase. The individual excitations of neuron will start to diverge considerably, depending on the proximity of the corresponding prototypes to the input state. As k is gradually reduced, local minima will appear, that correspond to different clusters of prototypes, having an intercluster distance significantly smaller than the intracluster distance. Eventually, at a sufficiently small value of k separate energy minima will appear for each of the stored prototypes (Geva ef al. 1991). The parameter k is a resolution gain factor, that determines the amount of generalization performed by the network. It may be useful in cases where one wishes to obtain different classifications, at various levels of resolution. If one wishes to guarantee the stability of all the prototypes, k has to be chosen small enough, with respect to the smallest allowable distance between individual prototypes. By tuning the resolution one can control the amount of cooperation and competition between the prototypes, to guarantee the stability of any number of prototypes, as individual attractors. Our computer simulations confirm this. Preliminary results have been reported in Geva and Sitte (1990a). 4 Associative Recall
In associative recall one presents the network with an incomplete or noisy input pattern, and the network responds with the best match, according
Exponential Response Neural Net
629
to some criterion, to one of the prototypes stored in the network. To achieve this our network requires only two layers: the input layer, and the memory layer. One neuron from the memory layer is dedicated to each of the prototype vectors. Prototypes can be reconstructed through the relaxation process described earlier. The network has no spurious states, and only prescribed prototype vectors are stable attractors. For heteroassociative recall, and for pattern recognition, the output layer is required. The output layer serves as a decoder of the class associated with the input vector. A binary coded class vector is produced, and may be used as a key in an addressable memory to establish arbitrary associations. The network will do pattern classification and associative recall very well, provided it has been adequately trained. 5 Implementing Learning Algorithms
Learning can be incorporated into the network quite naturally. We will illustrate this with a classifier network using Learning Vector Quantization (LVQ) type algorithms (Kohonen 1990). In particular we use the Decision Surface Mapping algorithm (DSM) (Geva and Sitte 1991). The DSM algorithm is a new member of the LVQ family of algorithms. 5.1 The Adaptive Decision Surface Mapping Algorithm (DSM). It starts by selecting a small, random subset of prototypes from the training set. The training set is then used to modify the prototypes to reduce gradually the classification error rate over the entire training set. Samples from the training set are cyclically, or randomly, presented for classification. When a training sample is correctly classified, that is, the training sample is of the same class as the nearest prototype, no modifications are applied. When misclassification occurs modifications take place to apply both punishment and reward. The punishment step takes the nearest neighbor prototype, which is now of the wrong class, and moves it away from the training sample, along the line connecting the two vectors.
The reward step searches for the nearest correct prototype and moves it toward the training sample, along the line connecting the two vectors: (5.2) is a monotonically decreasing scalar gain factor. The algorithm modifies prototypes only on misclassification, and since errors are more likely to occur with samples near class boundaries, it rearranges prototypes, in pairs, on each side of a class boundary, to correct or at least reduce the magnitude of these errors. We have found that DSM always outperforms LVQl when the training set is error free. When a probabilistic error is present in the training (Y
630
Shlomo Geva and Joaquin Sitte
set, DSM does not stabilize, and may not converge to an optimal solution. When this is the case, LVQl, which performs well in this situation, provides a better solution. DSM is then used, with the LVQl classifier as a trainer, to produce the same decision surface with fewer prototypes. 5.2 Parallel Learning. The LVQ and DSM algorithms require the presentation of an input pattern to the network, obtaining a classification, and then applying the desired modifications. The problem is that only one prototype, for LVQ, and two prototypes, for DSM, need to be modified at each presentation. Furthermore, the network must decide for itself which are the prototypes that require modification, and it must do so in parallel mode. The process must be simple, in that little external intervention is required. It is desirable to have all the memory layer neurons do exactly the same operations, without having to pinpoint explicitly particular neurons at any stage. This is achieved naturally by the network dynamics that we have already described. The feedback loop allows the network to pick out one neuron in the memory layer, usually the nearest neighbor. This neuron has the highest excitation, and in fact is the only neuron having an excitation that is close to unity, while all other neurons have excitation values that are close to zero. It is now possible to make the modification of the prototypes in the learning equations 5.1 and 5.2, proportional to the excitation of the memory layer neurons, with the direction of the modification being determined by the correctness of the classification:
(5.3) In this equation, the normalized excitation of memory neuron /.G has a value between 0 and 1. It controls the amount of modification applied to the weights of neuron p. The function f(c,y) takes on the value +1 if c, the input class, is the same as y, the output class, or the value -1 otherwise. This term determines whether punishment or reward take place. By making the modification proportional to the excitation, we can allow all weights to be modified in parallel, without having to trigger any set of weights connecting a particular neuron. The term up has a value close to zero for all but the nearest neighbor, for which it has a value close to 1. This approach is different from the ART networks, where winner-take-all dynamics is utilized to pick one neuron for modification. In summary, the steps for selecting and modifying the weights of the winner are 1. A sample vector is presented at the input layer.
2. The network is allowed to relax to one of its stable attractor states.
Exponential Response Neural Net
631
3. The weights connecting each neuron at the memory layer to the input layer are modified according to 5.3.
DSM makes no modification when classification is correct. When misclassification occurs, the above steps are taken. Because then the winning prototype is of the wrong class, 5.3 produces punishment. However a reward step still has to be applied to the nearest prototype of the correct class. This is achieved by the following process: All memory layer neurons not belonging to the class of the sample vector are inhibited, and the original sample vector is presented once again for classification. This time, the neuron that dominates the memory layer excitations is indeed the neuron representing the nearest neighbor of the correct class, and obviously, the output class is correct. Now the application of 5.3 results in the reward of this neuron. 6 Conclusion
Exponential response neurons allow us to construct a network where we can control the cooperation and competition between prototypes. By tuning the resolution parameter k, it is possible to guarantee the stability of all the prototypes, no matter how similar or correlated. By using a decoding layer in the network (the output layer), it is also possible to do pattern classification. By appropriate selection of k we can ensure that the network performs as a nearest neighbor classifier, implementing a variety of adaptive pattern classification algorithms, in parallel mode. The energy function we introduced allows for analytical treatment of the network, and analysis of its dynamics.
References Chiueh, T. D., and Goodman, R. M. 1988. High-capacity exponential associative memories. Proceedings of the I€€€ ~nternat~onffl Conference on Neuraf Networks. Chiueh, T. D., and Goodman, R. M. 1991. Recurrent correlation associative memories. IEEE Transact. Neural Networks 2, 275-284. Gardner, E. 1987. Maximum storage capacity in neural networks. Europhys. Lett. 4, 481-485. Geva, S., and Sitte, J. 1990a. Increasing the storage capacity and shaping the basins of attraction of neural nets. In Parallel Processing in Neural Systems and Computers, R. Eckmiller, G. Hartmann, and G. Hauske, eds., pp. 257-260. Elsevier Science Publishers, Amsterdam. Geva, S., and Sitte, J. 1990b. A pseudo-inverse neural net with storage capacity exceeding N. Proceedings of the International Joint Conference on Neural Nefwork,San Diego, June, Vol. I, pp. 783-786s. Geva, S., and Sitte, J. 1991. Adaptive nearest neighbour pattern classification. l E E E Transact. Neural Networks 2, 318-322.
632
Shlomo Geva and Joaquin Sitte
Geva, S., Sitte, J., and Finn, G. 1991. Pattern classification by an exponential response neural net. Hawaii International Conference of System Sciences, Hawaii. Goodman, R. M., and Chiueh, T.D. 1990. VLSI implementation of a neural associative memory and its application to vector quantization. Proceedings of the International Neural Network Conference, Paris, pp. 635-638. Kohonen, T. 1990. Statistical pattern recognition revisited. In AdzJanced Neural Computers, R. Eckmiller, ed., pp. 137-144. North-Holland, Amsterdam. Specht, D. F. 1990. Probabilistic neural networks. Neural Networks 3, 109-118.
Received 6 February 1991; accepted 3 May 1991.
This article has been cited by: 2. Richard S. Zemel , Michael C. Mozer . 2001. Localist Attractor Networks. Neural Computation 13:5, 1045-1064. [Abstract] [PDF] [PDF Plus]
633
Errata
Due to a software bug in the final printing of Volume 3, Number 3, the letter R was omitted in two papers: In ”The Transition to Perfect Generalization in Perceptrons,” by E. Baum and Y.-D. Lyuu, pages 386 to 401, wherever -(fi) appears in For example, Theorem 4.1 should an exponent, it shouId read -R(&). read: A = e p r 2 ( f i )for m 2 2 . 0 8 2 1 ~ ~ An R was also omitted from the exponents in the third to last line of the abstract, the fourth to last line of Section 1 on page 387, the line above definition 2.5 on page 388, the line below the statement of Theorem 4.1 on page 394, line five above equation 4.1 on page 395, the fifth line on page 396, and in equations 4.4 and 4.5. In ”Simulations of a Reconstructed Cerebellar Purkinje Cell Based on Simplified Channel Kinetics,” by P. Bush and T. J. Sejnowski, the top line of Table 1 on page 322 should appear as follows: C, = lpF/cm2 ra = 225 R-cm
This article has been cited by:
634
Index
Index
Volume 3 By Author Abbott, L. F., Marder, E., and Hooper, S. L. Oscillating Networks: Control of Burst Duration by Electrically Coupled Neurons (Letter)
3(4):487-497
Ajjanagadde, V. and Shastri, L. Rules and Variables in Neural Nets (Letter)
3(1):121-134
Al-Mashouq, K. A. and Reed, I. S. Including Hints in Training Neural Nets (Letter)
3(3):418427
Apolloni, B. and de Falco, D. Learning by Asymmetric Parallel Boltzmann Machines (Letter)
3(3):402408
Babloyantz, A. - See Destexhe, A. Back, A. D. and Tsoi, A. C. FIR and IIR Synapses, a New Neural Network Architecture for Time Series Modeling (Letter)
3(3):375-385
Baldi, P. and Chauvin, Y. Temporal Evolution of Generalization during Learning in Linear Networks (Letter)
3(4):589-603
Baldi, P. and Pineda, F. Contrastive Learning and Neural Oscillations (Letter)
3(4):526-545
Barton, S. A. A Matrix Method for Optimizing a Neural Network (Letter)
3(3):450459
Baum, E. B. and Lyuu, Y.-0. The Transition to Perfect Generalization in Perceptrons (Letter)
3(3):386401
Bishop, C. Improving the Generalization Properties of Radial Basis Function Neural Networks (Letter)
3(4):579-588
Blake, R. - See Lehky, S. R. Bower, J. M. - See Wilson, M. A. Bush, I? C. and Douglas, R. J. Synchronization of Bursting Action Potential
Index
635
Discharge in a Model Network of Neocortical Neurons (Letter) Bush, P. C. and Sejnowski, T. J. Simulations of a Reconstructed Cerebellar Purkinje Cell Based on Simplified Channel Kinetics (Letter1
3(1):19-30
3(3):321-332
Chauvin, Y. - See Baldi, l? de Falco, D. - See Apolloni, B. Destexhe, A. and Babloyantz, A. Pacemaker-Induced Coherence in Cortical Networks (Letter)
3(2):145-1 54
Domany, E. - See Nabutovsky, D. Douglas, R. J. - See Bush, P. C. Flower, B. - See Jabri, M. Foldihk, P. Learning Invariance from Transformation Sequences (Letter)
3(2):194-200
Gallant, S. I. A Practical Approach for Representing Context and for Performing Word Sense Disambiguation Using Neural Networks (View)
3(3):293-309
Geva, S. and Sitte, J. An Exponential Response Neural Net (Letter)
3(4):623-632
Hancock, P. J. B., Smith, L. S., and Phillips, W. A. A Biologically Supported Error-Correcting Learning Rule (Letter)
3(2):201-212
Hartman, E. and Keeler, J. D. Predicting the Future: Advantages of Semilocal Units (Letter)
3(4):56&578
Hinton, G. E. - See Jacobs, R. A. Holmes, G. - See Veitch, A. C. Hooper, S. L. - See Abbott, L. F. Horn, D., Sagi, D. and Usher, M. Segmentation, Binding, and Illusory Conjunctions (Letter)
3(4):510-525
Index
636
Horn, D. and Usher, M. Parallel Activation of Memories in an Oscillatory Neural Network (Letter) Jabri, M. and Flower, B. Weight Perturbation: An Optimal Architecture and Learning Technique for Analog VLSI Feedforward and Recurrent Multilayer Networks (Letter) Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive Mixtures of Local Experts (Letter) Jain, A. N. Parsing Complex Sentences with Structured Connectionist Networks (Letter)
3(1):31-43
3(4):546-565 3(1):79-87
3( 1):110-1 20
Jong, T.-I,. - See Wang, J.-H. Jordan, M. I. - See Jacobs, R. A. Kawabata, T. Generalization Effects of k-Neighbor Interpolation Training (Letter)
3(3):409-417
Keeler, J. D. - See Hartman, E. Koh, B.-Y.
-
See Lee, M.-J.
Konig, P. and Schillen, T. B. Stimulus-Dependent Assembly Formation of Oscillatory Responses: I. Synchronization (Letter)
3(2):155-166
Konig, P. - See Schillen, T. B. Konishi, M. Deciphering the Brain’s Codes (Review)
3(1):1-18
Krile, T. F. - See Wang, J.-H. Kurkova, V. Kolmogorov’s Theorem Is Relevant (Letter)
3(4):617-622
Lee, H.-J., Lee, S.-Y., Shin, S.-Y., and Koh, B.-Y. TAG: A Neural Network Model for Large-Scale Optical Implementation (Letter)
3(111135-143
Lee, S.-Y. - See Lee, H.-J. Lee, Y. Handwritten Digit Recognition Using
637
Index
K Nearest-Neighbor, Radial-Basis Function, and Backpropagation Neural Networks (Letter)
3(3):440449
Lehky, S. R. and Blake, R. Organization of Binocular Pathways: Modeling and Data Related to Rivalry (Letter)
3(1):44-53
Lemmon, M. 2-Degree-of-freedom Robot Path Planning using Cooperative Neural Fields (Letter)
3(3):350-362
Lippmann, R. P. - See Richard, M. D. LUO,Z.-Q. On the Convergence of the LMS Algorithm with Adaptive Learning Rate for Linear Feedforward Networks (Letter)
3(2):226-245
Lyuu, Y.-D. - See Baum, E. B. Mani, G. Lowering Variance of Decisions by Using Artificial Neural Network Portfolios (Note)
3(4):483486
Marder, E. - See Abbott, L. F. Martin, G. L. and Pittman, J. A. Recognizing Hand-Printed Letters and Digits Using Backpropagation Learning (Letter)
3(2):258-267
Mitchison, G. Removing Time Variation with the Anti-Hebbian Differential Synapse (Letter)
313):312-320
Nabutovsky, D. and Domany, E. Learning the Unlearnable (Letter)
3(4):604-616
Nowlan, S. J. - See Jacobs, R. A. Ogmen, H. On the Mechanisms Underlying Directional Selectivity (Letter)
3(3):333-349
Park, J. and Sandberg, I. W. Universal Approximation Using Radial-BasisFunction Networks (Letter)
3(2):246-257
Phillips, W. A. - See Hancock, I? J. B. Pineda, F. - See Baldi, P.
Index
638
Pinkas, G. Symmetric Neural Networks and Propositional Logic Satisfiability (Letter)
3(2):282-29 1
Pittman, J. A. - See Martin, G. L. Platt, J. A Resource-Allocating Network for Function Interpolation (Letter) Pomerleau, D. A. Efficient Training of Artificial Neural Networks for Autonomous Navigation (Letter) Richard, M. D. and Lippmann, R. P. Neural Network Classifiers Estimate Bayesian a posteriori Probabilities (Review)
3(2)1213-225
3(1):88-97
3(4):461483
Reed, I. S. - See Al-Mashouq, K. A. Sagi, D. - See Horn, D. Sandberg, I. W.
-
See Park, J.
Sanger, T. D. A Tree-Structured Algorithm for Reducing Computation in Networks with Separable Basis Functions (Letter)
3( 1)67-78
Schillen, T. B. - See Konig, P. Schillen, T. B. and Konig, P. Stimulus-Dependent Assembly Formation of Oscillatory Responses: 11. Desynchronization (Letter) Sejnowski, T. J.
-
3(2):167-178
See Bush, P. C.
Shastri, L. - See Ajjanagadde, V. Shin, S.-Y. - See Lee, H.-J. Simik, P. D. Constrained Nets for Graph Matching and Other Quadratic Assignment Problems (Letter)
3(2):268-281
Simmen, M. W. Parameter Sensitivity of the Elastic Net Approach to the Traveling Salesman Problem (Letter)
3(3)~363-374
Sitte, J.
-
See Geva, S.
Index
639
Smith, L. S. - See Hancock, P. J. B. Touretzky, D. S. and Wheeler, D. W. Sequence Manipulation Using Parallel Mapping Networks (Letter)
3(1):98-109
Tsoi, A. C. - See Back, A. D.
Usher, M. - See Horn, D. Veitch, A. C. and Holmes, G. A Modified Quickprop Algorithm (Note)
3(3):310-311
Walkup, J. F. - See Wang, J.-H. Wang, J.-H., Krile, T. F., Walkup, J. F., and Jong, T.-L. On the Characteristics of the Autoassociative Memory with Nonzero-Diagonal Terms in the Memory Matrix (Letter)
3(3):428-439
Wheeler, D. W. - See Touretzky, D. S. Wilson, M. A. and Bower, J. M. A Computer Simulation of Oscillatory Behavior in Primary Visual Cortex (Letter) Zhang, J. Dynamics and Formulation of Self-organizing Maps (Letter) Zipser, D. Recurrent Network Model of the Neural Mechanism of Short-Term Active Memory (Letter)
3(4):498-509
3(1):54-66
3(2):179-193