Review of Neural Networks for Speech Recognition Richard P. Lippmann* MIT Lincoln Laboratory, Lexington, MA 021 73, USA
The performance of current speech recogition systems is far below that of humans. Neural nets offer the potential of providing massive parallelism, adaptation, and new algorithmic approaches to problems in speech recognition. Initial studies have demonstrated that multilayer networks with time delays can provide excellent discrimination between small sets of pre-segmented difficult-to-discriminate words, consonants, and vowels. Performance for these small vocabularies has often exceeded that of more conventional approaches. Physiological front ends have provided improved recognition accuracy in noise and a cochlea filter-bank that could be used in these front ends has been implemented using micro-power analog VLSI techniques. Techniques have been developed to scale networks up in size to handle larger vocabularies, to reduce training time, and to train nets with recurrent connections. Multilayer perceptron classifiers are being integrated into conventional continuous-speech recognizers. Neural net architectures have been developed to perform the computations required by vector quantizers, static pattern classifiers, and the Viterbi decoding algorithm. Further work is necessary for large-vocabulary continuousspeech problems, to develop training algorithms that progressively build internal word models, and to develop compact VLSI neural net hardware. 1 State of the Art for Speech Recognition
Speech is the most natural form of human communication. Compact implementations of accurate, real-time speech recognizers would find widespread use in many applications including automatic transcription, simplified man-machine communication, and aids for the hearing impaired and physically disabled. Unfortunately, current speech recognizers perform poorly on talker-independent continuous-speech recognition tasks that people perform without apparent difficulty. Although children learn to understand speech with little explicit supervision and adults take speech recognition ability for granted, it has proved to be a difficult task 'This work was sponsored by the Department of the Air Force. The views expressed are those of the author and do not reflect the official policy or position of the U.S. Government.
Neural Computation 1, 1-38 (1989)
@ 1989 Massachusetts Institute of Technology
2
Richard P. Lippmann
to duplicate with machines. As noted by Klatt (1986), this is due to variability and overlap of information in the acoustic signal, to the need for high computation rates (a human-like system must match inputs to 50,000 words in real time), to the multiplicity of analyses that must be performed (phonetic, phonemic, syntactic, semantic, and pragmatic), and to the lack of any comprehensive theory of speech recognition. The best existing speech recognizers perform well only in artificially constrained tasks. Performance is generally better when training data is provided for each talker, when words are spoken in isolation, when the vocabulary size is small, and when restrictive language models are used to constrain allowable word sequences. For example, talker-dependent isolated-word recognizers can be trained to recognize 105words with 99% accuracy (Paul 1987). Large-vocabulary talker-dependent word recognition accuracy with sentence context can be as high as 95% for 20,000 words from sentences in office memos spoken with pauses between words (Averbuch et al. 1987). Accuracy for a difficult 997-word talker-independent continuousspeech task using a strong language model (an average of only 20 different words possible after any other word) can be as high as 96% (Lee and Hon 1988). This word accuracy score translates to an unacceptable sentence accuracy of roughly 50%. In addition, the word accuracy of this high-performance recognizer when tested with no grammar model is typically below 70% correct. Results such as these illustrate the poor lowlevel acoustic-phonetic matching provided by current recognizers. These recognizers depend heavily on constraining grammars to achieve good performance. Humans do not suffer from this problem. We can recognize clearly spoken but contextually inappropriate words in anomalous sentences such as “John drank the guitar” almost perfectly (Marslen-Wilson 1987). The current best performing speech recognition algorithms use Hidden Markov Model (HMM) techniques. Good introductions to these techniques and to digital signal processing of speech are available in (Lee and Hon 1988; Parsons 1986; Rabiner and Juang 1986; Rabiner and Schafer 1978). The HMM approach provides a framework which includes an efficient decoding algorithm for use in recognition (the Viterbi algorithm) and an automatic supervised training algorithm (the forward-backward algorithm). New neural-net approaches to speech recognition must have the potential to overcome the limitations of current HMM systems. These limitations include poor low-level and poor high-level modeling. Poor low-level acoustic-phonetic modeling leads to confusions between acoustically similar words while poor high-level speech understanding or semantic modeling restricts applications to simple situations where finite state or probabilistic grammars are acceptable. In addition, the first-order Markov assumption makes it difficult to model coarticulation directly and HMM training algorithms can not currently learn the topological structure of word and sub-word models. Finally, HMM theory does not
Review of Neural Networks for Speech Recognition
3
WORD MODELS (Patterns and Sequences) SPEECH INPUT
PATTERN MATCHING
SIGNAL
-b
AND
SPECTRAL PATTERNS
TIME
PATTERN SEQUENCE CLASSIFICATION
SELECTED WORD
W ~ R D SCORES
(100 per Second)
Figure 1: Block diagram of an isolated word recognizer. specify the structure of implementation hardware. It is likely that high computation and memory requirements of current algorithms will require new approaches to parallel hardware design to produce compact, large-vocabulary, continuous-speech recognizers. 2 The Potential of Neural Nets
Neural nets for speech recognition have been explored as part of the recent resurgence of. interest in this area. Research has focused on evaluating new neural net pattern classification and training algorithms using real speech data and on determining whether parallel neural net architectures can be designed which perform the computations required by important speech recognition algorithms. Most work has focused on isolated-word recognition. A block diagram of a simple isolated word recognizer is shown in figure 1. Speech is input to this recognizer and a word classification decision is output on the right. Three major operations are required. First, a preprocessor must extract important information from the speech waveform. In most recognizers, an input pattern containing spectral information from a frame of speech is extracted every 10 msec using Fast Fourier Transform (FFT)or Linear Predictive Coding (LPC) (Parsons 1986; Rabiner and Schafer 1978) techniques. Second, input patterns from the preprocessor must be compared to stored exemplar patterns in word models to compute local frame-to-frame distances. Local distances are used in a third step to time align input pattern sequences to stored exem-
4
Richard P. Lippmann
plar pattern sequences that form word models and arrive at whole-word matching scores. Time alignment compensates for variations in talking rate and pronunciation. Once these operations have been performed, the selected word to output is that word with the highest whole-word matching score. This paper reviews research on complete neural net recognizers and on neural nets that perform the above three operations. Auditory preprocessors that attempt to mimic cochlea and auditory nerve processing are first reviewed. Neural net structures that can compute local distance scores are then described. Classification results obtained using static speech patterns as inputs are then followed by results obtained with dynamic nets that allow continuous-time inputs. Techniques to integrate neural net and conventional approaches are then described followed by a brief review of psychological and physiological models of temporal pattern sequence recognition. The paper ends with a summary and suggestions for future research. Emphasis throughout is placed on studies that used large public-domain speech data bases or that first presented new approaches. 3 Auditory Preprocessors
A preprocessor extracts important parameters from the speech waveform to compress the amount of data that must be processed at higher levels and provide some invariance to changes in noise, talkers, and the acoustic environment. Most conventional preprocessors are only loosely modeled on the cochlea and perform simple types of filtering and data compression motivated by Fourier analysis and information theory. Recent physiological studies of cochlea and auditory nerve responses to complex stimuli have led to more complex physiological preprocessors designed to closely mimic many aspects of auditory nerve response characteristics. Five of these preprocessors and the VLSI cochlea filter listed in table 1 are reviewed in this section. Good reviews of many of these preprocessors and of response properties of the cochlea and auditory nerve can be found in (Greenberg 1988a; 1988b). The five preprocessors in table 1 rely on periodicity or synchrony information in filter-bank outputs. Synchrony information is related to the short-term phase of a speech signal and can be obtained from the arrival times of nerve spikes on the auditory nerve. It could increase recognition performance by supplementing the spectral magnitude information used in current recognizers. Synchrony information is typically obtained by filtering the speech input using sharp bandpass filters with characteristics similar to those of the mechanical filters in the cochlea. The resulting filtered waveforms are then processed using various types of time domain analyses that could be performed using analog neural net circuitry.
Review of Neural Networks for Speech Recognition
Study
Processing
Comments
Deng and Geisler (1987)
Cross-Channel Correlation of Neural Outputs
Physiologically Plausible (Untested for Speech Recognition)
Ghitza (1988)
Create Histogram of Time Intervals Between Threshold Crossings of Filter Outputs
Improved Speech Recognition In Noise
Hunt and Lefebvre (1988)
Periodicity and Onset Detection
Improved Speech Recognition In Noise and with Spectral Tilt
5
Lyon and Mead Tapped Transmission Implemented Using Line Filter with Micropower VLSI (1988) 49 outputs Techniques
Seneff (1988)
Provides Periodicity and Spectral Magnitude Outputs
Synchrony Spectrograms Provide Enhanced Spectral Resolution (Untested for Speech Recognition)
Sharnrna (1988)
Lateral Inhibition Across Cochlea Filter Outputs
Physiologically Plausible (Untested for Speech Recognition)
Table 1: Recent Physiological Preprocessors. Spectrograms created using physiological preprocessors for steady state vowels and other speech sounds illustrate a n improvement in ability to visually identify vowel formants (resonant frequencies of the vocal tract) in noise (Deng and Geisler 1987; Ghitza 1988; Seneff 1988; Shamma 1988). Comparisons to more conventional front ends using existing speech recognizers have been performed by Beet (Beet et al. 19881, Ghitza (19881, and by Hunt and Lefebvre (1988). These comparisons demonstrated significant performance improvements in noise (Ghitza 1988; Hunt and Lefebvre 1988) and with filtering that tilts the
6
Richard P. Lippmann
input spectrum up at high frequencies (Hunt and Lefsbvre 1988). Extensive comparisons have not, however, been made between physiological preprocessors and conventional preprocessors when the conventional preprocessors incorporate current noise and stress compensation techniques. Positive results from such comparisons and more detailed theoretical analyses would do much to foster the acceptance of these new and computationally intensive front ends. Lyon and Mead (1988) describe a filter bank that could be used in a physiological preprocessor. This filter bank was carefully modeled after the cochlea, provides 49 analog outputs, and has been implemented using micropower analog VLSI CMOS processing. Extra circuitry would be required to provide synchrony or spectral magnitude information for a speech recognizer. This recent work demonstrates how preprocessors can be miniaturized using analog VLSI techniques. The success of this approach is beginning to demonstrate that ease of implementation using VLSI techniques may be more important when comparing alternative neural net approaches than computational requirements on serial Von Neuman machines. 4 Computing Local Distance Scores
Conventional speech recognizers compute local frame-to-frame distances by comparing each new input pattern (vector of parameters) provided by a preprocessor to stored reference patterns. Neural net architectures can compute local frame-to-frame distances using fine-grain parallelism for both continuous-observation and discrete-observation recognizers. New neural net algorithms can also perform vector quantization and reduce the dimensionality of input patterns. Local distances for continuous-observation recognizers are functions related to log likelihoods of probability distributions. Simple log likelihood functions such as those required for independent Gaussian or binomial distributions can be calculated directly without training using single-layer nets with threshold-logic nonlinearities (Lippmann 1987; Lippmann et al. 1987). More complex likelihood functions can be computed using multilayer perceptrons (Cybenko 1988; Lapedes and Farber 1988; Lippmann et al. 1987), hierarchical nets that compute kernel functions (Albus 1981; Broomhead and Lowe 1988; Hanson and Burr 1987; Huang acd Lippmann 1988; Moody 1988; Moody and Darken 1988), or high-order nets (Lee et al. 1986; Rumelhart et al. 1986a). Training to produce these complex functions is typically longest with multilayer perceptrons. These nets, however, often provide architectures with fewer nodes, simpler nodal processing elements, and fewer weights. They also may develop internal hidden abstractions in hidden layers that can be related to meaningful acoustic-phonetic speech characteristics such as for-
Review of Neural Networks for Speech Recognition
7
mant transitions and that also could be applied to many different speech recognition tasks. Discrete-observation recognizers first perform vector quantization and label each input with one particular symbol. Symbols are used to calculate local distances via look-up tables that contain symbol probabilities for each reference pattern. The look-up table calculation can be performed by simple single-layer perceptrons. The perceptron for any reference pattern must have as many inputs as there are symbols. Weights must equal symbol probabilities and all inputs must be equal to zero except for that corresponding to the current input symbol. Alternatively, a multilayer perceptron could be used to store probabilities for symbols that have been seen and interpolate between these probabilities for unseen symbols. The vector quantization operation can be performed using an architecture similar to that used by Kohonen’s feature-map net (Kohonen 1984). Inputs to the feature-map net feed an array of codebook nodes containing one node for each symbol. Components of the Euclidean distance between the input and the reference pattern represented by weights to each node are computed in each node. The codebook node with the smallest Euclidean distance to the input is selected using lateral inhibition or other maximum-picking techniques (Lippmann et al. 1987). This process guarantees that only one node with the minimum Euclidean distance to the input has a unity output as required. Weights used in this architecture can be calculated using the feature-map algorithm or any other standard vector quantization algorithm based on Euclidean distances such as k-means clustering (Duda and Hart 1973). Kohonen’s feature-map vector quantizer is an alternative sequentiallytrained neural net algorithm. It has been tested successfully in an experimental speech recognizer (Kohonen 1988; Kohonen et al. 1984) but not evaluated with a large public speech data base. A version with a small number of nodes but including training logic has been implemented in VLSI (Mann et al. 1988). Experiments with a discrete-observation HMM recognizer (Mann et al. 1988) and with a template-based recognizer (Naylor and Li 1988) demonstrated that this algorithm provides performance similar to that provided by conventional clustering procedures such as k-means clustering (Duda and Hart 1973). The feature-map algorithm incrementally trains weights to a two-dimensional grid of nodes such that after training, nodes that are physically close in the grid correspond to input patterns that are close in Euclidean distance. One advantage of this topological organization is that averaging outputs of nodes that are physically close using nodes at higher levels corresponds to a probability smoothing technique often used in speech recognizers called Parzen smoothing (Duda and Hart 1973). This averaging can be performed by nodes with limited fan-in and short connections. The auto-associative multilayer perceptron (Elman and Zipser 1987; Hinton 1987) is a neural net algorithm that reduces the dimensionality of continuous-valued inputs. It is a multilayer perceptron with the same
8
Richard P. Lippmann
number of input and output nodes and one or more layers of hidden nodes. This net is trained to reproduce the input at the output nodes through a small layer of hidden nodes. Outputs of hidden nodes after training can be used as reduced dimensional inputs for speech processing as described in (Elman and Zipser 1987; Fallside et al. 1988). Recent theoretical analyses have demonstrated that auto-associative networks are closedly related to a standard statistical technique called principal components analysis (Baldi and Hornik 1989; Bourlard and Kamp 1988). Auto-associative nets are thus not a new analytical tool but instead a technique to perform the processing required by principal components analysis.
5 Static Classification of Speech Segments Many neural net classifiers have been applied to the problem of classifying static input patterns formed from a spectral analysis of pre-segmented words, phonemes, and vowels. Table 2 summarizes results of some representative studies. Introductions to many of the classifiers listed in this table and to neural net training algorithms are available in (Cowan and Sharp 1988; Hinton 1987; Lippmann et al. 1987). Unless otherwise noted, error rates in this and other tables refer to talker-dependent training and testing, multilayer perceptrons were trained using back-propagation (Rumelhart et al. 1986a), and systems were trained and tested on different data sets. The number of tokens in this and other tables refers to the total number of speech samples available for both training and testing and the label "multi-talker" refers to results obtained by testing and training using data from the same group of talkers. The label "talkerindependent" refers to results obtained by training using one group of talkers and testing using a separate group with no common members. Input patterns for studies in table 2 were applied at once as one whole static spectrographic (frequency versus time) pattern. Neural nets were static and didn't include internal delays or recurrent connections that could take advantage of the temporal nature of the input for real-time processing. This approach might be difficult to incorporate in real-time speech recognizers because it would require long delays to perform segmentation and form the input patterns in an input storage buffer. It would also require accurate pre-segmentation of both testing and training data for good performance. This pre-segmentation was performed by hand in many studies. Multilayer perceptrons and hierarchical nets such as the feature-map classifier and Kohonen's learning vector quantizer (LVQ) have been used to classify static patterns. Excellent talker-dependent recognition accuracy near that of experimental HMM and commercial recognizers has been provided by multilayer perceptrons using small sets of words and digits. Hierarchical nets have provided performance similar to that of
9
Review of Neural Networks for Speech Recognition
Speech Materials
Study
Network
Elman and Zipser (1987)
Multilayer Perceptron (MW 16 x 20 Inputs
Huang and Lippmann, (1988)
MLP, Feature Map Classifier (FMC) 2 Inputs
67 Talkers 10 Vowels 671 Tokens
Gaussian, FMC,MLP- 20% FMC Trains Fastest
Kammerer and Kupper (1988)
MLP 16 x 16 Inputs
11 Talkers 20 Words 5720 Tokens
Talker Dep. - 0.4% Talker Indep. - 2.7%
Kohonen (1988)
Learning Vector Labeled Quantizer (LVQ) Finish Speech 3010 Tokens 15 Inputs
Gaussian - 12.9% k" - 12.0% LVQ - 10.9%
Lippmann and Gold (1987)
MLP 11 x 2 Inputs
16 Talkers 7 Digits 2,912 Tokens
Gaussian - 8.7% k" - 6% MLP - 7.6%
Peeling and Moore (1987)
MLP 19 x 60 Inputs
40 Talkers 10 Digits
Talker Dep. - 0.3% Multi Talker - 1.9%
1 Talker, CV's
/b,d,g/ /i,a,u/ 505 Tokens
Error Rate Cons. - 5% Vowels - 0.5%
16,000 Tokens
Table 2: Recognition of Speech Patterns Using Static Neural Nets. multilayer perceptrons but with greatly reduced training times and typically more connection weights and nodes. 5.1 Multilayer Perceptrons. Multilayer perceptron classifiers have been applied to speech problems more often than any other neural net classifier. A simple example from Huang and Lippmann (1988) presented in figure 2 illustrates how these nets can form complex decision regions with speech data. Input data obtained by Peterson and Barney (1952)
Richard P. Lippmann
10
consisted of the first two formants from vowels spoken by men, women, and children. Decision regions shown in the right side of figure 2 were formed by the two-layer perceptron with 50 hidden nodes trained using back-propagation shown on the left. Training required more than 50,000 trials. Decision region boundaries are near those that are typically drawn by hand to separate vowel regions and the performance of this net is near that provided by commonly used conventional k-nearest neighbor (k") and Gaussian classifiers (Duda and Hart 1973). A more complex experiment was performed by Elman and Zipser (1987) using spectrographic-like inputs. Input patterns formed from 16 filter-bank outputs sampled 20 times over a time window of 64 msec were fed to nets with one hidden layer and 2 to 6 hidden nodes. The analysis time window was centered by hand on the consonant voicing onset. Networks were trained to recognize consonants or vowels in consonantvowel (CV) syllables composed of the consonants /b,d,g/ and the vowels /i,a,u/. Error rates were roughly 5% for consonant recognition and 0.5% for vowel recognition. An analysis indicated that hidden nodes often become feature detectors and differentiate between important subsets of sound types such as consonants versus vowels. This study demonstrated the importance of choosing a good data representation for speech and of normalizing speech inputs. It also raised the important question of training time because many experiments on this small data base required more than 100,000 training trials. Lippmann and Gold (1987)performed another early study to compare multilayer perceptrons and conventional classifiers on a digit classification task. This study was motivated by single-talker results obtained
OUTPUT (One Node for Each Ten Vowels)
DECISION REGIONS
of 0 HOD
A WHO'D
-
iHAWED 2000
X
N
HEED
3
0 HID
N U
V HEAD
0 HAD
<
HOOD
> HUD INPUT (First and Second Formants)
HEARD
500 0
1000
1400
F1 (Hz)
Figure 2: Decision regions formed by a 2-layer perceptron using backpropagation training and vowel formant data.
Review of Neural Networks for Speech Recognition
11
by Burr (1988a). Inputs were 22 cepstral parameters from two speech frames located automatically by finding the maximum-energy frame for each digit. One- to three-layer nets with from 16 to 256 nodes in each hidden layer were evaluated using digits from the Texas Instruments (TI) 20-Word Speech Data Base (Doddington and Schalk 1981). Multilayer perceptron classifiers outperformed a Gaussian but not a k" classifier. Hidden layers were required for good performance. A single-layer perceptron provided poor performance, much longer training times, and sometimes never converged during training. Most rapid training (less than 1000 trials) was provided by all three-layer perceptrons. These results demonstrate that the simple hyperplane decision regions provided by single-layer perceptrons are sometimes not sufficient and that rapid training and good performance can be obtained by tailoring the size of a net for a specific problem. The digit data used in these experiments was also used to test a multilayer perceptron chip implemented in VLSI (Raffel et al. 1987). This chip performed as well as computer simulations when down-loaded with weights from those simulations. Kammerer and Kupper obtained surprisingly good recognition results for words from the TI 20-word data base (Kammerer and Kupper 1988). A single-layer perceptron with spectrogram-like input patterns performed slightly better than a DTW template-based recognizer. Words were first time normalized to provide 16 input frames with 16 2-bit spectral coefficients per frame. Expanding the training corpus by temporally distorting training tokens reduced the error slightly and best performance was provided by single and not multilayer perceptrons. Talkerdependent error rates were 0.4% (14/3520) for the single-layer perceptron and 0.7% (25/3520) for the DTW recognizer. These error rates are better than all but one of the commercial recognizers evaluated in (Doddington and Schalk 1981) and demonstrate good performance for a single-layer perceptron without hidden nodes. Talker-independent performance was evaluated by leaving out the training data for each talker, one at a time, and testing using that talker's test data. Average talker-independent error rates were 2.7% (155/5720) for the single-layer perceptron and 2.5% (145/5720) for the DTW recognizer. Training time was 6 to 25 minutes per talker on an array processor for the talker-dependent studies and 5 to 9 hours for the talker-independent studies. Peeling and Moore (1987) obtained extremely good recognition results for digit classification. A multilayer perceptron with one hidden layer and 50 hidden nodes provided best Performance. Its talker-dependent performance was low and near that provided by an advanced HMM recognizer. Spectrogram-like input patterns were generated using a 19channel filter-bank analyzer with 20 msec frames. Nets could accommodate 60 input frames (1.2 seconds) which was enough for the longest duration word. Shorter words were padded with zeros and positioned randomly in the 60 frame input buffer. Nets were trained using different numbers of layers and hidden units and speech data from the RSRE
12
Richard I? Lippmann
40-speaker digit data base. Multi-talker experiments explored performance when recognizers were tested and trained using data from all talkers. Error rates were near zero for talker-dependent experiments 0.25% (5/2000) and low for multi-talker experiments 1.9% (78/4000). Error rates on an advanced HMM recognizer under the same conditions were 0.2% (4/2000) and 0.6% (25/4000) respectively. The computation required for recognition using multilayer perceptrons was typically more than five times less than that required for the HMM recognizer. The good small-vocabulary word recognition results obtained by both Kammerer and Kupper (1988) and Peeling and Moore (1987) suggest that back-propagation can develop internal feature detectors to extract important invariant acoustic events. These results must be compared to those of other experiments which attempted to classify digits without time alignment. Burton, Shore, and Buck (Burton et al. 1985; Shore and Burton 1983) demonstrated that talker-dependent error rates using the TI 20-Word Data Base can be as low as 0.3% (8/2560) for digits and 0.8% (40/5120) for all words using simple vector-quantization recognizers that do not perform time alignment. These results suggest that digit recognition is a relatively simple task where dynamic time alignment is not necessary and talker-dependent accuracy remains high even when temporal information is discarded. The good performance of multilayer perceptrons is thus not surprising. These studies and the multilayer perceptron studies do, however, suggest designs for implementing computationallyefficient real-time digit and small-vocabulary recognizers using analog neural-net VLSI processing. 5.2 Hierarchical Neural Nets that Compute Kernel Functions. Hierarchical neural net classifiers which use hidden nodes that compute kernel functions have also been used to classify speech patterns. These nets have the advantage of rapid training and the ability to use combined supervised/unsupervised training data. Huang and Lippmann (1988) described a net called a feature-map classifier and evaluated the performance of this net on the vowel data plotted in figure 2 and on difficult artificial problems. A block diagram of the feature-map classifier is shown in figure. 3. Intermediate codebook nodes in this net compute kernel functions related to the Euclidean distance between the input and cluster centers represented by these nodes. The lower feature map net is first trained without supervision to form a vector quantizer and the upper perceptron-like layer is then trained with supervision using a modified version of the LMS algorithm. This classifier was compared to the multilayer perceptron shown in figure 2 and to a k" classifier. All classifiers provided an error rate of roughly 20%. The 2-layer perceptron, however, required more than 50,000 supervised training trials for convergence. The feature map classifier reduced the amount of supervised training required by three orders of magni-
Review of Neural Networks for Speech Recognition
OUTPUT A
13
c-'
SUPERVISED TRAINING
UNSUPERVISED TRAINING
INPUT
Figure 3: Block diagram of the hierarchical feature-map classifier. tude to fewer than 50 trials. Similar results were obtained with artificial problems. Kohonen and co-workers (Kohonen et al. 1988) compared a neural-net classifier called a learning vector quantizer (LVQ) to Bayesian and kNN classifiers. The structure of the learning vector quantizer is similar to that of the feature-map classifier shown in figure 3. Training differs from that used with the feature-map classifier in that a third stage of supervised training is added which adjusts weights to intermediate codebook nodes when a classification error occurs. Adjustments alter decision region boundaries slightly but maintain the same number of codebook nodes. Bayesian, k" and LVQ classifiers were used to classify 15-channel speech spectra manually extracted from stationary regions of Finnish speech waveforms. All classifiers were tested and trained with separate sets of 1550 single-frame patterns that were divided into 18 phoneme classes (Kohonen et al. 1988). A version of the LVQ classifier with 117 codebook nodes provided the lowest error rate of 10.9% averaging over results where training and testing data sets are interchanged. The Bayesian classifier and kNN classifiers had slightly higher error rates of 12.9% and 12.0% respectively. Training time for the LVQ classifier was roughly 10 minutes on an IBM PC/AT. These results and those of
14
Richard I? Lippmann
Huang and Lippmann (1988) demonstrate that neural nets that use kernel functions can provide excellent performance on speech tasks using practical amounts of training time. Other experiments on artificial problems described in (Kohonen et al. 1988) illustrate trade-offs in training time. Boltzmann machines provided near optimal performance on these problems followed by the LVQ classifier and multilayer perceptrons. Training times were 5 hours on an array processor for the Boltzmann machine, 1 hour on a Masscomp MC 5600 for the multilayer perceptron, and roughly 20 minutes on the Masscomp for the LVQ classifier. Two recent studies (Niranjan and Fallside 1988; Bridle 1988) have begun to explore a hierarchical net where nodes in a hidden layer compute kernel functions called radial basis functions (Broomhead and Lowe 1988). These nets are similar to previous classifiers that use the method of potential functions (Duda and Hart 1973). They have an advantage over multilayer perceptrons in that once the locations of the kernel functions are established, weights to the output nodes are determined uniquely by solving a least squares problem using matrix-based approaches. Initial results with small amounts of speech data consisting of vowels (Niranjan and Fallside 1988) and words (Bridle 1988) have been encouraging. Further work must explore techniques to assign the locations of kernel functions and adjust scale factors that determine the range of influence of each kernel function. 6 Dynamic Classification of Speech Segments
New dynamic neural net classifiers that incorporate short delays, temporal integration, or recurrent connections have been developed specifically for speech recognition. Spectral inputs for these classifiers are applied to input nodes sequentially, one frame at a time. These classifiers could thus be integrated into real time speech recognizers more easily than static nets because accurate pre-segmentation is typically not required for good performance and only short delays are used. Both multilayer nets with delays and nets with recurrent connections have been used to classify acoustically similar words, consonants, and vowels. Excellent performance has been obtained using time delay nets in many studies including those by Lang and Hinton (1988) and by Waibel et al. (1987; 1988). Performance for small vocabularies often slightly exceeded that provided by high-performance experimental HMM recognizers. Techniques have also been developed to scale nets up for larger vocabularies and to speed up training times both for feed-forward and recurrent nets. Rapid training has been demonstrated using a hierarchical learning vector quantizer with delays and good performance but extremely long training times has been provided by Boltzmann machines.
Review of Neural Networks for Speech Recognition
15
6.1 Time-Delay Multilayer Perceptrons. Some of the most promising neural-net recognition results have been obtained using multilayer perceptrons with delays and some form of temporal integration in outp u t nodes (Lang and Hinton 1988; Waibel et al. 1987; Waibel et al. 1988). Table 3 summarizes results of six representative studies. Early results on consonant and vowel recognition were obtained by Waibel a n d co-workers (Waibel e t al. 1987) using the multilayer percep-
Study
Network
Speech Materials
Lang and Hinton (1988)
Time Delay MLP 16 Inputs
100 Talkers ”B,D,E,V” 768 Tokens
Multi Talker - 7.8%
Unnikrishnan, Hopfield, and Tank (1988)
Time Concentration Net 32 Inputs
1 Talker
0.7%
Waibel et al. (1987)
Time Delay MLP 16 Inputs
Error Rate
Digits 432 Tokens
3 Japanese Talkers, /b,d,g/, Many Contexts > 4,000 Tokens
/b,d,g/ - 1.5%
Waibel, Sawai, Time Delay and Shikano MLP (1988) 16 Inputs
1 Japanese
/b,d,g,p,t,k/ - 1.4%
Talker, 18 Cons., 5 Vowels > 10,000 Tokens
18 Cons. - 4.1% 5 Vowels - 1.4%
Watrous (1988)
Temporal Flow Structured MLP 16 Inputs
1 Talker /b,d,g/ - 0.8% Phonemes, Words rapid/rabid - 0.8% > 2,000 Tokens /i,a,u/ - 0.0%
McDermott and Katagiri (1988)
Time Delay LVQ 16 Inputs
3 Japanese Talkers, /b,d,g/ > 4,000 Tokens
/b,d,g/
Table 3: Recognition of Speech Using Time-Delay Neural Nets.
- 1.7%
Richard P. Lippmann
16
tron with time delays shown in figure 4. The boxes labeled r in this figure represent fixed delays. Spectral coefficients from 10 msec speech frames (16 per frame) are input on the lower left. The three boxes on the bottom thus represent an input buffer containing a context of three frames. Outputs of the nodes in these boxes (16 x 3 spectral coefficients) feed 8 hidden nodes in the first layer. Outputs from these nodes are buffered across the five boxes in the first hidden layer to form a context of five frames. Outputs from these boxes (8 x 5 node outputs) feed three hidden nodes in the second hidden layer. Outputs from these three nodes are integrated over time in a final output node. In initial experiments (Waibel et al. 1987), the time-delay net from figure 4 was trained using back-propagation to recognize the voiced stops /b,d,g/. Separate testing and training sets of 2000 voiced stops spoken by three talkers were excised manually from a corpus of 5260 Japanese words. Excised portions sampled the consonants in varying phonetic contexts and contained 15 frames (150 msec) centered by hand around the vowel onset. The neural net classifier provided an error rate of 1.5% compared to an error rate 6.5% provided by a simple discrete-observation HMM recognizer. Training the time-delay net took several days on a fourprocessor Alliant computer. More recent work (Waibel et al. 1988)has led to techniques that merge smaller nets designed to recognize small sets of
-
"G"
OUTPUT
:
Figure 4: A time-delay multilayer perceptron.
COEFFICIENTS
Review of Neural Networks for Speech Recognition
17
consonants and vowels into large nets which can recognize all consonants at once. These techniques greatly reduce training time, improve performance and are a practical approach to the scaling problem. Experiments resulted in low error rates of 1.4% for the consonants /b,d,g,p,t,k/ and 1.4% for the vowels /i,a,u,e,o/. The largest net designed from smaller subnets provided a talker-dependent error rate for one talker of 4.1% for 18 consonants. An advanced discrete-observation HMM recognizer provided an error rate of 7.3%on this task. These two studies demonstrate that good performance can be provided by time-delay nets when the network structure is tailored to a specific problem. They also demonstrate how small nets can be scaled up to solve large classification problems without scaling u p training times substantially. Lang and Hinton (1988) describe an extensive series of experiments that led to a similar high-performance time-delay net. This net was designed to classify four acoustically similar isolated words “B”, “ D , ”E”, and “V” that are the most confusable subset from the spoken alphabet. A multi-talker recognizer for 100 male talkers was first trained and tested using pre-segmented 144 msec speech samples taken from around the vowel onset in these words. A technique called multi-resolution training was developed to shorten training time. This involved training nets with smaller numbers of hidden nodes, splitting weight values to hidden nodes to create larger desired nets, and then re-training the larger nets. A multiresolution trained net provided an error rate of 8.6%. This result, however, required careful pre-segmentation of each word. Presegmentation was not required by another net which allowed continuous speech input and classified the input as that word corresponding to the output node whose output value reached the highest level. Training used simple automatic energy-based segmentation techniques to extract 216 msecs of speech from around the vowel onset in each word. This resulted in an error rate of 9.5%. Outputs were then trained to be high and correct for the 216 msec speech segments as before, but also low for counter-example inputs selected randomly from the left-over background noise and vowel segments. Inclusion of counter-examples reduced the error rate to 7.8%. This performance compares favorably with the 11% error rate estimated for an enhanced HMM recognizer on this data base and based on performance with the complete E-set (Bahl et al. 1988; Lang and Hinton 1988). Watrous (1988) also explored multilayer perceptron classifiers with time delays that extended earlier exploratory work on nets with recurrent connections (Watrous and Shastri 1987). These multilayer nets differed from those described above in that recurrent connections were provided on output nodes, target outputs were Gaussian-shaped pulses, and delays and the network structure were carefully adjusted by hand to extract important speech features for each classification task. Networks were tested using hand-segmented speech and isolated words from one talker. Good discrimination was obtained for many different recognition tasks.
18
Richard P. Lippmann
For example, the error rate was 0.8% for the consonants /b,d,g/, 0.8% for the word pair “rapid/rabid,” and 0.0% for the vowels /i,a,u/. Watrous has also explored the use of gradient methods of nonlinear optimization to decrease training time (Watrous 1986). Rossen et al. (1988) recently described another time delay classifier. It uses more complex input data representations than the time-delay nets described above and a brain-state-in-a-box neural net classifier to integrate information over time from lower-level networks. Good classification performance was obtained for six stop consonants and three vowels. Notable features of this work are training to reject noise inputs as in (Lang and Hinton 1988) and the use of modular techniques to build large nets from smaller trained modules as in (Waibel et al. 1988). Other recent work demonstrating good phoneme and syllable classification using structured multilayer perceptron nets with delays is described in (Harrison and Fallside 1988; Homma et al. 1988; Irino and Kawahara 1988; Kamm et al. 1988; Leung and Zue 1988). Unnikrishnan, Hopfield, and Tank (1988) obtained low error rates on digit classification using a time-concentration neural net that does not use only simple delays. This net, described in (Tank and Hopfield 1987), uses variable length delay lines designed to disperse impulsive inputs such that longer delays result in more dispersion. Impulsive inputs to these delay lines are formed by enhancing spectral peaks in the outputs of 32 bandpass filters. Outputs of delay lines are multiplied by weights and summed to form separate matched filters for each word. These matched filters concentrate energy in time and produce a large output pulse at the end of the correct word. Limited evaluations reported in (Unnikrishnan et al. 1988) for digit strings from one talker demonstrated good performance using a modified form of back-propagation training. A prototype version of this recognizer using discrete analog electronic devices was also constructed (Tank and Hopfield 1987). Tests performed by Gold with a large speech data base and a hierarchical version of the time concentration net that included both allophone and word models yielded performance that was no better than that of an existing HMM recognizer (Gold 1988). 6.2 Hierarchical Nets that Compute Kernel Functions. McDermott and Katagiri (1988) used Kohonen’s LVQ classifier on the same /b,d,g/ speech data base used by Waibel et al. (1987). They were able to obtain an error rate of 1.7%which is not statistically different from the 1.5%error rate obtained by Waibel et al. using the time-delay net shown in figure 4 (Waibel et al. 1987). Inputs for the LVQ classifier consisted of a 7-frame window of 16 filterbank outputs. The nearest of 150 codebook nodes were determined as the 15-frame speech samples were passed through this 7-frame window. The normalized distances between nearest nodes and 112-element input patterns were integrated over time and used to classify speech inputs. The error rate without the final stage of LVQ train-
Review of Neural Networks for Speech Recognition
19
ing was high (7.3%). It dropped to 1.7%after LVQ training was complete. This result demonstrates that nets with kernel functions and delays can perform as well as multilayer perceptrons with delays. These nets train faster but require more computation and memory during use. In this application, for example, the LVQ classifier required 17,000 weights which was more than 30 times as many required for the time-delay net used in (Waibel et al. 1987). If memory is not an important limitation, rapid search techniques such as hashing and k-d trees described in (Omohundro 1987) can be applied to the LVQ classifier to greatly reduce the time required to find nearest-neighbors. This would make the differences in computation time between these alternative approaches small on existing serial Von Neuman computers. 6.3 Nets with Recurrent Connections. Nets with recurrent connections have not been used as extensively for speech recognition problems as feed-forward nets because they are more difficult to train, analyze, and design. Table 4 summarizes results of three representative studies. Initial work explored the use of recurrent Boltzmann machines. These nets typically provided good performance on small problems but required extremely long training times. More recent studies have focused on modified back-propagation training algorithms described in (Almeida 1987; Jordan 1986; Pineda 1987; Rohwer and Forrest 1987; Rumelhart et al. 1986a; Watrous 1988) that can be used with recurrent nets and time varying inputs.
Studv
Network
Anderson,
Recurrent Net 36 Inputs
Speech Materials
Error Rate
20 Talkers, Cv's /b,d,g,p,t,k/, /a/ 561 Tokens
Talker Indep. - 13.1%
Prager, Boltzmann Harrison, and Machine Fallside (1986) 2048 Inputs
6 Talkers 11 Vowels 264 Tokens
Multi Talker - 15%
Robinson and Fallside (1988b)
7 Talkers 27 Phonemes 558 Sentences
Multi Talker - 30.8%
Merrill, and Port (1988)
Recurrent Net 20 Inputs
Talker Dep. - 22.7%
Table 4: Recognition of Speech Using Recurrent Neural Nets.
Richard I? Lippmann
20
OUTPUTS
I
r
I
HIDDEN NODES
n INPUTS
"STATES, y (t-1)
Figure 5: A recurrent neural net classifier. Prager, Harrison, and Fallside (Prager et al. 1986)performed one of the first experiments to evaluate the use of Boltzmann machines for speech recognition. At the time this study was performed, the Boltzmann machine training algorithm described in (Ackley et al. 1985) was the only well-known technique that could be used to train nets with recurrent connections. This training algorithm is computationally intensive because simulated annealing procedures (Kirkpatrick et al. 1983) are used to perform a probabilistic search of connection weights. Binary input and output data representations were developed to apply Boltzmann machines to an 11-vowel recognition task. One successful net used 2048 input bits to represent 128 spectral values and 8 output bits to specify the vowel. Nets typically contained 40 hidden nodes and 7320 links. Training used 264 tokens from 6 talkers and required 6 to 15 hours of processing on a high-speed array processor. The resulting multi-talker error rate was 15%. Prager, Harrison, and Fallside (Prager et al. 1986) also explored the use of a Boltzmann machine recognizer inspired by single-order Markov Model approaches to speech recognition. A block diagram of this recurrent net is presented in figure 5. The output of this net is delayed and fed back to the input to "carry" nodes that provide information about the prior state. This net was trained to identify words in two sentences spoken by one talker. Training time required 4 to 5 days of processing on a VAX 11/750 computer and performance was nearly perfect on the training sentences. Other recent work on Boltzmann machines (Bengio
Review of Neural Networks for Speech Recognition
21
and De Mori 1988; Kohonen et al. 1988; Prager and Fallside 1987)demonstrates that good performance can be provided at the expense of excessive training time. Preliminary work on analog VLSI implementations of the training algorithm required by Boltzmann machines has demonstrated practical learning times for small hardware networks (Alspector and Allen 1987). Many types of recurrent nets have been proposed that can be trained with modified forms of back-propagation. Jordan (1986) appears to have been the first to study nets with recurrent connections from output to input nodes as in figure 5. He used these nets to produce pattern sequences. Bourlard and Wellekens (1988) recently proved that such nets could be used to calculate local probabilities required in HMM recognizers and Robinson and Fallside (1988a) pointed out the relationship between these nets and state space equations used in classical control theory. Nets with recurrent self-looping connections on hidden and output nodes were studied by Watrous and Shastri (1987) for a speech recognition application. Nets with recurrent connections from hidden nodes to input nodes were studied by Elman (1988) and by Servan-Schreiber, Cleeremans, and McClelland (1988) for natural language applications. Two recent studies have explored recurrent nets similar to the net shown in figure 5 when trained with modified forms of back-propagation. Robinson and Fallside (1988b)used such a net to label speech frames with one of 27 phoneme labels using hand-marked testing and training data. Training used an algorithm suggested by Rumelhart et al. (1986a) that, in effect, replicates the net at every time step during training. Talkerdependent error rates were 22.7% for the recurrent net and 26.0% for a simple feed-forward net with delays between input nodes to provide input context. Multi-talker error rates were 30.8% for the recurrent net and 40.8% for the feed-forward net. A 64 processor array of transputers provided practical training times in these experiments. Anderson, Merrill, and Port (1988) also explored recurrent nets similar to the net in figure 5 . Stimuli were CV syllables formed from six stop consonants and the vowel /a/ that were hand segmented to contain 120 msecs of speech around the vowel onset. Nets were trained on data from 10 talkers, tested on data from 10 other talkers, and contained from one to two hidden layers with different numbers of hidden nodes. Best performance (an error rate of 13.1%) was provided by a net with two hidden layers. 7 Integrating Neural Net and Conventional Approaches
Researchers are beginning to combine conventional HMM and DTW speech recognition algorithms with neural net classification algorithms and also to design neural net architectures that perform computations required by important speech recognition algorithms. This may lead
Richard P. Lippmann
22
Studv
Avvroach
Comments
Bourlard and MLP Provides Allophone Good Performance on 918-Word, TalkerWellekens Distance Scores Dependent, Continfor DTW Recognizer (1987) uous-Speech Task Burr (1988a)
MLP Classifier After Energy-Based DTW
Tested on SingleTalker E-Set
Huang and Lippmann (1988)
Second-Stage MLP Discrimination After HMM Recognizer
Improved Performance for "B,D,G from TI Alpha-Digit Data Base
Lippmann and Gold (1987)
"Viterbi-Net" Neural Net Architecture for HMM Viterbi Decoder
Same Good Performance on Large Data Base as Robust HMM Recognizer
Sakoe and Is0 (1987)
MLP Provides Distance Scores for DTW Recognizer
No Hand Labeling Required, Untested
Table 5: Studies Combining Neural Net and Conventional Approaches. to improved recognition accuracy and also to new designs for compact real-time hardware. Combining the good discrimination of neural net classifiers with the automatic scoring and training algorithms used in HMM recognizers could lead to rapid advances by building on existing high-performance recognizers. Studies that have combined neural net and conventional approaches to speech recognition are listed in table 5. Many (Bourlard and Wellekens 1987; Bun 1988b; Huang et al. 1988; Sakoe and Is0 1987) integrate multilayer perceptron classifiers with conventional DTW and HMM recognizers and one (Lippmann and Gold 1987) provides a neural-net architecture that could be used to implement an HMM Viterbi decoder. One study (Bourlard and Wellekens 1987) demonstrated how a multilayer perceptron could be integrated into a DTW continuous-speech recognizer to improve recognition performance.
Review of Neural Networks for Speech Recognition
23
7.1 Integrating Multilayer Perceptron Classifiers with DTW and HMM Recognizers. At least three groups have proposed recognizers where multilayer perceptrons compute distance scores used in DTW or HMM recognizers (Bourlard and Wellekens 1987; Burr 1988a; Sakoe and Is0 1987). Bourlard and Wellekens (1987) demonstrated how the multilayer perceptron shown in figure 6 could be used to calculate allophone distance scores required for phoneme and word recognition in a DTW discrete-observation recognizer. One net had inputs from 15 frames of speech centered on the current frame, 50 hidden nodes, and 26 output nodes. Outputs corresponded to allophones in a 10-digit German vocabulary. Inputs were from 60 binary variables per frame. One input bit was on in each frame to specify the codebook entry that represented that frame. The multilayer perceptron was trained using hand-labeled training data to provide a high output only for that output node corresponding to the current input allophone. Recognition then used dynamic time warping with local distances equal to values from output nodes. This provides good discrimination from the neural net and integration over time from the DTW algorithm. Perfect recognition performance was provided for recognition of 100 tokens from one talker. Bourlard and Wellekens (1987) also used a multilayer perceptron with contextual input and DTW to recognize words from a more difficult 919word talker-dependent continuous-speech task. The net covered an input context of 9 frames, used one of 132 vectors to quantize each frame, had 50 or 200 hidden nodes, and had 50 output nodes corresponding to 50 German phonemes. This net was trained using 100 hand-segmented sentences and tested on 188 other sentences containing roughly 7300 phonemes. The phoneme error rate was 41.6% with 50 hidden nodes and 37% with 200 hidden nodes. These error rates were both lower than the 47.5% error rate provided by a simple discrete-observation HMM recognizer with duration modeling and one probability histogram per phoneme. Bourlard and Wellekens suggested that performance could be improved and the need for hand-segmented training data could be eliminated by embedding muitilayer perceptron back-propagation training in an iterative Viterbi-like training loop. This loop could progressively improve segmentation for DTW or HMM recognizers. Iterative Viterbi training was not performed because the simpler single-pass training required roughly 200 hours on a SUN-3 workstation. As noted above, Bourlard and Wellekens (1988) also recently proved that recurrent neural nets could calculate local probabilities required in HMM recognizers. Sakoe and Is0 (1987) suggested a recognition structure similar to that of Bourlard and Wellekens (1987)where a multilayer perceptron with delays between input nodes computes local distance scores. They, however, do not require output nodes of the multilayer perceptron to represent sub-word units such as phonemes. Instead, a training algorithm is described that is similar to the iterative Viterbi-like training loop suggested
Richard P. Lippmann
24
by Bourlard and Wellekens (1987) but for continuous input parameters. No results were presented for this approach. Burr (1988a) gave results for a recognizer where words were first aligned based on energy information to provide a fixed 20 input frames of spectral information. These inputs were fed to nine outputs representing members of the E-set ("B,C,D,E,G,P,T,V,Z). This recognizer was trained and tested using 180 tokens from one talker. Results were nearly perfect when the initial parts of these words were oversampled. Huang and Lippmann demonstrated how a second-stage of analysis using a multilayer perceptron could decrease the error rate of an HMM recognizer (Huang and Lippmann 1988). The Viterbi backtraces from an HMM recognizer were used to segment input speech frames and average HMM log probability scores for segments were provided as inputs to single- and multilayer perceptrons. Performance was evaluated using the letters "B,D,G" spoken by the 16 talkers in the TI alpha-digit data base. Ten training tokens per letter were used to train the HMM and neural net recognizer for each talker and the 16 other tokens were used for testing. Best performance was provided by a single-layer perceptron which almost halved the error rate. The error rate dropped from 7.2% errors with the HMM recognizer alone to 3.8% errors with the neural net postprocessor.
LOCAL ALLOPHONE DISTANCE SCORES
I
CONTEXT
HIDDENNODES
CURRENT FRAME
I
CONTEXT
Figure 6: A feed-forward multilayer perceptron that was used to compute allophone distance scores for a DTW recognizer.
Review of Neural Networks for Speech Recognition
25
Figure 7 A recurrent neural net called a Viterbi net that performs the calculations required in an HMM Viterbi decoder.
7.2 A Neural Net Architecture to Implement a Viterbi Decoder. Lippmann and Gold (1987) described a neural-net architecture called a Viterbi net that could be used to implement the Viterbi decoder used in many continuous observation HMM recognizers using analog VLSI techniques. This net is shown in figure 7. Nodes represented by open triangles correspond to nodes in a left-to-right HMM word model. Each of these triangles represents a threshold-logic node followed by a fixed delay. Small subnets in the upper part of the figure select the maximum of two inputs as described in (Lippmann et al. 1987) and subnets in the lower part sum all inputs. A temporal sequence of input vectors is presented at the input and the output is proportional to the log probability calculated by a Viterbi decoder. The structure of the Viterbi net illustrates how neural net components can be integrated to design a complex net which performs the calculations required by an important conventional algorithm. The Viterbi net differs from the Viterbi decoding algorithm normally implemented in software and was thus evaluated using 4000 word tokens from the 9-talker 35-word Lincoln Stress-Style speech data base. Connection strengths in Viterbi nets with 15 internal nodes (one node per HMM model state) were adjusted based on parameter estimates obtained from the forward-backward algorithm. Inputs consisted of 12 me1 cepstra and 13 differential me1 cepstra that were updated every 10 msec. Performance was good and almost identical to that of current Robust HMM isolated-word recognizers (Lippmann and Gold 1987). The error
26
Richard I? Lippmann
rate was 0.56% or only 23 out of 4095 tokens wrong. One advantage an analog implementation of this net would have over digital approaches is that the frame rate could be increased to provide improved temporal resolution without requiring higher clock rates. 8 Other Nets for Pattern Sequence Recognition
In addition to the neural net models described above, other nets motivated primarily by psychological and physiological findings and by past work on associative memories have been proposed for speech recognition and pattern sequence recognition. Although some of these nets represent new approaches to the problem of pattern sequence recognition, few have been integrated into speech recognizers and none have been evaluated using large speech data bases. 8.1 Psychological Neural Net Models of Speech Perception. Three neural net models have been proposed which are primarily psychological models of speech perception (Elman and McClelland 1986; MacKay 1987; Marslen-Wilson 1987; Rumelhart et al. 1986b). The COHORT model developed by Marslen-Wilson (1987)assumes a left-to-right real-time acoustic phonetic analysis of speech as in current recognizers. It accounts for many psychophysical results in speech recognition such as the existence of a time when a word becomes unambiguously recognized (recognition point), the word frequency effect, and recognition of contextually inappropriate words. This model, however, is descriptive and is not expressed as a computational model. Hand crafted versions of the TRACE and Interactive Activation models developed by Elman, McClelland, Rumelhart, and co-workers were tested with small speech data bases (Elman and McClelland 1986; Rumelhart et al. 1986b). These models are based on neuron-like nodes, include both feed-forward and feed-back connections, use nodes with multiplicative operations, and emphasizes the benefits that can be obtained by using co-articulation information to aid in word recognition. These models are impractical because the problems of time alignment and training are not addressed and the entire network must be copied on every new time step. The Node Structure Theory developed by MacKay (1987) is a qualitative neural theory of speech recognition and production. It is similar in many ways to the above models, but considers problems related to talking rate, stuttering, internal speech, and rhythm.
8.2 Physiological Models For Temporal Pattern Recognition. Neural net approaches motivated primarily by physiological and behavioral results have also been proposed to perform some component of the time alignment task (Cohen et al. 1987; Dehaene et al. 1987; Wong and Chen 1986). Wong and Chen (1986) and Dehaene et al. (1987) describe similar
Review of Neural Networks for Speech Recognition
27
models that have been tested with a small amount of speech data. These models include neurons with shunting or multiplicative nodes similar to those that have been proposed in the retina to compute direction of motion (Poggio and Koch 1987). Three neurons can be grouped to form a ”synaptic triad” that can be used to recognize two component pattern sequences. This triad will have a strong output only if the modulator input goes ”high” and then, a short time later, the primary input goes ”high.” Synaptic triads can be arranged in sequences and in hierarchies to recognize features, allophones and words (Wong and Chen 1986). In limited tests, hand crafted networks could recognize a small set of words spoken by one talker (Wong and Chen 1986). More interesting is a proposed technique for training such networks without supervision (Dehaene et al. 1987). If effective, this could make use of the large amount of unlabeled speech data that is available and lead to automatic creation of sub-word models. Further elaboration is necessary to describe how networks with synaptic triads could be trained and used in a recognizer. Cohen and Grossberg proposed a network called a masking field that has not yet been tested with speech input (Cohen and Grossberg 1987).
CAT
TAC
MASKING FIELD
SHORT-TERM
(Only One Node ”High”)
INPUT
~
~~
Figure 8: A model called a masking field that can be used to detect pattern sequences.
28
Richard I? Lippmann
This network is shown in figure 8. Inputs are applied to the bottom subnet which is similar to a feature map net (Kohonen et al. 1984). Typically, only one node in this subnet has a “high output at any time. Subnet node outputs feed short-term storage nodes whose outputs decay slowly over time. Different input pattern sequences thus lead to different amplitude patterns in short term storage. For example the input C-A-T sampled at the end of the word will yield an intensity pattern in short-term storage with node C low, node A intermediate, and node T high. The input T-A-C will yield a pattern with node C high, node A intermediate, and node T low. These intensity patterns are weighted and fed to nodes in a masking field with weights adjusted to detect different patterns. The masking field is designed such that all nodes compete to be active and nodes representing longer patterns inhibit nodes representing shorter patterns. This approach can recognize short isolated pattern sequences but has difficulty recognizing patterns with repeated sub-sequences because nodes in short-term storage corresponding to those sub-sequences can become saturated. Further elaboration is necessary to describe how masking fields should be integrated into a full recognizer. Other recent studies (Jordan 1986; Stornetta et al. 1988; Tattersall et al. 1988) have also proposed using slowly-decaying nodes as short-term storage to provide history useful for pattern recognition and pattern sequence generation. 8.3 Sequential Associative Memories. A final approach to pattern sequence recognition is to build a sequential associative memory for pattern sequences as described in (Amit 1988; Buhmann and Schulten 1988; Hecht-Nielsen 1987; Kleinfield 1986; Sompolinskyand Kanter 1986). These nets extend past work on associative memories by Hopfield and Little (Hopfield 1982; Little 1974) to the case where pattern sequences instead of static patterns can be restored. Recognition in this approach corresponds to the net settling into a desired sequence of stable states, one after the other, when driven by an input temporal pattern sequence. Dynamic associative memory models developed by Amit, Kleinfield, Sompolinsky, and Kanter (Amit 1988; Kleinfield 1986; Sompolinsky and Kanter 1986) use long and short delays on links to generate and recognize pattern sequences. Links with short delays mutually excite a small set of nodes to produce stable states. Links with long delays excite nodes in the next expected stable state. Transitions between states thus occur at predetermined times that depend on the delays in the links. A net developed by Buhmann and Schulten (1988) uses probabilistic nodes to produce sequencing behavior similar to that produced by a Markov chain. Transitions in this net occur stochastically but at some average rate. A final net described by Hecht-Nielsen (1987) is a modified version of Grossberg’s avalanche net (Grossberg 1988). The input to this net is similar in structure to Kohonen’s feature map. It differs in that nodes have different rise and fall time constants and overall network activity is
Review of Neural Networks for Speech Recognition
29
controlled such that only the outputs of a few nodes are “high” at any time. A few relatively small simulations have been performed to explore the behavior of the sequential associative memories. Simulations have demonstrated that these nets can complete pattern sequences given the first element of a sequence (Buhmann and Schulten 1988) and also perform such functions as counting the number of input patterns presented to a net (Amit 1988). Although this approach is theoretically very interesting and may be a good model of some neural processing, no tests have been performed with speech data. In addition, further work is necessary to develop training procedures and useful decoding strategies that could be applied in a complete speech recognizer. 9 Summary of Past Research
The performance of current speech recognizers is far below that of humans. Neural nets offer the potential of providing massive parallelism, adaptation, and new algorithmic approaches to speech recognition problems. Researchers are investigating: 1. New physiological-based front ends,
2. Neural net classifiers for static speech input patterns, 3. Neural nets designed specifically to classify temporal pattern sequences, 4. Combined recognizers that integrate neural net and conventional recognition approaches,
5. Neural net architectures that implement conventional algorithms, and 6. VLSI hardware neural nets that implement both neural net and conventional algorithms.
Physiological front ends have provided improved recognition accuracy in noise (Ghitza 1988; Hunt and LefPbvre 1988) and a cochlea filterbank that could be used in these front ends has been implemented using micro-power VLSI techniques (Lyon and Mead 1988). Many nets can compute the complex likelihood functions required by continuousdistribution recognizers and perform the vector quantization required by discrete-observation recognizers. Kohonen’s feature map algorithm (KOhonen et al. 1984) has been used successfully to vector quantize speech and preliminary VLSI hardware versions of this net have been built (Mann et al. 1988). Multilayer perceptron networks with delays have provided excellent discrimination between small sets of difficult-to-discriminate speech inputs (Kammerer and Kupper 1988; Lang and Hinton 1988; Peeling and
30
Richard P. Lippmann
Moore 1987; Waibel et al. 1987; Waibel et al. 1988; Watrous 1988). Good discrimination was provided for a set of 18 consonants in varying phonetic contexts (Waibel et al. 1988), similar E-set words such as "B,D,E,V" (Lang and Hinton 1988), and digits and words from small-vocabularies (Kammerer and Kupper 1988; Peeling and Moore 1987; Watrous 1988). In some cases performance was similar to or slightly better than that provided by a more conventional HMM or DTW recognizer (Kammerer and Kupper 1988; Lang and Hinton 1988; Peeling and Moore 1987; Waibel et al. 1987; 1988). In almost all cases, a neural net approach performed as well as or slightly better than conventional approaches but provided a parallel architecture that could be used for implementation and a computationally simple and incremental training algorithm. Approaches to the problem of scaling a network up in size to discriminate between members of a large set have been proposed and demonstrated (Waibel et al. 1988). For example, a net that classifies 18 consonants accurately was constructed from subnets trained to discriminate between smaller subsets of these consonants. Algorithms that use combined unsupervised/supervised training and provide high performance and extremely rapid training have also been demonstrated (Huang and Lippmann 1988; Kohonen et al. 1988). New training algorithms are under development (Almeida 1987; Jordan 1986; Pineda 1987; Rohwer and Forrest 1987; Watrous 1988) that can be used with recurrent networks. Preliminary studies have explored recognizers that combine conventional and neural net approaches. Promising continuous-speech recognition results have been obtained by integrating multilayer perceptrons into a DTW recognizer (Bourlard and Wellekens 1987) and a multilayer perceptron post processor has improved the performance of an isolatedword HMM recognizer (Huang et al. 1988). Neural net architectures have also been designed for important conventional algorithms. For example, recurrent neural net architectures have been developed to implement the Viterbi decoding algorithm used in many HMM speech recognizers (Lippmann and Gold 1987) and also to compute local probabilities required in discrete-observation HMM recognizers (Bourlard and Wellekens 1988). Many new neural net models have been proposed for recognizing temporal pattern sequences. Some are based on physiological data and attempt to model the behavior of biological nets (Dehaene et al. 1987; Cohen et al. 1987; Wong and Chen 1986) while others attempt to extend existing auto-associativenetworks to temporal problems (Amit 1988; Buhmann and Schulten 1988; Kleinfield 1986; Sompolinsky and Kanter 1986). New learning algorithms and net architectures will, however, be required to provide the real-time response and automatic learning of internal word and phrase models required for high-performance continuous speech recognition. This is still a major unsolved important problem in the field of neural nets.
Review of Neural Networks for Speech Recognition
31
10 Suggestions for Future Work
Further work should emphasize networks that provide rapid response and could be used with real-time speech input. They must include internal mechanisms to distinguish speech from background noise and to determine when a word has been presented. They also must operate with continuous acoustic input and not require hand marking of test speech data, long internal delays, or duplication of the network for new inputs. Short-term research should focus on a task that current recognizers perform poorly on such as accurate recognition of difficult sets of isolated words. Such a task wouldn’t require excessive computation resources or extremely large data bases. A potential initial problem is talkerindependent recognition of difficult E-set words or phonemes as in (Lang and Hinton 1988; Waibel et al. 1988). Techniques developed using small difficult vocabularies should be extended to larger vocabularies and continuous speech as soon as feasible. Efforts should focus on: developing training algorithms to construct sub-word and word models automatically without excessive supervision, developing better front-end acousticphonetic feature extraction, improving low-level acoustic/phonetic discrimination, integrating temporal sequence information over time, and developing more rapid training techniques. Researchers should continue integrating neural net approaches to classification with conventional approaches to training and scoring. Longer-term research on continuousspeech recognition must address the problems of developing high-level speech-understanding systems that can learn and use internal models of the world. These systems must be able to learn and use syntactic, semantic, and pragmatic constraints. Efforts on building neural net VLSI hardware for speech recognition should also continue. The development of compact real-time speech recognizers is a major goal of neural net research. Parallel neural-net architectures should be designed to perform the computations required by successful algorithms and then these architectures should be implemented and tested. Recent developments in analog VLSI neural nets suggest that this approach has the potential to provide the high computation rates required for both front-end acoustic analysis and high-level pattern matching. All future work should take advantage of the many speech data bases that currently exist and use results obtained with experimental HMM and DTW recognizers with these data bases as benchmarks. Descriptions of some common data bases and comments on their availability are in (Pallett 1986; Price et al. 1988). Detailed evaluations using large speech data bases are necessary to guide research and permit comparisons between alternative approaches. Results obtained on a few locally-recorded speech samples are often misleading and are not informative to other researchers. Research should also build on the current state of knowledge in neural networks, pattern classification theory, statistics, and conventional HMM
32
Richard P. Lippmann
and DTW approaches to speech recognition. Researchers should become familiar with these areas and not duplicate existing work. Introductions to current HMM and DTW approaches are available in (Dixon and Martin 1979; Lee and Hon 1988; Parsons 1986; Rabiner and Juang 1986; Rabiner et al. 1978) and introductions to statistics and pattern classification are available in many books including (Duda and Hart 1973; Fukunaga 1972; Nilsson 1965).
Acknowledgments
I would like to thank members of Royal Signals and Radar Establishment including John Bridle and Roger Moore for discussions regarding the material in this paper. I would also like to thank Bill Huang and Ben Gold for interesting discussions and Carolyn for her patience. References Ackley, D.H., G.E. Hinton, and T.J. Sejnowski. 1985. A Learning Algorithm for Boltzmann Machines. Cognitive Science 9, 147-160. Albus, J.S. 1981. Brain, Behavior, and Robotics. BYTE Books. Almeida, L.B. 1987. A Learning Rule for Asynchronous Perceptrons with Feedback in a Combinatorial Environment. In: 1st International Conference on Neural Networks. IEEE, 11-609. Alspector, J. and R.B. Allen. 1987. A Neuromorphic VLSI Learning System. In: Advanced Research in VLSI: Proceedings of the 1987 Stanford Conference, ed. P Losleben, 313-349. Cambridge: MIT Press. Amit, D.J. 1988. Neural Networks for Counting Chimes. Proceedings National Academy of Science, U S A 85, 2141-2145. Anderson, S., J. Merrill, and R. Port. 1988. Dynamic Speech Categorization With Recurrent Networks. Technical Report 258, Department of Linguistics and Department of Computer Science, Indiana University. Averbuch, A., L. Bahl, and R. Bakis. 1987. Experiments with the Tangora 20,000 Word Speech Recognizer. In: Proceedings IEEE International Conference on Acoustics Speech and Signal Processing, Dallax, TX, 701-704. Bahl, L.R., P.F. Brown, P.V. De Souza, and R.L. Mercer. 1988. Modeling Acoustic Sequences of Continuous Parameters. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, New York, NY, 4043. Baldi, P. and K. Hornik. 1988. Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima. Neural Networks 2, 53-58. Beet, S.W., H.E.G. Powrie, R.K. Moore, and M.J. Tomlinson. 1988. Improved Speech Recognition Using a Reduced Auditory Representation. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, New York, NY, 75-78.
Review of Neural Networks for Speech Recognition
33
Bengio, Y. and R. De Mori. 1988. Use of Neural Networks for the Recognition of Place of Articulation. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, New York, NY, 103-106. Bourlard, H. and Y.Kamp. 1988. Auto-Association by Multilayer Perceptrons and Singular Value Decomposition. Biological Cybernetics 59, 291-294. Bourlard, H. and C.J. Wellekens. 1988. Links Between Markov Models and Multilayer Perceptrons. Technical Report Manuscript M-263, Phillips Research Laboratory, Brussels, Belgium. . 1987. Speech Pattern Discrimination and Multilayer Perceptrons. Technical Report Manuscript M-211, Phillips Research Laboratory, Brussels, Belgium. Scheduled to appear in the December issue of Computer, Speech and Language. Bridle, J. 1988. Neural Network Experience at the RSRE Speech Research Unit. ATR Workshop on Neural Networks and Parallel Distributed Processing, Osaka, Japan. Broomhead, D.S. and D. Lowe. 1988. Radial Basis Functions, multi-variable functional interpolation and adaptive networks. Technical Report RSRE Memorandum No. 4148, Royal Speech and Radar Establishment, Malvern, Worcester, Great Britain. Buhmann, J. and K. Schulten. 1988. Noise-Driven Temporal Association in Neural Networks. Europhysics Letters 4, 120.5-1209. Burr, D.J. 1988a. Experiments on Neural Net Recognition of Spoken and Written Text. In: IEEE Transactions on Acoustics, Speech and Signal Processing, 36, 1162-1168. . 1988b. Speech Recognition Experiments with Perceptrons. In: Neural Information Processing Systems, ed. D. Anderson, 144-1.53. New York: American Institute of Physics. Burton, D.K., J.E. Shore, and J.T. Buck. 1985. Isolated-Word Speech Recognition Using Multisection Vector Quantization Codebooks. In: IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-33, 837-849. Cohen, M. and S. Grossberg. 1987. Masking fields: A Massively Parallel Neural Architecture for learning, recognizing, and predicting multiple groupings of patterned data. Applied Optics 26, 1866-1891. Cohen, M.A., S. Grossberg, and D. Stork. 1987. Recent Developments in a Neural Model of Real-Time Speech Analysis and Synthesis. In: 1st International Conference on Neural Networks, IEEE. Cowan, J.D. and D.H. Sharp. 1988. Neural Nets and Artificial Intelligence. Daedalus 117, 85-121. Cybenko, G. 1988. Continuous Valued Neural Networks with Two Hidden Layers are SufFcient. Technical Report, Department of Computer Science, Tufts University. Dehaene, S., J. Changeux, and J. Nadal. 1987. Neural Networks that Learn Temporal Sequences by Selection. Proceedings National Academy Science, USA, Biophysics 84, 2727-2713. Deng, Li and C. Daniel Geisler. 1987. A Composite Auditory Model for Processing Speech Sounds. Journal of the Acoustical Society of America 82:6,2001-2012. Dixon, N.R. and T.B. Martin. 1979. Automatic Speech and Speaker Recognition. New York IEEE Press.
34
Richard P. Lippmann
Doddington, G.R. and T.B. Schalk. 1981. Speech Recognition: Turning Theory into Practice. IEEE Spectrum, 2&32. Duda, R.O. and P.E. Hart. 1973. Pattern Classification and Scene Analysis. New York John-Wiley & Sons. Elman, J.L. 1988. Finding Structure in Time. CRL Technical Report 8801, University of California, San Diego, CA. Elman, J.L. and J.L. McClelland. 1986. Exploiting Lawful Variability in the Speech Wave. In: Invariance and Variability in Speech Processes, eds. J.S. Perkell and D.H. Klatt. New Jersey: Lawrence Erlbaum. Elman, J.L. and D. Zipser. 1987. Learning the Hidden Structure of Speech. ICS Report 8701, Institute for Cognitive Science, University of California, San Diego, La Jolla, CA. Fallside, F., T.D. Harrison, R.W. Prager, and A.J.R. Robinson. 1988. A Comparison of Three Connectionist Models for Phoneme Recognition in Continuous Speech. ATR Workshop on Neural Networks and Parallel Distributed Processing, Osaka, Japan. Fukunaga, K. 1972. Introduction to Statistical Pattern Recognition. New York: Academic Press. Ghitza, 0.1988. Auditory Neural Feedback as a Basis for Speech Processing. In: Proceedings IEEE International Conference on Acoustics Speech and Signal Processing, New York, NY, 91-94. Gold, 8. 1988. A Neural Network for Isolated Word Recognition. In: Proceedings IEEE International Conference on Acoustics Speech and Signal Processing, New York, NY, 44-47. Greenberg, S. 1988a. The Ear as a Speech Analyzer. Journal of Phonetics 16, 139-149. . 1988b. Special Issue on “Representation of Speech in the Auditory Periphery.” Journal of Phonetics 16. Grossberg, S. 1988. Nonlinear Neural Networks: Principles, Mechanisms, and Architectures. Neural Networks 1, 17-61. Hanson, S.J. and D.J. Burr. 1987. Knowledge Representation in Connectionist Networks. Technical Report, Bell Communications Research, Morristown, New Jersey. Harrison, T.D. and F. Fallside. 1988. A Connectionist Structure for Phoneme Recognition. Technical Report CUED/F-INFENG/TR.15, Cambridge University Engineering Department. Hecht-Nielsen, R. 1987. Nearest Matched Filter Classification of Spatiotemporal Patterns. Applied Optics 26, 1892-1899. Hinton, G.E. 1987. Connectionist Learning Procedures. Technical Report CMUCS-87-115, Carnegie Mellon University, Computer Science Department. Homma, T., L.E. Atlas, and R.J. Marks. 1988. An Artificial Neural Network for Spatio-Temporal Bipolar Patterns: Application to Phoneme Classification. In: Neural Information Processing Systems, ed. D. Anderson, 31-40. New York: American Institute of Physics. Hopfield, J.J. 1982. Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proceedings of the National Academy of Sciences, U S A 79, 2554-2558.
Review of Neural Networks for Speech Recognition
35
Huang, W.M. and R.P. Lippmann. 1988. Neural Net and Traditional Classifiers. In: Neural Information Processing Systems, ed. D. Anderson, 387-396. New York: American Institute of Physics. Huang, W.M., R.P. Lippmann, T. Nguyen. 1988. Neural Nets for Speech Recognition. In: Conference of the Acoustical Society of America, Seattle WA. Hunt, M.J. and C. Lefebvre. 1988. Speaker Dependent and Independent Speech Recognition Experiments With an Auditory Model. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing 1,New York, 215-218. Irino, T. and H. Kawahara. 1988. A Study on the Speaker Independent Feature Extraction of Japanese Vowels by Neural Networks. ATR Workshop on Neural Networks and Parallel Distributed Processing, Osaka, Japan. Jordan, M.I. 1986. Serial Order: A Parallel Distributed Processing Approach. Institute for Cognitive Science Report 8604, University of California, San Diego. Kamm, C., T. Landauer, and S. Singhal. 1988. Training an Adaptive Network to Spot Demisyllables in Continuous Speech. ATR Workshop on Neural Networks and Parallel Distributed Processing, Osaka, Japan. Kammerer, B. and W. Kupper. 1988. Experiments for Isolated-Word Recognition with Single and Multi-Layer Perceptrons, Abstracts of 1st Annual INNS Meeting, Boston. Neural Netuiorks 1, 302. Kirkpatrick, S., C.D. Gelatt, and M.P. Vecchi. 1983. Optimization by Simulated Annealing. Science 229, 671-679. Klatt, K.H. 1986. The Problem of Variability In Speech Recognition and Models of Speech Perception. In: Invariance and Variability in Speech Processes, eds. J.S. Perkell and D.H. Klatt, 300-324. New Jersey: Lawrence Erlbaum. Kleinfield, D. 1986. Sequential State Generation by Model Neural Networks. Proceedings National Academy Science, USA, Biophysics 83, 9469-9473. Kohonen, T. 1988. An Introduction to Neural Computing. Neural Networks 1, 3-16. Kohonen, T. 1984. Self-organization and Associative Memory. Berlin: SpringerVerlag. Kohonen, T., G. Barna, and R. Chrisley. 1988. Statistical Pattern Recognition with Neural Networks: Benchmarking Studies. In: IEEE Annual International Conference on Neural Networks, San Diego, July. Kohonen, T., K. Makisara, and T. Saramaki. 1984. Phonotopic Maps - Insightful Representation of Phonological Features for Speech Recognition. In: IEEE Proceedings of the 7th International Conference on Pattern Recognition. Lang, K.J. and G.E. Hinton. 1988. The Development of the Time-Delay Neural Network Architecture for Speech Recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University. Lapedes, A. and R. Farber. 1988. How Neural Nets Work. In: Neural Information Processing Systems, ed. D. Anderson, 442456. New York: American Institute of Physics. Lee, Kai-Fu and Hsiao-Wuen Hon. 1988. Large-Vocabulary Speaker-Independent Continuous Speech Recognition Using HMM. In: Proceedings IEEE
36
Richard P. Lippmann
International Conference on Acoustics, Speech and Signal Processing 1,123126. Lee, Y.C., G. Doolen, H.H. Chen, G.Z. Sun, T. Maxwell, H.Y. Lee, C.L. Giles. 1986. Machine Learning Using a Higher Order Correlation Network. Physica D, 276306. Leung, H.C. and V.W. Zue. 1988. Some Phonetic Recognition Experiments Using Artificial Neural Nets. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing 1. Lippmann, R.P., B. Gold, and M.L. Malpass. 1987. A Comparison of Hamming and Hopfield Neural Nets for Pattern Classification. Technical Report TR-769, MIT Lincoln Lab. Lippmann, R.P. 1987. An Introduction to Computing with Neural Nets. IEEE A S S P Magazine 42,622. Lippmann, R.P. and Ben Gold. 1987. Neural Classifiers Useful for Speech Recognition. In: 1st International Conference on Neural Networks, IEEE, IV-417. Little, W.A. 1974. The Existence of Persistent States in the Brain. Mathematical Biosciences 19, 101-120. Lyon, R.F. and C. Mead. 1988. An Analog Electronic Cochlea. IEEE Transactions on Acoustics, Speech and Signal Processing 36, 1119-1134. MacKay, D.G. 1987. The Organization of Perception and Action, New York Springer Verlag. Mann, J., J. Raffel, R. Lippmann, and B. Berger. 1988. A Self-organizing Neural Net Chip. Neural Networks for Computing Conference, Snowbird, Utah. Marslen-Wilson, W.D. 1987. Functional Parallelism in Spoken Word-Recognition. In: Spoken Word Recognition, eds. U.H. Frauenfelder and L.K. Tyler. Cambridge, MA: MIT Press. McDermott, E. and S. Katagiri. 1988. Phoneme Recognition Using Kohonen’s Learning Vector Quantization. ATR Workshop on Neural Networks and Parallel Distributed Processing, Osaka, Japan. Moody, J. 1988. Speedy Alternatives to Back Propagation. Neural Networks for Computing Conference, Snowbird, Utah. Moody, J. and C. Darken. 1988. Learning with Localized Receptive Fields. Technical Report YALEU/DCS/RR-649, Yale Computer Science Department, New Haven, CT. Naylor, J. and K.P. Li. 1988. Analysis of a Neural Network Algorithm for Vector Quantization of Speech Parameters, Abstracts of 1st Annual INNS Meeting, Boston. Neural Networks 1,310. Nilsson, Nils J. 1965. Learning Machines. New York McGraw Hill. Niranjan, M. and F. Fallside. 1988. Neural Networks and Radial Basis Functions in Classifying Static Speech Patterns. Technical Report CUED/F-INFENG/TR 22, Cambridge University Engineering Department. Omohundro, S.M. 1987. Efficient Algorithms with Neural Network Behavior. Complex Systems 1, 273347. Pallett, D.S. 1986. A PCM/VCR Speech Database Exchange Format. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, Tokyo, Japan, 317-320.
Review of Neural Networks for Speech Recognition
37
Parsons, T. 1986. Voice and Speech Processing. New York McGraw-Hill. Paul, D.B. 1987. A Speaker-Stress Resistant HMM Isolated Word Recognizer. ICASSP 87, 713-716. Peeling, S.M. and R.K. Moore. 1987. Experiments in Isolated Digit Recognition Using the Multi-Layer Perceptron. Technical Report 4073, Royal Speech and Radar Establishment, Malvern, Worcester, Great Britain. Peterson, Gordon E. and Harold L. Barney. 1952. Control Methods Used in a Study of Vowels, The Journal of the Acoustical Society of America 249, 175-84. Pineda, F.J. 1987. Generalization of Back-Propagation to Recurrent Neural Networks. Physical Review Letters 59, 2229-2232. Poggio, T. and C. Koch. 1987. Synapses that Compute Motion. Scientific American 256, 46-52. Prager, R.W. and F. Fallside. 1987. A Comparison of the Boltzmann Machine and the Back Propagation Network as Recognizers of Static Speech Patterns. Computer Speech and Language 2,179-183. Prager, R.W., T.D. Harrison, and F. Fallside. 1986. Boltzmann Machines for Speech Recognition. Computer Speech and Language 1, 2-27. Price, P., W.M. Fisher, J. Bernstein, D.S. Pallett. 1988. The DARPA 1000-Word Resource Management Database for Continuous Speech Recognition. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, New York 1, 651454. Rabiner, L.R. and B.H. Juang. 1986. An Introduction to Hidden Markov Models. IEEE A S S P Magazine 3:1,4-16. Rabiner, Lawrence R. and Ronald W. Schafer. 1978. Digital Processing of Speech. New Jersey: Prentice-Hall. Raffel, J., J. Mann, R. Berger, A. Soares, and S. Gilbert. 1987. A Generic Architecture for Wafer-Scale Neuromorphic Systems. In: 1st International Conference on Neural Networks, IEEE. Robinson, A.J. and F. Fallside. 1988a. A Dynamic Connectionist Model for Phoneme Recognition. nEuro '88, Paris, France. . 1988b. Static and Dynamic Error Propagation Networks with Application to Speech Coding. In: Neural Information Processing Systems, ed. D. Anderson, 632-641. New York: American Institute of Physics. Rohwer, R. and B. Forrest. 1987. Training Time-Dependencies in Neural Networks. In: 1st International Conference on Neural Networks, IEEE, 11-701. Rossen, M.L., L.T. Niles, G.N. Tajchman, M.A. Bush, J.A. Anderson, and S.E. Blumstein. 1988. A Connectionist Model for Consonant-vowel Syllable Recognition. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, New York, NY, 59-66. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986a. Interactive Processes in Speech Perception: The TRACE Model. In: Parallel Distributed Processing: Vol. 2, Psychological and Biological Models, eds. D.E. Rumelhart and J.L. McClelland. Cambridge, MA: MIT Press. . 1986b. Learning Internal Representations by Error Propagation. In: Parallel Distributed Processing: Vol. 1, Foundations. Cambridge, MA: MIT Press. Sakoe, H. and K. Iso. 1987. Dynamic Neural Network - A New Speech Recognition
38
Richard P. Lippmann
Model Based on Dynamic Programming and Neural Network. IEICE Technical Report 87, NEC Corporation. Seneff, S. 1988. A Joint Synchrony/Mean-Rate Model of Auditory Speech Processing. Journal of Phonetics 16, 55-76. Servan-Schreiber, D., A. Cleeremans, and J.L. McClellan. 1988. Encoding Sequential Structure in Simple Recurrent Networks. Technical Report CMU-CS-88-183, Carnegie Mellon University. Shamma, S. 1988. The Acoustic Features of Speech Sounds in a Model of Auditory Processing: Vowels and Voiceless Fricatives. Journal of Phonetics 16, 77-91. Shore, J.E. and D.K. Burton. 1983. Discrete Utterance Speech Recognition Without Time Alignment. I E E E Transactions on Information Theory lT-29,473-491. Sompolinsky, H. and I. Kanter. 1986. Temporal Association in Asymmetrical Neural Networks. Physical Review Letters 57,2861-2864. Stornetta, W.S., T. Hogg, and B.A. Huberman. 1988. A Dynamical Approach to Temporal Pattern Processing. In: Neural Information Processing Systems, ed. D. Anderson, 750-759. New York: American Institute of Physics. Tank, D. and J.J. Hopfield. 1987. Concentrating Information in Time: Analog Neural Networks with Applications to Speech Recognition Problems. In: 1st International Conference on Neural Networks, IEEE. Tattersall, G.D., P.W. Linford, and R. Linggard. 1988. Neural Arrays for Speech Recognition. British Telecommunications Technology Journal 6 , 140-163. Unnikrishnan, K.P., J.J. Hopfield, and D.W. Tank. 1988. Learning Time-delayed Connections in a Speech Recognition Circuit. Neural Networks for Computing Conference, Snowbird, Utah. Waibel, Alex, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. 1987. Phoneme Recognition Using Time-Delay Neural Networks. Technical Report TR-1-006, ATR Interpreting Telephony Research Laboratories, Japan. Scheduled to appear in March 1989 issue of l E E E Transactions on Acoustics Speech and Signal Processing. Waibel, Alex, H. Sawai, and K. Shikano. 1988. Modularity and Scaling in Large Phonemic Neural Nets. Technical Report TR-1-0034, ATR Interpreting Telephony Research Laboratories, Japan. Watrous, R.L. 1988. Speech Recognition Using Connectionist Networks. Ph.D thesis, University of Pennsylvania. . 1986. Learning Algorithms for Connectionist Networks: Applied Gradient Methods of Nonlinear Optimization. Technical Report MS-CIS-87-51, Linc Lab 72, University of Pennsylvania. Watrous, R.L. and Lokendra Shastri. 1987. Learning Phonetic Features using Connectionist Networks: An Experiment in Speech Recognition. In: 1st International Conference on Neural Networks, IEEE, IV-381. Wong, M.K. and H.W. Chen. 1986. Toward a Massively Parallel System for Word Recognition. In: Proceedings IEEE International Conference on Acoustics Speech and Signal Processing, 37.4.1-37.4.4. ~
Received 10 November; accepted 14 November 1988
Communicated by Richard Lippmann
Modular Construction of Time-Delay Neural Networks for Speech Recognition Alex Waibel Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA and ATR Interpreting Telephony R.esearch Laboratories, Twin 21 MiD Tower, Osaka, 540, Japan
Several strategies are described that overcome limitations of basic network models as steps towards the design of large connectionist speech recognition systems. The two major areas of concern are the problem of time and the problem of scaling. Speech signals continuously vary over time and encode and transmit enormous amounts of human knowledge. To decode these signals, neural networks must be able to use appropriate representations of time and it must be possible to extend these nets to almost arbitrary sizes and complexity within finite resources. The problem of time is addressed by the development of a Time-Delay Neural Network, the problem of scaling by Modularity and Incremental Design of large nets based on smaller subcomponent nets. It is shown that small networks trained to perform limited tasks develop time invariant, hidden abstractions that can subsequently be exploited to train larger, more complex nets efficiently. Using these techniques, phoneme recognition networks of increasing complexity can be constructed that all achieve superior recognition performance. 1 Introduction Numerous studies have recently demonstrated powerful pattern recognition capabilities emerging from connectionist models or “artificial neural networks” (Rumelhart and McClelland 1986; Lippmann 1987). Most are trained on mere presentations of suitable sets of inputjoutput training data pairs. Most commonly these networks learn to perform tasks by effective use of hidden units as intermediate abstractions or decisions in an attempt to create complex, non-linear, decision functions. While these properties are indeed elegant and useful, they are, in their most simple form, not easily applicable to decoding human speech.
Neuruf Computation 1, 3946 (1989)
@ 1989 Massachusetts Institute of Technology
40
Alex Waibel
2 Temporal Processing
One problem in speech recognition is the problem of time. A human speech signal is produced by moving the articulators towards target positions that characterize a particular sound. Since these articulatory motions are subject to physical constraints, they commonly don’t reach clean identifiable phonetic targets and hence describe trajectories or signatures rather than a sequence of well defined phonetic units. Properly representing and capturing the dynamic motion of such signatures, rather than trying to classify momentary snapshots of sounds, must therefore be a goal for suitable models of speech. Another consequence of the dynamic nature of speech is the general absence of any unambiguous acoustic cue that indicates when a particular sound occurs. As a solution to this problem, segmentation algorithms have been proposed that presegment the signal before classification is carried out. Segmentation, however, is an errorful classification problem in itself and, when in error, sets up subsequent recognition procedures for recognition failure. To overcome this problem, a suitable model of speech should instead simply scan the input for useful acoustic clues and base its overall decision on the sequence and co-occurrence of a sufficient set of detected lower level clues. This then presumes the existence of translation invariant feature detectors, i.e., detectors that recognize an acoustic event independent of its precise location in time. A “Time Delay Neural Network (TDNN) (Lang 1987; Waibel et al. 1987) possesses both of these properties. It consists of TDNN-units that, in addition to computing the weighted sum of their current input features, also consider the history of these features. This is done by introducing varying delays on each of the inputs and processing (weighting) each of these delayed versions of a feature with a separate weight. In this fashion each unit can learn the dynamic properties of a set of moving inputs. The second property, ”translation invariance” is implemented by TDNN-units that scan an input token over time, in search of important local acoustic clues, instead of applying one large network to the entire input pattern. Translation invariant learning in these units is achieved by forcing the network to develop useful hidden units regardless of position in the utterance. In our implementation this was done by linking the weights of time shifted instantiations of the net during a scan through the input token (thus removing relative timing information). Figure 1 illustrates a TDNN trained to perform the discrimination task between the voiced stop consonants /b, d, g/ (Waibel et al. 1987) for a more detailed description of its operation). The three-category TDNN shown here (Fig. 1)has been evaluated over a large number of phonetic tokens (/b,d,g/). These tokens were generated by extracting the 150 msec intervals around pertinent phonemes from a phonetically handlabeled, large vocabulary database of isolated
41
Modular Construction of Time-Delay Neural Networks
B
D
G
Output Layer
integration
.-3 C a
Hidden Layer 2
m
Hidden Layer 1 m
Input Layer
15 frames 10 msec frame rate
Figure 1: The TDNN architecture (input: " B A ) . Eight hidden units in hidden layer 1 are fully interconnected with a set of 16 spectral coefficients and two delayed versions illustrated by the window over 48 input units. Each of these eight units in hidden layer 1 produces patterns of activation as the window moves through input speech. A five frame window scanning these activation patterns over time then activates each of three units in hidden layer 2. These activations over time in turn are then integrated into one single output decision. Note that the final decision is based on the combined acoustic evidence, independent of where in the given input interval (15 frames or 150 msecs) the /b, d or g / actually occurred.
Alex Waibel
42
Japanese utterances (Waibel et al. 1987). While isolated pronunciation provided relatively well articulated tokens, the data nevertheless included significant variability due to different phonetic contexts (e.g., " D O vs. "DI") and position in the utterance. Recognition experiments with three different male speakers showed that discrimination scores between 97.5% and 99.1%could be obtained.' These scores compare favorably with those obtained using several standard implementations of Hidden Markov Model speech recognition algorithms (Waibel et al. 1987). To understand the operation of the TDNNs, the weights and activation patterns of trained /b,d,g/-nets have been extensively evaluated (Waibel et a1. 1987). Several interesting properties were observed: 1. The TDNNs developed linguistically plausible features in the hid-
den units, such as movement detectors for first and second formants, segment boundary detectors, etc. 2. The TDNN has developed alternate internal representations that can link quite different acoustic realizations to the same higher level concept (here: phoneme). This is possible due to the multilayer arrangement used.
3. The hidden units fire in synchrony with detected lower layer events. These units therefore operate independent of precise time alignment or segmentation and could lead to translation invariant phoneme recognition.
Our results suggest that the TDNN has most of the desirable properties needed for robust speech recognition performance. 3 The Problem of Scaling
Encouraged by the good performance and the desirable properties of the model, we wanted to extend TDNNs to the design of large scale connectionist speech recognition systems. Some simple preliminary considerations, however, raise serious questions about the extendibility of connectionist design: Is it feasible, within limited resources and time, to build and train ever larger neural networks? Is it possible to add new knowledge to existing networks? With speech being one of the most complex and all encompassing human cognitive abilities, this question of scaling must be addressed. As a first step, let us consider the problem of extending the scope of our networks from tackling the three category task of all voiced stops (/b,d,g/) to the task of dealing with all stop consonants (/b,d,g,p,t,k/). The first row in table 1 shows the recognition scores of two individually 'All recognition scores in this paper were obtained from evaluation on test data that was not included in training.
43
Modular Construction of Time-Delay Neural Networks
Method Individual TDNNs
bdg
ptk
bdgptk
98.3% 98.7%
TDNN: Max. Activation
60.5%
Retrain BDGPTK
98.3%
Retrain Combined Higher Layers
98.1%
Retrain with V/W-units
98.4%
Retrain with Glue
98.4%
All-net Fine Tuning
98.6%
Table 1: From /b,d,g/ to /b,d,g,p,t,k/; Modular Scaling Methods. trained three category nets, one trained on the voiced stop consonant discrimination task (/b,d,g/) and the other on the voiceless stop consonant discrimination task (/p,t,k/). A naive attempt of combining these two nets by simply choosing the maximally activated output unit from these two separately trained nets resulted in failure as seen by the low recognition score (60.5%) in the second row. This is to be expected, since neither network was trained using other phonetic categories, and independent output decisions minimize the error for only small subsets of the task. A larger network (/b,d,g,p,t,k/-net) with six output units was therefore trained. Twenty hidden units (instead of eight) were used in Hidden layer 1 and six in hidden layer 2. Good performance could now be achieved (98.3%),but significantly more processing had to be expended to train this larger net. While task size was only doubled, the number of connections to be trained actually tripled. To make matters worse, more training data is generally needed to achieve good generalization in larger networks and the search complexity in a higher dimensional weight space increases dramatically as well. Even without increasing the number of training tokens in proportion to the number of connections, the complete /b,d,g,p,t,k/-net training run still required 18 days on a 4-processor Alliant supermini and had to be restarted several times before an acceptable solution had been found. The original /b,d,g/-net, by comparison, took only three days. It is clear that learning time increases more than linearly with task size, not to mention practical limitations such as available training data and computational capabilities. In summary, the dilemma between performance and resource limitations must
Alex Waibel
44
be addressed if Neural Networks are to be applied to large real world tasks. Our proposed solutions are based on three observations: 1. Networks trained to perform a smaller task may not produce outputs that are useful for solving a more complex task, but the knowledge and internal abstractions developed in the process may indeed be valuable.
2. Learning complex concepts in (developmental) stages based on previously learned knowledge is a plausible model of human learning and should be applied in connectionist systems. 3. To increase competence, connectionist learning strategies should build on existing distributed knowledge rather than trying to undo, ignore or relearn such knowledge.
Four experiments have been performed: 1. The previously learned hidden abstractions from the first layer of a /b,d,g/-net and a /p,t,k/-net were frozen by keeping their connections to the input fixed. Only connections from these two hidden layers 1 to a combined hidden layer 2 and to the output layer were retrained. While only modest (a few hours of) additional training was necessary at the higher layers, the recognition performance (98.1%)was found to be almost as good as for the monolithically trained /b,d,g,p,t,k/-net (see table I). The small difference in performance might have been caused by the absence of features needed to merge the two subnets (here, for example, the voicing feature distinguishing voiced from voiceless stops). 2. Hidden features from hidden layer 1 are fixed as in the previous experiment, but four additional class-distinctive features are incorporated at the first hidden layer. These four units were excised from a net that was exclusively trained to perform voiced/unvoiced discrimination. The voiced/unvoiced net could be trained in little more than one day and combination training at the higher layers was accomplished in a few hours. A high recognition rate of 98.4% was achieved.
3. The hidden units from hidden layer 1 are fixed as before, and four additional free units are incorporated. These free units are called connectionist glue, since they are intended to fit or glue together two distinct, previously trained nets. This network is shown in figure 2. The four glue units can be seen to have free connections to the input that are trained along with the higher layer combinations. In this fashion they can discover additional features that are needed to accurately perform the larger task. In addition to training the original /b,d,g/- and /p,t,k/-nets, combination training using glue
Modular Construction of Time-Delay Neural Networks
45
units was accomplished in two days. The resulting net achieved a recognition rate of 98.4%.
4. All-net fine tuning was performed on the previous network. Here, all connections of the entire net were freed once again for several hours of learning to perform small additional weight adjustments. While each of these learning iterations was indeed very slow, only few iterations were needed to fine tune the entire network for best performance of 98.6%. Only modest additional training beyond that required to train the subcomponent nets was necessary in these experiments. Performance, however, was as good or better than that provided by a monolithically trained net and as high as the performance of the original smaller subcomponent nets.
Output Layer
Hidden Layer 2
._
Free
c
-=
BDG
Input Layer
Figure 2: Combination of a /b,d,g/-net and a /p,t,k/-net using 4 additional units in hidden layer 1 as free "Connectionist Glue."
46
Alex Waibel
4 Conclusion
We have described connectionist networks with delays that can represent the dynamic nature of speech and demonstrated techniques to scale these networks up in size for increasingly large recognition tasks. Our results suggest that it is possible to train larger neural nets in a modular, incremental fashion from smaller subcomponent nets without loss in recognition performance. These techniques have been applied successfully to the design of neural networks capable of discriminating all consonants in spoken isolated utterances (Waibel et al. 1988). With recognition rates of 96%, these nets were found to compare very favorably (Waibel et al. 1988) with competing recognition techniques in use today. Acknowledgments The author would like to acknowledge his collaborators: Toshiyuki Hanazawa, Geoffrey Hinton, Kevin Lang, Hidefumi Sawai, and Kiyohiro Shikano, for their contributions to the work described in this paper. This research would also not have been possible without the enthusiastic support of Akira Kurematsu, President of ATR Interpreting Telephony Research Laboratories. References Lang, K. 1987. Connectionist Speech Recognition. Ph.D thesis proposal, Carnegie Mellon University. Lippmann, R.P. 1987. An Introduction to Computing with Neural Nets. IEEE A S S P Magazine, 422. Rumelhart, D.E. and J.L. McClelland. 1986. Parallel Distributed Processing; Explorations in the Microstructure of Cognition. Cambridge, MA: MIT Press. Waibel, A., T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. 1987. Phoneme Recognition Using Time-Delay Neural Networks. Technical Report TR-1-0006, ART Interpreting Telephony Research Laboratories. Also scheduled to appear in IEEE Transactions on Acoustics, Speech and Signal Processing, March 1989. Waibel, A,, H. Sawai, and K. Shikano. 1988. Moduiarify and Scaiing in Large Phonemic Neural Networks. Technical Report TR-1-0034,ATR Interpreting Telephony Research Laboratories; I E E E Transactions on Acoustics, Speech and Signal Processing, to appear.
Received 6 November; accepted 8 December 1988
Communicated by John Wyatt
A Silicon Model Of Auditory Localization John Lazzaro Carver A. Mead Department of Computer Science, California Institute of Technology, MS 256-80, Pasadena, CA 91125, USA
The barn owl accurately localizes sounds in the azimuthal plane, using interaural time difference as a cue. The time-coding pathway in the owl's brainstem encodes a neural map of azimuth, by processing interaural timing information. We have built a silicon model of the time-coding pathway of the owl. The integrated circuit models the structure as well as the function of the pathway; most subcircuits in the chip have an anatomical correlate. The chip computes all outputs in real time, using analog, continuous-time processing.
1 Introduction The principles of organization of neural systems arose from the combination of the performance requirements for survival and the physics of neural elements. From this perspective, the extraction of time-domain information from auditory data is a challenging computation; the system must detect changes in the data which occur in tens of microseconds, using neurons which can fire only once per several milliseconds. Neural approaches to this problem succeed by closely coupling algorithms and implementation. We are building silicon models of the auditory localization system of the barn owl, to explore the general computational principles of time-domain processing in neural systems. The barn owl (Tyto alba) uses hearing to locate and catch small rodents in total darkness. The owl localizes the rustles of the prey to within one to two degrees in azimuth and elevation (Knudsen et al. 1979). The owl uses different binaural cues to determine azimuth and elevation. The elevational cue for the owl is interaural intensity difference (IID). This cue is a result of a vertical asymmetry in the placement of the owl's ear openings, as well as a slight asymmetry in the left and right halves of the owl's facial ruff (Knudsen and Konishi 1979). The azimuthal cue is interaural time difference (ITD). The ITDs are in the microsecond range, and vary as a function of azimuthal angle of the sound source (Moiseff and Konishi 1981). The external nucleus of the owl's inferior colliculus (ICx) contains the neural substrate of sound localization, a map of auditory space (Knudsen and Konishi 1978). Neurons in the ICx respond Neurul Computation 1,47-57 (1989)
@ 1989 Massachusetts Institute of Technology
48
John Lazzaro and Carver A. Mead
maximally to stimuli located in a small area in space, corresponding to a specific combination of IID and ITD. There are several stages of neural processing between the cochlea and the computed map of space in the ICx. Each primary auditory fiber initially divides into two distinct pathways. One pathway processes intensity information, encoding elevation cues, whereas the other pathway processes timing information, encoding azimuthal cues. The time-coding and intensity-coding pathways recombine in the ICx, producing a complete map of space (Takahashi and Konishi 1988). 2 A Silicon Model of the Time-Coding Pathway
We have built an integrated circuit that models the time-coding pathway of the barn owl, using analog, continuous-time processing. Figure 1 shows the floorplan of the chip. The chip receives two inputs, corresponding to the sound pressure at each ear of the owl. Each input connects to a silicon model of the cochlea, the organ that converts the sound energy present at the eardrum into the first neural representation of the auditory system. In the cochlea, sound is coupled into a traveling wave structure, the basilar membrane, which converts time-domain information into spatially-encoded information, by spreading out signals in space according to their time scale (or frequency). The cochlea circuit is a one-dimensional physical model of this traveling wave structure; in engineering terms, the model is a cascade of second-order sections, with exponentially scaled time constants (Lyon and Mead 1988). In the owl, inner hair cells contact the basilar membrane at discrete intervals, converting basilar-membrane movement into a graded, halfwave rectified electrical signal. Spiral ganglion neurons connect to each inner hair cell, producing action potentials in response to inner-hair-cell electrical activity. The temporal pattern of action potentials encodes the shape of the sound waveform at each basilar-membrane position. Spiral ganglion neurons also reflect the properties of the cochlea; a spiral ganglion neuron is most sensitive to tones of a specific frequency, the neuron’s characteristic frequency. In our chip, inner hair cell circuits connect to taps at discrete intervals along the basilar-membrane model. These circuits compute signal processing operations (half-wave rectification and nonlinear amplitude compression) that occur during inner hair cell transduction. Each inner hair cell circuit connects to a spiral ganglion neuron circuit. This integrate-to-threshold neuron circuit converts the analog output of the inner-hair-cell model into fixed-width, fixed-height pulses. Timing information is preserved by greatly increasing the probability of firing near the zero crossings of the derivative of the neuron’s input. In the owl, the spiral ganglion neurons project to the nucleus magnocellularis (NM), the first nucleus of the time-coding pathway. The NM
49
A Silicon Model Of Auditory Localization
I1
Left
Ear Input
Nonllnur Inhlbitlon Clrcult (170 inpurr)
Right Ear Input
Time-Multiplexing Scanne.
Output Map
of Interaural Time Delay
Figure 1: Floorplan of the silicon model of the time-coding pathway of the owl. Sounds for the left ear and right ear enter the respective silicon cochleas at the lower left and lower right of the figure. Inner hair cell circuits tap each silicon cochlea at 62 equally-spaced locations; each inner hair cell circuit connects directly to a spiral ganglion neuron circuit. The square box marked with a pulse represents both the inner hair cell circuit and spiral ganglion neuron circuit. Each spiral ganglion neuron circuit generates action potentials; these signals travel down silicon axons, which propagate from left to right for spiral ganglion neuron circuits from the left cochlea, and from right to left for spiral ganglion circuits from the right cochlea. The rows of small rectangular boxes, marked with the symbol At, represent the silicon axons. 170 NL neuron circuits, represented by small circles, lie between each pair of antiparallel silicon axons. Each NL neuron circuit connects directly to both axons, and responds maximally when action potentials present in both axons reach that particular neuron at the same time. In this way, ITDs map into a neural place code. Each vertical wire which spans the array combines the response of all NL neuron circuits which correspond to a specific ITD. These 170 vertical wires form a temporally smoothed map of ITD, which responds to a wide range of input sound frequencies. The nonlinear inhibition circuit near the bottom of the figure increases the selectivity of this map. The time-multiplexing scanner transforms this map into a signal suitable for display on an oscilloscope.
50
John Lazzaro and Carver A. Mead
acts as a specialized relay station; neurons in the NM preserve timing information, and project bilaterally to the nucleus laminaris (NL), the first nucleus in the time-coding pathway that receives inputs from both ears. For simplicity, our chip does not model the NM; each spiral ganglion neuron circuit directly connects to a silicon NL. Neurons in the NL are most sensitive to binaural sounds with a specific ITD. In 1948, Jeffress proposed a model to explain the encoding of ITD in neural circuits (Jeffress 1948). In the Jeffress model applied to the owl, axons from the ipsilateral and contralateral NM, with similar characteristic frequencies, enter the NL from opposite surfaces. The axons travel antiparallel, and action potentials counterpropagate across the NL; the axons act as neural delay lines. NL neurons are adjacent to both axons. Each NL neuron receives synaptic connections from both axons, and fires maximally when action potentials present in both axons reach that particular neuron at the same time. In this way, ITD is mapped into a neural place coding; the ITD that maximally excites an NL neuron depends on the position of the neuron in the NL. Anatomical and physiological evidence in the barn owl supports this theory (Carr and Konishi 1988). The chip models the anatomy of the NL directly (Fig. I). Two silicon cochleas lie at opposite ends of the chip; spiral ganglion neuron circuits from each cochlea, with similar characteristic frequencies, project to separate axon circuits, which travel antiparallel across the chip. The axon circuit is a discrete neural delay line; for each action potential at the axon's input, a fixed-width, fixed-height pulse travels through the axon, section by section, at a controllable velocity (Mead 1989). NL neuron circuits lie between each pair of antiparallel axons at every discrete section, and connect directly to both axons. Simultaneous action potentials at both inputs excite the NL neuron circuit; if only one input is active, the neuron generates no output. For each pair of antiparallel axons, there is a row of 170 NL neuron circuits across the chip. These neurons form a place encoding of ITD. Our silicon NL differs from the owl's NL in several ways. The silicon NL neurons are perfect coincidence detectors; in the owl, NL neurons also respond, with reduced intensity, to monaural input. In the owl, many axons from each side converge on an NL neuron; in the chip, only two silicon axons converge on each silicon NL neuron. Finally, the brainstem of the owl contains two NLs, symmetric about the midline; each NL primarily encodes one half of the azimuthal plane. For simplicity, our integrated circuit has only one copy of the NL, which encodes all azimuthal angles. In the owl, the NL projects to a subdivision of the central nucleus of the inferior colliculus (ICc), which in turn projects to the ICx. The ICx integrates information from the time-coding pathway and from the amplitude-coding pathway to produce a complete map of auditory space. The final output of our integrated circuit models the responses of ICx
A Silicon Model Of Auditory Localization
51
neurons to ITDs. In response to ITDs, ICx neurons act differently from NL neurons. Experiments suggest mechanisms for these differences; our integrated circuit implements several of these mechanisms to produce a neural map of ITD. Neurons in the NL and ICc respond to all ITDs that result in the same interaural phase difference (IPD) of the neuron’s characteristic frequency; neurons in the ICx respond to only the one true ITD. This behavior suggests that ICx neurons combine information from many frequency channels in the ICc, to disambiguate ITDs from IPDs; indeed, neurons in the NL and ICc reflect the frequency characteristics of spiral ganglion neurons, whereas ICx neurons respond equally to a wide range of frequencies. In our chip, all NL neuron outputs corresponding to a particular ITD are summed to produce a single output value. NL neuron outputs are current pulses; a single wire acts as a dendritic tree to perform the summation. In this way, a two-dimensional matrix of NL neurons reduces to a single vector; this vector is a map of ITD, for all frequencies. In the owl, inhibitory circuits between neurons tuned to the same ITD may also be present, before summation across frequency channels. Our model does not include these circuits. Neurons in the ICc are more selective to ITDs than are neurons in the NL; in turn, ICx neurons are more selective to ITDs than are ICc neurons, for low frequency sounds. At least two separate mechanisms join to increase selectivity. The selectivity of ICc and ICx neurons increases with the duration of a sound, for sounds lasting less than 5 milliseconds, implying that the ICc and perhaps the ICx may use temporal integration to increase selectivity (Wagner and Konishi, in preparation). Our chip temporally integrates the vector that represents ITD; the time constant of integration is adjustable. Nonlinear inhibitory connections between neurons tuned to different ITDs in the ICc and ICx also increase sensitivity to ITDs; application of an inhibitory blocker to either the ICc or ICx decreases sensitivity to ITD (Fujita and Konishi, in preparation). In our chip, a global shunting inhibition circuit (Lazzaro et al. 1988) processes the temporally integrated vector that represents ITD. This nonlinear circuit performs a winner-take-all function, producing a more selective map of ITD. The chip time-multiplexes this output map on a single wire for display on an oscilloscope. 3 Comparison of Responses
We presented periodic click stimuli to the chip (Fig. 2a), and recorded the final output of the chip, a map of ITD. Three signal-processing operations, computed in the ICx and ICc of the owl, improve the original encoding of ITDs in the NL: temporal integration, integration of infor-
John Lazzaro and Carver A. Mead
52
Right
w 2.1 ms
Figure 2: Input stimulus for the chip. Both left and right ears receive a periodic click waveform, at a frequency of 475 Hz. The time delay between the two signals, notated as 6t, is variable. mation over many frequency channels, and inhibition among neurons tuned to different ITDs. In our chip, we can disable the inhibition and temporal-integration operations, and observe the unprocessed map of ITD (Fig. 2b). By combining the outputs of 62 rows of NL neurons, each tuned to a separate frequency region, the maps in figure 2b correctly encode ITD, despite random variations in axonal velocity and cochlear delay. Figure 3 shows this variation in velocity of axonal propagation, due to circuit element imperfections. Figure 2c shows maps of ITD taken with inhibition and temporal integration operations enabled. Most maps show a single peak, with little activity at other positions. Figure 4a is an alternative representation of the map of ITD computed by the chip. We recorded the map position of the neuron with maximum signal energy, for different ITDs. Carr and Konishi (1988) performed a similar experiment in the owl's NL (Fig. 4b), mapping the time delay of an axon innervating the NL, as a function of position in the NL. The linear properties of our chip map are the same as those of the owl map.
4 Conclusions Traditionally, scientists have considered analog integrated circuits and neural systems to be two disjoint disciplines. The two media are different in detail, but the physics of computation in silicon technology and in
53
A Silicon Model Of Auditory Localization
A & JJkJlc 0.6
0.0
0.1
0.6
0.2
A
0.7
0.8
0.3
0.4
&
-
0.0
1.0
1.5
1.1
1.6
1.2
&
& & 1.4
&
4
1.3
1.7
1.8
1.9
]Re*p0nse
2.0
A
2.1
2.2
Jwl)JwL
2.3
2.4
Position
Figure 3: Map of ITD, taken from the chip. The nonlinear inhibition and temporal smoothing operations were turned off, showing the unprocessed map of ITD. The vertical axis of each map corresponds to neural activity level, whereas the horizontal axis of each map corresponds to linear position within the map. The stimulus for each plot is the periodic click waveforms of Figure 2a; 6t is shown in the upper left corner of each plot, measured in milliseconds. Each map is an average of several maps recorded at 100 millisecond intervals; averaging is necessary to capture a representation of the quickly changing, temporally unsmoothed response. The encoding of ITD is present in the maps, but false correlations add unwanted noise to the desired signal. Since we are using a periodic stimulus, large time delays are interpreted as negative delays, and the map response wraps from one side to the other at an ITD of 1.2 milliseconds.
John Lazzaro and Carver A. Mead
54
0.0
0.1
0.2
0.3
0.4
1 L 11 11IA __I I1I1 II1 I1 1 1.6
2.0
1.1
1.6
2.1
0.7
1.2
1.7
2.2
0.8
1.3
1.8
2.3
0.9
1.4
1.9
2.4
0.5
1.0
0.6
,
,
_,
Figure 4: Map of ITD, taken from the chip. The nonlinear inhibition and temporal smoothing operations were turned on, showing the final output map of ITD. Format is identical to Figure 2b. Off-chip averaging was not used, since the chip temporally smooths the data. Most maps show a single peak, with little activity at other positions, due to nonlinear inhibition. The maps do not reflect the periodicity of the individual frequency components of the sound stimulus; additional experiments with a noise stimulus confirm the phase-disambiguation property of the chip.
A Silicon Model Of Auditory Localization
55
400 T
350 -300 --
Pulse Width of Axon Segment (PS)
250
-.
200 .-
150--
,,,,o ,o lt 5020
40
60
80
100
120
140
160
Position on Chip
Figure 5: Variation in the pulse width of a silicon axon, over about 100 axonal sections. Axons were set to fire at a slower velocity than in the owl model, for more accurate measurement. In this circuit, a variation in axon pulse width indicates a variation in the velocity of axonal propagation; this variation is a potential source of localization error. neural technology are remarkably similar. Both media offer a rich palette of primitives in which to build a structure; both pack a large number of imperfect computational elements into a small space; both are ultimately limited not by the density of devices, but by the density of interconnect. Modeling neural systems directly in a physical medium subjects the researcher to many of the same pressures faced by the nervous system over the course of evolutionary time. We have built a 220,000 transistor chip that models, to a first approximation, a small but significant part of a spectacular neural system. In doing so we have faced many design problems solved by the nervous system. This experience has forced us to a high level of concreteness in specifying this demanding computation. This chip represents only the first few stages of auditory processing, and thus is only a first step in auditory modeling. Each individual circuit in the chip is only a first approximation to its physiological counterpart. In addition, there are other auditory pathways to explore: the intensity-coding localization pathway, the elevation localization pathway in mammals, and, most formidably, the sound-understanding structures that receive input from these pathways.
John Lazzaro and Carver A. Mead
56
Position of Maximum Neural Activity
-
a:
20 -800-600-400-200
0
200 400 600 800
Interaural Time Difference
(pa)
T Depth In
NL (rm)
1800-.
1600-
1400-
1200.-
1000
b:
.. 0
800
I
0
60
100
150
200
Axonal Time Delay
260
300
(ps)
Figure 6: (a) Chip data showing the linear relationship of silicon NL neuron position and ITD. For each ITD presented to the chip, the output map position with the maximal response is plotted. The linearity shows that silicon axons have a uniform mean time delay per section. (b) Recordings of the NM axons innervating the NL in the barn owl (Carr and Konishi 1988). The figure shows the mean time delays of contralateral fibers recorded at different depths during one penetration through the 7 KHz region.
We thank M. Konishi and his entire research group, in particular S. Volman, I. Fujita, and L. Proctor, as well as D. Lyon, M. Mahowald, T. Delbruck, L. Dupr6, J. Tanaka, and D. Gillespie, for critically reading and correcting the manuscript, and for consultation throughout the project.
A Silicon Model Of Auditory Localization
57
We thank Hewlett-Packard for computing support, a n d DARPA a n d MOSIS for chip fabrication. This work w a s sponsored by the Office of Naval Research and the System Development Foundation.
References Carr, C.E. and M. Konishi. 1988. Axonal Delay Lines for Time Measurement in the Owl's Brainstem. Proc. Nat. Acad. Sci. 85, 8311-8315. Fujita, I. and M. Konishi. In preparation. Jeffress, L.A. 1948. A Place Theory of Sound Localization. J. Cornp. Physiol. Pyschol. 41, 35-39. Knudsen, E.I., G.G. Blasdel, and M. Konishi. 1979. Sound Localization by the Barn Owl Measured with the Search Coil Technique. 1. Comp. Physiol. 133, 1-11. Knudsen, E.I. and M. Konishi. 1979. Mechanisms of Sound Localization in the Barn Owl (Tyto alba). J. Cornp. Physiol. 133, 13-21. . 1978. A Neural Map of Auditory Space in the Owl. Science 200, 795797. Lazzaro, J.P., S. Ryckebusch, M.A. Mahowald, and C.A. Mead. 1988. WinnerTake-All Networks of O(n) Complexity. Proc. I E E E Conf. Neural lnforrnation Processing Systems, Denver, CO. Lyon, R.F. and C. Mead. 1988. An Analog Electronic Cochlea. IEEE Trans. Acoust., Speech, Signal Processing 36, 1119-1134. Mead, C.A. 1989. Analog V L S l and Neural Systems. Reading, MA: AddisonWesley. Moiseff, A. and M. Konishi. 1981. Neuronal and behavioral sensitivity to binaural time differences in the owl. J. Neurosci. 1, 4048. Takahashi, T.T. and M. Konishi. 1988. Projections of the Nucleus Angularis and Nucleus Laminaris to the Lateral Lemniscal Nuclear Complex of the Barn Owl. J. Cornpar. Neurol. 274, 221-238. Wagner, H. and M. Konishi. In preparation.
Received 26 October; accepted 9 November 1988.
Communicated by Carver Mead
Criteria for Robust Stability In A Class Of Lateral Inhibition Networks Coupled Through Resistive Grids John L. Wyatt, Jr. David L. Standley Department of EJectrical Engineering and Computer Science, Massachusetts Institute of TechnoJogy, Cambridge, MA 02139, USA
In the analog VLSI implementation of neural systems, it is sometimes convenient to build lateral inhibition networks by using a locally connected on-chip resistive grid to interconnect active elements. A serious problem of unwanted spontaneous oscillation often arises with these circuits and renders them unusable in practice. This paper reports on criteria that guarantee these and certain other systems will be stable, even though the values of designed elements in the resistive grid may be imprecise and the location and values of parasitic elements may be unknown. The method is based on a rigorous, somewhat novel mathematical analysis using Tellegen’s theorem (Penfield et al. 1970) from electrical circuits and the idea of a Popov multiplier (Vidyasagar 1978; Desoer and Vidyasagar 1975) from control theory. The criteria are local in that no overall analysis of the interconnected system is required for their use, empirical in that they involve only measurable frequency response data on the individual cells, and robust in that they are insensitive to network topology and to unmodelled parasitic resistances and capacitances in the interconnect network. Certain results are robust in the additional sense that specified nonlinear elements in the grid do not affect the stability criteria. The results are designed to be applicable, with further development, to complex and incompletely modelled living neural systems. 1 Introduction In the VLSI implementation of lateral inhibition and certain other types of networks, active cells are locally interconnected through an on-chip resistive grid. Linear resistors fabricated in, e g , polysilicon, could yield a very compact realization, and nonlinear resistive grids, made from MOS transistors, have been found useful for image segmentation (Hutchinson et al. 1988). Networks of this type can be divided into two classes: feedback systems and feedforward-only systems. In the feedfonvard case Neural Computation 1, 58457 (1989)
0 1989
Massachusetts Institute of Technology
Robust Stability In A Class Of Lateral Inhibition Networks
59
one set of amplifiers imposes signal voltages or currents on the grid and another set reads out the resulting response for subsequent processing, while the same amplifiers both ”write to” the grid and ”read from” it in a feedback arrangement. Feedforward networks of this type are inherently stable, but feedback networks need not be. A practical example is one of Mahowald and Mead’s retina chips (Mead and Mahowald 1988; Mead 1988) that achieve edge enhancement by means of lateral inhibition through a resistive grid. Figure l a shows a single cell in an earlier version of this chip, and figure l b illustrates the network of interconnected cells. Experiment has shown that the individual cells in this system are open-circuit stable and remain stable when the output of amplifier #2 is connected to a voltage source through a resistor, but the interconnected system oscillates so badly that the earlier design is scarcely usable’ (Mahowald and Mead 1988). Such oscillations can readily occur in most resistive grid circuits with active elements and feedback, even when each individual cell is quite stable. Analysis of the conditions of instability by conventional methods appears hopeless, since the number of simultaneously active feedback loops is enormous. This paper reports a practical design approach that rigorously guarantees such a system will be stable if the active cells meet certain criteria. The work begins with the naiv6 observation that the system would be stable if we could design each individual cell so that, although internally active, it acts like a passive system as seen from the resistive grid. The design goal in that case would be that each cell’s output impedance should be a positive-real (Vidyasagar 1978; Desoer and Vidyasagar 1975; Anderson and Vongpanitlerd 1973) function. This is sometimes possible in practice; we will show that the original network in figure la would satisfy this condition in the absence of certain parasitic elements. Furthermore, it is a condition one can verify experimentally by frequencyresponse measurements. It is obvious that a collection of cells that appear passive at their terminals will form a stable system when interconnected through a passive medium such as a resistive grid, and that the stability of such a system is robust to perturbations by passive parasitic elements in the network. The work reported here goes beyond that observation to provide (i) a demonstration that the passivity or positive-real condition is much stronger than we actually need and that weaker conditions, more easily achieved in practice, suffice to guarantee robust stability of the linear active network model, and (ii) an extension of the analysis to the nonlinear domain that furthermore rules out sustained large-signal oscillations under certain conditions. A key feature of the integrated circuit environment that makes these results applicable is the almost total absence of on-chip inductance. While the cells can appear inductive, as in figure 3c, ‘The later design reported in (Mead and Mahowald 1988) avoids stability problems altogether, at a small cost in performance, by redesigning the circuits to passively sense the grid voltage in a ”feedforward” style as described above.
John L. Wyatt, Jr., and David L. Standley
60
I I
I I incident I
light
-r
Figure 1: (a) This photoreceptor and signal processor circuit, using two MOS amplifiers, realizes spatial lateral inhibition and temporal sharpening by communicating with similar cells through a resistive grid. The resistors will often be nonlinear by design. (b) Interconnection of cells through a hexagonal resistive grid. Cells are drawn as 2-terminal elements with the power supply and signal output lines suppressed. The voltage on the capacitor in any given cell is affected both by the local light intensity incident on that cell and by the capacitor voltages on neighboring cells of identical design. The necessary ingredients for instability - active elements and signal feedback - are both present in this system. (c) Grid resistors with a nonlinear characteristic of the form i = tanh(v) can be useful in image segmentation (Hutchinson et al. 1988).
Robust Stability In A Class Of Lateral Inhibition Networks
61
the absence of inductance in our grid models makes these theorems possible. Note that these results do not apply directly to networks created by interconnecting neuron-like elements, as conventionally described in the literature on artificial neural systems. The ”neurons” in, e.g., a Hopfield network (Hopfield 1984) are unilateral 2-port elements in which the input and output are both voltage signals. The input voltage uniquely and instantaneously determines the output voltage of such a neuron model, but the output can only affect the input via the resistive grid. In contrast, the cells in our system are I-port electrical elements (temporarily ignoring the optical input channel) in which the port voltage and port current are the two relevant signals, and each signal affects the other through the cell’s internal dynamics (modeled as a Thevenin equivalent impedance) as well as through the grid’s response. It is apparent that uncontrolled spontaneous oscillation is a potential problem in living neural systems, which typically also consist of active elements arranged in feedback loops. Biological systems have surely solved the same problem we attack in this paper. It is reasonable to believe that stability has strongly constrained the set of network configurations nature has produced. Whatever Nature’s solutions may be, we suspect they have at least three features in common with the ones proposed here: (1) robustness in the face of wide component variation and the presence of parasitic network elements, (2) reliance on empirical data rather than anything we would recognize as a theory or analytic method, (3) stability strategies based on predominantly local information available to each network element. Several reports on this work have appeared and will appear in (Wyatt and Standley 1988; Standley 1989; Standley and Wyatt 1989; 1988a; 198813) during its development; a longer tutorial exposition will be given in the second printing of (Mead 1988). 2 The Linear Theory 2.1 Terminology. The output impedance of a linear system is a measure of the voltage response due to a change in output current while the input (light intensity in this case) is held constant. This standard electrical engineering concept will play a key role here. Figure 2a illustrates one experimental method for measuring the output impedance, and figure 2b is a standard graphical representation of an impedance, known as a Nyquist diagram. Similar plots have been used in experimental physiology (Cole 1932). In the context of this work, an impedance is said to be positive-real (Vidyasagar 1978, Desoer and Vidyasagar 1975, Anderson and Vongpanitlerd 1973)if it is stable (i.e., has no poles or zeroes in the right-half plane) and its Nyquist diagram lies entirely in the right-half plane (i.e., in the
John L. Wyatt, Jr., and David L. Standley
62
(a) r - - - - - - - - - - - - - - -1 T
I
I
I
Bcos (wt
I
+
=K I
cv
Figure 2: (a) Simplified experimental measurement of the output impedance of a cell. A sinusoidal current i = Acos(wt) is injected into the output and the voltage response u = Bcos(wt + 4) is measured. The impedance, which has magnitude B/A and phase 4,is typically treated as a complex number Z(iw) that depends on the frequency w. (b) Example of the Nyquist diagram of an impedance. This is a plot in the complex plane of the value of the impedance, measured or calculated at purely sinusoidal frequencies, ranging from zero upward toward infinity. It is not essential to think of Nyquist diagrams as representing complex numbers: they are simply polar plots in which radius represents impedance magnitude and angle to the horizontal axis represents phase. The diagram shown here is the Nyquist plot of the positive-real impedance in equation (2.1).
Robust Stability In A Class Of Lateral Inhibition Networks
(b)
63
r--I I I
I I I 1
I I I I I
-
I. - I I
Figure 3: (a) Elementary model for an MOS amplifier. These amplifiers have a relatively high output resistance, which is determined by a bias setting (not shown). (b) Linearity allows this simplification of the network topology for the circuit in figure la without loss of information relevant to stability. The capacitor in figure l a has been absorbed into the output capacitance of amp #2. (c) Passive network realization of the output impedance given in equation (2.1) for the network in (b). language of complex numbers, Re{Z(iw)} 2 0 for all purely sinusoidal frequencies w). Figure 2a is an example, while the system represented in figure 4 is stable but not positive-real. A deep link between positive-real functions, physical networks and
64
John L. Wyatt, Jr., and David L. Standley
Figure 4: Nyquist diagram of an impedance that satisfies the Popov criterion, defined as follows. A linear impedance Z(s) satisfies the Popov criterion if (1 + TS)Z(S)is positive-real for some T > 0. The “Popov multiplier” (1 + T S ) modifies the Nyquist diagram by stretching and rotating it counterclockwisefor w > 0. The impedance plotted here is active and thus is not positive-real, but the rotation due to the (1+ T S ) term can make it positive-real for an appropriate value of T . The Popov criterion is a condition on the linear elements that is weaker than passivity: active elements satisfying this criterion are shown to pose no danger of instability even when nonlinear resistors and capacitors are present in the grid. passivity is established by the classical result in linear circuit theory which states that H ( s ) is positive-real if and only if it is possible to synthesize a 2-terminal network of positive linear resistors, capacitors, inductors and ideal transformers that has H ( s ) as its driving-point impedance (Anderson and Vongpanitlerd 1973). This work was originally motivated by the following linear analysis of a model for the circuit in figure l a . For an initial approximation to the output impedance of the cell w e use the elementary model shown in figure 3a for the amplifiers and simplify the circuit topology within a single cell as shown in figure 3b.
Robust Stability In A Class Of Lateral Inhibition Networks
65
Straightforward calculations show that the output impedance is given by
This is a positive-real impedance that could be realized by a passive network of the form shown in figure 3c, where
Of course this model is oversimplified, since the circuit does oscillate. Transistor parasitics and layout parasitics cause the output impedance of the individual active cells to deviate from the form given in equations (2.1) and (2.21, and any very accurate model will necessarily be quite high order. The following theorem shows how far one can relax the positive-real condition and still guarantee that the entire network is robustly stable. It obviously applies to a much wider range of linear networks than has been discussed here. A linear network is said to be stable if for any initial condition the transient response converges asymptotically to a constant. Theorem 1. Consider the class of linear networks of arbitrary topology, consisting of any number of positive 2-terminal resistors and capacitors and of N lumped linear impedances Z,(s), n = 1,2,.. . , N, that are open- and short-circuit stable in isolation, i.e.. that have no poles or zeroes in the closed right-half plane. Everby such network is stable if at each frequency w 2 0 there exists a phase angle O(w) such that 0 2 O(w) 2 -90" and ILZ,,(iw) - O(iw)l < 90",n = 1 , 2 , .. . , N. An equivalent statement of this last condition is that the Nyquist plot of each cell's output impedance for w 2 0 never intersects the 2nd quadrant of the complex plane (figure 4 is an example), and that no two cells' output impedance phase angles can ever differ by as much as 180". If all the active cells are designed identically and fabricated on the same chip, their phase angles should track fairly closely in practice, and thus this second condition is a natural one. The theorem is intuitively reasonable and serves as a practical design goal. The assumptions guarantee that the cells cannot resonate with one another at any purely sinusoidal frequency s = jw since their phase angles can never differ by as much as 180", and they can never resonate with the resistors and capacitors since they can never appear simultaneously active and inductive at any sinusoidal frequency. A more advanced argument (Standley and Wyatt 1989) shows that exponentially growing instabilities are also ruled out.
John L. Wyatt, Jr., and David L. Standley
66
3 Stability Result for Networks with Nonlinear Resistors and
Capacitors The previous results for linear networks can afford some limited insight into the behavior of nonlinear networks. If a linearized model is stable, then the equilibrium point of the original nonlinear network must be locally stable. But the result in this section, in contrast, applies to the full nonlinear circuit model and allows one to conclude that in certain circumstances the network cannot oscillate or otherwise fail to converge wen if the initial state is arbitrarily fur from the equilibrium point. Figure 4 introduces the Popov criterion, which is the basis of the following theorem. This is the first nonlinear result of its type that requires no assumptions on the network topology.
Theorem 2. Consider any network consisting of nonlinear resistors and capacitors and linear active cells with output impedances Zn(s),n = 1,2,. . . ,N . Suppose (a) the nonlinear resistor and capacitor characteristics, ij = g3(vj) and q k = hk(vk), respectively, are monotone increasing continuously differentiable functions, and
(b) the impedances Z,(s> all satisfy the Popov criterion for some common value of r > 0.
Then the network is stable in the sense that, for any initial condition at t = 0, (3.1)
Acknowledgments We sincerely thank Professor Carver Mead of Cal Tech for encouraging this work, which was supported by Defense Advanced Research Projects Agency (DARPA) Contract No. N00014-87-K-0825 and National Science Foundation (NSF) Contract No. MIP-8814612.
References Anderson, B.D.O. and S. Vongpanitlerd. 1973. Network Analysis and Synthesis - A Modern Systems Theory Approach, Englewood Cliffs, NJ: Prentice-Hall. Cole, K.S. 1932. Electric Phase Angle of Cell Membranes. J. General Physiology 156,64149.
Desoer, C. and M. Vidyasagar. 1975. Feedback Systems: Input-Output Properties. New York Academic Press.
.
Robust Stability In A Class Of Lateral Inhibition Networks
67
Hopfield, J.J. 1984. Neurons with Graded Response have Collective Computational Properties Like Those of Two-state Neurons. Proc. Nat'l. Acad. Sci. 81,3088-3092. Hutchinson, J., C. Koch, J. Luo, and C. Mead. 1988. Computing Motion Using Analog and Binary Resistive Networks. Computer 21:3, 52-64. Mahowald, M.A. and C.A. Mead. 1988. Personal communication. Mead, C.A. 1988. Analog VLSI and Neural Systems. Addison-Wesley. Mead, C.A. and M.A. Mahowald. 1988. A Silicon Model of Early Visual Processing. Neural Networks 1:1, 91-97. Penfield, P., Jr., R. Spence, and S. Duinker. 1970. Tellegen's Theorem and Electrical Networks. Cambridge, MA: MIT Press. Standley, D.L. 1989. Design Criteria Extensions for Stable Lateral Inhibition Networks in the Presence of Circuit Parasitics. PYOC.1989 I E E E Int. Symp. on Circuifs and Systems, Portland, OR. Standley, D.L. and J.L. Wyatt, Jr. 1989. Stability Criterion for Lateral Inhibition Networks that is Robust in the Presence of Integrated Circuit Parasitics. I E E E Trans. Circuits and Systems, to appear. . 1988a. Stability Theorem for Lateral Inhibition Networks that is Robust in the Presence of Circuit Parasitics. Proc. I E E E Int. Conf. on Neural Networks I, San Diego, CA, 27-36. . 1988b. Circuit Design Criteria for Stability in a Class of Lateral Inhibition Neural Networks. Proc. I E E E Conf. on Decision and Control, Austin, TX, 328-333. Vidyasagar, M. 1978. Nonlinear Systems Analysis. Englewood Cliffs, NJ: PrenticeHall. Wyatt, J.L., Jr. and D.L. Standley. 1988. A Method for the Design of Stable Lateral Inhibition Networks that is Robust in the Presence of Circuit Parasitics. In: Neural Information Processing Systems, ed. D. Anderson, American Institute of Physics, New York, 860-867.
Received 30 September; accepted 13 October 1988.
Communicated by David Mumford
Two Stages of Curve Detection Suggest Two Styles of Visual Computation Steven W. Zuckert AIlan Dobbins Lee Iverson Computer Vision and Robotics Laboratory, McGill Research Centre for Intelligent Machines, McGill University, Montrkal, QuCbec, Canada
The problem of detecting curves in visual images arises in both computer vision and biological visual systems. Our approach integrates constraints from these two sources and suggests that there are two different stages to curve detection, the first resulting in a local description, and the second in a global one. Each stage involves a different style of computation: in the first stage, hypotheses are represented explicitly and coarsely in a fixed, preconfigured architecture; in the second stage, hypotheses are represented implicitly and more finely in a dynamically-constructed architecture. We also show how these stages could be related to physiology, specifying the earlier parts in a relatively fine-grained fashion and the later ones more coarsely. 1 Introduction
An extensive mythology has developed around curve detection. In extrapolating from orientation-selectiveneurons in the visual cortex (Hubel and Wiesel 19621, it is now widely held that curve detection is simply a matter of "integrating" the responses of these cells. More specifically, the mythology holds that this integration process is global, that the initial estimates are local, and that the relationship between them will become clear as a more detailed understanding of cortical circuitry is uncovered. However, this mythical process of "integration" has turned out to be elusive, the search for it has led, instead, to a series of dilemmas, and the quantity of physiological data is exploding. It is rarely clear how new details of cortical circuitry relate to different components of the curve detection problem. We believe that this situation is typical of vision in general, and amounts to ascribing too little function to the earlier stages, and too +SeniorFellow, Canadian Institute for Advanced Research.
Neural Computation 1, 68-81 (1989)
@ 1989 Massachusetts Institute of Technology
Two Stages of Curve Detection
69
much to the later ones. For curve detection, virtually all of the complexity is delegated to the process of "integration," so it is not surprising that successful approaches have remained elusive. Part of the problem is that models of integrative processes have been rich in selected detail, but poor in abstract function. In the sense that it is often useful to see the forest before the trees, we submit that solutions will likely be found by considering both coarse-grained and fine-grained models, and that such models will suggest a partitioning of function whose abstraction varies with granularity. To make this point concretely, we here outline a coarsegrained solution to the curve detection problem from a computational perspective, and sketch how it could map onto physiology. The sketch is coarse enough to serve as an organizational framework, but fine enough to suggest particular physiological constraints. One of these comprises our first, coarse-grain prediction: curve detection naturally decomposes into two stages, the first in which a local description is computed, and the second in which a global description is computed. These computations are sufficiently different that we are lead to hypothesize two different styles of visual computation. 2 The Dilemma of Curve Detection
The initial measurement of orientation information is broadly tuned, which suggests the averaging necessary to counteract retinal (sensor) sampling, quantization, and noise. However, the end result of curve detection is unexpectedly precise: corners can be distinguished from arcs of high curvature, and nearby curves can be distinguished from one another to a hyperaccurate level, even though they might pass through the same receptive field. An analogous dilemma exists for computer vision systems, even with the spectacular numerical precision of which computers are capable: quantization and noise imply smoothing, but smoothing blurs corners, endpoints, and nearby curves into confusion (Zucker 1986). At the foundation is a chicken-and-egg problem: if the points through which the curve passed, together with the locations of discontinuities, were known, then the actual properties of the curve could be inferred. But initially they are not known, so any smoothing inherent in the inference process is potentially dangerous. 3 Two Stages of Curve Detection
We have discovered a computational solution to this dilemma, which involves decomposing the full problem into two stages, each of which has a rather different character. In the first stage, the local properties of the curve are computed: its trace (the set of retinotopic points through which the curve passes), its tangent (or orientation at those points), and
70
Steven W. Zucker, Allan Dobbins, and Lee Iverson
its curvature. In the second stage, these properties are refined to create a global mode1 of the curve. This much - proceeding from local to global - is standard; the style of the computations is not. The key to the first stage is to infer the local properties coarsely - not in fine detail - but without sacrificing reliability or robustness. Coarseness is here related to quantization, whch must limit error propagation without blurring over corners. Observe that this is precisely what is lacking in the standard myth, where errors (e.g., about placing discontinuities) can have far reaching consequences. The result is a style of computation in which the different (quantized) possibilities are made explicit, and arranged in a fixed, preconfigured computational architecture that imposes no a priori ordering over them. Each distinct hypothesis, say rough orientation and curvature at every position, forms a unit in a fixed network that strongly resembles neural-network-style models. Reliability and robustness are then maintained by the network; hence the local description is not computed locally! A mapping onto orientation hypercolumns will be discussed shortly. The second stage embodies a rather different style of computation. Now the possibilities no longer need be general, but are constrained to be in the range dictated by the first stage. Thus the architecture can be tailored to each problem-that is, constructed adaptively rather than preconfigured-and variables can be represented implicitly. With these highly focused resources, the key limitation on precision is implementation, and it need not be hampered by uncontrolled error propagation. From the outside, this constructive style of computation holds certain key properties in common with later visual areas, such as V4 and IT, where receptive field structure has been shown to vary with problem constraints (e.g., Maunsell and Newsome 1987; Moran and Desimone 1985). 4
The Model of Curve Detection
In physiological terms, neurons are said to be orientation selective if they respond differentially to stimulus (edge or line) orientation. We take this operational statement one step further by defining orientation selection to be the inference of a local description of the curve everywhere along it, and postulate orientation selection as the goal of our first stage. In the second stage, global curves are inferred through this local description. The various stages of our process are shown in figure 1, and expanded below. 4.1 Stage 1: Inferring the Tangent Field. Formally orientation selection amounts to inferring the trace of the curve, or the set of points (in the image) through which the curve passes, its (approximate) tangent and curvature at those points, and their discontinuities (Zucker 1986). We refer to such information as the tangent field, and note that, since the initial
Two Stages of Curve Detection
71
measurements are discrete, this will impose constraints on the (inferred) tangents, curvatures, and discontinuities (Parent and Zucker 1985). This first stage of orientation selection is in turn modeled as a two step process: Step 1.1. Initial Measurement of the local fit at each point to estimate orientation and curvature. These estimates derive from a model of simple cell receptive fields instantiated at multiple scales and orientations at each image position. However, these local measurements are inherently inaccurate (e.g., broadly tuned), so we require: Step 1.2. Inter~retatjoninto an explicit distributed representation of tangent and curvature by establishing consistency between the local measurements. This is accomplished by modifying them according to their geometric relationships with nearby estimates. 4.2 Stage 2 Infemng a Covering of the Curve. Since the tangent is the first derivative of a curve (with respect to arc length), the global curve can be recovered as an integral through the tangent field. Such a view typically leads to sequential recovery algorithms (e.g., Kass and Witkin 1987). But these algorithms require global parameters, starting points, and some amount of topological structure (i.e., which tangent point follows which); in short, they are biologically implausible. In contrast, we propose a novel approach in which a collection of short, dynamically modifiable curves (”snakes” in computer vision; see Montanari 1971; Kass et al. 1988) move in parallel. The key idea behind our approach is to recover the global curve by computing a covering of it; i.e., a set of objects whose union is equivalent to the original curve. The elements of the covering are unit-length dynamic splines, initially equivalent to the elements of the tangent field, but which then evolve according to a potential distribution constructed from the tangent field. The evolution takes two forms: (i) a migration in position to achieve smooth coverings; and (ii) a “growth to triple their initial length. Furthermore, since the splines are initially independent, it is not known which should be grouped into the covering of each distinct global curve. For graphical purposes we represent this by creating each one with a different “color,” and include a second process which converts overlapping splines to the same color. In the end, then, the cover is given by a collection of overlapping splines, or short “snakes,“ each of which is the same color. Again, there are two conceptually distinct steps to Stage 2 of the algorithm (David and Zucker 1989):
Step 2.1. Constructing the Potential Distribution from the discrete tangent field. Each entry in the tangent field actually represents a discretization of the many possible curves in the world that could project onto that particular (tangent, curvature) hypothesis. Now these pieces
72
Steven W. Zucker, Allan Dobbins, and Lee Iverson must be put together, so consider a measure (or envelope) over all of these possible curves. Assuming the curves are continuous but not necessarily differentiable everywhere, each contribution to the potential can be modeled as a Gaussian (the Wiener measure) oriented in the direction of the tangent field entry. The full potential distribution is their pointwise sum; see figure 3.
Step 2.2. Spline Dynamics The discrete entities in the tangent field are converted into unit splines initialized in the valleys of the potential distribution. They evolve according to a variational scheme that depends on spline properties (tension and rigidity) as well as the global potential.
5 Implementing the Model
Each stage of the model has different implementation requirements. To differentiatebetween smooth curves, curves with corners, crossing curves and branching curves, it is necessary to represent each possible tangent (orientation) and curvature value at every possible position. Smooth curves are then represented as a single (tangent, curvature) hypothesis at each (retinotopic) trace point, corners as multiple tangents at a single point, and bifurcations as a single tangent but multiple curvatures at a single point. Orientation hypercolumns in the visual cortex are thus a natural representational substrate, with explicit representation of each possible orientation and curvature at each position. This leads to a new observation regarding discontinuities: explicit neurons to represent them are unnecessary, and leads to our first physiological prediction: Prediction 1. Crossings, corners, and bifurcations are represented at the early processing stages by multiple neurons firing within a "hypercolumn." 5.1 Stage 1, Step 1: Intra-Columnar Initial Measurements. We first seek a physiologically plausible mechanism for measuring orientation and curvature. Observe that an orientation-selective cortical neuron carries information about the tangent to curves as they pass through its receptive field, and an ensemble of such cells of different size carries information about how orientation is changing over it. Such differences are related to curvature (or deviation from straightness), and adding appropriate rectification leads to a model of endstopped neurons (Dobbinset al. 1987; cf. Hubel and Wiesel 1965). This model exhibits curvature-selective response at the preferred orientation, as do endstopped neurons. Thus
Prediction 2. Endstopped neurons carry the quantized representation of orientation and (non-zero) curvature at each position.
Two Stages of Curve Detection
73
Figure 1: An illustration of the different stages of curve detection. In (a) we show a section of a fingerprint image; note the smooth curves and discontinuities around the " Y in the center. (b) Graphical illustration of the initial information, or those orientation/curvature hypotheses resulting from convolutions above the noise level. (c) The discrete tangent field resulting from the relaxation process after 2 iterations; note that most of the spurious initial responses have been eliminated. (d) Final snake positions, or coverings of the global curves. (e) The potential distribution constructed from the entries in the tangent field.
74
Steven W. Zucker, Allan Dobbins, and Lee Iverson
By varying the components one obtains cells selective for different ranges and signs of curvature. Thus the initial measurements can be built up by intra-columnar local circuits, with the match to each (quantized) orientation and curvature represented explicitly as, say, firing rate in endstopped neurons. However, these measurements of orientation and curvature are broadly tuned; nearby curves are blurred together and multiple possibilities arise at many positions. Introducing further non-linearities into the initial measurements eliminates some spurious responses (Zucker et al. 19881, but the broadly-tuned smearing remains. We thus seek an abstract principle by which these broadly tuned responses can be refined into crisper distributions. 5.2 Stage 1, Step 2: Inter-Columnar Iterative Refinement. Again curvature enters the model, but now as a way of expressing the relationship between nearby tangent (orientation) hypotheses. Consider an arc of a curve, and observe that tangents to this arc must conform to certain position and Orientation constraints for a given amount of curvature; we refer to such constraints geometrically as co-circularity (Fig. 2a). Discretizing all continuous curves in the world that project into the columnar space of coarse (orientation, curvature) hypotheses partitions these curves into equivalence classes, examples of which are shown in figure 2b (Parent and Zucker 1985; Zucker et al. 1988). Interpreting the (orientation, curvature) hypotheses as endstopped neurons, such co-circularlyconsistent relationships are what is to be expected of the firing pattern between endstopped neurons in nearby orientation hypercolumns given such a curve as stimulus. Turning this around, when such intercolumnar patterns arise from the initial measurements, a curve from one of the equivalence classes is to be expected. Such inter-columnar interactions can be viewed physiologically as excitatory and inhibitory projections between endstopped cells at nearby positions (adjacent hypercolumns), and can be used as follows. Since curvature is a relationship between tangents at nearby positions, two tangents should support one another if and only if they agree under a curvature hypothesis, and co-circularity provides the measure of such support. In addition, two tangents that disagree with the curvature estimate should detract support from one another. Relaxation labeling provides a formal mechanism for defining such support, and for specifying how to use it (Hummel and Zucker 1983). Mathematically it amounts to gradient descent; physiologically it can be viewed as a mechanism for specifying how the response of neighboring neurons will interact. In summary:
Prediction 3. Inter-columnar interactions exist between curvature consistent (co-circular) tangent hypotheses.
Two Stages of Curve Detection
75
Figure 2: (a) The geometric relationships necessary for defining the compatibilities between two label pairs at points i and j . (b) Compatibilities between coarse (orientation, curvature) hypotheses at nearby positions. 8 distinct orientations and 7 curvatures were represented, and 3 examples are shown. (top) The labels which give positive (left) and negative (right) support for a diagonal orientation that curves slightly left; (middle) positive and negative support for a straight curvature class; (bottom) positive and negative support for the maximum curvature class. The magnitude of the interactions varies as well, roughly as a Gaussian superimposed on these diagrams. The values were obtained by numerically solving a 6-dimensional closest point problem (Zucker et al. 1988). Physiologically these projective fields represent inter-columnar interactions. Multiplied by the original tangent receptive fields, they represent the units for building the potential distribution that guides Stage 2.
76
Steven W. Zucker, Allan Dobbins, and Lee Iverson
Given interaction, the next question relates to precision. Earlier we hypothesized that this first stage was coarse. Both computational experiments (Zucker et al. 19881, psychophysics (Link and Zucker 19881, and the range of receptive field sizes in striate cortex (Dobbins et al. 1988) provide independent evidence about the quantization of curvature: Prediction 4. The initial representation of curvature in the visual cortex is quantized into 5 f 2 distinct classes; namely, straight, curved to the left a small amount, curved to the left a large amount, and similarly to the right. Relaxation processes can be realized iteratively, and computational experiments suggest that about 3 interations suffice (Zucker et al. 1988). At this time we can only speculate how these iterations relate to physiology, but perhaps the first iteration is carried out by a recurrent network within V1, and the subsequent iterations through the feed-forward and -back projections to extrastriate cortex (e.g., V2 or V4 in monkey). There is no doubt, however, that interactions beyond the classical receptive field abound (Allman et al. 1985). The advantage of this style of ”coarse modeling” is that a number of testable physiological hypotheses do emerge, and we are now beginning to explore them. The requirement of initial curvature estimates led to the connection with endstopping, and the current model suggests roles for inter-columnar interactions. In particular, we predict that they should be a function of position and orientation, a prediction for which some support exists (e.g. Nelson and Frost 1985) in the zero-curvature case; experiments with curved stimuli remain to be done. 5.3 Stage 2: Potential Distributions and Evolving Spline Covers. The tangent field serves as a coarse model for the curve, represented locally. The next task is to infer a smooth, global curve running through it. We perform this inference in a rather different kind of architecture, one that involves potential distributions constructed specifically for each instance. It proceeds as follows. The potential distribution is created by adding together contributions from each element in the tangent field; see figure 3. Changing the representation from the tangent field to the potential distribution changes what is explicit and what is implicit in the representation. In Stage 1 there were discrete coarse entities; now there are smooth valleys that surround each of the global curves, with a separation between them. The “jaggies’’ imposed by the initial image sampling have been eliminated, and interpolation to sub-pixel resolution is viable. To recover the curves through the valleys, imagine creating, at each tangent field entry, a small spline of unit length oriented according to the tangent and curvature estimates (Fig. 4). By construction, we know that this spline will be born in a valley of the tangent field potential
Two Stages of Curve Detection
77
Figure 3: Illustration of how a potential distribution is constructed from tangent field entries. (a) A small number of tangents, showing the individual contributions from each one. (b) As more tangents are included, long "valleys" begin to form when the individual entries are added together. (c) The complete tangent field and potential distribution as shown in figure 1. Physiologically one might think of such potentials as being mapped onto neuronal membranes. Not shown is the possible effect of attention in gating the tangent field contributions, the smallest unit for which could correspond to a tangent field entry. distribution, so they are then permitted to migrate to both smooth out the curve and to find the true local minima. But the inference of a cover for the global curves requires that the splines overlap, so that each point on every curve is covered by at least one spline. We therefore let the splines extend in length while they migrate in position, until they each reach a prescribed length. The covering is thus composed of these extensible splines which have grown in the valleys of the tangent field potential. Their specific dynamics and properties are described more fully in (David and Zucker 1989). It is difficult to interpret these ideas physiologically within the classical view of neurons, in which inputs are summed and transformed into
Steven W. Zucker, Allan Dobbins, and Lee Iverson
78
an output train of action potentials. Dendrites simply support passive diffusion of depolarization. Recently, however, a richer view of neuronal processing has emerged, with a variety of evidence pointing to active dendritic computation and dendro-dendritic interaction (Schmitt and Worden 1979). Active conductances in dendrites functionally modify the geometry, and dendro-dendritic interactions suggest that the output transformation is not uniquely mediated by the axon. Taken together, these facts imply that patterns of activity can be sustained in the dendritic arbor, and that this membrane could be the substrate of the above potential distribution computations. For this to be feasible, however, we require Prediction 5. The mapping of the potential distribution onto the neuronal membrane implies that the retinotopic coordinates are similarly mapped (at least in open neighborhoods) onto the membrane. The large constructed potential distributions may bear some resemblance to the large receptive fields observed in areas V4 and IT (Maunsell and Newsome 1987). While any such relationship is clearly speculative at this time, it should be noted that they have two key similarities: (i) extremely large receptive fields (potential distributions) have been created, but they maintain about the same orientation selectivity as in V1 (Desimone et al. 1985); (ii) their structure can change. We have stressed how structure is controlled by upward flowing information, but it should be modifiable by "top-down" attentional influences as well (Maunsell and Newsome 1987; Moran and Desimone 1985). Attention could easily "gate" the tangent field entries at the creation of the potential, which leads to: Prediction 6. There exists a smallest scale of attentional control, and it corresponds (in size) to the scale of the unit potential contributions. 6 Conclusions
This paper is both constructive and speculative. On the constructive side, we have outlined a computational solution to the curve detection problem that fills the wide gulf between initial broad measurements of orientation and precise final descriptions of global curve structure. Much of the mythology that has developed around curve detection is due, we believe, to ascribing too little function to the first (measurement) stage, and too much function to the second (integration) stage. Our solution was to interpose a stable description-the tangent field-between the stages, to represent the local properties of curves (and their discontinuities). Three points emerged: (i) represent the local structure coarsely,
Two Stages of Curve Detection
79
Figure 4: Illustration of the splines in motion. Initially, each spline is born at a tangent field location, with unit length. Then, according to the potential distribution shown in figure le, the splines migrate in position (to find minima in the distribution) and in length, so that they overlap and fill in short gaps. At convergence, the length of each spline has tripled. Not shown is the fact that each spline is born with a different "color," and that, as they overlap, the colors equilibrate to a unique value for the entire covering of each global curve. Also, those splines that migrate to positions unsupported by the potential distribution are eliminated at convergence. (a) Initial distribution; (b) and (c) intermediate iterations; (d) final convergence. Physiologically one might think of the spline computations as being supported by localized dendric or dendrodendritic interactions. not in fine detail; so that (ii) the different possibilities can be represented explicitly and (iii) do not assume that local properties must be computed purely locally. Once the tangent field was in place, the task for the second, global stage could then be posed, and led to the introduction of the mathematical notion of a cover to suggest parallel (and hence at least not biologically implausible) mechanisms for recovering global information.
Steven W. Zucker, Allan Dobbins, and Lee Iverson
80
Finally, we introduced the notion of a potential distribution as the representation for mediating the local to global transition between the two stages. The paper has also been speculative. Problems in vision are complex, and computational modeling can certainly help in understanding them. But in our view computational modeling cannot proceed without direct constraints from the biology, and modeling - like curve detection - should involve both coarse-grained and finer-grained theories. We attempted to illustrate how such constraints could be abstracted by speculating how our model could map onto physiology While much clearly remains to be done, the role for curvature at several levels now seems evident. That such roles for curvature would have emerged from more traditional neural network modeling seems doubtful. Two different styles of computation emerged in the two stages of curve detection. Although we stressed their differences in the paper, in closing we should like to stress their similarities. Both stages enjoy formulations as variational problems, and recognizing the hierarchy of visual processing, we cannot help but postulate that the second, fine stage of curve detection may well be the first, coarse stage of shape description. The fine splines then become the coarse units of shape. Acknowledgments This research was sponsored by NSERC grant A4470. We thank R. Milson and especially C. David for their contributions to the second stage of this project. References Allman, J., F. Miezin, and McGuinness. 1985. Stimulus Specific Responses from Beyond the Classical Receptive Field: Neurophysiological Mechanisms for Local-global Comparisons in Visual Neurons. Annual Rev. Neurosci. 8,407430. David, C. and S.W. Zucker. 1989. Potentials, Valleys, and Dynamic Global Coverings. Technical Report 89-1, McGill Research Center for Intelligent Machines, McGill University, Montreal. Desimone, R., S. Schein, J. Moran, and L. Ungerleider. 1985. Contour, Color, and Shape Analysis Beyond the Striate Cortex. Vision Research 25, 441452. Dobbins, A., S.W. Zucker, and M.S. Cynader. 1987. Endstopping in the Visual Cortex as a Substrate for Calculating Curvature. Nature 329, 4 3 8 4 1 . . 1988. Endstopping and Curvature. Technical Report 88-3, McGill Research Center for Intelligent Machines, McGill University, Montreal. Hubel, D.H. and T.N. Wiesel. 1962. Receptive Fields, Binocular Interaction and Functional Architecture in the Cat’s Visual Cortex. J. Physiol. (London) 160, 106154.
Two Stages of Curve Detection
81
. 1965. Receptive Fields and Functional Architecture in Two Non-striate Visual Areas (18 and 19) of the Cat. J. Neurophysiol. 28, 229-89. Hummel, R. and S.W. Zucker. 1983. On the Foundations of Relaxation Labeling Processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 5, 267-287. Kass, M., A. Witkin, and D. Terzopoulos. 1988. SNAKES Active Contour Models. Int. J. Computer Vision 1, 321-332. Kass, M. and A. Witkin. 1987. Analyzing Oriented Patterns. Computer Vision Graphics and Information Processing 37, 362-385. Link, N. and S.W. Zucker. 1988. Corner Detection in Curvilinear Dot Grouping. Biological Cybernetics 59, 247-256. Maunsell, J. and W. Newsome. 3987. Visual Processing in Monkey Extrastriate Cortex. Ann. Rev. Neuroscience 10, 363-401. Montanari, U. 1971. On the Optimum Detection of Curves in Noisy Pictures. CACM 14, 335-345. Moran, J. and R. Desimone. 1985. Selective Attention Gates Visual Processing in the Extriastriate Cortex. Science 229, 782-784. Nelson, J.J. and B.J. Frost. 1985. Intracortical Facilitation among Co-oriented, Co-axially Aligned Simple Cells in Cat Striate Cortex. E x p . Br. Res. 61, 5M1. Parent, P. and S.W. Zucker. 1985. Trace Inference, Curvature Consistency, and Curve Detection. CVaRL Technical Report CIM-86-3, McGill University. I E E E Transactions on Paftern Annlysis and Machine Intelligence, in press. Schmitt, F. and F. Worden. 1979. The Neurosciences: Fourth Study Program, Cambridge, MA: MIT Press. Zucker, S.W. 1986. The Computational Connection in Vision: Early Orientation Selection. Behaviour Research Methods, Instruments, and Computers 18, 608617. Zucker, S.W., C. David, A. Dobbins, and L. Iverson. 1988. The Organization of Curve Detection: Coarse Tangent Fields and Fine Spline Coverings. Proc. 2nd Int. Conf. on Computer Vision, Tarpon Springs, Florida.
Received 14 October; accepted 23 October 1988.
Communicated by Dana Ballad
Part Segmentation for Object Recognition Alex Pentland Vision Sciences Group, The Media Lab, Massachusetts Institute of Technology, Room E15-410, 20 Ames Street, Cambridge, M A 02139, USA
Visual object recognition is a difficult problem that has been solved by biological visual systems. An approach to object recognition is described in which the image is segmented into parts using two simple, biologically-plausible mechanisms: a filtering operation to produce a large set of potential object "parts," followed by a new type of network that searches among these part hypotheses to produce the simplest, most likely description of the image's part structure. 1 Introduction
In order to recognize objects one must be able to compute a stable, canonical representation that can be used to index into memory (Binford 1971; Marr and Nishihara 1978; Hoffman and Richards 1985). The most widely accepted theory on how people recognize objects seems to be that they first segment the object into its component parts and then recognition occurs by using this part description to classify the object, perhaps by use of an associative network. Despite the importance of object recognition, most vision research and especially neural network research -has been aimed at understanding early visual processing. In part this focus on early vision is because the uniform, parallel operations typical of early vision are easily mapped onto neural networks, and are more easily understood than the nonhomogeneous, nonlinear processing required to segment an object into parts and then recognize it. As a consequence, the process of object recognition is little understood. The goal of this research is to automatically recover accurate part descriptions for object recognition. I have approached this objective by developing a system that segments an imaged object into convex parts using a neural network that is similar to that described by Hopfield and Tank (Hopfield and Tank 19851, but which uses a temporally-decaying feedback loop to achieve considerably better performance. For the sake of efficiency and simplicity I have used silhouettes, obtained from greyscale images by intensity, motion, and texture thresholding, rather than operating on the grey-scale images directly.
Neural Computation 1, 82-91 (1989)
@ 1989 Massachusetts Institute of Technology
Part Segmentation for Object Recognition
83
2 A Computational Theory of Segmentation Many machine vision systems employ matched filters to find particular 2-D shapes in an image, typically using a multiresolution approach that allows efficient search over a wide range of scales. Thus, in machine vision, a natural way to locate the parts of a silhouetted object is to make filter patterns that cover the spectrum of possible 2-D part-shapes (as is shown in figure l(a)), match these 2-D patterns against the silhouette, and then pick the best matching filter. If the match is sufficiently good, then we register the detection of a part whose shape is roughly that of the best-matching filter. A biological version of this approach might use many hypercolumns each containing receptive fields with excitatory regions shaped as in figure 1. The cell with the best-matching excitatory field would be selected by introducing strong lateral inhibition within the hypercolumn in order to suppress all but the best-responding cells. This arrangement of receptive fields and within-hypercolumn inhibition produces receptive fields with oriented, center-surround spatial structure, such as is shown in figure l(b). The major problem with such a filtering/receptive field approach is that all such techniques incorporate a noise threshold that balances the number of false detections against the number of missed targets. Thus we will either miss many of the object’s parts because they don’t quite fit any of our 2-D patterns, or we will have a large number of false detections. This false-alarm versus miss problem occurs in almost every image processing domain, and there are only two general approaches to overcoming the problem. The first is to improve the discriminating power of the filter so as to improve the false-alarm/miss tradeoff. The success of this approach depends upon precise characterization of the target and so is not applicable to this problem. In the second approach, each non-zero response of a filter/receptive field is considered as an hypothesis about the object’s part structure rather than being considered as a detection. One therefore uses a very low threshold to obtain a large number of hypotheses, and then searches through them to find the “real” detections. This approach depends upon having some method of measuring the likelihood of a set of hypotheses, i.e., of measuring how good a particular segmentation into parts is as an explanation of the image data. It is this second, “best explanation” approach that I have adopted in this paper. 2.1 Global Optimization: The Likelihood Principle and Occam’s Razor. The notion that vision problems can be solved by optimizing some “goodness of fit” measure is perhaps the most powerful paradigm found in current computational research (Hopfield and Tank 1985; BalIard
84
Alex Pentland
Figure 1: (a) Two-dimensional binary patterns used to segment silhouettes into parts. (b) Spatial structure of a receptive field corresponding to one of these binary patterns. et al. 1983; Hummel and Zucker 1983; Poggio et al. 1985). Although heuristic measures are sometimes employed, the most attractive schemes have been based on the likelihood principle (the scientific principle that the most likely hypothesis is the best one), i.e., they have posed the problem in terms of an a priori model with unknown parameter values, and then searched for the parameter settings that maximize the likelihood of the model given the image data. Recently it has been proven (Rissanen 1983) that one method of finding this maximum likelihood estimate is by use of the formal, informationtheoretic version of Occam’s Razor: the scientific principle that the simplest hypothesis is the best one. In information theory the simplicity or complexity of a description is measured by the number of bits (binary digits) needed to encode both the description and remaining residual noise. This new result tells us that both the likelihood principle and Occam’s Razor agree that the best description of image data is the one that provides the bitwise shortest encoding. This method of finding the maximum likelihood estimate is partic-
85
Part Segmentation for Object Recognition
ularly useful in vision problems because it gives us a simple way to produce maximum likelihood estimates using image models that are too complex for direct optimization (Leclerc 1988). In particular, to find the maximum likelihood estimate of an object's part structure one needs only to find the shortest description of the image data in terms of parts. 2.2 A Computational Procedure. How can the shortest/most likely image description be computed? Let { H } be a set of n part hypotheses h, produced by our filters/receptive fields, and let { H ' } be a subset of { H } containing m hypotheses. The particular elements which comprise { H * } can be indicated by a vector F consisting of n - m zeros and m ones, with a one in slot L indicating that hypothesis h, is an element of
{H*}The presence of part hypothesis h, in the set {H} indicates that a particular pattern from among those illustrated in figure l(a) has at least a minimal correspondence to the image data at some particular image location. Let us designate the number of image pixels at which h, and the image agree (have the same value) by a,,, and the number of image pixels at which h, and the image disagree (have different values) by e?,. Then h, provides an encoding of the image which saves S(h,) bits as compared to simple pixel-by-pixel description of the image pixel values. The amount of this savings, in bits, is:
S(h,)= klutz - k2ezz- k3
(2.1)
where kl is the average number of bits needed to specify a single image pixel value, k2 is the average number of bits needed to specify that a particular pixel is erroneously encoded by h,, and k3 is the cost of specifying h, itself. The ratio between kl and kz is our a priori estimate of the signal to noise ratio, including both image noise and noise from quantization of the set of 2-D shape patterns. The parameter k3 is equal to the minus log of the probability of a particular part hypothesis. By default we make k3 equal for all h,; however, we can easily incorporate a priori knowledge about the likelihood of each h, by setting k3 to the minus log probability associated with each h,. Equation 2.1 allows us to find the single hypothesis which provides the best image description by simply maximizing S(h,) over all the hypotheses h,. To find the overall maximum-likelihood/simplest description, however, we must search from among the power set of { H } to find that subset { H * } which maximizes S(2). Thus we must be able to account for interactions between the various h, in {If*}. Let be the number of image pixels at which h,, h,, and the image all agree, and e,, be number of image pixels at which both h, and h, disagree with the image. We then define a matrix A with values a,, on the diagonal, and values -1/2u,, for z # 3 , and similarly a matrix E with values e,, on the diagonal, and values -1/2eZJ for L 1. Ignoring points
+
Alex Pentland
86
where three or more hi overlap, the savings generated by encoding the image data using { H * } (as specified by the vector i?) is simply S(2) = klZAZT - kzZEZT
-
k3ZZT.
(2.2)
Equation 2.2 can easily be extended to include overlaps between three or more parts by adding in additional terms that express these higherorder overlaps. However, these higher-order overlaps are expensive to calculate. Moreover, such high-order overlaps seem to be infrequent in real imagery. I have chosen, therefore, to assume that in the final solution that there are a negligible number of image points covered by three or more hi. Note that we are not assuming that this is true of the entire set { H } , where such high-order overlaps will be common. The important consequence of this assumption is that the maximum of the savings function S ( 3 over all Zis also the maximum of equation 2.2. The solution to equation 2.2 is straightforward when the matrix Q
Q = klA - kzE - k3I
(2.3)
is positive (or negative) definite. Unfortunately, this is not the case in this problem. As a consequence, relaxation techniques (Hummel and Zucker 1983) such as the Hopfield-Tank network (Hopfield and Tank 1985) give a very poor solution. I have therefore devised a new method of solution (and corresponding network) which can provide a good solution to equation 2.2. This new technique is a type of continuation method: one first picks a problem related to the original problem that can be solved, and then iteratively solves a series of problems that are progressively closer to the original problem, each time using the last solution as the starting point for the next iteration. In the problem at hand, Q is easily solved when k3 is large enough, as then Q is diagonally dominant and thus negative definite. Therefore, I can obtain a globally good solution by first solving using a large k3, and then - using that answer as starting point - progressively resolve using smaller and smaller values of k3 until the desired solution is obtained. Because k j is the cost of adding a model to our description, the effect of this continuation technique is to solve for the largest, most prominent parts first, and then to progressively add in smaller and smaller parts until the entire figure is accounted for. The neural network interpretation of this solution method is a Hopfield-Tank network placed in a feedback loop where the diagonal weights are initially quite large and decay over time until they finally reach the desired values. In each "time step" the Hopfield-Tank network stabilizes, the diagonal weights are reduced, and the network outputs are fed back into the inputs. When the diagonal weights reach their final values, the desired outputs are obtained. It can be shown that for many well-behaved problems (for example, when the largest eigenvalues are all of one sign, with opposite-signed
Part Segmentation for Object Recognition
87
eigenvalues of much smaller magnitude) this feedback technique will produce an answer that is on average substantially better than that obtained by Hopfield-Tank or relaxation methods. As with relaxation techniques (Hummel and Zucker 19831, this feedback method can be applied to problems with asymmetric weights. A biological equivalent of our solution method is to use a set of hypercolumns (each containing cells with the excitatory subfields illustrated in figure 1) that are tied together by a Hopfield-Tank network augmented by a time-decaying feedback loop. The action of this network is to suppress activity in all but a small subset of the hypercolumns. After this network has stabilized, each of the remaining active cells correspond exactly to one part of the imaged object. The characteristics of that cell's excitatory subfield correspond to the shape of the imaged part. 3 Segmentation Examples
This technique has been tested on over two hundred synthetic images, with widely varying noise levels (Pentland 1988). In these tests the number of visible parts was correctly determined 85-95% of the time (depending on noise level), with largely obscured or very small parts accounting for almost all of the errors. Estimates of part shape were similarly accurate. The following three examples illustrate this segmentation performance. The first example uses synthetic range data with a dynamic range of 4 bits. In this example, only 72 2-D shape patterns were employed in order to illustrate the effects of coarse quantization in both orientation and size. The intent of this example is to demonstrate that a high-quality segmentation into parts can be achieved despite coarse quantization in both orientation, size, and range values, and despite wide variation in the weights. In the remaining examples, the 2-D shape patterns shown in figure l(a) were employed. Figure 2(a) shows an intensity image of a CAD model; synthetic range data from this model is shown in figure 2(b). These range data were histogrammed and automatically thresholded, producing the silhouette shown in figure 2(c). Figure 2(d) shows the operation of our new solution method. The parameter k3 is initially set to a large value, thus making equation 2.2 diagonally dominant. In this first step only the very largest parts are recovered, as is shown in the first frame of figure 2(d). The parameter k3 is then progressively reduced and the equation resolved, allowing smaller and smaller parts to be recovered. This is shown in the remaining frames of figure 2(d). This solution method therefore constructs a scale hierarchy of object parts, with the largest and most visible at the top of the hierarchy and the smallest parts on the bottom. This scale hierarchy can be useful in matching and recognition processes.
88
Alex Pentland
Figure 2: (a) Intensity image of a CAD model. (b) Range image of this model. (c) Silhouette of the range data. (d) This sequence of images illustrates how our continuation method constructs a scale-space description of part structure, first recovering only large, important parts and then recovering progressively smaller part structure. (el Final segmentation into parts obtained using only very coarsely quantized 2-D patterns; 3-D models corresponding to recovered parts are used to illustrate the recovered structure. (f) Segmentations for a 5 : 1 ratio of the parameters ki, showing that the segmentation is stable. The final segmentation for this figure is shown in figure 2(e); here
3-D volumetric models have been substituted for their Corresponding' 2D shapes in order to better illustrate how the silhouette was segmented into parts. The z dimension of these 3-D models is arbitrarily set equal to the smaller of the 5 and y dimensions. It can be seen that, apart from coarse quantization in orientation and size, the part segmentation is a good one. 'That is, for each 2-D pattern we substituted a 3-D CAD model whose outline corresponds exactly to the 2-D shape pattern.
Part Segmentation for Object Recognition
89
One important question is the stability of segmentation with respect to the parameters k,. Figure 2(f) shows the results of varying the ratio of parameters k , , k2, and kj over a range of 5 : 1. It can be seen that the part segmentation is stable, although as the relative cost of each model increases (the final value of k3 becomes large) small details (such as the feet) disappear. The second example of segmenting a silhouette into parts uses a real image of a person, shown in figure 3(a). A silhouette was produced by automatic thresholding of a fractal measure of texture smoothness; this silhouette is shown in figure 3(b). The resulting segmentation into parts is shown in figure 3(c). An example of segmenting a more complex silhouette into parts uses the Rites of Spring, a drawing by Picasso, shown in figure 3(c). The area within the box was digitized and the intensity thresholded to produce a
Figure 3: (a) Image of a person. (b) Silhouette produced by thresholding a fractal texture measure. (c) Automatic segmentation into parts. (d) The Rites of Spring, by Picasso. (e) Digitized version. (f) The automatic segmentation into parts.
90
Alex Pentland
coarse silhouette, as shown in figure 3(d). The automatic segmentation is shown in figure 3(e). It is surprising that such a good segmentation can be produced from this hand-drawn, coarsely digitized image (note that very small details, e.g., the goat’s horns, were missed because they were smaller than any of the 2-D patterns). 4 Summary
I have described a method for segmenting 2-D images into their component parts, a critical stage of processing in many theories of object recognition. This method uses two stages: a detection stage which uses matched filters to extract hypotheses about part structure, and an optimization stage, where all hypotheses about the object’s part structure are combined into a globally optimum (i.e., simplest, most likely) explanation of the image data. The first stage is implemented by local competition among the filters illustrated in figure l(a), and the second stage is implemented by a new type of neural network that gives substantially better answers than previously suggested optimization networks. This new network may be described as a relaxation or Hopfield-Tank network augmented by time-decaying feedback. For additional details the reader is referred to reference (Pentland 1988). Acknowledgments This research was made possible in part by National Science Foundation, Grant No. IRI-8719920. I wish to especially thank Yvan Leclerc for his comments, and for reviving my interest in minimal length encoding.
References Ballard, D.H., G.E. Hinton, and T.J. Sejnowski. 1983. Parallel Visual Computation. Nature 306, 21-26. Binford, T.O. 1971. Visual Perception by Computer. Proceeding of the IEEE Conference on Systems and Control, Miami. Hoffman, D. and W. Richards. 1985. Parts of Recognition. In: From Pixels to Predicates, ed. A. Pentland. New Jersey: Ablex Publishing Co. Hopfield, J.J. and D.W. Tank. 1985. Neural Computation of Decisions in Optimization Problems. Biological Cybernetics 52, 141-152. Hummel, R.A. and S.W. Zucker. 1983. On the Foundations of Relaxation Labeling Processes. IEEE Transactions on Pattern Analysis and Machine lntelligence 53,267-287.
Leclerc, Y. 1988. Construction Simple Stable Descriptionsfor Image Partitioning. Proc. DARPA lmage Understanding Workshop, April 6-8, Boston, MA, 365-382.
Part Segmentation for Object Recognition
91
Man; D. and K. Nishihara. 1978. Representation and Recognition of the Spatial Organization of Three-dimensional Shapes. Proceedings of the Royal SocietyLondon B 200, 269-94 Pentland, A. 1988. Automatic Recovey of Deformable Part Models. Massachusetts Institute of Technology Media Lab Vision Sciences Technical Report 104. Poggio, T., V. Torre, and C. Koch. 1985. Computational Vision and Regularization Theory. Nature 317, 314-319. Rissanen, J. 1983. Minimum-iength Description Principle. Encyclopedia of Slatistical Sciences 5, 523-527. New York Wiley.
Received 23 September; accepted 8 November 1988.
Communicated by Richard Andersen
Computing Optical Flow in the Primate Visual System H. Taichi Wang Bimal Mathur Science Center, Rockwell International, Thousand Oaks, CA 91360, USA
Christof Koch Computation and Neural Systems Program, Divisions of Biology and Engineering and Applied Sciences, 216-76, California Institute of Technology, Pasadena, CA 91125, USA
Computing motion on the basis of the time-varying image intensity is a difficult problem for both artificial and biological vision systems. We show how gradient models, a well known class of motion algorithms, can be implemented within the magnocellular pathway of the primate’s visual system. Our cooperative algorithm computes optical flow in two steps. In the first stage, assumed to be located in primary visual cortex, local motion is measured while spatial integration occurs in the second stage, assumed to be located in the middle temporal area (MT). The final optical flow is extracted in this second stage using population coding, such that the velocity is represented by the vector sum of neurons coding for motion in different directions. Our theory, relating the single-cell to the perceptual level, accounts for a number of psychophysical and electrophysiological observations and illusions. 1 Introduction
In recent years, a number of theories have been advanced at both the computational and the psychophysical level, explaining aspects of biological motion perception (for a review see Ullman 1981; Nakayama 1985; Hildreth and Koch 1987). One class of motion algorithms exploit the relation between the spatial and the temporal intensity change at a particular point (Fenneman and Thompson 1979; Horn and Schunck 1981; Marr and Ullman 1981; Hildreth 1984). In this article we address in detail how these algorithms can be mapped onto neurons in striate and extrastriate primate cortex (Ballard et al. 1983). Our neuronal implementation is derived from the most common version of the gradient algorithm, proposed within the framework of machine vision (Horn and Schunck 1981). Due to the ”aperture” problem inherent in their definition of optical flow, only the component of Neural Cornputation 1, 92-103 (1989) @ 1989 Massachusetts Institute of Technology
Computing Optical Flow in the Primate Visual System
93
motion along the local spatial brightness gradient can be recovered. In their formulation, optical flow is then computed by minimizing a twopart quadratic variational functional. The first term forces the final optical flow to be compatible with the locally measured motion component ("constraint line term"), while the second term imposes the constraint that the final flow field should be as smooth as possible. Such as "smoothness" or "continuity" constraint is common to most early vision algorithms. 2 A Neural Network Implementation
Commensurate with this method, and in agreement with psychophysical results (e.g. Welch 19891, our network extracts the optical flow in two stages (Fig. 1). In a preliminary stage, the time-varying image Z ( i , j ) is projected onto the retina and relayed to cortex via two neuronal pathways providing information as to the spatial location of image features ( S neurons) and temporal changes in these features (T neurons): S ( i , j ) = V 2 G * I ( i , j and ) T ( i , j )= a ( V * G * I ( i ,j) ) /&where , * is the convolution operator and G the 2-D Gaussian filter (Enroth-Cugell and Robson 1966; Marr and Hildreth 1980; Marr and Ullman 1981). In the first processing stage, the local motion information is represented using a set of n ON-OFF orientation- and direction-selective cells U , each with preferred direction indicated by the unit vector o k :
where t is a constant and Vk is the spatial derivative along the direction Ok. This derivative is approximated by projecting the convolved image S ( i , j ) onto a "simple" type receptive field, consisting of a 1 by 7 pixel positive (ON) subfield adjacent to a 1 by 7 pixel negative (OFF) subfield. The cell U responds optimally if a bar or grating oriented at right angles to Ok moves in direction Ok.Note that U is proportional to the product of a transient cell ( T )with a sustained simple cell with an odd-symmetric receptive field, with an output proportional to the magnitude of local component velocity (as long as \V&'(z,j)l > t). At each location i , j , n such neurons code for motion in n directions. Equation (2.1) differs from the standard gradient model in which U = - T / V k S , by including a gain control term, t, such that U does not diverge if the stimulus contrast decreases to zero. t is set to a fixed fraction of the square of the maximal magnitude of the gradient V S for all values of z,j. Our gradient-like scheme can be approximated for small enough values of the local contrast (i.e. if lVS(i,j)12< e), by -T(z,j)VkS(z,j). Under this condition, our model can be considered a second-order model, similar to the correlation or spatio-temporal energy models (Hassenstein and Reichardt 1956; Poggio and Reichardt 1973; Adelson and Bergen 1985; Watson and Ahumada
94
H. Taichi Wang, Bimal Mathur, Christof Koch
1985). We also require a set of n ON-OFF orientation- but not directionselective neurons E , with E ( i , j ,k ) = IVkS(i,j ) ] ,where the absolute value ensures that these neurons only respond to the magnitude of the spatial gradient, but not to its sign. We assume that the final optical flow field is computed in a second stage, using a population coding scheme such that the velocity is represented within a set of n‘ neurons V at location i , j , each with preferred direction Ok with V(i, j) = ETLl V ( i ,j , k)@k. Note that the response of any individual cell V ( i ,j , k ) is not the projection of the velocity field V(i,j) onto Ok. For any given visual stimulus, the state of the V neurons is determined by minimizing the neuronal equivalent of the above mentioned variational functional. The first term, enforcing the constraint that the final optical flow should be compatible with the measured data, has the form: k)-
(2.2)
where cos(k’- k ) is a shorthand for the cos of the angle between @k’ and @k. The term E2(i,j , k ) ensures that the local motion components U ( i ,j , k ) only have an influence when there is an appropriate oriented local pattern; thus E2 prevents velocity terms incompatible with the measured data from contributing significantly to Lo. In order to sharpen the orientation tuning of E , we square the output of E (the orientation tuning curve of E has a half-width of about 60”). The smoothness term, minimizing the square of the first derivative of the optical flow (Horn and Schunck 1981) takes the following form:
Our algorithm computes the state of the V neurons that minimizes Lo + XLI (A is a free parameter, usually set at 10). We can always find this state by evolving V ( i ,j , k ) on the basis of the steepest descent rule: dV/at = -a(& + XLI)/dV. Since the variational functional is quadratic in V, the right hand side in the above differential equation is linear in V. Conceptually, we can think of the coefficients of this linear equation as synaptic weights, while the left hand side can be interpreted as a capacitive term, determining the dynamics of our model neurons. In other words, the state of the V neurons evolve by summating the synaptic contribution from E , U and neighboring V neurons and updating its state accordingly. Thus, the system ”relaxes” into its final and unique state. To mimic neuronal responses more accurately, the output of our model neurons S, T , E , U , and V is set to zero if the net input is negative.
Computing Optical Flow in the Primate Visual System
lir Kil
... ...
K
95
K-1
.... .... ........
1i.j)
u. E
i
b
V
90"
I
270"
MT NEURON
Figure 1: Computing motion in neuronal networks. (a) Schematic representation of our model. The image Z is projected onto the rectangular 64 by 64 retina and sent to the first processing stage via the S and T channels. A set of n = 16 ON-OFF orientation- ( E ) and direction-selective (U)cells code local motion in 16 different directions @k. These cells are most likely located in layers 4Ca and 48 of V1. Neurons with overlapping receptive field positions i, j but different preferred directions @k are arranged here in 16 parallel planes. The ON subfield of one such U cell is shown in Fig. 4a. The output of both E and U cells is relayed to a second set of 64 by 64 by 16 V cells where the final optical flow is represented via population coding V(i, j) = J& V ( i ,j , k ) @ k , with n' = 16. Each cell V ( i ,j , k ) in this second stage receives input from cells E and U at location i , j as well as from neighboring neurons at different spatial locations. We assume that the V units correspond to a subpopulation of MT cells. (b) Polar plot of the median neuron (solid line) in MT of the owl monkey in response to a field of random dots moving in different directions (Baker et al. 1981). The tuning curve of one of our model V cells in response to a moving bar is superimposed (dashed line). Figure courtesy of J. Allman and S. Petersen.
96
H. Taichi Wang, Bimal Mathur, Christof Koch
3 Physiology and Psychophysics
Since the magnocellular pathway in primates is the one processing motion information (Livingstone and Hubel 1988; DeYoe and Van Essen 1988) we assume that the U and E neurons would be located in layers 4Ca and 4B of V1 (see also Hawken et al. 1988) and the V neurons in area MT, which contains a very high fraction of direction- and speed-tuned neurons (Allman and Kass 1971; Maunsell and Van Essen 1983). All 2n neurons U and E with receptive field centers at location i, j then project to the n’ MT cells V in an excitatory or inhibitory (via interneurons) manner, depending on whether the angle between the preferred direction of motion of the pre- and post-synaptic neuron is smaller or larger than 190”1.’ Anatomically, we then predict that each MT cell receives input from V1 (or V2) cells located in all different orientation-columns. The smoothness constraint of equation (3)results in massive interconnections among neighboring V cells (Fig. la). Our model can explain a number of perceptual phenomena. When two identical square gratings, oriented at a fixed angle to each other, are moved perpendicular to their orientation (Fig. 2a), human observers see the resulting plaid pattern move coherently in the direction given by the intersection of their local constraint lines (“velocity constraint combination rule”; in the case of two gratings moving at right angle at the same velocity, the resultant is the diagonal; Adelson and Movshon 1982). The response of our network to such an experiment is illustrated in figure 2: the U cells only respond to the local component of motion (component selectivity; Fig. 2b), while the V cells respond to the global motion (Fig. 2c), as can be seen by computing the vector sum over all V cells at every location (pattern selectivity; Fig. 2d). About 30%of all MT cells do respond in this manner, signaling the motion of the coherently moving plaid pattern (Movshon et al. 1985). In fact, under the conditions of rigid motion in the plane observed in Adelson and Movshon’s experiments, both their “velocity space combination rule” and the “smoothness” constraint converge to the solution perceived by human observers (for more results see Wang et al. 1989). Given the way the response of the U neurons vary with visual contrast (equation 2.1), our model predicts that if the two gratings making up the plaid pattern differ in contrast, the final pattern velocity will be biased in the direction of the component velocity of the grating with the larger contrast. Recent psychophysical experiments support this conjecture (Stone et al. 1988). It should be noted that the optical flow field is not represented explicitly within neurons in the second stage, but only implicitly, via vector addition. Our algorithm reproduces both “motion capture” (Ramachandran and ‘The app;opriate weight of the synaptic connection between U and V is codk k’)U(k’)E ( k ). Various biophysical mechanisms can implement the required multiplicative interaction as well as the synaptic power law (Koch and Poggio 1987).
Computing Optical Flow in the Primate Visual System
97
Figure 2: (a) Two gratings moving towards the lower right (one at -26" and one at -64"), the first moving at twice the speed of the latter. The amplitude of the composite is the sum of the amplitude of the individual bars. The neuronal responses of a 12 by 12 pixel patch (outlined in (a)) is shown in the next three sub-panels. (b) Response of the U neurons to this stimulus. The half wave rectified output of all 16 cells at each location is plotted in a radial coordinate system at each location as long as the response is significantly different from zero. (c) The output of the V cells using the same needle diagram representation after the network converged. (d) The resulting optical flow field, extracted from (c) via population coding, corresponding to a coherent plaid moving towards the right, similar to the perception of human observers (Adelson and Movshon 1982) as well as to the response of a subset of MT neurons in the macaque (Movshon et al. 1985). The final optical flow is within 5% of the correct flow field.
98
H. Taichi Wang, Bimal Mathur, Christof Koch
Anstis 1983; see Wang et al. 1989) and “motion coherence” (Williams and Sekuler 19841, as illustrated in figure 3. As demonstrated previously, these phenomena can be explained, at least qualitatively, by a smoothness or local rigidity constraint (Yuille and Grzywacz 1988; Biilthoff et al. 1989). Finally, y motion, a visual illusion first reported by the Gestalt psychologists (Lindemann 1922; Kofka 1931; for a related illusion in man and fly see Bulthoff and Gotz 1979), is also mimicked by our algorithm. This illusion arises from the initial velocity measurement stage and does not rely on the smoothness constraint. Cells in area MT respond well not only to motion of a bar or grating but also to a moving random dot pattern (Albright 1984; Allman et al. 1985). Similarly, our algorithm detects a random-dot figure moving over a stationary random-dot background, as long as the spatial displacement between two consecutive frames is not too large (Fig. 4). An interesting distinction arises between direction-selective cells in V1 and MT. While the optimal orientation in V1 cells is always perpendicular to their optimal direction, this is only true for about 60% of MT cells (type I cells; Albright 1984; Rodman and Albright 1988). 30% of MT cells respond strongly to flashed bars oriented parallel to the cells’ preferred direction of motion (type I1 cells). If we identify our V cells with this MT subpopulation, we predict that type I1 cells should respond to an extended bar (or grating) moving parallel to its edge (Fig. 4). 4 Discontinuities in the Optical Flow
The major drawback of all motion algorithms is the degree of smoothness required, smearing out any discontinuities in the optical flow field, such as those arising along occluding objects or along a figure-ground boundary. It has been shown previously how this can be dealt with by introducing the concept of line processes which explicitly code for the presence of discontinuities in the motion field (Hutchinson et al. 1988; see also Poggio et al. 1988). If the spatial gradient of the optical flow between two neighboring points is larger than some threshold, the flow field “is broken”; that is the process or ”neuron” coding for a motion discontinuity at that location is switched on and no smoothing occurs. If little spatial variation exists, the discontinuity remains off. The performance of the original version of the Horn and Schunck version is greatly improved using this idea (Hutchinson et al. 1988). Perceptually, it is known that the visual system uses motion to segment different parts of visual scenes (Baker and Braddick 1982; van Doorn and Koenderink 1983; Hildreth 1984). But what about the possible cellular correlate of line processes? Allman and colleagues (Allman et al. 1985)first described cells in area MT in the owl monkey whose “true” receptive field extended well beyond the classical receptive field as mapped with bar or spot stimuli. About half of all MT cells have an antagonistic direction-selective
Computing Optical Flow in the Primate Visual System
99
SC47304
. . . . . . . . . . . . .
I
. . . . . . . . . . . . . . . . . .
...................
...................
I
.................... .................. .. *
t
ICCCe..rr.*cCI... e<e%GF C f F e q C e
eF-€c c c c c F F-
.. ...... ................... ' F C C E , . * ' C C C C . . . .
C I . . . . . * r . * . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
................... . . . . . . . . . . . . . . . . . .
I
I
I
1
Figure 3: (a) In "motion coherence," a cloud of dots is perceived to move in the direction of their common motion component. In this sequence, all dots have an upward velocity component, while their horizontal velocity component is random. (b) The final velocity field only shows the motion component common to all dots. Humans observe the same phenomena (Williams and Sekuler 1984). (c) In 7-motion, a dark stimulus flashed onto a homogenously lit background appears to expand (Lindemann 1922; Kofka, 1931). It will disappear with a motion of contraction. (d) Our algorithm perceives a similar expansion when a disk appears. The contour of the stimulus is projected onto the final optical flow. surround, such that the response of the cell to motion of a random dot display or an edge within the center of the receptive field can be modified by moving a stimulus within a very large surrounding region. The response depends on the difference in speed and direction of motion between the center and the surround, and is maximal if the surround
100
H. Taichi Wang, Bimal Mathur, Christof Koch
........, ....
t
,
.
.
.
.
.
.
.
.
. . . .
..... ....
.....
.. .. . . .
. .
.....
'
.... .....
....... . . . . . . . . . . . . . . . . . . .
. . .
.
.
.
.
'
6
,
*
&
. . . . . . . . . . . . . . . . . . . . .
I Figure 4 (a) A dark bar (outlined) is moved parallel to its orientation towards the right. Due to the aperture problem, those U neurons whose receptive field only "see" the straight elongated edges of the bar - and not the leading or trailing edges - will fail to respond to this moving stimulus, since motion remains invisible on the basis of purely local information. The ON subfield of the receptive field of a vertically oriented U cell is superimposed onto the image. (b) It is only after information has been integrated, following the smoothing process inherent in the second stage of our algorithm, that the V neurons respond to this motion. The type I1 cells of Albright (1984) in MT should respond to this stimulus. (c) Response of our algorithm to a random-dot figure-ground stimulus. The central 10 by 10 pixel square moved by 1 pixel toward the left. (d) Final optical flow after smoothing. The V cells detect the figure, similar to cells in MT. The contour of the translated square is projected onto the final optical flow.
Computing Optical Flow in the Primate Visual System
101
moves at the same speed as the stimulus in the center but in the opposite direction. Thus, tantalizing hints exist as to the possible neuronal basis of motion discontinuities. Recently, two variations of Horn and Schuncks (1981) original algorithm have been proposed, based on computational considerations (Uras et al. 1988; Yuille and Grzywacz 1988). Both algorithms can be mapped onto the neuronal network we have proposed, with minimal changes either in the U stage (Uras et al. 1988) or by increasing the connectivity among more distant V cells (Yuille and Grzywacz 1988). It remains an open challenge to provide both psychophysical and electrophysiological evidence to evaluate the validity of these and similar schemes (e.g. Nakayama and Silverman 1988). We are currently trying to extend our model to account for the intriguing phenomena of ”motion transparency,” such as when two fields of random dots moving in opposite directions are perceived to form the 3-D motion field associated with a transparent and rotating cylinder (Siege1 and Andersen 1988). Acknowledgments We thank John Allman, David Van Essen, and Alan Yuille for many fruitful discussions and Andrew Hsu for computing the figure-ground response. Support for this research came from NSF grant EET-8700064, ONR Young Investigator Award and NSF Presidential Young Investigator Award to C.K. References Adelson, E.H. and J.R. Bergen. 1985. Spatio-temporal Energy Models for the Perception of Motion. J. Opt. SOC.Am. A 2, 284299. Adelson, E.H. and J.A. Movshon. 1982. Phenomenal Coherence of Moving Visual Patterns. Nature 200, 523-525. Albright, T.L. 1984. Direction and Orientation Selectivity of Neurons in Visual Area MT of the Macaque. 1. Neurophysiol. 52, 1106-1130. Allman, J.M. and J.H. Kass. 1971. Representation of the Visual Field in the Caudal Third of the Middle Temporal Gyms of the Owl Monkey (Aotus trivirgatus). Bruin Res. 31, 85-105. Allman, J., F. Miezin, and E. McGuinness. 1985. Direction- and Velocity-specific Responses from Beyond the Classical Receptive Field in the Middle Temporal Area (MT). Perception 14, 105-126. Ballard, D.H., G.E. Hinton, and T.J. Sejnowski. 1983. Parallel Visual Computation. Nature 306,21-26. Baker, C.L. and O.J. Braddick. 1982. Does Segregation of Differently Moving Areas Depend on Relative or Absolute Displacement. Vision Res. 7,851-856. Baker, J.F., S.E. Petersen, W.T. Newsome, and J.M. Allman. 1981. Visual Response Properties of Neurons in Four Extrastriate Visual Areas of the Owl
102
H. Taichi Wang, Bimal Mathur, Christof Koch
Monkey (Aotus trivirgatus): A Quantitative Comparison of Medial, Dorsomedial, Dorsolateral, and Middle Temporal Areas. J. Neurophysiol. 45, 397416. Bulthoff, H.H., J.J. Little, and T. Poggio. 1989. Parallel Computation of Motion: Computation, Psychophysics and Physiology. Nature, in press. Bulthoff, H.H. and K.G. Gotz. 1979. Analogous Motion Illusion in Man and Fly. Nature 278, 636-638. DeYoe, E.A. and D.C. Van Essen. 1988. Concurrent Processing Streams in Monkey Visual Cortex. Trends Neurosci. 11, 219-226. Enroth-Cugell, C. and J.G. Robson. 1966. The Contrast Sensitivity of Retinal Ganglion Cells of the Cat. 1. Physiol. (Lond.) 187, 517-552. Fennema, C.L. and W.B. Thompson. 1979. Velocity Determination in Scenes Containing Several Moving Objects. Comput. Graph. Image Proc. 9,301-315. Hassenstein, B. and W. Reichardt. 1956. Systemtheoretische Analyse der Zeit-, Reihenfolgen- und Vorzeichenauswertungbei der Bewegungsperzeption des Russelkafers Chlorophanus. Z. Naturforschung l l b , 513-524. Hildreth, E.C. 1984. The Measurement of Visual Motion. Cambridge, MA: MIT Press. Hildreth, E.C. and C. Koch. 1987. The Analysis of Visual Motion. Ann. Rev. Neurosci. 10, 4777.53 Horn, B.K.P. and B.G. Schunck. 1981. Determining Optical Flow. Artif. Intell. 17, 185-20 Hutchinson, J., C. Koch, J. Luo, and C. Mead. 1988. Computing Motion using Analog and Binary Resistive Networks. I E E E Computer 21, 52-61. Koch, C. and T. Poggio. 1987. Biophysics of Computation. In: Synaptic Function, eds. G.M. Edelman, W.E. Gall, and W.M. Cowan, 637-698. New York John Wiley. Kofka, K. 1931. In: Handbuch der normalen und pathologischen Physiologie 12, eds. A. Bethe et al. Berlin: Springer. Lindemann, E. 1922. Experimentelle Untersuchungen uber das Enstehen und Vergehen von Gestalten. Psych. Forsch. 2, 5-60. Livingstone, M. and D. Hubel. 1988. Segregation of Form, Color, Movement, and Depth: Anatomy, Physiology and Perception. Science 240, 740-749. Marr, D. and E.C. Hildreth. 1980. Theory of Edge Detection. Proc. R. SOC.Lond. B 297, 181-217. Marr, D. and S. Ullman. 1981. Directional Selectivity and its Use in Early Visual Processing. Proc. R. SOC. Lond. B 211, 151-180. Maunsell, J.H.R. and D. Van Essen. 1983b. Functional Properties of Neurons in Middle Temporal Visual Area of the Macaque Monkey. 11. Binocular Interactions and Sensitivity to Binocular Disparity. J. Neurophysiol. 49,11481167. Movshon, J.A., E.H. Adelson, M.S. Gizzi, and W.T. Newsome. 1985. The Analysis of Moving Visual Patterns. In: Exp. Brain Res. Suppl. 11: Pattern Recognition Mechanisms, eds. C. Chagas, R. Gattass, and C. Gross, 117-151. Heidelberg: Springer. Nakayama, K. 1985. Biological Motion Processing: A Review. Vision Res. 25, 625-660.
Computing Optical Flow in the Primate Visual System
103
Nakayama, K. and G.H. Silverman. 1988. The Aperture Problem-11. Spatial Integration of Velocity Information along Contours. Vision Res. 28, 747-75 Poggio, T., E.B. Gamble, and J.J. Little. 1988. Parallel Integration of Visual Modules. Science 242 337-340. Poggio, T. and W. Reichardt. 1973. Considerations on Models of Movement Detection. Kybernetik 13, 223-227. Ramachandran, V.S. and S.M. Anstis. 1983. Displacement Threshold for Coherent Apparent Motion Motion in Random-dot Patterns. Vision Res. 12, 1719-1724. Rodman, H. and T. Albright. 1989. Single-unit Analysis of Pattern-motion Selective Properties in the Middle Temporal Area (MT). Exp. Brain Res., in press. Siegel, R.M. and R.A. Andersen. 1988. Perception of Three Dimensional Structure from Motion in Monkey and Man. Nature 331, 259-261. Stone, L.S., J.B. Mulligan, and A.B. Watson. 1988. Neural Determination of the Direction of Motion: Contrast Affects the Perceived Direction of Motion. Neurosci. Abstr. 14, 502.5. Ullman, S. 1981. Analysis of Visual Motion by Biological and Computer Systems. I E E E Computer 14, 57-69. Uras, S., F. Girosi, A. Verri, and V. Torre. 1988. A Computational Approach to Motion Perception. Btol. Cybern. 60, 79-87. van Doorn, A.J. and J.J. Koenderink. 1983. Detectability of Velocity Gradients in Moving Random-dot Patterns. Vision Res. 23, 799-804. Wang, H.T., B. Mathur, A. Hsu, and C. Koch. 1989. Computing Optical Flow in the Primate Visual System: Linking Computational Theory with Perception and Physiology. In: The Computing Neurone, eds. R. Durbin, C. Miall, and G. Mitchinson. Reading: Addison-Wesley. In press. Watson, A.B. and A.J. Ahumada. 1985. Model of Human Visual-motion Sensing. J. Opt. SOC.Am. A 2, 322-341. Welch, L. 1989. The Perception of Moving Plaids Reveals Two Motion Processing Stages. Nature, in press. Williams, D. and R. Sekuler. 1984. Coherent Global Motion Percepts from Stochastic Local Motions. Vision Res. 24, 55-62. Yuille, A.L. and N.M. Grzywacz. 1988. A Computational Theory for the Perception of Coherent Visual Motion. Nature 333, 71-73.
Received 28 October; accepted 6 December 1988.
Communicated by Richard Lippmann
A Multiple-Map Model for Pattern Classification Alan Rojer Eric Schwartz Computational Neuroscience Laboratory, New York University Medical Center, Courant Institute of Mathematical Sciences, New York University, New York, NY 10016, USA
A characteristic feature of vertebrate sensory cortex (and midbrain) is the existence of multiple two-dimensional map representations. Some workers have considered single-map classification (e.g. Kohonen 1984) but little work has focused on the use of multiple maps. We have constructed a multiple-map classifier, which permits abstraction of the computational properties of a multiple-map architecture. We identify three problems which characterize a multiple-map classifier: classification in two dimensions, mapping from high dimensions to two dimensions, and combination of multiple maps. We demonstrate component solutions to each of the problems, using Parzen-window density estimation in two dimensions, a generalized Fisher discriminant function for dimensionality reduction, and splivmerge methods to construct a "tree of maps" for the multiple-map representation. The combination of components is modular and each component could be improved or replaced without affecting the other components. The classifier training procedure requires time linear in the number of training examples; classification time is independent of the number of training examples and requires constant space. Performance of this classifier on Fisher's iris data, Gaussian clusters on a five-dimensional simplex, and digitized speech data is comparable to competing algorithms, such as nearest-neighbor, back-propagation and Gaussian classifiers. This work provides an example of the computational utility of multiplemap representations for classification. It is one step towards the goal of understanding why brain areas such as visual cortex utilize multiple map-like representations of the world. 1 Introduction
One of the most prominent features of the vertebrate sensory system is the use of multiple two-dimensional maps to represent the world. The observational data base for cortical maps is excellent, and this area represents one of the better-understood aspects of large-scale brain architecture. Recently, through the use of a system for computer-aided neuNeural Computation 1,104-115 (1989) @ 1989 Massachusetts Institute of Technology
A Multiple-Map Model for Pattern Classification
105
roanatomy, we have been able to obtain high-precision reconstructions of primary visual cortex map and column architectures, have constructed accurate models of both columnar and topographic architecture of primary visual cortex, and have suggested several computational algorithms which are contingent on the specific forms of column and map architecture which occur in this first visual area of monkey cortex (Schwartz et al. 1988). We expect to be able to extend these methods and ideas to other cortical areas. There is thus good progress in the areas of measuring, modeling, and computing with single-map representations. However, the problem of how to make use of multiple maps has been little explored. Other workers have considered the application of single-map representations to classification. Kohonen (1984) has developed an algorithm for representing a feature space in a map; this algorithm constructs a space-variant representation, in rough analogy to the space-variant nature of primate visual cortex. However, this work does not provide a computational model for computing with multiple maps. We believe that a classifier utilizing a multiple-map architecture must incorporate the following modules: An efficient algorithm for classification in two dimensions. A projection of high dimensional data into a two-dimensional representation. An algorithm for combining multiple two-dimensional representations. Our strategy in this work has been to use simple components to construct our multiple-map classifier. In particular, we were seeking algorithms which require one pass through the data and which are not sensitive to convergence issues (e.g. local minima in an energy function). We are interested in the overall properties of the classifier, and we are trying to deemphasize the role of the individual components, which are modular and hence subject to improvement or replacement. 2 Classification in Two Dimensions
We assume that the items, or instances, we wish to classify are represented as vectors z E Rd, where each component of 2 is a feature measurement. Each instance belongs to a class k . We also have a training set, a set of instances of known class (training examples). We refer to the set of training examples in class k as xk. Our problem is to construct a set of discriminant functions fk : Rd + R, k = 1,.. . , c. An arbitrary instance x is assigned to the class k for which f&) is maximal. Because the instances are represented as vectors, we can refer to the distance between an instance and a training example as 115 - 2’11. We compute discriminant functions
106
Alan Rojer and Eric Schwartz
where g ( r ) is some function which decreases as
T
increases. If we let
g be a probability density (i.e. nonnegative and integrating to one over
its support) this is the Parzen-window estimate (Parzen 1962) for the a = p(klz). Since this is the same term which is maximized in the Bayes classifier, our classifier performance approaches the Bayesian limit as the approximation above approaches the actual probability density. This algorithm is related to the nearest-neighbor classifier. Its principal novelty is to use maps to store fk(z).Then, given the training examples, we can compute fk(x> by convolution in one pass. For illustration, see figure 1. We depict a two-class, one-dimensional classifier. The "map" is simply a segment of the real line. The training examples are shown on the z-axis as boxes. The weighting function g(z) is a Gaussian function. The individual convolutions g(x) * 6(z - z') are shown as dotted lines. The class-specific density estimates, which are also the discriminant functions, are shown as a solid and broken line, respectively. We consider a two-dimensional, three-class problem in figure 2 and figure 3. The weighting function is a circular two-dimensional Gaussian function. The instances have been drawn from prespecified twodimensional multivariate normal distributions; this permits construction of a Bayes classifier to determine minimal error rate. In figure 2, we show a comparison between the Parzen-window density estimates f k ( l c ) and the actual probability density functions for each class. In figure 3, the classifier is compared to a Bayesian classifier. The visual comparison indicates that the classifier is capturing much of the character of the Bayesian classifier. When the classifier was trained on 400 samples from each class, and tested on 300 (different) instances, its error rate was 16.0%, which may be compared to 14.4% for the Bayesian classifier. One important issue in the application of this method is the choice of the weighting function (or kernel). We have typically used Gaussian kernels, in which case we need to choose the kernel variance CT* (or covariance matrix x,,, in higher dimensions). This is a difficult problem in general; we have used heuristic algorithms. For example, if we desire an isotropic kernel, we might use CT = N-'/"&, where X I is the largest eigenvalue of the covariance matrix resulting from the projection of the data into an rn-dimensional map. The factor N-'/"arises from the heuristic decision to give each training instance an equal amount of map volume; since the kernel is m-dimensional, the volume scales as cP. More generally, we could use C, = h P C , where P is the projection into the map (see below). In experimental studies, we have found that the performance of the classifier is insensitive to small changes in the kernel size or shape.
posteriori density; i.e. f&)
A Multiple-Map Model for Pattern Classification
0
107
i
0.5
1.5
Position x
Figure 1: A one-dimensional two-class classifier. Class 0 instances are shown as solid boxes; class 1 instances are shown as open boxes. The estimated a posteriori density p ( k l z ) (here, for one-dimensional z)is shown as a solid line for class 0, and a broken line for class 1. The Parzen-window function g(11z - 2'11) for each paradigm x' is shown as a dotted line. The classifier operates by choosing the class for which the estimated a posteriori density is maximized. Thus, samples drawn with feature measurements below 0.7 would be assigned to class 0. Samples drawn with z > 0.7 would be assigned to class 1.
Alan Rojer and Eric Schwartz
108
Binormal densities
Map estimates
Figure 2: Comparison of actual binormal density against estimated density f&) computed by our classifier. Here a higher density corresponds to a darker region of the plot. The top row shows density plots for three binormal distributions in R2.The bottom row shows the estimates fk(x) computed by the classifier for 400 samples drawn from each of these three distributions. The weighting function used is a circular Gaussian with a variance approximately 1/60 the width of the figure.
3 From d Dimensions to Two Dimensions
In the previous section, we showed that the performance (as measured by error rate) approaches that of the Bayesian classifier for two-dimensional
109
A Multiple-Map Model for Pattern Classification
Bayesian classifier
Map Classifier
Figure 3: Comparison of decision regions computed by our classifier against decision regions of a hypothetical Bayes classifier which had complete knowledge of the underlying class distributions. Regions in the Bayes classifier are clipped due to round-off. data drawn from Gaussian distributions. Actually there is nothing in our derivation which restricts us to two dimensions; a d-dimensional classifier is defined as above, except that the density estimate p k ( z ) will require a d-dimensional map. In practice we limit ourselves to two dimensions for three reasons. First, our original motivation is to understand the functional utility of laminar structures such as neocortex for pattern classification in the brain. Second, two-dimensional maps can be processed
Alan Rojer and Eric Schwartz
110
by conventional image processing software and hardware, and the user of the classifier gets the benefit of visual displays of the intermediate structures in the classifier (e.g. P k ( Z ) and the computed partition). Finally, the number of bins required to store the maps P k ( Z ) grows exponentially with the dimensionality d, and thus favors small d. Restriction to two dimensions introduces an interesting aspect to the classification problem. Although our original data is in d dimensions, i.e. the data is composed of d measurements, we must somehow extract only two measurements or combinations of measurements with which to construct our classifier. The classification problem then spawns a problem of feature derivation. We can formulate the dimensionality reduction problem as construction of a function P : Rd .+ R2 which maps d-dimensional instances to two-dimensional map positions. The two dimensions of the map constitute the two derived features. We need to specify what kind of function P we will allow. To date, we have only considered linear projections, but nonlinear functions could also be used (e.g. Kohonen's self-organizing feature map; Kohonen 1984). We apply the generalized Fisher discriminant which was first introduced for a projection to R (Fisher 1936)and later generalized to a domain of arbitrary dimensionality (Bryan 1951). A discussion of the technique may be found in (Duda and Hart 1973). The two vectors which comprise P turn out to be the eigenvectors associated with the two largest eigenvalues in the generalized eigenvalue system sbu
= XSwu,
(3.1)
where S, is the "within-class" scatter matrix, given by (3.2)
with c k the class mean and N k the number of training examples representing class k , and sb is the "between-class" scatter matrix, given by (3.3) Here, z is the mean over all the training examples. In practice, we have found that S, is nonsingular, so the system can be reduced to a standard eigenvalue problem Si'sbU
= Xu.
(3.4)
The extraction of the principal eigenvalue and its eigenvector are realizable using a typical Hebb synapse model with a fixed-length weight vector (Oja 1982). The performance of the discriminant can be observed with Fisher's classical iris data. This data describes a four-dimensional, three-class
A Multiple-Map Model for Pattern Classification
111
problem. Figure 4 depicts the classifier constructed from the projection of the iris data into the two-dimensional subspace which maximizes the ratio described above. The classes can be seen to be fairly well separated. The classifier was tested by splitting the 150 instances into a 100-instance training set and a 50-instance test set. After training on the 100 instances, the classifier achieved 98% correct classification on the 50 instances in the test set. By comparison, the nearest-neighbor classifier operating on the same training and test sets achieved 98% correct classification, the Gaussian classifier achieved 94% correct classification, and a multilayer perceptron trained using back-propagation' achieved 96% correct classification. 4 Using Multiple Maps: A Tree of Maps
The previous example showed that for a relatively easy four-dimensional three-class problem, the generalized Fisher discriminant analysis was adequate to obtain a map which permitted good classifier performance. But in general, the discriminant analysis does not yield enough separation. For example, consider a regular five-dimensional simplex; this is a set of six equidistant points on the unit sphere in R5.Locate a spherical multivariate normal distribution at each vertex of the simplex. This is a point swarm whose density declines as exp(-r2), where T is the distance from the vertex. We construct the classifier by utilizing discriminant analysis to find a projection P : R5 + R2. With 600 training points and 300 test points, the error rate is 36%. Fortunately, we are not confined to one map. One method of using multiple maps utilizes a split/merge technique to reduce one many-class problem to several problems, each with fewer classes. We merge the original base classes into superclasses, each of which is represented by the union of training examples from its underlying base classes. We then apply discriminant analysis to the newly formed superclasses. If we can achieve an adequate separation, we proceed as above with a separate map for each superclass. If necessary, we can again perform merges among the elements of a superclass, until we have divided each superclass into component base classes. The consequence of this approach is to create a tree of maps. From the root of the tree, we project instances into a superclass. If the superclass is a base class, we assign that class to the instance. Otherwise, the instance is assigned to a superclass, which has its own map. We project the instance into that map, and continue as above, until the instance lands in a region assigned to a base class. In the training phase, we construct 'A multilayer perceptron with 4 hidden units was trained using back-propagation (Rumelhart et al. 1986) with 5000 iterations through the training set, with E = 0.02 and a = 0.
112
Alan Rojer and Eric Schwartz
Figure 4: Iris classifier constructed from 100 training points drawn from the iris data (shaded background). Projection of four-dimensional iris data points into the two-dimensional subspace which maximizes the ratio of between-class variance to within-class variance (foreground data points). a classifier in R2 for each internal node in the tree to classify instances into one of the superclasses for that node. We can illustrate this algorithm with the simplex data. We partition the classes so that three maps are used to classify. In the first map, we merge classes 2-5 into a superclass, letting classes 0 and 1 remain as base classes. In the second map, we will resolve the superclass composed of classes 2-5 into classes 2 and 3 and a superclass formed of 4 and 5. Finally, in the third map, we will resolve the superclass formed from 4 and 5 above into component base classes. The error rate of the three-map classifier is found to be 4.7%, a dramatic improvement over the singlemap classifier (36% error rate). This may be compared to error rates of 2%, 6% and 2.7% respectively for the Gaussian, nearest-neighbor and
A Multiple-Map Model for Pattern Classification
113
multilayer perceptron classifiem2 We have also applied our classifier to real-world data which consisted of 22 cepstral parameters from digitized ~ p e e c h . ~Each of 16 data sets represented one speaker; seven classes (monosyllabic words) were present in each data set. Each set consisted of 70 training instances and 112 test instances. The results are summarized: Classifier: Average error rate(%1: Range (%):
Nearest- Multilayer Multiple-map Gaussian4 neighbor perceptron5
6.5
6.0
5.9
6.3
1.8-12.5
3.6-10.7
1.8-15.2
1.8-11.6
from which it may be seen that all four classifiers under consideration had closely comparable performance. 5 Automatic Generation of the Map Tree
In the preceding examples of multiple map usage, we interactively chose a map tree. In this section we explore a simple approach to automatic generation of the map tree. This is a clustering problem; we want to group classes into superclasses which in some way reflect the natural similarity between classes. We introduce the distance matrix A for the classes. For any interclass distance measure dist(i,j), A,, = A,, = dist(i, j ) . We use a very simple tree generation algorithm. we treat A as a graph with each class represented by a node, and each edge weighted according to interclass distance. We then compute the minimal spanning tree. We form superclasses by recursively removing the largest edge in the tree, yielding two subtrees, each of which forms a superclass. We can use a variety of interclass distance measures; we have experimented with distances between class means, overlap of the one-dimensional Fisher discriminant projections, and overlap of the two-dimensional Fisher discriminant projections. 2A multilayer perceptron with 7 hidden units was trained using back-propagation (Rumelhart et al. 1986) with 5000 iterations through the training set, with E = 0.02 and N = 0. 3We are grateful to R. Lippmann of MIT Lincoln Laboratory for providing this data. 4Covariance matrix estimates were obtained by pooling data from all 16 speakers for each class. 5Multilayer perceptrons for each speaker used 15 hidden units.
114
Alan Rojer and Eric Schwartz
6 Discussion Our classifier is inspired by the prevalence of maps in the vertebrate brain. Its components are two-dimensional map units which implement Parzen-window density estimation, a dimensionality reduction methodology, and a scheme for decomposing a problem so that it can be solved by a system of maps. The training and running costs are favorably low. It admits an easy formal description. The intermediate results of classifier construction, e.g. the density estimates p&) and the partition computed by the classifier, are easily observed by a human user. This allows insight into the structure of the data that is hard to gain from other algorithms. Many of the operations can be implemented with conventional image processing operations (and thus can take advantage of special-purpose image processing hardware). The error rate is comparable to popular parametric, nonparametric, and neural network classifiers. Only a few other workers have considered the role of maps in pattern classification. In particular, Kohonen (1984) has considered iterative algorithms for "self-organizing feature maps." We wish to distinguish his work from ours. In our classifier, the map function comprises a linear projection of a data instance to determine a position in the map followed by a reference to that position. Kohonen maps an instance via a distance computation at each node of his map, followed by a winner-take-all cycle to obtain the nearest-neighbor to the instance among all the map nodes. The projection we use is computed with one pass through the training set to compute second-order statistics, which are diagonalized in a step which has a cost related only to the dimensionality of the data, but not the number of samples. Kohonen uses a very large number of iterations through the training set. Most importantly, we emphasize the use of multiple maps, which is not considered in (Kohonen 1984). We could use a Kohonen-type feature map as a module in our classifier (replacing the Fisher discriminant analysis) although we would then sacrifice these advantages. Tree classifiers have been considered at length in Breiman et al. (1984). There are similarities between their classifiers and ours at classification time, although the training algorithms are quite distinct. The principal difference in classifier operation is that we use two-dimensional density estimation at each node, while they use one-dimensional linear discriminants. Their discriminant is typically a threshold comparison of one feature value, although they also describe an iterative technique for obtaining a discriminant from a linear combination of a subset of the feature variables. There are much larger differences in the classifier training algorithms; we present an example of a simple heuristic for generating map trees (based on minimal spanning trees) whereas they examine a large set of possible splits in the data to generate trees. We wish to emphasize that our tree classifier is one possible technique for utilizing multiple maps; examination of alternative approaches is an important research problem.
A Multiple-Map Model for Pattern Classification
115
Perceptual (and probably cognitive) functions of the brain are mediated by laminar cortical systems. Three carefully investigated systems (monkey vision, bat echolocation, and auditory localization in the owl) are committed to multiple two-dimensional spatial maps. The present paper describes the first attempt to construct a pattern classification system which has high performance and which is based on a multiple parallel map-like representation of feature vectors. The algorithms described in this paper allow us to begin to investigate the pattern classification and perceptual performance of such map-based architectures. Acknowledgments Supported by AFOSR-88-0275. References Breiman, L., J.H. Freedman, R.A. Olshen, and C.J. Stone. 1984. Classification and Regression Trees. Bellmont, CA: Wadsworth. Bryan, J.G. 1951. The Generalized Discriminant Function: Mathematical Foundation and Computation Routine. Harvard Educ. Rev. 21, 90-95. Duda, R.O. and P.E. Hart. 1973. Pattern Classification and Scene Analysis. New York Wiiey. Fisher, R.A. 1936. The Use of Multiple Discriminant Measurements in Taxonomic Problems. Ann. Eugenics 7, 179-188. Kohonen, T. 1984. Self-organization and Associative Memory. New York: SpringerVerlag. Oja, E. 1982. A Simplified Neuron Model as a Principal Component Analyzer. J. Math. Biol. 15, 267-273. Parzen, E. 1962. On Estimation of a Probability Density Function and Mode. Ann. Math. Stat 33, 1065-1076. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning Representations by Back-propagating Errors. Nature 323, 533-536. Schwartz, E.L., B. Merker, E. Wolfson, and A. Shaw. 1988. Computational Neuroscience: Applications of Computer Graphics and Image Processing to Two and Three Dimensional Modeling of the Functional Architecture of Visual Cortex. IEEE Computer Graphics and Applications ]uly 1988, 13-28.
Received 1 October; accepted 18 November 1988.
Communicated by David A. Robinson
A Control Systems Model of Smooth Pursuit Eye Movements with Realistic Emergent Properties R. J. Krauzlis S. G. Lisberger Department of Physiology and Neuroscience Graduate Program, University of California, San Fkancisco, C A 94143, USA
Visual tracking of objects in a noisy environment is a difficult problem that has been solved by the primate oculomotor system, but remains unsolved in robotics. In primates, smooth pursuit eye movements match eye motion to target motion to keep the eye pointed at smoothly moving targets. We have used computer models as a tool to investigate possible computational strategies underlying this behavior. Here, we present a model based upon behavioral data from monkeys. The model emphasizes the variety of visual signals available for pursuit and, in particular, includes a sensitivity to the acceleration of retinal images. The model was designed to replicate the initial eye velocity response observed during pursuit of different target motions. The strength of the model is that it also exhibits a number of emergent properties that are seen in the behavior of both humans and monkeys. This suggests that the elements in the model capture important aspects of the mechanism of visual tracking by the primate smooth pursuit system. 1 Introduction Computer models have advanced our understanding of eye movements by providing a framework in which to test ideas suggested by behavioral and physiological studies. Our knowledge of the smooth pursuit system is at a stage where such models are especially useful. We know that pursuit eye movements are a response to visual motion and that they are used by primates to stabilize the retinal image of small moving targets. Lesion and electrophysiological studies have identified several cortical and subcortical sites that are involved in pursuit (see Lisberger et al. 1987 for review), but the precise relationship between the visual motion signals recorded at these sites and those used by pursuit remain unclear. We present a model that was designed to replicate the monkey’s initial eye velocity response as a function of time for different pursuit target motions. The model emphasizes the variety of visual signals available for pursuit and minimizes the computations done by motor pathways. The structure of the model is based upon behavioral experiments which Neural Computation 1,116122 (1989) @ 1989 Massachusetts Institute of Technology
A Control Systems Model of Smooth Pursuit Eye Movements
117
have characterized how different aspects of visual motion act to initiate pursuit. In monkeys, the pursuit system responds not only to the velocity of retinal images, but also to smooth accelerations and to the abrupt accelerations that accompany the onset of target motion (Krauzlis and Lisberger 1987; Lisberger and Westbrook 1985). Therefore, the model includes three parallel pathways that are sensitive to these three aspects of visual motion. 2 Structure of the Model
Our model is drawn in figure 1. Within each pathway, a time-delayed signal related to the motion of retinal images (?' - E , called retinal "slip") is processed by a nonlinear gain element derived from our behavioral experiments and a filter. The first pathway is sensitive to slip velocity (C) and its gain is linear. The second and third pathways are both sensitive to slip acceleration (el, but in different ways. The impulse acceleration pathway is sensitive to the large accelerations that accompany step changes in target velocity, but has a dead zone in the gain element that renders it insensitive to smaller accelerations. The smooth acceleration pathway is sensitive to gradual changes in image velocity. The outputs of the gain elements in each pathway are low pass filtered to produce three signals with different dynamics (&w, Pi,p a ) that are then summed and integrated to give a command for eye velocity (I?). The integrator makes our model reproduce the fact that the pursuit system interprets visual inputs as commands for eye accelerations (Lisberger et al. 1981). Eye velocity ( E ) is obtained by passing the eye velocity command through a low pass filter that represents the eye muscles and orbital tissues. The behavior of the model was refined by matching its performance under open-loop conditions to that of the monkey. We removed visual feedback by setting the value of feedback gain to zero, and compared the model's output to averages of the monkey's eye velocity in behavioral trials where visual feedback was electronically eliminated for 200 ms (methods in Morris and Lisberger 1987). First, we stimulated the model with steps in target velocity, which activate the slip velocity and impulse acceleration pathways. We adjusted the filters in these pathways to obtain the fit in figure 2A, where the rising phase of the model's output (dotted lines) matches the rising phase of the monkey's eye velocity (solid lines) during the open-loop portion of the monkey's response. Then, we stimulated the model with steps in target acceleration, which activates the slip velocity and smooth acceleration pathways, and adjusted the filter in the smooth acceleration pathway (Fig. 2B).
R. J. Krauzlis and S. G. Lisberger
118
3 Emergent Properties of the Model
We tested the model by restoring visual feedback and providing steps of target velocity in closed-loop conditions. Figure 2C shows that the model matches the rising phase of the monkey's response and makes a realistic transition into steady-state tracking. Like the monkey, it reaches steadystate velocity at later times for higher target speeds. The contribution of each pathway to the initiation of pursuit can be assessed by setting its
m
retina
1
"sib smooth acceleration"
I
Figure 1: Model of smooth pursuit eye movement system. Boxes contain transfer functions expressed in Laplace notation. Abbreviations: T , target velocity; e, slip velocity; e, slip acceleration; E'v, output of slip velocity pathway; B'i, output of slip impulse acceleration pathway; p a , output of slip acceleration pathway; eye acceleration command; l?', eye velocity command; E , eye velocity. Parameters used: t = .065; T, = 0.030; Ti = 0.020; T, = 0.010; Tp = 0.015. Functions for gain elements: Slip velocity pathway: y = az; a = 8.3. Slip impulse acceleration pathway: for z > c, y = alog(bz + 1); a = 17500, b = .00015, c = 3000. Slip smooth acceleration pathway: for d > z > e, y = alog(bz + 1); for z < e, y = (cz2)alog(bz+ 1);a = 28, b = .l, c = .0016, d = 500, e = 18.5. Equations given for impulse and smooth acceleration pathways apply for z > 0. For z < 0, equivalent odd functions are used.
e',
A Control Systems Model of Smooth Pursuit Eye Movements
119
gain to zero, effectively “lesioning” that limb of the model. When the slip impulse acceleration pathway is lesioned, the rising phase is delayed and more sluggish, but the transition to steady-state tracking is unchanged (Fig. 2D, open arrow). This pathway contributes exclusively to the initial 25-50 ms of the response to steps in target velocity and allows the model to reproduce the observation that the earliest component of the pursuit response is sensitive to target direction, but relatively insensitive to target speed (Lisberger and Westbrook 1985). When the smooth acceleration pathway is eliminated, the rising phase is unchanged, but there is a large overshoot in the transition to steady-state tracking (Fig. 2D, filled arrow). Thus, the smooth acceleration pathway normally decelerates the eye as eye velocity approaches target velocity. This is an emergent property of the model, since this pathway was tuned by adjusting its contribution to the acceleration of the eye as shown in figure 2B. The model rings at a relatively high frequency, 5 Hz, in response to steps in target velocity (solid lines in figure 3A). Similar oscillations are seen in the behavior of both humans and monkeys (Goldreich and Lisberger 1987; Robinson et al. 1986). If the model is driven at this resonant frequency, the output lags target velocity by 180 degrees (Fig. 3B, filled arrow), an effect that is also seen in the monkey’s behavior (Goldreich and Lisberger 1987). The high frequency properties of the model depend upon the presence of the smooth acceleration pathway. If this pathway is eliminated, the spontaneous oscillations still occur, but now at only 1.6 Hz (Fig. 3A, open arrow), and the phase lag in the response to sinusoidal inputs increases (Fig. 3B, open arrow). 4 Discussion An important property of our model is that it allows independent control over the initiation and maintenance phases of pursuit. The rising phase is determined mainly by the slip velocity and impulse acceleration pathways. The steady-state behavior is determined primarily by the smooth acceleration pathway. Since the differentiator in the smooth acceleration pathway introduces a phase lead, the steady-state behavior of the model has a higher frequency response than the rising phase. The exact frequency of ringing depends upon the total delay around the smooth acceleration pathway. For example, if the delay in the visual input is increased by 30 ms, the model will ring at 3.8 Hz, similar to what is normally observed in humans (Robinson et al. 1986) or in monkeys when the delay in visual feedback is increased (Goldreich and Lisberger 1987). The amount of ringing depends upon the gain element in the smooth acceleration pathway. Lowering the gain dampens or eliminates the ringing, increasing the gain produces persistent ringing. Such variations are also seen on individual trials in the monkey’s behavior.
R. J. Krauzlis and S. G. Lisberger
120
25
B
120
15
64 45
5
;.
...--..
..'..
Figure 2: Comparison of the eye velocity output from the model and the monkey. In all panels, dotted lines show the model's output and solid lines show the monkey's eye velocity. A: Steps in target velocity of 5, 15, and 25 d/s under open-loop conditions. The monkey's open-loop response lasts for the first 200 ms of his response. B: Steps in target acceleration of 45, 64,and 120 d/s2 under open-loop conditions. C: Steps in target velocity of 5, 15, and 25 d/s under closed-loop conditions. D Effect of lesioning either accelerationpathway on the model's response to a step in target velocity of 20 d/s. Open arrow, model's response when gain in slip impulse acceleration pathway is set to zero. Closed arrow, model's response when gain in slip smooth acceleration pathway is set to zero. The strength of our model is that it uses open-loop behavioral data to embody the pursuit system's sensitivity to different aspects of visual motion. Although it was designed to replicate the dynamics of the initiation of pursuit, the model also serendipitously solves a problem noted by Robinson (Robinson et al. 1986), namely, that the rising phase of pursuit is sluggish compared to the frequency of its ringing. The emergence of realistic steady-state properties in the model indicates that the visual elements in its pathways capture important aspects of signal processing within the smooth pursuit system. Our model does not include a sensitivity to position errors, which can affect steady-state tracking in monkeys (Morris and Lisberger 1987). It also does not include the topographic organization of the visual system and therefore cannot reproduce the retinotopic deficits in the initiation of pursuit or the directional deficits
121
A Control Systems Model of Smooth Pursuit Eye Movements
in the maintenance of pursuit which are seen after lesions of cortical areas MT and MST (Dursteler et al. 1987; Newsome et al. 1985). However, it should be possible to embed the dynamics of our model within a topographic structure that would account for these effects.
B
A
model target 4 U
400 ms
100 ms
Figure 3: Ringing and frequency response of the model. A: Solid lines, model's closed-loop response to steps in target velocity of 10 and 20 d/s, viewed on larger time scale than in figure 2C. Dotted line, model's response to step in target velocity of 10 d / s when gain in smooth acceleration pathway is set to zero. 8:Driving the model with sine wave target velocity at 5.0 Hz under closed-loop conditions. Filled arrow, the intact model's response lags target motion by 180 degrees. Open arrow, model's response when gain in smooth acceleration pathway is set to zero. Phase lag is now 325 degrees.
122
R. J. Krauzlis and S. G. Lisberger
Acknowledgments This research was supported by NIH Grants EY03878 a n d EY07058.
References Dursteler, M.R., R.H. Wurtz, and W.T. Newsome. 1987. Directional Pursuit Deficits Following Lesions of the Foveal Representation within the Superior Temporal Sulcus of the Macaque Monkey. J. Neurophysiol 57, 1262-1287. Goldreich, D. and S.G. Lisberger. 1987. Evidence that Visual Inputs Drive Oscillations in Eye Velocity during Smooth Pursuit Eye Movements in the Monkey. SOC. Neurosci. Abstr. 13, 170. Krauzlis, R.J. and S.G. Lisberger. 1987. Smooth Pursuit Eye Movements are Not Driven Simply by Target Velocity. SOC.Neurosci. Abstr. 13, 170. Lisberger, S.G., C. Evinger, G.W. Johanson, and A.F. Fuchs. 1981. Relationship between Eye Acceleration and Retinal Image Velocity during Foveal Smooth Pursuit Eye Movements in Man and Monkey. J. Neurophysiol. 46,229-249. Lisberger, S.G., E.J. Morris, and L. Tychsen. 1987. Visual Motion Processing and Sensory-motor Integration for Smooth Pursuit Eye Movements. Ann. Rev. Neurosci. 10, 97-129. Lisberger, S.G. and L.E. Westbrook. 1985. Properties of Visual Inputs that Initiate Horizontal Smooth Pursuit Eye Movements in Monkeys. J. Neurosci. 5, 1662-1673. Morris, E.J. and S.G. Lisberger. 1987. Different Responses to Small Visual Errors during Initiation and Maintenance of Smooth-pursuit Eye Movements in Monkeys. J. Neurophysiol. 58, 1351-1369. Newsome, W.T., R.H. Wurtz, M.R. Dursteler, and A. Mikami. 1985. Deficits in Visual Motion Processing Following Ibotenic Acid Lesions of the Middle Temporal Area of the Macaque Monkey. J. Neurosci. 5, 825-840. Robinson, D.A., J.L. Gordon, and S.E. Gordon. 1986. A Model of the Smooth Pursuit Eye Movement System. Biol. Cybern. 55, 43-57.
Received 15 August; accepted 1 October 1988.
Communicated by Patricia Churchland
The Brain Binds Entities and Events by Multiregional Activation from Convergence Zones Antonio R. Damasio Department of Neurology, Division of Behavioral Neurology and Cognitive Neuroscience, University of Iowa College of Medicine, Iowa City, IA, USA
The experience of reality, in both perception and recall, is spatially and temporally coherent and "in-register." Features are bound in entities, and entities are bound in events. The properties of these entities and events, however, are represented in many different regions of the brain that are widely separated. The degree of neural parcellation is even greater when we consider that the perception of most entities and events also requires a motor interaction on the part of the perceiver (such as eye movements and hand movements) and often includes a recordable modification of the perceiver's somatic state. The question of how the brain achieves integration starting with the bits and pieces it has to work with, is the binding problem. Here we propose a new solution for this problem, at the level of neural systems that integrate functional regions of the telencephalon. 1 Introduction
Data from cognitive psychology, neurophysiology, and neuroanatomy indicate unequivocally that the properties of objects and events that we perceive through various sensory channels engage geographically separate sensory regions of the brain (Posner 1980; Van Essen and Maunsell 1983; Damasio 1985; Livingstone and Hubel 1988). The need to "bind" together the fragmentary representations of visual information has been noted by Treisman and Gelade (19801, Crick (1984), and others, but clearly the problem is a much broader one and includes the need to integrate both the sensory and motor components in both perception and recall, at all scales and at all levels. This broader concept of binding is closer to that of Sejnowski (1986). The traditional and by now untenable solution to the binding problem has been that the components provided by different sensory portals end up being displayed together in so-called multimodal cortices, where the most detailed and integrated representations of reality are achieved. This intuitively reasonable view suggests that perception depends on a unidirectional process which provides a gradual refinement of signal extraction Neural Computation 1, 123-132 (1989) @ 1989 Massachusetts Institute of Technology
124
Antonio R. Damasio
along a cascade aimed towards integrative cortices in anterior temporal and anterior frontal regions. Some of the most influential accounts for the neural basis of cognition in the post-war period as well as major discoveries of neurophysiology and neuroanatomy over the past two decades, have seemed compatible with this view. After all, anatomical projections do radiate from primary sensory cortices toward structures in the hippocampus and prefrontal cortices via a multi-stage sequence (Pandya and Kuypers 1969; Jones and Powell 1970; Nauta 1971; Van Hoesen 1982),and the farther away neurons are from primary sensory cortices, the larger their receptive fields become, and the less unimodal their responses are (Desimone and Ungerleider 1989). However, there are several lines of evidence on the basis of which this traditional solution can be rejected. 2 Experimental Evidence
Evidence from Experimental Neuroanatomy: The notion that integration of perceptual or recalled components depends on a single neural meeting ground calls for the identification of a neuroanatomical site that would receive projections from all neural regions involved in the processing of entities and events as they occur in experience. Despite considerable exploration no such region has yet been found. The anterior temporal cortices and the hippocampus do receive projections from multiple sensory areas, but not from motor regions (Van Hoesen 1982). The anterior frontal cortices, the most frequently mentioned candidates for ultimate integration, are even less suited for that role. The sensory and motor streams that reach them remain segregated in different regions (Goldman-Rakic 1988). In other words, there seems to be no structural foundation to support the intuition that temporal and spatial integration occur at a single site. Advances in experimental neuroanatomy have added a new element to neuroanatomical reasoning about this problem: at every stage of the chain of forward cortical projections, there exist prominent projections back to the originating sites. Moreover, the systems are just as rich in multi-stage, reciprocating feedback projections as they are in feedforward projections (Van Hoesen 1982; Van Essen 1985; Livingstone and Hubel 1987). The neuroanatomical networks revealed by these studies allow for both forward convergence of some parallel processing streams, and for the flow of signaling back to points of origin. In the proposal we will describe below, such networks operate as coherent phase-locked loops in which patterns of neural activity in "higher" areas can trigger, enhance, or suppress patterns of activity in "lower" areas. Evidence from Experimental Neuropsychology in Humans with Focal Cerebral Lesions: If temporal and frontal integrative cortices were the substrate for the integration of neural activity on which binding depends, the bilateral destruction of those cortices in humans should: (a) preclude the
The Brain Binds Entities and Events by Multiregional Activation
-Frontal
125
"integrative"cortices-
Figure 1: Fundamental divisions of human cerebral cortex depicted in a simplified diagram of the external and internal views of the left hemisphere. The motor and premotor cortices include cytoarchitectonicfields 4,6, and 8. The early and intermediate sensory cortirces include the primary visual, auditory, and somatosensory regions (respectively fields 17, 41 /42, and 3/1/2), and the surrounding association cortices (fields 18/19, 7, 39, 22, 40, 5). The temporal "integrative" cortices include fields 37, 36, 35, 20, 21, 38, and 28, i.e., neocortical as well as limbic and paralimbic areas. The frontal "integrative" cortices include fields 44, 45, 46, 9, 10, 11, 12, 13, and 25, i.e., prefrontal neocortices as well as limbic. perception of reality as a coherent multimodal experience and reduce experience to disjointed, modal tracks of sensory or motor processing; (b) reduce the integration of even such modal track processing; and (c) disable memory for any form of past integrated experience and interfere with all levels and types of memory. However, the results of bilateral destruction of the anterior temporal lobes, as well as bilateral destruction of prefrontal cortices, falsify these predictions (see Fig. 1). Coherent perceptual experience is not altered by bilateral damage to the anterior temporal regions, nor does such damage disturb perceptual quality (see Corkin 1984; Damasio et al. 1985; Damasio et al. 1987). Our patient Boswell is a case in point. His extensive, bilateral damage in
126
Antonio R. Damasio
anterior temporal cortices and hippocampus, disables his memory for unique autobiographical events, but not his ability to perceive the world around in fully integrated fashion and to recall and recognize the entities and events that he encounters or participates in, at non-unique level. His binding ability breaks down at the level of unique events, when the integration of extremely complex combinatorial arrangement of entities is required. Bilateral lesions in prefrontal cortices, especially when restricted to the orbitofrontal sector, are also compatible with normal perception and even with normal memory for most entities and events except for those that pertain to the domain of social knowledge (Eslinger and Damasio 1985; Damasio and Tranel 1988). Finally, it is damage to certain sectors of sensory association cortices that can affect both the quality of some aspects of perception within the modality served by those cortices, and recognition and recall. Depending on precisely which region of visual cortex is affected, lesions in early visual association cortices can disrupt perception of shape, or color, or texture, or stereopsis, or spatial placement of the physical components of a stimulus (Damasio 1985; Damasio et al. 1989). A patient may lose the ability to perceive color and yet perceive shape, depth and motion normally. More importantly, damage within some sectors of modal association cortices can disturb recall and recognition of stimuli presented through that modality, even when basic perceptual processing is not compromised. For instance, patients may become unable to recognize familiar faces that they perceive flawlessly (although, intriguingly, they can discriminate familiar from unfamiliar faces at covert level; Tranel and Damasio 1985; 1988). The key point is that damage in a posterior and unimodal association cortex can disrupt recall and recognition at virtually every level of the binding chain, from the entity-categorical level to the event-unique level. It can preclude the kind of integrated experience usually attributed to the anterior cortices. 3 A New View on the Binding Problem
The evidence then indicates: (a) that substantial binding, relative to entities or parts thereof, occurs in unimodal cortices and can support recall and recognition at the level of categories; (b) that recall and recognition at category level, are generally not impaired by damage confined to anterior integrative cortices, i.e., knowledge recalled at categoric levels depends largely on posterior sensory cortices and interconnected motor cortices; (c) that recall and recognition of knowledge at the level of unique entities or events, requires both anterior and posterior sensory cortices, i.e., a more complex network is needed to map uniqueness; anterior integra-
The Brain Binds Entities and Events by Multiregional Activation
127
tive structures alone are not sufficient to record and reconstruct unique knowledge. The implication is that the early and intermediate posterior sensory cortices contain fragmentary records of featural components which can be reactivated, on the basis of appropriate combinatorial arrangements (by fragmentary featural components we mean “parts of entities,” at a multiplicity of scales, most notably at feature level, e.g., color, movement, texture, shape and parts thereof). They also contain records of the combinatorial arrangement of features that defined entities (”local” or ”entity” binding), but do not contain records of the spatial and temporal relationships assumed by varied entities within an event (“non-local” or “event binding”). The latter records, the complex combinatorial codes needed for event recall, are inscribed in anterior cortices. In this perspective the posterior cortices contain the fragments with which any experience of entities or events can potentially be re-enacted, but only contain the binding mechanism to re-enact knowledge relative to entities. Posterior cortices require binding mechanisms in anterior structures in order to guide the pattern of multiregional activations necessary to reconstitute an event. Thus posterior cortices contain both basic fragments and local binding records and are essential for recreating any past experience. Anterior cortices contain non-local or event-binding records and are only crucial for reconstitution of contextually more complex events. Perhaps the most important distinction between this perspective and the traditional view, is that higher-order anterior cortices are seen as repositories of combinatorial codes for inscriptions that lie elsewhere and can be reconstructed elsewhere, rather than being the storage site for the more refined ”multimodal” representations of experiences. Although anterior cortices receive multimodal projections we conceptualize the records they harbor as amodal. If parts of the representation of an entity are distributed over distant regions of the brain, then mechanisms must be available to bind together the fragments. A proposal for a new solution to the binding problem (Damasio 1989) is illustrated in figure 2 and presented in outline as follows: 1. The neural activity prompted by perceiving the various physical properties of any entity, occurs in fragmented fashion and in geographically separate regions located in early sensory cortices and in motor cortices. So-called ”integrative” cortices do not contain such fragmentary inscriptions.
2. The integration of multiple aspects of external and internal reality in perceptual or recalled experiences, depends on the phase-locked coactivation of geographically separate sites of neural activity within the above mentioned sensory and motor cortices, rather than on a transfer and spatial integration of different representations towards anterior higher-order cortices. Consciousness of those co-
128
Antonio R. Damasio activations depends on their being attended to, i.e., on simultaneous enhancement of a pertinent set of activity against background activity.
3. The patterns of neural activity that correspond to distinct physical properties of entities are recorded in the same neural ensembles
Figure 2: Simplified diagram of some aspects of the proposed neural architecture. V, SS, and A depict early and intermediate sensory cortices in visual, somatosensory, and auditory modalities. In each of those sensory sectors, separate functional regions are represented by open and filled dots. Note feedforward projections (black lines) from those regions toward several orders of convergence zones (CZ1, CZ2, CZn), and note also feedback projections from each CZ level toward originating regions (red lines). H depicts hippocampal system, one of the structures where signals related to a large number of activity sites can converge. Note outputs of H toward last station of feedforward convergence zones (CZn) and toward noncortical neural stations (NC) in basal forebrain, brain stem, and neurotransmitter nuclei. Feedforward and feedback pathways should not be seen as rigid channels. They are conceived as facilitated lines which become active when concurrent firing in early cortices or CZs takes place. Furthermore, those pathways terminate over neuron ensembles, in distributed fashion, rather than on specific single neurons.
The Brain Binds Entities and Events by Multiregional Activation
129
in which they occur during perception, but the combinatorial arrangements (binding codes) that describe their pertinent linkages in entities and in events (their spatial and temporal coincidences), are stored in separate neural ensembles called convergence zones. 4. Convergence zones trigger and synchronize neural activity patterns
corresponding to topographically organized fragment representations of physical structure, that were pertinently associated in experience, on the basis of similarity, spatial placement, temporal sequence, or temporal coincidence, or combinations thereof. The triggering and synchronization depends on feedback projections from the convergence zone to multiple cortical regions where fragment records can be activated.
5. Convergence zones are located throughout the telencephalon, at multiple neural levels, in association cortices of different orders, limbic cortices and subcortical limbic nuclei, and non-limbic subcortical nuclei such as the basal ganglia. 6. The geographic location of convergence zones for different entities varies among individuals but is not random. It is constrained by
the subject matter of the recorded material (its domain), and by contextual complexity of events (the number of component entities that interact in an event and the relations they adopt), and by the anatomical design of the system. Convergence zones that bind features into entities are located earlier in the processing streams, and convergence zones that bind entities into progressively more complex events are gradually placed more anteriorly in the processing streams.
7. The representations inscribed in the above architecture, both those that preserve topographic/topologic relationships and those that code for combinatorial arrangements, are committed to populations of neuron ensembles and their synapses, in distributed form. 8. The co-occurrence of activities in multiple sites that is necessary for binding conjunctions, is achieved by recurrent feedback interactions.
Thus, we propose that the processing does not proceed in a single direction but rather through temporally coherent phase-locking amongst multiple regions. Although the convergence zones that realize the more encompassing integration are placed more anteriorly, it is activity in the more posterior cortical regions that is more directly related to conscious experience. By means of feedback, convergence zones repeatedly return processing to earlier cortices where activity can proceed again towards the same or other convergence zones. Integration takes place when activations occur within the same time window, in earlier cortices. There is no need
130
Antonio R. Damasio
to postulate a "final" and single integration area. This model accommodates the segregation of neural processing streams that neuroanatomical and neurophysiological data continue to reveal so consistently, and is compatible with the increase in receptive fields of neurons that occurs in cerebral cortex, in the posterior-anterior direction. It accords with the proposal that fewer and fewer neurons placed anteriorly in the system are projected on by structures upstream and thus subtend a broader compass of feed-forwarding regions. Broad receptive field neurons serve as pivots for reciprocating feedback projections rather than as accumulators of the knowledge inscribed at earlier levels. They are intermediaries in a continuous process that centers on early cortices. 4 Conclusions
The problem of how the brain copes with the fragmentary representations of information is central to our understanding of brain function. It is not enough for the brain to analyze the world into its components parts: the brain must bind together those parts that make whole entities and events, both for recognition and recall. Consciousness must necessarily be based on the mechanisms that perform the binding. The hypothesis suggested here is that the binding occurs in multiple regions that are linked together through activation zones; that these regions communicate through feedback pathways to earlier stages of cortical processing where the parts are represented; and that the neural correlates of consciousness should be sought in the phase-locked signals that are used to communicate between these activation zones. Several questions are raised by this new view. For instance, what is the precise nature of the feedback signals that provide temporally coherent phase-locking among multiple regions? How large are the convergence zones in different parts of the brain? How are the decisions made to store an aspect of experience in a particular zone? There are several possible approaches to test the hypothesis proposed here. One approach is to develop new techniques for recording from many neurons simultaneously in communicating brain regions. Another relies on neuropsychological experiments in neurological patients with small focal lesions in key areas of putative networks dedicated to specific cognitive processes. Finally, modeling studies should illuminate the collective properties of convergence zones and provide us with the intuition we need to sharpen our questions. Acknowledgments Supported by NINCDS Grant PO1 NS19632.
The Brain Binds Entities and Events by Multiregional Activation
131
References Corkin, S. 1984. Lasting Consequences of Bilateral Medial Temporal Lobectomy: Clinical Course and Experimental Findings in HM. Seminars in Neurology 4, 249-259. Crick, F. 1984. Function of the Thalamic Reticular Complex: The Searchlight Hypothesis. Proc. Natl. Acad. Sci. USA 81, 4586-4590. Damasio, A. 1989. Multiregional Retroactivation: A Systems Level Model for Some Neural Substrates of Cognition. Cognition, in press. . 1985. Disorders of Complex Visual Processing. In: Principles of Behavioral Neurology, ed. M.M. Mesulam, Contemporary Neurology Series, 259-288. Philadelphia: F.A. Davis. Damasio, A., P. Eslinger, H. Damasio, G.W. Van Hoesen, and S. Cornell. 1985. Multimodal Amnesic Syndrome Following Bilateral Temporal and Basal Forebrain Damage. Archives of Neurology 42, 252-259. Damasio, A., H. Damasio, D. Tranel, K. Welsh, and J. Brandt. 1987. Additional Neural and Cognitive Evidence in Patient DRB. Society for Neuroscience 13, 1452. Damasio, A. and D. Tranel. 1988. Domain-specific Amnesia for Social Knowledge. Society for Neuroscience 14, 1289. Damasio, A.R., H. Damasio, and D. Tranel. 1989. Impairments of Visual Recognition as Clues to the Processes of Memory. In: Signal and Sense: Local and Global Order in Perceptual Maps, eds. G. Edelman, E. Gall, and M. Cowan, Neuroscience Institute Monograph. Wiley and Sons. Desimone, R. and L. Ungerleider. 1989. Neural Mechanisms of Visual Processing in Monkeys. In: Handbook of Neuropsychology, Disorders of Visual Processing, ed. A. Damasio, in press. Eslinger, P. and A. Damasio. 1985. Severe Disturbance of Higher Cognition after Bilateral Frontal Lobe Ablation. Neurology 35, 1731-1741. Goldman-Rakic, P.S. 1988. Topography of Cognition: Parallel Distributed Networks in Primate Association Cortex. In: Annual Review of Neuroscience 11, Annual Reviews Inc., Palo Alto, CA, 137-156. Jones, E.G. and T.P.S. Powell. 1970. An Anatomical Study of Converging Sensory Pathways within the Cerebral Cortex of the Monkey. Brain 93,793420. Livingstone, M. and D. Hubel. 1988. Segregation of Form, Color, Movement, and Depth Anatomy, Physiology, and Perception. Science 240, 740-749. . 1987. Connections between Layer 4B of Area 17 and Thick Cytochrome Oxidase Stripes of Area 18 in the Squirrel Monkey. Journal of Neuroscience 7, 3371-3377. Nauta, W.J.H. 1971. The Problem of the Frontal Lobe: A Reinterpretation. J. Psychiat. Res. 8, 167-187. Pandya, D.N. and H.G.J.M. Kuypers. 1969. Cortico-cortical Connections in the Rhesus Monkey. Brain Res. 13, 13-36. Posner, M.I. 1980. Orienting of Attention. Quarterly Journal of Experimental Psychology 32, 3-25. Sejnowski, T.J. 1986. Open Questions about Computation in Cerebral Cortex. In:
132
Antonio R. Damasio
Parallel Distributed Processing, eds. J.L. McClelland and D.E. Rummelhart, 372-389. Cambridge: MIT Press. Tranel, D. and A. Damasio. 1988. Nonconscious Face Recognition in Patients with Face Agnosia. Behavioral Brain Research 30, 235-249. . 1985. Knowledge without Awareness: An Autonomic Index of Facial Recognition by Prosopagnosics. Science 22821, 1453-1454. Treisman, A. and G. Gelade. 1980. A Feature-integration Theory of Attention. Cognitive Neuropsychology 12, 97-1 36. Van Essen, D.C. 1985. Functional Organization of Primate Visual Cortex. In: Cerebral Cortex, eds. A. Peters and E.G. Jones, 259-329. Plenum Publishing. Van Essen, D.C. and J.H.R. Maunsell. 1983. Hierarchical Organization and Functional Streams in the Visual Cortex. Trends in Neuroscience 6, 370-375. Van Hoesen, G.W. 1982. The Primate Parahippocampal Gyrus: New Insights Regarding its Cortical Connections. Trends in Neurosciences 5, 345-350.
Received 18 November; accepted 25 November 1988.
Communicated by David Touretzky
Product Units: A Computationally Powerful and Biologically Plausible Extension to Backpropagation Networks Richard Durbin David E. Rumelhart Department of Psycholoa, Stanford University, Stanford, CA 94305, USA
We introduce a new form of computational unit for feedfoxward learning networks of the backpropagation type. Instead of calculating a weighted sum this unit calculates a weighted product, where each input is raised to a power determined by a variable weight. Such a unit can learn an arbitrary polynomial term, which would then feed into higher level standard summing units. We show how learning operates with product units, provide examples to show their efficiency for various types of problems, and argue that they naturally extend the family of theoretical feedforward net structures. There is a plausible neurobiological interpretation for one interesting configuration of product and summing units. 1 Introduction
The success of multilayer networks based on generalized linear threshold units depends on the fact that many real-world problems can be well modeled by discriminations based on linear combinations of the input variables. What about problems for which this is not so? It is clear that for some tasks higher order combinations of some of the inputs, or ratios of inputs, may be appropriate to help form a good representation for solving the problem (for example cross-correlation terms can give translational invariance). This observation led to the proposal of ”sigmapi units” which apply a weight not only to each input, but also to all second and possibly higher order products of inputs (Rumelhart, Hinton, and McClelland; Maxwell et al. 1987). The weighted sum of all these terms is then passed through a non-linear thresholding function. The problem with sigma-pi units is that the number of terms, and therefore weights, increases very rapidly with the number of inputs, and becomes unacceptably large for use in many situations. Normally only one or a few of the non-linear terms are relevant. We therefore propose a different type of unit, which represents a single higher order term, but learns which one to represent. The output of this unit, which we will call a product unit, is Neural Computation 1, 133-142 (1989) @ 1989 Massachusetts Institute of Technology
134
Richard Durbin and David E. Rumelhart
Figure 1: Two suggested forms of possible network incorporating product units. Product units are shown with a II and summing units with a C. (a) Each summing unit gets direct connections from the input units, and also from a group of dedicated product units. (b) There are alternating layers of product and summing units, finishing with a summing unit. The output of all our summing units was squashed using the standard logistic function, 1/(1+ e-”); no non-linear function was applied to the output from product units.
We will treat the pi in the same way as variable weights, training them by gradient descent on the output sum square error. In fact such units provide much more generality than just allowing polynomial terms, since the pi can take fractional and negative values, permitting ratios. However, simple products can still be represented by setting the pi to zero or one. Related types of units were previously considered by Hanson and Burr (1987). There are various ways in which product units could be used in a network. One way is for a few of them to be made available as inputs to a standard thresholded summing unit in addition to the original raw inputs, so that the output can now consider some polynomial terms (Fig. la). This approach has a direct neurobiological interpretation (see the discussion). Alternatively there could be a whole hidden layer of product units feeding into a subsequent layer of summing units (Fig. lb). We do not envision product units replacing summing units altogether; the attractions are rather in mixing them, particularly in alternating layers so that we can form weighted sums of arbitrary products. This is analogous to alternating disjunctive and conjunctive layers in general forms for logical functions. 2 Theory
In order to discuss the equations governing learning in product units it is convenient to rewrite equation 1 in terms of exponentials and logarithms.
Product Units for Backpropagation Networks =
1%
135
(2.1)
J1
In this form we can see that a product unit acts like a summing unit whose inputs are preprocessed by taking logarithms, and whose output is passed through an exponential, rather than a squashing function. If L , is negative then log, zz= log, 12,I + ZT, which is complex, and so equation (2.1) becomes
(2.2) We want to be able to consider negative inputs because the non-linear characteristics of product units, which we want to use computationally, are centered on the origin. There are two main alternatives to dealing with the resulting complex-valued expressions. One is to handle the whole network in the complex domain, and at the end fit the real component to the data (either ignoring the complex component or fitting it to 0). The other is to keep the system in the real domain by ignoring the imaginary component of the output from each product unit, restricting us to real-valued weights. For most problems the latter seems preferable. In the case where all the exponents pi are integral, as with a true polynomial term, then the approximation of ignoring the imaginary component is exact. Given this, it can be viewed that we are extending the space of polynomial terms to fractional exponents in a well behaved fashion, so as to permit smooth learning of the exponents. Additionally, in simulations we seem to gain nothing for the added complexity of working in the complex domain (it doubles the number of equations and weight variables). On the other hand, for some physical problems it may be appropriate to consider complex-valued networks. In order to train the weights by gradient descent we need to be able to calculate two sets of derivatives for each unit. First we need the derivative of the output y with respect to each weight pi so as to be able to update the weights. Second we need the derivative with respect to each input zi so as to be able to propagate the error back to previous layers using the chain rule. Let us set I, equal to 1 if xi is negative, otherwise 0, and define U , V by N
N
Then the equations we need for the real-valued version are y = eucosTv
(2.3)
136
Richard Durbin and David E. Rumelhart
It is possible to add an extra constant input to a product unit, corresponding to the bias for a summing unit. In this case the appropriate constant is -1, since a positive value would simply multiply the output by a scalar, which is irrelevant when there is a variable multiplicative weight from the output to a higher level summing unit. Although this multiplicative bias is often eventually redundant, we have found it to be important during the learning process for some tasks, such as the symmetry task (see below and Fig. 2). One property that we have lost with product units is that they are vulnerable to translation and rotation of the input space, in the sense that a learnable problem may no longer be learnable after translation. Summing units with a threshold are not vulnerable to such transformations. If desired, we can regain translational invulnerability by introducing new parameters pi to allow an explicit change of origin. This would replace xi by (xi- pi)in all the above equations. We can once again learn the pi by gradient descent. With the pi present a product unit can approximate a linear threshold unit arbitrarily closely, by working on only a small region of the exponential function. Alternatively, we can notice that rotational and translational vulnerability of single product units is in part compensated for if a number of them are being used in parallel, which will often be the case. This is because a single product transforms to a set of products in a rotated and translated space. In any case, there may be some benefit to the asymmetry of a product unit’s capabilities under affine transformation of the input space. For non-geometric sets of input variables this type of extra computational power may well be useful. 3 Results
Many of the problems that are studied using networks use Boolean input. For product units it is best to use Boolean values -1 and 1, in which case the exponential terms in equations (3) disappear, and the units behave like cosine summing units with 1 and 0 inputs. Examples of the use of product units for learning Boolean functions are provided by networks that learn the parity and symmetry functions. These functions are hard to learn using summing units: the parity function requires as many hidden units as inputs, while symmetry requires 2 hidden units, but often gets stuck in a local minimum unless more are given. Both functions are learned rapidly using a single product hidden unit (Fig. 2 a,b). A good example of a problem that multilayer nets with product units find good solutions for is the multiplexing task shown in figure 2c. Here two of the inputs code which of the four remaining inputs to output. This task has a biological interpretation as an attentional mechanism, and is therefore relevant for computational models of sensory information processing. Indeed, the neurobiological interpretation of just the type of hybrid net
Product Units for Backpropagation Networks
137
Q
(a)
-1
1 1
-1 -1
1 -1 1 -I
x X
I
-1
~
-1
-I
I
X X X
X
X
Figure 2: Examples of product unit networks that solve "hard" binary problems. In each case there is a standard thresholded summing output unit (C) and one or more "hidden" product units (II). The weight values are shown by each arrow, and there is also a constant bias value shown inside each unit's circle. Product unit biases can be considered to have constant -1 input (see text). In each case the network was found by training from data. (a) Parity. The output is 1 if an even number of inputs is on, 0 if an odd number is on. (b) Symmetry. The output is 1 if the input pattern is mirror symmetric (as shown here), 0 otherwise. For summing unit network solutions to the symmetry and parity problems see (Rumelhart, Hinton, and Williams). (c) Multiplexer. Here the values of the two lefthand input units encode in binary fashion which of the four right hand inputs is transmitted to the output. Examples are shown. Where there is a dot the value of the input unit (1 or -1) is irrelevant. An "d' stands for either 1 or -1.
used here (see below) suggests a substrate and mechanism for attentional processes in the brain. We can measure the informational capacity of a unit by the number of random Boolean patterns that it can learn (more precisely, the number a t which the probability of storing them all perfectly drops to a haIf;
Richard Durbin and David E. Rumelhart
138
Structure M 61
12 1
621
621 fixed output
Product unit Summing unit percentage percentage
12 18 20
49
92
24 36 40
100 66 20
29
24 36 40
100 82 58
2 0 0
24 36 40
100 45 14
0
28
18 3 1
0 0
0
0
Table 1: Results on storage of random data. The number of successful storage attempts in 100 trials is shown in the last two columns for various net structures and numbers of vectors, M . Storage is termed successful if all input vectors produce output on the correct side of 0.5. Input vectors were random, q = -1 or 1, and output values for each vector were random 0 or 1. The “6 1” and ”12 1” nets had a single learning unit with 6 or 12 inputs. For these comparisons the output of a product unit was passed through the standard summing unit squashing function, e”/(l + e”). The single summing units do not attain the M=2N theoretical limit (Cover 1965), presumably because the squashing function output creates local minima not present for a simple perceptron. The “6 2 1” nets had 2 hidden units (either product or summing) and one summing output unit, which was trainable for the first set of results, and fixed with all weights equal for the second set. These results indicate that storage capacity for product units is at least 3 bits per weight, as opposed to no more than 2 bits per weight for summing units, and that fixed output units do not drastically reduce computational power in multilayer networks. Mitchison and Durbin 1989). For a single summing unit with N inputs the capacity can be shown theoretically to be 2N (Cover 1965). The empirical capacity of a single product unit is significantly higher than this at around 3N (table 1). The relative improvement is maintained in a comparison of multilayer networks with product hidden units compared with ones consisting purely of summing units (table 11, indicating that product units cooperate well with summing units. We can also consider the performance of product units when the inputs are real valued. An example is the ability of a network with two
&
139
Product Units for Backpropagation Networks
(a>
-20.0
2.0
0.0
2.0
gq -3.0
-12.3
2.3
Figure 3: Performance on a task with real-valued input: learning a circular domain. In each case the network and a plot of its response function over the range -2.0 to 2.0 are shown. A variety of local minima were found using two product unit networks (which in theory could solve the problem exactly), whereas there was only one solution found using two summing units. Although non-optimal, the product unit solutions were always better than the summing units solutions. (a) The ideal product unit network, used to generate the data. (b) An example of a good empirical solution with product units (2% misclassified, MSE 0.03). (c) An example of a poor product unit local minimum (13%misclassified, MSE 0.10). (d) The solution essentially always obtained with two summing hidden units (38%misclassified, MSE 0.24). puts 2 , are real valued. An example is the ability of a network with two product hidden units to learn to respond to a circular region around the origin. In fact it appears that there are many local minima for this problem, and although the network occasionally finds the "correct" solution (Fig. 3a), it more often finds other solutions such as those shown in figure 3b,c. However these solutions are not bad: the average mean square error (MSE) for product unit networks is 0.09 (average of 10) with 88% correct data classification (whether the output is the correct side of 0.5) whereas the best corresponding summing unit network gives an MSE of 0.24, and only 62% correct classification.
140
Richard Durbin and David E. Rumelhart
4 Discussion
We have proposed a new type of computational unit to be used in layered networks along with standard thresholded summing units. The underlying idea behind this unit is that it can learn to represent any generalized polynomial term in the inputs. It can therefore help to form a better representation of the data in cases where higher order combinations of the inputs are significant. Unlike sigma-pi units, which to some extent perform the same task, product units do not increase the number of free parameters, since there is only one weight per input, as with summing units. Although we have been unable to prove that product units are guaranteed to learn a learnable task, as can be shown for a single simplified summing unit (Rosenblatt 1962), we have shown that product units can be trained efficiently using gradient descent, and allow much simpler solutions of various standard learning problems. In addition, as isolated units they have a higher empirical learning capacity than summing units, and they act efficiently to create a hidden layer representation for an output summing unit (table 1). There is a natural neurobiological interpretation for this type of combination of product and summing units in terms of a single neuron. Local regions of dendritic arbor could act as product units whose outputs are summed at the soma. Equation (2.1) shows that a product unit acts like a summing unit with an exponential output function, whose inputs are preprocessed by passing them through a log function. Both these transfer functions are realistic. When there are voltage sensitive dendritic channels, such as NMDA receptors, the post-synaptic voltage response is qualitatively exponential around a critical voltage level (Collingridge and Bliss 1987); an effect that will be influenced by other local input apart from the specific input at the synapse. Presynaptically, there are saturation effects giving an approximately logarithmic form to the voltage dependency of transmitter release. In fact just these features have been presented as problems with the standard thresholded summing model of neurons. Standard summing inputs could still be made using neurotransmitters that do not stimulate voltage sensitive channels. As far as learning in the biological model is concerned, it is acceptable that the second layer summing weights, corresponding to the degree of influence of dendritic regions at the soma, are not very variable. Systems with fixed summing output layers are nearly as computationally powerful as fully variable ones, both in theory (Mitchison and Durbin 19891, and simulations (table 1). Learning at the input synapses is still essentially Hebbian (equation 2.2b), with an additional term when the input 2, is negative. Although the periodic form of this term appears unbiological, some type of additional term is not unreasonable for inhibitory input, which may well have different learning characteristics. Alternatively, it might be that the learning model only applies to excitatory input. Further consideration of this neurobiological model is required, but it seems likely that this
Product Units for Backpropagation Networks
141
approach will lead to a plausible new computational model of a neuron that is potentially much more powerful than the standard McCullochPitts model. One possible criticism of introducing a new type of unit is that it is trivially going to improve the representational capabilities of the networks: one can always improve a fit to data by making a model more complex, and this is rarely worth the price of throwing away elegance. The defence to this must be that the extension is in some sense natural, which we believe that it is. Product units provide the continuous analogy to general Boolean conjunctions in the same way that summing units are continuous analogs of Boolean disjunctions (although both continuous forms are much more powerful, sufficiently so that either can represent any arbitrary disjunction or conjunction on Boolean input). In fact many of the proofs of capabilities of networks to perform general tasks rely on the “abuse” of thresholded summing units to perform multiplicative or conjunctive tasks, often in alternating layers with units being used in an additive or disjunctive fashion. Such proofs will be much simpler for networks with both product and summing units, indicating that such networks are more appropriate for finding simple models of general data. It might be argued that in opening up such generality the special properties of learning networks will be lost, because they no longer provide strong constraints on the type of model that is created. We feel that this misses the point. The real justification for layered network models appears when a number of different output functions are fit to some set of data. By using a layered model the fit of each function influences and constrains the fit of all the others. If there is some underlying natural representation this will be modeled by the intermediate layers, since it will be appropriate for all the output functions. This cross-constraining of learning is not easily available in many other systems, which therefore miss out on a vast amount of data that is relevant, although indirectly so. Product units provide a natural extension of the use of summing units in this framework.
Acknowledgments
R.M.D is a Lucille P. Markey Visiting Fellow at Stanford University. We thank T.J. Sejnowski for pointing out the neurobiological interpretation. References Collingridge, G.L. and T.V.P. Bliss. 1987. NMDA receptors - their role in long-term potentiation. Trends Neurosci. 10, 288-293. Cover, T. 1965. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. IEEE Trans. Elect. Cornp. 14,326-334.
142
Richard Durbin and David E. Rumelhart
Hanson, S.J. and D.J. Burr. 1987. Knowfedge Representation in Connectionist Nefworks. Technical Report Bell Communication Research, Morristown, NJ. Maxwell, T., C.G. Giles, and Y.C. Lee. 1987. Generalization in Neural Networks, the Contiguity Problem. In: Proceedings IEEE First International Conference on Neural Networks 2 , 4 1 4 5 . Mitchison, G.J. and R.M. Durbin. 1988. Bounds on the Learning Capacity of Some Multilayer Networks. Biofogical Cybernetics, in press. Rosenblatt, F. 1962. Principles of Neurodynamics. New York Spartan. Rumelhart, D.E., G.E. Hinton, and J.L. McClelland. 1986. A General Framework for Parallel Distributed Processing. In: Parallel Distributed Processing 1,4576. Cambridge, MA, and London: MIT Press. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning Internal Representations by Errtjr Propagation. In: Parallel Distributed Processing 1, 318-362. Cambridge, MA, and London: MIT Press.
Received 11 November; accepted 17 December 1988.
Communicated by Scott Kirkpatrick
Deterministic Boltzmann Learning Performs Steepest Descent in Weight-Space Geoffrey E. Hinton Department of Computer Science, University o f Toronto, 10 King’s College Road, Toronto M5S 1 A4, Canada
The Boltzmann machine learning procedure has been successfully applied in deterministic networks of analog units that use a mean field approximation to efficiently simulate a truly stochastic system (Peterson and Anderson 1987). This type of ”deterministic Boltzmann machine” (DBM) learns much faster than the equivalent ”stochastic Boltzmann machine” (SBM), but since the learning procedure for DBM’s is only based on an analogy with SBM‘s, there is no existing proof that it performs gradient descent in any function, and it has only been justified by simulations. By using the appropriate interpretation for the way in which a DBM represents the probability of an output vector given an input vector, it is shown that the DBM performs steepest descent in the same function as the original SBM, except at rare discontinuities. A very simple way of forcing the weights to become symmetrical is also described, and this makes the DBM more biologically plausible than back-propagation (Werbos 1974; Parker 1985; Rumelhart et al. 1986). 1 Introduction
The promising results obtained by Peterson and Anderson (Peterson and Anderson 1987) using a DBM are hard to assess because they present no mathematical guarantee that the learning does gradient descent in any error function (except in the limiting case of a very large net with small random weights). It is quite conceivable that in a DBM the computed gradient might have a small systematic difference from the true gradient of the normal performance measure for each training case, and when these slightly incorrect gradients are added together over many cases their resultant might bear little relation to the resultant of the true casewise gradients (see Fig. 1). 2 The Learning Procedure for Stochastic Boltzmann Machines
~
A Boltzmann machine (Hinton and Sejnowski 1986) is a network of symmetrically connected binary units that asynchronously update their states Neural Computation 1, 143-150 (1989) @ 1989 Massachusetts Institute of Technology
Geoffrey E. Hinton
144
according to a sfochastic decision rule. The units have states of 1 or 0 and the probability that unit i adopts the state 1 is given by
c
1 pi = o(T i
(2.1)
SjWij)
where s j is the state of the jthunit, wij is the weight on the connection between the jthand the ithunit, T is the "temperature" and c is a smooth non-linear function defined as 1 o(z)= 1 + e-"
(2.2)
If the binary states of units are updated asynchronously and repeatedly using equation 2.1, the network will reach "thermal equilibrium" so that the relative probabilities of global configurations are determined by their energies according to the Boltzmann distribution:
where Pa is the probability of a global configuration and E, is its energy defined by
where s: is the binary state of unit i in the othglobal configuration, and bias terms are ignored because they can always be treated as weights on connections from a permanently active unit. At any given temperature, T , the Boltzmann distribution is the one that minimizes the Helmholtz free energy, F , of the distribution. F is defined by the equation
h
a+b
Figure 1: The true gradients of the performance measure are a and b for two training cases. Even fairly accurate estimates, 2 and 6, can have a resultant that points in a very different direction.
Deterministic Boltzmann Learning Performs Steepest Descent F = ( E )- T H
145
(2.5)
where ( E ) is the expected value of the energy given the probability distribution over configurations and H is the entropy of the distribution. It can be shown that minima of F (which will be denoted by F') satisfy the equation a
In a stochastic Boltzmann machine, the probability of an output vector, Op, given an input vector, I, is represented by
(2.7)
where F& is the minimum free energy with I , and Op clamped, and F,* is the minimum free energy with just I , clamped. A very natural way to observe P-(OplI,) is to allow the network to reach thermal equilibrium with I , clamped, and to observe the probability of 00.The key to Boltzmann machine learning is the simple way in which a small change to a weight, w , ~ affects , the free energy and hence the log probability of an output vector in a network at thermal equilibrium. (2.8)
where ( s , s j ) is the expected value of s,sj in the minimum free energy distribution. The simple relationship between weight changes and log probabilities of output vectors makes it easy to teach the network an input-output mapping. The network is "shown" the mapping that it is required to perform by clamping an input vector on the input units and clamping the required output vector on the output units (with the appropriate conditional probability). It is then allowed to reach thermal equilibrium at T = 1, and at equilibrium each connection measures how often the units it connects are simultaneously active. This is repeated for all input-output pairs so that each connection can measure (s,sJ)+, the expected probability, averaged over all cases, that unit i and unit j are simultaneously active at thermal equilibrium when the input and output vectors are both clamped. The network must also be run in just the same way but without clamping the output units to measure ( s , s j ) - , the expected probability that both units are active at thermal equilibrium when the output vector is determined by the network. Each weight is then updated by
146
Geoffrey E. Hinton
It follows from equation 2.7 and equation 2.8 that if E is sufficiently small this performs steepest descent in an information theoretic measure, G, of the difference between the behavior of the output units when they are clamped and their behavior when they are not clamped. (2.10)
where I, is a state vector over the input units, Op is a state vector over the output units, P+ is a probability measured at thermal equilibrium when both the input and output units are clamped, and P- is a probability measured when only the input units are clamped. Stochastic Boltzmann machines learn slowly, partly because of the time required to reach thermal equilibrium and partly because the learning is driven by the difference between two noisy variables, so these variables must be sampled for a long time at thermal equilibrium to reduce the noise. If we could achieve the same simple relationships between log probabilities and weights in a deterministic system, learning would be much faster. 3 Mean field theory
Under certain conditions, a stochastic system can be approximated by a deterministic one by replacing the stochastic binary variables of equation 2.1 by deterministic real-valued variables that represent their mean values
We could now perform discrete, asynchronous updates of the pi using equation 3.1 or we could use a synchronous, discrete time approximation of the set of differential equations (3.2)
We shall view the pi as a representation of a probability distribution over all binary global configurations. Since many different distributions can give rise to the same mean values for the pi we shall assume that the distribution being represented is the one that maximizes the entropy, subject to the constraints imposed on the mean values by the pi. Equivalently, it is the distribution in which the pi are treated as the mean values of independent stochastic binary variables. Using equation 2.5 we can calculate the free energy of the distribution represented by the state of a DBM (at T = 1).
Deterministic Boltzmann Learning Performs Steepest Descent
147
+ C[pzlog(pa)+ (1 - pz) lOg(1 - pz)l
(3.3)
F = -C 2<3
~ z ~ 3 ~ 2 3 a
Although the dynamics of the system defined by equation 3.2 do not consist in following the gradient of F , it can be shown that it always moves in a direction that has a positive cosine with the gradient of -F so it settles to one of the minima of F (Hopfield 1984). Mean field systems are normally viewed as approximations to systems that really contain higher order statistics, but they can also be viewed as exact systems that are strongly limited in the probability distributions that they can represent because they use only N real values to represent distributions over 2N binary states. Within the limits of their representational powers, they are an efficient way of manipulating these large but constrained probability distributions. 4 Deterministic Boltzmann machine learning
In a DBM, we shall define the representation of P-(OpII,) exactly as in equation 2.7, but now F& and F,* will refer to the free energies of the particular minima that the network actually settles into. Unfortunately, in a DBM this representation is no longer equivalent to the obvious way of defining P-(OplI,) which is to clamp I, on the input units, settle to a minimum of F,, and interpret the values of the output units as a representation of a probability distribution over output vectors, using the maximum entropy assumption. The reason for choosing the first definition rather than the second is this: Provided the stable states that the network settles to do not change radically when the weights are changed slightly, it can now be shown that the mean field version of the Boltzmann machine learning procedure changes each weight in proportion to the gradient of log P-(OgII,), which is exactly what is required to perform steepest descent in the performance measure G defined in equation 2.10. When zut3 is incremented by an infinitessimal amount cp,p3 two things happen to F* (see Fig. 2). First, the mean energy of the probability distribution represented by the state of the DBM is decreased by cpp2,p2, and, to first order, the mean energy of all nearby states of the DBM is decreased by the same amount. Second, the values of the p , at which F is minimized change slightly so the stable state moves slightly. But, to first order, this movement of the minimum has no effect on the value of F because we are at a stable state in which a F / d p , = 0 for all 2. Hence the effect of incrementing wZJby epdp, is simply to create a new, nearby stable state which, to first order, has a free energy that is cp:p: lower than the old stable state. So, assuming T = 1, if all weights are incremented by cp:p; in the stable state that has ii, and 0 0 clamped and are decremented by ~ p ; p ; in the stable state that has only I , clamped we have, from equation 2.7
148
Geoffrey E. Hinton
This ensures that by making E sufficiently small the learning procedure can be made to approximate steepest descent in G arbitrarily closely. The derivation above is invalid if, with the same boundary conditions, a small change in the weights causes the network to settle to a stable state with a very different free energy. This can happen with energy landscapes like the one shown in figure 3. A small weight change caused by some other training case can cause a free energy barrier that prevents the network finding the deeper minimum. In simulations that repeatedly sweep through a fixed set of training cases, it is easy to avoid this phenomenon by always starting the network at the stable state that was found using the same boundary conditions on the previous sweep. This has the added advantage of eliminating almost all the computation required to settle on a stable state, and thus making a settling almost as fast as a forward pass of the back-propagation procedure. Unfortunately, starting from the previous best state does not eliminate the possibility that a small free-energy bamer will disappear and a much better state will then be found when the network is running with the output units unclamped. This can greatly increase the denominator in equation 2.7 and thus greatly decrease the network’s representation of the probability of a correct output vector. It should also be noted that it is conceivable that, due to local minima in the free energy landscape, F,‘ may actually be higher than F&, in which case the network’s representation of P-(OpIla) will exceed 1. In practice this does not seem to be a problem, and DBMs compare very favorably with back-propagation in learning speed.
Figure 2 The effect of a small weight increment on a free energy minimum. To first order, the difference in free energy between A and C is equal to the difference between A and B. At a minimum, small changes in the distribution (sideways movements) have negligible effects on free energy, even though they may have significant (and opposite)effects on the energy and the entropy terms.
Deterministic Boltzmann Learning Performs Steepest Descent
149
Figure 3: A small increase in the free energy of B can prevent a network from settling to the free energy minimum at C. So small changes in weights occasionally cause large changes in the final free energy. 5 Symmetry of the Weights
We have assumed that the weight of the connection from i to j is the same as the weight from j to i. If these weights are asymmetric, the learning procedure will automatically symmetrize them provided that, after each weight update, each weight is decayed slightly towards zero by an amount proportional to its magnitude. This favors “simple” networks that have small weights, and it also reduces the energy barriers that create local minima. Weight-decay always reduces the difference between wii and wji, and since the learning rule specifies weight changes that are exactly symmetrical in i and j , the two weights will always approach one another. Williams (1985) makes a similar argument about a different learning procedure. Thus the symmetry that is required to allow the network to compute its own error derivatives is easily achieved, whereas achieving symmetry between forward and backward weights in back-propagation networks requires much more complex schemes (Parker 1985). 6 Acknowledgments
I thank Christopher Longuet-Higgins, Conrad Galland, Scott Kirkpatrick, and Yann Le Cun for helpful comments. This research was supported by grants from the Ontario Information Technology Research Center, the National Science and Engineering Research Council of Canada, and DuPont. Geoffrey Hinton is a fellow of the Canadian Institute for Advanced Research.
150
Geoffrey E. Hinton
References Hinton, G.E. and T.J. Sejnowski. 1986. Learning and relearning in Boltzmann machines. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, eds. D.E. Rumelhart, J.L. McClelland, and the PDP group. Cambridge, MA: MIT Press. Hopfield, J.J. 1984. Neurons with Graded Response Have Collective Computational Properties like Those of Two-state Neurons. Proceedings of the National Academy of Sciences U.S.A. 81,3088-3092. Parker, D.B. 1985. Learning-logic. Technical Report TR-47, Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA. Peterson, C. and J.R. Anderson. 1987. A Mean Field Theory Learning Algorithm for Neural Networks. Complex Systems 1,995-1019. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning Representations by Back-propagating Errors. Nature 323, 533-536. Werbos, P.J. 1974. Beyond Regression: New Tools for Prediction and Analysis i n the Behavioral Sciences. PhD Thesis, Harvard University, Cambridge, MA. Williams, R.J. 1985. Feature Discovery through Error-correction Learning. Technical Report ICS-8501, Institute for Cognitive Science, University of California, San Diego, La Jolla, CA.
Received 5 December; accepted 15 December 1988.
Communicated by Les Valiant .
What Size Net Gives Valid Generalization? Eric B. Baum* Jet Propulsion Laboratorj; California Institute o f Technology, Pasadena, C A 91109, USA
David Haussler Department o f Computer and Information Sciencrs, University o f California, Santa Cruz, CA 95064, USA
We address the question of when a network can be expected to generalize from m random training examples chosen from some arbitrary probability distribution, assuming that future test examples are drawn from the same distribution. Among our results are the following bounds on appropriate sample vs, network size. Assume 0 < E 5 1/8. We show random examples can be loaded on a feedforward that if m 2 0($209!) network of linear threshold functions with N nodes and W weights, so that at least a fraction 1 - 5 of the examples are correctly classified, then one has confidence approaching certainty that the network will correctly classify a fraction 1 - E of future test examples drawn from the same distribution. Conversely, for fully-connected feedforward nets with one hidden layer, any learning algorithm using fewer than O(F) random training examples will, for some distributions of examples consistent with an appropriate weight choice, fail at least some fixed fraction of the time to find a weight choice that will correctly classify more than a 1 - E fraction of the future test examples.
1 Introduction In the last few years, many diverse real-world problems have been attacked by back propagation. For example ”expert systems” have been produced for mapping text to phonemes (Sejnowski and Rosenberg 19871, for determining the secondary structure of proteins (Qian and Sejnowski 19881, and for playing backgammon (Tesauro and Sejnowski 1988). In such problems, one starts with a training database, chooses (by making an educated guess) a network, and then uses back propagation to load as many of the training examples as possible onto the network. The hope is that the network so designed will generalize to predict correctly on future examples of the same problem. This hope is not always realized. *Current address: Department of Physics, Princeton University, Princeton, NJ 08540.
Neural Computation 1, 151-160 (1989) @ 1989 Massachusetts Institute of Technology
152
Eric B. Baum and David Haussler
We address the question of when valid generalization can be expected. Given a training database of m examples, what size net should we attempt to load these on? We will assume that the examples are drawn from some fixed but arbitrary probability distribution, that the learner is given some accuracy parameter E , and that his goal is to produce with high probability a feedforward neural network that predicts correctly at least a fraction 1 - 6 of future examples drawn from the same distribution. These reasonable assumptions are suggested by the protocol proposed by Valiant for learning from examples (Valiant 1984). However, here we do not assume the existence of any "target function"; indeed the underlying process generating the examples may classify them in a stochastic manner, as in e.g. (Duda and Hart 1973). Our treatment of the problem of valid generalization will be quite general in that the results we give will hold for arbitrary learning algorithms and not just for backpropagation. The results are based on the notion of capacity introduced by Cover (Cover 1965) and developed by Vapnik and Chervonenkis (Vapnik and Chervonekis 1971; Vapnik 1982). Recent overviews of this theory are given in (Devroye 1988; Blumer et al. 198%; Pollard 1984), from the various perspectives of pattern recognition, Valiant's computational learning theory, and pure probability theory, respectively. This theory generalizes the simpler counting arguments based on cardinality and entropy used in (Blumer et al. 1987a; Denker et al. 1987), in the latter case specifically to study the question of generalization in feedforward nets (see Vapnik 1982 or Blumer et al. 1987b). The particular measures of capacity we use here are the maximum number of dichotomies that can be induced on m inputs, and the VupnikChervonenkis (VC) Dimension, defined below. We give upper and lower bounds on these measures for classes of networks obtained by varying the weights in a fixed feedforward architecture. These results show that the VC dimension is closely related to the number of weights in the architecture, in analogy with the number of coefficients or "degrees of freedom" in regression models. One particular result, of some interest independent of its implications for learning, is a construction of a near minimal size net architecture capable of implementing all dichotomies on a randomly chosen set of points on the n-hypercube with high probability. Applying these results, we address the question of when a network can be expected to generalize from m random training examples chosen from some arbitrary probability distribution, assuming that future test examples are drawn from the same distribution. Assume 0 < t 5 1/8. We show that if m 1 O(:logT) random examples can be loaded on a feedforward network of linear threshold functions with N nodes and W weights, so that at least a fraction 1- f of the examples are correctly classified, then one has confidence approaching certainty that the network will correctly classify a fraction 1- E of future test examples drawn from the same distribution. Conversely, for fully-connected feedforward nets with one hidden layer, any learning algorithm using fewer than O ( 5 )
What Size Net Gives Valid Generalization?
153
random training examples will, for some distributions of examples consistent with an appropriate weight choice, fail at least some fixed fraction of the time to find a weight choice that will correctly classify more than a 1 - E fraction of the future test examples. Ignoring the constant and logarithmic factors, these results suggest that the appropriate number of training examples is approximately the number of weights times the inverse' of the accuracy parameter, E . Thus, for example, if we desire an accuracy level of 90%, corresponding to E = 0.1, we might guess that we would need about 10 times as many training examples as we have weights in the network. This is in fact the rule of thumb suggested by Widrow (1987), and appears to work fairly well in practice. At the end of Section 3, we briefly discuss why learning algorithms that try to minimize the number of non-zero weights in the network (Rumelhart 1987; Hinton 1987) may need fewer training examples. 2 Definitions
We use In to denote the natural logarithm and log to denote the logarithm base 2. We define an example as a pair ( 2 , u ) , 2E P , a E {-l,+l}.We define a random sample as a sequence of examples drawn independently at random from some distribution D on Snx { -1, +1}. Let f be a function from RR into {-1, +l}. We define the errur of f , with respect to D, as the probability a f (2)for (2,a ) a random example. Let F be a class of { -1, +I}-valued functions on Rn and let S be a set of m points in Rn. A dichotomy of S induced by f E F is a partition of S into two disjoint subsets S' and S- such that f(2) = +1 for 2 E S' and f ( 2 ) = -1 for 2 E S-. By A,(S) we denote the number of distinct dichotomies of S induced by functions f E F , and by A,(m) we denote the maximum of AF(S) over all S c Sn of cardinality m. We say S is Shattered by F if A&5') = 21'1, i.e. all dichotomies of S can be induced by functions in F. The Vapnik-Chervonenkis W C ) dimension of F , denoted VCdinz(F),is the cardinality of the largest S c Xn that is shattered by F , i.e. the largest m such that A&n) = 2". A feedforward net with input from !JIn is a directed acyclic graph G with an ordered sequence of n source nodes (called inputs) and one sink (called the output). Nodes of G that are not source nodes are called computation nodes, nodes that are neither source nor sink nodes are called hidden nodes. With each computation node n, there is associated a function ji p n d e g r e e h , ) + {-1, +I}, where indegree(n,) is the number of incoming edges for node n,. With the net itself there is associated a func-
+
'It should be noted that our bounds differ significantly from those given in (Devroye 1988) in that the latter exhibit a dependence on the inverse of 6'. This is because we derive our results from Vapnik's theorem on the uniform relative deviation of frequencies from their probabilities (Vapnik 1982; see Appendix A3 of Blumer et al. 1987b), giving sharper bounds as c approaches 0.
Eric B. Baum and David Haussler
154
tion f : W -+ { -1, +1}defined by composing the Zfs in the obvious way, assuming that component i of the input 2 is placed at the ith input node. A feedforward architecture is a class of feedforward nets all of which share the same underlying graph. Given a graph G we define a feedforward architecture by associating to each computation node n, a class of functions F, from Pndegree(nt) to {-1, +1}. The resulting architecture consists of all feedforward nets obtained by choosing a particular function f, from F, for each computation node n,. We will identify an architecture with the class of functions computed by the individual nets within the architecture when no confusion will arise. 3 Conditions Sufficient for Valid Generalization
Theorem 1. Let F be a feedforward architecture generated by an underlying graph G with N 2 2 computation nodes and Fibe the class of functions associated with computation node n, of G, 1 5 i 5 N . Let d = XE,VCdim(Fi). Then A,(m) 5 @, A,(m) 5 (Nem/d)dfor m 2 d , where e is the base of the natural logarithm.
Proof. Assume G has n input nodes and that the computation nodes of G are ordered so that node ni receives inputs only from input nodes and from computation nodes nj, 1 5 j 5 i - 1. Let S be a set of m points in $I2".The dichotomy induced on S by the function in node nl can be chosen in at most A , ( m ) ways. This choice determines the input to node n 2 for each of the m points in S. The dichotomy induced on these m inputs by the function in node n2 can be chosen in at most A,(m) ways, etc. Any dichotomy of S induced by the whole network can be obtained by choosing dichotomies for each of the ni's in this manner, hence A d m ) 5 A,(rn). By a theorem of Sauer (1972), whenever VCdim(F)= k < 03, A,(m) 5 (em/k)kfor all m 2 k (see also Blumer et al. 198%). Let d, = VCdim(Fi), 1 5 i 5 N . Thus d = C,"=, di. Then A,(m) 5 nE,(em/di)d*for m 2 d. Using the fact that C,"=, -critogai 5 logN whenever ai > 0, 1 5 i 5 N , and X,"=, cri = 1, and setting ai = d i / d , it is easily verified that did, 2 ( d / N ) d .Hence n,"=,(em/di)dc 5 (Nem/dId.I
n,:
nEl
,:n
Corollary 3. Let F be the class of all functions computed by feedforward nets defined on a fixed underlying graph G with E edges and N 2 2 computation nodes, each of which computes a linear threshold function. Let W = E + N (the total number of weights in the network, including one weight per edge and one threshold per computation node). Then A & n ) 5 (Nem/W)wfor all m 2 W and VCdim(F)5 2Wlog(eN). Proof. The first inequality follows directly from Theorem 1 using the fact that VCdim(F)= k + l when F is the class of all linear threshold functions on rJZk (see e.g. Wenocur and Dudley 1981). For the second inequality, it
155
What Size Net Gives Valid Generalization?
is easily verified that for N 2 2 and m = 2Wlog(eN), ( N e m / m < 2". Hence this is an upper bound on VCdim(F).I Using VC dimension bounds given in (Wenocur and Dudley 19811, related corollaries can be obtained for nets that use spherical and other types of polynomial threshold functions. These bounds can be used in the following. Theorem 2. (Vapnik 1982) (see Blumer et al. 1987b; Theorem A3.3): Let F be a class o f function2 on % 0
2 -En-, E
t
i f one can find a choice of weights so that at least a fraction 1 - €12 o f the m training examples are correctly loaded, then one has confidence at least 1 - 8e-1.sw that the net will correctly classify all but a fraction E o f future examples drawn from the same distribution. For 64W m 2 -En-, E
the confidence is at least 1
-
64N E
8e-em/32
Proof. Let y = 112 and apply Theorem 3, using the bound on AF(m) given in Corollary 2. This shows that the probability that there exists a choice of the weights that defines a function with error greater than E that is consistent with at least a fraction 1 - €12 of the training examples is at most 8(2N em/ W ) we-'"/16. When m = ?En?, this is 8 ( 2 e & E n y ) W , which is less than 8e-1.5w for N 2 2 and t 5 112. When m 2 ?En?, ( 2 N e m / W ) w 5 efm/32,so 8(2Nem/W)we-'m/'6 < - 8e-cm/32.I The constant 32 is likely an overestimate. No serious attempt has been made to minimize it. Further, we do not know if the log term is unavoidable. Nevertheless, even without these terms, for nets with 2We assume some measurability conditions on the class F. See (Pollard 1984; Blumer et al. 198%) for details.
Eric B. Baum and David Haussler
156
many weights this may represent a considerable number of examples. Such nets are common in cases where the complexity of the rule being learned is not known in advance, so a large architecture is chosen in order to increase the chances that the rule can be represented. To counteract the concomitant increase in the size of the training sample needed, one method that has been explored is the use of learning algorithms that try to use as little of the architecture as possible to load the examples, e.g. by setting as many weights to zero as possible, and by removing as many nodes as possible (a node can be removed if all its incoming weights are zero.) (Rumelhart 1987; Hinton 1987). The following shows that the VC dimension of such a ”reduced architecture is not much larger than what one would get if one knew a priori what nodes and edges could be deleted. Corollary 5. Let F be the class of all functions computed by linear threshold feedforward nets defined on a fixed underlying graph G with N‘ > 2 computation nodes and E’ 2 N’ edges, such that at most E 2 2 edges have non-zero weights and a t most N 2 2 nodes have at least one incoming edge with a non-zero weight. Let W = E + N . Then the conclusion of Corollary 4 holds for sample size
32W m 2 ---In-. E
32NE‘ E
Proof sketch. We can bound A,(m) by considering the number of ways the N nodes and E edges that remain can be chosen from among those in the initial network. A crude upper bound is (N’)N(E’)E.Applying Corollary 2 to the remaining network gives A,(m) 5 (N’)N(E’)E(Nern/W)W. This is at most (NE’em/W)W. The rest of the analysis is similar to that in Corollary 4. I This indicates that minimizing non-zero weights may be a fruitful approach. Similar approaches in other learning contexts are discussed in (Haussler 1988) and (Littlestone 1988). 4 Conditions Necessary for Valid Generalization
The following general theorem gives a lower bound on the number of examples needed for distribution-free learning, regardless of the algorithm used. Theorem 3. (Ehrenfeucht et al. 1987; see also BJumer et aJ. 1987b): Let F be a class of {-1, +l}-valued functions on Xn with VCdzm(F) 2 2. Let A be any learning algorithm that takes as input a sequence of {-1, +I}-labeled examples over W and produces as output a function from Xn into {-1, +l}. Then for any 0 < E 5 1/8, 0 < 6 5 and
6
I - E 1 VCdim(F)-l m < max[---ln-, I, € 6 326
What Size Net Gives Valid Generalization?
157
there exists (1) a function f E F , and (2) a distribution D on X n x {-1, +1} for which Prob((2,a ) : a # f (2)) = 0 , such that given a random sample of size m chosen according to D , with probability at least 6,A produces a function with error greater than E .
This theorem can be used to obtain a lower bound on the number of examples needed to train a net, assuming that the examples are drawn from the worst-case distribution that is consistent with some function realizable on that net. We need only obtain lower bounds on the VC dimension of the associated architecture. In this section we will specialize by considering only fully-connected networks of linear threshold units that have only one hidden layer. Thus each hidden node will have an incoming edge from each input node and an outgoing edge to the output node, and no other edges will be present. In (Baum 1988) a slicing construction is given that shows that a one hidden layer net of threshold units with n inputs and 2 j hidden units can shatter an arbitrary set of 2jn vectors in general position in % A coroIlary In. of this result is: Theorem 4. The class of one hidden layer linear threshold nets taking input from Xn with k hidden units has VC dimension at least 21$]n. Note that for large k and n, 2LSJn is approximately equal to the total number W of weights in the network. A special case of considerable interest occurs when the domain is restricted to the hypercube: {+1,-1)". Lemma 6 of (Littlestone 1988) shows that the class of Boolean functions on {+l, represented by where disjunctive normal form expressions with k terms, k < O(Z"/'/fl, each term is the conjunction of n/2 literals, has VC dimension at least kn/4. Since these functions can be represented on a linear threshold net with one hidden layer of k units, this provides a lower bound on the VC dimension of this architecture. We also can use the slicing construction of (Baum 1988) to give a lower bound approaching kn/2. The actual result is somewhat stronger in that it shows that for large n a randomly chosen set of approximately kn/2 vectors is shattered with high probability. Theorem 5. With probability approaching 1 exponentially in n, a set S of m 5 P I 3 vectors chosen randomly and uniformly from {+l,-l}n can be shattered by the one hidden layer architecture with 2[m/L(n(l linear threshold units in its hidden layer.
I])-)$
Proof sketch. With probability approaching 1 exponentially in n no pair of vectors in S are negations of each other. Assume n 2 e". Let r = Ln(1- &I]. Divide S at random into [ m / ~disjoint l subsets St,. .. , Srm/rl each containing no more than T vectors. We will describe a set T of 3 ~ 1vectors as sliceable if the vectors in T are linearly independent and the subspace they span over the reals does not contain any i l vector other than the vectors in T and their negations. In (Odlyzko 1988) it
Eric B. Baum and David Haussler
158
is shown, for large n, that any random set of r vectors has probability P = 4($)(:)" + O((&)")of not being sliceable. Thus the probability that some S, is not sliceable is O(rnn2(:)"), which is exponentially small for rn 5 2n/3. Hence with probability approaching 1 exponentially in R, each S, is sliceable, 1 5 i 5 [rn/rl. Consider any Boolean function f on S and let Sl = ( 2 E S, : f(2)= +1}, 1 5 i 5 [m/r1. If S, is sliceable and no pair of vectors in S are negations of each other then we may pass a plane through the points in S,+that doesn't contain any other points in S. Shifting this plane parallel to itself slightly we can construct two half spaces whose intersection forms a slice of !R" containing S,+and no other points in S. Using threshold units at the hidden layer recognizing these two half spaces, with weights to the output unit +1 and -1 appropriately, the output unit receives input +2 for any point in the slice and 0 for any point not in the slice. Doing ' : and thresholding at 1 implements the function f. I this for each 5 We can now apply Theorem 6 to show that any neural net learning algorithm using too few examples will be fooled by some reasonable distributions. Corollary 6. For any learning algorithm training a net with k linear thresh118, if the algorithm uses (a) old functions in its hidden layer, and 0 < E fewer than examples to learn a function from F'to {-l,+l},or (b) fewer than Lnlk/Z] (max(l/2,1-10/(ln "I))] -1 examples to learn a function from
<
326
{-l,+l}"to {-l,+l},for k 5 O(2"I3), then there exist distributions D for which (i) there exists a choice of weights such that the network exactly classifies its inputs according to D , but (ii) the learning algorithm will have probability a t least .01 of finding a choice of weights which in fact has error greater than E .
5 Conclusion
We have given theoretical lower and upper bounds on the sample size vs. net size needed such that valid generalization can be expected. The exact constants we have given in these formulae are still quite crude; it may be expected that the actual values are closer to 1. The logarithmic factor in Corollary 4 may also not be needed, at least for the types of distributions and architectures seen in practice. Widrow's experience supports this conjecture (Widrow 1987). However, closing the theoretical gap between the O(5log:) upper bound and the lower bound on the worst case sample size for architectures with one hidden layer of threshold units remains an interesting open problem. Also, apart from our upper bound, the case of multiple hidden layers is largely open. Finally, our bounds are obtained under the assumption that the node functions are linear threshold functions (or at least Boolean valued). We conjecture
R(7)
What Size Net Gives Valid Generalization?
159
that similar bounds also hold for classes of real valued functions such as sigmoid functions, and hope shortly to establish this.
Acknowledgments We would like to thank Ron Rivest for suggestions on improving the bounds given in Corollaries 4 and 5 in a n earlier draft of this paper, and Nick Littlestone for many helpful comments. The research of E. Baum was performed by the Jet Propulsion Laboratory, California Institute of Technology, as part of its Innovative Space Technology Center, which is sponsored by the Strategic Defense Initiative Organization/Innovative Science and Technology through an agreement with the National Aeronautics and Space Administration (NASA). D. Haussler gratefully acknowledges the support of ONR grant N00014-86-K-0454. Part of this work was done while E. Baum was visiting UC Santa Cruz.
References Baum, E.B. 1988. On the Capabilities of Multilayer Perceptrons. J. of Complexity, in press. Blumer, A., A. Ehrenfeucht, D. Haussler, and M. Warmuth. 1987a. Occam’s Razor. lnf. Proc. Let. 24, 377-380. . 1987b. Learnability and the Vapnik-Ckervonenkis Dimension. University of California-Santa Cruz Technical Report UCSC-CRL-87-20 and I. ACM, to appear. Cover, T. 1965. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications to Pattern Recognition. IEEE Trans. Elect. Comp. V14,326-334. Devroye, L. 1988. Automatic Pattern Recognition, a Study of the Probability of Error. I E E E Trans. PAMI V10:N4, 530-543. Denker, J., D. Schwartz, B. Wittner, S. Solla, J. Hopfield, R. Howard, and L. Jackel. 1987. Automatic Learning, Rule Extraction, and Generalization. Complex Systems 1, 877-922. Duda, R. and P. Hart. 1973. Pattern Classification and Scene Analysis. New York: Wiley. Ehrenfeucht, A., D. Haussler, M. Kearns, and L. Valiant. 1987. A General Lower Bound on the Number of Examples Needed for Learning. University of California-Santa Cruz Technical Report UCSC-CRL-87-26 and Information and Computation, to appear. Haussler, D. 1988. Quantifying Inductive Bias: A1 Learning Algorithms and Valiant’s Learning Framework. Artificial Intelligence 36, 177-221. Hinton, G. 1987. Connectionist Learning Procedures. Artificial Intelligence, to appear. Littlestone, N. 1988. Learning Quickly when Irrelevant Attributes Abound: a new linear threshold algorithm. Machine Learning V2, 285-318.
160
Eric B. Baum and David Haussler
Odlyzko, A. 1988. On Subspaces Spanned by Random Selections of f l Vectors. J. Comb. Th. A V47:N1,124-133. Pollard, D. 1984. Convergence of stochastic processes. New York: Springer-Veriag. Qian, N. and T.J. Sejnowski. 1988. Predicting the Secondary Structure of Globular Protein using Neural Networks. J. Molec. Bid. 202, 865-884. Rumelhart, D. 1987. Personal communication. Sauer, N. 1972. On the Density of Families of Sets. J. Comb. 7%. A V13,145-147. Sejnowski, T.J. and C.R. Rosenberg. 1987. NET Talk a Parallel Network that Learns to Read Aloud. Complex Systems 1, 145-168. Tesauro, G. and T.J. Sejnowski. 1988. A 'Neural' Network that Learns to Play Backgammon. In: Neural Information Processing Systems, ed. D.Z. Anderson, AIP, NY, 794-803. Valiant, L.G. 1984. A Theory of the Learnable. Comm. ACM V27 N11,1134-1142. Vapnik, V.N. and A.Ya. Chervonenkis. 1971. On the Uniform Convergence of Relative Frequencies of Events to their Probabilities. Th. Prob. and its Applications V12N2, 264-280. Vapnik, V.N. 1982. Estimation of Dependences Based on Empirical Data. New York Springer Verlag. Wenocur, R.S. and R.M. Dudley. 1981. Some Special Vapnik-Chervonenkis Classes. Discrete Math. V33, 313-318. Widrow, B. 1987. ADALINE and MADALINE - 1963, Plenary Speech, Vol I: Proc. IEEE Zst Int. Conf. on Neural Networks, San Diego, CA, 143-158.
Received 5 September; accepted 26 September 1988.
VIEW
Recurrent Backpropagat ion and the Dynamical Approach to Adaptive Neural Computation Fernando J. Pineda Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109, USA
Error backpropagation in feedforward neural network models is a popular learning algorithm that has its roots in nonlinear estimation and optimization. It is being used routinely to calculate error gradients in nonlinear systems with hundreds of thousands of parameters. However, the classical architecture for backpropagation has severe restrictions. The extension of backpropagation to networks with recurrent connections will be reviewed. It is now possible to efficiently compute the error gradients for networks that have temporal dynamics, which opens applications to a host of problems in systems identification and control. 1 Introduction
The problem of loading a neural network model with a nonlinear mapping is like the problem of finding the parameters of a multidimensional nonlinear curve fit. The traditional way of estimating the parameters is to minimize a measure of the error between the actual output and the “target” output. Many useful optimization techniques exist, but the most common methods make use of gradient information. In general, if there are N free parameters in the objective function, the number of operations required to calculate the gradient numerically is at best proportional to N 2 . Neural networks are special because their mathematical form permits two tricks that reduce the complexity of the gradient calculation, as discussed below. When these two tricks are implemented, the gradient calculation scales linearly with the number of parameters (weights), rather than quadratically. The resulting algorithm is known as a backpropagation algorithm. Classical backpropagation was introduced to the neural network community by Rumelhart, Hinton and Williams (1986). Essentially the same algorithm was developed independently by Werbos (1974) and Parker (1982) in different contexts. Le Cun (1988) has provided a brief overview of backpropagation pre-history and stresses that the independent discovery of the technique and its interpretation in the context of connectionist Neural Computation 1,161-172 (1989) @ 1989 Massachusetts Institute of Technology
162
Fernando J. Pineda
systems is a recent and important development. He points out that within the framework of optimal control the essential features of the algorithm were known even earlier (Bryson and Ho 1969). In this paper, the term "backpropagation" will be used generically to refer to any technique that calculates the gradient by exploiting the two tricks. Furthermore, since one can write a backpropagation routine for evaluating the gradient and then use this routine in any prepackaged numerical optimization package, it is reasonable to take the position that the term "backpropagation" should be attached to the way the gradient is calculated rather than to the particular algorithm for using the gradient (conjugate gradient, line search, etc.). Recurrent backpropagation is a non-algorithmic continuous-time formalism for adaptive recurrent and nonrecurrent networks in which the dynamical aspects of the computation are stressed (Pineda 1987a; 198713; 1988). The formalism is expressed in the language of differential equations so that the connection to collective physical systems is more natural. Recurrent backpropagation can be put into an algorithmic form to optimize the performance of the network on digital machines, nevertheless, the intent of the formalism is to stay as close to collective dynamics as possible. Recurrent backpropagation has proven to be a rich and useful computational tool. Qian and Sejnowski (1988) have demonstrated that a recurrent backpropagation network can be trained to calculate stereo disparity in random-dot stereograms. For dense disparity maps the network converges to the algorithm introduced by Marr and Poggio (1976) and for sparse disparity maps it converges to a new algorithm for transparent surfaces. Barhen et al. (1989) have developed a new methodology for constrained supervised learning and have extended RBP to handle constraints and to include terminal attractors (Zak 1988). They have applied their algorithms to inverse kinematics in robotic applications. The formalism has also been fertile soil for theoretical developments. Pearlmutter (1989) has extended the technique to time-dependent trajectories while Simard et al. (1988) have investigated its convergence properties. 2 Overview of a Dynamical Model
The class of neural network models which can be trained by recurrent backpropagation is very general, but it is useful to pick a definite system as an example, therefore consider a neural network model based on differential equations of the form
The vector x represents the state vector of the network, I represents an external input vector and w represents a matrix of coupling constants
Dynamical Approach to Adaptive Neural Computation
163
(weights) which represent the strengths of the interactions between the various neurons. The relaxation time scale is T ~ .By hypothesis, the vector valued function f(r,)is differentiable and chosen so as to give the system appropriate dynamical properties. For example, biologically motivated choices are the logistic or hyperbolic tangent functions (Cowan 1968). When the matrix w is symmetric with zero diagonals, this system corresponds to the Hopfield model with graded neurons (1984). In general, the solutions of equation (2.1) exhibit oscillations, convergence onto isolated fixed points or chaos. For our purposes, convergence onto isolated fixed points is the desired behavior, because we use the value of the fixed point as the output of the system. When the network is loaded, the weights are adjusted so that the output of the network is the desired output. There are several ways to guarantee convergence. One way is to impose structure on the connectivity of the network, such as requiring the weight matrix to be lower triangular or symmetric. Symmetry, although mathematically elegant, is quite stringent because it constrains microscopic connectivity by requiring pairs of neurons to be symmetrically connected. A less stringent constraint is to require that the Jacobian matrix be diagonally dominant. For equation (2.11, the Jacobian matrix has the form where Sij are the elements of the identity matrix and f'(zj)is the derivative of f(zj).This condition has been used by Guez et al. (1988). If the feedforward, symmetry or diagonal dominance stability conditions are imposed as initial conditions on a network, gradient descent dynamics will typically evolve a network which violates the conditions. Nevertheless, this author has never observed an initially stable network becoming unstable while undergoing simple gradient descent dynamics. This fact points out that the above stability conditions are merely sufficient conditions - they are not necessary. This fact also motivates the stability assumption upon which recurrent backpropagation on equation (2.1) is based: that if the initial network is stable, then the gradient descent dynamics will not change the stability of the network. The need for this assumption can be eliminated by choosing a dynamical system which admits only stable behavior, even under learning, as was done by Barhen et al. (1989). In gradient descent learning, the computational problem is to optimize an objective function whose free parameters are the weights. Let the number of weights be denoted by "N" and let the number of processing units be denoted by "n". Then, N is proportional to n2provided the fan-in/fan-out of the units is proportional to R. For the neural network given by equation (2.1) it requires O ( m N )or O(mn2)operations to relax the network and to calculate a separable objective function based on the steady state xo. (In this discussion, the precision of the calculation is fixed
Fernando J. Pineda
164
and m is the number of time steps required to relax equation (2.1) to a given tolerance.) Accordingly, to calculate the gradient of the objective function by numerical differentiation requires O(mN2)or O(mn4) calculations. For problems with lots of connections this becomes intractable very rapidly. The scaling referred to here should not be confused with the number of gradient evaluations required for convergence to a solution. Indeed, for some problems, such as parity, the required number of gradient evaluations may diverge at critical training set sizes (Tesauro 1987). Now, as already mentioned, backpropagation adaptive dynamics is based on gradient descent and exploits two tricks to reduce the amount of computation. The first trick uses the fact that, for equations of the form (2.11, the gradient of an objective function E(#) can be written as an outer-product, that is,
V,E = yof(xo)T
(2.3)
where xo is the fixed point of equation (2.1) and where the "error vector" yo is given by
yo = (LYJ
(2.4)
where LT is the transpose of the n x n matrix defined in equation (2.2) and J is an external error signal which depends on the objective function and on xo. This trick reduces the computational complexity of the gradient calculation by a factor of n because L-' can be calculated from L by direct matrix inversion in Oh3) operations and because xo can be calculated in only O(mn2)calculations. Thus the entire calculation scales like O(mn3) or O ( r n N 3 h The second trick exploits the fact that yo can be calculated by relaxation or equivalently it is the (stable) fixed point of the linear differential equation (2.5) A form of this equation was derived by Pineda (198513). A discretetime version was derived independently by Almeida (1987). To relax y (that is, to integrate equation (2.5) until y reaches steady state) requires O(n2)operations per time step. Therefore, if the system does not wander chaotically, the required amount of computation scales like O(mn2)or O(mN ) . The method is computationally efficient provided the network is sufficiently large and sparse and provided that the fixed points are not marginally stable. These results are summarized in Table 1. Note that the two backpropagation algorithms have reduced the amount of computation by a factor of N . The classical feedforward algorithm is more efficient because it does not have to relax to a steady state. For all its faults, backpropagation has permitted optimization techniques to be applied to many problems which were previously considered
Dynamical Approach to Adaptive Neural Computation
165
Numerical algorithm complexity: O(mN2) Worst case (e.g. numerical differentiation) Matrix inversion (e.g. gaussian elimination) O(mN3/2) Matrix inversion by relaxation (e.g. recurrent backpropagation) O(mN) Recursion (e.g. classical feedforward backpropagation) O(N) Table 1: Scaling of various algorithms with the number of connections. m is the number of time steps required to relax equations (2.1) or (2.5). It is assumed that the number of time steps required to relax them is the same. numerically intractable. The N-fold reduction of the amount of calculation is perhaps the single most important reason that backpropagation algorithms have made such an impact in neural computation. The idea of using gradient descent is certainly not new, but whereas it was previously only tractable on problems with few parameters, it is now also tractable on problems with many parameters. It is interesting to observe that a similar situation arose after the development of the FFT algorithm. The idea of numerical fourier transforms had been around for a long time before the FFT, but the FFT caused a computational revolution by reducing the complexity of an n-point fourier transform, from Oh2) to O(nlog(n)). 3 Dynamics vs. Algorithms
Backpropagation algorithms are usually viewed from an algorithmic viewpoint. For example, the gradient descent version of the algorithm is expressed in the following pseudo-code: while@ > E )
c initialize weight change AUJ= 0 repeat for each pattern
i relax eqn. (2.1) to obtain xo relax eqn. (2.4) to obtain yo calculate gradient V E = yof(xo)T accumulate gradients Aw = Aw + V E
}
Fernando J. Pineda
166
update weights w + w + Aw
1 Note that all the patterns are presented before a weight update. On the other hand, a ”dynamical algorithm” can be obtained by replacing the weight update step with a differential equation, that is, T w d W i j l d t = yif(xjc,) (3.1) and integrating it simultaneously with the forward-propagation and backward-propagation equations. A constant pattern is presented through the input pattern vector, I, and the error signal is presented through the error vector, J. The dynamics of this system is capable of learning a single pattern so long as the relaxation time of the forward and backward propagations (7,and 7J is much slower than the relaxation time of the weights, rw. Since the forward and backward equations settle rapidly after a presentation, the outer product yf(xITis a very good approximation for the gradient during most of the integration. To learn multiple patterns, the patterns must be switched slowly compared to the relaxation time of the forward and backward equations, but rapidly compared to T, the time scale over which the weights change. The conceptual advantage of this approach is that one now has a dynamical system which can be studied and perhaps used as a basis for models of actual physical or biological systems. This is not to say that merely converting an algorithm into a dynamical form makes it biologically or physically plausible. It simply provides a starting point for further development and investigation. Intuition and formal results concerning algorithmic models do not necessarily apply to the corresponding dynamical models. For example, consider the well-known ”fact” that gradient descent is a poor algorithm compared to conjugate gradient. In fact this conventional wisdom is incorrect when it comes to physical dynamical systems. The reason is that the disease which makes gradient descent inefficient is a consequence of discretization. For example a difficulty occurs when descending down a long narrow valley. Gradient descent can wind up taking many tiny steps crossing and re-crossing the actual gradient direction. This is inefficient because the gradient must be recomputed for each step and because the amount of computation required to recalculate the gradient from one step to the next is approximately constant. Conjugate gradient is a technique which assures that the new direction is conjugate to the previous direction and therefore avoids the problem. Accordingly larger steps may be taken and less gradient evaluations are required. On the other hand gradient descent is quite satisfactory in physical dynamical systems simply because time is continuous. The ”steps” are by definition infinitely small and the gradient is evaluated continuously. No repeated crossing of the gradient direction occurs. For the same reason, the ultimate performance of physical neural networks cannot be determined from how quickly or how slowly a “neural” simulation runs on a
Dynamical Approach to Adaptive Neural Computation
0
1
2
3
4 5 Time ( T ~x 1000)
167
6
Figure 1: Mean squared error as a function of time. digital machine. Instead one must integrate the simultaneous equations and measure how long it takes to learn, in multiples of the fundamental time scales of the equations. As an example, consider the following illustrative problem. Choose input and output vectors to be randomly selected 5 digit binary vectors scaled between 0.1 and 0.9. Use a network with two layers of five units each with connections going in both directions (50 weights). For dimensionless time scales choose T, = rv= 1.0, rW = 32rz and select a new pattern at random every 47,. The equations may be integrated crudely, for example, use the Euler method with (At = 0.02~~). One finds that the error reaches E = 0.1 in approximately 4xl@~ or ,after lo3 presentations. Figure 1 shows the error as a function of time. To estimate the performance of an electronic physical system we can replace these time scales with electronic time scales. Therefore, suppose patterns are presented every lo-’ sec (100 kHz). This is the performance bottleneck of the system, since the relaxation time of the circuit, rz,is sec, which is slow compared with what then approximately 2.5 x
168
Fernando J. Pineda
sec, which is slow compared with what then approximately 2.5 x can be achieved in analog VLSI. Hence in this case the patterns would be learned in approximately 10 milliseconds. Unlike simple feedforward networks, recurrent networks exhibit dynamical phenomena. For example, a peculiar phenomenon can occur if a recurrent network is trained as an associative memory to store multiple memories: it is found that the objective function can be reduced to some very small value, yet when the network is tested for recall, the supposedly stored memory is missing! This is due to a fundamental limitation of gradient descent. Gradient descent is capable of moving existing fixed points only. It cannot create new fixed points. To create new fixed points requires a technique whereby some degrees of freedom in the network are clamped during the loading phase and ,released during the recall phase. The analogous technique in feedforward networks is called “teacher forcing.” It can be shown that this technique causes the creation of new fixed points. Unfortunately, after the suppressed degrees of freedom are released, there is no guarantee that the system is stable with respect to the suppressed degrees of freedom. Therefore the fixed points sometimes turn out to be repellers instead of attractors. In feedforward nets teacher forcing causes no such difficulties because there is no dynamics in feedforward networks and hence no attractors or repellers. 4 Recent Developments
Zak (1988) has suggested the use of fixed points with infinite stability in recurrent networks. These fixed points, denoted “terminal attractors,” have two properties which follow from their infinite stability. First, their stability is always guaranteed, hence the repeller problem never occurs, and second, trajectories converge onto them in a finite amount of time, rather than an infinite amount of time. In particular, if a terminal attractor is used in the weight update equation, a remarkable speedup in learning time occurs (see for example Barhen et al. 1989). These interesting properties are a consequence of the fact that the attractors violate the Lipschitz condition. Pearlmutter (1989) has extended the recurrent formalism to include time-dependent trajectories (time-dependent recurrent backpropagation). In this approach the objective function of the fixed point is replaced with an objective functional of the trajectory. The technique is the continous time generalization of the sequence generating network discussed by Rumelhart et al. (1986). Like all backpropagation algorithms the amount of calculation is reduced by O W ) for each gradient evaluation. However, like the Rumelhart network, it requires that the network be unfolded in time during training. Hence the storage and computation during training scales like O ( m N )where m is the number of unfolded time steps. Furthermore, the technique is acausal in that the backpropagation equation
Dynamical Approach to Adaptive Neural Computation
169
one is solving a two-point boundary problem of the kind familiar from control theory. For problems where the target trajectories are known a priori and on-line learning is not required, this is the technique of choice. On the other hand a causal algorithm has been suggested by Williams and Zipser (1989). This algorithm does not take advantage of the backpropagation tricks and therefore the complexity scales like O ( m N 2 )for each gradient evaluation while the storage scales like Oh3). Nevertheless, for small problems where on-line learning is required it is the technique of choice. Both techniques seek to minimize a measure of the error between a target trajectory and an actual trajectory by performing gradient descent. Only the method used for the gradient evaluation differs. Therefore one expects that, to the extent that on-line training is not an issue and to the extent that complexity is not an issue, one could use the two techniques interchangeably to create networks. Both techniques can suffer from the repeller problem if an attempt is made to introduce multiple attractors. As before, this problem could be solved by introducing a time dependent terminal attractor. 5 Constraints
Biologically and physically plausible adaptive systems which are massively parallel should satisfy certain constraints. 1) They should scale well with connectivity, 2) they should require little or no global synchronization, 3) they should use low precision components, and 4) they should not impose unreasonable structural contraints, such as symmetric weights or bi-directional signal propagation. Backpropagation algorithms in general and recurrent backpropagation and time-dependent recurrent backpropagation in particular can be viewed in light of each of these constraints. Linear scaling of the gradient calculation in backpropagation algorithms is a consequence of the local nature of the computation; that is, each unit only requires information from the units to which it is connected. This notion of locality, which arises from the analysis of the numerical algorithm is distinct from the notion of spatial locality, which is a constraint imposed by physical space on physical networks. Spatial locality is how one avoids the O h 2 ) growth of wires in networks. Both locality constraints could be satisfied by physical backpropagation networks. Global synchronization requires global connections, therefore it is undesirable if the network is to scale up. In one sense, the problem of synchronization has been eliminated in recurrent backpropagation because there is no longer any need for separate forward, backward and update steps, indeed equations (2.1), (2.51, and (3.1) are "integrated" simultaneously by the dynamical system as it evolves. There is another sense in which synchronization causes difficulties. In physical systems and in
170
Fernando J. Pineda
massively parallel digital simulations, time delays and asynchronous updates can give rise to chaotic or exponential stochastic behavior (Barhen and Golati 1989). Barhen et al. have shown that this “emergent chaos” can be suppressed easily by the appropriate choice of dynamical parameters. It is still an open question as to whether backpropagation algorithms require low precision or high precision components. Formal results suggest that some problems, like parity in single layer nets (Minsky and Papert 19881, may lead to exponential growth of weights. In practice it appears that 16 bits of precision for the weights and 8 bits of precision for the activations and error signals are sufficient for many useful problems (Durbin 1987). Structurally, recurrent backpropagation and time-dependent recurrent backpropagation impose no constraints on the weight matrix. This would help the biological plausiblity of the model were it not for the requirement that the connections be bi-directional. Bi-directionality is perhaps the biggest plausibility problem with the algorithms based on backpropagation. Biologically, this requires bi-directional synapses or separate, but equal and opposite, paths for error and activation signals. There is no evidence for either structure in biological systems. The same difficulties arise in electronic implementations where engineering solutions to this problem have been developed (Furman and Abidi 1988), but one would hope that a better adaptive dynamics would eliminate the problem altogether.
6 Discussion If neural networks were merely clever numerical algorithms it would be difficult to completely account for the recent excitement in the field. To my mind, much of the excitement started with the work of Hopfield (1982) who made explicit the profound relationship between information storage and dynamically stable configurations of collective physical systems. Hopfield nets are based on the physics of interacting spins which together form a system known as a spin glass. The relevant physical property of spin glasses which make them useful for computation is that the collective interactions between all the spins can result in stable patterns which can be identified with stored memories. Hopfield nets serve as an explicit example of the principle of collective computation even though they may not be the best networks for practical computing. Digital computers, on the other hand, can compute because they are physical realizations of finite state machines. In digital computers collective dynamics does not play a role at the algorithm level, although it certainly plays a role at the implementation level since the physics of transistors is collective physics. Collective dynamics can play a role at both the algorithmic and the implementation levels if the physical
Dynamical Approach to Adaptive Neural Computation
171
dynamics of the machine is reflected in the computation directly. Rather than search for machine-independent algorithms, one should search for just the opposite - dynamical algorithms that can fully exploit the collective behavior of physical hardware. Acknowledgments The author wishes to acknowledge very helpful discussions with Pierre Baldi, Richard Durbin, and Terrence Sejnowski. The work described in this paper was performed at the Applied Physics Laboratory, The Johns Hopkins University, sponsored by the Air Force Office of Scientific Research (AFOSR-87-354). The writing and publication of this paper was supported by the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply any endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology. References Almeida, L.B. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In: Proceedings of the IEEE First International Conference on Neural Networks, San Diego, CA, eds. M. Caudil and C. Butler, 2, 609-618. Barhen, J., S. Gulati, and M. Zak. 1989. Neural learning of inverse kinematics for redundant manipulators in unstructured environments. To appear in: IEEE Computer, June 1989, Special issue on Autonomous Intelligent Machines. Barhen, J. and S. Gulati. 1989. 'Chaotic relaxation' in concurrently asynchronous neurodynamics. Submitted to: 1989 International Joint Conference on Neural Networks, June 18-19, Washington, D.C. Bryson, A.E. Jr. and Y-C. Ho. 1969. Applied Optimal Control. Blaisdell Publishing co. Cowan, J.D. 1968. Statistical mechanics of nervous nets. In: Neural Networks, ed. E.R. Caianiello. Berlin: Springer-Verlag, 181-188. Durbin, R. 1987. Backpropagation with integers. Abstracts of the meeting, Neural Networks for Computing, Snowbird, UT. Furman, B. and A. Abidi. 1988. A CMOS backward error propagation LSI. Proceedings of the Twenty-second Asilomar Conference on Signals, Systems, and Computers. Pacific Grove, CA. Guez, A., V. Protopopsecu, and J. Barhen. 1988. On the stability, storage capacity, and design of nonlinear continuous neural networks. I E E E Transactions on Systems, Man, and Cybernetics, 18, 80-87.
172
Fernando J. Pineda
Hopfield, J.J. 1982. Neural networks as physical systems with emergent collective computational abilities. Proceedings of the National Academy of Science USA, 79,2554-2558. . 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science USA, 81,3088-3092. le Cun, Y. 1988. A theoretical framework for backpropagation. Proceedings of the 1988 Connectionist Models Summer School, Carnegie-Mellon University, eds. D. Touretzky, G. Hinton, T. Sejnowski, 21-29. San Mateo, CA: MorganKaufmann Publishers. Marr, D. and T. Poggio. 1976. Cooperative computation of stereo disparity. Science, 194, 283-287. Minsky, M. and S. Papert. 1988. Perceptrons, 2nd edition. Cambridge, MA MIT Press. Parker, David B. 1982. Learning-Logic. Invention Report, S81-64, File 1, Ofice of Technology Licensing, Stanford University. Pearlmutter, Barak A. 1989. Learning state space trajectories in recurrent neural networks. Neural Computation, 1263-269. Pineda, F.J. 1987a. Generalization of backpropagation to recurrent neural networks. Physical Review Letters, 18, 2229-2232. . 198%. Generalization of backpropagation to recurrent and higher order networks. In: Proceedings of IEEE Conference on Neural Information Processing Systems, Denver, Colorado, Nov. 8-12, ed. D.Z. Anderson, 602611. . 1988. Dynamics and architecture for neural computation. Journal of Complexity, 4, 216-245. Qian, N. and T.J. Sejnowski. 1988. Learning to solve random-dot stereograms of dense transparent surfaces with recurrent backpropagation. In: Proceedings of the 1988 Connectionist Models Summer School, eds. D. Touretzky, G. Hinton and T. Sejnowski, 435443. San Mateo, C A Morgan-Kaufmann Publishers. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning internal representations by error propagation. In: Parallel Distributed Processing, 1, eds. D.E. Rumelhart and J.L. McClelland, 318-362. Simard, P.Y., M.B. Ottaway, and D.H. Ballard. 1988. Analysis of recurrent backpropagation. Proceedings of the 1988 Connectionist Models Summer School, June 17-26, 1988, Carnegie Mellon Univ., Morgan-Kaufmann Publishers. Tesauro, G. 1987. Scaling relationships in backpropagation learning: dependence on training set size. Complex Systems, 1, 367-372. Werbos, P. 1974. Beyond regression: New tools for prediction and analysis in the behmiorual sciences, Ph.D. thesis, Harvard University. Williams, R.J. and D. Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1,270-280. Zak, M. 1988. Terminal attractors for addressable memory in neural networks. Physics Letters A, 133, 18-22. Received 14 March 1989; accepted 17 April 1989.
New Models for Motor Control Jennifer S. Altman Jenny Kien Institut fur Zoologie, Universitat Regensburg, 0-8400 Regensburg, West Germany
What do a prototype robot (Brooks 1989) and a model for the control of behavioral choice in insects (Altman and Kien 1987a) have in common? And what do they share with a scratching cat (Shadmehr 1989)? The answer is distributed control systems that do not depend on a central command center for the execution of behavioral outputs. The first two in particular are examples of a growing trend to replace the long-held concept of linear hierarchical control of motor output with one of decentralized, distributed control, with inputs at many levels and the output a consensus of the activity in several centers. Brooks (1989) describes a six-legged machine that, in its most advanced form, can walk over rough terrain and prowl around following a source of warmth, such as a person. The six legs, chosen as a compromise between stability and ease of coordination, give the robot a superficial resemblance to an insect - but the similarity goes deeper. The modular control system, designed strictly on engineering principles for maximum efficiency and economy, bears a striking similarity to the model we have proposed elsewhere (Altman and Kien 1987a) to describe the organization of the motor system in insects such as the locust. In both systems, the same set of components can generate different behaviors, depending on the context, and similar principles govern the generation of different levels of behavior, from movements of a single leg to coordinated responses of the whole beast. Neither requires a single center for integrating all sensory information and conflicts tend to be resolved by consensus at the motor level. 1 The Conventional Hierarchical Model
Such distributed systems contrast sharply with the hierarchical view of motor control found in many textbooks: sensory information converges on command neurons in higher motor centers; when excited by a particular set of inputs, one or more command neurons initiate a behavior by activating the relevant central pattern generator in the lower motor Neural Computation 1,173-183 (1989)
@ 1989 Massachusetts Institute of Technology
174
Jennifer S. Altman and Jenny Kien
centers, the output of which ultimately drives the muscles appropriate to the behavior. Much of the support for this theory has come from work in invertebrates. The command neuron concept (for review see Kupferman and Weiss 1978) stemmed from demonstrations that electrical stimulation of single interneurons in the crayfish could elicit complex patterns of movements (Wiersma and Ikeda 1964). As well as assuming linear organization and the hierarchical activation of behaviors, the concept implies that function is statically localized to individual nerve cells (Eaton and DiDomenico 1985). Although this idea is attractively simple, it is uneconomical and lacks the flexibility seen in neuronal responses and everyday behavior (Kien 1983; 1986). Behaviors less stereotyped than rapid escape responses are probably regulated by large numbers of higher-order interneurons that are involved in both maintaining and timing the output of the local pattern generators (see Kien 1986). Central pattern generators, on the other hand, undoubtedly exist, as Shadmehr demonstrates (1989): in both vertebrates and invertebrates, there are circuits of neurons that can generate the correctly patterned outputs characteristic of a range of behaviors. The debate here has been how far sensory inputs modify the basic pattern (Altman 1982). Wilson (1961)first demonstrated that deafferented thoracic ganglia in locusts (see Fig. 1) could produce a correctly patterned flight motor output, thus establishing that such rhythmic movements were not driven by a chain of reflexes; this, however, has often been taken to mean that sensory inputs play no part in pattern generation. More recently it has been appreciated that sensory inflow is necessary for maintaining excitation and for tailoring the output to the irregularities of the environment (Altman 1982; Wolf and Pearson 19871, and may even be an essential part of the pattern generation in intact animals (Cohen et al. 1988). Command fibers and pattern generators have mostly been studied in fixed, dissected preparations and isolated ganglia, where the variability of normal life is absent. This has resulted in the misleading emphasis on the role of central neurons and the serious neglect of the importance of sensory feedback at every level in the motor system. Now that almost intact, though still restrained, preparations are being used, a fairer assessment of the relative contributions of center and periphery is likely, and distributed models such as those discussed here should help in the design of more realistic experiments. 2 The Robot
The basic module of the robot is the augmented finite state machine (AFSM), whose output depends on its inputs and the state of its registers and timers. The output drives a motor moving a leg or provides the input to a register in another AFSM; other inputs come from sensors that
New Models for Motor Control
175
Antenna
Figure 1: A locust dissected from the dorsal side to show the organization of the nervous system. The brain, lying above the gut, receives inputs from all the main sense organs as well as ascending pathways from the rest of the nerve cord. These inputs converge on the descending interneurons that run in the interganglionic connectives to the other ganglia. The suboesophageal ganglion (SOG),which is similar in function to the vertebrate midbrain and brain stem, contains a second population of descending interneurons. These terminate in the segmental ganglia of the thorax and abdomen, which innervate the body musculature and contain the circuits that generate movements. Both the brain and the SOG seem to be involved in initiating and maintaining movements but the brain may be more concerned with what the animal does and the SOG with how it does it (Altman and Kien 1987; 1987a).
176
Jennifer S. Altman and Jenny Kien
monitor the environment (feelers and infrared detectors) and the forces on the leg motors. The AFSMs are connected together by three simple rules: an AFSM can suppress and gate an input to another; one can inhibit an output from another; and the output from one can override an input from another, the default condition. Inhibition and suppression are effective only for a short time unless the controlling AFSM provides a continuous stream of signals. The robot is built by adding layers of modules, each of which allows it to perform more complex actions, !?om simply standing up to wandering round a room looking rather like a dog following its owner. The output of the system at any of these levels depends on a consensus between the states of the various AFSMs, which in turn is dependent on the consensus between inputs and internal states for each AFSM. The simpler activities are regulated by inputs from sensors that monitor the leg motors; novel inputs from the environment are needed solely for higher-level behaviors. The dog-like robot has 57 AFSMs, 32 of which form the basic layers that enable it to stand up and walk on an even surface. Each leg has five AFSMs to control its stepping movements and there are two that coordinate all the legs: the alpha-balance AFSM, which initiates compensatory responses in all the other legs when one moves, and the walk AFSM, which times the stepping sequence of the legs so that a regular gait is achieved. The stability required for progress over rough ground requires another three AFSMs for each leg, which compensate for the extra forces generated when a leg is placed on an obstacle and regulate the height to which the leg is lifted. Adding feelers to give advance warning of obstacles and detectors for high forward and backward pitch improves performance under more difficult conditions. Finally, the walk AFSM can itself be regulated by the prowl and steer AFSMs that get inputs from the infrared detectors and so control the overall performance according to external conditions. Two aspects of this design are hallmarks of distributed control systems. First, each layer is capable of generating a behavior - adding layers simply improves performance and extends the range of possible activities. Second, information about leg movements feeds back to several layers and is used for generating the functions specific to each layer. 3 Decision-making Model for Insects
The insect model (Altman and Kien 1987a) shown in figure 2 has similar features. It is intended to provide a framework for thinking about decision making in the nervous system, that is, the choice of behavior appropriate to the circumstances of the moment. Based on anatomical, physiological, and behavioral evidence accumulated over the past 15 years on the neural organization of motor outputs in arthropods, such as flight and walking in locusts, it shows how a large range of behav-
New Models for Motor Control
177
iors could be produced by sharing an economical number of components and demonstrates how the overall coordination of all parts of the body, which is essential for the coherent performance of any behavior, could be achieved. The model (Fig. 2) consists of several stations, each of which contains local networks that generate an output. A station receives inputs from all other stations as well as novel inputs signalling changes in the environment, environmental and internal changes generated by the animal's own action, inputs from memory and information about its physiological state. This network works on two principles. First, the output from a station is the result of a consensus between the total pattern of inputs to the station, which we term the across-fiber pattern (borrowed from taste physiology, see Erickson 1963), and the state of the local networks. Because the outputs from every station form part of the inputs to every other station, the system as a whole consists of a number of loops acting in parallel. Second, it is the activity in these loops rather than in a linear sequence of stations that regulates behavior. Decisions are the outcome of a consensus between the simultaneous activities in the loops, each of which controls a different aspect of the motor behavior, such as form, direction, overalI coordination and local coordination between limbs. What stops this network becoming amorphous, with everything connected to everything else? The local networks in each station are unique, so each station has a unique output. Furthermore, the same elements in a station are used in different ways and different combinations in several behaviors - for example, the motor neurons that drive the muscles in the large hind legs of the locust are used in walking, jumping, kicking, grooming, and sound production. A station thus contains several local networks that share at least some elements and whose functional connectivity will change according to the strength and nature of the inputs. The best demonstration of this is in the stomatogastric ganglion of the lobster (Heinzel and Selverston 1988), where neuromodulatory peptides such as proctolin, produced by neurons elsewhere in the system, can alter the coupling between the neurons of the pattern generator to produce different forms of coordination between the teeth of the gastric mill. Inputs do not necessarily all have the same weight, so the consensus that determines the output also includes the strength of coupling between the inputs and the local circuits. Some inputs, like the lateral giant neurons that elicit the escape tailflip in the crayfish, have strong synaptic connections to the motor neurons that usually override all other inputs. Even so, the strength of these synapses can be altered by neuromodulatory inputs in appropriate physiological states such as feeding (see Bicker and Menzel 1989). Similarly, some local circuits have higher thresholds than others and are activated less frequently. For example, the swimmerets of the crayfish beat back to front most of the time and reverse beating is seen only on occasion.
178
Jennifer S. Altman and Jenny Kien
CNS
se
1 1 Action
-@-
Figure 2: A model for decision making in the insect nervous system. In the CNS, stations 1, 2, 3 contain local networks, N1-3. These stations approximate to the brain, suboesophageal ganglion (SOG) and segmental ganglia of the locust (see Fig. 1). The size and composition of the local networks in these ganglia are unknown, but there must be several hundred neurons in each, some of which belong to more than one circuit. Each station receives inputs, composed of novel stimuli from the environment (A); changes generated by the animal's movements (B); and internal state (C), which includes memory, arousal, and hormonal levels. There are also signals reporting the state of the networks in the other stations, which are similar to the outputs of the stations (X, Y, Z ) . The segmental ganglia produce a motor output (M) to the muscles, which is limited by the mechanical constraints of the body (D). The output of each station results from a consensus between the activity of the inputs and the local networks in that station, so the output of each station is different. The stations are thus linked in several parallel loops and the output of the whole system is the consensus of the activity in all the loops. (Modified from Altman and Kien 1987a.)
New Models for Motor Control
179
4 What is a Station? The stations in our model have appropriate equivalents in the real insect nervous system (Fig. 1): the brain, the suboesophageal ganglion (SOG) and the segmental ganglia. Lesion experiments, microstimulation and recording in the nerve connectives joining the brain and SOG to the segmental ganglia, recording from single identified neurons and neuroanatomy (reviewed in Altman and Kien 198%) all indicate that the outputs of the brain and the SOG have different functions. Neurons in the brain that run directly to the segmental ganglia carry spatial information and regulate what sort of response is required, for example, the direction in which to turn; in contrast, the neurons in the SOG that project to the segmental ganglia seem to regulate how the operation should be performed. Together these outputs provide the local circuits of the segmental ganglia with information about the ’form’ of the response required, similar to the walk, prowl, and steer AFSMs in Brooks’ robot. Just as the modules of the robot all share the same basic design, the stations in our model all operate on the principle of consensus. This has the advantage that studying the behavior of one station can provide clues to the operation of all - the motor-neuron output from the segmental ganglia to the muscles can serve as a model for the output of the descending interneurons from the brain or SOG to the segmental ganglia. Each station is, however, a good deal more complex than the individual AFSMs in the robot, which have a very limited operating range. A better analogy to a station is a group of AFSMs with equivalent functions, such as the five that drive the movements of a single leg. Looked at in this way, the block diagram in figure 3 of Brooks (1989) can be divided into three control levels that approximate to the brain, SOG and segmental ganglia in our model: the general control modules that determine the form of the output - walk, prowl, steer - have functions similar to those of the neurons from the brain and SOG; the modules like the alpha balance and forlbak pitch that have inputs to more than one leg are similar in function to the neurons of the SOG that regulate the way a movement is performed; and the modules that control the actions of a single leg, are approximately analogous to the segmental ganglia. 5 Pattern Generators
In our model (Altman and Kien 1987a1, we do not consider the details of the interface between the inputs to stations and the local networks, or the way in which the local networks generate patterned neural activity. One such network is discussed by Shadmehr in this issue: the pattern generator for rhythmic scratching in the cat. A cat responds to an irritating itch by positioning a foot, which rhythmically scratches the spot by alternately aiming and wiping. Positioning
180
Jennifer S . Altman and Jenny Kien
and aiming involving group A muscles, driven by the BI group of interneurons; wiping uses group B muscles and BIII interneurons. The change from position/aim to wipe is a simple example of consensus; to wipe or not to wipe depends on the balance of excitatory and inhibitory inputs to BI and BIII. Shadmehr’s model is based on work with spinal cats that have the scratching leg immobilized. He proposes that the alternation between aim and wipe is coordinated by a third group of interneurons, BII, which controls the inhibition of group BIII by BI, as well as directly exciting BIII. Once BIII reaches threshold and a wipe starts, it in turn inhibits BII, so that BI inhibition of BIII is again established. This alternating disinhibition of BI and BIII is similar in principle to the use of suppression to generate stepping in the robot -in both, the next action cannot happen unless existing signals are blocked, although in the robot the suppressed signal is excitatory, whereas in the cat it is inhibitory. The circuitry for hindlimb scratching seems to be confined to three segments of the lumbar spinal cord, which may be considered a station in the terms of our model. Although the basic scratching behavior is organized and executed competently by this station, Shadmehr (1989) observes that the activity in all three groups of interneurons can also be recorded in the ascending tracts from spinal cord to cerebellum and reticular system; the descending messages that are presumably generated here would help to modify the form of the activity and overcome the local reflex at times when scratching would be inconvenient. Like the basic walking module of the robot, or the pattern generators in insect segmental ganglia, the spinal-cord pattern generator can produce an organized output but cannot adapt it to meet varying needs. 6 Sensory Inputs
The sensory inputs that shape the basic pattern to the needs of the animal come both from novel inputs that signal change in the environment and from inputs that monitor changes in the relationship between the animal and its environment. Shadmehr (1989) shows how adding an input that simply indicates the starting position of the limb alters the length of the positioning phase; presumbly the form of the scratching movement will also vary with the site of the itch. An important feature of both the robot and our model is that the inputs can enter the system at any level. In insects, the same sensory input may reach the motor neurons over several pathways: extensionsensitive afferents in the locust hind leg have four parallel inputs to motor neurons, direct, through non-spiking or spiking interneurons or through both (Burrows et al. 1988). Similar information can also reach more than one station, either directly or indirectly. The wing-beat frequency in flying locusts, for instance, is monitored both by wing sense organs with
New Models for Motor Control
181
C
I
ACTION
I
Figure 3: Figure 2 redrawn in Hopfield notation. Symbols as in figure 2. (Thanks to Peter Getting and Terry Sejnowski for effecting the transformation.) inputs directly to the local segmental ganglia, and by hairs that detect modulations in the air flow over the head, which have their inputs to neurons in the brain. The latter help to mold the output of the brain to the segmental ganglia (Altman 1982). The distributed sensory inputs are thus a component of the loops that connect the stations and an essential part of the consensus by which outputs are chosen.
182
Jennifer S. Altman and Jenny Kien
7 Stability and Change
The model in figure 2 shows what is happening at a single instant. This network can be redrawn, without changing the connections, in Hopfield notation (Fig. 3), which emphasizes its tendency to adopt a stable state. But in a behaving animal the inputs (A, B, and C ) are rarely constant the animal's movements generate instabilities, the terrain is not even, the unexpected happens in the environment and internal state changes as energy stores are depleted, hunger satisfied, or copulation achieved. In both the insect model and the robot, small perturbations in input lead to small adjustments in output, such as lifting a leg higher on encountering an obstacle; large changes alter the type of behavior produced - the robot starts to walk only when it detects an infrared source in the vicinity. The magnitude of the change in output depends on the strength and priorities of the various inputs. Brooks points out that his robot is an existence proof for a decentralized control of motor outputs similar to what we propose for insects. Evidence from scratching cats (Shadmehr 1989)and from frogs (Grobstein 1989) makes it seem likely that vertebrates do it this way, too. Acknowledgments We acknowledge support from the Deutsche Forschungsgemeinschaft through SFB4, project H-2. References Altman, J.S. 1982. In: BIONA Report 2, ed. W. Nachtigall, 127-136. Stuttgart: Gustav Fischer. Altman, J.S. and J. Kien. 1987a. In: Nervous Systems in Invertebrates, ed. M.A. Ali, 621-643. New York Plenum. . 1987b. In: Arthropod Brain: Its Evolution, Development, Structure and Function, ed. A.P. Gupta, 265-301. New York John Wiley. Bicker, G. and R. Menzel. 1989. Nature, 337,3-39. Brooks, R.A. 1989. A robot that walks; Emergent behaviors from a carefully evolved network. Neural computation, 1, 253-262. Burrows, M., G.J. Laurent, and L.H. Field. 1988. journal of Neuroscience, B, 30853093. Cohen, A.H., S. Rossignol, and S. Grillner. 1963. Neural Control of Rhythmic Movements in Vertebrates. New York John Wiley. Eaton, R.C. and R. DiDomenico. 1985. Brain Behavioral Evolution, 27, 132-164. Erickson, R.P. 1963. In: Olfaction and Taste, ed. Y. Zotterman, 205-213. Oxford: Pergamon.
New Models for Motor Control
183
Grobstein, P., in press. In: Visuomotor Coordination: Amphibians, Comparisons, Models, and Robot, eds. J.-P. Ewert and M.A. Arbib. New York: Plenum. Heinzel, H.-G. and A.I. Selverston. 1988. Journal of Neurophysiology, 59,566-585. Kien, J. 1983. Proceedings of the Royal Society of London B, 219, 137-174. . 1986. Behavioral Brain Science, 9, 732-733. Shadmehr, R. 1989. A neural model for generation of some behaviors in the fictive scratch reflex. Neural Computation, 1, 242-252. Kupferman, I. and K.R. Weiss. 1978. Behavioral Brain Science, 1, 3-39. Wiersma, C.A.G. and K. Ikeda. 1964. Comparative Biochemical Physiology, 12, 509-525. Wilson, D.M. 1961. Journal of Experimental Biology, 38, 471-490. Wolf, H. and K.G. Pearson. 1987. Journal of Comparative Physiology A, 160, 269279.
Received 29 December 1988; accepted 3 January 1989.
VIEW
Seeing Chips: Analog VLSI Circuits for Computer Vision Christof Koch Computation and Neural Systems Program, Divisions of Biology and Engineering and Applied Science, 216-76, California Institute of Technology, Pasadena, CA 91125, USA
Vision is simple. We open our eyes and, instantly, the world surrounding us is perceived in all its splendor. Yet Artificial Intelligence has been trying with very limited success for over 20 years to endow machines with similar abilities. A large van, filled with computers and driving unguided at a mile per hour across gently sloping hills in Colorado and using a laser-range system to ”see” is the most we have accomplished so far. On the other hand, computers can play a decent game of chess or prove simple mathematical theorems. It is ironic that we are unable to reproduce perceptual abilities which we share with most animals while some of the features distinguishing us from even our closest cousins, chimpanzees, can be carried out by machines. Vision is difficult. 1 Introduction
In the last ten years, significant progress has been made in understanding the first steps in visual processing. Thus, a large number of well-studied algorithms exist that locate edges, compute disparities along these edges or over areas, estimate motion fields and find discontinuities in depth, motion, color and texture (for an overview see Horn 1986 or the last reference at the end of this article). At least two major problems remain. One is the integration of information from different modalities. Fusion of information is expected to greatly increase the robustness and fault tolerance of current vision systems as it is most likely the key towards fully understanding vision in biological systems (Barrow and Tenenbaum 1981; Marr 1982; Poggio et al. 1988). The second, more immediate, problem is the fact that vision is very expensive in terms of computer cycles. Thus, one second’s worth of black-and-white TV adds up to approximately 64 million bits which needs to be transmitted and processed further. And since early vision algorithms are usually formulated as relaxation algorithms which need to be executed many hundreds of times before convergence, even supercomputers take their time. For instance, the 65,536 Neural Computation 1,184-200 (1989) @ 1989 MassachusettsInstitute of Technology
Seeing Chips: Analog VLSI Circuits for Computer Vision
185
processor Connection Machine at Thinking Machines Corporation (Hillis 1985), with a machine architecture optimal from the point of view of processing two-dimensional images, still requires several seconds per image to compute depth from two displaced images (Little 1987). Performance on microprocessor based workstations is hundreds of times slower. Animals, of course, devote a large fraction of their nervous system to vision. Thus, about 270,000 out of 340,000 neurons in the house fly Musca damestica are considered to be "visual" neurons (StrausfeId 1975), while a third of the human cerebral cortex is given over to the computations underlying the perception of depth, color, motion, recognition, etc. One way for technology to bypass the computational bottleneck is to likewise construct special-purpose vision hardware. Today, commercial vendors offer powerful and programmable digital systems on the open market for a few thousand dollars. Why, however, execute vision algorithms on digital machines when the signals themselves are analog? Why not exploit the physics of circuits to build very compact, analog special-purpose vision systems? Such a smart sensor paradigm, in which as much as possible of the signal processing is incorporated into the sensor and its associated circuitry in order to reduce transmission bandwidth and subsequent stages of computation, is starting to emerge as a possible competitor to more general-purpose digital vision machines. 2 Analog circuits for vision: the early years
This idea was explicitly raised by Horn at MIT (1974),where he proposed the use of a 2-D hexagonal grid of resistances to find the inverse of the discrete approximation to the Laplacian. This is the crucial operation in an algorithm for determining the lightness of objects from their image. An attempt to build an analog network for vision was undertaken by Knight (1983) for the problem of convolving images with the Differenceof-two-Gaussians (DOG), a good approximation of the Laplacian-of-aGaussian filter of Marr and Hildreth (1980). The principal idea is to exploit the dynamic behavior of a resistor/capacitor transmission line, illustrated in figure 1. In the limit that the grid becomes infinitely fine, the behavior of the system is governed by the diffusion equation: (2.1)
If the initial voltage distribution is V(z,O) and if the boundaries are infinitely far away, the solution voltage is given by the convolution of V(z, 0) with a progressively broader Gaussian distribution (Knight 1983). Thus, a difference of two Gaussians can be computed by converting the incoming image into an initial voltage distribution, storing the resulting voltage distribution after a short time and subtracting it from the voltage distribution at a later time. A resistor/capacitor plane yields the same
186
Christof Koch
Figure 1: One-dimensional lumped-element resistor/capacitor transmission line. The incoming light intensity is converted into the initial voltage distribution V(Z, 0). The final voltage V ( s ,t ) along the line is given by the convolution of V(z,O) with a Gaussian of variance u2 = 2t/RC and is read off after a certain time t related to the width of the Gaussian filter. From Koch (1989). result in two dimensions. Practical difficulties prevented the successful implementation of this idea. A team from Rockwell Science Center (Mathur et al. 1988) is reevaluating this idea by using a continuous 2-D undoped polysilicon plane deposited on a thick oxidized silicon sheet (to implement the distributed capacitor). The result of the convolution is read out via a 64 by 64 array of vertical solder columns. A different approach - exploiting CCD technology - for convolving images was successfully tried by Sage at MIT's Lincoln Laboratory (Sage 1984),based on an earlier idea of Knight (1983). In this technology, incoming light intensity is converted into a variable amount of charge trapped in potential "wells" at each pixel. By using appropriate clocking signals, the original charge can be divided by two and shifted into adjacent wells. A second step further divides and shifts the charges and so on (Fig. 2). This causes the charge in each pixel to spread out in a diffusive manner described accurately by a binominal convolution. This represents, after a few iterations, a good approximation to a gaussian convolution. Sage extended this work to the 2-D domain (Sage and Lattes 1987) by first effecting the convolution in the z and then in the y direction. Their 288 by 384 pixel CCD imager convolves images at up to 60 times per second. Since CCD devices can be packed extremely densely - commercial million-pixel CCD image sensors are available such convolvers promise to be remarkably fast and area-efficient. 3 Analog VLSI and Neural Systems
The current leader in the field of analog sensory devices that include significant signal processing is undoubtedly Carver Mead at Caltech
Seeing Chips: Analog VLSI Circuits for Computer Vision
187
(Mead 1989). During the last years he has developed a set of subcircuit types and design practices for implementing a variety of vision circuits using subthreshold analog complementary Metal-Oxide-Semiconductor (CMOS) VLSI technology. His best-known design is the "Silicon retina" (Sivilotti et al. 1987; Mead and Mahowald 19881, a device which computes the spatial and temporal derivative of an image projected onto its phototransistor array. The version illustrated schematically in figure 3a has two major components. The photoreceptor consists of a phototransistor feeding current into a circuit element with an exponential current-voltage characteristic. The output voltage of the receptor G: is logarithmic over four to five orders of magnitude of incoming light intensity, thus performing automatic gain control, analogous to the cone photoreceptors of the vertebrate retina. This voltage is then fed into a 48 by 48 element hexagonal resistive layer with uniform resistance values R. The photoreceptor is linked to the grid by a conductance of value G, implemented by a transconductance amplifier. An amplifier senses the voltage difference across this conductance and thereby generates an output at each pixel proportional to the difference between the receptor output and the network potential. Formally, if the voltage at pixel i , j is V,, and the current being fed into the network at that location IZz3 = G(V, - K3), the steady state is characterized by:
A
r
B C
r
D E
Figure 2: Schematic of a potential "well" CCD structure evolving over time. The initial charge acrossthe 1-D array is proportional to the incoming light intensity. The charge packet shown in (A) is then shifted into the two adjacent wells by an appropriate clocking method. Since the total charge is conserved, the charge per well is halved (B). In subsequent cycles, (C, D and E) the charge is further divided and shifted, resulting in a binominal charge distribution. After several steps, this distribution is very similar to a Gaussian distribution. From Koch (1989).
Christof Koch
188
On inspection, this turns out to be one of the simplest possible discrete analogs of the Laplacian differential operator V2. In other words, given an infinitely fine grid and the voltage distribution V(z, y), this circuit computes the current I ( z ,y) via
V2V = R G ( v - V ) = RI.
(3.2)
The current I at each grid point - proportional to 17 - V and sensed by the amplifier - then corresponds to a spatially high-passed filtered version of the logarithmic compressed image intensity. Operations akin to temporal differentiation can be achieved by adding capacitive elements (Sivilotti et al. 1987). The required resistive elements of this circuit are designed by exploiting the current-voltage relationship (Fig. 3b) of a small transistor circuit, instead of using the resistance of a special metallic process. As long as the voltage across the device is within its linear range (a couple of 100 mV's), it behaves like a constant resistance whose value can be controlled over five orders of magnitude. The current saturates for larger voltage values, a nonlinearity with very desirable effects (see below). This, then, is the basic circuit element used for most vision chips coming out of Caltech. The response of the silicon retina to a 1-D edge projected onto the phototransistors is shown in figure 3c. The voltage trajectory can be well approximated by the second spatial derivative of the smoothed brightness intensity. In 2-D the response is similar to that obtained by convolving the image with the DOG edge detection operator (Marr and Hildreth 1980). A different circuit (Tanner 1986) computes the optical flow field induced by a spatially homogeneous motion, such as moving a pointing device over a fixed surface (for example, an optical mouse). A serious practical problem in designing the type of networks discussed here is that unwanted oscillations can spontaneously arise when large populations of active elements are interconnected through a resistive grid. These oscillations can occur even when the individual elements are quite stable. Using methods from nonlinear circuit theory, Wyatt and Standley (1989) at MIT have shown how this flaw can be circumvented. They have proven that if each linear active element in isolation is designed to satisfy the experimentally testable Popov criterion from control theory (which guarantees that a related operator is positive real), then stability of the overall interconnected nonlinear system is guaranteed. Mead's principal motivation for this work comes from his desire to understand and emulate neurobiological circuits (as expressed in his new textbook, Mead 1989). He argues that the physical restrictions on the density of wires, the low power consumption of the CMOS process in the subthreshold domain, the limited precision and the cost of communication imposed by the spatial layout of the electronic circuits are similar to the constraints imposed on biological circuits. Furthermore, the
189
Seeing Chips: Analog VLSI Circuits for Computer Vision
a Ruponde
I (10-’A)
(V)
4.m-
6 1 3.960..
3.900.-
3.860-
s.aw ..
t .
-‘ -0.3
-0
z
-0.1
o
0.1
0.1
v;- V;-% (V)
b
0.3
3.750 -1.1
-1.0
-0.6
-0.8
-0.4
-0.2
0.0
Dimtsnca (em)
c
Figure 3: The “Silicon retina.” (a) Diagram of the hexagonal resistive network with an enlarged single element. A photoreceptor, whose output voltage is proportional to the logarithm of the image intensity, is coupled - via the conductance G - to the resistive grid. The output of the chip is proportional to the current across the conductance G, or in other words, to the voltage difference between the photoreceptor and the grid. (b) The current-voltage relationship for Mead’s resistive element. As long as the voltage gradient is less than FZ 100 mV, the circuit acts like a linear resistive element. The output current saturates for larger gradients. (c) The experimentally measured voltage response of a 48 by 48 pixel version of the retina when a step intensity edge is moved past one pixel. This response is similar to the one expected by taking the second spatial derivative of the smoothed incoming light intensity. Adapted from Mead and Mahowald (1988). From Koch (1989).
Christof Koch
190
silicon medium provides both the computational neuroscience and the engineering communities with tools to test theories under realistic, realtime conditions. To further the spread of this technology into the general academic community, all circuits are fabricated via the silicon foundry MOSIS. 4 Regularization theory and analog networks
Problems in vision are usually inverse problems; the two dimensional intensity distribution on retina or camera must be inverted to recover physical properties of the visible three dimensional surfaces surrounding the viewer. More precisely, these problems are ill-posed in that they either admit to no solution, to infinitely many solutions or to a solution that does not depend continuously on the data. In general, additional constraints must be applied to arrive at a stable and unique solution. One common technique to achieve this, termed "standard regularization" (Poggio et al. 1985), is via minimization of a given "cost" functional (for earlier examples of this see Grimson 1981; Horn and Schunck 1981; Ikeuchi and Horn 1981; Terzopoulos 1983; Hildreth 1984). The first term in these functionals assesses by how much the solution diverges from the measured data. The second term measures how closely the solution conforms to certain a priori expectations, for instance that the final surface should be as smooth as possible. Let us briefly consider the problem of fitting a 2-D surface through a set of noisy and sparse depth measurements, a well-explored problem in computer vision (Grimson 1981). Specifically, a set of sparse depth measurements is given on a 2-D lattice, diJ, which are corrupted by some noise process. It is obvious that infinitely many surfaces, fij, can be fitted through the sparse data set. One way to regularize this problem is to find the surface f that minimizes
in which a depends on the signal-to-noise ratio and the second sum only contains contributions from those locations i where data exists. Equation (4.1)represents the simplest possible functional, even though many alternatives exist (Grimson 1981; Terzopoulos 1983; Harris 1987). This and all other quadratic regularized variational functionals of early vision can be solved with simple linear resistive networks by virtue of the fact that the electrical power dissipated in linear networks is quadratic in the current or voltage (Poggio et al. 1985; Poggio and Koch 1985). The resistive network will then converge to its unique equilibrium state in which the dissipated power is at a minimum (subject to the source constraints). The static version of this sta ement is known as Maxwell's Minimum Heat Theorem. The steady-state o the resistive network in figure 4a minimizes
d
Seeing Chips: Analog VLSI Circuits for Computer Vision
191
a
b
c
d
e
Figure 4:Surface interpolating network. (a) At those locations where depth data are available, the values of the battery V,, and of the conductance G are set to their appropriate values, via additional sample-and-hold circuitry. The output is the voltage Volt,at each location. This circuit solves, for small enough voltage gradients, a modified form of Poisson’s equation, via minimization of equation (4.2). Experimental results from a 48 by 48 subthreshold, analog CMOS VLSI circuit are shown next (Luo et al. 1988). (b) The input voltage corresponding to a flat, 2 pixel wide, strip around the periphery and a central 4 pixel wide tower (solid coloring). At these locations, the conductance G is set to a constant, fixed value, while G is zero everywhere else. Thus, no data are present in the area between the bottom of the tower and the outside strip. (c), (d), and (e) show the output voltage for a high, medium, and low value of the transversal resistance R. If R is small enough, the resulting smoothing will flatten out the central tower.
vn,,
Christof Koch
192
expression (4.1) if the voltage V,, is identified with the discretized solution surface fl,, the battery Et3 with the data dlJ and the product of the variable conductance G,, connecting the node to the battery and the constant horizontal resistance R with a. The power minimized by this circuit is then formally equivalent to the functional of equation (4.1). The performance of an experimental 48 by 48 subthreshold, analog CMOS VLSI circuit is illustrated in figure 4 (Luo et al. 1988). For an infinitely fine grid and a voltage source E ( z ,y) the surface interpolation chip computes the voltage distribution V(z, y) according to the modified Poisson equation V2V+RGV = RGE,
(4.2)
with either an arbitrary Dirichlet boundary condition (such as zero voltage along the boundary) or a zero voltage slope (that is, no current across the boundary) Neumann boundary condition. If RG is a constant across the grid, this equation is sometimes known as the Helmholtz equation. Note that the difference to equation (3.2) lies in the choice of observable, current I ( z ,y) versus voltage V(z, y). A large number of problems in early vision, such as detecting edges, computing motion or estimating disparity from two images have a similar architecture, with resistive connections among neighboring nodes implementing the constraint that objects in the real world tend to be smooth and continuous. 5 Discontinuities
However, the most interesting locations in any scene are arguably those locations at which some feature changes abruptly, for instance the 2-D optical flow at the boundary between a moving figure and the stationary background or the color across the sharp boundaries in a painting. Geman and Geman (1984) (see also Blake and Zisserman 1987) introduced the powerful concept of a binary line process l,,, which explicitly codes for the absence (lzJ = 0) or presence (lz, = 1) of a discontinuity at location i ,j in the 2-D image. Further constraints, such that discontinuities should occur along continous contours (as they do, in general, in the real world) or that they rarely intersect, can be incorporated into their theory, which is based on a statistical estimation technique (see also Marroquin et al. 1987). In the case of surface interpolation and smoothing, maximizing the a posteriori estimate of the solution can be shown to be equivalent to minimizing:
Seeing Chips: Analog VLSI Circuits for Computer Vision
193
where J:l and 12, are the horizontal and vertical depth discontinuities, p a fixed parameter and V a potential function containing a number of terms penalizing or encouraging specific configurations of line processes. In the case of surface interpolation, a simple example is V(l,V,)= 1;;; that is, the ~ be set to 1 if the ”cost” for line process lFj between Z , J and z + 1 , will smoothing, that is ( f z + l j- fz,)2, is larger than the parameter P. Otherwise, = 0. Discontinuities greatly improve the performance of early vision processes, since they allow algorithms to smooth over unreliable or sparse data as well as account for boundaries between figures and ground. In fact, it can be argued that the introduction of discontinuities represents the single biggest advance in machine vision in the last 5 years. They have been used to demarcate boundaries in the intensity, color, depth, motion and texture domains (Geman and Geman 1984; Terzopoulos 1986; Blake and Zisserman 1987; Marroquin et al. 1987; Gamble and Poggio 1987; Hutchinson et al. 1988; Poggio et al. 1988; Chhabra and Grogan 1988). Line discontinuities can be implemented in various ways. In a hybrid implementation, each line process is represented by a simple binary switch. When the switch is open, no current flows across the connection . network operates between the two adjacent nodes I , ] and z + 1 , ~ The by switching between distinct modes. In the analog cycle the network settles into the state of least power dissipation, given a fixed distribution of switches. In the digital phase, the line processes are evaluated using expression (5.1); that is, the switches are set to the state minimizing this expression. Such a hybrid implementation is illustrated in figure 5 for the case of computing the optical flow in the presence of motion discontinuities. The flow field - induced by the time-varying image intensity I ( z ,y, t ) - is regularized using a smoothness constraint (Horn and Schunck 1981). The amount of smoothing is governed by the constant resistance value R of the upper and lower horizontal grids. In a complete analog implementation, each line process is represented by a ”neuron” whose output varies continuously between 0 and 1 (Koch et al. 19861, similarly to Hopfield and Tank’s (1985) use of such continuous variables to solve the “traveling salesman problem.” Another possibility exploits the saturation inherent in Mead’s design for resistances (Mead 1989). As illustrated in figure 3b, the current-voltage relation of the resistive element is linear for voltage differences on the order of 100 mV, while the element saturates for larger voltage differences. In other words, the peak current is independent of the size of the voltage gradient (as long as it is larger than some threshold), implementing a first approximation of a binary line process. This occurs in figure 4c, where segmentation starts to occur due to the large voltage gradient between the base and the top of the ”tower.”
194
Christof Koch
A very promising implementation is via ”resistive fuses” (Harris et al. 1989): in such a two-terminal nonlinear resistor, the current flowing through is proportional to the voltage difference across as long as that difference is less than a threshold. If the voltage gradient exceeds the threshold, the current decreases and, for large enough voltage gradients, is set to zero. The experimentally determined voltage-current relationship of this device (Harris et al. 1989) is closely related to the cost function used with the “analog” line discontinuities (Koch et al. 1986). It can also be derived from the cost function used in the “graduated nonconvexity” method of Blake and Zisserman (1987). The notion of minimizing power in linear networks implementing quadratic regularization algorithms must be replaced by the more general notion of minimizing the total co-content J for linear networks with “resistive fuses” (where J = Jd/ f(V/’)dV’for a resistor defined by I = f(V)). Although the method proposed by Geman and Geman (1984) requires stochastic optimization techniques and complicated potential functions (V in expression (5.1)) to implement the various constraints under which line discontinuities operate, computer simulations have shown that various deterministic approximations as well as much simplified potential functions can be used (see also Blake 1989). 6 Analog chips versus digital computers
As we have seen, all of the above circuits exploit the physics of the system to perform operations useful from a computational point of view. Thus, the transient voltage or charge distribution at some time in the networks of figures 1 and 2 corresponds to the solution, in this case convolution of the image intensity with a Gaussian. In the networks derived from the appropriate variational functionals, the stationary voltage distribution corresponds to the interpolated surface (Fig. 4) or to the optical flow (Fig. 5). These quantities are governed by Kirchhoff’s and Ohm’s laws, instead of being symbolically computed via execution of software in a digital computer. Furthermore, the architecture of the analog resistive circuits reflects the nature of the underlying computational task, for instance, smoothing, while digital computers - being Turing universal - do not. One of the advantages of these non-clocked analog circuits is that their operating mode is optimally suited to analog sensory data since they avoid temporal aliasing problems caused by discrete temporal sampling. Furthermore, their robustness to imprecisions or errors in the hardware, their processing speed and low power consumption (Mead‘s retina requires Iess than a mW, most of which is used in the photoconversion stage) and their small size make analog smart sensors very attractive for tele-robotic applications, remote exploration of planetary surfaces and a host of industrial applications where their power hungry, heat producing, bulky and slow digital cousins are unable to compete.
Seeing Chips: Analog VLSI Circuits for Computer Vision
195
Figure 5: Hybrid resistive network to compute the optical flow in the presence of discontinuities. The algorithm computes the smoothest flow field compatible with the measured motion data (Horn and Schunck 1981; Hutchinson et al. 1988). The steady-state voltage distribution in the upper grid is equivalent to the x component and the stationary voltage distribution in the lower grid is equivalent to the y component of the optical flow. A high voltage at location i , j will spread to its four neighboring nodes. The degree to which voltage spreads, and thus the degree of smoothness, is governed by the value of the constant horizontal resistance R. The value of the batteries E, and E, and the conductances G,, G, (for clarity, only two such elements are drawn) and G depend on the measured spatial and temporal intensity gradients V I and It and will be set by on-chip photoreceptors. Binary switches 1 implement motion discontinuities, since an arbitrary high voltage, that is, velocity, will not affect the neighboring site across the discontinuity. Adapted from Hutchinson et al. (1988).
196
Christof Koch
The two principal drawbacks of analog VLSI circuits are their lack of flexibility and their imprecision. The above circuits are all hardwired to perform very specific tasks, unlike digital computers which can be programmed to approximate any logical or numerical operation. Only certain parameters associated with this algorithm, for instance the smoothness in the case of figures 4 and 5, can be varied. Thus, digital computers appear vastly preferable for developing and evaluating new algorithms; analog implementations should only be attempted after such initial exploration of algorithms. Furthermore, although 12 and even 16 bit analog-to-digital converters are commercially available, it seems unlikely that the precision of analog vision circuits will exceed 7 to 8 bit of resolution in the next few years. However, for a number of important tasks, such as navigation or tracking, the incoming intensity data are rarely more accurate than 1%in any case. 7 TheFuture
Within the last year, a number of potentially very exciting developments have occurred which bode well for the future of analog vision circuits. Mahowald and Delbriick (1989) from Mead's laboratory have built and tested an analog CMOS VLSI circuit implementing a version of Marr and Poggio's (1976) cooperative stereo algorithm. Two 1-D phototransistor arrays, with 40 elements each, located next to each other on the chip provide the input to the circuit. A winner-take-all circuit selects the most active node among the seven possible disparity values at each pixel, replacing the inhibitory interaction in the original algorithm. A problem plaguing analog subthreshold circuits are random offsets which vary from location to location and are caused by fluctuations in the process accuracy as well as dark currents. Such offsets, while usually not problematic for digital circuits, can be very disruptive when operating in the analog domain, in particular when spatial or temporal derivatives are required. Mead (1988) has recently developed a variant of the "floating gate technology" used for a long time for resetting programmable read-only-memory cells (EPROM) by means of ultra-violet light. While previously the chips were bombarded with UV radiation to erase memory, Glasser (1985) of MIT demonstrated how this technology could be used to selectively write a " 0 or a "1" into the cell. Mead is the first to have applied this technique to the analog domain, by building a local feedback circuit at every node of the retina (Fig. 3) which senses the local current and attempts to keep it at or near zero by charging up a capacitor located between two layers of poly positioned above each node. Exposure to UV light excites electrons sufficiently to enable them to surmount the potential barrier at the silicon/silicon dioxide interface. In order to adapt the retina, a blank, homogeneous image is projected for a fraction of a minute onto the chip - in the presence of the UV light. This ef-
Seeing Chips: Analog VLSI Circuits for Computer Vision
197
fectively creates a "floating" battery at each location, which induces a current exactly counteracting the effect of the offset current at that pixel. Mead (1988) has even been able to show after-image-like phenomena. Another problem with most resistive networks for early vision problems is that the values of the individual circuit elements, such as conductances or voltage sources, depend on the measured data or can even be negative in value (the associated operator is not, in other words, of the convolution type) raising problems with network stability. Harris at Caltech has shown how this problem can be circumvented via the use of so-called "constraint boxes", which impose a generalized constraint equation (Harris 1987; 1989). For the case of reconstructing surfaces using a smoother functional than the one of equation (4.1) (so-called cubic spline or thin plate interpolation), his circuit implements an equation of the form V , - - V, = 0. The VLSI circuit has been tested successfully (Harris 1989) and is unusual in that all of its terminals can act as input or output nodes. Thus, if nodes n and b are held constant, then the c node is fixed to V, - K. Using these constraint boxes in the case of computing smooth optical flow (Horn and Schunck 1981; Hutchinson et al. 19881, all resistance values are positive and data independent, a considerable advantage when building these circuits. A team at MIT headed by J. Wyatt, and including B. Horn, H.-S. Lee, T. Poggio, and C. Sodini, is initiating an ambitious effort to fabricate analog, early vision chips exploiting different circuit technologies, such as CCD or mixed bipolar and CMOS devices. They plan to build various 2-D spatial correlator and convolver circuits, analog image memories and single-chip moment calculators and motion sensors. As part of this effort, new methods for estimating first and second image moments or computing optical flow under various constraints (for example, rigid environment) are being developed purely for such analog implementations (Horn 1989). Fusion of information on-chip is being attempted in Koch's laboratory at Caltech, by integrating a set of simple resistive networks computing depth and depth discontinuities, as well as edges and optical flow onto a small, autonomous moving vehicle. A number of other laboratories are also engaged in efforts to build vision sensors. In particular, a group at UCLA and Rockwell International (White et al. 1988) is designing a 2-D network for edge detection on the basis of Poggio, Voorhees and Yuille's (1985) proposal via a set of four l-D resistive lines. Thus, it appears that the analog computers of the 1940s and 1950s (Karplus 19581, until recently considered extinct, are making a sort of comeback in the form of highly dedicated smart vision chips. Acknowledgments The author thanks John Harris, Berthold Horn, Andy Lumsdaine, and in particular John Wyatt for a careful reading of the manuscript. Research
198
Christof Koch
on analog circuits for vision is supported by a Young Investigator Award and grant IST-8700064 from the Office of Naval Research, a Presidential Young Investigator Award from the National Science Foundation, as well as by DDF-I1 funds from the Jet Propulsion Laboratory and by Rockwell International. References Barrow, H.G. and J.M. Tenenbaum. 1981. Computational vision. Proceedings of the IEEE, 69,572-595. Blake, A. 1989. Comparison of the efficiency of deterministic and stochastic algorithms for visual reconstruction. I E E E Transactions on Pattern Analysis and Machine Intelligence, 11, 2-12. Blake, A. and A, Zisserman. 1987. Visual Reconstruction. Cambridge: MIT Press. Chhabra, A.K. and T.A. Grogan. 1988. Estimating depth from stereo: Variational methods and network implementation. IEEE International Conference on Neural Networks, San Diego. Gamble, E. and T. Poggio. 1987. Integration of intensity edges with stereo and motion. Artificial Intelligence Laboratory Memo, 970. Cambridge: MIT Press. Geman, S. and D. Geman. 1984. Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741. Glasser, L.A. 1985. A UV write-enabled PROM. In: 1985 Chapel Hill Conference on VLSI, ed. W. Fuchs, 61-65. Rockville, MD: Computer Science Press. Grimson, W.E.L. 1981. From Images to Surfaces. Cambridge: MIT Press. Harris, J.G. 1987. A new approach to surface reconstruction: The coupled depth/slope model. Proceedings of the I E E E First International Conference on Computer Vision, 277-283, London. . 1989. An analog VLSI chip for thin plate surface interpolation. In: Neural Information Processing Systems, ed. D. Touretsky. MorganKaufmann Publishers, 687-694. Harris, J., C. Koch, J. Luo, and J.Wyatt. 1989. Resistive fuses: Analog hardware for detecting discontinuities in early vision. In: Analog VLSI Implementations of Neural Systems, eds. C. Mead and M.Ismai1. Norwell, MA: Kluwer. Hildreth, E.C. 1984. The Measurement of Visual Motion. Cambridge: MIT Press. Hillis, W.D. 1985. The Connection Machine. Cambridge: MIT Press. Hopfield, J.J. and D.W. Tank. 1985. Neural computation in optimization problems. Biological Cybernetics, 52, 141-152. Horn, B.K.P. 1974. Determining lightness from an image. Computational Graphics Image Processing, 3, 277-299. . 1986. Robot Vision. Cambridge: MIT Press. . 1989. Parallel networks for machine vision. Artificial Intelligence Laboratory Memo, 1071. Cambridge: MIT Press. Horn, B.K.P.and B.G. Schunck. 1981. Determining optical flow. Artificial Intelligence, 17, 185-203.
Seeing Chips: Analog VLSI Circuits for Computer Vision
199
Hutchinson, J., C. Koch, J. Luo, and C. Mead. 1988. Computing motion using analog and binary resistive networks. I E E E Computers, 21, 52-63. Ikeuchi, K. and B.K.P. Horn. 1981. Numerical shape from shading and occluding boundaries. Artificial Intelligence, 17, 141-184. Karplus, W.J. 1958. Analog Simulation: Solution of Field Problems. New York: McGraw-Hill. Knight, T. 1983. Design of an integrated optical sensor with on-chip preprocessing. Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA. Koch, C. 1989. Resistive networks for computer vision. A Tutorial. In: An Introduction to Neural and Electronic Networks, eds. S.F. Zornetzer, J.C. Davis, and C. Lau. Academic Press, in press. Koch, C., J. Marroquin, and A. Yuille. 1986. Analog 'neuronal' networks in early vision. Proceedings of the National Academy of Science USA, 83, 4263-4267. Little, J. 1987. Parallel algorithms for computer vision on the connection machine. Artificial Intelligence Laboratory Memo, 928. Cambridge: MIT Press. Luo, J., C. Koch, and C. Mead. 1988. An experimental subthreshold, analog CMOS two-dimensional surface interpolation circuit. Oral presentation at the Neural Information Processing Systems Conference, Denver. Mahowald, M. and T. Delbriick. 1989. Cooperative stereo matching using static and dynamic features. In: Analog VLSI Implementations of Neural Systems, eds. C. Mead and M. Ismail. Norwell, MA: Kluwer. Marr, D. 1982. Vision. New York Freeman. Marr, D. and E.C. Hildreth. 1980. Theory of edge detection. Proceedings of the Royal Society of London B, 207, 187-217. Marr, D. and T. Poggio. 1976. Cooperative computation of stereo disparity. Science, 194,283-287. Marroquin, J., S. Mitter, and T. Poggio. 1987. Probabilistic solution of ill-posed problems in computational vision. Journal of the American Statistic Association, 82, 76-89. Mathur, B.P., H.T. Wang, C.T. Tsen, and E. Walton. 1988. Variable scale edge detection of images. I E E E Conference on Circuit Design, Rochester. Mead, C.A. 1988. Plenary talk. Proceedings of the International Neural Network Society, Boston. . 1989. Analog VLSl and Neural Systems. Reading, MA: Addison-Wesley. Mead, C.A. and M.A. Mahowald. 1988. A silicon model of early visual processing. Neural Networks, 1,91-97. Poggio, T., E.B. Gamble, and J.J. Little. 1988. Parallel integration of vision modules. Science, 242, 436-440. Poggio, T. and C. Koch. 1985. Ill-posed problems in early vision: From computational theory to analogue networks. Proceedings of the Royal Society of London B, 226, 303-323. Poggio, T., V. Torre, and C. Koch. 1985. Computational vision and regularization theory. Nature, 317, 314-319. Poggio, T., H. Voorhees, and A. Yuille. 1986. A regularized solution to edge detection. Artificial Intelligence Laboratory Memo, 833. Cambridge: MIT Press.
200
Christof Koch
Proceedings of the First International Conference on Computer Vision. 1987, Washington, D.C. London: IEEE Computer Society Press. Sage, J.P. 1984. Gaussian convolution of images stored in a charge-coupled device. Quarterly Technical Report, August l-October 31, 1983, 53-59. Lexington: MIT Lincoln Laboratory. Sage, J.P. and A.L. Lattes. 1987. A high-speed two-dimensional CCD gaussian image convolver. Quarterly Technical Report, August I-October 31, 1986, 4952. Lexington: MIT Lincoln Laboratory. Sivilotti, M.A., M.A. Mahowald, and C.A. Mead. 1987. Real-time visual computation using analog CMOS processing arrays. In: 1987 Stanford Conference on VLSI, 295-312. Cambridge: MIT Press. Strausfeld, N. 1975. Atlas of an lnsecf Brain. Heidelberg: Springer. Tanner, J.E. 1986. Integrated optical motion detection. Ph.D. thesis, Department of Computer Science, 5223:TR86, Caltech. Terzopoulos, D. 1983. Multilevel computational processes for visual surface reconstruction. Computer Vision Graphics lmage Proceeding, 24, 52-96. . 1986. Regularization of inverse problems involving discontinuities. I E E E Transactions Pattern Analysis Machine Intelligence, 8, 413-424. White, J., 8. Furman, A.A. Abidi, R.L. Baker, B. Mathur, and H.T. Wang. 1988. Parallel analog architecture for 2D gaussian convolution of images. Proceedings of the International Neural Network Society, Boston. Wyatt, J.L. and D.L. Standley. 1989. Criteria for robust stability in a class of lateral inhibition networks coupled through resistive grids. Neural Computation, 1, 58-67. Received 30 September 1988; accepted 14 October 1988.
A Proposal for More Powerful Learning Algorithms Eric 8. Baum* Jet Proprilsion Laboratory, California Institute of Technology, Pasadena CA 91109, U S A
Judd (1988) and Blum and Rivest (1988) have recently proved that the loading problem for neural networks is NP complete. This makes it very unlikely that any algorithm like backpropagation which varies weights on a network of fixed size and topology will be able to learn in polynomial time. However, Valiant has recently proposed a learning protocol (Valiant 1984) which allows one to sensibly consider generalization by learning algorithms with the freedom to add neurons and synapses, as well as simply adjusting weights. Within this context, standard circuit complexity arguments show that learning algorithms with such freedom can solve in polynomial time any learning problem that can be solved in polynomial time by any algorithm whatever. In this sense, neural nets are universal learners, capable of learning any learnable class of concepts.
What are the ultimate limits on neural net learning? Backpropagation (Rumelhart et al. 1986) has proved itself able to learn, from data bases of real world examples, state of the art expert systems for tasks such as phoneme recognition (Waibel et al. 1987). Most researchers, however, feel that backpropagation is too slow, and that the time it takes to converge grows too rapidly as the size of the problem is scaled up for it to be practical in most applications. The phoneme recognition example cited above involves very limited vocabulary for exactly this reason. Strong evidence for this position has been given by Judd (1988) who proved NP-complete the problem of Loading, that is, given a set of examples and a net, is there any choice of weights for which that net implements those examples? More recently Blum and Rivest (1988) have proved the Loading problem remains NP-complete even for nets containing only three neurons. These results show it is unlikely that any algorithm which simply varies weights on a net of fixed size and topology can learn in polynomial time. This paper remarks that these obstructions to rapid learning can be avoided if one considers algorithms with the power to add neurons and *Current address: Department of Physics, Princeton University, Princeton, NJ 08540, USA.
Neural Cornpufation 1, 201-207 (1989) @ 1989 Massachusetts Institute of Technology
202
Eric B. Baum
synapses, as well as simply varying synaptic weights.' Indeed, algorithms with such power can simply construct a neural random access memory (Baum 1988a) which solves any given loading problem. The standard neural net algorithms work by attempting to load a set of examples onto a fixed size net. The hope is that if the examples can be loaded, and if there are more examples than weights, then the net produced will correctly generalize to future examples. Obviously, this approach to generalization is not available if learning algorithms are considered which can add neurons. For this reason, it only makes sense to consider learning algorithms with this freedom in a context which meaningfully includes generalization. This is an important reason why such algorithms have not been extensively studied. Fortunately Valiant has recently suggested such a learning protocol (Valiant 1984). Informally, Valiant assumes that we are supplied with examples chosen according to some arbitrary but fixed probability distribution, and classified as positive or negative according to some target function. So, for a practical example, we might have a database of symptoms of medical patients, supplied according to the naturally occuring distribution of illness in the population, and we might be told in each instance whether or not the patient had appendicitus. We desire a learning algorithm which will rapidly produce a representation function that correctly classifies as having or not having appendicitus a large fraction of future patients drawn from the same distribution. The representation function might be drawn from a particular class of functions - for example the learning algorithm might simply adjust the weights on a fixed neural net. The results of references (Judd 1988) and (Blum and Rivest 1988) show it is unlikely that any such algorithm, which varies weights on a fixed net, will converge in polynomial time to learn all functions consistent with the net. I will argue, however, that widely applicable, rapidly convergent algorithms may possibly be constructed if we consider algorithms which build a representation net as they learn, adding neurons and synapses as necessary. More formally, Valiant assumed that one wished to learn from examples a Boolean function f in a class F . Here examples are pairs {?, f(i3}, where 2 is chosen according to some arbitrary, unknown, but fixed probability distribution D over the n dimensional domain of f,typically either X" or (1, -1)". We ask for what classes F does there exist an algorithm A which can learn every f E F for every distribution D , in time polynomial in n and parameters 1 2 6 , >~ 0. A is said to learn a function f if, supplied with examples of f drawn from distribution D , A produces with probability 1- 6 a representation g in a representation class G, such that g(2) = f(?) with probability at least 1 - t for examples drawn from D. 'Although the brain evidently has a fixed number of neurons, it seems extremely likely that in learning the brain recruits previously unused or underused neurons to build new subcircuits representing learned concepts.
A Proposal for More Powerful Learning Algorithms
203
Valiant’s framework is very well designed for studying the learning problem backpropagation addresses of producing “expert systems” from data bases? It includes arguably the correct definition of generalization, that is, prediction of future examples drawn from the same distribution as the data base. It demands polynomial scaling, and it allows a controlled probability (6) of failure of the algorithm, and fraction (t) of incorrect predictions, both of which are provably necessary if examples are to be supplied probabilistically. The extension where the learning algorithm A is allowed to supply vectors 2 and receive their classification j ( Z ) (Valiant 19841, and the restriction to particular fixed distributions of example vectors such as the uniform distribution over (1, -l}n (Kearns et al. 1987) have both been studied. The result I will describe holds in both these cases. Within Valiant’s protocol, a number of interesting classes of functions have been proved to be learnable (Blumer et al. 1987; Valiant 1984). It also has become clear that the choice G of representation class is crucial in the learnability of F.3 To a pragmatist, a proof that F is not learnable by G for some specific F and G, only demonstrates that we may have chosen the wrong representation class. A pragmatic learner should be willing to use any class of representations necessary to solve his problem. He should not be limited by a priori prejudices. For concreteness I define a feedforward neural network as a directed acyclic graph G, with n source nodes called inputs and a distinguished sink node corresponding to the output. A weight is associated with every edge and a threshold with every node except the input nodes. When a feature vector 2 is presented at the input nodes (component i to the ith input) each node computes its value as soon as all the neighbors pointing to it have computed theirs, by simple linear threshold function, i.e. ui= sgn(& wyu3 - ti), where wt3 is the weight of edge ( i , j ) , t, is the threshold at node i, and the sum is over all edges pointing towards i. The class of feedforward neural nets is thus the transitive closure of linear threshold functions. %deed, within the context of Valiant’s framework, we have given tight bounds on the size of the network which should be employed to achieve accurate generalization to future examples (Baum and Haussler 1988) using any algorithm which (like backpropagation) varies weights on a fixed network. In addition, sharp bounds have been given on the number of test examples necessary to verify reliability of the expert system generated (Baum 1988b). 3For example, it may easily be shown that the class Fthr of linear threshold functions with real valued weights is learnable (Blumer et al. 1987)by an algorithm (based on polynomial time solution to the linear programming problem (Karmarkar 1984)) which uses as representation class Fth,,. itself. The subclass Fsthr of Boolean threshold functions with weights valued either 0 or 1 is however not self-learnable, since a selflearning algorithm for this function must represent the examples by Boolean threshold functions, which can be proved equivalent to solving integer programming (Kearns et al. 1987; Pitt and Valiant 1986). F B , is~ of ~ course learnable by an algorithm using as representation G = Fthr.
204
Eric B. Baum
I define a class G of representations as a p-time representation if for all 2 and for all g E G, g(Z) may be computed in time polynomial in n and the size of g. It is difficult to see how a learning algorithm A could effectively utilize in polynomial time expressions g not evaluatable in polynomial time. Also, any learning algorithm which produced expressions not feasibly evaluatable would be of little practical import. Thus restriction to p-time representations seems benign. We now observe the following proposition: For any class of concepts F and any p-time representation G, if F is learnable by G, then F is learnable by feedforward neural nets. Actually, this result holds as well of the transitive closure of any class of representations sufficiently general that the transitive closure contains and, or, and not gates. I call any representation class H of this type complete. The proof is simple. Note any representation is some encoding of an algorithm for deciding whether any given vector Z corresponds to a positive or a negative example. I assume that any algorithm is expressed using any YeasonabZe encoding as a deterministic Turing machine(DTM1, and that the size of the DTM so generated under two reasonable encodings will be at least polynomially related. Thus any recognition algorithm for classifying positive and negative examples, specified in any given representation class, may be concisely written as a DTM. Now circuit complexity arguments (Baum 1988b; Pippenger and Fischer 1979) show that that any DTM of size N which converges to yield an answer of yes or no in time 7, may be rapidly and explicitly mapped to an equivalent feedforward neural network of size polynomial in IT and N. The p-time representation 9 produced by any learning algorithm A, can be regarded as such a DTM. Since A produces g in polynomial time, g must be of polynomial size. Thus, given a learning algorithm A which produces in time pfn) a program g in the class G, we may explicitly modify A so that A instead produces in time p’(n) a feedforward neural network. This mapping is described in detail in Baum (1988b). Note that this result is much stronger than simply a statement that any representation produced by a learning algorithm could be expressed as a neural net. The point is that any representation produced by a learning algorithm can be expressed as a neural net in polynomial time, and thus that a composite learning algorithm can be formed which uses neural nets as its representation. An example which illustrates the difference is the class of conjunctive normal forms. While any Boolean function can be expressed in conjunctive normal form, it is by no means evident that the class of conjunctive normal forms is a complete learning representation, since expressing a representation in conjunctive normal form might take exponential time.
A Proposal for More Powerful Learning Algorithms
205
On the other hand, no claim whatever is asserted about biological plausibility or locality of the learning algorithm, or even whether the learning algorithm itself can be expressed as a neural net. Backpropagation was developed as an algorithm which, when supplied with a training set of examples, would choose weights in a neural net that hopefully would correctly classify a testing set of examples. When backpropagation was first proposed, it was not promoted as being biologically plausible or implementable by a neural net (although these questions have been studied since). In the same spirit, I assert that under very general conditions, there exist algorithms which when supplied with appropriate training examples, will create a neural net that correctly classifies test examples. We can not describe whether such algorithms are local, or biologically plausible, but only that they converge rapidly. The particular construction used in the proof will convert good algorithms in some other representation to relatively inefficient algorithms creating neural net representations. However, although we have only an existence proof, I believe it is interesting to note the limitations of NP-completeness results like those of references (Judd 1988) and (Blum and Rivest 19881, and to suggest that it may be fruitful to look for practical algorithms which build nets as they go along. In a very general context, Blumer et al. (1987) have given conditions which suffice to demonstrate that a learning algorithm which can grow representations in a larger class than the class of target functions will converge to yield a good generalization. These conditions involve the "Vapnik-Chernovenkis dimension." Such results could be applied to neural net representations using results on the VC-dimension of neural networks (Baum and Haussler 1988). Blumer et al. also give some examples of such learning algorithms. Under plausible but strong complexity theory hypotheses, Goldreich et al. (1984) have constructed classes of ply-random functions not learnable by any representation, and hence, in particular, not learnable by feedforward nets. More recently, Kearns and Valiant (1988) have shown under cryptographic hypothesis that the class of feedforward nets, even when restricted to be logarithmically deep, with each node connected to a constant number of others, are still not learnable by any p-time representation. These results, however, are proved by constructions of complicated, cryptographic functions. Thus the class of feedforward nets is proved unlearnable by construction of a small, cryptographically secure subclass. It is evident that human learning, which takes place in the natural world, is not required to solve the general decryption problem. It would thus be interesting to ask whether there is some "natural" class F of functions for which there is no learning algorithm A having the property that A will likely learn f , if f is selected at random from F . In Valiant's protocol, one assumes that the function to be learned is drawn from some particular class of target functions. This assumption is vital as one can demonstrate very generally by counting arguments that
206
Eric B. Baum
there are simply too many possible functions for one to be able to learn without such a restriction (Baum 1988b; Blumer et al. 1987). Indeed if one wishes to learn from n dimensional examples in time polynomial in n, one must restrict the class of concepts to an exponentially small subset of possible concepts. In the real world, of course, we are not supplied with a specific target class. However, we know that people are capable of learning in the real world and thus that there must exist a small set of concepts which is simultaneously rapidly learnable and adequate for accurately describing the world. Theory can not guide us in finding such a set of concepts, rather we must be guided by heuristic experience. The successes of backpropagation strongly indicate that the world may be well described by relatively small, feedforward neural networks. It is thus of considerable interest to construct learning algorithms for this class of functions.
Acknowledgments I thank L. Valiant for a critical reading. The research reported in this paper was carried out by the Jet Propulsion Laboratory, California Institute of Technology, and was sponsored by the Strategic Defense Initiative Organization, Innovative Science and Technology Office and the National Aeronautics and Space Administration. This work was performed as part of JPL's Center for Space Microelectronics Technology.
References Baum, E.B. 1988a. On the capabilities of multilayer perceptrons. Journal of Complexity, 4, 193-215. . 1988b (to appear). Complete representations for learning from examples. Complexity in Information Theory, ed. Y. Abu-Mostafa. Springer-Verlag. Baum, E.B. and D. Haussler. 1988. What size net gives valid generalization. Neural Computation, 1, 151-160. Blum, A. and R.L. Rivest. 1988. Training a 3-node neural network is NPcomplete. Proceeding of the 1988 Workshop on Computational Learning The0y, 9-18. San Mateo, CA: Morgan-Kaufmann Publishers. Blumer, A., A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. 1987. Learnability and the Vapnik-Chervonenkis dimension, UCSC Technical Report UCSC-CRL87-20, and Journal of the Association for Computing Machinery, in press. Goldreich, O., S. Goldwasser, and S. Micali. 1984. How to construct random functions. Journal of the Association for Computing Machine y, 33~4,792-807. Judd, S. 1988. On the complexity of loading shallow neural networks, Journal of Complexity, 4. Karmarkar, N. 1984. A new polynomial time algorithm for linear programming. Combinatorica, 4, 373-395.
A Proposal for More Powerful Learning Algorithms
207
Kearns, M., M. Li, L. Pitt, and L.G. Valiant. 1987. On the learnability of Boolean formulae. Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, 285-295. New York: Association for Computing Machinery. Kearns, M. and L.G. Valiant. 1988. Learning Boolean formulae or finite automata is as hard as factoring. Technical Report 14-88, Aiken Laboratory, Harvard University. Pippenger, N. and M.J. Fischer. 1979. Relations among complexity measures, Journal of the Association of Computing Machinery, 262, 361-381. Pitt, L. and L.G. Valiant. 1986. Computational limitations on learning from examples. Harvard University Technical Report TR-05-86. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning internal representations by error propagation. In: Parallel Distributed Processing, eds. D.E. Rumelhart and J.L. McClelland, 318-362. Cambridge: MIT Press. Valiant, L.G. 1984. A theory of the learnable. Communications of the ACM, V27, N11, 1134-1142. Waibel, A., T. Hanazawa, G. Hinton, and K. Shikano. 1987. Phoneme recognition using time-delay neural networks. ATR Laboratories Technical Report TR-I0006.
Received 29 August 1988; accepted 25 September 1988.
Communicated by Steven Zucker
A Possible Neural Mechanism for Computing Shape From Shading Alex Pentland Vision Sciences Group, E15-410, The Media Laboratory, Massachusetts Institute of Technology 20 Ames Street, Cambridge, MA 02138, USA
A simple neural mechanism that recovers surface shape from image shading is derived from a simplified model of the physics of image formation. The mechanism‘s performance is surprisingly good even when applied to complex natural images, and is even able to extract significant shape information from some line drawings. 1 Introduction Shading is the variation in image intensity due to changes in surface shape, and has long been recognized as one of the most important visual cues to surface shape. Leonard0 Da Vinci, for instance, wrote in his notebooks: “Shading appears to me to be of supreme importance in perspective, because, without it opaque and solid bodies will be ill-defined.” Despite its importance, however, relatively little is known about how people extract shape from shading. Perhaps the major obstacle to understanding is the lack of a good theoretical model. For although the physics is well understood, the mathematical problem is so underconstrained that no solution is possible without the use of simplifying assumptions (Horn 1975; Pentland 1984). Examination of the physics shows that there are three types of simplifications that might be useful. These are assumptions about the surface shape (e.g., smoothness), the distribution of illumination (e.g., a single light source direction), or about the reflectance function (e.g., Lambertian reflectance). Of these three categories, assumptions about illumination are the least controversial: almost all research has accepted the hypothesis that people assume a single, distant illuminant within ”fairly large” image regions. Assumptions about surface shape have received the most attention in recent research. There have been two types of simplifying assumptions that have been found useful: (1) smoothness assumptions, first employed by Horn (1975), and (2) assumptions about local surface curvature, first employed by Pentland (1984). The resulting shape-from-shading techniques have estimated surface orientation, so that integration is required Neural Computation 1,208-217 (1989) @ 1989 Massachusetts Institute of Technology
A Possible Neural Mechanism for Computing Shape From Shading
209
to recover depth. Techniques employing smoothness have the additional disadvantage that they require dozens of iterations to converge to an answer. Despite considerable effort neither type of surface assumption has resulted in the sort of fast, locally-accurate recovery of shape typical of human vision. Research concerning the final type of simplifying assumption - assumptions about the reflectance function - has concentrated on the role of highlights and specularities rather than on the more diffuse type of reflection which dominates in most image regions. In this paper, I develop a new type of shape-from-shading theory based on a simple characterization of this diffuse type of reflection, describe a simple neural mechanism that computes shape from shading, and then evaluate this mechanism on natural imagery. 2 The Imaging of Surfaces
The first step is to recount the physics of how image shading is related to surface shape. Let 2 = z(x,y) be a surface, and let us assume that the surface is Lambertian and everywhere illuminated by (possibly several) distant point sources. I will also assume orthographic projection onto the z, y plane. I will let t = (zL, yL, zL) = (cos T sin n, sin T sin n, cos n) be the unit vector in the mean illuminant direction, where r is the tilt of the illuminant (the angle the image plane component of the illuminant vector makes with the z-axis) and n is its slant (the angle the illuminant vector makes with the z-axis). Under these assumptions the normalized image intensity I ( s , y) will be pcos T sin o + q sin T sin n + cos n (2.1) I(z, y) = (p2 + q 2 + 1)1/2 where p and q are the slope of the surface along the directions respectively, e.g.,
IC
and y image
Equation (2.1) can now be converted to a form which will allow us to relate image and 3-D surface in terms of their Fourier transforms or in terms of any other convenient set of linear basis functions. This is accomplished by taking the Taylor series expansion of I ( s , y ) about p , q = 0 up through the quadratic terms, to obtain cos n ~ ( zy), E cosn + pcosrsinn + qsinr sinn - -(pz + q2) (2.3) 2 Numerical simulation has shown that this expression yields a good approximation of the image intensities except for large values of p or q .
Alex Pentland
210
Note that because the Taylor expansion is applied at each image pixel individually, no smoothing or approximation of the surface shape is involved; we are simplifying only the surface's reflectance function. Now let the complex Fourier spectrum F A f ,0) of dz,y) be
~,(f, e) = m,(f, @)ei@z(f@)
(2.4)
where m,(f,B) is the magnitude at position (f,6)on the Fourier plane, and 4, is the phase. Since p and q are partial derivatives of z(z,y), their transforms Fp and Fq are simply related to F,, for example,
~ , ( f0) , = 28 cos e f m z ( f , 6)ei(@~(f~')+"/')
(2.5)
F J ~e), = 28 sin e f m z ( f , e)ei(~~(f~8)+=/2)
(2.6)
Under the condition Ipl , 141 < 1 the linear terms of equation (2.3) will dominate the image intensity function except when the average illuminant is within roughly f30" of the viewers' position. When either p , q c c 1 or the illumination direction is not near the viewing direction the quadratic terms will be negligible. Thus when either p , q << 1 or the illumination direction is oblique, so that the quadratic terms are negligible, then the Fourier transform of the image f(z,y) is (ignoring the DC term):
Ff(f, 0 ) = 28 sin afm,(f, e),ic~zcr,e)+"/2)[cos 0 cos 7 + sine s i n ~ l . (2.7) That is, the image intensity surface I ( z , y ) is a linear function of the height surface z(z, y).' When the quadratic terms of equation (2.3) dominate (e.g., when the illumination and viewing directions are similar) then the relationship between surface and image is substantially more complex. The quadratic terms, p' and q2, are everywhere positive whereas p and q are both positive and negative. Thus the transforms of p' and q2 will show a frequency doubling effect, that is, the image T(z,y) of a surface z(z, y) = sin(z) with 2 = (O,O, 1) will be (roughly) I ( z ,y) = sin(2z). Although the actual situation is considerably more complex than this, the notion of frequency doubling provides a good qualitative description of the contribution of the quadratic terms to the overall image intensity function. Human Psychophysics: The above mathematics makes it clear that a useful simplifying assumption is that the surface reflectance function is a linear function of surface orientation, for in this case it is possible to obtain a simple linear relationship between surface shape and image intensity. Such an assumption has the additional virtue of being an accurate description of the Lambertian reflectance function under a wide variety of common situations. This assumption has been tested in psychophysical experiments (Pentland 19881, with clear results: the hypothesis of a 'It is interesting to note that the moon's surface has exactly this reflectance function
(Horn 1975).
A Possible Neural Mechanism for Computing Shape From Shading
211
linear reflectance function correctly predicts people‘s judgment of shape, even when people’s judgment is incorrect. In order to model human performance our theory will therefore assume that the linear model given by equation (2.7) correctly describes how surfaces are imaged. 3 Recovery of Shape
Examining equation (2.7) shows that if given the illuminant direction then the Fourier transform of the surface can be recovered directly, except for an overall scale factor and the DC term. That is, letting the Fourier transform of the image be FI(f ,0) = m,(f, e)ei+r(fxB),
(3.1)
then the Fourier transform of the z surface is simply
To achieve reliable shape recovery using equation (3.2) it is necessary to add a stabilizing term to the denominator, so that its magnitude remains greater than approximately 0 . 5 sin ~ af. Edge information often affects the perception of shape from shading; similarly, there are interactions between shading and other shape cues such as stereo. Boundary information derived from, for example, edges, can be inserted by setting the surface shape components along orientations approximately perpendicular to the illuminant direction (for example, by setting the components F,(f, $1, where 1/17)to produce the correct surface shape. Surface shape information, e.g., coarse stereo estimates, can be optimally combined with shading information as follows: (3.3)
where of F A f , 0).
and
c&&ing
are the assumed variance of the two estimates
3.1 A Neural Mechanism. The ability to recover surface shape by use of equation (3.2) suggests a neural mechanism for the perception of shape from shading. It is widely accepted that the visual system’s initial cortical processing areas contain many cells that are tuned to orientation, spatial frequency and phase. Although the tuning of these cells is relatively broad, one can still produce an estimate of shape by summing the output of these cells in a selective manner. Figure 1 illustrates this mechanism. I assume that some transformation T’ of the image is produced by a set of filters that are localized in
Alex Pentland
212
I
I
I
I-
T
* Normalize Dower within each orientation
* Scale by inverse of filter center frequency
4 4 T-l
Figure 1: A Shape-from-shadingmechanism. A transformationT' produces localized measurementsof sine and cosine phase frequency content, and then the inverse transformation is applied, switching sine and cosine phase amplitudes and scaling the filter amplitude in proportion to the central frequency. The output of this process is the recovered surface shape. both space and spatial frequency. Such a transformation is widely believed to occur between the retina and striate cortex. Note that these filters are all centered on the same image location, and have the same overall response envelope. I will further assume that these filters exist in quadrature pairs, so that the local phase information is available. Exactly this sort of filter mechanism is central to many recent psychological theories (Adelson and Bergen 1985; Daugman 1980; Adelson et al. 1987). I will for purposes of discursive clarity assume that these filters are sine and cosine phase Gabor filters, because the output of such a set of Gabor filters is exactly the Fourier transform of the image as seen through a Gaussian-shaped windowing function and thus the above equations are directly applicable. In order to recover surface shape from this filter set, the transformations indicated in equation (3.2) must be performed. These
A Possible Neural Mechanism for Computing Shape From Shading
213
transformations are (1)phase-shift the filter responses by 7r/2, (2)scale the filter amplitude by l / f , where f is the filter's central spatial frequency, (3) bias the filters to remove variation due to illumination direction, and (4) reconstruct a depth surface from the scaled amplitudes of the filter set. The final step, reconstruction, can be accomplished by a process nearly identical to that by which one would reconstruct the original signal, that is, by summing Gabor filter basis functions with an amplitude proportional to the Gabor filter's activity. This reconstruction process is indicated by the transformation 2?-l in figure 1; it is the inverse of what is believed to happen between the retina and striate cortex. The only differences between the shape recovery process and simple reconstruction are (1) the role of the sine and cosine phase filters are switched, so that cosine phase functions are added together with amplitude proportional to the response of the sine phase filters, and vice versa This accomplishes a 7r/2 phase shift. (2) Each filter's amplitude is reduced in proportion to its central frequency, thus accomplishing the l / f frequency scaling. (3) The average filter amplitude is normalized within each orientation. This normalization removes the directional biasing effects of the illuminant. The result of this scaled summation will be the estimated surface shape within the windowed area of the image, that is, within the "receptive field" covered by the filters. If this theory were construed as applying to near parafoveal receptive fields of the Macaque monkey, then the patch over which shape is recovered would cover an area roughly the size of the monkey's fully extended hand. It is possible to smoothly link adjacent patches together to produce an extended surface, however a better alternative is to combine this small-scale shading information with largescale structure that can come from other, more suitable sources such as stereo or motion. Relation to Biology: In an actual biological implementation use of a Gabor filter transform T' would be a poor choice because T' ?-I, so that two separate sets of filters are required, and because for Gabor filters calculation of F-1 requires a level of numerical accuracy inappropriate for biological mechanisms. In a biological mechanism one would expect an orthonormal filter set (which can be visually similar to Gabor filters) so that T'= 2"' and so that precise calculations are not required (Adelson et al. 1987). Further, the use of exactly sine and cosine phase filter pairs is unnecessary, as all the phase information is available from any two filters of different phase. When using filters of arbitrary phase one feeds a weighted average of the two input filters to the reconstruction filters rather than feeding the sine/cosine phase input filters exclusively to the cosine/sine phase reconstruction filters.
+
214
Alex Pentland
3.2 Surface Recovery Results. Having developed a theory and mechanism, it is time to evaluate the performance of that mechanism on realworld problems. This section and the following section present several
Figure 2: (a) An image of a mountainous region outside of Phoenix, Arizona, (b) a perspective view of a stereo-derived surface shape, and (c) a perspective view of the surface shape derived from image shading, (d) an image widely used in image compression research, (e) a perspective view of the recovered surface in the neighborhood of the woman’s face, (0 two shaded views of the recovered surface; note the accurate recovery of eye, cheek, lip, nostril, and nose arch shape.
A Possible Neural Mechanism for Computing Shape From Shading
215
examples of using this approach to recover shape from shading. In these examples the illuminant direction was estimated from the Fourier transform of the image, as described in reference (Pentland 1988). Figure 2a shows a high-altitude image of a mountainous region outside of Phoenix, Arizona. The Defense Mapping Agency has created a elevation map of this region using their interactive stereo system; figure 2b shows a perspective view of this stereo-recovered surface. Figure 2c shows a perspective view of the surface shape derived from the shading information in figure 2a. Comparisons of the stereo-derived, surface (Fig. 2b) to the shading-derived surface (Fig. 2c) demonstrate that the recovery of shape from shading in this example is quite accurate. The major defect of the recovered surface shape is a low-frequency bowing of the entire surface, which appears to stem from slow variations in average surface reflectance and illumination direction. A second example of shape recovery is shown in figures 2d and 2e. Figure 2d shows an image widely used in image compression research. Figure 2e shows a perspective view of the recovered surface in the neighborhood of the face. Figure 2f shows shaded versions of the recovered surface from viewpoints 30 and 45 degrees from the original image’s viewpoint. The shape of the eyes, cheek, lips, the nose arch and nostrils are all recovered accurately. 3.2.1 Line Drawings The role shading information has in interpreting line drawings has long been debated. To investigate this issue we applied our shape-fromshading technique to several line drawings. One example, shown in figure 3a, is a line drawing of a famous underground cartoon character named Zippy. Surface shape for a digitized version of this drawing was estimated using our shape-from-shading mechanism. A perspective view of the face region of the recovered surface is shown in figure 3b. Figure 3c shows shaded versions of the recovered surface from several points of view. As can be seen, the shape of the forehead, eyes, cheeks, shirt collar, and mouth are all recovered in a way that agrees closely with our perceptions. 4 Summary
I have shown that a simple neural mechanism based on the assumption of a linear model surface reflectance function allows shape information to be extracted from image shading in a robust and accurate manner. The filters used in this neural mechanism are similar to those that are believed to occur in biological visual systems, and the mechanism itself is a simple modification of decomposition-reconstruction networks pro-
Alex Pentland
216
C
Figure 3: (a) A line drawing of a famous cartoon character, (b) a perspective view of the surface shape recovered by a shading analysis covering the face region, (c) several shaded views of the recovered surface. posed for other types biological visual processing (Adelson and Bergen 1985; Daugman 1980; Adelson et al. 1987). This shape-from-shading mechanism fails to give a good answer when the illumination is from behind the viewer, and when either the surface reflectance or the illumination changes sharply. However when the il-
A Possible Neural Mechanism for Computing Shape From Shading
217
lumination is from behind the viewer (so that the mechanism proposed here fails) then image brightness is closely correlated with surface orientation so that very simple rules can be used to interpret the image shading (Pentland 1984). In either case practical extraction of shape information requires preprocessing to segment the image into regions of constant reflectance and illumination. Perhaps the most surprising aspect of this shape-estimation mechanism is its performance when applied to line drawings. In the cases examined so far, it appears that a substantial amount - and in some cases most - of the 3-D information may be recovered by this simple shading analysis.
Acknowledgments This research was made possible by National Science Foundation, Grant No. DAAL 03-87-K-0005. I wish to thank Berthold Horn and Ted Adelson for their comments and insights.
References Adelson, E. and J. Bergen. 1985. Spatiotemporal energy models for the perception of motion. journal of the OpticaI Society of America A, 2(2), 284-299. Adelson, E.H., E. Simonelli, and R. Hingorani. 1987. Orthogonal pyramid transforms for image coding. SPlE Proceedings on Visual Communications and Image Processing 11, SPIE 845, 50-58, Oct. 27-29, 1987, Boston, MA. Daugman, J. 1980. Two-dimensional analysis of cortical receptive field profiles. Vision Research, 20, 846-856. Horn, B.K.P. 1975. Obtaining shape from shading information. In: The Psychology of Computer Vision, ed. P.H. Winston. McGraw-Hill. Pentland, A. 1988. Shape from shading: A theory of human perception. Proceedings of International Conference on Computer Vision, IEEE Society, Dec. 5-8, Tarpon Springs, FL. . 1984. Local analysis of the image. I E E E Transactionson Pattern Analysis and Machine Recognition, 170-187.
Received 10 November; accepted 14 November 1988.
Communicated by Geoffrey Hinton
Optimization in Model Matching and Perceptual Organization Eric Mjolsness Department of Computer Science, Yale University, New Haven, CT 06520, U S A
Gene Gindi Department of Electrical Engineering, Yale University, New Haven, CT 06520, USA
P. Anandan Department of Computer Science, Yale University, New Haven, CT 06520, USA
We introduce an optimization approach for solving problems in computer vision that involve multiple levels of abstraction. Our objective functions include compositional and specialization hierarchies. We cast vision problems as inexact graph matching problems, formulate graph matching in terms of constrained optimization, and use analog neural networks to perform the optimization. The method is applicable to perceptual grouping and model matching. Preliminary experimental results are shown. 1 Introduction The minimization of objective functions is an attractive way to formulate and solve visual recognition problems. Such formulations are parsimonious, being expressible in several lines of algebra, and may be converted into artificial neural networks which perform the optimization. Advantages of such networks including speed, parallelism, cheap analog computing, and biological plausibility have been noted (Hopfield and Tank 1985). According to a common view of computational vision, recognition involves the construction of abstract descriptions of data governed by a database of models. Abstractions serve as reduced descriptions of complex data useful for reasoning about the objects and events in the scene. The models indicate what objects and properties may be expected in the scene. The complexity of visual recognition demands that the models be organized into compositional hierarchies which express object-part Neural Computation 1,218-229 (1989) @ 1989 Massachusetts Institute of Technology
Optimization in Model Matching and Perpetual Organization
219
relationships and specialization hierarchies which express object-class relationships. In this paper, we describe a methodology for expressing model-based visual recognition as the constrained minimization of an objective function. Model-specific objective functions are used to govern the dynamic grouping of image elements into recognizable wholes. Neural networks are used to carry out the minimization. Previous work on optimization in vision (Barrow and Popplestone 1971; Burr 1983; Hummel and Zucker 1983; Terzopoulos 1986) has typically been restricted to computations occurring at a single of level of abstraction and/or involving a single model. For example, surface interpolation schemes, even when they include discontinuities (Terzopoulos 1986) do not include explicit models for physical objects whose surface characteristics determine the expected degree of smoothness. By contrast, heterogeneous and hierarchical model-bases often occur in nonoptimization approaches to visual recognition (Hanson and Riseman 1986) including some which use neural networks (Ballard 1986). We attempt to obtain greater expressibility and efficiency by incorporating hierarchies of abstraction into the optimization paradigm. 2 Casting Model Matching as Optimization
We consider a type of objective function which, when minimized by a neural network, is capable of expressing many of the ideas found in frame systems in Artificial Intelligence (Minsky 1975). These "Frameville" objective functions (Mjolsness et al. 1988) are particularly well suited to applications in model-based vision, with frames acting as few-parameter abstractions of visual objects or perceptual groupings thereof. Each frame contains real-valued parameters, pointers to other frames, and pointers to predefined models (for example, models of objects in the world) which determine what portion of the objective function acts upon a given frame. 2.1 Model Matching as Graph Matching. Model matching involves finding a match between a set of frames, ultimately derived from visual data, and the predefined static models. A set of pointers represent objectpart relationships between frames, and are encoded as a graph or sparse matrix called m a . That is, mfil,J= 0 unless frame J is "in" frame a as one of its parts, in which case = 1 is a "pointer" from J to a. The expected object-part relationships between the corresponding models is encoded as a fixed graph or sparse matrix IW.A form of inexact graph-matching is required: m a should follow ZA4 as much as is consistent with the data. A sparse match matrix A4 (0 5 Ma? 5 1) of dynamic variables represents the correspondence between model a and frame a. To find the best match between the two graphs one can minimize a simple objective function for this match matrix, due to Hopfield (1984) (also Feldman et
Eric Mjolsness, Gene Gindi, and P. Anandan
220
al. 1988; von der Malsburg and Bienenstock 1986), which just counts the number of consistent rectangles (see Fig. la): (2.1) This expression may be understood as follows: For model a and frame i, the match value Mnt is to be increased if the neighbors of o! (in the Ih!4 graph) match to the neighbors of i (in the ina graph). Note that E ( M )as defined above can be trivially minimized by setting all the elements of the match matrix to unity. However, to do so will violate additional syntactic constraints of the form h ( M ) = 0 which are imposed on the optimization, either exactly (Platt and Barr 1988) or as penalty terms (Hopfield and Tank 1985) i h 2 ( M ) added to the objective function. Originally the syntactic constraints simply meant that each frame should match one model and vice versa, as in (Hopfield and Tank 1985). But in Frameville, a frame can match both a model and one of its specializations (described later), and a single model can match any number of instances or frames. In addition one can usually formulate constraints stating that if a model matches a frame then two distinct parts of the same model must match two distinct part frames and vice versa. We have found the following formulation to be useful:
where the first sum in each equation is necessary when several high-level models (or frames) share a part. (It turns out that the first sums can be forced to zero or one by other constraints.) The resulting competition is illustrated in figure lb. Another constraint is that M should be binaryvalued, i.e., M d 1 - M A = 0,
(2.4)
but this constraint can also be handled by a special “analog gain” term in the objective function (Hopfield and Tank 1985) together with a penalty term cC,,M,,O - M,A. In Frameville, the m a graph actually becomes variable, and is determined by a dynamic grouping or “perceptual organization” process. These new variables require new constraints, starting with zna,,(l -ins,,) = 0, and including many high-level constraints which we now formulate. 2.2 Frames and Objective Functions. Frames can be considered as bundles of real-valued parameters Ftp, where p indexes the different parameters of a frame. For efficiency in computing complex arithmetic
221
Optimization in Model Matching and Perpetual Organization
IN*
Q
1,2
Model side
4,9
Data side
Figure 1: (a) Examples of Frameville rectangle rule. Shows the rectangle relationship between frames (triangles) representing a wing of a plane, and the plane itself. Circles denote dynamic variables, ovals denote models, and triangles denote frames. For the plane and wing models, the first few parameters of a frame are interpreted as position, length, and orientation. (b) Frameville sibling competition among parts. The match variables along the shaded lines (M3.9 and A 4 7 ) are suppressed in favor of those along the solid lines LV12.9 and M3.7).
Eric Mjolsness, Gene Gindi, and P. Anandan
222
relationships, such as those involved in coordinate transformations, an analog representation of these parameters is used. A frame contains no information concerning its match criteria or control flow; instead, the match criteria are expressed as objective functions and the control flow is determined by the particular choice of a minimization technique. In Figure la, in order for the rectangle (1,4,9,2)to be consistent, the parameters F4p and FgP should satisfy a criterion dictated by models 1 and 2, such as a restriction on the difference in angles appropriate for a mildly swept back wing. Such a constraint results in the addition of the following term to the objective function:
6)
where H a [ ' ( E , measures the deviation of the parameters of the data frames from that demanded by the models. The term H can express coordinate transformation arithmetic (for example, HnB(z,,q )= 1/2[z2- xJAz,/j]*), and its action on a frame is selectively controlled or "gated by M and ina variables. This is a fundamental extension of the distance metric paradigm in pattern recognition; because of the complexity of the visual world, we use an entire database of distance metrics H"". We index the models (and, indirectly, the database of H metrics) by introducing a static graph of pointers IS4,, to act as both a specialization hierarchy and a discrimination network for visual recognition. A frame may simultaneously match to a model and just one of its specializations: (2.6) As a result, IS4 siblings compete for matches to a given frame (see Figure 2); this competition allows the network to act as a discrimination tree. Frameville networks have great expressive power, but have a potentially serious problem with cost: for n data frames and m models there may be O(nm+ n2)neurons widely interconnected but sparsely activated. The number of connections is at most the number of monomials in the polynomial objective function, namely n2mf, where f is the fan-out of the 11\14 graph. One solution to the cost problem, used in the line grouping experiments reported in section 3.2, is to restrict the flexibility of the frame system by setting most A4 and ina neurons to zero permanently. The few remaining variables can form an efficient data structure such as a pyramid in vision. A more flexible solution might enforce the sparseness constraints on the M and ina neurons during minimization, as well as at the fixed point. Then large savings could result from using "virtual" neurons (and connections) which are created and destroyed dynamically. This and other cost-cutting methods are a subject of continuing research.
Optimization in Model Matching and Perpetual Organization
223
Figure 2: Frameville specialization hierarchy. The plane model specializes along IS4 links to a propeller plane or a jet plane and correspondingly the wing model specializes to prop-wing or jet-wing. Sibling match variables M6.4 and M44 compete as do M7.9 and M5.9. The winner in these competitions is determined by the consistency of the appropriate rectangles, for example, if the 4-4-9-5 rectangle is more consistent than the 6-4-9-7 rectangle, then the jet model is favored over the prop model.
3 Experimental Results 3.1 Recognizing Simple Shapes. Frameville experiments were conducted in a domain consisting of a two-level compositional hierarchy. As
224
Eric Mjolsness, Gene Gindi, and P. Anandan
seen in Figure 3a, the input data at the lowest level are unit-length line segments parameterized by location 2,y and orientation 8, corresponding to frame parameters F,p ( p = 1,2,3). We allow only horizontal (8 = 0) and vertical (8= n/2) orientations. There are two high-level models, "T" and "L" junctions, each composed of three low-level segments. The task is to recognize instances of "T", "L", and their parts, in a translation-invariant manner. The high-level models are abstracted by the parameters of a designated main part, in this case, the upper vertical segment of each model. On the model side, there are seven low-level models indexed by 13, as shown in Figure 3b. These correspond to seven positional roles that a segment may assume in the context of a composite figure. These positions are illustrated iconically inside the model nodes in Figure 3b and correspond to the positions of segments in the familiar seven-segment LED display. The high-level models "T" and "L",indexed by a, are then specified by the appropriate set of IN4 links. We distinguish between high-level frames, indexed by i, that may match only high-level junction models, and low-level frames, indexed by j, that may match only low-level segment models. For this domain, the parameter check term Hap of Equation 2.5 checks the location and orientation of a given part relative to the main part. For example, in recognizing a " T , if low-level frame 3 is matched to model 5, a "middle horizontal" segment, then its parameters (F3.1,F3.2, F3.3 = 23, y3, 8 3 ) must differ from those of an "upper vertical" mainpart by quantities +1, -1, and n/2, respectively. In our design the parameters of a highlevel frame represent a best fit to the parameters of its mainpart. So if high-level frame 7 is matched to model 9, a "T", then an appropriate parameter check term is:
H9,5(F7,F33) = (F7.1 -F3.1 - 1 > 2 + ( F 7 . 2 - F 3 , 2 + 1 ) 2 + ( F 7 , 3 - F 3 . 3 -
K
?I2. (3.1)
The quantities +1, -1, and 7~/2are thus model information stored in the objective function. This kind of objective function also determines the best-fit high-level parameters &, even if the low-level mainpart frame itself is missing. Note here that a limited form of invariance is achieved by analog computation of relative coordinates; instances of "T" and "L" are recognized in a manner invariant to translation. (Rotation invariance can also be formulated if a different parameterization is used, but no experiments have been done.) We used the unconstrained optimization technique in (Hopfield and Tank 1985) to minimize the objective function. We achieved improved results by including terms demanding that at most one model match a given frame, and that at most one high-level frame include a given lowlevel frame as its part. These are expressed as additive penalty terms:
Optimization in Model Matching and Perpetual Organization
t t t t t t t t t t t t t t t 7
225
t
pi
6
t
t
t
t
5
t
t
t
t
t t t t t t t t t t
4
t
t
t
t
t
3
t
t
t
t
t
2
t
t
t
t
t
t t
t
1 t t t t t 4
t t t10
t
+ + T
wttt 0 1 2 3 4 5 6 7 8 9
E 0000
8 0000 ina ij
1
1 2 3
o~mooooooo e oomo
2 0 0 0 0 0 0 8 1 0 0 0 3
E 0000 0000
0000000000 1
2
3
4
5 8 7
8
910
E 0000 0000 1
2
3
4
5
6
7
8
9 1 0
Figure 3: (a) Input data consists of unit-length segments oriented horizontally or vertically. The task is translation-invariant recognition of three segments forming a " T junction (for example, sticks 1,2,3) or an "L" (for example, sticks 5, 6,7) amid extraneous noise sticks. (b) Structure of network. Models occur at two levels. IN4 links are shown for a " T . Each frame has three parameters: position 2,y and orientation 8. Also shown are some match and ina links. The bold lines highlight a possible consistency rectangle. (c) Experimental result. The value of each dynamical variable is displayed as the relative area of the shaded portion of a circle. Matrix Mlj3 indicates low-level matches and M,, indicates high-level matches. Grouping of low-level to high-level frames is indicated by the in,n matrix. The parameters of the high-level frames are displayed in the matrix F& of linear analog neurons. (The parameters of the low-level frames, held fixed, are not displayed.) The few neurons circumscribed by a square, corresponding to correct matches for the main parts of each model, are clamped to a value near unity. Shaded circles indicate the final correct state.
226
Eric Mjolsness, Gene Gindi, and P. Anandan
In addition, we did not include the binary-value constraint (Equation 2.4). The linear analog neurons representing parameters in frames were not sigmoidally mapped as in (Hopfield and Tank 1985). Figure 3c shows results of attempts to recognize the two junctions in Figure 3a. When initialized to small random values, the network becomes trapped in unfavorable local minima of the fifth-order objective function. (With only a single high-level model in the database, the system recognizes a shape amid noise given a random start.) If, however, the network is given a "hint" in the form of an initial state with mainparts and high-level matches set correctly, the network converges to the correct stable state. In particular, the linear parameter neurons settle to correct analog values corresponding to position and orientation of the mainparts of the junctions. Also, the proper dynamic grouping is accomplished as the zna neurons achieve the correct values, and the segment frames match the proper low-level models. Extraneous "noise" sticks remain unmatched. There is a great deal of unexploited freedom in the design of the model base and its objective functions; there may be good design disciplines which avoid introducing spurious local minima. For example, it may be possible to use ZSA and IIW hierarchies to guide a network to the desired local minimum. 3.2 Line Grouping. Frameville is also being applied to the problem of extracting long straight lines from an image by recursively grouping smaller line segments into longer lines. The model base for the initial experiments shown in Figure 4a consists of lines at two levels, denoted 0 and 1. The level-1 line is composed of a left-line and a right-line at level 0, which are specializations of the level-0 line. We have conducted experiments on problems involving a 3 x 3 grid of level-0 frames and a 2 x 2 grid of level-1 frames. Each level-1 frame is connected to four level0 frames that are near its spatial location, thus forming an overlapped pyramid. The end points of level-0 lines are specified as input data. Each level-1 line is denoted by four points, which correspond to the projections of the end points of its two component level-0 lines. The points on the level-1 line are determined by minimizing the energies of the springs shown in Figure 4b. The level-1 line frames contain additional slots that are used in the verification of colinearity of the four points on the line. The results in Figure 4 were obtained by using the syntactic constraints of Equations 2.2 and 2.3 as penalty terms, while exactly maintaining the binary-value constraint of Equation 2.4 (both on the M and the m a variables). The constrained optimization method described in (Platt and Barr 1988) was used. The distance metric H in Equation 2.5 measures the energies of the springs shown in Figure 4b. To achieve stability to the network, Equation 2.5 was modified by replacing the M and ina variables by their squares. The details of our model-base and the constraints as
Optimization in Model Matching and Perpetual Organization
227
Figure 4: (a) The line model base. The thin lines connecting models represent IS4 links and the thick lines represent the IN4 links. (b) The spring-stick model of fit. The springs between the level-0 lines and the level-1 line favor colinearity, while the spring between the two intermediate points on the longer line favors spatial proximity of the component lines. (c) and (d) Experimental results. The input data consists of the four vertical line segments and the approximating level-1 lines are displayed as three segments connecting the four points . . . Z4 as seen in (b). In (c) the network is at an intermediate stage, where two of the segments have been correctly grouped, while the other two line segments appear to become parts of different high-level frames. In (d) the network has moved away from the incorrect solution and is close to the correct solution. The small extra line segment eventually vanishes.
Eric Mjolsness, Gene Gindi, and P.Anandan
228
well as planned extensions of this work are described in (Anandan et al. 1989). 4 Conclusion Frameville provides opportunities for integrating all levels of vision in a uniform notation which yields analog neural networks. Low-level models such as fixed convolution filters just require analog arithmetic for frame parameters, which is provided. High-level vision typically requires structural matching, also provided. Qualitatively different models may be integrated by specifying their interactions, Hal’.
Acknowledgments We wish to thank Joachim Utans, John Ockerbloom, and Charles Garrett for the Frameville simulations. This work was supported in part by AFOSR grant F49620-88-C-0025, by DARPA grant DAAAl5-87-K-0001, and by ONR grant N00014-86-0310.
References Anandan, P., E. Mjolsness, and G. Gindi. 1989. Low-level visual grouping via optimization in neural networks. Technical Report, Yale University Computer Science Department. Manuscript in preparation. Ballard, D. 1986. Cortical connections and parallel processing: Structure and function. Behavioral and Brain Sciences, 9, 67-120. Barrow, H.G., and R.J. Popplestone. 1971. Relational descriptions in picture processing. In: Machine Intelligence, 6, ed. D. Mitchie. Edinburgh University Press. Burr, D.J. 1983. Matching elastic templates. In: Proceedings of the International Symposium on Physical and Biological Processing of Images, eds. O.J. Braddick and A.C. Sleigh. Springer-Verlag. Feldman, J.A., M.A. Fanty, and N.H. Goddard. 1988. Computing with structured neural networks. IEEE Computer, 2k3,91-103. Hopfield, J.J. 1984. Personal communication. Hopfield, J.J. and D.W. Tank. 1985. Neural computation of decisions in optimization problems. Biological Cybernetics, 52, 141-152. Hummel, R.A. and S.W. Zucker. 1983. On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5, 267-287.
Minsky, M.L. 1975. A framework for representing knowledge. In: The Psychology of Computer Vision, ed. P.H. Winston, 211-277. McGraw-Hill. Mjolsness, E., G. Gindi, and P. Anandan. 1988. Optimization in model matching and perceptual organization: A first look. Technical Report YALEU/DCS/RR634, Yale University.
Optimization in Model Matching and Perpetual Organization
229
Platt, J.C. and A.H. Barr. 1988. Constraint methods for flexible models. Computer Graphics, 224, 279-288. Proceedings of SIGGRAPH '88. Riseman, E.M. and A.R. Hanson. 1986. A methodology for the development of general knowledge-based vision systems. In: Vision, Brain, and Cooperative Computation, eds. M.A. Arbib and A.R. Hanson, 285-328. MIT Press. Terzopoulos, D. 1986. Regularization of inverse problems involving discontinuities. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMId, 413-424. von der Malsburg, C. and E. Bienenstock. 1986. Statistical coding and shortterm synaptic plasticity: A scheme for knowledge representation in the brain. In: Disordered Systems and Biological Organization, 247-252. SpringerVerlag.
Received 1 October 1988; accepted 17 October 1988.
Communicated by Richard Andersen
Distributed Parallel Processing in the Vestibulo-Oculomotor System Thomas J. Anastasio Vestibular Laboratory, University of Southern California, Los Angeles, C A 90033, USA
David A. Robinson Departments of Ophthalmology and Biomedical Enginwring, The Johns Hopkins University, School o f Medicinc, Baltimore, M D 21205, USA
The mechanisms of eye-movement control are among the best understood in motor neurophysiology. Detailed anatomical and physiological data have paved the way for theoretical models that have unified existing knowledge and suggested further experiments. These models have generally taken the form of black-box diagrams (for example, Robinson 1981) representing the flow of hypothetical signals between idealized signal-processing blocks. They approximate overall oculomotor behavior but indicate little about how real eye-movement signals would be carried and processed by real neural networks. Neurons that combine and transmit oculomotor signals, such as those in the vestibular nucleus (VN), actually do so in a diverse, seemingly random way that would be impossible to predict from a block diagram. The purpose of this study is to use a neural-network learning scheme (Rumelhart et al. 1986) to construct parallel, distributed models of the vestibulo-oculomotor system that simulate the diversity of responses recorded experimentally from VN neurons. 1 Introduction
The primary function of the VN is to relay head-velocity signals from the semicircular canals to extraocular-muscle motoneurons (Fig. 1). The purpose of this relay - the three-neuron-arc of the vestibuloocular reflex (VOR) - is to stabilize images on the retina by producing eye movements that compensate for head movements (Wilson and Melvill Jones 1979). The VOR operates in all three rotational dimensions. Since canal and muscle coordinate frames are non-orthogonal and rotated from one another, a sensory/motor transformation must occur (Pelionisz and Llinhs 1980) by combining inputs from the various canal pairs. The three-dimensional behavior of any VOR neuron can be specified by its Neural Coinputation 1,230-241 (1989) @ 1989 Massachusetts Institute of Technology
Distributed Parallel Processing in the Vestibulo-Oculomotor System
231
activation-vector, defined as the axis of head (or eye) rotation for which its change of activity is a maximum. The activation-vectors of the canal inputs are specified by canal geometry, as are those of the motoneurons by the pulling directions of the muscles (Robinson 1982). In contrast, single-unit recordings have shown that the activation-vectors of VN neurons are dispersed in various directions and do not align with either the canal or muscle vectors (Baker et al. 1984). Thus, the transformation is distributed over the VN. In addition to vestibular signals, the VN also relays eye-velocity command signals from the smooth pursuit and saccadic oculomotor subsystems (Fig. 1). The pursuit system enables certain foveate animals, such as primates, to visually follow smoothly moving targets, while the saccadic system is used to rapidly change the direction of gaze from one target to another (Carpenter 1978). The pursuit and saccadic systems probably evolved on top of the basic, ubiquitous vestibuloocular reflex as the eyes developed foveas and moved to a forward-looking position. Saccades may have utilized the pre-existing quick-phase system, while pursuit possibly originated as an effort to suppress the basic vestibular reflex when not wanted. These later systems do, however, have some direct projections to motoneurons (not shown) as well as the VN-mediated connections. Thus, most vestibulo-oculomotor neurons (this includes cells in the nucleus prepositus hypoglossi and other areas in the caudal pons - but we will use VN for short) carry parts of all three commands. The change in discharge rate divided by the change in eye-velocity command is called the gain of each cell for each function: vestibular (V), pursuit (P) and saccadic 6). The three systems intermix on the cells of the VN on which can be found seemingly any possible combination of V, P, and S (Chubb et al. 1984; Fuchs and Kimm 1975; Miles 1974; Tomlinson and Robinson 1984). Thus, the three systems are diffusely represented and, again, their signals are distributed over this neuronal pool. One job of the VN neural network is to ensure that eye velocity, represented by the activity of motoneurons, is actually that called for by the three input velocity commands. The network can do this job by adjusting synaptic weights, according to some optimization procedure such as error-driven learning. It will be appreciated at once that the vestibulooculomotor tasks to be learned are simple. Consequently, the purpose of this study is not to illustrate the sophistication of learning network models where the neurophysiological basis for the input, hidden, or output layers, and often all three, is incomplete or nonexistent. We have the opposite problem: the task to be performed is simple, but twenty years of research has provided a solid basis for the behavior of cells in all three layers. Thus, this study presents learning network models of simple functions that simulate known, but heretofore unexplained, neurophysiological properties of real VN neurons.
232
Thomas J. Anastasio and David A. Robinson
Figure 1: Schematic skeleton of the vestibulo-oculomotor system. Inputs to the vestibular nucleus (VN) include afferents from the six canals and projections from the pursuit (lp, rp) and saccadic (Is, rs) oculomotor subsystems. Outputs from the VN project to the extraocular muscle motoneurons of which are shown the six to the left eye. The initially random, divergent input projections onto VN neurons are depicted. The subsequent convergence of VN projections onto motoneurons is not shown for clarity. It is proposed that the cells in the VN act as a hidden layer in which the three eye-velocity command signals are distributed and, via modifiable synapses, maintain correct eye-velocity responses. Some of the major connections of the horizontal VOR are shown by heavy lines. The VN neuron that receives an excitatory connection from the rhc makes an excitatory connection to the motoneuron of the lr. The excitatory relay from the lhc to the VN and on to the motoneuron of the mr occurs through an internuclear interneuron connecting the abducens and oculomotor nuclei. The principal connections of the vertical VOR are also shown. A feedforward, inhibitory commissural system (cs) and some inhibitory VN neurons are shown by filled cells. Canal and muscle geometry for humans is taken from Robinson (Robinson 1982). 111, IV, VI, oculomotor, trochlear and abducens nuclei; sr, ir, lr, mr, so, io, left superior, inferior, lateral and medial recti, and superior and inferior oblique muscles or motoneurons; lac, lpc, Ihc, rac, rpc, rhc, left or right anterior, posterior and horizontal semicircular canal or primary afferent.
Distributed Parallel Processing in the Vestibulo-Oculomotor System
233
2 Neural Network Models
The vestibulo-oculomotor system was modelled as a network with three layers. The input layer carries the three eye-velocity commands into the network, the output layer represents the motoneurons of the muscles of the left eye, while units in the middle (hidden) layer represent VN neurons. The activity of each model unit, interpreted as its firing rate, is a sigmoidal function of the weighted sum of its inputs. Units respond almost linearly to midrange inputs but can be driven to cut-off or saturation with large negative or positive inputs, respectively. All input units projected to all hidden units. All hidden units projected to all output units and there were no direct input-to-output connections unless otherwise stated. There were also no intra-layer or feedback connections. Each input and output is represented as a single pair of units; only in the hidden layer, which can contain up to 40 units, is information represented in a distributed manner. The activities of vestibular and pursuit input pairs are modulated in push-pull within the linear range of the sigmoidal transfer function (0.4 - 0.6) about a spontaneous (firing) rate (SR) of 0.5, while the burst (saccadic) input pair is modulated over the entire range (0.0 - 1.0) and is spontaneously silent (SR = 0.0). All modifiable connections are randomized prior to training (range: -1.0 to 1.0, uniform distribution) and the networks are trained using backpropagation learning (Rumelhart et al. 1986). Briefly, the pattern of current outputs is compared to the desired outputs to form an error. This error is effectively propagated backward through the network to generate error patterns at each layer which are then used to modify the weights of the interlayer projections. The process is repeated for all inputs until, after many iterations, the errors for all outputs are less than a tolerance, in our case, 0.01. 3 The Simplest Arrangement
The three-neuron-arc of the VOR (heavy lines, Fig. 1) serves as a backbone for the networks to be described. To understand the basic manner in which the network organizes itself, we began with the simplest arrangement; the horizontal VOR was modelled with only two neurons in each layer. The input units represent primary afferents from the left and right horizontal canals (lhc and rhc, respectively), and the output units represent the motoneurons of the left lateral and medial rectus muscles (lr and mr, respectively, Fig. 1). The network was trained on the input/output table given in table 1. This table specifies inputs and outputs consistent with a compensatory VOR. For example, head rotation to the left causes the firing-rate (representing head velocity) of the lhc to increase (to 0.6) and that of the rhc to decrease (to 0.4). The compensatory eye movement to the right would then be produced by an equal increase in the
234
Thomas J. Anastasio and David A. Robinson
firing-rate (representing eye velocity) of the mr (to 0.6) and a decrease in that of the lr (to 0.4). The opposite pattern applies for head rotation to the right. The network finished learning this simple task after about 200 cycles through the input/output table and the values of the final weights are shown in table 2 where hl and h2 are the hidden units. This arrangement, which is representative of every other run of this simulation, is similar in many ways to the actual circuitry of the VOR (Fig. 1). Hidden unit hl, which receives excitation from the lhc (1.63) and inhibition from the rhc (-1.21), is excitatory to the mr (1.84) and inhibitory to the lr (-1.71) while the opposite pattern is observed (with different values of weights) for hidden unit h2 (Table 2). These features loosely resemble the reciprocal innervation of motoneurons by VN neurons and the inhibitory commissural connections between the two vestibular nuclei (Wilson and Melvill Jones 1979). (Excitation and inhibition by the same model neurons simply implies the existence of an interneuron.) Thus, the learning algorithm clearly produces a set of connections consistent with the known organization of the VOR. The hidden and output unit responses for this run are given in table 3. The activities of the input units (lhc and rhc) are simply copied from table 1. The responses of the ouput units (lr and mr) match those desired (Table 1) to within the tolerance of 0.01, while the responses of the hidden units are more variable. For example, the SR of hl (0.55) is its activity due to the head-still inputs (lhc = 0.5 and rhc = 0.5) via the weights in table 2. Similarly, the response of hl to leftward head rotation (0.62) is its activity following forward propagation of the head-left input (lhc = 0.6 and rhc = 0.4). The SRs and vestibular gains, V, are presented in table 4. V is the change in activity from SR of a hidden or output unit divided by the change from 0.5 in the left input unit. For example, V for h l would be: (0.62 - 0.55)/(0.6 - 0.5) = 0.70. (Gains P and S for units during pursuit and saccades, see below, are computed in an analogous way.) Although the learning algorithm produces realistic connections in networks having only two hidden units, the rules of reciprocal innervation are sometimes violated in networks having more. In larger networks, such nonreciprocally-connected units always have very small gains. In large networks having all connections modifiable, the learning algorithm appears to converge to a configuration wherein the VOR transformation is supported by a few reciprocally connected, high gain units, after which the others, many nonreciprocally-connected, remain essentially unused. To prevent this development, the hidden-to-output connections were fixed in all subsequent examples. These fixed connections are arranged to provide reciprocal innervation of the motoneurons and all had the same absolute value. Its choice depended on the simulation being run.
Distributed Parallel Processing in the Vestibulo-Oculomotor System
Input lhc head still head left head right
235
output
rhc
lr
0.50 0.50 0.60 0.40 0.40 0.60
mr
0.50 0.50 0.40 0.60 0.60 0.40
Table 1: Vestibular input/output table.
~
to from
hl
~
~
lr
h2
lhc 1.63 -2.64 rhc -1.21 2.25
hl
h2
mr
-1.71 1.84 2.08 -2.23
Table 2: Final weights of simple VOR model.
Input Ihc rhc
Hidden h2 hl
output lr mr
0.50 0.50 0.60 0.40 0.40 0.60
0.55 0.45 0.62 0.33 0.48 0.57
0.50 0.50 0.41 0.60 0.59 0.40
Table 3: Hidden and output unit responses for simple VOR model.
4 Distributed Representations
To explore the representation of multiple signals in the VN, we began by considering the interaction of vestibular and pursuit signals only. This representation is simulated on networks having 40 hidden units. The fixed hidden-to-output weights have the value 0.35 and the following reciprocal pattern: hidden units 1-20 inhibit the lr and excite the mr while the reverse holds for hidden units 2140. The two input unit pairs represent vestibular inputs (from the Ihc and rhc, as before) and left and right pursuit inputs (lp and rp, respectively, Fig. 1). The pursuit pair operates in the sense opposite to that of the vestibular pair in that
236
Thomas J. Anastasio and David A. Robinson
Hidden hl
SR V
output lr mr
h2
0.55 0.45 0.70 -1.20
0.50 0.50 -0.90 1.00
Table 4: Gain, V, and SRs of hidden and ouput units in simple VOR model.
head still pursue left pursue right head left head right
Input rhc
lp
lhc
0.50 0.60 0.40 0.50 0.50
0.50 0.50 0.50 0.60 0.40
0.50 0.50 0.50 0.40
rp
0.50 0.40 0.60 0.50 0.60 0.50
Output lr mr 0.50 0.50 0.60 0.40 0.40 0.60 0.40 0.60 0.60 0.40
Table 5: Vestibular-pursuit input/output table. a left pursuit command (lp activated, rp inhibited) produces a left eye movement (lr activated, mr inhibited) and vice-versa. The vestibularpursuit input/output table is given in table 5. The network learned this transformation after about 200 cycles. The vestibular and pursuit gains, V and P, of the hidden and output units were evaluated and V versus P is plotted for each unit in a typical run in figure 2. The most striking feature of this scatter-plot is the variety of combinations with which hidden units can carry the pursuit and vestibular signals, just as observed experimentally (Chubb et al. 1984; Fuchs and Kimm 1975; Miles 1974; Tomlinson and Robinson 1984). To explore the interaction of vestibular, pursuit and saccadic eyevelocity commands, a third input pair was added to the 40 hidden unit network representing left and right saccadic inputs (1s and rs, respectively, Fig. 1). The saccadic pair operates in the same sense as the pursuit pair. The hidden-to-output connections are the same as for the vestibularpursuit network. In addition, to reflect known anatomy, there are also direct connections from the saccadic inputs to the outputs; these are also fixed in a reciprocal pattern and have the value of 2.5. The vestibularpursuit-saccadic input/output table is given in table 6. The network learned this transformation after about 1,000 cycles. The V, P, and S gains of the hidden and output units were calculated. The V and P gains were distributed much as is shown in figure 2. Saccadic behavior consisted of bursts and pauses of activity. Many hidden units burst for saccades in one direction (almost always the same as that for which pursuit activity
Distributed Parallel Processing in the Vestibulo-Oculomotor System
Input
head still pursue left pursue right head left head right saccade left saccade right
lp
Ihc
Is
rs
0.50 0.60 0.40 0.50 0.50 0.50 0.50
0.50 0.50 0.50 0.60 0.40 0.50 0.50
0.00 0.00 0.00 0.00 0.00 1.00
0.00 0.00 0.00 0.00 0.00 0.00
0.00
rhc
rp
0.50 0.50 0.50 0.40 0.50 0.60 0.40 0.50 0.60 0.50 0.50 0.50 1.00 0.50 0.50
237
Output mr
lr
0.50 0.50 0.60 0.40 0.40 0.40
0.60
0.60 0.60 0.40 1.00 0.00 0.00 1.00
Table 6: Vestibular-pursuit-saccadic input/output table. increased) and paused in the other direction. Some units would burst or pause for saccades in one direction and do nothing in the other direction. Several burst for saccades in both directions and others paused for saccades in both directions. This diversity of behavior is not unlike that seen experimentally (Chubb et al. 1984; Fuchs and Kimm 1975; Miles 1974; Tomlinson and Robinson 1984). To explore the distributed representation of spatial information, the sensory/motor transformation between the vertical semicircular canals and the cyclovertical muscles of the left eye was modelled using a network having four input, four output and 40 hidden units. One pair of input units represented afferents from the right anterior (racbleft posterior (lpc) canal pair and the other from the right posterior (rpc)-left anterior (lac) canal pair (Fig. 1). One pair of output units represented motoneurons of the left superior (so)-inferior oblique (io) muscle pair and the other of the left superior (srbinferior rectus (ir) muscle pair (Fig. 1). The hidden-to-output connections were fixed so that hidden units 1-10 and 21-30 reciprocally innervated so and io, and 11-20 and 3140 reciprocally innervated sr and ir; their weights had the value 0.25. This vertical VOR network model was trained to produce input/output patterns that were consistent with compensatory eye-movements for head rotations about eight axes spaced 45 degrees apart in the horizontal plane. The activity of each input pair was modulated according to the projection of each head-rotation vector onto the activation-vectors of the canal pairs, while that of each output pair corresponded to the muscle pair components calculated to produce the compensatory eye movemendusing the angles for humans in figure 1. The network learned the transformation after about 500 cycles. Activation-vectors for the hidden and output units are plotted in figure 3. Those of the hidden units are dispersed in many directions and are qualitatively similar to those of cat VN neurons (Baker et al. 1984).
Thomas J. Anastasio and David A. Robinson
238
V
I
0.8
I
0.4
I
0
I
-0.4
I
-0.8
Pursuit gain P Figure 2: Plot of vestibular (V) versus pursuit (P) gains of hidden and output units in a 40 hidden unit, vestibular-pursuit neural network model. Because correctly wired units would have vestibular and pursuit gains of opposite signs, the polarity of the P axis is reversed for illustrative purposes. Hidden units (filled circles) falling on or near the horizontal or vertical axis can be considered as pure-pursuit or pure-vestibular units, respectively. The output units (+), having equal and opposite V and P values (by design), lie along the diagonal line. Hidden units falling on or near this line also have V and P values that are approximately equal and opposite. Most units are scattered throughout quadrants 1 and 3 and, having V and P of opposite sign, encode eye movements in the same direction. However, a small percentage of hidden units fall in quadrants 2 and 4 and have V and P of the same sign. Such seemingly anomalous units have been occasionally observed experimentally (Chubb et al. 1984; Fuchs and Kimm 1975; Miles 1974; Tomlinson and Robinson 1984). 5 Conclusions
This study indicates that a learning-network model provides a good explanation for the variability found in the neural organization of the vestibulo-oculomotor system in monkey and cat, and shows how neurons in the caudal pons combine and carry vestibulo-oculomotor signals in a distributed manner. Note that these models, however simple, are circuits equivalent to other multisynaptic, more realistic versions of the vestibulo-oculomotor system. No external teacher is needed since the error is retinal image motion which is well represented by signals in the brain stem and cerebellum. Further, the ability of retinal slip to produce plastic changes in the operation of the vestibulo-oculomotor system is
Distributed Parallel Processing in the Vestibulo-Oculomotor System
\
239
Y
Figure 3: Activation-vectors of hidden and output units in a 40 hidden unit neural network model of the vertical VOR. The activation-vectors of the hidden and output units were determined by testing the mature network with the eight vectors of head-rotation used in training. A cosine function was fit to the eight responses of each unit; the magnitude and direction of the activationvector was taken as the amplitude and phase of this cosine function. The canal activation vectors are lac, Ipc, rac and rpc; those of the muscles (of the left eye) are 01, 02, 03 and 04. The latter do not coincide with the rotation axes of these four muscles, sr, so, ir and io, respectively, because the muscles form a skewed coordinate system (Robinson 1982). Note that the activation-vectors of the hidden units vary in magnitude and are dispersed in many directions. Activation-vector divergence is also characteristic of cat VN neurons (Baker et al. 1984). well documented (Wilson and Melvill Jones 1979). To date, vestibulooculomotor modelling has been confined to the lumped, black-box approach. We hope that the results summarized here and presented in detail in two forthcoming articles (Anastasio and Robinson 1989; Anastasio and Robinson, in review) provide a n alternative analysis technique that recognizes the distributed nature of vestibulo-oculomotor signals. Additionally, it is interesting to note that when every model had its weights rerandomized and was retrained, it reached a different solution. In such networks there is no unique solution. Thus, each run of the 40 hidden unit signal combination model ended with a different distribution of V, P, and S gains. Rather than forming discrete subpopulations, hidden units seemed to fall along a continuum with regard to the relative strengths of these gains. In the two-dimensional spatial transfor-
240
Thomas J. Anastasio and David A. Robinson
mation model (Fig. 3), a 40-dimensional vector on the hidden units is collapsed to a two-dimensional motoneuron vector (sr - ir, so - io). This is sometimes referred to as the overcomplete problem since, in a closed form, there would be too many equations for too few variables to allow a unique solution, Our learning network is not bothered by this problem precisely because it does not seek a unique solution. Thus, the learning network approach suggests that, a s far as the nervous system is concerned, the overcomplete problem may not be a problem at all.
Acknowledgments This study was made possible by grants EY00598, EY01765, and EY05901, all from the National Eye Institute of the National Institutes of Health, Bethesda, Maryland, USA.
References Anastasio, T.J. and D.A. Robinson. 1989. The distributed representation of vestibulo-oculomotor signals by brain-stem neurons. Biological Cybernetics, in press. . Sensorimotor coordinate transformations: Learning networks and tensor theory. In review. Baker, J., J. Goldberg, G. Hermann, and B. Peterson. 1984. Optimal response planes and canal convergence in secondary neurons in vestibular nuclei of alert cats. Brain Research, 294, 133-137. Carpenter, R.H.S. 1978. Movements of the Eyes. London: Pion Press. Chubb, M.C., A.F. Fuchs, and C.A. Scudder. 1984. Neuron activity in monkey vestibular nuclei during vertical vestibular stimulation and eye movements. Journal of Neurophysiology, 52, 724-742. Fuchs, A.F. and J. Kimm. 1975. Unit activity in the vestibular nucleus of the alert monkey during horizontal angular acceleration and eye movement. Journal of Neurophysiology, 38, 1140-1161. Miles, F.A. 1974. Single unit firing patterns in the vestibular nuclei related to voluntary eye movements. Brain Research, 71,215-224. Pelionisz, A. and R. Llinas. 1980. Tensorial approach to the geometry of brain function: Cerebellar coordination via a metric tensor. Neuroscience, 5, 11251136. Robinson, D.A. 1981. The use of control systems analysis in the neurophysiology of eye movements. Annual Revim of Neuroscience, 4,463-503. . 1982. The use of matrices in analyzing the three-dimensional behavior of the vestibulo-ocular reflex. Biological Cybernetics, 46,53-66. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning internal representations by e m r propagation. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1,Foundations, eds. D.E. Rumelhart, J.L. McClelland, and The PDP Research Group, 318-362. Cambridge: MIT Press.
Distributed Parallel Processing in the Vestibulo-Oculomotor System
241
Tomlinson, R.D. and D.A. Robinson. 1984. Signals in vestibular nucleus mediating vertical eye movements in the monkey. Journal of Neurophysiology, 51, 1121-1136. Wilson, V.J. and G. Melvill Jones. 1979. Mammalian Vestibular Physiology. New York Plenum Press. ~
Received 3 October 1988; accepted 3 January 1989.
Communicated by Allen Selverston
A Neural Model for Generation of Some Behaviors in the Fictive Scratch Reflex Reza Shadmehr Brain Simulation Laboratory, Department o f Computer Science, University of Southern California, Los Angeles, C A 90089-0782, USA
We have studied the scratch reflex in order to better understand the strategy that the spinal neural structures use to program limb movements. A network is proposed for the positioning and rhythmic portions of the scratch reflex in the deafferented preparation. This model is based on Berkinblit's hypotheses regarding organization of neurons constituting the spinal central pattern generator (CPG) for this behavior. We offer a mechanism by which the initial position of the hindlimb can influence the output of the CPG in the immobilized animal. 1 Introduction
The supraspinal control of a set of reflexes whose circuitries lie in the spinal cord may be the basic mechanism by which voluntary movements are generated (Berkinblit, Feldman and Fuckson 1986). The idea that the central control signals are expressible in terms of reflex parameters was the basis for work of Feldman on the "equilibrium point hypothesis" (for a review, see Feldman 1986). The underlying assumption is that afferent signals and central commands are interchangeable and that the nervous system accomplishes point to point movements by imitating inputs from the afferents to the stretch reflex circuitry (Berkinblit, Feldman and Fuckson 1986). This hypothesis leads to an algorithm for the control of redundant multijoint limbs without explicitly computing the inverse kinematics, the trajectory of movement, or the forces necessary for its execution (Berkinblit, Gelfand, and Feldman 1986). The scratch reflex is an example of a movement that involves several joints and muscle groups and can be programmed by the spinal motor centers. In the vertebrate scratch reflex, an animal responds to irritation of its skin by reaching towards and scratching the irritation site rhythmically with its foot. Since the movement can be evoked in animals with the spinal cord transacted at upper cervical segments, networks of neurons residing entirely in the spinal cord probably contain the essential mechanisms for control of this movement (Berkinblit et al. 1978). Understanding the mechanisms by which an animal maps the sensory field (in this case, the site of irritation on the skin) onto the neuronal circuitry Neural Computafion 1,242-252 (1989)
@ 1989 Massachusetts Institute of Technology
A Neural Model for Generation of Some Behaviors
243
that is capable of generating a temporal pattern of muscular activation may lead to an understanding of voluntary limb movements which rely on the manipulation of the reflex circuitry. The aim of this paper is to pursue a set of hypotheses on the organization of the neural controller for this reflex (Berkinblit et al. 1978; Deliagina et al. 1975; Deliagina and Orlovsky 1980; Deliagina et al. 1987; Shimanskii and Baev 1987),and to offer a neural model for the generation of the temporal pattern of scratching in the reduced preparation. Our approach is initially to consider the action of the neuronal controller of this reflex in the deafferented preparation: Although this oscillatory output is dependent on sensory feedback from the moving limb (Shimanskii and Baev 1987), when the corresponding limb is deafferented, essential characteristics of the movement remain intact (Berkinblit et al. 1978). We will then introduce limited afferent information to the model and compare the results with the immobilized, but afferent-intact, preparation. 2 The Experimental Data
In the fictitious scratch reflex (fictitious since the responding limb is immobilized), efferent neuromuscular contact is abolished and in response to skin irritation, the following activity in motoneurons is observed: In the initial positioning phase, which lasts about 2 seconds, motoneurons belonging to tibialis anterior (TA, ankle flexor), extensor digitorum longus (EDL, digit flexor), and quadriceps (Q, knee extensor) are gradually depolarized and begin firing, while no change in membrane potential is observed among motor neurons belonging to soleus and gastrocnemius (GS, ankle extensors), and plantaris (P, ankle extensor) (Baev 1981; Berkinblit et al. 1980; Deliagina et al. 1987). By the end of this phase the ankle joint is within reach of the scratch site (Berkinblit et al. 1978). Once it has reached the irritation site, the leg begins an oscillatory pattern of scratching motion (period of 300 msec) which is sustained until the irritation is removed. This second phase is termed the rhythmic phase and is further divided into an aiming (activation of TA, EDL and Q) and wiping phase (activation of GS and P). Almost all muscles active in the positioning phase (for example, TA, EDL and Q) are also active in the aiming phase, and are silent in the wiping phase. Muscles active in the wiping phase (for example, GS and P) are silent during the positioning and the aiming phases. Due to their similar time course of activation, we lump TA, EDL, and Q into group A, and GS and P into group B muscles. In summary then, the positioning phase involves only activity of motor neurons belonging to group A muscles, while the rhythmic phase involves alternation of group A and group B activity. Figure l a shows the average change in membrane potential (MP) for a set of neurons belonging to TA (group A) and GS (group B) motor pools dwing one scratch cycle. MP is drawn with respect to resting
-
244
Reza Shadmehr
potential of the same motoneuron before the irritation was applied. Note that the potential reached by the TA motoneurons by the end of the positioning phase (MPp) is essentially sustained during the rhythmic phase, except for periodic arrival of inhibition that drops the MP back to the resting level, but not below it. This suggests that during the rhythmic phase, the source of excitation of TA motoneurons is inhibited and not the motoneuron itself. GS motoneurons however do not undergo any change in their MP during the positioning phase and appear to receive only excitatory input during the rhythmic phase. Berkinblit hypothesized (Berkinblit et al. 1978) that tonic excitatory inflow from the propriospinal neurons relaying information from the sensors at the irritation site activates the CPG. The neurons causing the rhythmic activity are distinguished from the motoneurons and their associated local reflex pathways (that is, Ia inhibitory interneurons and Renshaw cells). By contrast, Miller and Scott’s (Miller and Scott 1977) model of spinal pattern generators uses reciprocal inhibition among motoneurons to generate oscillatory behavior. However, there is some evidence (Pratt and Jordan 1987) that the reflex pathways of Ia inhibitory interneurons and Renshaw cells are not critical for generation of rhythmic activity in the motoneurons, and therefore are not part of the CPG for scratching. Berkinblit’s theory is further supported by the apparent localization of the neurons belonging to this CPG in the third to fifth segments of the lumbar (L3-L5) spinal cord (most hindlimb motoneurons are situated below L5). Many neurons in L3, L4 and L5 segments of immobilized cat fire in bursts that match the rhythmic behavior of the scratch cycle (Berkinblit et al. 1978). Figure l b shows the averaged activity of these neurons. Berkinblit et al. (1978) named these neurons groups I, 11, and 111. To avoid confusion with muscle afferents, however, we will refer to these neurons as groups BI, BII and BIII. Activity of neurons in group BI appears to be phase locked with the group A motoneurons active during the aiming phase, while group BIII’s activity is similar to that of group B motoneurons which fired in the wiping phase. During fictive scratching, activity in the ventral spino-cerebellar tract (VSCT) from the L4 and L5 segments is strikingly similar to that in neurons of groups BI and BII, while activity of spino-reticulo-cerebellarpathway (SRCP) neurons resembles that of group BIII (compare Fig. 202 of Ito 1984). Selective destruction and isolation of various segments of the spinal cord suggests that the main input to VSCT and SRCP neurons is the neuronal network generating the rhythmic oscillations in the L3 to L5 sections (Arshavsky et al. 1984). The importance of cell groups BI, BII and BIII for generation of rhythmic hindlimb activity is not known. However, neurons in an isolated L5 segment can generate a seemingly unchanged rhythmic behavior when a scratch reflex is evoked (Deliagina et al. 1983). No rhythmic activity is generally observed more caudally (L6 and lower) when neuronal somas in L4 and L5 have been destroyed. Anatomical evidence that a fair
A Neural Model for Generation of Some Behaviors
GS motor
245
100 rnsec I
TA motor
neurons
A 0
0.5
1
BII P,
BIII
B
Activity in Group BI
neurons
0
0.5
1
Figure 1: Output characteristics of the central pattern generator during the fictive scratch reflex: (a) P Average change in membrane potential of motoneurons located in TA and GS motor pools by the end of the positioning phase (denoted by MPp), and during the scratch cycle. Change is with respect to resting potential before the reflex was initiated. (b) l' Activity in three groups of neurons located in L4 and L5 segments of the spinal cord during one cycle of the reflex. Figures constructed from data in figures 2 and 4 of (Berkinblit et al. 19801, and 11 a, b and c of (Berkinblit et al. 1978). number of neurons located in lamina VII of L4 and L5 segments have axons that terminate on motoneurons of the seventh lumbar (L7) and first sacral (Sl) segments (Jankowska and Skoog 1986) further supports the hypothesis of an oscillatory neuronal network residing in the L4 and L5 segments.
246
Reza Shadmehr
3 The Model
Based on this evidence, we propose a neuronal model for pattern generation in the fictive scratch reflex, including a mechanism by which afferent information may be processed in the immobilized preparation. The model (Fig. 2) comprises three groups of neurons (Groups BI, BII and BIII) that constitute the CPG, a set of Ia inhibitory neurons, two motor pools (A and B), and an afferent feedback loop. Each group of neurons in the CPG is modeled by one inhibitory (filled units) and one excitatory (open units) neuron. The three groups of neurons are assumed to reside mainly in lamina VII of L4 and L5 segments. Activity in excitatory neurons of groups BI and BIII activates muscles in the aiming (group A) and wiping (group B) phases, respectively. Interaction between groups BI, BII and BIII neurons leads to the positioning and rhythmic behaviors observed during the fictive scratch. The role of Ia inhibitory interneurons can be described as follows: When dorsal roots are severed and no length information is being used to cause inhibition of the antagonist motor pool, a rhythmic pattern of activity persists in the Ia interneurons during the scratch cycle. The Ia interneurons leading to motoneurons of group A muscles are depolarized in phase with the motoneurons of group B muscles, while the Ia interneurons leading to the motoneurons of group B muscles are depolarized in phase with the later portion of the aiming phase (Deliagina and Orlovsky 1980). The pattern of depolarization in Ia interneurons leading to motoneurons belonging to group A muscles and group B muscles is similar to the pattern of activity seen in groups BIII and BII respectively. In figure 2, we therefore connect groups BII and BIII to these Ia interneurons. Initially, we consider the operation of the network in the deafferented animal, that is, the afferent neurons are silent. As the irritation is detected, cutaneous input is translated into a tonic activation of group BI neurons which yield sustained contraction of the group A muscles (positioning phase). This afferent tonic excitation eventually activates group BII neurons which are postulated to have a longer time constant so that it takes them longer to reach firing threshold. Activity of the inhibitory neuron in group BII inhibits group BI neurons and serves to end the positiming phase. Activity in group BII's excitatory neuron excites group BIII neurons, but until group BII is successful in ending activity in group BI neurons, group BIII cannot overcome group BI's inhibition. Once group BI is shut down, activation of the excitatory neuron in group BIII leads to activation of the motoneurons in group B motor pool (wiping phase), while the inhibitory neuron in the same group inhibits group BII and ends its own source of excitation. Inhibition of group BII leads to disinhibition of group BI neurons, which leads to start of another aiming-wiping cycle. The activation history of neurons belonging to groups BI, BII, and BIII is plotted in figure 3. Mathematics used in simulation of this network have been described elsewhere (Shadmehr and Lindquist 1988). Input
247
A Neural Model for Generation of Some Behaviors
Descending tonic input
1
Group 8111
I[ fL
P- i I
Group BII
........... I,
I . .
................. ..........\ Group 61
I
t,
........................... 1 ;
r
..C
....... .........
..
I
I
I a inhibitory interneurons
motor
+ -
Group I1 muscle afferents from hindlimb flexors
Group I1 muscle aff erents from hindlimb extensors
Figure 2: Schematic of the neuronal system postulated to control the positioning and rhythmic portions of the fictive scratch reflex. Groups BI, BII, and BIII are neurons located in lamina VII of the L4 and L5 spinal segments, while the Ia inhibitory interneurons are near the motor pools in the more caudal segments. Group A motor pool is a lumped representation for TA, EDL and Q motor pools. Group B represents GS and P pools. Open neurons are excitatory, filled are inhibitory.
Reza Shadmehr
248
arrives at the time marked with an arrow, and is simulated by excitatory input to groups BI and BII. Parameter values for this particular run are given in the appendix of (Shadmehr and Lindquist 1988). The initial period seen in group BI lasts on the order of 1.2 seconds. In this period, group A muscles are activated such that they deflect the hindlimb forward, presumably so that the paw is within reach of the scratch site. Once this positioning period ends, activity in group BI is limited to an oscillatory pattern which lasts 300 msec, followed by an inactivation period that lasts 100 msec. The network offers a mechanism by which tonic input can be translated into an oscillatory output in a way that matches the data of (Berkinblit et al. 1978; 1980). Further data exists that describes the effects of initial hindlimb position on the output of the CPG in an immobilized animal whose afferents are intact (Baev 1981). Passive pulling of the limb to a more caudal position before eliciting the scratch reflex lengthens the initial positioning phase while pulling to a more rostra1 position shortens
I
stimulus arrival
Group BII
'I
0
,
1 sec
,
Group BIII
-
-
-
-
-
I L
Figure 3: Activation history (firing rates) of the neurons of groups BI, BII and BIII. Stimulus is tonic excitation of groups BI and BII neurons. The long initial activation in the group BI neuron corresponds to the positioning phase, the period activation in group BI is the aiming phase. The short periodic activation in group BIII corresponds to the wiping phase. Firing rate of each neuron is a sigmoid function bounded by 0 and 1 (Shadmehr and Lindquist 1988).
A Neural Model for Generation of Some Behaviors
249
it, suggesting that afferent information on length of hindlimb muscles in an immobilized animal affects the CPG. Feedback from the hindlimb muscles to the L4 and L5 segments of the spinal cord appears to be mainly via group I1 afferents (Edgely and Jankowska 1987). These afferents make monosynaptic connections to neurons residing in lamina VII of L4 and L5 (where the CPG is postulated to be located) and can be activated with muscle stretches of less than 100 pm, though little or no innervation of this area of spinal cord from group I afferents has been found. It has also been reported that the intensity of afferent flow to the L5 region of the spinal cord from a hindlimb muscle that is being passively stretched increases (Shimanskii and Baev 1987). To account for the variations in the network’s output when the hindlimb is passively moved before the start of the behavior, we postulate that the new hindlimb position is reported to the neurons that make up the CPG by an increase in the firing rate of group I1 afferents of ankle and thigh extensors. Since the effect of this initial limb position must be a decrease in the length of the positioning phase, it is postulated that these afferents must excite group BII neurons of figure 2. Using the same approach, it can be supposed that afferents of thigh and ankle flexors increasingly inhibit group BII neurons as the limb is further pulled to a more caudal position. The net effect of synaptic input from these afferents on the group BII neurons must be of a tonic nature since the limb never actually moves after the fictive scratch has been initiated, and group I1 afferents, as opposed to group Ia afferents, are mainly sensitive to absolute length of the muscle (Carew and Ghez 1985), and not its rate of shortening. We assume that the tonic source of excitation from irritation that initiated the reflex has remained the same. Figure 4 shows that effect of tonic inhibition on group BII neurons is a lengthening of the initial positioning phase with an increase in the period of the rhythmic phase. The effect of excitation is the reverse. Changing the afferent tonic inflow has a profound effect on the initial positioning phase, while the period of the rhythmic phase quickly reaches asymptotic values, in good agreement with experimental data showing that the location of the hindlimb mainly affects the initial positioning phase (Shimanskii and Baev 1987). The magnitude of change in the period of the rhythmic phase is a readily testable result of this model. This model has attempted to take into account the role of afferent information only when the hindlimb has been immobilized. It was suggested that the dependence of the motor output on the initial position of the limb can be explained by considering a mapping of afferents belonging to extensor or flexor muscles such that they provide tonic excitation or inhibition to group BII neurons of the CPG, respectively. The role of afferent information on modification of the CPG’s output in the intact animal is not known. However, it is generally believed that afferent information is processed at the input into the spinal cord by a mechanism
Reza Shadmehr
250
2.00
-
1.75
-
4
1.50
1.25
Q **%
1.00
-
Length of initial positioningphase
0 0
0
* * 00 00
-*
0.50
Period of rhythmic phase 0
0.25
0
0
0
0
0
0
0
0
0
1
0.8
1.o
1.2
1.4
1.6
Summation of inputs from priopriospinaland hind limb muscle afferents on group BII neurons
Figure 4: Length of the positioning phase and the period of the rhythmic phase as functions of the magnitude of input on group BII neurons. Input is defined as the weighted algebraic sum of the inhibitory and excitatory firing rates. Inhibitory input from muscle afferents leads to smaller values on the abscissa, while larger values on the abscissa result from excitatory afferent input on group BII neurons. of presynaptic inhibition (Baev et al. 1978). The function of this inhibition on the pre-synaptic junction is to modulate the sensitivity of that terminal to incoming action potentials from the afferent neuron, or possibly generate anti-dromic action potentials to collide and cancel action potentials generated by the afferent neuron (Bayev and Kostyuk 1981). Recently, Shimanskii and Baev (1988) have suggested that a model of afferent inflow from the muscles is located in the scratch generator of the intact animal, and serves to correct the trajectory of the limb. In the one dimensional model presented here, the mapping of the irritation site to positioning of the limb is coded in the time length of the positioning phase. From figure 4, it is possible to arrive at some quantity Mu), which is the sum of tonic afferent inputs required to produce a positioning phase of length d. The goal of our future research is to postulate a network that converts the site of irritation and initial position of the hindlimb first to a positional error e (distance from paw to irritation)
A Neural Model for Generation of Some Behaviors
251
and then M e ) , which is now supplied as an extra input in parallel to the afferents we have postulated in figure 4. Acknowledgments
The author wishes to thank guidance a n d review of Prof. Michael Arbib in preparation of this manuscript and collaboration of Mr. Gary D. Lindquist in development of some of the earlier models. Preparation of this paper was supported in part by grant no. lROl NS24926 from the National Institutes of Health (Michael Arbib, Principal Investigator). Portions of this paper have appeared in abstract form in Neural Networks, 1,(S1):359, 1988. References Arshavsky, Y.I., I.M. Gelfand, G.N. Orlovsky, G.A. Pavlova, L.B. Popova. 1984. Origin of signals conveyed by the ventral spino-cerebellar tract and spinoreticulo-cerebellar pathway. Experimental Brain Research, 54, 426-431. Baev, K.V., Y.V. Panchin, and R.N. Skryma. 1978. Depolarization of primary afferents during fictitous scratching in thalamic cats. Neurophysiology, 10(2), 120-1 22. Baev, K.V. 1981. The central program of activation of hindlimb muscles during scratching in cats. Neurophysiology, 13(1), 3844. Bayev, K.V. and P.G. Kostyuk. 1981. Primary afferent depolarization evoked by the activity of spinal scratching generator. Neuroscience, 6,205-215. Berkinblit, M.B., T.G. Deliagina, A.G. Feldman, I.M. Gelfand, and G.N. Orlovsky. 1978. Generation of scratching. I. Activity of spinal interneurons during scratching. Journal of Neurophysiology, 41(4), 104C1057. Berkinblit, M.B., T.G. Deliagina, G.N. Orlovsky, A.G. Feldman. 1980. Activity of motoneurons during fictitious scratch reflex in the cat. Brain Research, 193,427438. Berkinblit, M.B., A.G. Feldman, and 0.1. Fuckson. 1986. Adaptability of innate motor patterns and motor control mechanisms. Behavior and Brain Sciences, 9,585-638. Berkinblit, M.B., I.M. Gelfand, and A.G. Feldrnan. 1986. Model of the control of the movements of a multijoint limb. Biophysics, 31(1), 142-153. Carew, T.J. and C. Ghez. 1985. In: Principles of neural science, eds. E.R. Kandel and J.H. Schwartz, second edition. New York Elsevier. Deliagina, T.G., A.G. Feldman, I.M. Gelfand., and G.N. Orlovsky. 1975. On the role of central program and afferent inflow in the control of scratching movements in the cat. Brain Research, 100, 297-213. Deliagina, T.G. and G.N. Orlovsky. 1980. Activity of Ia inhibitory interneurons during fictitious scratch reflex in the cat. Brain Research, 193,439-447. Deliagina, T.G., G.N. Orlovsky, G.A. Pavlova. 1983. The capacity for generation of rhythmic oscillations is distributed in the lumbosacral spinal cord of the cat. Experimental Brain Research, 53, 81-90.
252
Reza Shadmehr
Edgley, S.A. and E. Jankowska. 1987. Field potentials generated by group I1 muscle afferents in the middle lumbar sements of the cat spinal cord. Journal of Physiology (London), 385, 393413. Feldman, A.G. 1986. Once more on the equilibrium-point hypothesis (A model) for motor control. Journal of Motor Behavior, 18(1), 17-54. Ito, M. 1984. The Cerebellum and Neural Control. Raven Press. Jankowska, E.and B. Skoog. 1986. Labeling of midlumbar neurones projecting to cat hindlimb motoneurones by transneuronal transport of WGA-HRP. Neuroscience Letters, 71, 163-168. Miller, S. and P.D. Scott. 1977. The spinal locomotor generator. Experimental Brain Research, 30, 387-403. Pratt, C.A. and L.M. Jordan. 1987. Ia inhibitory interneurons and Renshaw cells as contributors to the spinal mechanisms of fictive locomotion. Journal of Neurophysiology, 57, 56-71. Shadmehr, R. and G.D. Lindquist. 1988. A neural network for pattern generation in the scratch reflex. IEEE International Conference on Neural Networks [ICNN 881, 11, 25-32. Shimanskii, Y.P. and K.V. Baev. 1987. Dependence of efferent activity parameters on limb position during fictitious scratching in decerebrate cats. Neurophysiology, 15(5), 451-458. . 1988. Reordering of scratch generator efferent activity produced by stimulating hindlimb muscle afferents in decerebrate immobilized cats. Neurophysiology, 19(3), 279-287.
Received 11 October; accepted 24 October 1988.
Communicated by Dana Ballard
A Robot that Walks; Emergent Behaviors from a Carefully Evolved Network Rodney A. Brooks MIT Artificial Intelligence Laboratory, 545 Technology Square, Cambridge, MA 02139, USA
Most animals have significant behavioral expertise built in without having to explicitly learn it all from scratch. This expertise is a product of evolution of the organism; it can be viewed as a very long-term form of learning which provides a structured system within which individuals might learn more specialized skills or abilities. This paper suggests one possible mechanism for analagous robot evolution by describing a carefully designed series of networks, each one being a strict augmentation of the previous one, which control a six-legged walking machine capable of walking over rough terrain and following a person passively sensed in the infrared spectrum. As the completely decentralized networks are augmented, the robot’s performance and behavior repertoire demonstrably improve. The rationale for such demonstrations is that they may provide a hint as to the requirements for automatically building massive networks to carry out complex sensory-motor tasks. The experiments with an actual robot ensure that an essence of reality is maintained and that no critical disabling problems have been ignored. 1 Introduction
In earlier work (Brooks 1986; Brooks and Connell1986), we have demonstrated complex control systems for mobile robots built from completely distributed networks of augmented finite state machines. In this paper we demonstrate that these techniques can be used to incrementally build complex systems integrating relatively large numbers of sensory inputs and large numbers of actuator outputs. Each step in the construction is purely incremental, but nevertheless along the way viable control systems are left at each step, before the next little piece of network is added. Additionally we demonstrate how complex behaviors, such as walking, can emerge from a network of rather simple reflexes with little central control. This contradicts vague hypotheses made to the contrary during the study of insect walking (for example, Bassler 1983, page 112).
Neural Computation 1, 253-262 (1989) @ 1989 Massachusetts Institute of Technology
254
Rodney A. Brooks
2 The Subsumption Architecture
The subsumption architecture (Brooks 1986) provides an incremental method for building robot control systems linking perception to action. A properly designed network of finite state machines, augmented with internal timers, provides a robot with a certain level of performance, and a repertoire of behaviors. The architecture provides mechanisms to augment such networks in a purely incremental way to improve the robot‘s performance on tasks and to increase the range of tasks it can perform. At an architectural level, the robot’s control system is expressed as a series of layers, each specifying a behavior pattern for the robot, and each implemented as a network of message passing augmented finite state machines. The network can be thought of as an explicit wiring diagram connecting outputs of some machines to inputs of others with wires that can transmit messages. In the implementation of the architecture on the walking robot the messages are limited to 8 bits. Each augmented finite state machine (AFSM), figure 1, has a set of registers and a set of timers, or alarm clocks, connected to a conventional finite state machine which can control a combinatorial network fed by the registers. Registers can be written by attaching input wires to them and sending messages from other machines. The messages get written into them replacing any existing contents. The arrival of a message, or the expiration of a timer, can trigger a change of state in the interior finite state machine. Finite state machine states can either wait on some event, conditionally dispatch to one of two other states based on some combinatorial predicate on the registers, or compute a combinatorial function of the registers directing the result either back to one of the registers or to an output of the augmented finite state machine. Some AEMs connect directly to robot hardware. Sensors deposit their values to certain registers, and certain outputs direct commands to actuators. A series of layers of such machines can be augmented by adding new machines and connecting them into the existing network in the ways shown in figure 1. New inputs can be connected to existing registers, which might previously have contained a constant. New machines can inhibit existing outputs or suppress existing inputs, by being attached as side-taps to existing wires (Fig. 1, circled 5’). When a message arrives on an inhibitory side-tap no messages can travel along the existing wire for some short time period. To maintain inhibition there must be a continuous flow of messages along the new wire. (In previous versions of the subsumption architecture (Brooks 1986) explicit, long, time periods had to be specified for inhibition or suppression with single shot messages. Recent work has suggested this better approach (Connell 1988).) When a message arrives on a suppressing side-tap (Fig. 1, circled ‘s’), again no messages are allowed to flow from the original source for some small time period, but now the suppressing message is gated through and it masquerades as having come from the original source. Again, a contin-
255
A Robot that Walks
uous supply of suppressing messages is required to maintain control of a side-tapped wire. One last mechanism for merging two wires is called defaulting (indicated in wiring diagrams by a circled ‘d’). This is just like the suppression case, except that the original wire, rather than the
Auamented FSM
1 -
I
Augmented FSM
I
la-
I
Figure 1: An augmented finite state machine consists of registers, alarm clocks, a combinatorial network and a regular finite state machine. Input messages are delivered to registers, and messages can be generated on output wires. AFSMs are wired together in networks of message passing wires. As new wires are added to a network, they can be connected to existing registers, they can inhibit outputs and they can suppress inputs.
Rodney A. Brooks
256
new side-tapping wire, is able to wrest control of messages sent to the destination. All clocks in a subsumption system have approximately the same tick period (0.04 seconds on the walking robot), but neither they nor messages are synchronous. The fastest possible rate of sending messages along a wire is one per clock tick. The time periods used for both inhibition and suppression are two clock ticks. Thus, a side-tapping wire with messages being sent at the maximum rate can maintain control of its host wire. 3 The Networks and Emergent Behaviors
The six-legged robot is shown in figure 2. We refer to the motors on each leg as an CY motor (for advance) which swings the leg back and forth, and a /3 motor (for balance) which lifts the leg up and down. Figure 3 shows a network of 57 augmented finite state machines which was built incrementally and can be run incrementally by selectively deactivating later AFSMs. The AFSMs without bands on top are repeated six times, once for each leg. The AFSMs with solid bands are unique and comprise the only central control in making the robot walk, steer and follow targets. The AFSMs with striped bands are duplicated twice each and are specific to particular legs. The complete network can be built incrementally by adding AFSMs to an existing network producing a number of viable robot control systems itemized below. All additions are strictly additive with no need to change any existing structure. Figure 4 shows a partially constructed version of the network. 1. Standup. The simplest level of competence for the robot is achieved with just two AFSMs per leg, alpha pos and beta pos. These two machines use a register to hold a set position for the CY and /3 motors respectively and ensure that the motors are sent those positions. The initial values for the registers are such that on power up the robot assumes a stance position. The AFSMs also provide an output that reports the most recent commanded position for their motor. 2. Simple walk. A number of simple increments to this network result in one which lets the robot walk. First, a leg down machine for each leg is added which notices whenever the leg is not in the down position and writes the appropriate beta pas register in order to set the leg down. Then, a single alpha balance machine is added which monitors the CY position, or forward swing of all six legs, treating straight out as zero, forward as positive and backward as negative. It sums these six values and sends out a single identical message to all six alpha pos machines, which, depending on the sign of the sum is either null, or an increment or decrement to the current CY position of each leg. The alpha balance machine samples the leg
A Robot that Walks
257
Figure 2: The six-legged robot is about 35cm long, has a leg span of 25cm, and weighs approximately 1Kg. Each leg is rigid and is attached at a shoulder joint with two degrees of rotational freedom, driven by two orthogonally mounted model airplane position controllable servo motors. An error signal has been tapped from the internal servo circuitry to provide crude force measurement (5 bits, including sign) on each axis, when the leg is not in motion around that axis. Other sensors are two front whiskers, two four-bit inclinometers (pitch and roll), and six forward looking passive pyroelectric infrared sensors. The sensors have approximately 6 degrees angular resolution and are arranged over a 45 degree span. There are four onboard 8 bit microprocessors linked by a 62.5Kbaud token ring. The total memory usage of the robot is about lKbytes of RAM and lOKbytes of EPROM. Three silver-zinc batteries fit between the legs to make the robot totally self contained. positions at a relatively high rate. Thus if one leg happens to move forward for some reason, all legs will receive a series of messages to move backward slightly. Next, the alpha advance AFSM is added for each leg. Whenever it notices that the leg is raised (by monitoring the output of the beta pos machine) it forces the leg forward by suppressing the signal coming from the global alpha balance machine. Thus, if a leg is raised for some reason it reflexively swings forward, and all other legs swing backward slightly to compensate (notice that the forward
258
Rodney A. Brooks swinging leg does not even receive the backward message due to the suppression of that signal). Now a fifth AFSM, up-leg trigger is added for each leg which can issue a command to lift a leg by suppressing the commands from the 2eg down machine. It has one register which monitors the current p position of the leg. When it is down, and a trigger message is received in a second register, it ensures that the contents of an initially constant third register are sent to the beta pos machine to lift the leg. With this combination of local leg specific machines and a single machine trying to globally coordinate the sum of the (Y position of all legs, the robot can very nearly walk. If an up-leg trigger machine receives a trigger message it lifts its associated leg, which triggers a reflex to swing it forward, and then the appropriate leg-down machine will pull the leg down. At the same time all the other legs still on the ground (those not busy moving forward) will swing backward, moving the robot forward.
Figure 3: The final network consists of 57 augmented finite state machines. The AFSMs without bands on top are repeated six times, once for each leg. The AFSMs with solid bands are unique and comprise the only central control in making the robot walk, steer and follow targets. The AFSMs with striped bands are duplicated twice each and are specific to particular legs. The AFSMs with a filled triangle in their bottom right corner control actuators. Those with a filled triangle in their upper left corner receive inputs from sensors.
A Robot that Walks
259
Iegh
2l triaaer up
Figure 4: A strict subset of the full network enables the robot to walk without any feedback. It pitches and rolls significantly as it walks over rough terrain. This version of the network contains 32 AFSMs. 30 of these comprise six identical copies, one for each leg, of a network of five AFSMs which are purely local in their interactions with a leg. The last two machines provide all the global coordination necessary to make the machine walk; one tries to drive the sum of leg swing angles (a angles) to zero, and the other sequences lifting of individual legs. The final piece of the puzzle is to add a single AFSM which sequences walking by sending trigger messages in some appropriate pattern to each of the six up-leg trigger machines. We have used two versions of this machine, both of which complete a gait cycle once every 2.4 seconds. One machine produces the well-known alternating tripod (Wilson 19801, by sending simultaneous lift triggers to triples of legs every 1.2 seconds. The other produces the standard back to front ripple gait by sending a trigger message to a different leg every 0.4 seconds. Other gaits are possible by simple substitution of this machine. The machine walks with this network, but is insensitive to the terrain over which it is walking and tends to roll and pitch excessiveIy as it walks over obstacles. The complete network for this simple type of walking is shown in figure 4. 3. Force balancing. A simple minded way to compensate for rough terrain is to monitor the force on each leg as it is placed on the ground and back off if it rises beyond some threshold. The rationale is that if a leg is being placed down on an obstacle it will have to roll (or pitch) the body of the robot in order for the leg P angle to reach its preset value, increasing the load on the motor. For each leg a beta force machine is added which monitors the p motor
260
Rodney A. Brooks forces, discarding high readings coming from servo errors during free space swinging, and a beta balance machine which sends out lift up messages whenever the force is too high. It includes a small deadband where it sends out zero move messages which trickle down through a defaulting switch on the up-leg trigger to eventually suppress the leg down reflex. This is a form of active compliance which has a number of known problems on walking machines (Klein et al. 1983). On a standard obstacle course (a single 5 centimeter high obstacle on a plane) this new machine reduced the standard deviation, over a 12 second period, of the readings from onboard 4 bit pitch and roll inclinometers (with approximately a 35 degree range), from 3.592 and 0.624 respectively to 2.325 and 0.451 respectively.
4. Leg lifting. There is a tradeoff between how high each leg is lifted and overall walking speed. But low leg lifts limit the height of obstacles which can be easily scaled. An eighth AFSM for each leg compensates for this by measuring the force on the forward swing ( a )motor as it swings forward and writing the height register in the up-leg trigger at a higher value setting up for a higher lift of the leg on the next step cycle of that leg. The up-leg trigger resets this value after the next step.
5. Whiskers. In order to anticipate obstacles better, rather than waiting until the front legs are rammed against them, each of two whiskers is monitored by a feeler machine and the lift of the left and right front legs is appropriately upped for the next step cycle. 6. Pitch stabilization. The simple force balancing strategy above is by no means perfect. In particular in high pitch situations the rear or front legs (depending on the direction of pitch) are heavily loaded and so tend to be lifted slightly causing the robot to sag and increase the pitch even more. Therefore one forward pitch and one backward pitch AFSM are added to monitor high pitch conditions on the pitch inclinometer and to inhibit the local beta balance machine output in the appropriate circumstances. The pitch standard deviation over the 12 second test reduces to 1.921 with this improvement while the roll standard deviation stays around the same at 0.458. 7. Prowling. Two additional AFSMs can be added so that the robot only bothers to walk when there is something moving nearby. The IR sensors machine monitors an array of six forward-looking pyroelectric infrared sensors and sends an activity message to the prowl machine when it detects motion. The prowl machine usually inhibits the leg lifting trigger messages from the walk machine except for a little while after infrared activity is noticed. Thus the robot sits still until a person, say, walks by, and then it moves forward a little.
A Robot that Walks
261
8. Steered prowling. The single steer AFSM takes note of the predominant direction, if any, of the infrared activity and writes into a register in each alpha pos machine for legs on that side of the robot, specifying the rear swinging stop position of the leg. This gets reset on every stepping cycle of the leg, so the steer machine must constantly refresh it in order to reduce the leg’s backswing and force the robot to turn in the direction of the activity. With this single additional machine the robot is able to follow moving objects such as a slow-walking person. 4 Conclusion
This exercise in synthetic neuro-ethology has successfully demonstrated a number of things, at least in the robot domain. All these demonstrations depend on the manner in which the networks were built incrementally from augmented finite state machines. Robust walking behaviors can be produced by a distributed system with very limited central coordination. In particular much of the sensory-motor integration which goes on can happen within local asynchronous units. This has relevance, in the form of an existence proof, to the debate on the central versus peripheral control of motion (Bizzi 1980) and in particular in the domain of insect walking (Bassler 1983). Higher-level behaviors (such as following people) can be integrated into a system which controls lower level behaviors, such as leg lifting and force balancing, in a completely seamless way. There is no need to postulate qualitatively different sorts of structures for different levels of behaviors and no need to postulate unique forms of network interconnect to integrate higher level behaviors. Coherent macro behaviors can arise from many independent micro behaviors. For instance, the robot following people works even though most of the effort is being done by independent circuits driving legs, and these circuits are getting only very indirect pieces of information from the higher levels, and none of this communication refers at all to the task in hand (or foot). There is no need to postulate a central repository for sensor fusion to feed into. Conflict resolution tends to happen more at the motor command level, rather than the sensor or perception level. Acknowledgments Grinnell More did most of the mechanical design and fabrication of the robot. Colin Angle did much of the processor design and most of the
262
Rodney A. Brooks
electrical fabrication of the robot. Mike Ciholas, Jon Connell, Anita Flynn, Chris Foley, and Peter Ning provided valuable design and fabrication advice and help. This report describes research done at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. Support for the research is provided in part by the University Research Initiative under Office of Naval Research contract N00014-86-K-0685 and in part by the Advanced Research Projects Agency under Office of Naval Research contract NO0014-85-K-0 124.
References Bassler, U. 1983. Neural Basis of EZementary Behauior in Stick Insects. SpringerVerlag. Bizzi, E. 1980. Central and peripheral mechanisms in motor control. Tutorials in Motor Behavior, eds. G.E. Stelmach and J. Requin. North-Holland. Brooks, R.A. 1986. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, RA-2, 14-23. Brooks, R.A. and J.H. Connell. 1986. Asynchronous distributed control system for a mobile robot. Proceedings SPIE, Cambridge, MA, 77-84. Connell, J.H. 1988. A behavior-based arm controller. MIT A1 Memo. 1025. Klein, C.A., K.W. Olson, and D.R. Pugh. 1983. Use of force and attitude sensors for locomotion of a legged vehicle. International Journal of Robotics Research, 2:2, 3-17. Wilson, D.M. 1966. Insect walking. Annual Review of Entomology, 11, 1966; reprinted in The Organization of Action: A New Synthesis, C.R. Gallistel, Lawrence Erlbaum, 1980, 115-142.
Received 30 September 1988; accepted 20 October 1988.
Communicated by Fernando Pineda
Learning State Space Trajectories in Recurrent Neural Networks Barak A. Pearlmutter School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Many neural network learning procedures compute gradients of the errors on the output layer of units after they have settled to their final values. We describe a procedure for finding aE/aw,, where E is an error functional of the temporal trajectory of the states of a continuous recurrent network and wy are the weights of that network. Computing these quantities allows one to perform gradient descent in the weights to minimize E. Simulations in which networks are taught to move through limit cycles are shown. This type of recurrent network seems particularly suited for temporally continuous domains, such as signal processing, control, and speech.
1 Introduction Pineda (1987) has shown how to train the fixpoints of a recurrent temporally continuous generalization of backpropagation networks (Rumelhart et al. 1986). Such networks are governed by the coupled differential equations
where (1.2)
x, = Z W J t Y , 3
is the total input to unit 2, yz is the state of unit a, T, is the time constant of unit 2, u is an arbitrary differentiable function', wzJ are the weights, and the initial conditions y,(to) and driving functions L ( t ) are the inputs to the system. Consider minimizing E(y), some functional of the trajectory taken by y between to and t l . For instance, E = $;(yo(t) - f ( t ) ) * d t measures the deviation of yo from the function f , and minimizing this E would teach the network to have yo imitate f . Below, we develop a technique for computing aE(y)/dw,, and aE(y)/aT,, thus allowing us to do gradient descent in the weights and time constants so as to minimize E. 'Typically v ( [ ) = (1 + P - ~ ) - ' , in which case d(<) = cr(<)(l
- rr(E)).
Neural Computation 1, 263-269 (1989) @ 1989 Massachusetts Institute of Technology
Barak A. Pearlmutter
264
2 A ForwardIBackward Technique
Let us define (2.1)
In the usual case E is of the form $,’ f ( y ( t ) t, )dt so e&) = a f ( y ( t ) ,t ) / a y i ( t ) . Intuitively, eAt) measures how much a small change to yi at time t affects E if everything else is left unchanged. If we define zi by the differential equation dz 1 2 = -2, dt T,
-
e, -
1 1--wzs~’(z3)z3
(2.2)
TJ
3
with boundary conditions z i ( t l )= 0 then
and (2.4)
These results are derived using a finite difference approximation in (Pearlmutter 19881, and can also be derived using the calculus of variations and Lagrange multipliers (William Skaggs, personal communication) or from the continuous form of dynamic programming (Bryson 1962). 3 Simulation Results
Using first order finite difference approximations, we integrated the system y forward from to to t l , set the boundary conditions zi(tl) = 0, and integrated the system z backwards from t l to to while numerically integrating zj d x j ) yi and zi dyildt, thus computing 8E/dzuij and dE/dTi.
I Figure 1: The XOR network.
input
hidden
outout
I
Learning State Space Trajectories in Recurrent Neural Networks
265
Figure 2: The states of the output unit in the four cases plotted from t = 0 to t = 5 after 200 epochs of learning. The error was computed only between t = 2 and t = 3.
Since computing dzi/dt requires knowing o'(zi),we stored it and replayed it backwards as well. We also stored and replayed yi as it is used in expressions being numerically integrated. We used the error functional
where d&) is the desired state of unit i at time t and sift)is the importance of unit i achieving that state at that time. Throughout, we used a(<) = (1 + e-E)-'. Time constants were initialized to 1, weights were initialized to uniform random values between 1 and -1, and the initial values y&) were set to I&) + do). For these simulations we used At = 0.1. All of these networks have an extra unit which has no incoming connections, an external input of 0.5, and outgoing connections to all other units. This unit provides a bias, which is equivalent to the negative of a threshold. This detail is suppressed below. 3.1 Exclusive Or. The network of figure 1 was trained to solve the XOR problem. Aside from the addition of time constants, the network topology was that used by Pineda in (Pineda 1987). We defined E = J2(yLk) - d(k))2dtwhere k ranges over the four cases, d is the correct output, and yo is the state of the output unit. The inputs to the net I,(k) and 12') range over the four possible boolean combinations in the four different cases. With suitable choice of step size and momentum, training
zk
266
Barak A. Pearlmutter
Figure 3: Desired states dl and d2 plotted against each other (left); actual states y1 and y2 plotted against each other at epoch 1,500 (center)and 12,000 (right). time was comparable to standard backpropagation, averaging about one hundred epochs. It is interesting that even for this binary task, the network made use of dynamical behavior. After extensive training the network behaved as expected, saturating the output unit to the correct value. Earlier in training, however, we occasionally (about one out of every ten training sessions) observed the output unit at nearly the correct value between t = 2 and t = 3, but then saw it move in the wrong direction at t = 3 and end up stabilizing at a wildly incorrect value. Another dynamic effect, which was present in almost every run, is shown in figure 2. Here, the output unit heads in the wrong direction initially and then corrects itself before the error window. A very minor case of diving towards the correct value and then moving away is seen in the lower left-hand corner of figure 2. 3.2 A Circular Trajectory. We trained a network with no input units, four hidden units, and two output units, all fully connected, to follow the circular trajectory of figure 3. It was required to be at the leftmost point on the circle at t = 5 and to go around the circle twice, with each circuit taking 16 units of time. While unconstrained by the environment, the network moves from its initial position at (0.5,0.5)to the correct location at the leftmost point on the circular trajectory. Although the network was run for ten circuits of its cycle, these overlap so closely that the separate circuits are not visible. Upon examining the network's internals, we found that it devoted three of its hidden units to maintaining and shaping a limit cycle, while the fourth hidden unit decayed away quickly. Before it decayed, it pulled the other units to the appropriate starting point of the limit cycle, and
Learning State Space Trajectories in Recurrent Neural Networks
267
after it decayed it ceased to affect the rest of the network. The network used different units for the limit behavior and the initial behavior, an appropriate modularization. 3.3 A Figure Eight. We were unable to train a network with four hidden units to follow the figure eight shape shown in figure 4, so we used a network with ten hidden units. Since the trajectory of the output units crosses itself, and the units are governed by first order differential equations, hidden units are necessary for this task regardless of the (T function. Training was more difficult than for the circular trajectory, and shaping the network’s behavior by gradually extending the length of time of the simulation proved useful. Before t = 5, while unconstrained by the environment, the network moves in a short loop from the initial position at (0.5,0.5)to where it should sit on the limit cycle at t = 5, namely (0.5,0.5). Although the network was run for ten circuits of its cycle to produce this graph, these overlap so closely that the separate circuits are not visible. 4 Embellishments
Adding time delays to the links simply adds analogous time delays to the differential equation for z . This approach can be used to learn modifiable time delays. We can avoid the backwards pass by using a shooting method to update guesses for the correct values of z,(to) such that z , ( t l ) = 0 and integrating everything in the forward direction. Regretably, the computation required to compute the derivatives required by the shooting method seems excessive, and numeric stability is poor. We can derive a ”teacher forced” variant of our learning algorithm, presumably obtaining speedups similar to those reported by Williams and Zipser (1989). It would be useful to have some characterization of the class of trajectories that a network can learn as a function of the number of hidden units. These networks have at least the representational power of Fourier decompositions, as one can use a pair of nodes to build an oscillator of arbitrary frequency by making use of the local linearity of the (T function (Furst 1988). We have also found simple bounds on d 2 y / d t 2 based on the number of units, the largest weight, and the largest reciprocal time constant. Experiments with perturbing the cyclic networks of section 3 shows that they have developed true limit cycles which attract neighboring states and pull them into the cycle. The oscillatory behavior of the two output units was not independent, but was coupled by the hidden units, which keep them phase locked even in the face of massive disruptions.
268
Barak A. Pearlmutter
Figure 4: Desired states dl and y1
d2 plotted against each other (left); actuaI states and yz plotted against each other at epoch 3,182 (center)and 20,000 (right).
Further details on these and other related topics can be found in (Pearlmutter 1988). 5 Related Network Models
We use the same class of networks used by Pineda (19871, but he is concerned only with the limit behavior of these networks, and completely suppresses all other temporal behavior. His learning technique is applicable only when the network has a simple fixpoint; limit cycles or other non-point attractors violate a mathematical assumption upon which his technique is based. We can derive Pineda‘s equations from ours. Let I, be held constant, assume that the network settles to a fixpoint, let the initial conditions be this fixpoint, that is, yi(to) = yi(oo), and let E measure Pineda’s error integrated over a short interval after to, with an appropriate normalization constant. As tl tends to infinity, (2.2) and (2.3) reduce to Pineda’s equations, so in a sense our equations are a generalization of Pineda’s; but these assumptions strain the analogy. Jordan (1986) uses a conventional backpropagation network with the outputs clocked back to the inputs to generate temporal sequences. The treatment of time is the major difference between Jordan’s networks and those in this work. The heart of Jordan’s network is atemporal, taking inputs to outputs without reference to time, while an external mechanism is used to clock the network through a sequence of states in much the same way that hardware designers use a clock to drive a piece of combinatorial logic through a sequence of states. In our work, the network is not externally clocked; instead, it evolves continuously through time according to a set of coupled differential equations.
Learning State Space Trajectories in Recurrent Neural Networks
269
Most recently, Williams and Zipser (1989) have discovered an online learning procedure for networks of this sort. The tradeoffs between this procedure and that of Williams and Zipser is explored in some detail in (Pearlmutter 1988).
6 Acknowledgments We thank Richard Szeliski for helpful comments and David Touretzky for unflagging support. This research was sponsored in part by National Science Foundation grant EET-8716324, and by the Office of Naval Research under contract number N00014-86-K-0678. Barak Pearlmutter is a Fannie and John Hertz Foundation fellow.
References Bryson, A.E., Jr. 1962. A steepest ascent method for solving optimum programming problems. Journal of Applied Mechanics, 29(2), 247. Furst, M. 1988. Personal communication. Jordan, M.I. 1986. Attractor dynamics and parallelism in a connectionist sequential machine. In: Proceedings of the 1986 Cognitive Science Conference, 531-546. Lawrence Erlbaum. Pearlmutter, B. 1988. Learning state space trajectories in recurrent neural networks. Technical Report CMU-CS-88-191, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. Pineda, F. 1987. Generalization of backpropagation to recurrent neural networks. Physical Review Letters, 19(59), 2229-2232. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning internal representations by error propagation. In: Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: Bradford Books. Werbos, P.J. 1988. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 339-356. Williams, R.J. and D. Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270-280.
Received 17 October 1988; accepted 14 March 1989.
Communicated by Fernando Pineda
A Learning Algorithm for Continually Running Fully Recurrent Neural Networks Ronald J. Williams College of Computer Science, Northeastern University, Boston, MA 02115, USA
David Zipser Institute for Cognitive Science, University of California, La Jolla, CA 92093, USA
The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have (1)the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2) the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms allow networks having recurrent connections to learn complex tasks that require the retention of information over time periods having either fixed or indefinite length. 1 Introduction
A major problem in connectionist theory is to develop learning algorithms that can tap the full computational power of neural networks. Much progress has been made with feedforward networks, and attention has recently turned to developing algorithms for networks with recurrent connections, which have important capabilities not found in feedforward networks, including attractor dynamics and the ability to store information for later use. Of particular interest is their ability to deal with time-varying input or output through their own natural temporal operation. A variety of approaches to learning in networks with recurrent connections have been proposed. Algorithms for the special case of networks that settle to stable states, often regarded as associative memory networks, have been proposed by Hopfield (1982), Lapedes and Farber (19861, Almeida (1987), Pineda (19881, and Rohwer and Forrest (1987). Other researchers have focused on learning algorithms for more general networks that use recurrent connections to deal with time-varying input and/or output in nontrivial ways. A general framework for such Neural Computation 1, 270-280 (1989)
@ 1989 Massachusetts Institute of Technology
A Learning Algorithm
271
problems was laid out by Rumelhart, Hinton, and Williams (19861, who unfolded the recurrent network into a multilayer feedforward network that grows by one layer on each time step. We will call this approach ”backpropagation through time.” One of its primary strengths is its generality, but a corresponding weakness is its growing memory requirement when given an arbitrarily long training sequence. Other approaches to training recurrent nets to handle time-varying input or output have been suggested or investigated by Jordan (19861, Bachrach (1988), Mozer (1988), Elman (1988), Servan-Schreiber, Cleeremans, and McClelland (19881, Robinson and Fallside (19871, Stornetta, Hogg, and Huberman (19871, Gallant and King (19881, and Pearlmutter (1988; 1989). Many of these approaches use restricted architectures or are based on more computationally limited approximations to the full backpropagation-through-time computation. The approach we propose here enjoys the generality of the backpropagation-through-time approach while not suffering from its growing memory requirement in arbitrarily long training sequences. It coincides with an approach suggested in the system identification literature (McBride and Narendra 1965) for tuning the parameters of general dynamical systems. The work of Bachrach (1988) and Mozer (1988) represents special cases of the algorithm presented here, and Robinson and Fallside (1987) have given an alternative description of the full algorithm as well. However, to the best of our knowledge, none of these investigators has published an account of the behavior of this algorithm in unrestricted architectures. 2 The Learning Algorithm and Variations 2.1 The Basic Algorithm. Let the network have n units, with m external input lines. Let y ( t ) denote the n-tuple of outputs of the units in the network at time t, and let x ( t ) denote the m-tuple of external input signals to the network at time t. We concatenate y ( t ) and x ( t ) to form the (m+ nbtuple z(t), with U denoting the set of indices k such that Zk is the output of a unit in the network and I the set of indices k for which zk is an external input. The indices on y and x are chosen to correspond to those of z, so that
Let W denote the weight matrix for the network, with a unique weight between every pair of units and also from each input line to each unit. By adopting the indexing convention just described, we can incorporate all the weights into this single n x ( m + n) matrix. To allow each unit to have a bias weight we simply include among the m input lines one input whose value is always 1.
Ronald J. Williams and David Zipser
272
In what follows we use a discrete time formulation and we assume that the network consists entirely of semilinear units; it is straightforward to extend the approach to continuous time and other forms of differentiable unit computation. We let (2.2)
denote the net input to the kth unit at time t, for k E U , with its output at the next time step being y!At + 1) = fk(Sk(t)),
(2.3)
where fk is the unit's squashing function. Thus the system of equations (2.2) and (2.31, where k ranges over U , constitute the entire dynamics of the network, where the z k values are defined by equation (2.1). Note that the external input at time t does not influence the output of any unit until time t + 1. We now derive an algorithm for training this network in what we will call a "temporal supervised learning" task, meaning that certain of the units' output values are to match specified target values at specified times. Let 7%) denote the set of indices k E U for which there exists a specified target value d&) that the output of the kth unit should match at time t. Then define a time-varying n-tuple e by
Note that this formulation allows for the possibility that target values are specified for different units at different times. The set of units considered to be "visible" can thus be time-varying. Now let
denote the overall network error at time t. For the moment, assume that the network is run starting at time t o up to some final time t l . We take as the objective the minimization of the total error
over this trajectory. We do this by a gradient descent procedure, adjusting W along the negative of VWJtotal(t0, t + 1). Since the total error is just the sum of the errors at the individual time steps, one way to compute this gradient is by accumulating the values of V w J ( t )for each time step along the trajectory. The overall weight change for any particular weight w i j in the network can thus be written as
A Learning Algorithm
273
where (2.8)
and N is some fixed positive learning rate. Now
where dyk(t)/awij is easily computed by differentiating the network dynamics (equations (2.2) and (2.3), yielding (2.10)
where hik denotes the Kronecker delta. Because we assume that the initial state of the network has no functional dependence on the weights, we also have (2.11)
These equations hold for all k E U , i E U , and j E U U I . We thus create a dynamical system with variables {&} for all k i E U , and j E U U I , and dynamics given by
E
U,
(2.12) with initial conditions k Pij(t0)
= 0,
(2.13)
and it follows that (2.14) for every time step t and all appropriate i, j, and k. The precise algorithm then consists of computing, at each time step t from t o to t l , the quantities p t j ( t ) , using equations (2.12) and (2.131, and then using the discrepancies e J t ) between the desired and actual outputs to compute the weight changes (2.15)
274
Ronald J. Williams and David Zipser
The overall correction to be applied to each weight wij in the net is then simply the sum of these individual Awij(t) values for each time step t along the trajectory. In the case when each unit in the network uses the logistic squashing function we use (2.16) & ( S k ( t ) ) = y d t + 1” - Y k ( t + 111 in equation (2.12). Real-Time Recurrent Learning The above algorithm was derived on the assumption that the weights remained fixed throughout the trajectory. In order to allow real-time training of behaviors of indefinite duration, however, it is useful to relax this assumption and actually make the weight changes while the network is running. This has the important advantage that no epoch boundaries need to be defined for training the network, leading to both a conceptual and an implementational simplification of the procedure. For this algorithm, we simply increment each weight wij by the amount Aw&) given by equation (2.15) at time step t, without accumulating the values elsewhere and making the weight changes at some later time. A potential disadvantage of this real-time procedure is that it no longer follows the precise negative gradient of the total error along a trajectory. However, this is exactly analogous to the commonly used method of training a feedforward net by making weight changes after each pattern presentation rather than accumulating them elsewhere and then making the net change after the end of each complete cycle of pattern presentation. While the resulting algorithm is no longer guaranteed to follow the gradient of total error, the practical differences are often slight, with the two versions becoming more nearly identical as the learning rate is made smaller. The most severe potential consequence of this departure from true gradient-following behavior for real-time procedure for training the dynamics is that the observed trajectory may itself depend on the variation in the weights caused by the learning algorithm, which can be viewed as providing another source of negative feedback in the system. To avoid this, one wants the time scale of the weight changes to be much slower than the time scale of the network operation, meaning that the learning rate must be sufficiently small. 2.2 Teacher-Forced Real-Time Recurrent Learning. An interesting technique that is frequently used in temporal supervised learning tasks (Jordan 1986; Pineda 1988)is to replace the actual output y d t ) of a unit by the teacher signal &(t) in subsequent computation of the behavior of the network, whenever such a value exists. We call this technique ”teacher forcing.” The dynamics of a teacher-forced network during training are given by equations (2.2) and (2.3), as before, but where z(t) is now defined by
A Learning Algorithm
275
(2.17) rather than by equation (2.1). To derive a learning algorithm for this situation, we once again differentiate the dynamical equations with respect to wig. This time, however, we find that (2.18) since ddl(t)/dw,, = 0 for all 1 E 5%) and for all t. For the teacher-forced version we thus alter our learning algorithm so that the dynamics of the pf, values are given by P$
+ 1) = fL(S!&))
[c
&U-T(t)
WkldJt)
1
+ 6,kzJt)
,
(2.19)
rather than equation (2.12), with the same initial conditions as before. Note that equation (2.19) is the same as equation (2.12) if we treat the values of pi,(t) as zero for all 1 E T ( t )when computing &,(t + 1). The teacher-forced version of the algorithm is thus essentially the same as the earlier one, with two simple alterations: (1) where specified, desired values are used in place of actual values to compute future activity in the network; and (2) the corresponding pfg values are set to zero ~ after they have been used to compute the A W , values.
Computational Features of the Real-Time Recurrent Learning Algorithms It is useful to view the triply indexed set of quantities pfj as a matrix, each of whose rows corresponds to a weight in the network and each of whose columns corresponds to a unit in the network. Looking at the update equations it is not hard to see that, in general, we must keep track of the values pz", even for those k corresponding to units that never receive a teacher signal. Thus we must always have n columns in this matrix. However, if the weight w,,is not to be trained (as would happen, for example, if we constrain the network topology so that there is no connection from unit j to unit z), then it is not necessary to compute the value pl", for any k E U . This means that this matrix need only have a row for each adaptable weight in the network, while having a column for each unit. Thus the minimal number of pfgvalues needed to store and update for a general network having n units and T adjustable weights is nr. For a fully interconnected network of n units and m external input lines in which each connection has one adaptable weight, there are n3+ mn2 such pFj values.
276
Ronald J. Williams and David Zipser
3 Simulation Experiments
We have tested these algorithms on several tasks, most of which can be characterized as requiring the network to learn to configure itself so that it stores important information computed from the input stream at earlier times to help determine the output at later times. In other words, the network is required to learn to represent useful internal state to accomplish these tasks. For all the tasks described here, the experiments were run with the networks initially configured with full interconnections among the units, with every input line connected to every unit, and with all weights having small randomly chosen values. The units to be trained were selected arbitrarily. More details on these simulations can be found in Williams and Zipser (1988; to appear 1989). 3.1 Pipelined XOR. For this task, two nonbias input lines are used, each carrying a randomly selected bit on each time step. One unit in the network is trained to match a teacher signal at time t consisting of the XOR of the input values given to the network at time t - T , where the computation delay T is chosen in various experiments to be 2, 3, or 4 time steps. With 3 units and a delay of 2 time steps, the network learns to configure itself to be a standard 2-hidden-unit multilayer network for computing this function. For longer delays, more units are required, and the network generally configures itself to have more layers in order to match the required delay. Teacher forcing was not used for this task. 3.2 Simple Sequence Recognition. For this task, there are two units and m nonbias input lines, where m 2 2. Two of the input lines, called the a and b lines, serve a special purpose, with all others serving as distractors. At each time step exactly one input line carries a 1, with all others carrying a 0. The object is for a selected unit in the network to output a 1 immediately following the first occurrence of activity on the b line following activity on the a line, regardless of the intervening time span. At all other times, this unit should output a 0. Once such a b occurs, its corresponding a is considered to be “used up,” so that the next time the unit should output a 1 is when a new a has been followed by its first ”matching” b. Unlike the previous task, this cannot be performed by any feedforward network whose input comes from tapped delay lines on the input stream. A solution consisting essentially of a flip-flop and an AND gate is readily found by the unforced version of the algorithm. 3.3 Delayed Nonmatch to Sample. In this task, the network must remember a cued input pattern and then compare it to subsequent input patterns, outputting a 0 if they match and a 1 if they don’t. We have
A Learning Algorithm
277
investigated a simple version of this task using a network with two input lines. One line represents the pattern and is set to 0 or 1 at random on each cycle. The other line is the cue that, when set to 1, indicates that the corresponding bit on the pattern line must be remembered and used for matching until the next occurrence of the cue. The cue bit is set randomly as well. This task has some elements in common with both of the previous tasks in that it involves an internal computation of the XOR of appropriate bits (requiring a computation delay) as well as having the requirement that the network retain indefinitely the value of the cued pattern. One of the interesting features of the solutions found by the unforced version of the algorithm is the nature of the internal representation of the cued pattern. Sometimes a single unit is recruited to act as an appropriate flip-flop, with the other units performing the required logic; at other times a dynamic distributed representation is developed in which no static pattern indicates the stored bit. 3.4 Learning to Be a Turing Machine. The most elaborate of the tasks we have studied is that of learning to mimic the finite state controller of a Turing machine deciding whether a tape marked with an arbitrary length string of left and right parentheses consists entirely of sets of balanced parentheses. The network observes the actions of the finite state controller but is not allowed to observe its states. Networks with 15 units always learned the task. The minimum-size network to learn the task had 12 units. 3.5 Learning to Oscillate. Three simple network oscillation tasks that we have studied are (1) training a single unit to produce 010101 . . .; (2) training a 2-unit net so that one of the units produces 00110011 . . .; and (3) training a 2-unit net so that one of the units produces approximately sinusoidal oscillation of period on the order of 25 time steps in spite of the nonlinearity of the units involved. We have used both versions of the algorithm on these oscillation tasks, with and without teacher forcing, and we have found that only the version with teacher forcing is capable of solving these problems in general. The reason for this appears to be that in order to produce oscillation in a net that initially manifests settling behavior (because of the initial small weight values), the weights must be adjusted across a bifurcation boundary, but the gradient itself cannot yield the necessary information because it is zero or very close to zero. However, if one is free to adjust both weights and initial conditions, at least in some cases this problem disappears. Something like this appears to be at the heart of the success of the use of teacher forcing: By using desired values in the net, one is helping to control the initial conditions for the subsequent dynamics. Pineda (1988) has observed a similar need for teacher forcing when
278
Ronald J. Williams and David Zipser
attempting to add new stable points in an associative memory rather than just moving existing ones around. 4 Discussion
Our primary goal here has been to derive a learning algorithm to train completely recurrent, continually updated networks to learn temporal tasks. Our emphasis in simulation studies has been on using uniform starting configurations that contain no a priori information about the temporal nature of the task. In most cases we have used statistically derived training sets that have not been extensively optimized to promote learning. The results of the simulation experiments described here demonstrate that the algorithm has sufficient generality and power to work under these conditions. The algorithm we have described here is nonlocal in the sense that, for learning, each weight must have access to both the complete recurrent weight matrix W and the whole error vector e. This makes it unlikely that this algorithm, in its current form, can serve as the basis for learning in actual neurophysiological networks. The algorithm is, however, inherently quite parallel so that computation speed would benefit greatly from parallel hardware. The solutions found by the algorithm are often dauntingly obscure, particularly for complex tasks involving internal state. This observation is already familiar in work with feedforward networks. This obscurity has often limited our ability to analyze the solutions in sufficient detail. In the simpler cases, where we can discern what is going on, an interesting kind of distributed representation can be observed. Rather than only remembering a pattern in a static local or distributed group of units, the networks sometimes incorporate the data that must be remembered into their functioning in such a way that there is no static pattern that represents it. This gives rise to dynamic internal representations that are, in a sense, distributed in both space and time. Acknowledgments We wish to thank Jonathan Bachrach for sharing with us his insights into the issue of training recurrent networks. It was his work that first brought to our attention the possibility of on-line computation of the error gradient, and we hereby acknowledge his important contribution to our own development of these ideas. This research was supported by grant IRI-8703564 from the National Science Foundation to the first author and contract N00014-87-K-0671 from the Office of Naval Research, grant 86-0062 from the Air Force Office of Scientific Research, and grants from the System Development Foundation to the second author.
A Learning Algorithm
279
References Almeida, L.B. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. Proceedings of the I E E E First International Conference on Neural Networks, 11, 609-618. Bachrach, J. 1988. Learning to represent state. Unpublished master’s thesis, University of Massachusetts, Amherst. Elman, J.L. 1988. Finding structure in time. CRL Technical Report 8801. La Jolla: University of California, San Diego, Center for Research in Language. Gallant, S.I. and D. King. 1988. Experiments with sequential associative memories. Proceedings of the Tenth Annual Conference of the Cognitive Science Society, 4047. Hopfield, J.J. 1982. Neural networks as physical systems with emergent collective computational abilities. Proceedings of the Nationat Academy of Sciences, 79, 2554-2558. Jordan, M.I. 1986. Attractor dynamics and parallelism in a connectionist sequential machine. Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 531-546. Lapedes, A. and R. Farber. 1986. A self-optimizing, nonsymmetrical neural net for content addressable memory and pattern recognition. Physica D, 22, 247-259. Mozer, M.C. 1988. A focused backpropagation algorithm for temporal pattern recognition. Technical Report University of Toronto, Departments of Psychology and Computer Science. McBride, L.E., Jr. and K.S. Narendra. 1965. Optimization of time-varying systems. I E E E Transactions on Automatic Control, 10, 289-294. Pearlmutter, B.A. 1988. Learning state space trajectories in recurrent neural networks: A preliminary report. Technical Report AIP-54. Pittsburgh: Carnegie Mellon University, Department of Computer Science. . 1989. Learning state space trajectories in recurrent neural networks. Neural Computation, 1, 263-269. Pineda, F.J. 1988. Dynamics and architecture for neural computation, Journal of Complexity, 4, 216-245. Robinson, A.J. and F. Fallside. 1987. The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR. 1. Cambridge, England: Cambridge University Engineering Department. Rohwer, R. and B. Forrest. 1987. Training time-dependence in neural networks. Proceedings of the I E E E First International Conference on Neural Networks, 11, 701-708. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning internal representations by error propagation. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1.Foundations, eds. D.E. Rumelhart, J.L. McClelland, and the PDP Research Group. Cambridge: MIT Press/Bradford Books. Servan-Schreiber, D., A. Cleeremans, and J.L. McClelland. 1988. Encoding sequential structure in simple recurrent networks. Technical Report CMU-CS-
280
Ronald J. Williams and David Zipser
88-183. Pittsburgh: Carnegie Mellon University, Department of Computer Science. Stornetta, W.S., T. Hogg, and B.A. Huberman. 1987. A dynamical approach to temporal pattern processing. Proceedings of the I E E E Conference on Neural Information Processing Systems, 750-759. Williams, R.J. and D. Zipser. 1988. A learning algorithm for continually running fully recurrent neural networks. ICS Technical Report 8805. La Jolla: University of California, San Diego, Institute for Cognitive Science. . 1989. Experimental analysis of the real-time recurrent learning algorithm. Connection Science, to appear.
Received 14 October 1988; accepted 14 March 1989.
Communicated by Demetri Psaltis
Fast Learning in Networks of Locally-Tuned Processing Units John Moody Christian J. Darken Yale Computer Science, P.O. Box 2158, New Haven, CT 06520, USA
We propose a network architecture which uses a single internal layer of locally-tuned processing units to learn both classification tasks and real-valued function approximations (Moody and Darken 1988). We consider training such networks in a completely supervised manner, but abandon this approach in favor of a more computationally efficient hybrid learning method which combines self-organized and supervised learning. Our networks learn faster than backpropagation for two reasons: the local representations ensure that only a few units respond to any given input, thus reducing computational overhead, and the hybrid learning rules are linear rather than nonlinear, thus leading to faster convergence. Unlike many existing methods for data analysis, our network architecture and learning rules are truly adaptive and are thus appropriate for real-time use. Neurons with response characteristics which are "locally-tuned" or "selective" for some range of the input variables are found in many parts of nervous systems. For example, the cochlear stereocilia cells have locally-tuned response to frequency, while cells in somatosensory cortex respond selectively to stimulation from localized regions of the body surface. The orientation-selective cells in visual cortex respond selectively to stimulation which is both local in retinal position and local in angle of object orientation, while cells in nucleus laminaris of barn owl are tuned to specific interaural time delays. Populations of locally-tuned cells are typically arranged in cortical maps in which the values of the variables to which the cells respond vary with position in the map. The locally-tuned response of the stereocilia cells is a consequence of their biophysical properties. In the three other examples, however, the locality of response is a consequence of the network architecture of each of the various systems. In these cases, the response is local not in specific pre-synaptic activities, but rather in variables which have meaning only at a systems level. This locality is a computational property of the system and should not be confused with the biophysical response properties of
Neural Computation 1, 281-294 (1989)
@ 1989 Massachusetts Institute of Technology
John Moody and Christian J. Darken
282
cells, which are usually treated in the abstract as a thresholding of a weighted sum of inputs. In this letter, we present a simple, idealized network model based upon abstract processing units with locally-tuned response functions. We consider training such a network using purely supervised learning, but conclude that this offers no significant advantages over backpropagation. We then show that such a network can be efficiently trained to perform computationally interesting tasks via a combination of linear supervised and linear self-organizing techniques. The combination of locality of representation and linearity of learning offers tremendous speed advantages relative to backpropagation. As will become apparent, our model is best suited for learning to approximate continuous or piecewise continuous real-valued mappings f : R" H R" where n is sufficiently small. This class of functions includes classification problems as a special case. The model is not well suited to learning logical mappings such as parity, because such mappings are not usually piecewise continuous. It should be noted at the outset that local methods for density estimation (for example, Parzen Windows), classification, interpolation, and approximation have been widely used for many years and are known to have attractive computational properties including in some cases straightforward parallelizability and rapid speed of solution. However, most of the local methods which have been studied are intrinsically "non-neural," meaning that they can not be easily implemented in an adaptive network of fixed architecture and size. A network of A4 locally-tuned units (Moody and Darken 1988) (see figure la) has an overall response function:
n=l
R"(a
= R("2
-qI/Ua).
(1.2)
Here, Z is a real-valued vector in the input space, R" is the response function of the a-th locally-tuned unit, R is a radially-symmetricfunction with a single maximum at the origin and which drops off rapidly to zero at large radii, P and u" are the center and width in the input space of the a-th unit, and A" is the weight or amplitude associated with each unit. For this work, we have chosen gaussian response functions with unit normalization:
One can think of the functional representation in equation (1.1) as a decomposition into basis functions which are not orthonormal, not uniformly distributed over the input space, and do not have uniform width.
Fast Learning in Networks of Locally-Tuned Processing Units
A
B
283
9
Figure 1: A network of locally-tuned processing units (a) and a three layer perceptron (b). Unit response functions are depicted graphically. Both networks have linear inputs and outputs. The network of locally-tuned units has a single internal layer, while the perceptron has two internal layers of sigmoidal units. With the addition of lateral connections between the processing units, the network can dynamically produce the normalized response function (Moody 1989b):
This is essentially a weighted average or interpolation between weights or learned function values A" of nearby processing units. The locality of the unit response functions is important for attaining fast simulation speeds on serial machines. For any given input, only the small fraction of processing units with centers very close (in the input space) to the input will respond with activations which differ significantly from zero. Thus, only those with centers close enough to the input need to be evaluated and trained. We efficiently identify these units by partitioning the input space with an adaptive grid (Omohundro 1987) and doing a short search. The functional forms given by equations (1.1)to (1.4) are special cases of the kernel representations developed by statisticians. A number of
John Moody and Christian J. Darken
284
powerful estimation and regression techniques exist for finding the best values of the parameters { P a o", l A"}. However, many of these methods are off-line, global, and are not easily adapted to real-time use. We consider two styles of learning in our networks, a fully supervised method and a hybrid method which combines supervised with self-organizing methods. We consider the fully supervised method first, but abandon it in favor of the hybrid method. Both the supervised and hybrid methods are easily implemented adaptively. The fully supervised method uses an error measure E defined at the output to vary all parameters {Z", cP,A"} in a network of form (1.1). The total squared error for an entire training set is:
c
l N E =5 (f*(ZA - f(zi))2,
(1.5)
2=1
where Zt is the i-th training pattern, f is the total network output, and f' is the desired result. The supervised method yields high precision results, but places no architectural restrictions on the network parameters. In particular, the widths o" are not restricted to remain small, so a locally-tuned representation is not guaranteed. Thus, the supervised learning method does not obtain the computational advantages of locality. Furthermore, the supervised method casts learning as a non-linear optimization problem; this results in both slow convergence and unpredictable solutions (locally-tuned units are sometimes "squeezed out" of the region of the input space which contains data). It is interesting, however, to compare the learning performance of a supervised network of gaussian units with a backpropagation network of comparable size. Following Lapedes and Farber (Lapedes and Farber 1987), we consider a simple quadratic approximation problem, predicting the evolution of the logistic map: d t + 11 = 4 d t l ( l - s1tl). The networks of type (1.1) have 4, 5, or 6 internal gaussian units. The backpropagation network has a single internal layer with 5 sigmoidal units, a linear output unit, and an additional linear connection between the input and the output. All networks were trained using the conjugate gradient optimization procedure for 200 iterations (line minimizations) on a data set of 1000 points and were tested on a data set of 500 points. The number of modifiable network parameters, training set error, test set error, and total computation time in Sun 3/50 CPU seconds is shown in the following table: Network 5 Sigmoidal Units w/ Linear Term 4 Gaussian Units 5 Gaussian Units 6 Gaussian Units
#Par Time Train Error Test Error 17 3802 sec 0.58% 0.59% 12 15
18
3137 sec 0.62% 4278 sec 0.38% 4945 sec 0.26%
0.64% 0.41 % 0.27%
Fast Learning in Networks of Locally-Tuned Processing Units
x[t
285
+ 11
Figure 2: Supervised learning solution found by a network of 5 gaussian units (dashed lines) for the quadratic logistic mapping. The envelope of the gaussians (solid line) for the region in which training data is present (the interval [O, 11) is bracketed by the rectangle. Notice that two of the gaussians have inverted and flattened out to provide a near-constant offset. Note that the networks with 5 and 6 gaussian units achieve better performance than the more traditional sigmoidal network and do so with a comparable number of network parameters. This is probably because the gaussian shape is better suited to approximating the logistic map. Unfortunately, the gaussian units do not learn appreciably faster than the backpropagation network, and there is no reason why they should have. Furthermore, the gaussians assumed large widths, thus losing the locality intended in equation (1.1). A sample solution is shown in figure 2; see (Lapedes and Farber 1987) for a sample solution found by a sigmoidal net.
286
John Moody and Christian J. Darken
Since the supervised learning method with gaussian units seems to offer no dramatic advantages over a standard backpropagation network, we now consider a hybrid learning method which yields linear learning rules and truly local representations. The hybrid learning process works as follows. First, the processing unit centers and widths are determined in a bottom-up or self-organizing manner. Second, the amplitudes are found in a top-down manner using the supervised LMS rule. The bottom-up component serves to allocate network resources in a meaningful way by placing the unit centers in only those regions of the input space where data is present. It also dramatically reduces the amount of computational time required by supervised learning, since only the output weights (unit amplitudes) must be calculated using an error signal. Although our overall network response functions in equations (1.1) and (1.4) are nonlinear, each of our learning rules is based on minimizing a purely quadratic objective function. This reduces the learning procedure to a set of simple linear update rules. The standard k-means clustering algorithm (Lloyd 1957; MacQueen 1967) is used to find a set of k processing unit centers which represent a local minimum of the total squared euclidean distances E between the N exemplars (training vectors) & and the nearest of the k centers &:
Here, Mai is the cluster membership function, which is a k x N matrix of 0's and 1's with exactly one 1 per column which identifies the processing unit to which a given exemplar belongs. The (local) minimization of E can be formulated as an iterative process on a complete training set (Lloyd 19571 or as a real-time, adaptive process (MacQueen 1967). A variant of adaptive k-means is presented below in equation (1.7). The processing unit widths are determined using various " P nearestneighbor" heuristics. These heuristics vary the widths in order to achieve a certain amount of response overlap between each unit and its neighbors so that they form a smooth and contiguous interpolation over those regions of the input space which they represent. The heuristics can be formulated in an adaptive fashion and can be shown to minimize various objective functions with respect to the 0"'s or ( ~ J " ) ~ ' Sand are thereby stable. Some of the possible objective functions are quadratic, leading to simple linear update rules. A particularly simple example is the "global first nearest-neighbor" heuristic. It uses a uniform average width 0 = (Ax,,) for all units where Ax,, is the euclidean distance in the input space between each unit cy and its nearest-neighbor ,B and () indicates a global average over all such pairs. Other heuristics based on purely local computations yield individually-tuned widths oa. The response function amplitudes (output weights A" in equations (1.1) or (1.4)) are varied to minimize either the total error in equation
Fast Learning in Networks of Locally-Tuned Processing Units
287
(1.5) or in the case of real time implementations, an instantaneous estimate of the error. We have found that convergence occurs very quickly. This is possible because the self-organized learning which precedes the supervised learning has already done most of the work. To demonstrate the practical use of the hybrid learning method, we consider two representative test problems: the classification of phonemes and the prediction of a chaotic time series. The phoneme classification problem is to classify 10 distinct vowel sounds on the basis of their first and second formant frequencies. The data (provided by Huang and Lippmann (1988))consists of a training set with 338 phoneme exemplars and a test set with 333 exemplars. Using standard (off-line) k-means, an overIap parameter P = 2, and off-line LMS, the classification results for the hybrid learning system are: # GaussianUnits % Error on Test Set
100 60 80 20 40 26.7% 24.9% 21.3% 19.5% 18.0%
These results are comparable to those found by Huang and Lippmann using a variety of techniques. For convenience, we reproduce their results here: Classification Method
% Error on Number of Test Set Training Tokens
K Nearest Neighbors (Non-Adaptive) Gaussian Classifier (Non-Adaptive) 2-Layer Back Prop (Adaptive) Feature Map, 100 Nodes (Adaptive)
18.0% 20.4% 19.8%
22.8%
338 338 50,000 10,000 (Feature Map
Nodes) 50 (Output Nodes)
The most similar model to ours is the feature map classifier (Kohonen 1988). Our model achieves better classification results given the same number of nodes (in this case loo), because the gaussian response functions yield smooth interpolations of the classification regions, rather than sharp discontinuities from one cluster region to the next. (Furthermore, the feature map classifier makes an intrinsic assumption about the underlying dimensionality of the problem, typically producing either a one or two-dimensional map. A two-dimensional map is appropriate for this phoneme classification problem, but is unlikely to be the optimal choice for arbitrary domains. In contrast, the k-means algorithm discovers the intrinsic dimensionality of the input data.) Both the feature map and the
John Moody and Christian J. Darken
288
locally-tuned networks learn substantially faster than backpropagation nets (typically factors of hundreds in CPU time). Although our results above are for off-line learning, the adaptive kmeans clustering algorithm is actually simpler than the off-line version and yields solutions of similar quality. Furthermore, adaptive clustering methods are being implemented in special purpose hardware, and are likely to be very important for real-time learning systems. Figure 3 shows an example of real-time clustering using the phoneme training exemplars. The initial cluster centers are randomly chosen exemplars. At each time step, a random exemplar 2 is chosen and the nearest cluster center P is moved by an amount:
A P = $2 - P),
(17)
where 7 = 0.03 is the learning rate. The resulting trajectories for 20 cluster centers are shown for a run of 3000 exemplar samples. The final positions are indicated by triangles, and the circles represent the widths of the gaussians as determined by the P = 2 nearest neighbors heuristic. As a second representative test problem, we follow Farmer and Sidorowich (1987) and consider the prediction of a chaotic time series. As it is usually formulated, this problem requires finding a real-valued mapping f : Rn H R which takes a sequence of n recent samples of a time series and predicts the value of the time series at a future moment. It is assumed that the underlying process which generates the time series is unknown. We shall compare our network's learning and generalizing capabilities to a three-layer perceptron studied by Lapedes and Farber (1987) (see figure lb). The particular time series we use for comparison results from integrating the Mackey-Glass differential delay equation:
Figure 4 shows the resulting time series for r = 17, a = 0.2, and b = 0.1; note that it is quasi-periodic since no two cycles are the same. The characteristic time of the series, given by the inverse of the mean of the power spectrum, is t c h a y M 50. Note that classical techniques like global linear autoregression or Gabor-Volterra-Wiener polynomial expansions typically do no better than chance at predicting such a time series beyond tchar.
For our numerical comparison, both networks (Fig. 1) have four realvalued inputs (x[tl,z[t- Al,x[t - 2Al,x[t - 3Al) and one real-valued output z[t + TI with A = 6 and T = 85 > tchar. The network of locallytuned units (Fig. la) has between 100 and 10,000 internal units arranged in a single layer, while the backpropagation network (Fig. lb) has two internal layers each containing 20 sigmoidal units. The backpropagation network thus has 541 adjustable parameters (weights and thresholds) total.
Fast Learning in Networks of Locally-Tuned Processing Units
289
Figure 5 contrasts the prediction accuracy E (Normalized Prediction Error) versus number of internal units for three versions of our algorithm to the backpropagation benchmark (A) of Lapedes and Farber (1987). The three versions of the learning algorithm are: 1. Nearest neighbor prediction. Here, the nearest data point in the training set is used as a predictor. This behavior is actually a special case for the network of equation 1.4 where each input/output training pair (?%, fi} defines a processing unit {.", f"} of infinitely narrow width (cr" + 0). 2. Adaptive processing units with one unit per data point. Here, the amplitudes are determined by LMS, the widths by the global first
Frequency of first formant Figure 3: Adaptive k-means clustering applied to phoneme data.
John Moody and Christian J. Darken
290
0
200
400
600
800
1000
Time Figure 4 One thousand successive integer timesteps for the Mackey-Glass chaotic time series with delay parameter 7 = 17.
nearest neighbors heuristic, and the centers are fixed to be training data vectors. 3. Self-organizing, adaptive processing units. Similar to 2, but the training set has ten times more exemplars than the network has processing units and the processing unit centers are found using Ic-means clustering. The backpropagation benchmark used a training set with 500 exemplars. For all methods, prediction accuracies were measured on a 500-member test set. Note that versions 1 and 2 require storage of past time series data; version 2 assigns a processing unit to each data point. Hence, neither of these methods is appropriate for real-time signal processing with fixed memory. However, version 3 is fully adaptive in that a fixed set of network parameters can be varied in response to new data in real time and does not, in principle, require storage of previous data. In order
Fast Learning in Networks of Locally-Tuned Processing Units
291
to make a fair comparison to the backpropagation benchmark, however, w e optimized our networks on a fixed training set, rather than measure time-averaged, real-time performance. As is apparent from figure 5 (note horizontal reference line), method 2 achieves a prediction accuracy equivalent to backpropagation with about 7 times as much training data, while method 3 requires about 27 times
-0.6
+
Gaussian Units
-0.8
ba 0
d
W
8 t:
-1.0
w
a 8 -1.2 .M
Back Prop Benchmark
-1.6
2.0
2.5
3 .O
3.5
4.0
Number of Units (loglo) Figure 5: Comparison of prediction accuracy vs. number of internal units for four methods: (1)first nearest neighbor, (2) adaptive units (one unit per training vector), (3) self-organizing units (ten training vectors per processing unit), and (A) backpropagation. The methods are described in the text. For backpropagation, the abscissa indicates the number of training vectors. The horizontal line associated with (A) is provided for visual reference and is not intended to suggest a scaling law. In fact, the scaling law is not known.
John Moody and Christian J. Darken
292
as much data to reach equivalent accuracy. These differences in training data requirements stem from the fact that a backpropagation network is fully supervised, learns a global fit to the function, and is thus a more powerful generalizer, while the network of locally-tuned units learns only local representations. However, the larger data requirements of the networks of locallytuned processing units are outweighed by dramatic improvements in computational efficiency. The following table shows E versus computational time measured in Sun 3/60 seconds for each of the three methods described above in the 1000 internal units case. Locally-Tuned Network Model Version 1 Version 2 Version 3
Computation Time Normalized Prediction Error (Sun 3/60 CPU secs) 17.1% 67 secs (0.019 hours) 229 secs (0.064 hours) 9.9% 6.1% 1858 secs (0.51 hours)
The Lapedes and Farber backpropagation results required a few minutes to a fraction of an hour of Cray X/MP time running at an impressive 90 MFlops (Lapedes 1988). Their implementation used the conjugate gradient minimization technique and achieved approximately 5% Normalized Error. Our simulations on the Sun 3/60 probably achieved no more than 90 KFlops (the LINPACK benchmark). Taking these differences into account, the networks of locally-tuned processing units learned hundreds to thousands of times faster than the backpropagation network. Our own experiences with backpropagation applied to this problem are consistent with this difference. Using a variety of methods including on-line learning, off-line learning, gradient descent, and conjugate gradient, we have been unable to get a backpropagation network to come close to matching the prediction accuracy of locally-tuned networks in a few hours of Sun 3 time. Although our discussion has focused on algorithms which can be implemented as real-time adaptive systems, such as backpropagation and the networks of locally-tuned units we have presented, a number of offline algorithms for multidimensional function modeling achieve excellent performance both in terms of efficiency and precision. These include methods of exact interpolation based on rational radiallysymmetric basis functions (Powell 1985). In spite of the apparent locality of their functional representations, these methods are actually global since the basis functions do not drop off exponentially fast. Furthermore, they require a separate basis function for each data point and require computing time which scales as N 3 in the size of the data set. Finally, these algorithms are not adaptive.
Fast Learning in Networks of Locally-Tuned Processing Units
293
Approximation methods based on local linear and local quadratic fitting have been championed recently by Farmer and Sidorowich (1987). These algorithms utilize local representations in the input space, but are not appropriate for real-time use since they require multi-dimensional tree data structures which are cumbersome to modify on the fly and would be extremely difficult to implement in special purpose hardware. The off-line methods require explicit storage of past data and assume that all such data is retained. In contrast, the neural net approach requires storage of only a fixed set of tunable network parameters; this number is independent of the total amount of data observed over time. An alternative adaptive approach based upon hashing and a hierarchy of locally-tuned representations has been explored by Moody (1989a) with very favorable results. Note Added in Proof After this manuscript was accepted for publication, we learned that Hanson and Burr (1987) had suggested using a single layer of locally-tuned units in place of two layers of sigmoidal or threshold units. This was also suggested independently by Lapedes and Farber (198%). These authors, along with Lippmann (19871, describe constructions whereby localized "bumps" or convex regions can be built from two layers of sigmoidal or threshold units respectively. Acknowledgments We gratefully acknowledge helpful comments from and discussions with Andrew Barron, Doyne Farmer, Walter Heiligenberg, Alan Lapedes, Y.C. Lee, Richard Lippmann, Bartlett Mel, Demetri Psaltis, Terry Sejnowski, John Shewchuk, John Sidorowich, and Tony Zador. We furthermore wish to thank Bill Huang and Richard Lippmann for kindly providing us with the phoneme data. This research was supported by ONR grant N0001486-K-0310, AFOSR grant F49620-88-COO25, and a Purdue Army subcontract. References Farmer, J.D. and J.J. Sidorowich. 1987. Predicting chaotic time series. Physical Review Letters, 59, 845. Hanson, SJ. and D.J. Burr. 1987. Knowledge representation in connectionist networks. Bellcore Technical Report. Huang, W.Y. and R.P. Lippmann. 1988. Neural net and traditional classifiers. In: Neural Information Processing Systems, ed. D.Z. Anderson, 387-396. American Institute of Physics.
294
John Moody and Christian J. Darken
Kohonen, T. 1988. Self-organization and Associative Memory. Berlin: SpringerVerlag. Lapedes, A.S. 1988. Personal communication. Lapedes, A.S. and R. Farber. 1987. Nonlinear signal processing using neural networks: Prediction and system modeling. Technical Report, Los Alamos National Laboratory, Los Alamos, New Mexico. , 1988. How neural nets work. In: Neural Information Processing Systems, ed. D.Z. Anderson, 442-456. American Institute of Physics. Lippmann, R.P. 1987. An introduction to computing with neural nets. IEEE ASSP Magazine, 4:4. Lloyd, S.P. 1957. Least squares quantization in PCM. Bell Laboratories Internal Technical Report. l E E E Transactions on Information Theory, IT-282,1982. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics, and Probability, eds. L.M. LeCam and J. Neyman. Berkeley: U. California Press, 281. Moody, J. 1989a. Fast learning in multi-resolution hierarchies. In: Advances in Neural Information Processing Systems, ed. D. Touretzky. Morgan-Kaufmann, Publishers. . 1989b. In preparation. Moody, J. and C. Darken. 1988. Learning with localized receptive fields. In: Proceedings of the 1988 Connectionist Models Summer School, eds. Touretzky, Hinton, and Sejnowski. Morgan-Kaufmann, Publishers. Omohundro, S. 1987. Efficient algorithms with neural network behavior. Complex Systems, 1,273. Powell, M.J.D. 1985. Radial basis functions for multivariate interpolation: a review. Technical Report DAMPT 1985/NA12, Department of Applied Mathematics and Theoretical Physics, Cambridge University, Silver Street, Cambridge CB3 9EW, England.
Received 10 October 1988; accepted 27 October 1988.
REVlEW
Unsupervised Learning
What use can the brain make of the massive flow of sensory information that occurs without any associated rewards or punishments? This question is reviewed in the light of connectionist models of unsupervised learning and some older ideas, namely the cognitive maps and working models of Tolman and Craik, and the idea that redundancy is important for understanding perception (Attneave 1954), the physiology of sensory pathways (Barlow 1959), and pattern recognition (Watanabe 1960). It is argued that (1)The redundancy of sensory messages provides the knowledge incorporated in the maps or models. (2) Some of this knowledge can be obtained by observations of mean, variance, and covariance of sensory messages, and perhaps also by a method called "minimum entropy qodihq" (3) Such knowledge may be incorporated in a model of "what usually happens" with which incoming messages are automatically compared, enabling unexpected discrepancies to be immediately identified. (4) Knowledge of the sort incorporated into such a filter is a necessary prerequisite of ordinary learning, and a representation whose elements are independent makes it possible to form associations with logical functions of the elements, not just with the elements themselves. 1 Introduction Much of the information that pours into our brains throughout the waking day arrives without any obvious relationship to reinforcement, and is unaccompanied by any other form of deliberate instruction. What use can be made of this impressive flow of information? In this article I hope, first, to show that it is the redundancy contained in these messages that enables the brain to build up its "cognitive maps" or "working models" of the world around it; second, to suggest initial steps by which these might be formed; and third, to propose a structure for the maps or models that automatically ensures their access and use in everyday perception, and represents percepts-in a form suitable for detecting the new associations involved in ordinary learning and conditioning. Self-organization has been a major preoccupation of those interested in neural networks since the early days, and the volume edited by Yovits Ncurd
Coiiipiftntioi~ 1,
295-311 (1YXY)
@ 1989 Massachusetts Institute of Technology
296
H.B. Barlow
al. (1962) gives an overview of some of this work; it is interesting to compare this with the systematic and much more developed treatment in the book on the subject by Kohonen (1984). One goal has been to explain topographic projections of sensory pathways and the occurrence of feature-selective neurons without depending completely on genetic specification (see especially von der Malsburg 1973; Nass and Cooper 1975; Cooper et al. 1979; Perez et al. 1975; Fukushima 1975, 1980; Swindale 1980, 1982; Barrow 1987). Another goal has been to explain the automatic separation and classification of clustered sensory stimuli (Rosenblatt 1959, 1962; Uttley 1958, 1979). The informon (Uttley 19701, for example, separated frequently occurring patterns from among a background of randomly associated elements, and it mimicked many aspects of the model of Rescorla and Wagner (1972) for conditioning and learning (Uttley 1975). Grossberg (1980) mainly emphasized the interactions between supervised and unsupervised learning. The adaptive critic in the pole-balancing scheme described by Barto et nl. (1983) improved learning performance by observing the pattern of recurring correction-movements and their outcomes. Self-organization may be mediated by the competitive learning analyzed by Rummelhart and Zipser (1985), which has been applied to the generation of feature specificity by Barrow (1987) and to a hippocampal model by Rolls (1989). The hierarchical mapping scheme of Linsker (1986, 1988) shows spontaneous self-organization, and his infomax principle develops further some ideas related to those Uttley (1979) proposed. Linsker’s networks can produce an organization reminiscent of the cortex both spontaneously, and in response to regularities of the incoming signals. From an informational viewpoint the recent exploration by Pearlmutter and Hinton (1986) of unsupervised procedures for discovering regularities in the input is especially relevant. Much of this paper has antecedents in the above work as well as in theories about the importance of redundancy in perception (Attneave 1954; Barlow 1959) and pattern recognition (Watanabe 1960, 1985). However, I have also tried to relate unsupervised learning to ideas about cognitive processes developed by Tolman (1932) and Craik (1943). Since these ideas provide a link with traditional psychology they will be briefly described.
et
1.1 Cognitive Maps and Working Models. Tolman (1932) worked within the behaviorists‘ tradition, but he disagreed with the rigidity of their explanations, feeling that these did not adequately convey the richness of the knowledge about their environment that maze-running rats clearly possessed and freely utilized. As he said, “behavior reeks of purpose and of cognition,“ and the structured knowledge of the environment that he argued for was subsequently called a cognitive map. Craik (19431, in his shorter, more philosophically oriented, book proposed that “thought models, or parallels, reality” These working models embodied the essential features and interactions in the world that fed the senses,
Unsupervised Learning
297
so that the outcomes of various possible actions could be correctly predicted; this is very similar to the way'Tolman thought of cognitive maps being used by his rats. What is the source of the extensive and well-organized knowledge of the environment implied by the possession of a cognitive map or working model? Though their structure may be genetically determined, the specific evidence they incorporate can be obtained only from the sensory messages received by the brain, and it is argued below that it is the statistical regularities in these messages that must be used for this purpose. This is an extraordinarily complex and difficult task, for it requires something like a major program of scientific research to be conducted at a precognitive level. There is plenty of room for genetic help in doing this, but once the nature of the task has been defined the statistical aspects can be approached systematically. In the next sections this is attempted for the first few steps, and a new method of finding these regularities - minimum entropy coding - is proposed. 2 Redundancy Provides Knowledge ___
There are genuine conceptual difficulties in applying information theory to the nervous system. These start with the paradox that although redundancy is claimed to be terribly important, sensory pathways are said to eliminate or reduce it rather than preserve it. Some of these difficulties (such as that one) disappear upon better understanding of information theory, but others do not: it is, for instance, difficult to apply the concepts when one is uncertain about the information-bearing features of the messages in nerve fibres, and about the overall plan used to represent information in the brain. In the next section these difficulties are avoided by talking about the sensory stimuli applied to the animal rather than the messages these arouse, and by doing this the definitions can be made precise. In principle, the maximum rate of presentation of usable information to the senses can be specified if one knows the psychophysical facts about their discriminatory capacities; call this C bits/sec. Now look at the actual rate at which information is delivered, and call this f i bits/sec; then the redundancy is simply C' - H bits/sec, or 100 x ( C ' - H)/C%. There remains a problem about measuring H , for the lower limit to its value can be calculated only if one knows all there is to know about the constraints operating in the world that gives rise to our sensations, and this point can obviously never be reached. Fortunately the concept of redundancy remains useful even if H is calculated using incomplete knowledge of the constraints, for this defines an upper limit to H and a lower limit to the redundancy. It is confusing to refer to these C - H bits/sec as information, but the technically correct term, redundancy, is almost equally misleading,
298
H.B. Barlow
for it suggests that this part of the sensory inflow is useless or irrelevant, whereas it is the potential source of all the available knowledge about the constant or semiconstant patterns and regularities in an animal’s environment. Knowledge is perhaps the best term for it, though it may seem paradoxical that this knowledge of the world around us can be derived only from the redundancy of the messages. The point can be illustrated by briefly considering what nonredundant sensory stimuli would be like. Completely nonredundant stimuli are indistinguishable from random noise. Thus, such a visual stimulus would look like a television set tuned between stations, and an auditory stimulus would sound like the hiss on an unconnected telephone line. Though meaningless to the recipient, technically such signals convey information at the maximum rate because they cannot be predicted at all from other parts of the message; H = C and there is no redundancy. Thus, redundancy is the part of our sensory experience that distinguishes it from noise; the knowledge it gives us about the patterns and regularities in sensory stimuli must be what drives unsupervised learning. With this in mind one can begin to classify the forms the redundancy takes and the methods of handling it. 3 Finding and Using Sensory Redundancy
Some features of sensory stimuli are almost universal. For instance, the upper part of the visual field is imaged on the lower part of the retina in an erect animal, and it is almost always more brightly illuminated. In animals such as cats that have a reflecting tapetum one usually finds that it is confined to the part receiving the image of the lower, dimmer, part of the visual field while the reflecting tapetum is replaced by a densely absorbing pigment in the part receiving the bright image; the result is to greatly reduce the amount of scattered light obscuring the image in its dimmer parts. The many ways that redundant aspects of sensory stimuli are reflected in permanent features of the sensory system are themselves interesting, but here we are concerned with learning-like responses. To exploit redundant features the brain must determine characteristics of the stimuli that behave in a nonrandom manner, so one can consider methodically the various measures that could be made on the messages in order to characterize these regularities statistically. 3.1 Mean. One starts with the mean, taken over the recent past. In vision, this can assume any value from a few thousandths up to many thousands of cd/m2, but it behaves in a very nonrandom manner because it tends to stay rather constant for quite long periods. I have been sitting at my desk for the past hour, and during this time the mean luminance has always been close to 10 cd/m2; the constancy of this mean is a highly nonrandom feature and the visual system takes advantage of it to adjust
Unsupervised Learning
299
the sensitivity of the pathways to suit the limited range of retinal illuminations it will receive. Much is understood about these adaptational mechanisms, but the principles are well understood by communication engineers and I shall go ahead to consider more interesting types of redundancy. However the way in which coding by the retina changes with the mean luminance of the images is a simple paradigm of unsupervised learning, and the one we are closest to understanding physiologically.
3.2 Variance. The variance of sensory signals probably does not show the constancy over short periods combined with very large changes over long periods that is characteristic of the mean, though a walker in mist or a fish in murky water would certainly be exposed to signals with an exceptionally low range of image contrasts and hence low variance. After the transformations in the retina, taking account of changes in the variance of the input signals is actually very nearly equivalent to adjusting for the mean values of the signals in the "on" and "off" systems, and it has been suggested that such contrast gain control occurs in primary visual cortex (Ohzawa et al. 1982, 1985). One might perhaps consider next the higher moments of the distributions of input stimuli on the many input channels, but it is hard to imagine that adapting to these would have any great advantages and I know of no evidence that natural systems respond in any way to them. Hence the next step is the large one of considering the patterns of correlation between the inputs on different channels.
3.3 Covariance. The simplest measure of the correlated activity of sensory pathways would be the covariance between pairs of them. Just as adaptational mechanisms take advantage of the mean by using it as an expected value and expressing values relative to it, so one might take advantage of covariance by devising a code in which the measured correlations are "expected" in the input, but removed from the output by forming a suitable set of linear combinations of the input signals. It is possible to form an uncorrelated set of signals in a neural network with a rather simple scheme of connection and rule of synaptic modification (Barlow 1989; Barlow and Foldiiik 1989; see also Kohonen 1984). The essential idea is that each neuron's output feeds back to the other inputs through anti-Hebbian synapses, so that correlated activity among the outputs is discouraged. Such a network would account for many perceptual phenomena hitherto explained in terms of fatigue of pattern selective elements in sensory pathways, and it also offers a mechanism for some forms of the "unconscious inference" described by von Helmholtz (1925) and modern psychologists of perception (Rock 1983). These aspects are discussed in the references cited above, and here some of the possible extensions of the principle will be mentioned.
300
H.B. Barlow
So far it has been supposed that the covariance is worked out from paired values occurring at the same moment, but this need not be the case. Sutton and Barto (1981) have discussed temporal relationships in conditioning, and there are several synaptic mechanisms that might depend on the correlation between synaptic input at one moment and postsynaptic depolarization at a later moment; a transmitter might cause lingering "eligibility" for subsequent reinforcement, or a synaptic rewarding factor or reverse transmitter released by a depolarized neuron might be optimally picked up by presynaptic terminals some moments after they had themselves been active. Decorrelating networks based on such principles would "expect" events that occurred in often-repeated sequences, and would tend to respond less strongly to frequently occurring sequences and more strongly to abnormal ones. It is easy to see how such a mechanism might explain aftereffects of motion. A consequence of using covariances is that, since the inputs are taken in pairs, the number of computations increases in proportion to the square of the number of inputs. This means that it would be impossible to decorrelate the whole sensory input; the best that could be done would be to decorrelate local sets of sensory fibers. However, the process could then be repeated, possibly organizing the decorrelated outputs of the first stage according to principles other than their topographical proximity, such as proximity in color space or similarity of direction of motion (Barlow 1981; Ballard 1984). Such hierarchical decorrelation processes may have considerable potential, but there is no denying that the methods so far considered only begin the task of finding regularities in the sensory input. 3.4 Rules for Combination or Agglomeration. Decorrelation separates variables that are correlated, but if the correlation between two variables is very strong they might be conveying the same message, and then one should combine them. For instance, taste information is carried by a large number of nerve fibers each of which has its characteristic mixture of sensitivities to the four primary qualities, salt, sweet, sour, and bitter. We have shown (Barlow and Foldiak 1989) how these can be decorrelated in groups of four to yield the four primary qualities, but one might expect all the outputs for one quality then to be combined on to a much smaller number of elements, for without doing this they just seem to replicate the information needlessly. There is need for an operation of this sort in many situations: for instance, to exploit the fact that there are only two dimensions of color (in addition to luminance), to exploit the prevalence of edges in ordinary images, to combine in one entity the host of sensory experiences for which we use a single word or name, and to do the same for a commonly repeated phrase or cliche. Pearlmutter and Hinton (1986) consider a related problem, that of finding input patterns that occur more often than would be expected if the constituent features occurred independently.
Unsupervised Learning
301
Finding that some combinations occur more often than expected is the converse of finding that some combinations d o not occur at all, as is the case when the number of degrees of freedom or dimensions in a set of messages is less than would appear from the form of the messages. The set of S features spans less than an .V-dimensional space because certain combinations d o not occur, and exploiting this is just the kind of simplification that would enable one to make useful cognitive maps and working models. Principle component analysis will do what is required, and it is believed that the method described in the next section will also, but it is natural to look for network methods, especially as these have already achieved some success (for example, Oja 1982; Rumelhart and Zipser 1985; Pearlmutter and Hinton 1986; Foldiak 1989). 3.5 Minimum Entropy Coding. As with decorrelation the idea is to find a set of symbols to represent sensory messages such that, in the normal environment, each symbol is as nearly as possible independent of the others, but there are two differences: first, it is applicable to discretely coded, logically distinct variables rather than continuous ones, and second it takes into account all possible nonrandom relations between the outputs, not just the pairwise relationships of the covariance matrix. To make the principle clear the simple example of coding keyboard characters in 7 binary digits to find alternatives to the familiar 7-bit ASCII code will be considered. The advantages of examining this are its familiarity, its simplicity, and the fact that samples of normal English text are readily available from which the nonrandom character frequencies can be determined. If a sample of ordinary text is regarded simply as a string of independent characters randomly selected from the alphabet with the probabilities given by their frequency of occurrence in ordinary text, the average entropy of the characters 13, is given by the familiar expression:
(3.1)
where : 1 are the probabilities of the mutually exclusive set of characters. Each of the characters is represented by a 7-bit word, and the entropies for each bit can be obtained by measuring their frequencies in a sample of text. The entropy expression for the bits takes the form: H , = - (P ,log r, + Q , l o g o , )
(3.2)
where H , is the average entropy of the cth bit, P, is its probability, and C), is 1 - L'). An estimate of the average character entropy can be obtained by adding the 7-bit entropies, but it is important to realize that this can never be less, and will usually be greater, than the character entropy given by the original expression (3.1). The reason for this is the lack of independence between the values of the bits; if it were true for all the 7
302
H.B. Barlow
bits that their values were completely independent of the other bits occurring in any combination, then the two estimates would be equal. The object is to find a code for which this is true, or as nearly true as possible, and the method of doing this is to find a code that minimizes the sum of the bit entropies - hence the name. If the minimum is reached and the bits are truly independent we call it a factorial code, since each bit probability or its complement is then a factor of the probability of each of the input states. The maneuver can be looked at another way. The seven binary digits of the ASCII code can carry a maximum of 7 bits, but actually carry less when used to transmit normal text, for two reasons. First, the bit probabilities are a long way from 1/2, which would yield the maximum bit entropy; this form of redundancy is explicit and causes no trouble, for the probability of each of the 7 bits is available wherever they are transmitted and easily measured. Second, there are complicated interdependencies among the bits, so the conditional bit probabilities are not the same as the unconditional ones; this form of redundancy is troublesome, for it is not available wherever the bits are transmitted and to describe it completely one needs to know the conditional probabilities of each bit for all combinations of other bits. If both of these forms of redundancy were taken into account the information conveyed per ASCII word would of course be the same as H,. of expression (l),i.e., about 4.3 bits, and no change of the code would alter this. However, changing the code does change the relative amounts of the two forms of redundancy, and by finding one that minimizes the sum of the bit entropies one maximizes the redundancy that results from bit probabilities deviating from 1/2. This leaves less room for redundancy from interdependencies between the bits; the troublesome form of redundancy is squeezed out by maximizing the other less troublesome form. The minimum entropy principle should be generally applicable and clearly goes further than decorrelation, which considers only the outputs in pairs. It can also be used to compare and select from codes that change the number of channels or dimensionality of the messages. The entropy is a locally computable quantity, and by minimizing it one can increase the independence of the outputs without actually measuring the frequencies of all the possible output states, which would often be an impossible task. An accompanying article (Barlow et al. 1989) goes into some of the practical and theoretical problems in finding minimum entropy codes. In this section it has been suggested that the statistical regularities of the incoming sensory messages might be measured and used to change the way they are coded or represented. It is easy to see that this would have advantages, analogous to those conferred by automatic gain control, in ensuring a compact representation within the dynamic range of the representative elements, but there may be more profound benefits attached to a representation in which the variables are independent in the environment to which the representation has been adapted. To under-
Unsupervised Learning
303
stand these one must consider the main task for which our perceptions are used, namely the detection of new associations and their utilization in ordinary learning and conditioning. 4 Ordinary Learning Requires Previous Knowledge
Over the past 20 years the work of Kamin (1969), Rescorla and Wagner (19721, Mackintosh (1974,19831, Dickinson (1980), and others has brought about a very big change in the way theorists approach the learning problem. Whereas previously they tended to think in terms of mechanistic links whose strengths were increased or decreased according to definable laws, attention has now shifted to the computational problem that an animal solves when it learns. This started with the realization and experimental demonstration of the fact that the detection of new associations is strongly dependent on other previously and concurrently learned associations, many of which may be “silent” in that they do not themselves produce overt and obvious effects on outward behavior. As a result of this change it is at last appreciated that the brain studied in the learning laboratory is doing a profoundly difficult job: it is deducing causal links from which it can benefit in the world around it, and it does this by detecting suspicious coincidences; that is, it picks u p associations that are surprising, new, or different among those that the experimenter offers it. To detect new associations one must detect changes in the probabilities of certain events, and once this is realized an important role for unreinforced experience becomes clear: it is to find out and record the a priori probabilities, that is, the normal state of affairs, or what usually happens. Though this elementary fact does not seem to have been much emphasized by learning theorists it is obviously crucial, for how can something be recognized as new and surprising if there is no preexisting knowledge about what is old and expected? 4.1 Detecting New Associations. The basic step in learning is to detect that event C predicts U; C might be the conditional, U the unconditional stimulus of Pavlovian conditioning, or C might be a motor action and U a reinforcement in operant conditioning, or they might be successive elements in a learned sequence. Unsupervised learning can help with at least two aspects of this process: first, the separate representation of a wide range of alternative Cs, and second, the acquisition of knowledge of the normal probabilities of occurrence of these possible conditional stimuli. It is often tacitly assumed that all alternative conditioning stimuli can be separated by the brain and their occurrences independently registered in some way, but one should not blandly ignore the whole problem of pattern recognition, and the massive interconnections we know exist
304
H.B. Barlow
between the neurons of the brain means that the host of alternative Cs are unlikely to be completely separable unless there are specific mechanisms for ensuring that they are. The tacit assumption that the probabilities of occurrence of these stimuli, or of their cooccurrence with U, are known is equally unjustified, though it is evident that if they were not there would be no sound basis for judging that a particular C had become a good predictor of U. The logical steps necessary to detect an association between C and U will be considered in more detail to show the importance both of knowledge of their normal probabilities and of the separability of alternative conditional stimuli. The only way to establish that C usefully predicts U is to disprove the null hypothesis that the number of occasions U follows C is no more than would be expected from chance coincidences of the two events; it is easy to see that if this null hypothesis is correct, no benefit can possibly result from using C as a predictor of U. To know the expected-rate of chance coincidences one must either have measured the normal rate of the compound event (U following C) directly, or have knowledge of the normal probabilities of occurrence, P(C) and PW); further if these probabilities are to be used it must be reasonable to assume they are independent. This prior knowledge is clearly necessary before new predictive associations can be detected reliably. Now consider the difficulties that arise if a particular C cannot be fully resolved or separated from the alternative
cs.
Failure of resolution or separation means that the registration of the occurrence of an event is contaminated by occurrences of other events. Estimates of the probabilities of occurrence of C both with and without U would be misleading if based on these contaminated counts, and their use would cause failures to detect associations that were present and the detection of spurious associations that did not exist. Thus, if counts of alternative events like C are to be used to detect causal factors, they must be adequately resolved or separated if learning is to be efficient and reliable. 4.2 Independence Is Needed for Versatile Learning. Now reconsider the two ways, measurement and calculation, of estimating the compound event probability P(U following C). Directly measuring it is adequate and plausible when one has prior expectations about the possible conditional stimuli C, especially as in either scheme one must somehow be able to detect the occurrence of this sequence when it occurs. But calculating P(U following C) from P(C) and P(U) is much more versatile, for the following reason. Measuring the rates of Ar coincidences such as "U following C" just gives these rates and no more, whereas knowledge of the probabilities of N independent events enables one to calculate the probability of all possible logical functions of those events, at least in principle. This gigantic increase in the number of null hypotheses whose predictions can be specified and tested gives an enormous advan-
Unsupervised Learning
305
tage to the method of calculating, rather than measuring, the expected coincidence rates. However, calculating P(U following C) from the probabilities of its constituents depends on the formation of a representation in which the constituent events can be relied on to be independent until the association that is to be detected occurs. To summarize: to detect a suspicious coincidence that signals a new causal factor in the environment one should have access to prior knowledge of the probabilities of simpler constituent events, and these simpler events should be separately registered and independent on the null hypothesis from which one wishes to advance. It is obviously an enormously difficult and complicated task to generate such a representation, and the types of coding discussed above are only first steps; however, the versatility of subsequent learning must depend critically on how well the task has been done.
4.3 Some Other Issues. The approach taken here might be criticized on the grounds that the problem facing the brain in learning is considered in too abstract a manner, the actual mechanisms being ignored. For example, the logic of the situation requires that the numbers of occurrences and joint occurrences be somehow stored, and one might point to this as the major problem, rather than the way the numbers are used. It certainly is a major problem, but the attitude adopted here is that one is not going to get far in understanding learning without recognizing the logic of inductive inference, since this dictates what quantities actually need to be stored; it seems obvious that this problem should be looked at first. There must be many ways in which the brain fails to perform the idealized operations required to detect new causal factors. It performs approximations and estimates, not exact calculations, but one cannot appreciate the mistakes an approximation will lead to without knowing what the exact calculation is. It is likely that many of the features of learning stem from the nature of the problem being tackled, not from the specific details of the mechanisms, and it is foolish to confuse the one with the other through failing to attend to the complexity of the task the brain appears to perform so effectively. There is another somewhat irrelevant issue. If it was known with certainty that a predictive relation between C and U existed it would still have to be decided whether it should be acted on. This theoretically depends on whether P(U following C ) is high enough for the reward obtained when U does follow C to outweigh the penalty attached to the behavior needed to reap the reward when U fails to materialize; that is a different matter from deciding whether the relation exists, and for the present it can be ignored.
306
H.B. Barlow
4.4 Storing and Accessing the Model. So far no means has been proposed for performing the computations suggested above, nor for storing and accessing the knowledge of the environment that the model contains. One possibility is to form a massive memory for the usual rates of occurrence of various combinations of sensory inputs. Something like this may underlie our ability to say "the almond is blossoming unusually early this year" and to make similar cognitive judgments, but the comparative judgments of everyday perception are certainly made in quite a different way. When we see white walls in a dimly lit room we do not observe their luminance, then refer to a memorized look-up table that tells us what luminances are to be called white when the mean luminance is such and such; instead we have mechanisms (admittedly not yet fully understood) that automatically compare the signals generated by the image of the wall with signals from other regions, and then attach the label "white" wherever this comparison yields a high value. This automatic comparison was regarded above as a way of eliminating the redundancy involved in signaling the mean luminance on every channel, and it should now be clear how the various other suggested forms of recoding do much the same operation for other "expected" statistical regularities in the sensory messages. One can regard the model or map as something automatically held u p for comparison with the current input; it is like a negative filter through which incoming messages are automatically passed, so that what emerges is the difference between what is actually happening and what one would expect to happen, based on past experience. In this way past experience can be made continuously and automatically available. 5 Discussion
Since the early days of information theory it has been suggested that the redundancy of sensory stimuli is particularly important for understanding perception. Attneave (1954) was the first to point this out, and I have periodically argued for its importance in understanding both the physiological mechanisms of sensory coding, and higher level functions including intelligence (Barlow 1959, 1961, 1987). One can actually trace the line of thought back to von Helmholtz (1877, 19251, and particularly to the writings of Ernst Mach (1886) and Karl Pearson (1892) about "The Economy of Thought." To what extent is this line of thought the same as that of Tolman and Craik on cognitive maps and working models? They are certainly closely related, for they both say that the regularities in the sensory messages must be recorded by the brain for it to know what usually happens. However, redundancy reduction is the more specific form of the hypothesis, for I think it also implies that the knowledge contained in the map or model is stored in such a form that the current sensory scene is automatically compared with it and the dis-
Unsupervised Learning
307
crepancies passed on for further consideration - the idea of the model as a negative filter. There is perhaps something contradictory and intuitively hard to accept in this notion, especially when applied to the cognitive knowledge of our environments to which we have conscious access. When we become aware that the almond is blossoming unusually early, we think this is an absolute judgment based on comparisons with past, positive experiences, and not the result of a discrepancy between present experience and unconscious rememberings of past blossomings. Perhaps the negative filter idea applies only to the unconscious knowledge that our perceptions use so effectively, with quite different mechanisms employed at the higher levels to which we have conscious access. On the other hand redundancy reduction may be the deeper view of how our brains handle sensory input, for it may describe the goal of the whole process, dictating the form of representation as well as what is represented; we should not be too surprised if our introspections turn out to be misleading on such a matter, for they may be concerned with guiding us how to tell others about our experiences, not with informing us how the brain goes about its business. The discussion should have demonstrated that there is a close relationship between the properties of the elements that represent the current scene, the model that tells one “what usually happens,” and the ease with which new associations can be detected and learned. But the recoding methods suggested above are unlikely to be complete, and it is worth listing other factors that must be important in determining the utility of representations.
5.1 Other Factors Affecting the Utility of Representations. The best method of detecting a target in a noisy background is to derive a signal that picks up all the energy available from the signal with the minimum contamination by energy from its background the principle of the matched detector. This principle must be very important when detecting events in the environment that are associated with rewards or punishments, but there is no guarantee that the code elements of a minimum entropy code (or any other code that is unguided by reinforcement) will be well matched to these classes of events. Though a priori probabilities can be calculated for any logical function of the inputs if the representative elements are independent, this calculation is not necessarily as accurate as that obtained from a matched filter.
2. It is also important that a coding scheme should lead to appropriate generalization. Probably representative elements should start by responding to a wider class of events than that to which, under the influence of ”shaping,” they ultimately respond. To meet this
H.B. Barlow
308
requirement mechanisms additional to minimum entropy coding are required.
3. Items such as the markings of prey, predators, or mates may have a biological significance that is arbitrary from an informational viewpoint. 4. Sensory scenes and stimuli that have been reinforced obviously have special importance, and they should therefore have a key role in classifying sensory stimuli. It is clear that the minimum entropy principle is not the only one on which the representation of sensory information should be based. Nonetheless a code selected on this principle stores a wealth of knowledge about the statistical structure of the normal environment, and the independence of the representative elements gives such a representation enormous versatility. It is relatively easy to devise learning schemes capable of detecting specific associations, but higher mammals appear to be able to make associations with entities of the order of complexity that we would use a word to describe. As George Boole (1854) pointed out, words are to the elements of our sensations like logical functions to the variables that compose them. We cannot of course suppose that an animal can form an association with any arbitrary logical function of its sensory messages, but they have capacities that tend in that direction, and it is these capacities that the kind of representative schemes considered here might be able to mimic.
References Attneave, F. 1954. Informational aspects of visual perception. Psychol. Rev. 61, 183-193. Ballard, D.H. 1984. Parameter networks. Artificial Intell. 22, 235-267. Barlow, H.B. 1959. Sensory mechanisms, the reduction of redundancy, and intelligence. In National Physical Laboratory Symposium No. 10, The Mechanisation of Thought Processes. Her Majesty’s Stationery Office, London. Barlow, H.B. 1961. Possible principles underlying the transformationsof sensory messages. In Sensory Communication, W. Rosenblith, ed., pp. 217-234. MIT Press, Cambridge, MA. Barlow, H.B. 1981. Critical limiting factors in the design of the eye and visual cortex. The Ferrier lecture, 1980. Proc. Roy. SOC. London, B 212, 1-34. Barlow, H.B. 1987. Intelligence: The art of good guesswork. In The Oxford Companion to the Mind, R.L. Gregory, ed., pp. 381-383. Oxford University Press, Oxford. Barlow, H.B. 1989. A theory about the functional role and synaptic mechanism of visual after-effects. In Vision: Coding and Efficiency, C. Blakemore, ed. Cambridge University Press, Cambridge.
Unsupervised Learning
309
Barlow, H.B., and Foldiak, P. 1989. Adaptation and decorrelation in the cortex. In The Cornputing Neicron, R. Durbin, C. Miall, and G. Mitchison, eds. New York: Addison-Wesley. Barlow, H.B., Kaushal, T.P., and Mitchison, G.J. 1989. Finding minimum entropy codes. Neural Comp. 1, 412-423. Barrow, H.G. 1987. Learning receptive fields. Proc. IEEE First bit. Conf. Neural Networks, Cat. # 87TH0191-7, pp. 1V-115-IV-121. Barto, A.G., Sutton, R.S., and Anderson, C.W. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transact. Systems, Man, Cybernet. SMC-13(5), 835-846. Boole, G. 1854. A n lnvestigatioii of the Laws of Thought. Dover Publications Reprint, New York. Cooper, L.N., Liberman, F., and Oja, E. 1979. A theory for the acquisition and loss of neuron specificity in visual cortex. Bid. Cybemet. 33, 9-28. Craik, K.J.W. 1943. The Nature of Explanation. Cambridge University Press, Cambridge. Dickinson, A. 1980. Contemporary Animal Learning Theory. Cambridge University Press, Cambridge. Foldiak, P. 1989. Adaptive network for optimal linear feature extraction. In Proc. I E E E INNS Int. Ioint Coiif. Neural Networks, Washington DC. Fukushima, K. 1975. Cognitron: A self-organising multi-layered neural network. Bid. Cybernet. 20, 121-136. Fukushima, K. 1980. Neocognitron: A self-organising neural network model for a mechanism of pattern recognition unaffected by shift of position. B i d . Cybernet. 36, 193-202. Grossberg, S. 1980. How does a brain build a cognitive code? Psychol. Rev. 87, 1-51. Kamin, L.J. 1969. Predictability, surprise, attention and conditioning. In Putiishment and Aversive Behavior, B.A. Campbell and R.M. Church, eds., pp. 279296. Appleton-Century-Crofts, New York. Kohonen, T. 1984. Self-Organisation and Associative Memory. Springer-Verlag, Berlin. Linsker, R. 1986. From basic network principles to neural architecture (series). Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512, 8390-8394, 8779-8783. Linsker, R. 1988. Self-organisation in a perceptual network. Computer (March 1988), 105-117. Mach, E. 1886. The Analysis of Sensations, and the Relation of the Physical to the Psychical. Translation of the lst, revised from the 5th, German Edition by S. Waterlow. Open Court, Chicago and London. (Also Dover reprint, New York, 1959.) Mackintosh, N.J. 1974. The Psycliology of Animal Learning. Academic Press, London. Mackintosh, N.J. 1983. Conditionir~gand Associative Learniiig. Oxford University Press, Oxford. Nass, M.M., and Cooper, L.N. 1975. A theory for the development of feature detecting cells in visual cortex. Bid. Cybernet. 19, 1-18.
310
H.B. Barlow
Ohzawa, 1. Sclar, G., and Freeman, R.D. 1982. Contrast gain control in the cat visual cortex. Nature (London) 298, 266-278. Ohzawa, I. Sclar, G., and Freeman, R.D. 1985. Contrast gain control in the cat’s visual system. 1. Neuropkysiol. 54, 651-667. Oja, E. 1982. A simplified neuron as a principal component analyser. 1. Matk. B i d . 15, 267-273. Pearlmutter, B.A., and Hinton, G.E. 1986. G-maximization: an unsupervised learning procedure for discovering regularities. In Proc. Conf. Neural Networks Comp. American Institute of Physics. Pearson, K. 1892. The Grammar of Science. Walter Scott, London. Perez, R., Glass, L., and Shlaer, R. 1975. Development of specificities in the cat visual cortex. J. Matk. Biol. 1, 275-288. Rescorla, R.A., and Wagner, A.R. 1972. A theory of Pavlovian conditioning: Variations in effectiveness of reinforcement and non-reinforcement. In Classical conditioning II: Current Research and Theory, A.H. Black and W.F. Prokasy, eds., pp. 64-99. Appleton-Century-Crofts, New York. Rock, I. 1983. The Logic of Perception. MIT Press, Cambridge, MA. Rolls, E.T. 1989. The representation and storage of information in neuronal networks in the Primate cerebral cortex and hippocampus. In The Computing Neuron, R. Durbin, C. Miall, and G. Mitchison, eds. Addison-Wesley, New York. Rosenblatt, F. 1959. Two theorems of statistical separability in the Perceptron. In National Physical Laboratory Symposium No. 10, Mechanisation of Thought Processes, 419-456. Her Majesty’s Stationery Office, London. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, Washington, DC. Rumelhart, D.E., and Zipser, D. 1985. Feature discovery by competitive learning. Cog. Sci. 9, 75-112. (Also in Parallel Distributed Processing, Val 1, pp. 151-193, MIT Press, Cambridge, MA, 1986). Sutton, R.S., and Barto, A.G. 1981. Towards a modern theory of adaptive networks: Expectation and prediction. Psyckol. Rev. 88, 135-170. Swindale, N.V. 1980. A model for the formation of ocular dominance stripes. Proc. Roy. SOC.London B 208, 243-264. Swindale, N.V. 1982. A model for the formation of orientation columns. Proc. Roy. SOC. London B 215, 211-230. Tolman, E.C. 1932. Purposive Behavior in Animals and Men. The Century Company, New York. Uttley, A.M. 1958. Conditional probability as a principle in learning. Actes I re CongrPs Cybernetiques. Namur, 2956. J. Lemaire, ed. Gauthier-Villars, Paris. Uttley, A.M. 1970. The Informon: A network for adaptive pattern recognition. J. Tkeoret. Biol. 27, 3147. Uttley, A.M. 1975. The Informon in classical conditioning. 1. Tkeoret. Biol. 49, 355-376. Uttley, A.M. 1979. Information Transmission in the Neraous System. Academic Press, London. von der Malsburg, C. 1973. Self-organisation of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100.
Unsupervised Learning
311
von Helmholtz, H. 1877. T l v Serisations of Tone. (by A.J. Ellis 1885). Reprinted Dover, New York, 1954. von Helmholtz, H. 1925. Physiological Optics Voliiine 3: The Tlzcwy of the Perceptions of Visim. Translated from 3rd German Edition (1910). Optical Society of America, Washington, DC. Watanabe, S. 1960. Information-theoretical aspects of Inductive and Deductive Inference. I.B.M. I. Res. Dezt 4, 208-231. Watanabe, S. 1985. Pattern Rccugnifion: Hicinnn and Mechanical. Wiley, New York. Yovits, M.C., Jacobi, G.T., and Goldstein, G.D. 1962. Self-nrganiziq Systems. Spartan Books, Washington, DC.
Received 29 December 1988; accepted 15 February 1989
REVlEW
The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning Yaser S. Abu-Mostafa California Institute of Terhnohgy, Pasadcna, CA 91 125 USA
When feasible, learning is a very attractive alternative to explicit programming. This is particularly true in areas where the problems do not lend themselves to systematic programming, such as pattern recognition in natural environments. The feasibility of learning an unknown function from examples depends on two questions: 1. Do the examples convey enough information to determine the function? 2. Is there a speedy way of constructing the function from the ex-
amples? These questions contrast the roles of information and complexity in learning. While the two roles share some ground, they are conceptually and technically different. In the common language of learning, the information question is that of generalization and the complexity question is that of scaling. The work of Vapnik and Chervonenkis (1971)provides the key tools for dealing with the information issue. In this review, we develop the main ideas of this framework and discuss how complexity fits in. 1 Introduction We start by formalizing a simple setup for learning from examples. We have an environment such as the set of visual images, and we call the set X . In this environment we have a concept defined as a function f : X + (0, l}, such as the presence or absence of a tree in the image. The goal of learning is to produce a hypothesis, also defined as a function g : X + (0, l}, that approximates the concept f , such as a pattern recognition system that recognizes trees. To do this, we are given a number of examples (21,f h l ) ) , . . . , ( x N ,f ( x N ) ) from the concept, such as images with trees and images without trees. In generating the examples, we assume that there is an unknown probability distribution 'I on the environment X . We pick each example independently according to this probability distribution. The statements in the paper hold true for any probability distribution P, which sounds Neural Computation 1, 312-317 (1989) @ 1989 Massachusetts Institute of Technology
The Vapnik-Chervonenkis Dimension
313
very strong indeed. The catch is that the same P that generated the example is the one that is used to test the system, which is a plausible assumption. Thus, we learn the tree concept by being exposed to "typical" images. While X can be finite or infinite (countable or uncountable), we shall use a simple language that assumes no measure-theoretic complications. The hypothesis g that we produce approximates f in the sense that g would rarely be significantly different from f (Valiant 1984). This definition allows for two tolerance parameters c and 6. With probability 2 1 - 6, g will differ from ,f at most e of the time. The 6 parameter protects against the small, but nonzero, chance that the examples happen to be very atypical. A learning algorithm is one that takes the examples and produces the hypothesis. The performance is measured by the number of examples needed to produce a good hypothesis as well as the running time of the algorithm. 2 Generalization
We start with a simple case that may look at first as having little to d o with what we think of as generalization. Suppose we make a blind guess of a hypothesis g, without even looking at any examples of the concept f . Now we take some examples of f and test g to find out how well it approximates f . Under what conditions does the behavior of g on the examples reflect its behavior in general? This turns out to be a very simple question. On any point in X , f and y either agree or disagree. Define the agreement set
A = {x E X :
f ( s ) = y(r)}.
The question now becomes: How does the frequency of the examples in A relate to the probability of A? Let i7 be the probability of A, i.e., the probability that f (.c) = g(.r) on a point z picked from X according to the probability distribution P. We can consider each example as a Bernoulli trial (coin flip) with probability i-r of success ( f = g ) and probability 1 - i7 of failure (f # 9). With ,li examples, we have A' independent, identically distributed, Bernoulli trials. Let n be the number of successes (71 is a random variable), and let v = 71/N be the frequency of success. Bernoulli's theorem states that, by taking N sufficiently large, v can be made arbitrarily close to T with very high probability. In other words, if you take enough examples, the frequency of success will be a good estimate of the probability of success. Notice that this does not say anything about the probability of success itself, but rather about how the probability of success can be estimated from the frequency of success. If on the examples we get 90% right, we
Yaser S. Abu-Mostafa
314
should get about 90% right overall. If we get only 10% right, we should continue to get about the same. We are only predicting that the results of the experiment with the examples will persist, provided there are enough examples. How does this case relate to learning and generalization? After all, we do not make a blind guess when we learn, but rather construct a hypothesis from the examples. However, at a closer look, we find that we make a guess, not of a hypothesis but of a set of hypotheses. For example, when the backpropagation algorithm (Rumelhart et al. 1986) is used in a feedforward network, we are implicitly guessing that there is a good hypothesis among those that are obtained by setting the weights of the given network in some fashion. The set of hypotheses G would then be the set of all functions y that are obtained by setting the weights of the network in any fashion. Therefore, when learning deals with a limited domain of representation, such as a given network with free weights, we in effect make a guess G of hypotheses. The learning algorithm then picks a hypothesis g E G that mostly agrees with f on the examples. The question of generalization now becomes: Does this choice, which is based on the behavior on the examples, hold in general? We can approach this question in a way similar to the previous case. We define, for every y E G, the agreement set A, = { x
E
X
I j ( x ) = g(.r)}.
These sets are different for different gs. Let 7rg be the probability of A,, i.e., the probability that j ( s ) = g( a) on a point 5 picked from X according to the probability distribution P, for the particular g E G in question. We can again define random variables rig (the number of successes with respect to different gs) and the frequencies of success ug = n,,/N. At this point the problem looks exactly the same as the previous one and one may expect the same answer. There is one important difference. In the simple Bernoulli case, the issue was whether u converged to K . In the new case, the issue is whether the u?)sconverge to the rigs in a uniform manner as N becomes large. In the learning process, we decide on one y but not the other based on the values of vg. If we had the vgs converge to the K,S, but not in a uniform manner, we could be fooled by one erratic g. For example, we may be picking the hypothesis g with the maximum ug. With nonuniform convergence, the y we pick can have a poor K ~ We . want the probability that there is some g E G such that ug differs significantly from 7rg be very small. This can be expressed formally as r
1
where sup denotes the supremum.
The Vapnik-Chervonenkis Dimension
315
3 The V-C Dimension
A condition for uniform convergence, hence generalization, was found by Vapnik and Chervonenkis (1971). The key is the inequality
I
Pr sup Iv,, - ~ , , 1 > f qE(:
I
5 4rn(2,Y)r~””~*.
where 7rt is a function that depends on G.We want the right-hand side of the inequality to be small for large N , in order to achieve uniform convergence. The factor , - f 2 N / 8 is very helpful, since it is exponentially decaying in A’. Unless the factor rtt(2ll’) grows too fast, we should be OK. For example, if rr~(2~V) is polynomial in the right-hand side will go to zero as N goes to infinity. What is the function rtt? It depends on the set of hypotheses G. Intuitively, nr(,V) measures the flexibility of G in expressing an arbitrary concept on examples. For instance, if G contains enough hypotheses to be able to express any concept on 100 examples, one should not really expect any generalization with only 100 examples, but rather a memorization of the concept on the examples. On the other hand, if gradually more and more concepts cannot be expressed by any hypothesis in G as N grows, then the agreement on the examples means something, and generalization is probable. Formally, rn(,V) measures the maximum number of different binary functions on the examples .rl . . . . X,V induced by the hypotheses yl,,q2. E C:. For example, if X is the real line and G is the set of rays of the form .I 5 ( t , i.e., functions of the form ~
then r r i ( N ) = N + 1. The reason is that on I” points one can define only N + 1 different functions of the above form by sliding the value of (I from left of the leftmost point all the way to right of the rightmost point. There are two simple facts about the function rn. First, rrt(N) 5 I Gl (where I I denotes the cardinality), since G cannot induce more functions that it has. This fact is useful only when G is a finite set of hypotheses. The second fact is that r i t ( N ) 5 2’, since G cannot induce more binary functions on ,Vpoints than there are binary functions on N points. Indeed, there are choices of G (trivially the set of all hypotheses on X ) for which rrJ(A7 = 2.‘. For those cases, the V-C inequality does not guarantee uniform convergence. The main fact about r t t ( S ) that helps the characterization of G as far as generalization is concerned is that n t ( S ) is either identically equal to ’2 for all AT, or else is bounded above by N” + 1 for a constant d. This striking fact can be proved in a simple manner (Cover 1965; Vapnik and Chervonenkis 1971). The latter case implies a polynomial r r r ( h r )
316
Yaser S. Abu-Mostafa
and guarantees generalization. The value of rl matters only in how fast convergence is achieved. This is of practical importance because this determines the number of examples needed to guarantee generalization within given tolerance parameters. The value of d turns out to be the smallest N at which C starts failing to induce all possible 2N binary functions on any N examples. Thus, the former case can be considered the case d = m. d is called the V-C dimension (Baum and Haussler 1989; Blumer et al. 1986).
4 Interpretation
Training a network with a set of examples can be thought of as a process for selecting a hypothesis g with a favorable performance on the examples (large u,J from the set G. Depending on the characteristics of G, one can predict how this performance will generalize. This aspect of the characteristics of G is captured by the parameter (1, the V-C dimension. If the number of examples N is large enough with respect to d, generalization is expected. This means that maximizing v!, will approximately maximize rII,the real indicator of how well the hypothesis approximates the concept. In general, the more flexible (expressive, large) C is, the larger its V-C dimension (1. For example, the V-C dimension of feedforward networks grows with the network size (Baum and Haussler 1989). For example, the total number of weights in a one-hidden-layer network is an approximate lower bound for the V-C dimension of the network. While a bigger network stands a better chance of being able to implement a given function, its demands on the number of examples needed for generalization is bigger. These are often conflicting criteria. The V-C dimension indicates only the likelihood of generalization. This means, for better or for worse, whether the behavior on the examples is going to persist. The ability of the network to approximate a given function in principle is a separate issue. The running time of the learning algorithm is a key concern (Judd 1988; Valiant 1984). As the number of examples increases, the running time generally increases. However, this dependency is a minor one. Even with few examples, an algorithm may need an excessive amount of time to manipulate the examples into a hypothesis. The independence of this complexity issue from the above discussion regarding information is apparent. Without a sufficient number of examples, no algorithm slow or fast can produce a good hypothesis. Yet a sufficient number of examples is of little use if the computational task of digesting the examples into a hypothesis proves intractable.
The Vapnik-Chervonenkis Dimension
317
Acknowledgments
The support of the Air Force Office of Scientific Research under Grant AFOSR-88-0213 is gratefully acknowledged. References Baum, E.B., and Haussler, D. 1989. What size network gives valid generalization. Neural Comp. 1, 151-160. MIT Press, Cambridge, MA. Blumer, A., Ehrenfeucht, A,, Haussler, D., and Warmuth, M. 1986. Classifying learnable geometric concepts with the Vapnik-Chervonenkis dimension. Proc. ACM Symp. Theory Coinpirting 18, 273-282. Cover, T.M. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trms. Electroiiic Comput. 326-334. Judd, J.S. 1988. On the complexity of loading shallow neural networks. 1. Complex. 4, 177-192. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processiiig, Vol. 1. MIT Press, Cambridge, MA. Valiant, L.G. 1984. A theory of the learnable. Coinmun. ACM 27, 1134-1142. Vapnik, V.N., and Chervonenkis, A. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16, 264-280.
Received 6 July 1989; accepted 25 July 1989
NOTE
Communicated by Ellen Hildreth
Linking Linear Threshold Units with Quadratic Models of Motion Perception Humbert Suarez Christof Koch Computation and Nmral Systrms Program, 216-76, California Institutr of Technology,Pasadcna, CA 91 125 {JSA
Behavioral experiments on insects (Hassenstein and Reichardt 1956; Poggio and Reichardt 1976) as well as psychophysical evidence from human studies (Van Santen and Sperling 1985; Adelson and Bergen 1985; Watson and Ahumada 1985) support the notion that short-range motion perception is mediated by a system with a quadratic type of nonlinearity, as in correlation (Hassenstein and Reichardt 1956), multiplication (Torre and Poggio 1978), or squaring (Adelson and Bergen 1985). However, there is little physiological evidence for quadratic nonlinearities in directionally selective cells. For instance, the response of cortical simple cells to a moving sine grating is half-wave instead of full-wave rectified as it should be for a quadratic nonlinearity (Movshon ef al. 1978; Holub and Morton-Gibson 1981) and is linear for low contrast (Holub and MortonGibson 1981). Complex cells have full-wave rectified responses, but are also linear in contrast. Moreover, a detailed theoretical analysis of possible biophysical mechanisms underlying direction selectivity concludes that most do not have quadratic properties except under very limited conditions (Grzywacz and Koch 1987). Thus, it is presently mysterious how a system can show quadratic properties while its individual components do not. We briefly discuss here a simple population encoding scheme offering a possible solution to this problem. We assume a population of n directionally selective cells whose output is zero if the "somatic potential" x is below a certain threshold JT and whose output is linear for small s above this value f(x) = Q [,c - xT1 where 1x1 = J: if x > 0 and 0 otherwise (Fig. la). We do not consider further the mechanism generating direction selectivity but will assume that the perceptual response to motion R is given by the sum of the responses of a large number of these neurons. Thus, if the moving stimulus induces the intracellular response .r in all n cells, we have (0.1) Neural Computation 1, 318-320 (1989) @ 1989 Massachusetts Institute of Technology
Linking Linear Threshold Units
319
b
a
I
"T
XT+
x rn
SOMATIC POTENTIAL X
Figure 1: (a) Schematic input-output relationship of a highly idealized directionally selective cell. If the "somatic potential" r is below a threshold .rr (here 21, the cell remains silent; above this threshold the output of the cell is o(.r-.rr 1, and saturates for .I' = . ~ ' y , + . r , ~ For , . the simulations described here, we use .rr,, = 4 and n = 1. (b) The sum R of the responses for a group of 50 such units with . ~ r uniformly distributed between .I' = 1 and .r = 3 (see arrows). R is quadratic for small values of s and saturates for large values. The dashed curve is 12.5(,1'- 1)2 and corresponds to the expected mean of R for uniformly distributed values of .r between 1 and 3.
If ,1'7. is the same for all cells, R = ~ J ( I ( . I ,- . I Y ) as long as .I' > .I,,T. However, w e will now assume that the threshold varies randomly from cell to cell, let us say distributed uniformly between .rI,7and .ryl. If .I' falls within these values, the function o(,r,- .ry,)is randomly sampled across this interval prior to summation, and the system then simply computes the area below S(.i.), similar to Monte-Carlo integration methods. For a sufficiently large value of t i , we then have that R is proportional to (.r - . I . , . , ) ~ [in general, if , f ( , r )is a rr)th order polynominal, R will be pro]. if were constant for all cells portional to (.r - . r , ~ , ) ' " + ' Alternatively, while the "somatic potential" of the neurons was uniformly randomly between .i',/', and .I' for a given input, the same quadratic behavior in .r would be obtained. In particular, this random variation could be obtained by summing over a population of cells that is broadly tuned for the direction of motion with a certain distribution of preferred directions.
320
Humbert Suarez and Christof Koch
In all cases, for values of .I' much higher than .rr2,the output R will grow linearly since the system will integrate only over a narrow range around .r. Finally, more realistic neurons saturate at some output value, that is f(.r) = or,,) for .I' > .i.r+ .r,,!. R will then saturate also. The output of this simple system thus approximates a square function for motion in the preferred direction over a range of positive input values (Fig. lb). By using this averaging technique as well as ON- and OFFrectified "neurons," systems that show quadratic behavior, including fullwave rectification, could in principle be built out of linear threshold units, thereby linking the properties of single cells with the observed behavioral responses. It is rather elegant that this can be accomplished solely by taking into account the random variations in neuronal properties. Note that detailed simulations of more realistic neuronal models are needed to verify the applicability of this mechanism to biological visual systems.
References Adelson, E.H., and Bergen, J.R. 1985. Spatiotemporal energy models for the perception of motion. J. Opt. SOC.Am. A2, 284-299. Grzywacz, N.M., and Koch, C. 1987. Functional properties of models for direction selectivity in retina. Synapse 1, 417-434. Hassenstein, B., and Reichardt, W. 1956. Systemtheoretische analyse der zeit-, reihenfolgen- und vorzeichenauswertung bei der bewegungsperzeptiondes riisselkafers chlorophanus. Z. Nafurforsckung Ilb, 513-524. Holub, R.A., and Morton-Gibson, M. 1981. Response of visual cortical neurons of the cat to moving sinusoidal gratings: Response-contrast functions and spatiotemporal interactions. J. Neurophysiol., 46, 1244-1259. Movshon, J.A., Thompson, I.D., and Tolhurst, D.J. 1978. Spatial summation in the receptive fields of simple cells in the cat's striate cortex. J. Physial. 283, 53-88. Poggio, T., and Reichardt, W.E. 1976. Visual control of orientation behavior in the fly: Part 11: Towards the underlying neural interactions. Q. Reu. Biophys. 9, 377438. Torre, V., and Poggio, T. 1978. A synaptic mechanism possibly underlying directional selectivity to motion. Proc. R. SOC. London Ser. B 202, 409-416. Van Santen, J.P.H., and Sperling, G. 1985. Elaborated Reichardt detectors. J. Opt. SOC.Am. A2, 300-320. Watson, A.B., and Ahurnada, A.J. 1985. Model of human visual-motion sensing. J. Opt. SOC. Am. A2, 322-341.
Received 6 April 1989; accepted 4 July 1989.
NOTE
Communicated by Terrence Sejnowski
Neurogammon Wins Computer Olympiad Gerald Tesauro IBM Thomas 1. Watson Researcli Ceiiter, P.O. Box 704, Yorktom Hrights, NY 10598 USA
Neurogammon 1.0 is a backgammon program which uses multilayer neural networks to make move decisions and doubling decisions. The networks learned to play backgammon by backpropagation training on expert data sets. At the recently held First Computer Olympiad in London, Neurogammon won the backgammon competition with a perfect record of five wins and no losses, thereby becoming the first learning program ever to win a tournament.
Neural network learning procedures are being widely investigated for many classes of practical applications. Board games such as chess, go, and backgammon provide a fertile testing ground because performance measures are clear and well defined. Furthermore, expert-level play can be of tremendous complexity. Learning programs have been studied in games environments for many years, but heretofore have not reached significant levels of performance. Neurogammon 1.O represents the culmination of previous research in backgammon learning networks (Tesauro and Sejnowski 1989; Tesauro 1988; Tesauro 1989) in the form of a fully functioning program. Neurogammon contains one network which makes doubling cube decisions and a set of six networks which make move decisions in different phases of the game. Each network has a standard fully-connected feed forward architecture with a single hidden layer, and was trained by the weilknown backpropagation algorithm (Rumelhart et al. 1986). The movemaking networks were trained on a set of positions from 400 games in which the author played both sides. A "comparison paradigm," described in (Tesauro 1989), was used to teach the networks that the move selected by the expert should score higher than each of the other possible legal moves. The doubling network was trained on a separate set of about 3000 positions which were classified according to a crude ninepoint ranking scale of doubling strength. The training of each network proceeded until maximum generalization performance was obtained, as measured by performance on a set of test positions not used in training. The resulting program appears to play at a substantially higher level than conventional backgammon programs. At the Computer Olympiad in London, held on August 9-15, 1989, and organized by David Levy, Neirml Coinpirtntioii 1, 321-323 (1989) @ 1989 Massachusetts Institute of Technology
322
Gerald Tesauro
Neurogammon competed against five other opponents: three commercial programs (Video Gammon/USA, Mephisto Backgammon/ W. Germany, and Saitek Backgammon/Netherlands) and two non-commercial programs (Backbrain/Sweden and A1 Backgammon/USA). Hans Berliner’s BKG program was not entered in the competition. In matches to 11 points, Neurogammon defeated Video Gammon by 12-7, Mephisto by 12-5, Saitek by 12-9, Backbrain by 114, and A1 Backgammon by 16-1, to take the gold medal in the backgammon competition. Also, in unofficial matches to 15 points against two other commercial programs, Fidelity Backgammon Challenger and Sun Microsystems’ Gammontool, Neurogammon won by scores of 16-3 and 15-8 respectively. There were also a number of unofficial matches against intermediate-level humans at the Olympiad. Neurogammon won three of these and lost one. Finally, in an official challenge match on the last day of the Olympiad, Neurogammon put u p a good fight but lost to a human expert, Ossi Weiner of West Germany, by a score of 2-7. Weiner said that he was surprised at how much the program plays like a human, how rarely it makes mistakes, and that he had to play extremely carefully in order to beat it. In summary, Neurogammon’s victory at the Computer Olympiad demonstrates, along with similar recent advances in fields such as speech recognition (Lippmann 1989) and optical character recognition (Le Cun et al. in press), that neural networks can be practical learning devices for tackling hard computational tasks. It also suggests that machine learning procedures of this type might be useful in other games. However, there is still much work to be done both in extracting additional information from the data sets within the existing approach, as well as in developing new approaches such as unsupervised learning based on outcome, which would supplement what can be achieved with supervised learning from expert data.
References Tesauro, G., and Sejnowski, T.J. 1989. A parallel network that learns to play backgammon. Avfificinl Intellig~nce39, 357-390. Tesauro, G. 1988. Neural network defeats creator in backgammon match. Tech. Rep. no. CCSR-88-6, Center for Complex Systems Research, University of Illinois at Urbana-Champaign. Tesauro, G. 1989. Connectionist learning of expert preferences by comparison training. In D. Touretzky, (Ed.), Adzlatices it7 Nr.urd Iiifortnation Processiq Systems, 99-106. Morgan Kaufman Publishers. Rumelhart, D.E., et al. 1986. Learning representations by backpropagating errors. N n t m 323, 533-536.
Neurogammon Wins Computer Olympiad
323
Lippmann, R.P. 1989. Review of neural networks for speech recognition. Neiirnl Cotnp 1, 1-38. LeCun, Y., Boser, B., Denker, J.S., Hendersen, D., Howard, R.E., Hubbard, W., and Jackel, L.D. (in press). Backpropagation applied to handwritten zip code recognition. N t w d Compiitatioii.
Received 30 August 1989; accepted 30 August 1989.
Communicated by Christof Koch
Surface Interpolation in Three-Dimensional Structure-from-Motion Perception Masud Husain Stefan Treue Richard A. Andersen Department of Bran and Cognitive Sncnccs. hfnssarhiIsetts hstiti1tr of Trrhnology, Cn~nbridgc,MA 02139, [JSA
Although it is appreciated that humans can use a number of visual cues to perceive the three-dimensional (3-D) shape of an object, for example, luminance, orientation, binocular disparity, and motion, the exact mechanisms employed are not known (De Yoe and Van Essen 1988). An important approach to understanding the computations performed by the visual system is to develop algorithms (Marr 1982) or neural network models (Lehky and Sejnowski 1988; Siege1 1987) that are capable of computing shape from specific cues in the visual image. In this study we investigated the ability of observers to see the 3-D shape of an object using motion cues, so called structure-from-motion (SFM).We measured human performance in a two-alternative forced choice task using novel dynamic random-dot stimuli with limited point lifetimes. We show that the human visual system integrates motion information spatially and temporally (across several point lifetimes) as part of the process for computing SFM. We conclude that SFM algorithms must include surface interpolation to account for human performance. Our experiments also provide evidence that local velocity information, and not position information derived from discrete views of the image (as proposed by some algorithms), is used to solve the SFM problem by the human visual system. 1 Introduction
Recovering the three-dimensional (3-D) structure of a moving object from its two-dimensional (2-D) projection is considered an ”ill-posed” problem (Poggio and Koch 1985) since there are an infinite number of interpretations of a given 2-D pattern of motion. Several elegant algorithms have been formulated for computing SFM, each using a number of constraints to restrict the number of possible solutions (Ullman 1979, 1984; LonguetHiggins and Prazdny 1980; Hoffman 1982; Grzywacz and Hildreth 1987). None of them use surface interpolation. Rather these algorithms compute the relative position of isolated points. Existing schemes therefore Npurul Computation 1, 324-333 (1989) @ 1989 Massachusetts Institute of Technology
Surface Interpolation in 3-D Structure-from-MotionPerception
325
require that the tracked points on an object must be present continuously over the entire duration of the SFM computation. This leads to the powerful prediction that if the visual system is forced to sample a new set of points on the same object, the old set of points should not improve the perception of SFM since a new model of the object would have to be computed with each new set of sample points. An alternative approach to solving the SFM problem is to compute a 3-D surface representation by interpolating a surface between the sample points (Andersen and Siegel 1988). Such a scheme would sample the movement of as many points as possible across the surfaces of the object, and interpolate locally across these measurements to compute a continuous surface. New sets of points can easily be integrated into the representation and thereby improve its accuracy while the information of disappearing points is preserved in the interpolated surface. (We apply the term “surface interpolation” in a general way since the surface could be generated in physical as well as in velocity space.)
2 Experiments
In our experiments we examined how the human visual system performs the SFM computation when confronted with continuously changing sets of sample points. We used novel ”structured” and ”unstructured” dynamic random-dot stimuli with limited point lifetimes (Morgan and Ward 1980; Zucker 1984; Siegel and Andersen 1988). The structured stimulus was computed from the parallel projection of points covering the surface of a transparent rotating cylinder (Fig. 1). All subjects, whether naive or experienced, have reported the perception of a revolving hollow cylinder when viewing the structured display. The unstructured stimulus was generated by randomly displacing the velocity vectors present in the structured display within the boundaries of the stimulus, thereby conserving the population of vectors but destroying the spatial relationship between them (see Siegel and Andersen 1988). Each point was displayed for a ”lifetime” of only 100 msec (7 frames), after which it was replotted randomly at another position on the surface of the cylinder. In the first frame, points were randomly assigned positions in their life cycle. Thus, between two given frames of the stimulus only about 15%of the poiRtS “died” and were randomly replotted (“desynchronized case”). Under these conditions, using a reaction time task, we have found that subjects detect the change from an unstructured to a structured display reliably (> 80% correct) but take as much as 900 msec to react as shown in figure 2. This observation would suggest that the computation of SFM builds up over time and that new points can be integrated into the representation, which is partially computed by the old points. Unfortunately, it is not possible to determine from these data how much of this reaction time is needed as visual input and how much of it is com-
326
fl
Masud Husain, Stefan Treue, and Richard A. Andersen
.. . .. . . a
.
..* . . .. -*. . . * ' .
Bl
Figure 1: Schematic description of stimulus creation. All movies were created off-line on a PDP 11/73 computer that was also used to run the experiments. For the structured stimulus (E) 126 or 12 points were first randomly plotted on a two-dimensional surface (A). They were then parallel projected onto the surface of a transparent cylinder that was rotated at an angular rotation rate of 35". sec-' (B). Each point existed for a predetermined point lifetime after which it was randomly repositioned. The moving points were then parallel projected onto the two-dimensional CRT screen (HP 13118; P31 phosphor) (C) that was viewed by two highly trained observers (D) (ST and MH). The resulting velocity distribution in the structured stimulus is sinusoidal along any horizontal line across the stimulus, with the fastest speeds in the center of the display. The unstructured stimulus (F)was created by randomly shuffling the paths of the points in the structured display. Observers viewed the display binocularly from a distance of 57 cm; the stimuli subtended 6" of visual angle at the eye. The display rate was 70 Hz and mean luminance was 1 cd m-*.
putation time in the brain or motor reaction time. This is an important question since performance should improve when the stimulus is seen for longer than the point lifetime if surface interpolation is used.
327
Surface Interpolation in 3-D Structure-from-Motion Perception
2.1 Perceptual Buildup and Surface Interpolation. In order to address this issue we presented equal numbers of structured and unstructured stimuli of 40 to 1700 msec duration in random order and asked subjects to indicate in a two-alternative forced-choice paradigm whether they saw a rotating cylinder or an unstructured noise pattern. Figure 3 (filled symbols) shows that performance peaked only after viewing stimuli longer than 5 times the point lifetime (> 500 msec), being hardly above chance after one point lifetime. Current algorithms (which do not use surface interpolation) would not have predicted improved performance when viewing stimuli of more than one point lifetime. It could be argued, however, that the visual system selects a number of points from the display and needs to track their relative positions as a group for the duration of their lifetime. Since in our stimulus the points are not synchronized it is very unlikely that all the points in such a group are at the same point in their life cycle, that is, their onsets and offsets do not occur at the same time. So, because groups of dots constantly form
1000-
800 600
--
-
400 200
0
W
.
I
.
I
.
iJ
point lifetime used for 2AFC task (98ms)
,
,
,
,
,
,
.
,
,
~
Figure 2: Reaction time for detecting the transition from an "unstructured" to a "structured" cylinder. Observers were shown movies that started with the unstructured version of the cylinder, which after an unpredictable time changed into the structured display. The task was to press a key as soon as the structured cylinder was detected. (For further details see Siege1 and Andersen 1988). The arrow and dotted line indicate the point lifetime used for the two-alternative forced-choice experiments described in the text. The regression line is a best-fit third-order polynomial. Each data point represents the mean of about 100 trials.
Masud Husain, Stefan Treue, and Richard A. Andersen
328
% correct responses
loo 90
1 -
.
I
-
"
I
126 points point lifetimes desynchronized
A
point lifetimes synchronized
4 0 4 - . . * . 0 2 4 6 9
.
#
8
*
)
10
'
I
12
'
I
14
-
I
'
16
multiples of point lifetime
Figure 3: Percentage accuracy in a two-alternative forced-choice paradigm plotted as a function of duration of display in multiples of point lifetime (point lifetime was kept at 100 msec). Observers were shown movies of different duration containing either the cylinder or the unstructured stimulus and were asked to distinguish between them. The dots in the display were either desynchronized (open symbols), or the onsets and offsets of all the dots were synchronized (filled symbols). Note that in both cases peak performance is not reached until over 5 times the lifetime of each point, that is, > 490 msec. The regression lines are best-fit fourth-order polynomials (T > 0.97 for both). Each data point represents the mean of 200 to 600 trials. and dissolve, it might be argued that it simply takes a long time before one finds a group in which all the dots are "in phase." Therefore, we asked our subjects to view displays in which all the points appeared and disappeared together, that is, they were synchronized. Figure 3 (open symbols) shows that performance was indistinguishable from the desynchronized case. Another important consideration is that a surface interpolation may be used only when the high density of dots in the stimulus already perceptually constitutes a surface. Under different conditions, when the points are not dense enough to constitute an apparent surface by themselves, an alternative algorithm might be used. To investigate if the perceptual buildup we observed occurs only with a high density of dots, we decreased the number of points to less than a tenth of the original 126 points. Figure 4 (open symbols) shows that the time course using 12
Surface Interpolation in 3-D Structure-from-Motion Perception
329
points is even longer, with performance peaking only after more than 10 point lifetimes. To control for the possibility that the buildup in performance is not due to the presentation of new points but to some other effect we performed another experiment. We showed stimuli of the same duration in which the 12 points, after living through their first lifetime, were not randomly replotted but repositioned to the location they originally occupied at the beginning of the movie. They then moved through the same path as before and at the end of their lifetime were again replotted at their original starting position, thus beginning the cycle again. These “oscillating” stimuli therefore contained the same number of points with the same point lifetime as used in the previous experiment but after the passage of the first point lifetime they contained no new information. The results are plotted in figure 4 (filled symbols). It is evident that subjects did not perform above chance under these conditions. Thus, the visual system can improve its performance dramatically when presented with
% correct
12 points
responses
loo
A
1
desynchronized lifetimes “osclllating’stimulus
““1. 50
40
0
2
4
6
a
10
12
14
16
multiples of point lifetime
Figure 4: Percentage accuracy plotted as a function of display duration when 12 desynchronized points were used (point lifetime again 100 msec). Open symbols show the results from the experiment comparable to Figure 3. In this case perceptual buildup is more gradual and long-lasting. Peak performance is not reached until a stimulus length of more than 10 point lifetimes, that is, over 1 sec. The regression line is a best-fit third-order polynomial (7.= 0.96). Filled symbols show the results from the experiment in which points were replotted to their original position at the end of their point lifetimes (for details see text).
330
Masud Husain, Stefan Treue, and Richard A. Andersen
new sets of points and this is not due to a requirement to view stimuli for an extended period of time. These results suggest that the brain uses surface interpolation in computing the shape of 3-D surfaces from motion. As predicted, the accuracy of the object representation rises to some maximum value with the presentation of new data points, and the performance of the system is not influenced by whether the points are synchronized or not (cf. Fig. 3). Moreover, given less points, it predictably takes longer to compute an accurate surface representation (cf. Fig. 4). As expected, the surface representation integrates information over space, since performance was better with larger numbers of points, and over time, since several point lifetimes were required for the computation. 2.2 Position- versus Velocity-Based Computation. A second issue is whether the visual system measures position or local velocities in computing SFM. Position-based algorithms sample position information derived from a few discrete image views of a moving object and attempt to reach a rigid 3-D interpretation from the 2-D sample frames (Ullman 1984; Grzywacz and Hildreth 1987; Grzywacz et al. 1988). Velocity-based algorithms measure the local velocities of points on an image and use the global velocity field to compute 3-D SFM (Longuet-Higgins and Prazdny 1980; Hoffman 1982; Grzywacz and Hildreth 1987). To date, neither position- nor velocity-based algorithms have used surface interpolation and all velocity-based algorithms have used instantaneous velocity whereas the nervous system requires 50-80 msec to measure velocity accurately (McKee and Welch 1985; Nakayama 1985). A modified position-based scheme could incorporate measurements from new sets of points to improve performance by smoothing over the computed 3-D locations of points to interpolate a surface (E.C. Hildreth and S. Ullman, personal communications). However, there are several reasons to believe the nervous system uses a velocity-based algorithm with surface interpolation. In our displays the angular extent of the individual movements is quite small, approximately 3.5", since they are of finite point lifetime. Position-based algorithms require large displacements of 30-50" (Grzywacz and Hildreth 1987). Other experimental support from our laboratory for the velocity-based surface scheme comes from the finding that the minimum point lifetime required for perceiving SFM (Treue et al. 1988) corresponds to the minimum viewing time required to measure accurately the velocity of a moving stimulus (McKee and Welch 1985). This correspondence is preserved with changes in stimulus velocity: the point lifetime threshold falls in parallel for both tasks as velocity is increased. This correlation is further strengthened by the fact that subjects can detect motion in our displays with point lifetimes lower than the ones required for comparative performance in detecting SFM, suggesting that the perception of motion per se is not sufficient but that an accurate velocity field has to be measured. Finally,
Surface Interpolation in 3-D Structure-from-MotionPerception
331
our laboratory as well as other investigators have shown that lesions of area MT, a region in primate visual cortex that contains neurons tuned to global stimulus direction and velocity (Movshon et al. 1985; Allman et al. 1985), impair perception of both coherent motion (Newsome and Par6 1988) and SFM (Siege1 and Andersen 1986, 1988). 3 General Discussion
There are two possible levels at which a surface interpolation of the velocity field might occur. One is at a 2-D level in which the velocities of points moving on the 2-D retinal image are computed and an interpolation process fills in to form a dense 2-D velocity field from which a 3-D interpretation will be computed by a later process. In the second possibility the 3-D surface is immediately computed from the local 2-D velocities and the interpolation process operates on the 3-D image representation. At present we do not have evidence to distinguish between these two possibilities. A large number of algorithms for 2-D velocity measurement have been proposed that perform some velocity integration, averaging, or smoothing (Hildreth and Koch 1987; Horn and Schunk 1981; Zucker and Iverson 1986; Yuille and Grzywacz 1988; Biilthoff et al. 1989). Some of these algorithms have also been implemented in neural networks (Wang et al. 1989). Since all these algorithms integrate motion over local spatial neighborhoods they can account for a number of perceptual phenomena. Unfortunately, they cannot deal with transparent objects such as our rotating cylinder since vectors (with opposing direction) from the front and rear surface would be assigned to one surface, and the averaging of velocities over a patch would yield zero velocity. Evidently, an additional requirement for the successful application of these algorithms to transparent objects is the segregation of surfaces prior to the smoothing operation. For our stimulus, a simple solution is to assign motion in one direction to one surface. To investigate this issue we are presently recording from visual cortex in awake macaque monkeys. Preliminary results indicate that transparent motions in different directions are already separated at the level of VI (Erickson ef al. 1989).
Acknowledgments We are grateful to Shabtai Barash, Martyn Bracewell, Roger Erickson, Norbert0 Gryzwacz, Ellen Hildreth, and Shimon Ullman for their comments on earlier drafts of this manuscript. This work was supported by grants from the NIH, the Sloan Foundation, and the Whitaker Health Sciences Foundation. M.H. is a Harkness Fellow and S.T. is a Fellow of the Evangelisches Studienwerk Villigst, F.R.G. and is supported by the Educational Foundation of America.
332
Masud Husain, Stefan Treue, and Richard A. Andersen
References
Allman, J., Miezin, F., and McGuinness, E. 1985. Stimulus specific responses from beyond the classical receptive field. Ann. Rev. Neurosci. 8,407-430. Andersen, R.A., and Siege], R.M. 1989. Local and global order in perceptual maps. In Signal and Sense, G.M. Edelman, W.E. Gall, and W.M. Cowan, eds., in press. Wiley, New York. Biilthoff, H., Little, J., and Poggio, T. 1989. A parallel algorithm for real-time computation of optical flow. Nature (London) 337, 549-553. De Yoe, E.A., and Van Essen, D.C. 1988. Concurrent processing streams in monkey visual cortex. Trerids Neurosci. 11, 219-226. Erickson, R.G., Snowden, R.J., Andersen, R.A., and Treue, S. (in press). Directional neurons in awake rhesus monkeys: Implications for motion transparency. Soc. Neurosci. Abst. Grzywacz, N.M., and Hildreth, E.C. 1987. Incremental rigidity scheme for recovering structure from motion: Position-based versus velocity-based formulations. J. Opt. SOC.A m . 4, 503-518. Grzywacz, N.M., Hildreth, E.C., Inada, V.K., and Adelson, E.H. 1988. The temporal integration of 3-D structure from motion: A computational and psychophysical study. In Organization of Neural Networks, W. von Seelen, G. Shaw, and U.M. Leinhos, eds., pp. 239-259. VCH, Weinheim. Hildreth, E.C., and Koch, C. 1987. The analysis of visual motion: From computational theory to neuronal mechanisms. Anti. Rev. Netirosci. 10,477-533. Hoffman, D.D. 1982. Inferring local surface orientation from motion fields. I. Opt. Soc. Am. 72, 888-892. Horn, B.K.P., and Schunk, B.G. 1981. Determining optical flow. Artificial Intelligence 17, 185-203. Lehky, S.R., and Sejnowski, T.J. 1988. Network model of shape-from-shading: Neural function arises from both receptive and projective fields. Nature [London) 333, 452454. Longuet-Higgins, H.C., and Prazdny, K. 1980. The interpretation of a moving retinal image. Proc. R. Soc. London Ser. B 208, 385-397. Marr, D. 1982. Vision. Freeman, San Francisco. McKee, S.P., and Welch, L. 1985. Sequential recruitment in the discrimination of velocity. J. Opt. SOC.A m . A2, 243-251. Morgan, M.J., and Ward, R. 1980. Interocular delay produces depth in subjectivity moving noise patterns. Q. 1. E x p . Psycho/. 32, 387-395. Movshon, J.A., Adelson, E.H., Gizzi, M.S., and Newsome, W.T. 1985. The analysis of moving visual patterns. In Pattern Recognition Mechanisms (Exp. Br. Res. Suppl. ll), C. Chagass, R. Gattas, and C. Gross, eds., pp. 117-151. Springer-Verlag, Heidelberg. Nakayama, K. 1985. Biological image motion processing: A review. Visioti Rrs. 25, 625-660. Newsome, W.T., and Pare, E.B. 1988. A selective impairment of motion perception following lesions of the middle temporal visual area (MT). J.Neurosci. 8, 2201-2211.
Surface Interpolation in 3-D Structure-from-Motion Perception
333
Poggio, T., and Koch, C. 1985. Ill-posed problems in early vision: From computational theory to analog networks. Proc. R. Soc. Londoii Ser. B 226, 303-323. Siegel, R.M. 1987. A parallel distributed processing model for the ability to obtain three-dimensional structure from visual motion in monkey and man. SOC. Neurosci. Abstr. 13, 630. Siegel, R.M., and Andersen, R.A. 1986. Motion perceptual deficits following ibotenic acid lesions of the middle temporal area in the behaving rhesus monkey. Soc. Neurosci. Abstr. 12, 1183. Siegel, R.M., and Andersen, R.A. 1988. Perception of three-dimensional structure from motion in monkey and man. Nature (London) 331, 259-261. Treue, S., Husain, M., and Andersen, R.A. 1988. Human perception of 3-D structure from motion: Spatial and temporal characteristics. Soc. Neurosci. Abstr. 14, 1251. Ullman, S. 1979. The Irzterpretatiori of Visual Motion. MIT Press, Cambridge, MA. Ullman, S. 1984. Maximizing rigidity: The incremental recovery of 3-D structure from rigid and nonrigid motion. Perceptioti 13, 255-274. Wang, H.T., Mathur, B., and Koch, C. 1989. Computing optical flow in the primate visual system. Neural Cornp. 1, 92-103. Yuille, A.L., and Grzywacz, N.M. 1988. A computational theory for the perception of coherent visual motion. Nature (London) 333, 71-74. Zucker, S.W. 1984. Type I and Type I1 processes in early orientation selection. In Figural Synthesis, P.C. Dodwell and T. Caelli, eds., 283-300. Lawrence Erlbaum, London. Zucker, S.W., and Iverson, L. 1986. From orientation selection to optical flow. Memo CIM-86-2, Computer Vision and Robotics Laboratory, McGill Research Center for Intelligent Machines.
Received 17 April 1989; accepted 16 June 1989.
Communicated by Gordon Shepherd
A Winner-Take- All Mechanism Based on Presynaptic Inhibition Feedback Alan L. Yuille Harvard University, Division of Applied Sciences, G12c Pierce Hall, Cambridge, M A 02138 USA
Norbert0 M. Grzywacz Center for Biological Information Processing, Dcpartinent of Brain arid Cognitivr Sciences, Massachusetts Institute of Teclinology, E2,5-201, Cambridge, MA 02139 USA
A winner-take-all mechanism is a device that determines the identity and amplitude of its largest input (Feldman and Ballard 1982). Such mechanisms have been proposed for various brain functions. For example, a theory for visual velocity estimate (Grzywacz and Yuille 1989) postulates that a winner-take-all selects the strongest responding cell in the cortex's middle temporal area (MT). This theory proposes a circuitry that links the directionally selective cells in the primary visual cortex to MT cells, making them velocity selective. Generally, several velocity cells would respond, but only the winner would determine the perception. In another theory, a winner-take-all guides the spotlight of attention to the most salient image part (Koch and Ullman 1985). Also, such mechanisms improve the signal-to-noise ratios of VLSI emulations of brain functions (Lazzaro and Mead 1989). Although computer algorithms for winner-take-all mechanisms exist (Feldman and Ballard 1982; Koch and Ullman 1985), good biologically motivated models do not. A candidate for a biological mechanism is lateral (mutual) inhibition (Hartline and Ratliff 1957). In some theoretical mutual-inhibition networks, the inhibition sums linearly to the excitatory inputs and the result is passed through a threshold nonlinearity (Hadeler 1974). However, these networks work only if the difference between winner and losers is large (Koch and Ullman 1985). We propose an alternative network, in which the output of each element feeds back to inhibit the inputs to other elements. The action of this presynaptic inhibition is nonlinear with a possible biophysical substrate. This paper shows that the new network converges stably to a solution that both relays the winner's identity and amplitude and suppresses information on the losers with arbitrary precision. We prove these results mathematically and illustrate the effectiveness of the network and some of its variants by computer simulations. Neural Cornputation 1,334-347 (1989) @ 1989 Massachusetts Institute of Technology
A Winner-Take-All Mechanism
I
I I I I I I
I_
I-
I
I I I I I I I-
335
I
I I I I I I
I I I I I I I-
Figure 1: The layout of the presynaptic inhibitory network. The dashed lines represent the inputs to the individual network elements. Each element excites inhibitory elements (similar to interneurons), which act on the presynaptic inputs of the other elements (no self-inhibition). Excitatory and inhibitory synapses are labeled with + and - signs, respectively. Figure 1 illustrates our winner-take-all network, which consists of a number of elements that feed back to inhibit the presynaptic inputs of each other. Each element would receive a positive input, I , , from previous neurons if the inhibition was not present. [The network does not have self-inhibition, which if present would lead to a gain control mechanism (Reichardt et nf. 1983).1 The network is updated by the following equation ih, = -.r, rlt
7-
+ I,I< ( . I . , . . l ' 2 . . . . ..I,,-] ..]',+I. . . . . .rn.)
(1.I)
where .i.,(t) is the state of the itlr network element, T is constant, and the function I< is symmetric in all its variables, decreases (or remains
Alan L. Yuille and Norbert0 M. Grzywacz
336
constant) as they increase, and tends to zero when any of them goes to infinity. The initial values of the .r, are set to I,. The first term on the right-hand side of equation 1.1 corresponds to a time decay and the second contains the inhibition, which is implemented by the function K . It is not necessary that every function I< meeting the above criteria implements a winner-take-all network. Later, we discuss the effectiveness of some biologically motivated examples. Now, the following case is studied:
where X is constant. It is now shown that the network described by equation 1.1, using the the function K given in equation 1.2, implements a winner-takeall operation. More precisely, we prove the following winner-take-all theorem:
Theorem 1. Giveri a sy,steiii rlrscribed by eqnatioris 1.1 and 1.2 with iiiitiill coi~tiitionsx I = I, (we disc*rissa l t m i a t i w initial c:onditioiis later.), if I,,, = max, IT,then for siifficiently large X (strictly speakiiig as X + a), it fallows that TI,, + I,,, aiid r', + 0 (for i ,u1) as the systeni tmds to equilitrhi.
+
Proof. The proof proceeds in three stages. We first show that if I , > I,, then .c,(t) > rJ(t).Next, we use this result to prove that if the system converges to an equilibrium state, then the theorem holds. Finally, we show that there is a Lyaponov function associated with equation 1.1, and hence the system must converge to an equilibrium state. First, observe that the update equation preserves the ordering of the elements, that is i , ( t ) > i , ( t ) if and only if I, > I J . To see this we notice that the ordering can be violated only if at some time x , ( t ) = .r,(t). However, since I, > IJ,this implies d r , / d t > d.rJ/df from equation 1.1. Thus, it is impossible for the ordering to change. [Strictly speaking, we need only .~?,,(t) 2 s,(t)for all L UI, for the rest of the proof to be correct.] Second, we show that under the assumption that the system reaches equilibrium, the theorem holds. In equilibrium, d.r,/dt = 0 and the solutions obey
+
where I L , = X.r,. The function ii,c-"i is shown in figure 2; it has a single maximum at I!, = 1. From equation 1.3 it follows that
Combining equation 1.4 with the ordering constraint shows that it is impossible to have more than one network element such that u j > 1. In
A Winner-Take-All Mechanism
337
fact, if u, > 1~~ > 1, then 1 1 ~ p - l ' < i i J p - " ~ (Fig. 2) and I, > IJ (by the ordering constraint), which is inconsistent with equation 1.4. Also, if X is sufficiently large, then for at least one network element, 1 1 , > 1 in equilibrium. In fact, taking the product of equation 1.3 for all I gives p(N-')C,
n n (L,
llJ
I
=
XI,
(1.5)
I
If 11, was less than 1 for all t , then the left-hand side of equation 1.5 would be bounded from above. The right-hand side, however, can be -
.4
s, S
1. L
Small Lambda Larqe Lambda
.3
.2
.I
.o .o
2.
1.
3.
4.
5
U -
Figure 2: The graph of the function uc-" and its relation to the equilibrium state of the network. In the final state, the ratio of the values of this function for two network elements is independent of A. If the final values u, and l i l (with uz > u , ) of the elements are less than 1 (the small X case, labeled by S , s), then little suppression occurs and increasing the value of u , will increase the value of uJ. If, however, the final value of 7 i 1 > 1 (for large A, labeled by L.U, then suppression occurs, since increasing the value v, causes the value of 11) to decrease and uJ + 0 as 11, o. +
Alan L. Yuille and Norbert0 M. Crzywacz
338
made arbitrarily large by increasing A. Thus, if X is sufficiently large, then at least one ( I , > 1. From this result, and the one after equation 1.4, it follows that for large X there is only one network element such that 1 1 , > 1; the one corresponding to the largest I,. Thus, from equation 1.5, the winner's output is such that ( I , , , m as X + m. This implies that the losers' output go to zero, that is, u , -+ 0, to maintain the validity of equation 1.4 (Fig. 2). This proves the theorem at equilibrium, because from equation 1.3 and the losers' condition (u, + 0), the winner and the losers are such that L , , + I,,, and .r, + 0, respectively. Third, we demonstrate that a Lyaponov function exists for equation 1.1, and hence the system always settles into a final state solution of equation 1.1. It is convenient to perform a change of variables to 2 , = P - ~ ' , .Equation 1.1 transforms into +
(1.6)
This can be rewritten as (12,
-
M ! z ,OE
(Lt
J
(1.7)
a:,
where
E = 1- C ( ZJ log
IJ
X
- Zj
J
)+rI.5
(1.8)
I
satisfies the properties of a Lyaponov function; it is bounded from below and always decreases with time. (The system can be thought of as performing a form of steepest descent in an energy landscape given by E.) This completes the proof of the winner-take-all theorem. How large should X be for the theorem to be practical? We derive two lower bounds for X (necessary conditions). (It will be argued - supported by computer simulations - that if X is slightly larger than these bounds, then winner-take-all occurs - so the conditions may be sufficient.) The first bound comes from the application of perturbation theory to equation 1.1 to find when a winner solution exists (without proving that the system finds it). The analysis gives necessary and sufficient conditions, ~ X I l ~ ~ - << x f ~1iand ~ ~ !X2Z,,,~-x'~~~(C~il,, 17)1<< 1 for such a solution to occur. From these conditions a lower bound for X can be computed. The solutions are of form .I:,,, = I,,,- X I , , , P - ~ ~ , <&+,, , Ij, :r, = I,cJ-hl,,.for i # 11:. To obtain the second bound, we investigate the relationship between u,,. and X at equilibrium. Assuming the ordering constraint, (I,,&) > u,(t) ( i # w), equation 1.4 determines uniquely, and independently of initial conditions, ~ i (i i m) as a function of %I,,,(remember that 11; < 1).
+
A Winner-Take-All Mechanism
.o
339
5.
10
15
uw
Figure 3: The graph of X as function of 'u,,. at equilibrium, assuming ordering constraint, for three sets of inputs. These inputs were typical cases of a larger family of computer simulations. In these simulations, the inputs received values from a homogeneous random distribution ranging from zero to one. The graph points corresponding to when x,,,= IL,,,/X reaches 99% of I,,, are labeled: The symbols 0,* ,and % correspond to the 3,6, and 10 elements curves, respectively. For the 3 elements curve, there is no local maximum and the network has always an unique solution. In this case, the network performs well if X is sufficiently high to achieve, for example, a 99% precision. On the other hand, the 6 and 70 elements curves have local maxima at u,,, =: 1. In the 10 elements case, setting a 99% precision X is not good enough, because then three possible solutions for u,,, exist. Computer simulations show that if initial conditions s, = I , or s,= 0 are used, then the network converges to the lowest of the three values of I / , ! , , . It follows that for the network to perform well, X must be chosen larger than the local maximum. For the 6 elements case, the local maximum is not so important. The X for this case's 99% precision point is larger than the X at the local maximum.
Alan L. Yuille and Norbert0 M. Grzywacz
340
Substituting this into equation 1.3 yields A = - - e'h
I,,,
c,,,
11,(11,,)
(1.9)
By differentiating equation 1.9 (and using equation 1.4) one sees that the sign of dA/dri,,,, is the sign of
For u,,, 5 1, u ( ~ L ,is , , )positive and A(u7,,) increases with u,,,. For large u?,, u, become so small 1% (Iz/Ill,)u,,, exp(u,,,)l that the third term of equation 1.10 dominates making &,,J positive. However, for intermediate values of ?I,,,, a(u,,,)can change sign and thus, Xu,,,) can have a local maximum. Inspection of A(ut,,), supported by computer simulations, shows that the chance of such local maximum increases with the number of inputs (Fig. 3). (These simulations contained from three to a hundred input elements with random values ranging from zero to one.) Multiple local maxima seem unlikely from the form of a(uJ and they have not been observed in computer simulations. A local maximum implies that, at equilibrium, u,,)(A) could "jump" at a critical value of A. The computer Figure 4 (opposite): Simulation results for variants of the network. In all cases, the ordering constraint holds, so the winner is always correctly selected, though its amplitude may be reduced (Fig. 3). (a) The results of an exponential mechanism (equation 1.2) with X = 100 and 7 input elements. The open and solid rectangles label the inputs I , and the equilibrium states of the network elements, respectively. For this example, the amplitude of the winner is similar to the maximal input. (b) The same network as (a) but with 13 inputs. The amplitude of the winner is greatly reduced. (c) For the shunting network (equation 1.11)with (71. A) = (100.4), we plotted the precision of the winner-takeall solution (given by an index quantifying how well the winner gets to the maximal input and the losers to zero) as a function of the number of inputs (averaged over a representative sample of inputs randomly generated between zero and one). (d) A similar plot for the exponential network with X = 100 (by the winner-takeall theorem, this network will get the correct result for sufficiently large A). (e) A plot of the number (range) of inputs that different networks can cope with before their precision drops to 0.5. The labels A, B, C, D, E, and F correspond to shunting inhibition with ( n A)~ = (1. loo), shunting inhibition with ( n A)~ = (4. loo), exponential inhibition with A = 10, exponential inhibition with X = 100, inhibition with hyperbolic threshold with (A. Toy(/.I= (100.10. IOOO), and inhibition with sharp threshold with (A, 7;)) = (100. lo), respectively. The full range of the sharp-threshold-inhibition network lies off this graph; theoretically, for the chosen parameters, this network is precise independently of the number of input elements. (f) The spatial falloff of the K function for some of the networks considered above. The threshold function (the dashed line) with X = I'i, = 1 falls off quickest, the shunting function (dotted line) with n = X = 1 falls off slowest, and the exponential with X = 1 lies between them.
341
A Winner-Take-All Mechanism
EXPONENTIAL
a 1
al
a,
EXPONENTIAL
b
lr
n
I
U
s tl n .5
-
E d
a
OL
0 ~
EXPONENTIAL
SHUNTING
d l
7
e
.0
In
'G .5 2! n 0 0
10
20
0
10
20
Inputs
Inputs
f 20 -
b 10-
2
0
I ABCDEF
0
1
2
342
Alan L. Yuille and Norbert0 M. Grzywacz
simulations revealed that such ”jumps” occur if the initial conditions are .rl = I , or J , = 0. The value of X at the local maximum is our second lower bound. A close lower estimate for this bound, which is hard to compute directly, can be obtained as follows. Inspection of a(u,,,) or Nu,,), supported by computer simulations, shows that the local maximum occurs for u,,, = 1 (Fig. 3). Thus, calculating X such that 71,,, = 1 gives a good estimate for the second bound. The relative importance of each of these bounds depends on the input sets. When the sets are such that, at equilibrium, X(u?,,) has a single local maximum (Fig. 31, both bounds together represent a sufficient condition for a winner-take-all operation. If the minimum following the maximum is deep (10 elements’ example in Fig. 3), then the second bound is the important one. Otherwise (six elements in Fig. 31, only the first bound matters. When there is no local maximum (three elements in Fig. 3), the first bound is used alone. If the maximal number of inputs and their range of values are known in advance, X can be set to satisfy these bounds always. Otherwise, X must increase with the number of inputs, or the network will break down (Fig. 4). From equation 1.9, one can show that the necessary X depends exponentially on the number of inputs. Later, we show that inhibitory mechanisms other than that in equation 1.2 lead to a milder dependence. The theorem holds for any initial conditions for which the ordering constraint is not violated, but some conditions are better than others. To see this, first note that the Lyaponov function may have several minima. Thus, a violation of the ordering constraint may prevent the system from finding the correct solution, sometimes even making it behave like a loser-take-all network! If, for example, the initial conditions of the system are set at J~ = 0, small random fluctuations in the system’s elements are often sufficient to cause such violations. This noise sensitivity is reduced for the initial conditions, s,= I , , used in the theorem. The initial conditions, .n, = I,, may lead to solutions that are noise sensitive, a problem that can be partially solved by introducing time variation in A. For large X and these initial conditions, the .rzwill initially become small, and thus sensitive to small random fluctuations. This occurs, because initially, the second term on the right-hand side of equation 1.1 is small. Allowing X to vary with time gives a promising method to prevent this problem. The scenario is to start with T ) = 0 and initially set X = 0, which causes the system to converge to x7 = I, independently of noise. We then increase X slowly to a maximum value. The proof of the theorem shows that varying X with time does not violate the ordering constraint. When X attains its maximum value we can apply the theorem to prove convergence to the winner. Time-varying inputs may pose a problem to our network, which a regular resetting of the states of the neurons might help to solve. The problem is due to violations of the ordering constraint as the inputs, and hence the winners, change. A regular resetting of the network‘s
A Winner-Take-All Mechanism
343
initial conditions would solve this problem. This could be achieved by varying X regularly from 0 to its maximum value. Each time X = 0 the J , will converge to the current inputs I , and the correct ordering will be achieved. We now ask what biophysical inhibitory mechanisms may implement a winner-take-all network as prescribed by equation 1.1. (Note that the A' term in equation 1.1 does not saturate with I,, being analogous to a postsynaptic potential that is far from its reversal potential.) A candidate is shunting inhibition, which results from the reduction of the voltage generated by a unit of excitatory current due to a reduction in membrane conductance (Coombs et al. 1955). This mechanism is a form of inhibition based on division operations, which scales the input down (Torre and Poggio 1978; Grzywacz and Koch 1987; Amthor and Grzywacz 1989). Shunting inhibition may be important for biological networks, as for example, in the retina (Torre and Poggio 1978; Grzywacz and Koch 1987; Amthor and Grzywacz 1989), hippocampus (Alger and Nicoll 1982), and visual cortex (Kemp 1984). For simplicity, assume that in the presynaptic input the excitatory conductance is small and the capacitance is negligible. Also, assume that r i inhibitory transmitter molecules cooperate to induce inhibition. Then, based on the scheme of Figure 1, one may write
where X 1-I'~is the relative change in membrane conductance. This equation approximates well equation 1.2 for small XC.r, and 71 = l . However, theoretically, shunting inhibition works best for large conductance changes (Grzywacz and Koch 1987), which are consistent with experiments on retinal directional selectivity (Amthor and Grzywacz 1989). Furthermore, we showed in the theorem that the networks performance improves as X increases. Thus, because the approximation of equation 1.11 by equation 1.2 may not be valid, we studied the shunting inhibition's effectiveness with computer simulations. The simulations show that shunting inhibition implements a winnertake-all mechanism if there are not many inputs, and the effectiveness increases as the synapse's input-output relationship power, 1 1 , increases (Fig. 4c and e). [Other authors have used shunting inhibition networks with postsyriaptic feedback, which perform winner-take-all (Lazzaro and Mead 1989) or related operations (Elias and Grossberg 19751.1 Strong inhibitions for large inputs to the inhibitory synapse help winner-take-all operations, as suggested by the improvement for large n and the exponential inhibition (equation 1.2 - Fig. 4c-el. Here strength is defined as the speed with which K(.r,) decreases as a function of the .r, (Fig. 4f). A limiting case that makes this point more rigorous is an infinitely powerful inhibition with a sharp threshold:
Alan L. Yuille and Norbert0 M. Grzywacz
344
K
(.T,,
s2,.. . , x 7 - 1 , r z + ,.,..
5
3.N)
=
1, if XC’JI”J
(1.12)
where TO > 0 is constant. [A more continuous version of an inhibition with threshold is a hyperbolic threshold, which together with shunting inhibition may give K = (1 + exp -(A XI#, rJ - To)l/[l+ /L + exp -(A EJ#,xJ - T0)l.I We now prove that if X > To/Il,,,then the inhibition expressed in equation 1.12, coupled with equation 1.1 with initial conditions x7 = I,, generates a winner-take-all network. Similar to the winner-take-all theorem, there is an ordering constraint, that is, if I, > 4 , then s, > T I ; . It follows, that the function K eventually becomes 1 for the winner. Otherwise, there is a contradiction, because if K = 0 always, then the winner and losers fall to 0, implying Xx7#l,, zZ+ 0, eventually making X Z+,, .L‘, < To, and K becomes 1. Also, it follows from the ordering constraint that if K becomes 1 for the winner, then K never becomes 0 again. To see this note that if I, > I,, then XCJp xJ > XC,#, I’,. Thus, if K is about to become zero for the winner, then I( is zero for the losers. At that moment, d.r,/dt < 0 for the losers, and so X I , + LL.,/dt< 0, implying X Z+,, .r, < To, and the winner keeps rising. Thus, from equation 1.1, for t + m, zlr,+ I,,, and the winner ”kills” the losers by making K = 0 for them. The losers’ elimination is guaranteed if X > To/I,,,because in this case, XC+,,G > TO. Computer simulations show that the inhibition in equation 1.12 (Fig. 4e) and a version that uses inhibition with hyperbolic threshold (Fig. 4e) give rise to good winner-take-all networks. In conclusion, we propose a network that relays information on the identity and amplitude of its largest input with arbitrary precision. This winner-take-all network is based on presynaptic inhibition feedback where the allowable inhibitory mechanisms are biophysically plausible. Besides the biological motivation, another advantage of this network is spatial homogeneity, which does not hold for some other networks (Koch and Ullman 1985). For the network to have arbitrary precision, its parameter X must be sufficiently large; must it be so large that it is biologically unrealistic? We now argue that this is not the case. To know when X is “too“ large, one must consider its physiological meaning. The paper demonstrates that there are several neural mechanisms, which may underlie the winnertake-all network. Consider, for example, the threshold mechanism of equation 1.12. The proof following this equation shows that ”sufficiently large A” means that the winner’s activity alone is enough to make the excitatory inputs to the losers, subthreshold. The neurobiological literature suggests that such inhibition is not exaggeratedly large. For example, in the motor control of swimming in leeches, inhibitory synapses can kill other neurons’ activity single-handedly (Stent and Kristan 1981). For mechanisms different than that in equation 1.12, ”sufficiently large A’’ implies an inhibitory synapse, which is stronger than the minimal nec-
A Winner-Take-All Mechanism
345
essary to kill other neurons’ activity. Figure 4 shows that the necessary strength depends on the specific mechanism, the number of inputs to the winner-take-all network, and its precision. A complete experimental verification of the existence of such a network is very difficult with present techniques, but three specific predictions may already be tested. First, a pair of this network‘s cells would inhibit each other. [Mutual inhibition occurs in the interaction between synergist and antagonist groups of neurons in the motor system of mammalians (Sherrington 1947) and invertebrates (Stent and Kristan 19811. Sherrington (1947) suggested that this mutual inhibition leads to ”singleness of action,” which strongly resembles the action of a winner-take-all mechanism.] Second, this mutual inhibition would have to be presynaptic. [Physiologists found that presynaptic inhibition is common in the nervous system, as, for example, in the spinal cord (Eccles et al. 19611, retina (Masland et al. 19841, and hippocampus (Colmers et al. 1987). Also, this type of inhibition implies the existence of axoaxoanic synapses, which appear in electron micrographs at numerous locations of the mammalian central nervous system (Schmidt 1971; Somogyi 1977; Saint Marie and Peters 19841.1 Third, the inhibition in the network would be highly nonlinear, possibly with a threshold or of a shunting-inhibition type. [A new technique to investigate the linearity of inhibition in the visual system without intracellular recordings has been recently described (Amthor and Grzywacz 19891.1 Issues of the time dynamics of the network are interesting, but beyond the scope of this paper. In particular, we are investigating the stability of the network including time delays in the connections. Some results on the stability of inhibitory networks (Wyatt and Standley 1989) and analog neural networks with delay (Marcus and Westervelt 1989) have been developed. Preliminary computer simulations suggest that our network is stable, at least for small time delays (C.M. Marcus, personal communication).
Acknowledgments We thank Tomaso Poggio for critically reading the manuscript. Also, we are grateful to the action editor for emphasizing the importance of the initial conditions for the success of the network. A.L.Y. was supported by the Brown-Harvard-MIT Center for Intelligent Control Systems with an United States Army Research Office Grant DAAL03-86-K-0171. N.M.G. was supported by Grant BNS-8809528 from the National Science Foundation, by the Sloan Foundation, by a grant to Tomaso Poggio and Ellen Hildreth from the Office of Naval Research, Cognitive and Neural Systems Division, and by Grant IRI-8719394 to Tomaso Poggio, Ellen Hildreth, and Edward Adelson from the National Science Foundation.
346
Alan L. Yuille and Norbert0 M. Grzywacz
References
Alger, B.E., and Nicoll, R.A. 1982. Feed-forward dendritic inhibition in rat hippocampal pyramidal cells studied in vitro. J. Physiol. 328, 105-123. Amthor, F.R., and Grzywacz, N.M. 1989. Retinal directional selectivity is accounted for by shunting inhibition (submitted for publication). Colmers, W.F., Lukowiak, K., and Pittman, Q.J. 1987. Presynaptic action of neuropeptide Y in area CAI of the rat hippocampal slice. J . Physiol. 383, 285-299. Coombs, J.S., Eccles, J.C., and Fatt, P. 1955. The inhibitory suppression of reflex discharges from motoneurones. J. Physiol 130, 396-413. Eccles, J.C., Eccles, R.M., and Magni, F. 1961. Central inhibitory action attributable to presynaptic depolarization produced by muscle afferent volleys. I. Physiol. 159, 147-166. Elias, S.A., and Grossberg, S. 1975. Pattern formation, contrast control, and oscillations in the short term memory of shooting on-center off-surround cells. Biol. Cybern. 20, 69-98. Feldman, J.A., and Ballard, D.H. 1982. Connectionist models and their properties. Cog. Sci. 6, 205-254. Grzywacz, N.M., and Koch, C. 1987. Functional properties of models for direction selectivity in the retina. Synapse 1,417-434. Grzywacz, N.M., and Yuille, A.L. 1989. A model for the estimate of local image velocity by cells in the visual cortex (submitted for publication). Hadeler, K.P. 1974. On the theory of lateral inhibition. Kybernetik 14,161-165. Hartline, H.K., and Ratliff, F. 1957. Inhibitory interaction of receptor units in the eye of Limulus. J. Gen. Physiol. 40, 357-376. Kemp, J.A. 1984. Intracellular recordings from rat visual cortex cells in vitro and the action of GABA. J. Physiol. 349, 13P. Koch, C., and Ullman, S. 1985. Selecting one among the many: A simple network implementing shifts in selective visual attention. Human Neurobiol. 4, 219-227. Lazzaro, J., and Mead, C.A. 1989. A silicon model of auditory localization. Neural Comp. 1,47-57. Marcus, C.M., and Westervelt R.M. 1989. Dynamics of analog neural networks with time delay. In Advances in Neural Information Processing Systems 1, D.S. Touretzky, ed., pp. 568-576. Morgan Kaufmann, San Mateo, CA. Masland, R.H., Mills J.W., and Cassidy, C. 1984. The functions of acetylcholine in the rabbit retina. Proc. R. SOC.London Ser. B 223, 121-139. Reichardt, W.E., Poggio, T., and Hausen, K. 1983. Figure-ground discrimination by relative movement in the visual system of the fly. Part 11: Towards the neural circuitry. Bzol. Cybern. 46, 1-30. Saint Marie, R.L., and Peters, A. 1984. The morphology and synaptic connections of spiny stellate neurons in monkey visual cortex (area 17): A Golgi section microscopic study. 1. Comp. Neurol. 233, 213-235. Schmidt, R.F. 1971. Presynaptic inhibition in the vertebrate central nervous system. Ergeb. Physiol. 63, 20-101.
A Winner-Take-All Mechanism
347
Sherrington, C.S. 1947. The Integrntiue Action of the Nerzuus System. Yale University Press, New Haven. Somogyi, P. 1977. A specific "axo-axonal" interneuron in the visual cortex of the rat. Brain Res. 136, 345-350. Stent, G.S., and Kristan, W.B. 1981. In Neiirobiology of the Leech, K.J. Muller, J.G. Nicholls, and G.S. Stent, eds., pp. 113-146. Cold Spring Harbor Laboratory, Cold Spring Harbor, NY. Torre, V., and Poggio, T. 1978. A synaptic mechanism possibly underlying directional selectivity to motion. Proc. R. SOC.(London) Ser. B 202, 409-416. Wyatt, J.L., Jr., and Standley, D. 1989. Criteria for robust stability in a class of lateral inhibition networks coupled through resistive grids. Neural Cornp. 1, 58-67.
Received 14 March 1989; accepted 16 June 1989.
Communicated by Andrew Barto
An Analysis of the Elastic Net Approach t o the Traveling Salesman Problem Richard Durbin' King's College Research Centre, Cambridge CB2 1 ST, England
Richard Szeliski Artificial Intelligence Center, SR,I International, Menlo Park, C A 94025 USA
Alan Yuille Division of Applied Sciences, Harvard University, Cambridge, M A 02138 USA
This paper analyzes the elastic net approach (Durbin and Willshaw 1987) to the traveling salesman problem of finding the shortest path through a set of cities. The elastic net approach jointly minimizes the length of an arbitrary path in the plane and the distance between the path points and the cities. The tradeoff between these two requirements is controlled by a scale parameter K . A global minimum is found for large K, and is then tracked to a small value. In this paper, we show that (1)in the small K limit the elastic path passes arbitrarily close to all the cities, but that only one path point is attracted to each city, (2) in the large K limit the net lies at the center of the set of cities, and (3) at a critical value of K the energy function bifurcates. We also show that this method can be interpreted in terms of extremizing a probability distribution controlled by K. The minimum at a given K corresponds to the maximum a posteriori (MAP) Bayesian estimate of the tour under a natural statistical interpretation. The analysis presented in this paper gives us a better understanding of the behavior of the elastic net, allows us to better choose the parameters for the optimization, and suggests how to extend the underlying ideas to other domains. 1 Introduction The traveling salesman problem (Lawler et al. 1985) is a classical problem in combinatorial optimization. The task is to find the shortest possible tour through a set of N cities that passes through each city exactly once. This problem is known to be NP-complete, and it is generally believed *Current address: Department of Psychology, Stanford University, Stanford, CA 94305 USA.
Neurnl Computation 1, 348-358 (1989) @ 1989 Massachusetts Institute of Technology
The Elastic Net Approach to the Traveling Salesman Problem
349
that the computational power needed to solve it grows exponentially with the number of cities. In this paper we analyze a recent parallel analog algorithm based on an elastic net approach (Durbin and Willshaw 1987) that generates good solutions in much less time. This approach uses a fast heuristic method with a strong geometrical flavor that is based on the tea trader model of neural development (Willshaw and Von der Malsburg 1979). It will work in a space of any dimension, but for simplicity we will assume the two-dimensional plane in this paper. Below we briefly review the algorithm. Let {Xz), I = 1 to N , represent the positions of the ?V cities The algorithm manipulates a path of points in the plane, specified by {YJ}, 3 = 1 to A4 (34 larger than N),so that they eventually define a tour (that is, eventually each city X, has some path point Y, converge to it). The path is updated each time step according to
AY, = IL
C w,,(X, - YJ) + dK(Y,+, + YJ_,
-
2Y,)
7
where e-lx,-Y,12/2K2 lPl,
=
-yLp - l x , - Y L 1 2 / 2 k ~ ~
j j are constants, and K is the scale parameter. Informally, the n term pulls the path toward the cities, so that for each X I there is at least one YJ within distance approximately h-.The /7 term pulls neighboring path points toward each other, and hence tries to make the path short. The update equations are integrable, so that AY, = -KdE/dYJ for an “energy“ function, E , given by
o and
E({YJ},K ) = -OK
c log7 I
J
r-1X,-Y,I”2KL + ‘ ’ x { y J
-
yJ+l}2 (l.l)
I
For fixed K the path will converge to a (possibly local) minimum of E. At large values of I< the energy function is smoothed and there is only one minimum. At small values of I<, the energy function contains many local minima, all of which correspond to possible tours of the cities (we prove this later in the paper), and the deepest minimum is the shortest possible tour. The algorithm proceeds by starting at large I<, and gradually reducing K, keeping to a local minimum of E (see Fig. I). We would like this minimum that is tracked to remain the global minimum as I< becomes small. Unfortunately, this can not be guaranteed (see section 3). The elastic net approach is similar to a number of previously developed algorithms that use elastic matching (Burr 1981), energy-based matching (Terzopoulos et a / . 1987), or topographic mapping (Kohonen 1988) to solve vision, speech, and neural development problems. Alternative parallel analog algorithms have also recently been proposed for solving the traveling salesman problem (Hopfield and Tank 1985; Angeniol et al. 1988). The method of Angkniol et a/. is closely related
350
Richard Durbin, Richard Szeliski, and Alan Yuille
to that discussed here, but is based on Kohonen‘s self-organization algorithm (Kohonen 1988). It is faster, but for large problems it is marginally less accurate than the elastic method. An important contribution of this paper is to analyze the behavior of the energy function as the constant K changes and to describe the energy landscape. In particular, w e prove results about the behavior of the function for large and small h’, confirming the assertions made above about how the algorithm works. First, however, we show that
0
0
Figure 1: The convergence of the network as a function of K . The white and black squares represent the data (10 cities) and the network (25 path points), respectively. The six figures (a-f) show the configuration found by the network at values K = 0.261, 0.26, 0.21, 0.20, 0.12, 0.04. The data set is centered on (0.49, 0.46) and has second-order moments ( K r TKTI, , KvI) = (0.75, -0.23, 0.70). We use Q: = 0.2 and 13 = 1.0. The first bifurcation, when the origin becomes unstable, can be calculated to occur at K = 0.2606 (see Section 51, in agreement with the simulation. The second break temperature is between K = 0.21 and K = 0.20 for the simulation when the line spreads into a loop. This corresponds well to the point at which the second eigenvalue becomes negative ( K = 0.196). The correspondence is not exact because nonlinear terms become significant after the first break.
The Elastic Net Approach to the Traveling Salesman Problem
35 1
minimizing this cost function can be interpreted in terms of maximizing a probability distribution. 2 The Probabilistic Interpretation
The energy function (1.1)can be related to a probability distribution by exponentiation. This is analogous to use of the Gibbs distribution in statistical mechanics.
Observe that minimizing E with respect to {Y,,} corresponds to maximizing L with respect to {Y,}. We can interpret L in terms of Bayes‘ theorem, which states that
where P ( Y ( X ) is the probability of a tour (Y) given a set of cities (XI. Our algorithm maximizes P(YIX) over all possible tours (Y), so the value of P(X) is irrelevant. The distribution I.
is the a priori probability of a given tour. This distribution is a correlated gaussian that assigns greater prior probability to shorter tours. The distribution
(2.3) is the probability of the cities being at (X) given that the tour points are at (Y). P(XIY) is the product of h’ independent probability distributions
This equation is equivalent to assuming that the measured position of city X, was actually derived from one of the tour points in {YI} with a two-dimensional gaussian error of variance K2,but without knowing which tour point X, corresponds to. Thus equation 2.1 shows that the elastic net algorithm is computing the most probable tour (finding the
352
Richard Durbin, Richard Szeliski, and Alan Yuille
Bayesian MAP estimate) given a prior model (2.2) that favors short tours and a sensor model (2.3) with two-dimensional position uncertainty. Our method thus has an obvious extension to surface interpolation and threedimensional surface modeling where the correspondence between surface points and measured data points is unknown. 3 Tracking a Minimum
The algorithm devised by Durbin and Willshaw minimizes E at large K and then tracks the minimum energy solution down to small K . At a local minimum (or any extremum), we have
x2
where p is 1 or 2 and y1 and are the x and y components of the position vector Y,. As we follow the extrema as K changes we get the equation
which becomes
To obtain the trajectory we must solve this equation for rlY,”/dK. When we are at a true minimum, the Hessian # E / D X ” a T is a positive definite matrix and can be inverted, enabling us to compute dY,”/dK. Bifurcations occur when the Hessian has zero eigenvalues. In this case dY,”/dK is underdetermined and there are several possible solutions. Computer simulations and our calculations in the large hrlimit (see below) show that such a bifurcation occurs as the tour initially spreads out from a point. After this, our simulations suggest that the minimum tracks smoothly with K . Other minima of the energy function also appear as K is reduced. For the configuration shown in Figure 1, the number of minima increases rapidly from 1 at K = 0.12 to 3 at K = 0.10,9 at I< = 0.08 (shown in Fig. 21, and very many at K < 0.05. The minimum found by tracking K from large to small values is not necessarily the optimal (global) minimum (Fig. 2). Nevertheless, empirically the minima found are within a few percent of optimal (Durbin and Willshaw 1987). One possible improvement would be to pick up and track nearby minima by local random perturbation as in simulated annealing (Kirkpatrick et al. 1983).
The Elastic Net Approach to the Traveling Salesman Problem
353
Figure 2: Possible minima of the energy function for the data in Figure 1. To investigate the energy minima we started 1000 simulations at K = 0.08 with random initial configurations, hence without using the solutions for larger Aas initial conditions. In cases that lead to a sensible tour on subsequent slow reduction of K , the lines joining the white boxes (data points) show the tour found by the network. We found nine distinct groups of minima with the following frequencies: (a) 260, (b) 183 (tour length 3.356), (c) 140 (tour length 3.288), (d) 130 (tour length 3.350), (e) 95, (f) 72, (g) 48, (h) 36 (tour length 3.420), and (i) 36. We have probably found all the major forms of minima, since the least frequent happened 36 times. Note that the most frequent pattern (a) is neither the one obtained from tracking K (b) nor the optimal one (c).
4 The Small I< Limit
-
We now consider the behavior of the extrema and the Hessian at these 0. We will show that the only stable extrema occur extrema a s Ii
Richard Durbin, Richard Szeliski, and Alan Yuille
354
when each {X,} has at least one (Y,} arbitrarily close to it. Thus, the extrema all correspond to possible tours. For any given I let
B(I4 = min IX,- Y,] J
Then
and
Thus, for the energy to be bounded we must have lim
h -0
{El2([<)
-
2K2/oyh1}o
2h-
=c
where C is a constant and hence B ( K ) = O(l<''2). Thus, in the limit as K + 0, configurations with unmatched Xs will have arbitrarily high energy, and so will not be found by the algorithm. This means that the minima will all correspond to possible tours. Although all the Xs are matched there is no requirement that all the Ys are matched. Indeed, with correct choice of parameters (1 and /? it can be shown that there will be only one tour point at each city. The remaining tour points space themselves evenly in the intercity intervals. A sufficient requirement on the parameters for this to happen is that
where A is the shortest distance from a tour point being attracted to some city to any tour point not being attracted to that city. To derive this condition on the parameters consider a single city situated at the origin. Define wJ by
Assume the Y,, are at equilibrium. We wish to consider the stability of an equilibrium in which there are two i l l J that stay significantly nonzero as Z< 4 0 (to have a single tour point converge to each city we require instability). We can choose I< sufficiently small that there is no significant interaction between these two tour points and any other cities. Consider
The Elastic Net Approach to the Traveling Salesman Problem
355
where t = max,(ls,l). In the limit of small K we can ignore the $ term as well as the higher order t term. Clearly then for each 1 S, = fly, for some scalar f J . Let LL = -XK2/cy, and consider the case where only iu1, u 9 are significant. This leads to the eigenvalue equation
The criterion for instability is that at least one X is positive, and hence that the corresponding p is negative. For large K both eigenvalues /i1 and p2 are clearly positive. Hence for instability we require that K should be below the value at which one eigenvalue (and hence their product) goes negative. The product is /lip2
=
UJ11U2
[(K’- U12q2)(K2- 201Y;)
= WlUI2K2 [K’
- (I1“y:
- Ui11I)2Y,Y2]
+ w,Y;)]
Therefore we require that
K2 < U Q y :
+ w1y2
Since w1+ 711’ = 1 we will be safe if min lY, I > K . At equilibrium at small h’ we have Y, = (i?K/aur,)A, where A, = Y,+, + Y,-l. When, as will be usual, Y1 and Y2 are neiGhbors, then /All is just the distance to the next path point not converging on the city, and a sufficient requirement is that this distance must be greater than o / f f . Since an average tour on Ar cities has length (N/2)’i2 a safe estimate for A,,, would be 0.2(N/2)1’2/Af. Alternatively one could choose (I to be a decreasing function of K,such as K1’ where p is a fixed exponent between zero and one. 5 At Large
K
For large Ii, the energy function (1.1) has a minimum corresponding to the net lying at the center of the cities. At a critical value of K this mini-
Richard Durbin, Richard Szeliski, and Alan Yuille
356
mum becomes unstable and the system bifurcates. The initial movement of the net depends only on the second-order moments of the cities. To show this, we first calculate the first and second derivatives of E with respect to Y.
(5.1)
N
(qv - X(l)(r;. - ~ly)e-IX,-Y1/2/2KZ,-Ix~-Y,lz/2K~ (C$, ,-lX,-Y, 12/2K*)2
+& Q
2=1
---c K3
(q/’- xI’)(y[- xly)e-IX~-YA12/2K26al
Q
?=I
(p, p-lXt-Y,12/2K2)
+ 2/36””(26kl - 6~+1, - 61-11)
(5.2)
By substituting Y, = 0 into (5.1) we find that
dE ~
a
=
-c K
N (-Xf)e-1Xt12/2K2
z=l
M e-IX,I2/2K2)
- -__
K,u
1=1
The origin is thus always an extremum, provided that it is chosen at the center of the Xs (i.e., C,”=,Xz = 0). To show that the center is a minimum for very large K , we must calculate the eigenvalues of the Hessian. As K decreases, this minimum becomes unstable and a bifurcation occurs. Knowing the value of K at which this occurs will give us a useful starting value for K when we are running the elastic net algorithm. At the origin the Hessian can be written as
For large K , the eigenvalues of the Hessian are clearly all positive. By inspection of equation 2, we see that throughout the region IY31 << K the Hessian is positive definite and so the origin is a unique minimum. For small K , the dominant terms (the second and third terms on the right-hand side of 5.2) have negative trace, so there are some negative eigenvalues. Thus, the origin is a stable state for large K but then becomes unstable as K decreases. To see how this occurs we must explicitly calculate the eigenvectors.
The Elastic Net Approach to the Traveling Salesman Problem
357
If we compute the eigenvalues of the Hessian (Durbin et al. 1989), we find that smallest eigenvalue is (5.4) where X is the principal eigenvalue of the city covariance matrix. The center then becomes unstable and breaks at K s.t. Xmin = 0. This can be calculated from equation 5.4. Since the eigenvectors depend only on the second-order moments of the distribution of the cities the global minimum for K just below the critical value will also depend chiefly on the second-order moments. As K decreases further the higher order moments will become important. These theoretical results are confirmed by computer simulations (see Fig. 1). The net stays at the origin until the critical value of K and then forms a line along the principal axis of the city distribution. Near the second critical value of K (when the eigenvalue determined by substituting the minor eigenvalue of the city covariance matrix into equation 5.4 becomes negative) the line spreads into a loop. 6 Conclusion
In summary, we have obtained several theoretical results concerning the elastic net method. First, we have shown how the elastic net solution can be interpreted as a maximum a posteriori estimate of an unknown tour (circular curve), where some points along the tour have been measured with gaussian uncertainty in position. Second, we have proved that for small K every point Xi is matched, and that each point must be within O(K’1’) of a tour point. Third, we have found a condition on the parameters a , p under which each city becomes matched by only one tour point. Fourth, we have shown that at large K , a single minimum exists for the energy function, with all of the tour points lying at the center of gravity of the cities. Fifth, we have shown how to calculate the bifurcation points for the elastic net as K is reduced. The first result is particularly interesting since it suggests that this approach can be applied to other interpolation, approximation, and matching problems (such as surface interpolation in computer vision). The important feature here is that we do not need to prespecify which model point matches a data point, allowing ”slippery” matching. The second result proves the “correctness” of the elastic net method, in that any final solution must be a valid tour. The third, fourth, and fifth results can be used in selecting parameter values, a starting configuration for the net, and a starting value for K . The elastic net method that we have analyzed provides a simple, effective, and intuitively satisfying algorithm for generating good traveling salesman tours. We believe that similar continuation-based algorithms can be applied to a wide range of opt& mization and approximation problems.
358
Richard Durbin, Richard Szeliski, and Alan Yuille
References Anghiol, B., de La Croix Vaubois, G., and Le Texier, J.-Y. 1988. Selforganizing feature maps and the travelling salesman problem. Neural Networks 1, 289293. Burr, D.J. 1981. Elastic matching of line drawings. I E E E Trans. Pattern Anal. Machine Intelligence PAMI-3 6, 708-713. Durbin, R., Szeliski, R., and Yuille, A. 1989. An Analysis of the Elastic Net Approach to the Travelling Salesman Problem. Tech. Rep., Harvard University. Durbin, R., and Willshaw, D.J. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (London)326,689491. Hopfield, J.J., and Tank, D.W. 1985. Neural computation of decisions in optimization problems. Biol. Cybernet. 52, 141-152. Lawler, E.L., Lenstra., J.K., Rinooy Khan, A.H.G., and Shmoys, D.B. (eds). 1985. The Travelling Salesman Problem. Wiley, New York. Kirkpatrick, S., Gelatt, C.D., Jr., and Vecchi, M.P. 1983. Optimization by simulated annealing. Science 220, 671-680. Kohonen, T. 1988. Self-Orgmisation and Associative Memory, 2nd Ed. SpringerVerlag, Berlin. Terzopoulos, D., Witkin, A., and Kass, M. 1987. Symmetry-seeking models for 3D object recognition. Proc. First Int. Conf. Computer Vision, London, June 1987. Willshaw, D.J., and von der Malsburg, C. 1979. A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Phil. Trans. R. SOC.Ser. B 287,203-243.
Received 15 March 1989; 5 May 1989.
Communicated by Christof Koch
The Storage of Time Intervals Using Oscillating Neurons Christopher Miall King’s College Research Centre, King’s College, Cambridge, CB2 IST, England
A mechanism to store and recall time intervals ranging from hundreds of milliseconds to tens of seconds is described. The principle is based on beat frequencies between oscillating elements; any small group of oscillators codes specifically for an interval equal to the lowest common multiple of their oscillation periods. This mechanism could be realized in the nervous system by an output neuron, excited by a group of pacemaker neurons, and able to select via a Hebbian rule a subgroup of pacemaker cells to encode any given interval, or small number of intervals (for example, a pattern of pulses). Recall could be achieved by resetting the pacemaker cells and setting a threshold for activation of the output unit. A simulation is described and the main features of such an encoding scheme are discussed. 1 Introduction
How do we perceive and recall intervals ranging from tenths of seconds to seconds? It is these sorts of time intervals that underlie much of our behavior, as well as our appreciation of rhythms and music, and yet the way in which our nervous system encodes and stores time is unknown. There must be ways that time can be mapped onto neurons that allow them to be activated at specific intervals, or for specific durations, in response to internal or external time cues. Several neuronal mechanisms have been suggested over the years but they seem most suitable for intervals an order of magnitude or more shorter than those of interest here. For example Licklider (1951) suggested that a chain of neurons could form a simple delay line, with each synapse adding a discrete delay to a transmitted signal. Each neuron in the chain might contribute only a few milliseconds delay, however, so intervals of the order of seconds would necessitate unwieldy chains hundreds of cells long. Another suggestion was to make use of the conduction delays in axons, so that time is mapped as distance along an axon (Jeffress1948). Such a mechanism is indeed found to underlie detection of interaural time differences (Carr and Konishi 1988), but the delays achieved are in the range Neural Computation 1, 359-371 (1989) @ 1989 Massachusetts Institute of Technology
360
Christopher Miall
of 200-400 psec. Braitenberg (1961) suggested that conduction delays in the parallel fibers of the cerebellum could be used to cross-correlate their activity patterns, and Longuet-Higgins (1989) has proposed that a similar scheme could be used to store temporal correlations between pulse trains. Again, it is unlikely that axons will be found of sufficient length and low conduction velocities to encode intervals of more than a few tens of milliseconds. A third possibility is that oscillatory pacemaker neurons could be used directly to encode time, either by choosing pacemakers with suitable oscillation periods from a population of cells, or through plastic changes to the period of oscillation of each cell (Torras 1986). An interesting development of this idea is that a group of cells could combine to encode a temporal waveform, by forming its Fourier series. However, this scheme may have limits to the intervals (or frequencies) that it could store. Choosing among a population of pacemakers would require a very wide distribution of oscillation frequencies to be selected from, while changing the oscillation period would require an individual cell to adapt its frequency over a wide range. The idea of using oscillators to store an arbitrary temporal sequence was also fundamental to Longuet-Higgins' holophone (Longuet-Higgins 1968). However, it required a bank of oscillators with a wide range of frequencies represented, and also required these frequencies to be more-or-less equally spaced. Another difficulty with the scheme was that in order to store long intervals, an oscillator was required with an oscillation period equal to the interval to be stored. [Longuet-Higgins (1968) suggested a possible solution here, in that a long temporal pattern could be stored by a mechanism of reentrant pathways, effectively setting up a chain of shorter output sequences.] Schemes based on oscillators might also suffer a trade-off between accuracy and period, as cells with very long oscillation periods would have slowly changing membrane potentials, and hence one would expect some scatter of the moment at which spiking initiates in each cycle. In this paper a related mechanism that seems to avoid these pitfalls is presented. One can make use of "beating" between pairs or groups of oscillators to store time intervals. Thus, a group of pacemaker cells, even if they have quite similar oscillation frequencies, can be used to encode a wide range of time intervals, and to recall the interval at a later time. Some of the possibilities of this scheme are described here, and the relationships between numbers of cells, their inherent accuracy, and the accuracy with which they can encode time are discussed. 2 Oscillators and Beating
Consider a group of oscillators (pacemaker neurons), each with a slightly different frequency of oscillation, and each spiking for a brief part of each
The Storage of Time Intervals Using Oscillating Neurons
361
cycle (Fig. IA). The beat frequency of any pair of these oscillators is then the frequency at which they spike simultaneously: looking at oscillators 5 and 6 in Figure IB, it can be seen that they spike together only three times, as indicted by the stars. Thus, their beat frequency is much lower than their intrinsic oscillation frequency; it is given by the difference between the frequencies of the two cells. So if this difference is small, the beat frequency is low and the beat period long. Considering a group of cells, rather than just a pair, the beat frequency is given by the lowest common multiple of the periods of their oscillations. Although this statement is strictly true for oscillators active for only one instant in each cycle, partial coincidence of activity may occur at shorter intervals if the pacemakers are active for some significant portion of each cycle, as the group then drifts in and out of synchrony with a time course given by the lowest common multiple. Now, to encode any particular time interval, for example, the interval between time to and time t l indicated by the open bars in Figure IB, one could select the subgroup of cells that is active at both times, in this case cells 1, 3, and 6. The chosen set of oscillators fits the time interval specifically: they spike together only twice after to, at tl and t 2 . So to recover this interval at some later point, one need only reset all the oscillators, and wait for the selected group to spike simultaneously. One way this could be achieved would be to activate "Hebb" synapses between the selected pacemakers and a postsynaptic unit, and then set a threshold for activation of that unit equal to the number of cells selected. As will be shown later, with larger populations of oscillatory cells, the numbers of cells selected to encode any given interval need not be known accurately to set a suitable threshold. In the example shown, the three selected cells are again almost synchronous at time t 2 , which is exactly double the original period to - t,. Thus, the same mechanism can be used to oscillate with a given period. The three are not exactly synchronous at time t2 because they were selected imprecisely: the diagram has a limited resolution and is not accurate enough to distinguish the marginal difference in phase between cell 6 and cells 1 and 3 at the selection time t l . A final point to make about this example is that the pacemakers were all started simultaneously at time to. This is only a convenience, and the same procedure can be used in a larger group of pacemakers that is not synchronized. For instance, if we were to select cells at times t l and t z without resetting the population, we would end up with a similar result, having in this case chosen only units 1 and 3 (again given the limits of resolution imposed in the diagram). However, resetting the whole population before the encoding process significantly reduces the numbers of cells needed to accurately code any interval, and is vital if more than one interval is to be stored (see later).
362
Christopher Miall
output
value
+'
0
Time
B
Time (arbitrary units) t
Oscillator Number
''
i '-
----
*
_---I-
-1-
I -
I I I
Figure 1: (A) A schematic diagram of the behavior of a single pacemaker. The membrane potential oscillates sinusoidally (upper record), and the pacemaker is considered active ("spiking'') if a threshold level is exceeded. In the simulations presented in this paper, the output of such a pacemaker was taken to be +1 or zero, as shown in the lower record. (B) Beating among a group of six oscillators. Activity in each oscillator is indicated by a short black bar. The stars on the right indicate times when units 5 and 6 are synchronous. The open bars highlight those units active at selection times to and t l ; t 2 is twice the interval to - tl.
The Storage of Time Intervals Using Oscillating Neurons
363
3 A Simulation
To demonstrate the potential and the limits of this idea, the basic mechanism indicated in Figure 1 was simulated on a computer, A population of up to 500 pacemaking units was defined, with oscillation frequencies between 5 and 15 Hz chosen using a random number generator to give an average frequency of 10 Hz (Llinas 1988) and a standard deviation about the mean of 1.6 Hz. The iteration rate for the model was 60 Hz, allowing good resolution of even the highest frequency pacemakers. Each oscillator’s membrane potential was considered to be a pure cosine function ranging in amplitude *l, and an activity threshold was set to delimit the ”active” (spiking) and ”inactive” parts of its cycle (Fig. 1A). Thus, if the threshold was set at 0.0, the unit was considered to be active for half its cycle, and if the threshold was set at +0.9, the unit was active for 15% of its cycle. The output of each pacemaker was zero if the membrane potential was below the activity threshold, and +1 if above. This represents continuous “spiking“ activity above threshold, which is obviously somewhat unrealistic. A more realistic simulation in which spike rate is proportional to membrane potential above the activity threshold could be envisaged. However, the scheme is not dependent on this property, and can be realized even if each pacemaker emits a single spike at each cycle (see below). All pacemaking units synapsed onto a single output unit via Hebbian synapses taking values between +1 and -1; hence the output unit received input of zero or between *l from each pacemaker at each time step. The total input from the pacemakers was summed, and displayed as a time histogram (see Figs. 2-5). Now, to store any specific interval, to - t l , first all pacemaker units were synchronized at to (as in Fig. 1B). Those above threshold at time tl were noted, and the strength of their synapses onto the output unit set accordingly. In most of the simulations reported here, the selected synapses were set to a strength of +1 and all others set to zero; exceptions to this general rule will be pointed out later. To test the specificity with which the selected units stored each interval, all the pacemakers were again resynchronized, and the activity of the output unit monitored (Fig. 2). The model was tested with a population of between 10 and 500 pacemakers, with intervals ranging from 200 msec to 10 sec, and with the activity threshold of the pacemakers ranging from 0 to +0.999. 3.1 Timing Specificity. Figure 2 shows the total pacemaker output activity received by the output unit when 250 pacemakers were used at three activity thresholds. In each graph the T axis has been scaled to the stored interval and the y axis to the number of pacemakers selected to encode that interval. Hence the output unit is maximally activated at the extreme right of each graph. Comparing the maximum output (that is,
Christopher Miall
364
0.0
0.5
0.9
6.0
A
B
C
Figure 2: Records of the output unit’s activity in recall of intervals. Each graph shows the total synaptic input received by the output unit at each iteration of the model. In each the z axis has been scaled to the stored interval and the y axis to the number of pacemakers selected to encode that interval. Hence the output unit is maximally activated at the extreme right of each graph. The exact numbers of cells selected are not important; only the relative height of the peak at the extreme right of each graph to the next highest peak is critical for discrimination of the correct interval. Hence, a threshold for the output unit would be needed to discriminate activity at this time (tl) from all other times, but is not indicated here. Recall is shown of 5 intervals from 0.6 sec (Row a) to 9 sec (Row e); the figures to the right indicate the interval stored (in seconds). The simulation included 250 pacemakers, with activity threshold a = 0.0, 0.5, or 0.9 (columns A X ; see figures across top).
the number of pacemakers active at time t l ) to the next highest peak of activity gave a measure of the reliability of the output, its specificity for the stored interval. The next highest peak was generally found at half of the stored interval, as in the bottom right of figure 2, or was separated from the main peak by 100 msec, as in the left column of figure 2. The latter is because the population had a n average frequency of 10 Hz; thus,
365
The Storage of Time Intervals Using Oscillating Neurons
while all the selected cells are active at time t,, many will also be active 100 msec earlier. Three things should be pointed out in figure 2. First, specificity increased as the activity threshold increased (in columns from left to right). Second, specificity increased as the time interval increased (in rows from top to bottom), although in general the specificity was good for all intervals above 1 sec. Third, the output was elevated at some fractions of the specified interval; for example, peaks can be seen at one-third, half, and two-thirds the stored interval. As discussed in relation to figure 1, the accuracy with which units are selected (given by the activity threshold level) also determines the reliability of further output activity at multiples of the stored interval (that is, t2,t 3 etc.). In these simulations the activity threshold needed to be at least 0.5 to allow reliable discrimination of a second peak, and even higher if multiple outputs were to be reliable (Fig. 3). If the threshold was too high, however (0.99 or more), very few units were selected for any
In
3
0
E
.-0, 1 0) Y
2
2
0
c
0 0
._ -
d
0.999 0.99 0.95 0.9 0.8 0.5
0
1
0.0
0 0-1
0-2
2-4 4-6 Time period
6-8
Figure 3: The specificity for output at multiples of the original interval. A population of 250 pacemakers was used to store an interval of 2 sec, with activity thresholds varied from CI = 0.999 to 0.0. The ratio of output activity at multiples of 2 sec (I, 2 , 4 , 6 etc.) to the next highest output level (that is, the ratio of peaks, or “output specificity”) was then calculated and is displayed; only values above 1 allow discrimination of the desired interval. When n = 0.999, too few units were selected and discrimination of even the original interval was poor. At CI = 0.95 (stippled bars), even activity at cycle # 4 (8 sec) could be accurately detected. As 0 dropped further, discrimination of repeated cycles fell off.
366
Christopher Miall
one interval. Discrimination of later peaks was best if the threshold was set as high as possible while still allowing about 20 units to be selected. 3.2 Timing Resolution. The average frequency of the pacemakers was 10 Hz, and the activity threshold directly determined the resolution of the model. For example, if the threshold was set at zero, the basic resolution was M 50 msec (50% of the cycle; Fig. 2, left column). In other words at this threshold the model could not be used to distinguish intervals that differed by less than 50 msec. If the activity threshold was high (0.9 or above; Fig. 2, right column) resolution was limited only by the iteration rate of the model (16 msec). The shortest interval that could be stored was of course equal to the shortest period of any pacemaker. However, unless the activity threshold was high (above 0.51, the output signaling intervals under 1 sec could not be reliably distinguished from peaks 100 msec before or afterward (Fig. 2, top row). Thus, the temporal resolution at low activity thresholds was k100 msec, but was arbitrarily high for high thresholds. The longest interval that can be stored is difficult to specify. It depends on the number of pacemakers, their activity thresholds, and the precise distribution of oscillation frequencies within the group. For any given group, there must be some interval that cannot be distinguished from another shorter interval that is a fraction of the first. For example, if we consider just units 5 and 6 in figure lB, it would be impossible to code for the interval indicated by the lowest star, since these two units also code for half that interval, indicated by the middle star. In simulations of 250 or 500 pacemakers, the upper time limit seems to be at least 20 sec, although it has not been possible to test every interval below that value.
3.3 Number of Pacemakers. The accuracy and specificity with which each interval can be stored are also related to the number of units used to encode it. The proportion of the whole population of pacemakers expected to be selected at any time t l is equal to the proportion of time each unit is active. Thus, if the units are active for 15%of each cycle (an activity threshold of 0.9), then about one-sixth of the population will be selected to store any one interval. The simulations indicate that the best results are achieved when at least 15-20 units are selected to encode an interval. If this number drops below 5 or 6, the specificity becomes less predictable: some intervals are resolved clearly, others are coded ambiguously, and some intervals cannot be encoded at all, because no units are active at time t l . Thus, the minimum population size should be approximately 20f p , where p is the proportion of time each pacemaker is active. For these units, with sinusoidal membrane potential, p = cos-'(a)/w, where cr is the activity threshold. At the highest threshold used ( a = 0.999), each pacemaker was
The Storage of Time Intervals Using Oscillating Neurons
367
active for less than 1.5%of the cycle. Hence the output of each could be thought of as a single spike, and although the scheme was still valid (not shown), the population of pacemakers needed to reliably encode any arbitrary interval is high (about 1400, by the equations given above). 3.4 Setting a Threshold for the Output Unit. As indicated in figure 2, the output unit is continuously barraged by inputs, and an threshold must be defined to distinguish the input at tl from all others. An appropriate threshold can be set without need to know exactly how many units are selected at any one time. In general, about twice the number of units are active at tl than at the next highest peak, and as long as the population of pacemakers is large (for example, 100-2001, then the number selected at any time t l will be roughly constant (as discussed above). A suitable value for the output threshold would be /j = 0.75m . cos-'(n)/lr, where ni is the total population size and r i the activity threshold. 3.5 Inhibitory Pacemakers. Another way to increase the specificity for each interval is to include a number of pacemakers that make inhibitory synapses onto the output unit. If inhibitory units can be selected that are inactive at times to and t l then as a group they will tend to be active at all other times, partially inhibiting the output unit and increasing its specificity. Because of the asymmetry between the percentage time each pacemaker is active or inactive, this mechanism will select a set of inhibitory units with a different distribution of oscillation frequencies from the excitatory set. It is therefore not equivalent to merely changing the synaptic strengths of the excitatory units, or to altering the output unit's threshold. Figure 4 shows a simulation in which half the pacemakers were inhibitory. The selection rule was modified so that all excitatory units active at times to and tl had, as before, synapse strengths of +l;all inhibitory units inactive at time tl had synapse strengths of -1 and all others had synapse strengths of zero. It can be seen that the inhibitory units had two effects. First, the output unit was inhibited at most times between to and tl; this was most pronounced when the activity threshold was high (Fig. 4d and e) because then the inhibitory units inactive at t l greatly outnumbered the excitatory units active at that time. Thus, the threshold for the output unit could be set at zero without regard for the numbers of cells in the population. Second, activity at higher harmonics (at half or one-third of the original interval) was diminished because of the different distribution of frequencies in the excitatory and inhibitory groups. 3.6 Storing Multiple Intervals. The basic mechanism can be used to store more than one time interval simply by selecting additional pacemakers that encode for each additional time (Fig. 5). The threshold for the output unit, I?, need not be altered since approximately the same num-
368
Christopher Miall
ber of units will be chosen for each interval. Figure 5 has been scaled according to the total number of cells selected, so to make the output threshold clear a line has been drawn at P = 70 in each graph, which is the average number of units chosen for each interval. However, since the background activity is proportional to the total number of units, and not the number active at each selected time, only three or four intervals can be stored before the peaks can no longer be distinguished. Further-
t 1:3.0s nE: nl 59:64
0.0
;H 4a:6a
I
38:79 '
l"1'
"
'
0.5
I
d 1
e
'I 27:94
11:111
o.
0.9
~~
~~
Figure 4 Records of the output unit's input activity in recall of a 3-sec interval when inhibitory units were included in the simulation. The format used is the same as in figure 2. Half of the 250 pacemakers were inhibitory; the numbers of excitatory (nE) and inhibitory units (nIf selected are given on the right of each graph. The activity threshold, a, was vaned from 0.0 to 0.9, as indicated below each graph.
The Storage of Time Intervals Using Oscillating Neurons
369
t 1:3.5s
b
‘1
n2:80
I L ’
t t t n4:55
-- - -. .-
.
’
.
.
-- -
ttt t
Figure 5: Records of the output unit’s input activity in recall of multiple intervals. The format used is the same as in figure 2. In graph a, an interval t l of 3.5 sec was stored; in each subsequent graph one extra interval was added (at 2.0, 2.7, and 2.2 sec respectively, as indicated by the arrows). Five hundred excitatory pacemakers were simulated with activity threshold (I = 0.9; the numbers selected at each interval are given above each graph. A threshold for the output unit at 13 = 70 has been indicated in each graph. more, recover of multiple inputs can be achieved only by resetting all pacemakers, and so a pattern of pulses cannot easily be repeated several times. 3.7 Synapse Strengths. The simulations discussed so far have used integer synapses taking values 0 and 1, or 0, +1, and -1. In a biological framework it might be more realistic to assume that synapses could take values between 0 and 1, and would move only gradually between these two limits on repeated presentations of a particular interval. In other
370
Christopher Miall
words, a learning rule might increase the strength of selected synapses relative to the others only gradually. Temporal specificity is rather poor in such circumstances. If the number of selected pacemakers is small relative to the whole population (that is, the activity threshold is high), then the output unit receives a barrage of input from the unwanted units. It is therefore difficult to distinguish the peak at time tl unless the selected units have synapse strengths considerably greater than the others. Alternatively, the activity threshold must be low, so that the number of selected units is of the same order as the number of unwanted units, but that introduces limits to performance which were discussed previously. However, it might be possible to stop the unwanted pacemakers oscillating at all, perhaps by temporarily altering their membrane properties, leaving only the selected group active. This is functionally equivalent to setting the synapse strengths to zero as in these simulations. 4 Conclusions
The scheme presented here allows neurons with relatively high oscillation frequencies to encode intervals much greater than their own oscillation period. By selecting groups of pacemakers from a population, intervals between pairs of pulses, or short patterns of pulses, can be stored and recalled. The scheme is surprisingly robust, and even when individual pacemakers are active for 50% of each cycle, temporal resolution is good. Three weaknesses of the scheme are apparent. First, the synapses onto the output unit need to be considerably stronger from the selected group than from other pacemakers. Second, the selected units need to be reset synchronously to most accurately recall the stored interval. Third, the pacemakers need to maintain a stable oscillation frequency for the duration of the stored interval. Actually, this is not strictly so, as the scheme would still work if the pacemakers veliubly drifted or swept through a frequency range after being reset; the critical point is that their behavior must be repeatable in storage and in recall. None of these problems seems insurmountable, but since we are ignorant of the site of timing operations in the brain, it remains to be seen whether suitable populations of pacemakers will be found to realize this scheme. Pacemaking units are found in a number of brain sites, and the figure of 10 Hz chosen for these simulations is biologically reasonable (LlinPs 1988). It is therefore interesting to see that the temporal resolution of the simulations is of the same order as that found in psychophysical tests (that is, about 50 msec, see review by Poppel 1978). Poppel also reviews evidence for a qualitative difference in the perception of short intervals to those greater than about 2-3 sec. It would be interesting to find the longest interval that could be unambiguously stored by the proposed scheme. However, with the numbers and threshold levels chosen here, the limiting interval seems to be at least 10 sec.
The Storage of Time Intervals Using Oscillating Neurons
371
Finally, it should be mentioned that the fundamental principle here is of beating between oscillators, and is not dependent on the underlying properties of the pacemakers units. The oscillatory components could therefore be realized as individual pacemaker neurons, as simulated, or by entrained groups of pacemakers, or by reverbatory circuits containing several neurons. In the latter cases, any one cell in the group or circuit could make a synaptic connection to the output neuron. Acknowledgments This work was supported by King’s College Research Centre. I thank Graeme Mitchison for interesting discussions about these ideas, and the reviewers for their useful suggestions about the manuscript. References Braitenberg, V. 1961. Functional interpretation of cerebellar histology. Nature (London) 190,539. Carr, C.E., and Konishi, M. 1988. Axonal delay lines for time measurement in the owl’s brainstem. Proc. N u f . Acad. Sci. U.S.A. 85, 8311-8316. Jeffress, L.A. 1948. A place theory of sound localization. J. Comp. Physiol. Psychol. 41, 35-39. Licklider, J.C.R. 1951. A duplex theory of pitch perception. Experientia 7, 128134. Llinas, R.R. 1988. The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Science 242, 16541664. Lonquet-Higgins, H.C. 1968. Holographic model of temporal recall. Nature (London) 217, 104. Lonquet-Higgins, H.C. 1989. A mechanism for the storage of temporal correlations. In The Computing Neuron, R. Durbin, C. Miall, and G. Mitchison, eds. Addison-Wesley, Wokingham, England, pp. 99-104. Poppel, E. 1978. Time perception. in The Handbookof Sensory Physiology. Vol. VIII, pp. 713-729. Springer-Verlag: Berlin, Heidelberg, New York. Torras, C.I.G. 1986. Neural network model with rhythm-assimilation capacity. IEEE Syst. Man, Cybernet. 16, 680-693.
Received 1 March 1989; accepted 10 April 1989
Communicated by Jeffrey Elman
Finite State Automata and Simple Recurrent Networks Axel Cleeremans Department of Psychology, Carncgic-Mellon University, Pittsburgh, PA 15213 USA
David Servan-Schreiber Departmmt of Computer Scienre, Carnegie-Mcllon University, Pittsburgh, PA 15213 USA
James L. McClelland Department of Psychology, Cnmegir-hlcllon University, Pittsburgh, PA 15213 USA
We explore a network architecture introduced by Elman (1988)for predicting successive elements of a sequence. The network uses the pattern of activation over a set of hidden units from time-step t-1, together with element t, to predict element t + 1. When the network is trained with strings from a particular finite-state grammar, it can learn to be a perfect finite-state recognizer for the grammar. When the network has a minimal number of hidden units, patterns on the hidden units come to correspond to the nodes of the grammar, although this correspondence is not necessary for the network to act as a perfect finite-state recognizer. We explore the conditions under which the network can carry information about distant sequential contingencies across intervening elements. Such information is maintained with relative ease if it is relevant at each intermediate step; it tends to be lost when intervening elements do not depend on it. At first glance this may suggest that such networks are not relevant to natural language, in which dependencies may span indefinite distances. However, embeddings in natural language are not completely independent of earlier information. The final simulation shows that long distance sequential contingencies can be encoded by the network even if only subtle statistical properties of embedded strings depend on the early information. 1 Introduction Several connectionist architectures that are explicitly constrained to capture sequential information have been proposed. Examples are Time Delay Networks (for example, Sejnowski and Rosenberg 1986)- also called Neiirnl Computation 1, 372-381 (1989) @ 1989 Massachusetts Institute of Technology
Finite State Automata and Simple Recurrent Networks
373
Figure 1: The simple recurrent network (Elman 1988). In the SRN, the pattern of activation on the hidden units at time step t - 1, together with the new input pattern, is allowed to influence the pattern of activation at time step f . This is achieved by copying the pattern of activation on the hidden layer at time step t - 1 to a set of input units - called the "context units" - at time step t . All the forward connections in the network are subject to training via backpropagation. "moving window" paradigms - or algorithms such as backpropagation in time (Rumelhart et al. 1986; Williams and Zipser 1988). Such architectures use explicit representations of several consecutive events, if not of the entire history of past inputs. Recently, Elman (1988) has introduced a simple recurrent network (SRN) that has the potential to master an infinite corpus of sequences with the limited means of a learning procedure that is completely local in time (Fig. 1). In this paper, we show that the SRN can learn to mimic closely a finite-state automaton (FSA), both in its behavior and in its state representations. In particular, we show that it can learn to process an infinite corpus of strings based on experience with a finite set of training exemplars. We then explore the capacity of this architecture to recognize and use nonlocal contingencies between elements of a sequence. 2 Discovering a Finite-State Grammar
In our first experiment, we asked whether the network could learn the contingencies implied by a small finite-state grammar (Fig. 2). The network was presented with strings derived from this grammar, and required to try to predict the next letter at every step. These predictions
Cleeremans, Servan-Schreiber, and McClelland
374
are context dependent since each letter appears twice in the grammar and is followed in each case by different successors. A single unit on the input layer represented a given letter (six input units in total; five for the letters and one for a begin symbol " B ) . Similar local representations were used on the output layer (with the "begin" symbol being replaced by an end symbol "E''). There were three hidden units. 2.1 Training. On each of 60,000 training trials, a string was generated from the grammar, starting with the "B." Successive arcs were selected randomly from the two possible continuations with a probability of 0.5. Each letter was then presented sequentially to the network. The activations of the context units were reset to 0.5 at the beginning of each string. After each letter, the error between the network's prediction and the actual successor specified by the string was computed and backpropagated. The 60,000 randomly generated strings ranged from 3 to 30 letters (mean, 7; SD, 3.3). 2.2 Performance. Three tests were conducted. First, we examined the network's predictions on a set of 70,000 random strings. During this test, the network is first presented with the "B," and one of the five letters or "E" is then selected at random as a successor. If that letter is predicted
S
X
Start
T
Figure 2: The small finite-state grammar (Reber 1967).
End
Finite State Automata and Simple Recurrent Networks
375
by the network as a legal successor (that is, activation is above 0.3 for the corresponding unit), it is then presented to the input layer on the next time step, and another letter is drawn at random as its successor. This procedure is repeated as long as each letter is predicted as a legal successor until "E" is selected as the next letter. The procedure is interrupted as soon as the actual successor generated by the random procedure is not predicted by the network, and the string of letters is then considered "rejected." A string is considered "accepted" if all its letters have been predicted as possible continuations up to "E." Of the 70,000 random strings, 0.3% happened to be grammatical and 99.7% were ungrammatical. The network performed flawlessly, accepting all the grammatical strings and rejecting all the others. In a second test, we presented the network with 20,000 strings generated at random from the grammar, that is, all these strings were grammatical. Using the same criterion as above, all of these strings were correctly "accepted." Finally, we constructed a set of very long grammatical strings - more than 100 letters long - and verified that at each step the network correctly predicted all the possible successors (activations above 0.3) and none of the other letters in the grammar. 2.3 Analysis of Internal Representations. What kinds of internal representations have developed over the set of hidden units that allow the network to associate the proper predictions to intrinsically ambiguous letters? We recorded the hidden units' activation patterns generated in response to the presentation of individual letters in different contexts. These activation vectors were then used as input to a cluster analysis program that groups them according to their similarity. Figure 3 shows the results of such an analysis conducted on a small random set of grammatical strings. The patterns of activation are grouped according to the nodes of the grammar: all the patterns that are used to predict the successors of a given node are grouped together independently of the current letter. This observation sheds some light on the behavior of the network: at each point in a sequence, the pattern of activation stored over the context units provides information about the current node in the grammar. Together with information about the current letter (represented on the input layer), this contextual information is used to produce a new pattern of activation over the hidden layer, that uniquely specifies the next node. In that sense, the network closely approximates the FSA that would encode the grammar from which the training exemplars were derived. However, a closer look at the cluster analysis reveals that within a cluster corresponding to a particular node, patterns are further divided according to the path traversed before the node is reached. For example, looking at the bottom cluster - node #5 - patterns produced by a "VV," "PS," "XS," or "SXS ending are grouped separately by the analysis: they are more similar to each other than to other examples of paths leading
376
Cleeremans, Servan-Schreiber, and McClelland
to node #5. This tendency to preserve information about the path is not a characteristic of traditional finite-state automata. It must be noted that an SRN can perform the string acceptance tests described above and still fail to produce representations that clearly delineate the nodes of the grammar as in the case shown in Figure 3. This tendency to approximate the behavior but not the representation of the grammar is exhibited when there are more hidden units than are absolutely necessary to perform the task. Thus, using representations that correspond to the nodes in the FSA shown in figure 2 is only one way to act in accordance with the grammar encoded by that machine: representations correspond to the nodes in the FSA only when resources are severely constrained. In other cases, the network tends to preserve the separate identities of the arcs, and may preserve other information about the path leading to a node, even when this is not task relevant.
3 Encoding Path Information
In a different set of experiments, we asked whether the network could learn to use the information about the path that is encoded in the hidden units’ patterns of activation. In one of these experiments, we tested whether the network could master length constraints. When strings generated from the small finite-state grammar may have a maximum of only eight letters, the prediction following the presentation of the same letter in position number six or seven may be different. For example, following the sequence “TSSSXXV,” “V” is the seventh letter and only another “V“ would be a legal successor. In contrast, following the sequence “TSSXXV,” both “V“ and “I”’ are legal successors. A network with 15 hidden units was trained on 21 of the 43 legal strings of length 3 to 8. It was able to use the small activation differences present over the context units - and due to the slightly different sequences presented - to master contingencies such as those illustrated above. Thus, after ”TSSXXV,” the network predicted “V“ or “I”’;but after “TSSSXXV,” it predicted only “V” as a legal successor. How can information about the path be encoded in the hidden layer patterns of activation? As the initial papers about backpropagation pointed out, the hidden unit patterns of activation represent an ”encoding” of the features of the input patterns that are relevant to the task. In the recurrent network, the hidden layer is presented with information about the current letter, but also - on the context layer - with an encoding of the relevant features of the previous letter. Thus, a given hidden layer pattern can come to encode information about the relevant features of two consecutive letters. When this pattern is fed back on the context layer, the new pattern of activation over the hidden units can come to encode information about three consecutive letters, and so on. In this
377
Finite State Automata and Simple Recurrent Networks
-0.5
0.0
0.5
1 .o
1.5
Figure 3: Hierarchical cluster analysis of the hidden unit activation patterns after 60,000 presentations of strings generated at random from the finite-state grammar. A small set of strings was used to test the network. The single uppercase letter in each string shown in the figure corresponds to the letter actually presented to the network on that trial.
378
Cleeremans, Servan-Schreiber, and McClelland
manner, the context layer patterns can allow the network to maintain prediction-relevant features of an entire sequence. Learning progresses through three phases. During the first phase, the context information tends to be ignored because the patterns of activation on the hidden layer - of which the former are a copy - are changing continually as a result of the learning algorithm. In contrast, the network is able to pick up the stable association between each letter and all its possible successors. At the end of this phase, the network thus predicts all the successors of each letter in the grammar, independently of the urc to which each letter corresponds. In the second phase, patterns copied on the context layer are now represented by a unique code designating which letter preceded the current letter, and the network can exploit this stability of the context information to start distinguishing between different occurrences of the same letter - different arcs in the grammar. Finally, in a third phase, small differences in the context information that reflect the occurrence of previous elements can be used to differentiate position-dependent predictions resulting from length constraints (see Servan-Schreiber e f ul. 1988, for more details). It is important to note that information about the path that is not locally relevant tends not to be encoded in the next hidden layer pattern. It may then be lost for subsequent processing. This tendency is decreased when the network has extra degrees of freedom (that is, more hidden units) so as to allow small and locally useless differences to survive for several processing steps. 4 Processing Embedded Sequences
This observation raises an important question about the relevance of this simple architecture to natural language processing. Any natural language processing system must have the ability to preserve information about long-distance contingencies in order to correctly process sentences containing embeddings. For example, in the following problem of number agreement, information about the head of the sentence has to be preserved in order to make a correct grammatical judgment about the verb following the embedding: The dog [that chased the cat] is playful The dogs [that chased the cat1 are playful At first glance, however, information about the head of the sentence does not seem to be relevant for processing of the embedding itself. Yet, from the perspective of a system that is continuously generating expectations about possible succeeding events, information about the head is relevant within the embedding. For example, “itself” may follow the action verb only in the first sentence, whereas ”each other” is acceptable only in the second. There is ample empirical evidence to support
Finite State Automata and Simple Recurrent Networks
379
the claim that human subjects d o generate expectations continuously in the course of natural language processing (see McClelland 1988, for a review). To show that such expectations may contribute to maintaining information about a nonlocal context, we devised a new finite-state grammar in which the identity of the last letter depends on the identity of the first one (see Fig. 4). In a first experiment, contingencies within the embedded subgrammars were identical (0.5 on all arcs). Therefore, information about the initial letter is not locally relevant to predicting the successor of any letter in the embedding.
S
T
Figure 4: A complex finite-state grammar involving embedded sequences. The last letter is contingent on the first, and the intermediate structure is shared by the two branches of the grammar. In the first experiment, transition probabilities of all arcs were equal and set to 0.5. In the second experiment, the probabilities were biased toward the top arcs for the top embedding, and toward the bottom arcs for the bottom embedding. The numbers above each arc in the figure indicate the transition probabilities in the biased version of the grammar.
380
Cleeremans, Servan-Schreiber, and McClelland
An SRN with 15 hidden units was trained on exemplars (mean length, 7; SD, 2.6) generated from this grammar for 2.4 million letter presentations. Even after such extensive training, the network indifferently predicted' both possible final letters, independently of the initial letter of the string. However, when the contingencies within the embedded grammars depend, even very slightly, on the first letter of the string, the performance of the network can improve dramatically. For example, we adjusted the transition probabilities in the subgrammars such that the top subgrammar was biased toward the top arcs and the bottom subgrammar was biased toward the bottom arcs (see Fig. 4). An SRN trained on exemplars (mean length, 7; SD, 2.6) derived from this biased grammar performed considerably better. After 2.4 million presentations, the network correctly predicted the last letter of the string in 66.9% of the trials. There were 11.3%errors - that is, cases where the incorrect alternative was predicted - and 21.8% cases in which both alternatives were predicted about equally. Accuracy was very high for strings with a short embedding, and fell gradually to chance level at seven embedded elements. Note that we tested the network with strings generated from the unbiased version of the grammar. The network must therefore have learned to preserve information about the first letter throughout the embedding independently of whether the particular embedding is probable or not given the first letter. Moreover, with the biased grammar, some embeddings have exactly the same probability of occurring during training independently of whether the first letter is a "T" or a "I"'( e g , "PVPXVV). As we have seen, training with such exemplars makes it very difficult' for the network to maintain information about the first letter. Yet, when such embeddings are part of an ensemble of embeddings that as a whole has a probability structure that depends on the first letter, information about this first letter can be preserved much more easily.
5 Conclusion We have presented a network architecture first explored by Elman (1988) that is capable of mastering an infinite corpus of strings generated from a finite-state grammar after training on a finite set of exemplars with a learning algorithm that is local in time. When it contains only a minimal set of hidden units, the network can develop internal representa'We considered that the final letter was correctly or mistakenly predicted if the Luce ratio of the activation of the corresponding unit to the sum of all activations on the output layer was above 0.6. Any Luce ratio below 0.6 was considered as a failure to respond (that is, a miss). 20ther experiments showed that when the embedding is totally independent of the head, the network's ability to maintain information about a nnnlocal context depends on the number of hidden units. For a given number of hidden units, learning time increases exponentially with the length of the embedding.
Finite State Automata and Simple Recurrent Networks
381
tions that correspond to the nodes of the grammar, and closely approximates the corresponding minimal finite-state recognizer. We have also shown that the simple recurrent network is able to encode information about long-distance contingencies as long as information about critical past events is relevant at each time step for generating predictions about potential alternatives. Finally, we demonstrated that this architecture can be used to maintain information across embedded sequences, even when the probability structure of ensembles of embedded sequences depends only subtly on the head of the sequence. It appears that the SRN possesses characteristics that recommend it for further study as one useful class of idealizations of the mechanism of natural language processing.
Acknowledgments Axel Cleeremans was supported in part by a fellowship from the Belgian American Educational Foundation, and in part by a grant from the National Fund for Scientific Research (Belgium). David Servan-Schreiber was supported by an NIMH Individual Fellow Award MH-09696-01~ James L. McClelland was supported by an NIMH Research Scientist Career Development Award MH-00385. Support for computational resources was provided by NSF(BNS-86-09729) and ONR(N00014-86-G-
0146)
References Elman, J.L. 1988. Finding Structure in Time. CRL Tech. Rep. 9901. Center for Research in Language, University of California, San Diego, CA. McClelland, J.L. 1988. The case for interactionism in language processing. In Attention and Performaiice XII, M. Coltheart, ed. Erlbaum, London. Reber, A.S. 1967. Implicit learning of artificial grammars. I. Verbal Leartiing Verbal Behau. 5, 855-863. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. 1986. Learning internal representations by backpropagating errors. Nature (London) 323, 533-536. Sejnowski, T.J., and Rosenberg, C. 1986. NETtalk: A Parallel Network That Learns to Read Aloud. Tech. Rep. JHU-EECS-86-01, Johns Hopkins University. Servan-Schreiber, D., Cleeremans, A,, and McClelland, J.L. 1988. Learnitig Sequential Structure in Sirnple Recirrretit Netzuorks. Tech. Rep. CMU-CS-183, Carnegie-Mellon University. Williams, R.J., and Zipser, D. 1988. A Learning Algorithm for Cotititiually Running Fully Recurrent Neural Netzuorks. ICS Tech. Rep. 8805. Institute for Cognitive Science, University of California, San Diego, CA.
Received 6 March 1989; accepted 20 April 1989.
Communicated by Geoffrey Hinton
Asymptotic Convergence of Backpropagation Gerald Tesauro IBM Thomas 1. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598 U S A
Yu He Subutai Ahmad Center for Complex Systems Research, Beckman lnstifute, University of lllinois at Urbana-Champaign, 405 N. Matthews, Urbana IL 61801 U S A
We calculate analytically the rate of convergence at long times in the backpropagation learning algorithm for networks with and without hidden units. For networks without hidden units using the standard quadratic error function and a sigmoidal transfer function, we find that the error decreases as l / t for large t, and the output states approach their target values as l/d.It is possible to obtain a different convergence rate for certain error and transfer functions, but the convergence can never be faster than l/t. These results are unaffected by a momentum term in the learning algorithm, but convergence can be substantially improved by an adaptive learning rate scheme. For networks with hidden units, we generally expect the same rate of convergence to be obtained as in the single-layer case; however, under certain circumstances one can obtain a polynomial speed-up for nonsigmoidal units, or a logarithmic speed-up for sigmoidal units. Our analytic results are confirmed by empirical measurements of the convergence rate in numerical simulations. 1 Introduction
Recently developed learning algorithms for multilayer networks, such as the so-called "backpropagation" algorithm (LeCun 1985; Parker 1985; Rumelhart et al. 1986; Werbos 1974) for gradient-descent minimization of a global error function, have inspired a great deal of interest in such networks for both cognitive modeling and practical problem solving. Most studies of backpropagation have so far been empirical, since the equations governing a multilayer network of units with nonlinear transfer functions are extremely difficult to solve analytically. Empirical studies have produced some useful heuristics and rules of thumb for understandNeural Computation 1, 382-391 (1989) @ 1989 Massachusetts Institute of Technology
Asymptotic Convergence of Backpropagation
383
ing the learning behavior, but there is widespread agreement on the need for a more principled theoretical understanding. In this paper, as a first step toward the development of a full theoretical understanding of general gradient-descent learning in multilayer networks, we examine the rate of convergence late in learning when all of the errors are small. In this limit, the learning equations become much more amenable to analytic study. By expanding in the small differences between the desired and actual output states, and retaining only the dominant terms, one can explicitly solve for the leading-order behavior of the weights as a function of time. This is true both for single-layer networks and for multilayer networks containing hidden units. In gradient-descent learning, one minimizes an error function E that is based on a comparison of the actual network output for each input pattern with a desired output, or "teacher" signal. The weights are updated as follows: i)E a w = --F-
i b
where Aw is the change in the weight vector at each time step, and the learning rate 6 is a small numerical constant. The basic calculation of the convergence rate of equation 1.1 for single-layer networks with general error functions and transfer functions is presented in Section 2, along with specific results for certain standard functions. In Section 3, we examine three standard modifications of the basic gradient-descent procedure: the use of a "margin" variable for turning off the error backpropagation, the use of adaptive learning rate schemes, and the inclusion of a "momentum" term in the learning equation. In Section 4 we extend the analysis to networks with hidden units, and in the final section we summarize our results and discuss possible extensions in future work. 2 Convergence in Single-Layer Networks
Learning schemes for single-layer networks have been known for many years (Minsky and Papert 1969; Widrow and Hoff 1960). The inputoutput relationship for single-layer networks takes a particularly simple form: yp = y(w.x,)
(2.1)
where x p is a binary vector of length ri + 1 which represents the state of the 'ri input units for pattern p (plus one "true" unit which is always on), w is the real-valued weight vector of the network, 9 is the input-output transfer function (for the moment unspecified), and y, is the output state for pattern p . (We assume for simplicity that the network has only one output unit; the extension to multiple output units does not affect the basic results.) We further assume that the transfer function approaches 0 for large negative inputs and 1 for large positive inputs.
Gerald Tesauro, Yu He, and Subutai Ahmad
384
For convenience of analysis, we shall rewrite equation 1.1 for continuous time. Assuming that the total error E is the sum of individual errors Ep for each pattern, the learning equation becomes (2.2)
where h , = w . x , is the total input activation of the output unit for pattern p , and the summation over p is for an arbitrary subset of the 2” possible training patterns. The individual error Ep is a function, for the moment unspecified, of the difference between the actual output vP and the desired output d,, for pattern p . The mean-square error function EP = (yp - d,J2 is most often employed, but other error functions can be useful in certain situations, such as the ”cross-entropy” error function (Hinton 1987) Ep = d, log g, + (1 - d,) log(1 - y1J, which is useful when the output states are interpreted as probability values. Instead of trying to solve the evolution of the weights directly, it is more convenient to solve for the output values y p as a function of time. To do this we need the derivative of equation 2.1: 3jp = y ’ ( h , ) i v . ~ ,
(2.3)
Substituting equation 2.2 into equation 2.3 yields (2.4)
Let us now consider the situation late in learning when the output states are approaching the desired values. We define new variables 71, = yp - $, and assume that qr, is small for all p . For reasonable error functions, such as the ones mentioned above, the individual errors E,,will go to zero as some power of qp, i.e., EP T],~Y. (For the meansquare error, y = 2, and for the cross-entropy error, 7 = 1.) Similarly, the slope of the transfer function should approach zero as the output state approaches 1 or 0, and for reasonable transfer functions, such as the wellknown sigmoid, this will again follow a power law, that is, g’(h,,) q,”. Using the definitions of 17, -,Jand 13, equation 2.4 becomes N
-
(2.5)
The absolute value appears because g is a nondecreasing function. Let rli, be the slowest to approach zero among all the r p . We assume that the terms on the right-hand side of equation 2.5 of order qF do not cancel. Since the 71scan be both positive and negative, such cancellation could occur in principle, but only if the matrix of coefficients for the dominant terms has zero determinant. (This will not be the case for randomlychosen training patterns.) Furthermore, even if the matrix does have zero determinant, the initial condition must be precisely chosen to achieve cancellation, and this will not be the case for random initial weights.
Asymptotic Convergence of Backpropagation
385
(2.6)
(2.7)
So the output y approaches the desired output and the error function goes to zero with a power law dependence on the learning time. When 13 = 1, i.e., y’ -- / I , the error function approaches zero like l / t , independent of the exponent In fact, this is the fastest possible way for the error function to approach zero. To see this, we first show that exponential convergence to zero is not possible for the networks considered here. If the exponent on the right-hand side of equation 2.6 is 1, that is, r3 = 1 - 7/2, and the error function will go to zero exponentially fast. This requires that 13 < 1. However, since g’ (y - d)!’ = (9 - tl))’,we have y(.r) -- (.I.- r)’’’-’ ( c is a constant). Note that this is not bounded for 13 < 1. Therefore, the smallest value t? can assume is 1, and this yields the fastest possible convergence for the error function, viz. E l / f . For the usual sigmoid transfer function y ( i ) = (1+ r - ‘ ) - ’ , the exponent 13 = 1. Therefore one should see l / t behavior in the error function in this case. Such behavior has actually been observed in the numerical experiments of Ahmad (1988) and Ahmad and Tesauro (1988). The asymptotic l / t behavior was observed at relatively small t , about 20 cycles through the training set. Further numerical evidence is presented in Section 4 of this paper. When increases, E will go to zero more slowly. For example, if we take the transfer function to be 1 2 g(s) = -[1 + -tg-’.l”’I (2.8) 2 7 7 (where 7 r ) > 0 is an odd integer), then jj = 1 + l / m . In this case, the error function will go to zero as E f-3/(i+2/‘’t). In particular, when = 1, = 2, E l/d. 0,.
-
N
-
m,
N
3 Modifications of Gradient Descent
In the standard practice of backpropagation, several modifications to strict gradient descent are often used. One common modification is the use of a “margin” variable 11 such that, if the difference between network output and teacher signal for pattern p is smaller than ii, no error is backpropagated. This is meant to prevent the network from devoting resources to making its output arbitrarily close to the teacher signal, which is usually unnecessary. It is clear from the structure of equations 2.5 and 2.6 that the margin will not affect the basic l / f error convergence, except in a rather trivial way. When a margin is employed, certain driving terms
Gerald Tesauro, Yu He, and Subutai Ahmad
386
on the right-hand side of equation 2.5 will be set to zero as soon as they become small enough. However, as long as some nonzero driving terms are present, the basic polynomial solution of equation 2.7 will be unaltered. Of course, when all the driving terms disappear because they are all smaller than the margin, the network will stop learning, and the error will remain constant at some positive value. Thus, the predicted behavior is l / t decrease in the error followed eventually by a rapid transition to constant nonzero error. We have verified this behavior in numerical simulations. Another common modification to gradient descent is the use of heuristic schemes for adaptively altering the learning rate constant c. For example, Jacobs (1988) proposes a scheme in which if the current gradient is in the same direction as the previous gradient, the learning rate increases linearly with time, while if the gradient has changed directions, the learning rate decays exponentially. Let us consider the maximum possible benefit of such a scheme by assuming that sufficiently late in learning, the computed gradient is always in the same direction. In that case, the learning rate will increase linearly with time, that is, 4 t ) = fO+cTt. This will lead to an extra factor of t on the right-hand side of equation 2.6, which yields on integration
Thus, for sigmoid units, one could potentially obtain l / t 2 error convergence with an adaptive learning rate scheme. Of course, if the learning rate increase is slower than linear, the error convergence will be correspondingly slower. Finally, we consider a popular generalization of equation 1.1 that includes a “momentum” term as follows: BE Aw(t) = -6 -(t) + ~ A w (-t 1) (3.2)
aw
In continuous time, this takes the form rww
+
(1- 0 ) W =
--F
BE
(3.3)
-
BW
As in the previous section, to analyze equation 3.3 we substitute expressions for the state of the output and its derivative as a function of the weights and the input pattern. We also need the second derivative equation, which comes from differentiating equation 2.3:
i,
=
(W
. x,)g’(h,) +
g//(hJW
.x,y
(3.4)
Substituting equations 2.1, 2.3, and 3.4 into 3.3, and computing the dot product with x,, we obtain
Asymptotic Convergence of Backpropagation
- -
387
Once again, we substitute TI/, = gl, - dlr and expand in small /I/,. Again we assume that y' $',and given this relation, it follows by the chain rule that 9'' $"-I. Substituting these, along with Ep i / / > ' , yields a second-order differential equation for riT, in terms of a sum over other q,/. As in equation 2.6, the sum will be controlled by some dominant term fi, and the equation for this term is
-
where rl, r2, and 73 are numerical constants. It is not possible to exhaustively determine all possible classes of solutions to equation 3.6, although it is easy to rule out simple exponential solutions. One can, however, look for certain specific classes of solutions, such as the polynomial-time solutions found in the previous section. If we assume a solution of the form rip t" for some exponent 2 , then the second derivative term is of order F2, and can be neglected relative to the first derivative term, which is of order t"-'. Similarly, the ?/-'Ij2 term is also of order tZ-* and can also be neglected. The resulting equation for the exponent z thus has exactly the same form as in the zero momentum case of Section 2, and therefore the rate of convergence is the same as in equation 2.7. We have also verified this in numerical simulations.
-
4 Convergence in Networks with Hidden Units
We now consider networks with a single hidden layer. Let 1 1 1 , ~ represent the weights from the input layer to the hidden layer, and R, represent the weights from the hidden layer to the output unit. The total input activation of the hidden units is given by 71, = C, 7 1 ~ 7 J . I ' J , and the corresponding hidden unit outputs are = ~ ( u , ) The . total input activation of the output unit is now given by h = C , Q,o,. For this network, the resulting equation analogous to equation 2.4 for the rate of change of the output for pattern 11 is now more complicated: (1,
Since this equation explicitly depends on the second layer weights R,,we also need an equation governing how these weights are changing with time. This equation comes directly from the gradient-descent learning rule: (4.2)
The right-hand side of equation 4.1 has two terms. The first term, depending on the hidden unit output states, is analogous to equation 2.4 for single-layer convergence, and will in fact give the same rate of convergence as in the single-layer case, because the hidden unit outputs are
Gerald Tesauro, Yu He, and Subutai Ahmad
388
always of order 1. The second term, which depends on the slope of the hidden unit transfer function times the second layer weights, could give a different convergence rate if it dominates the first term, which is of order 1. In general, we d o not expect this to happen, because the saturation of the hidden unit states generally causes the 9' terms to vanish faster than the growth of the second-layer weights. Certainly this is true for a sigmoid transfer function: the weights can grow only logarithmically (as shown below), and any polynomial decrease of y' will kill the second term. However, for purposes of argument, let us assume that the hidden unit states do not saturate, and that the y' terms remain of order 1. This will give us the maximum possible effect of the second term. Expanding equations 4.3 and 4.4 in a small 17 expansion as before, and suppressing indices for convenience, we obtain the following coupled system of equations:
First we look for polynomial-time solutions to this system of equations of the form rj t', R t X ,with X > 0 and 2 < 0. Equation 4.4 yields the following expression for A: N
N
x
= z(?.+/?-l)-l
> 0
(4.5)
Equation 4.3, together with the above expression for A, then gives for the exponent 2 : z =
-3 33' + 4ji:
-
4
< 0
(4.6)
The constraint 2 < 0 will be satisfied provided that 7 > !(l -j?'). The constraint X > 0 will be satisfied, assuming y is always positive, whenever il > 1. This will also guarantee that 2 < 0. To summarize, we have shown that polynomial solutions for both the weights and the errors are possible when the transfer function exponent 13 > 1. It is interesting to note that these solutions converge slightly faster than in the single-layer case. For example, with 7 = 2 and 13 = 2, r/ t-3/'0 in the multilayer case, but as shown previously, 3 ) goes to zero only as t-'/* in the single-layer case. We emphasize that this slight speed-up will be obtained only when the hidden unit states do not saturate. To the extent that the hidden units saturate and their slopes become small, the convergence rate will return to the single-layer rate. When j?' = 1, as with the sigmoid transfer function, the above polynomial solution is not possible. Instead, one must look for solutions in which the weights increase only logarithmically. It is easy to verify that the following is a self-consistent leading order solution to equations 4.3 and 4.4 when 13 = 1: N
I,
t-'l?/r,-2137t
(4.7)
Asymptotic Convergence of Backpropagation
389
5.00
2.50
0.00
0 Hidden Units
-2.50
-5.00
3 Hidden Units 10 Hidden Units 50 Hidden Units
-7.50
Y
-10.00
2
3
5
7
6
9
10
Figure 1: Plot of total training set error versus epochs of training time on a loglog scale for networks learning the majority function using backpropagation without momentum. The networks had 23 input units and varying numbers of hidden units in a single layer, as indicated. The training set contained 200 patterns generated with a uniform random distribution. The straight line behavior at long times indicates power-law decay of the error function. In each case, the slope is approximately -1, indicating E l / t .
-
-
Recall that in the single-layer case, q t-’’?. Therefore, the effect of multiple layers is to provide only a logarithmic speed-up of convergence, and only when the hidden units do not saturate. For practical purposes, then, w e expect the convergence of networks with hidden units to be no different empirically from networks without hidden units. This is in fact what we have found in our numerical simulations, as illustrated in Figure 1. 5 Discussion We have obtained results for the asymptotic convergence of gradientdescent learning that are valid for a wide variety of error functions a n d
390
Gerald Tesauro, Yu He, and Subutai Ahmad
transfer functions. In typical situations, we expect the same rate of convergence to be obtained regardless of whether or not the network has hidden units. However, in some cases it may be possible to obtain a slight polynomial speed-up with nonsigmoidal units, or a logarithmic speed-up with sigmoidal units. We point out that in all cases, the sigmoid provides the maximum possible convergence rate, and is therefore a ”good” transfer function to use in that sense. We have not attempted analysis of networks with multiple layers of hidden units; however, the structure of equation 4.1 suggests a recursive structure in which one accumulates additional factors of g‘ as one adds more layers. To the extent that the hidden unit states saturate and the y’ factors vanish, we conjecture that the rate of convergence would be no different even in networks with arbitrary numbers of hidden layers. We have also examined some modifications to strict gradient-descent learning, and have found that, while momentum terms and margins do not affect the rate of convergence, adaptive learning rate schemes can have a big effect. Another important finding is that the expected rate of convergence does not depend on the use of all 2” input patterns in the training set. The same behavior should be seen for general subsets of training data. This is also in agreement with our numerical results, and with the results of Ahmad (1988) and Ahmad and Tesauro (1988). In conclusion, our analysis is only the first step toward a more complete theoretical understanding of gradient-descent learning in feed-forward networks. It would be of great interest to extend our analysis to times earlier in the learning process, when not all of the errors are small. The formalism developed in this paper might provide some of the ingredients of such an analysis. It might also provide a framework for the analysis of the numbers, sizes, and shapes of the basins of attraction for gradient-descent learning in feed-forward networks. Another topic of great importance is the behavior of the generalization performance, i.e., the error on a set of test patterns not used in training, which was not addressed in this paper. Finally, the type of analysis used in this paper might provide insight into the development and selection of new learning algorithms that might scale more favorably than backpropagation.
References Ahmad, S. 1988. A study of scaling and generalization in neural networks. Master’s Thesis, University of Illinois at Urbana-Champaign. Ahmad, S., and Tesauro, G. 1988. Scaling and generalization in neural networks: A case study. In Proceedings of the 2988 Connectionist Models Summer School, D. Touretzky et al., eds. Morgan Kaufman, San Mateo, CA. >lo. Hinton, G.E. 1987. Connectionist learning procedures. Tech. Rep. CMU-CS-87115, Department of Computer Science, Carnegie-Mellon University. Jacobs, R.A. 1988. Increased rates of convergence through learning rate a d a p tation. Neural Networks 1, 295-307.
Asymptotic Convergence of Backpropagation
391
Le Cun, Y. 1985. A learning procedure for asymmetric network. Proc. Cognit. (Paris), 85, 599-604. Minsky, M., and Papert, S. 1969. Perceptrons. MIT Press, Cambridge, MA. Parker, D.B. 1985. Learning-logic. Tech. Rep. TR-47, MIT Center for Computational Research in Economics and Management Science. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. 1986. Learning representations by back-propagating errors. Nature (London) 323, 533-536. Werbos, P. 1974. Ph.D. Thesis, Harvard University. Widrow, B., and Hoff, M.E. 1960. Adaptive switching circuits. Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, Part 4, 9&104.
Received 12 September 1988; accepted 5 June 1989.
Communicated by Steven Zucker
Learning by Assertion: Two Methods for Calibrating a Linear Visual System Laurence T. Maloney Center for Neural Scirnce, Dcpartrnent of Psychology, New York University, New York, NY 10003 USA
Albert J. Ahumada NASA Ames Research Center Moffett Field, CA 94035 USA
A visual system is geometrically calibrated if its estimates of the spatial properties of a scene are accurate: straight lines are judged straight, angles are correctly estimated, and collinear line segments are perceived to fall on a common line. This paper describes two new calibration methods for a model visual system whose photoreceptors are initially at unknown locations. The methods can also compensate for optical distortions that are equivalent to remapping of receptor locations (e.g., spherical aberration). The methods work by comparing visual input across eyelhead movements; they require no explicit feedback and no knowledge about the particular contents of a scene. This work has implications for development and calibration in biological visual systems. 1 Introduction
It’s likely that no biological visual system is ever perfectly calibrated, but considerable evidence exists that biological visual systems do compensate for optical distortions and initial uncertainty about the position of photoreceptors in the retinal photoreceptor lattice (Banks 1976; Hirsch and Miller 1987). Recent anatomical work, for example, demonstrates apparent disorder in the retinal lattice outside the central fovea, increasing with eccentricity (Hirsch and Miller 1987). Further, the optics of the eye change throughout the life span (Banks 1976; Weale 19821, suggesting that calibration may continue in the adult. Previous work in visual neural development suggests a variety of sources of information that drive calibration (Meyer 1988; Purves and Lichtman 1985; Shatz 1988), and there are computationaI models of visual neuraI development based on these cues (Sejnowski 1987). Yet, although biological visual systems are known to require patterned visual stimulation to achieve normal organization (Movshon and Van Sluyters Neural Computntion 1,392-401 (1989) @ 1989 Massachusetts Institute of Technology
Learning by Assertion
393
1981), few models require such stimulation to function. Exceptions include Kohonen (1982) and Toet, Blom, and Koenderink (1987). Further, while all these models could in principle compensate for disorder in the retinal lattice, none of them addresses the problem of compensation for optical distortion. We describe two methods for calibrating a simple linear visual system that work by comparing visual input across eye/head movements. These methods can organize the receptive fields of a simple visual system so as to compensate for irregularities in the retinal photoreceptor lattice and optical irregularities equivalent to distortions in the lattice. They require no explicit feedback and no knowledge about the particular contents of a scene, but instead work by "asserting" that the internal representation of the scene behave in a prespecified way under eye and head movements. We demonstrate that these methods can be used to calibrate the simple, linear visual system described next. In the final section, we discuss the implications of this work for other models of visual processing. 2 A Model Linear Visual System
The model visual system has ,1: photoreceptors arranged in a receptor array. The locations of these receptors are initially unknown. The light image is the mean intensity of light at each location in the receptor array. The output of a receptor is the value of the light image at the location assigned to the receptor. The mapping from light image to receptor array is assumed to be linear. The output (measured intensity) from the Ith receptor is denoted p,. In vector notation, the instantaneous input from the receptor array is p = [ / I , , . . ~ N I ? ' . Figure 1 also shows an ideal receptor array of A' receptors at specified, known locations. This receptor array may be a square or hexagonal grid of receptors, but it need not be. The input from the tth receptor in this ideal array is denoted I!,, and the input from the ideal array is denoted 1' / r = [ p , . . . . ./"I . The real array is connected to the ideal by linear receptive fields (one is shown in the figure). The visual system is calibrated when the receptive fields translate the input of the irregular, real array to what the ideal array would have sampled. Without some restriction on the light images sampled, there need be, of course, no connection between samples taken by the real array and those taken by the ideal array. For the remainder of this paper, the set of light images L is assumed to be a space of two-dimensional finite Fourier series of dimension N where ,V is the number of receptors in the ideal and in the real arrays. When N is 49, as in the simulations below, the 49dimensional space of two-dimensional finite Fourier series are weighted sums of products of one of 1, sin 27r.r.cos 27r.r, sin 2~2.1.. cos 27r2,r,sin 2 ~ 3 . rcos , 2~3.r ~
Laurence T. Maloney and Albert J. Ahumada
394
and one of I, sin 27ry, cos 27ry, sin 27r2y, cos 27r2y, sin 2 ~ 3 9cos , 2x311
This lowpass assumption reflects the blurring induced by the optics of the eye; it is commonly made in modeling spatial vision (Maloney 1990). With this assumption, we can show that if there is a solution to the calibration problem for a particular real array, ideal array, and linear subspace of lights, then it must be a linear transformation (Maloney 1990). This linear transformation, W, will depend on the unknown position of receptors in the real array and, consequently, on the optics of the visual
IDEAL
w
B
.'l
I -
Figure 1: An irregular photoreceptor array can be translated to an ideal regular array by a proper choice of receptive field weights.
Learning by Assertion
395
RETINA
I.."
.
. 0.0
0.0
-039 0.10
0.10 0.0
0.0
0.99
-0.10
0.37
Receptive Field 1
0.12
Receptive Field 2
Figure 2: Equivalent receptive fields. The black dots represent the locations of photoreceptors in the disordered real array. The squares correspond to the receptive fields of two receptors in the ideal array. See text. system. Each row of the linear transformation W , written as a matrix, is the weights of the receptive field of one ideal receptor. Because of the irregular distribution of receptors in the real array, the receptive fields for different receptors in the ideal array may be very different. Figure 2 shows two correctly calibrated receptive fields corresponding to two locations in the ideal array pointed to by dashed lines. The receptors in the real array are shown as black dots. The weights assigned to the nearby receptors in the real array are shown in the "exploded" squares at the ends of the dashed lines. These two receptive fields both extract the equivalent information that would have been sampled by an ideal receptor at the location indicated. The problem of calibration is now reduced to learning the unknown linear transformation W , given input only from the real sampling array.
Laurence T. Maloney and Albert J. Ahumada
396
In computer vision this problem is commonly solved by the use of “test patterns.” If the contents of the scene are known, then the value of the ideal array, p , can be compared to the value that is correct for the known test pattern, and W can be adjusted to eliminate any discrepancies (Rosenfeld and Kak 1976). We describe algorithms that do not require knowledge of the specific contents of the scene. 3 Eye/Head Movements
Consider the consequences of moving the eye and/or head, while the scene remains unchanged. The eye and head may translate to a new position, change the angle of gaze, rotate, “zoom” in or out on the scene, and so on. A particular eye and head movement serves to transform the value of the real (retinal) array (see Fig. 3). The transform, denoted T , maps the initial value of the real array p to the value after an eye/head movement, p‘. Different eye/head movements, of course, correspond to different transformations, T . If the visual system is properly calibrated, then 11, = W p and p‘ = Wp’ will be related by an equivalent transformation, denoted t. Intuitively, if the retinal image moves rigidly to the left on the real array, then, in a calibrated visual system, it would move rigidly to the left on the ideal array as well. T is a physical transformation induced by actual eye/head movements. t is an internal transformation that simulates the external transformation. The last assumption we make concerning the model visual system is that it can perform transformations, denoted t, on the ideal array that mimic all possible eye and head movements. The set of transfor-
P
T
I’
W
W
r,
L
1
?
Figure 3: Schematic diagram of an assertion. The consequencesof an eye/head movement can be computed in two ways. See text for an explanation.
Learning by Assertion
397
mations t is easily computed; it is precisely the transformations needed, for example, to compensate computationally for eye movements. The visual system can now compute the outcome of eye/head movements in two ways. It can look at the scene, take the resulting value / I = W p and apply t to get p'. Alternatively, it can perform the physical transformation T by actually moving, and then compute p' = WiTp. If the two methods of computing /I' = tWp = 1I'Tp (the two paths sketched by arrows in Fig. 3) produce different answers, then the visual system is not calibrated. Conversely, a specific transformation T constrains the choice of W so that f l l i = 1I.T. This constraint we term an assertion. We assert, for example, that in a calibrated visual system, moving closer to an object should simply result in scaling the object in size. Any other changes (rippling, flickering, distortion) are indications of failures of calibration. 4 Mathematical Results
To what extent is W constrained by all of the transformations T taken together? For the simple visual system considered here, we have the following mathematical results (Maloney 1989):
Result 1. If 1Y is noilsiiigrilar. it is corqiletely rfeterr~~ined up to a ,s(alir~g factor by the assertions gnieratetf Or. all eye arid head I I ~ O V ~ ~ I I ~inI I the ~S S('?IIf'.
Satisfying the assertions is almost equivalent to calibrating the visual system. The requirement that I t ' be nonsingular avoids pathological solutions where the visual system disconnects itself from the environment. If, for example, all weights in 14- were set to 0, the visual system, missing all visual input, could never see any failure of rigid transformation under eye/ head movements. The second result concerns equivalent receptive fields. Suppose we consider not all eye and head movements, but just (small) eye movements: translations of the retina perpendicular to the line of sight.
The importance of the second result is that it would permit anatomically more regular portions of the retina to serve as a template for organizing equivalent receptive fields elsewhere in the retina. The two results, taken together, suggest that assertions can be used to guide calibration. In the next sections, we develop and illustrate an algorithm for calibration based on assertions.
Laurence T. Maloney and Albert J. Ahumada
398
5 Learning by Assertion
The requirement that W T = tW for all transformations T and all light images determines W as stated above. The penalty term T
P
is minimized precisely when this condition holds. We can therefore develop a learning algorithm for calibrating the simple visual system outlined above by minimizing the quadratic penalty in equation 5.1. The algorithm repeats the following steps until the error term (equation 5.1) is sufficiently small: 1. Generate a light image drawn at random from the lowpass subspace of finite Fourier series. 2. Sample the light image to obtain p. 3. Simulate a randomly chosen eye/head movement T .
4. Resample the light image to obtain
11’.
5. Compute WT(p)- tW(p),the Euclidean vector difference between the two ways to compute it‘. 6. Update W using a modified Widrow-Hoff algorithm.
The Widrow-Hoff algorithm (Widrow and Hoff 1960) compares the correct and actual outputs of a linear transformation and alters the transforamtion to make them coincide. Our algorithm is computationally identical to the Widrow-Hoff algorithm except that it compares tWp and WTp, two possibly erroneous outputs, and attempts to minimize the discrepancy between them. We do not know the conditions under which the modified algorithm is guaranteed to converge. In Step 5, we assume that the transformation t, corresponding to T , is known to the visual system. We are investigating whether f can be estimated from the visual input with inexact knowledge of T (Maloney 1989). 6 Simulation Results
We have implemented the algorithm corresponding to Result 2 above: eye movements only. For the results reported here we assumed a 7 x 7 square grid ideal array with a 49-receptor irregular real array. The locations of the 49 receptors in the real array were chosen by randomly perturbing about half of the receptors in the regular array by about 0.25 of the spacing between receptors in the regular array. The lowpass space was described above. One receptive field in the ideal array was fixed.
Learning by Assertion
399
Figure 4: (a) The correct receptive field for one of the receptors in a 7 x 7 square grid ideal array. (b) The receptive field "learned" by the algorithm described in the text. The remaining 48 equivalent receptive fields were learned. The fixed receptive field could be set to arbitrary (nonzero) values. Figure 4a shows one of the 7 x 7 learned receptive fields after 150,000 iterations of the Widrow-Hoff algorithm; Figure 4b shows the correct receptive field for this element. A linear interpolation algorithm was used to render each 7 x 7 grids as a perspective plot. The receptive field has converged to its desired shape. 7 Generalizations and Predictions
We are currently implementing the algorithm corresponding to Result 1 (eye and head movements). For a nonlinear visual system, equation 5.1 still serves as a constraint guiding calibration that in combination with other constraints may be sufficient to guarantee proper calibration. Since equation 5.1 is quadratic, it is plausible that any candidate neural learning algorithm is capable of minimizing it, if the penalty term can be
400
Laurence T. Maloney and Albert J. Ahumada
computed. Since the penalty term represents nonrigid motion induced in the representation by eye or head movements, it is plausible that the penalty term is available to the nervous system. The methods described here use a novel cue, derived from comparision of visual input across successive glances at a scene, to calibrate a simple linear visual system. Previous models of visual neural development, reviewed above, use different cues and methods to calibrate the visual system. The methods outlined here differ from these methods in that (1) they directly optimize a visual capability (stability and rigidity under change of direction of gaze), (2) they can compensate for small optical distortions and remappings, ( 3 ) they require structured visual input (actual scenes), and (4) they require successive fixations on a single unchanging scene. Taken as a claim about visual development, the procedures developed here are readily testable empirically. (1)Animals reared in environments lacking structured visual input, or in environments where visual input is rendered perpetually nonrigid, or where it is never possible to fixate the same scene twice, should be perfectly calibrated according to previous theories, but not according to the work developed here. (2) Animals with small optical distortions induced early in development should not be perfectly calibrated according to previous theories, but may be so according to the work described here. (3) The visual system will compensate for retinally stabilized optical distortions in adults; these distortions may include small induced scotomas. Prediction (2) may hold while ( 3 ) fails if recalibration in the adult is limited by connectivity restrictions on receptive fields (Purves and Lichtman 1985).
Acknowledgments We thank Marty S. Banks, Jeffrey B. Mulligan, Michael Pavel, Brian A. Wandell, and John I. Yellott, Jr. for useful comments. Send reprint requests to Laurence T. Maloney, Department of Psychology, Center for Neural Science, 6 Washington Place, 8th Floor, New York, NY 10003.
References Banks, M.S. 1976. Visual recalibration and the development of contrast and optical flow perception. In Perceptual Developmetit in Infancy; The Minnesota Symposia on Child Psychology, Volume 20, A. Yonas, ed., Erlbaum, Hillsdale, NJ. Hirsch, J., and Miller, W.H. 1987. Does cone positional disorder limit resolution? J. Opt. SOC. Am. 4, 1481-1492. Kohonen, T. 1982. Analysis of a simple self-organizing process. Bid. Cybern. 44,135-140.
Learning by Assertion
401
Maloney, L.T. 1989. Geometric calibration of a linear visual system. In preparation. Maloney, L.T. 1990. The consequences of discrete retinal sampling for vision. In Computatioml Models of Visiial Processing, M.S. Landy and A J . Movshon, eds., MIT Press, Cambridge, MA. Meyer, R.L. 1988. Activity, chemoaffinity, and competition: Factors in the formation of the retinotectal map. In c e / / Iirteractiorzs i i i Visiinl Dezlelopineiit, S.R. Hilfer and J.B. Sheffield, eds., Springer-Verlag, New York. Movshon, J.A., and Van Sluyters, R.C. 1981. Visual neural development. Ami. Rev. Neurosci. 7. Purves, D., and Lichtman, J.W. 1985. Priirciples of Neirral Development, pp. 251270. Sinauer, Sunderland, MA. Rosenfeld, A., and Kak, A.C. 1976. Digital Picture Processing. Academic Press, New York. Sejnowski, T. 1987. Computational models and the development of topographic projections. Trends Neurosci. 10, 304-305. Shatz, C.J. 1988. The role of function in the prenatal development of retinogeniculate connections. In Cellular Thalainic Mecizntzisms, M. Bentivoglio and R. Spreafico, eds., Elsevier Science, Amsterdam. Toet, A., Blom, J., and Koenderink, J.J. 1987. The construction of a simultaneous functional order in nervous systems. 111. The influence of environmental constraints on the resulting functional order. B i d . Cyhern. 57, 331-340. Weale, R.A. 1982. A Biography of tlw Eye; Dezdopinent; GrozoW; Age. H.K. Lewis, London. Widrow, G., and Hoff, M.E. 1960. Adaptive switching circuits. Iizstitirtc of Radio Engineers, Westerir Electronic Shozci and Conve~ztioii,Coiizieritioti Record, Part 4, 96-104.
Received 23 May 1989; accepted 13 July 1989.
Communicated by Christoph von der Malsburg
How to Generate Ordered Maps by Maximizing the Mutual Information between Input and Output Signals Ralph Linsker I D N Rcsmrcli Dilision, T.J. \%;I tson Rcscarch CcnteI. P.O. Box 218. Yorktorvzi Hrights. NY 10598 I J S A
A learning rule that performs gradient ascent in the average mutual information between input and output signals is derived for a system having feedforward and lateral interactions. Several processes emerge as components of this learning rule: Hebb-like modification, and cooperation and competition among processing nodes. Topographic map formation is demonstrated using the learning rule. An analytic expression relating the average mutual information to the response properties of nodes and their geometric arrangement is derived in certain cases. This yields a relation between the local map magnification factor and the probability distribution in the input space. The results provide new links between unsupervised learning and information-theoretic optimization in a system whose properties are biologically motivated. 1 Introduction
A great deal is known experimentally about the complex organization of certain biological perceptual systems such as the visual system in cat and monkey. One way to study these systems theoretically is to explore whether there are optimization principles that can correctly predict what signal transformations are carried out at various stages of a perceptual pathway. I have proposed a principle of “maximum information preservation” (Linsker 1988a,b) according to which a processing stage has the property that the values of the output signals from that stage optimally discriminate, in an information-theoretic sense, among the possible sets of input signals to that stage. [See (Linsker 1990) for a review of earlier ideas relating information theory and sensory processing.] The principle in its basic form states: Given a statistically stationary ensemble of input patterns L (to a processing stage) having probability density function (pdf) PL(L), and a set S of allowed input-output mappings S = { f : L M } , where each ,f is characterized by a conditional pdf P ( M I L ) , choose an f E S that maximizes the Shannon information -i
Nerrrnl CotnpiMion 1, 402-411 (1989) @ 1989 Massachusetts Institute of Technology
Maximizing Mutual Information
403
rate or average mutual information (Shannon 1949)
R
=
/ rlL P,,(L) /"d.ll P(,U 1 L ) log[PLI.f 1 L)/Phf(LU11
(1.1)
where Ph,(.U) G I tlL P L ( L ) P (1 ~L ~ ) . In this paper we study some consequences of R-maximization for a processing stage in which the choice of set S is relatively simple yet biologically motivated. 2 Information Rate and Gradient Ascent
The type of processing stage we shall consider has the following properties. Each input pattern is denoted by a vector in a space L. There is a set of output "nodes" hi,each characterized by a vector .r.(Af) in L space. The response to an input L occurs in three steps: (1) Feedforward activation: Each node Af' receives activation i l ( L , Al')2 0; this quantity depends on L and the position r(Af') of node M'.( 2 ) Lateral interaction: The activity of each node Af at this step is given by B ( L , A l ) = Cnrrg(W. Al)A(L.A l ' ) where y(Ai'.Af) 2 0 and C4! y(Af'.Al) = 1 for all ,If'. 13) Selection of single output node to be fired: Node N is selected with probability P(M I L ) = B(L,Al)/EAllB ( L . M').(If we view the system as a network, A corresponds to the feedforward connection strengths, and y to lateral excitatory strengths. Lateral inhibitory connections are implicit in the selection of a single firing node at step 3. The requirement that a single node fire makes it easier to compute the information rate, but deprives the system of much of the richness of biological network response.) Thus P(A2 I L ) = [Ch,, g(i13', A1),4(L, Ai')l/Ebf,A(L, hl')
(2.1)
The functions A and y are specified. We wish to maximize R over all choices of the set of vectors {.rOW>. We derive a learning rule that, when averaged over the input ensemble, performs gradient ascent on R. The derivative of R with respect to the /th coordinate of .r(AfcJ is d R / t l [ ~ , ( A f "=) J ] tlL P I ~ ( L ) Z ~Jfo) ( L , where
z,(L.%Q' )
= IaA4(L. nf")/as,(Al")][c,jiA(L, Al')]-'
1
x Cnr [log P(Ai L ) - log S,r(1Zf)l
x [ g ( n l " , '11) - P(M
I L)]
(2.2)
The learning rule is: Select input presentations L according to the pdf PL(L). For each L in turn, move each node position . ~ ( A l o by ) a small amount k Z ( L , M o ) where k > 0. The rule makes use of one item of "historical" information at each node: PbJ(Af). If the {-I ( M ) } change slowly over many presentations, an average of the firing incidence of A2 over an appropriate number of recent presentations can provide a suitable approximation to Phf(Al). We can interpret equation 2.2 as follows. Consider each node hl in turn. Suppose that (1) P(Al I L ) > PAj(Ai); that is, the occurrence of
404
Ralph Linsker
pattern L conduces to the firing of 111. Suppose also that (2) y(M0, M ) > P(M I L). In network terms, ~ ( M o)A! , is the strength of the lateral connection from 1140 to M. By equation 2.1, P ( M I L ) equals the average strength of the active connections to node M , defined by weighting each y(M’, M ) by the activity A(L, M’)of node 111’.If both inequalities hold, then the effect of that N term in the right-hand side of equation 2.2 is to tend to move z(M0) in the direction of increase of A(L,Mo).Reversing the second inequality tends to move d M 0 ) so as to decrease A(L,A&). Stated informally, the derived learning rule has the effect that each node Mo develops to become more (respectively less) responsive to L if it is relatively strongly (respectively weakly) connected to nodes M that are themselves relatively strongly responsive to L. Three elements Hebb-like modification, and cooperation and competition among output nodes for “territory” in the input signal space - are apparent in this description. I emphasize that while we specified the activity dynamics of the processing stage - that is, the relationship between input and output (equation 2.1) as a function of the parameters {.z(M)}- we made no assumptions concerning the form of the learning rule by which the {dM)}are to be adjusted. It is striking that a learning rule combining Hebb-like modification and cooperative and competitive learning in a specific way emerges as the gradient of an important information-theoretic function - the average mutual information between input and output. Our model and result can be mapped directly onto a simple ecological problem in which, for example, different organisms M are differently suited to obtain various types of food L, or food at different locations L. Rather than denoting a rate of transfer of signaling information, R in this case provides a measure of the extent to which the statistical structure of M space reflects, or is matched to, the structure of L space. The R function is thus a candidate for a function that may be locally optimized (subject to developmental constraints) by familiar mechanisms such as adaptation and competition, at least in sufficiently simple ecological models. 3 Neighbor-Preserving Map Formation
Figure 1 shows the emergence of a neighbor-preserving or “topographic” map as the result of performing gradient ascent in R, for a case in which the L and M spaces are both two-dimensional. (To reduce boundary effects, periodic boundary conditions are imposed; each space can thus be regarded as the surface of a torus.) For this computation PL(L)is uniform, A(L,M’)c( exp(-rw I L - .r(hl’) 12), g ( h l ’ , M ) c( exp(-P 1 M’ - A4 I*), and 1 . . . 1 denotes Euclidean distance going the “short way around” the periodic L or M space. The generation of a neighbor-preserving map by a connection modification rule is of course not new (e.g., von der Malsburg and Willshaw 1977; Kohonen 1982). The point of interest
Maximizing Mutual Information
405
1
Figure I: Gradient ascent in the information rate R induces a neighborpreserving map. The input space L is a unit square. The M space consists of a 10 x 10 square array of nodes. Periodic boundary conditions are imposed (see text). Each node ( i ,j ) ( i , j = 1 , .. . ,101 is initially mapped onto a point dz,j ) in L space, which is randomly chosen from a uniform distribution on a square of side s = 0.7 centered a t s = (O.li,O.lj). Thus, a very coarse topographic ordering is initially present. [If the initial { s ( i , j ) }are entirely random ( s = l), a map having partially disrupted topographic order and a lower lying local maximum of R is obtained.] At each iteration, .r(i,j ) is changed by Mi, j ) = (-,/K)Er=,Z[LA.M ( i .j)] (see equation 2.2) where { L k } is an ensemble of input vectors. Parameter values are a = 20, 17 = 4/9, y = 1, and ' h = 900 ( { L A }is a 30 x 30 array of points). Plots show the links connecting each r ( i .j ) with s(i + 1,j ) and s(i.j + l), after 0 (upper left), 10 (upper right), 15 (lower left), and 40 (lower right) iterations.
406
Ralph Linsker
is rather that an optimization principle and learning rule yielding this result have emerged from information-theoretic considerations. Figure 1 shows that a square grid in M space is optimally mapped onto a square grid in L space. That is, the ”magnification factors” of the mapping M -+ .c(M) are the same in both coordinate directions, and orthogonality of the coordinate axes is preserved. In the next section we prove that this is a consequence of the principle of maximum information preservation under conditions that are more general than those of Figure 1. 4 Coarse-Grained Information Rate
We can derive a useful “coarse-grained” version of equations 1.1 and 2.1 under certain conditions. Suppose that A ( L ,M’) is negligible for IL - a(M’)I > a0 and that g ( M ’ , M ) is negligible for 1 Ad’ - M I> go. Suppose also that the following approximations can be made: (1) The mapping A4 -+ z ( M ) - which we will call the embedding of the M space in L space - is linear over a local region that is large compared with the length scales a0 and 90. (2) For each L, firing is confined to a single such local region in M space. (3) The firing rate is uniform over such a local region. (We will consider a two-dimensional M space, but the generalization to other dimensionalities is straightforward.) Figure 2a shows an orthogonal coordinate grid and unit vectors t i , 1 1 in M space, and Figure 2b shows a disk cut from the (arbitrary) linear mapping of this grid onto a two-dimensional subspace of L. The mapping is characterized by the lengths f and 9 of the images of u and ’I! under the mapping, and by the angle 0 between them. An area element dM in M space is thus mapped onto an area dL = c dhd in L space, where c = fg sin 0. For definiteness we choose A ( L , M ’ ) ‘xexp(-cr I L - s(M‘) 12) and g(M’, M ) = (p/a)exp(-p I hf’ - A4 1 2 ) . The density of nodes in M space is uniform, and we pass to the continuum limit, so that sums over M become integrals over area elements in M space.
4.1 Derivation - Qualitative Aspects. We can now express R-maximization as a geometric optimization problem. We first describe qualitatively the main geometric effects that arise. (1) By equation 1.1 we have R = Rl + R2 where R1 = - Jdhl P*,(M) log P M ( M )and R2 = J d L P L ( L ) J dllf P(M I L ) logP(M I L). (2) The quantity R1 is the entropy of the pdf P*f(M), and is a maximum when Ph,(M) is uniform over M. An example of an embedding that achieves this maximum is one in which the density of nodes M mapped to each region of L space is proportional to PL(L). (3) The quantity R2 is the average over input vectors L of the negative of the entropy of the pdf P(M I L). Its value is greater when the embedding is chosen so that the P ( M 1 L ) distribution for each L is more sharply localized to a small region of M space. (The intuitive
Maximizing Mutual Information
407
Figure 2: Locally linear mapping of the M-space coordinate system onto a region of L space. (a) Orthonormal vectors 1 1 . v and coordinate grid in M space. (b) Arbitrary linear mapping of this grid onto region of L space. (c) A mapping that maximizes information rate under conditions stated in text. idea is that if each input vector activates fewer nodes, then one’s ability to discriminate among the possible input vectors, given knowledge of which node fired, is improved.) This sharpened localization of P(Al 1 L ) is achieved in two ways: (a) Since the spread of activation due to the feedforward process A(L. M ‘ ) has fixed extent in L space, lowering the density of nodes hl’ in the vicinity of L tends to localize A(L. AT), and thereby P(M I L), to a smaller region of M space. This effect favors spreading out the embedding over a larger region of L space. The balance between this effect and the tendency to cluster the nodes in regions of high Pl,(L) (item 2 above) determines the ”magnification factor” of the mapping (next section). (b) When viewed in L space (Fig. 2b), the contour lines of A(L,M’) are circular, but those of y(Al’, Af) are in general elliptical. When f = y and sin 0 = 1 (Fig. 2c) the contour lines of y(Al’, A3) become circular in L space, and P(hl I L ) - which is proportional to the convolution of A(L,A T ) and y(M’. A l ) -becomes more sharply localized, as shown at the end of this section. 4.2 Mathematical Details. To derive the coarse-grained information rate R we need to express RZ in terms of the geometric properties of the embedding, such as the values of ( f .y, 8) at each ’If. The qualitative statements of the previous paragraph apply to a variety of functional forms for ,4(L, Al’) and g ( W . M ) , and one can, in general, compute the entropy of the P(A1 I L ) distribution numerically. However, when A(L. hl‘) and y ( W , h l ) have the gaussian forms assumed above, we can proceed analytically. The derivation is outlined in the remainder of this paragraph, which may be skipped by the reader interested only in the result and its consequences. (1) Rewriting P(Ai 1 L ) of equa-
Ralph Linsker
408
tion 2.1 as a ratio of integrals in L space (using dL' = c dhf'),we find that P(Al I L ) is a two-dimensional gaussian function of distance: P(M I L ) = (c/X)(n+a-)'/2exp(-n+~2 - n - q 2 ) , where (1;' = 0 - l + ij-'[h f (h2 - r2)'/'1 and h = ( f 2 + g2)/2. We define Lo as the point in the embedded M sheet that lies closest to the input vector L, and ( 5 , ~as) the components of the vector [ s ( M ) - L o ] along the major and minor axes of the elliptical contour lines of g(M', M ) in L space. The values of (f,g , I91 at Lo are used, since the activation for given L is confined to a local region of the embedding centered at Lo. ( 2 ) The negative of the entropy of the P ( M I L ) distribution for given L is u(L) = J d M P(Ai I L ) log P(M I L ) = l ~ g [ ( o , a - ) ' / ~ r / ( ~ r ) l . (3) Note that u ( L )depends on L only through Lo, and that the integral of PL(L)over all L sharing the same Lo is Ph,(Alo)/c(Mo), where Mo is a node in the vicinity of Lo. Therefore R2 = J ~ L P L ( L M L=) f d M PA!(hf&. ~ Some algebraic manipulation then yields the result stated below. 4.3 Results. The coarse-grained information rate we derive is R = J d b l r ( A l ) with
(fg sin O)z
(4.1)
where ( f ,g, 19) are functions of M , and p ~ ( A 4is) the firing probability per unit area in M space. The firing probability per unit area of the embedding in L space is q ( M ) 3 P&bf)/-c, which depends on the PL(L) distribution and the shape of the the embedded surface in L space but is independent of (f,g , 0). When the stated approximations are valid, we see that rate maximization has become a geometric problem: that of embedding an M sheet in L space subject to constraints (such as boundary conditions) so as to maximize the integral over A4 of equation 4.1. (Only a portion of the mapping from M to L might satisfy the stated approximations. In that case only the contribution of that portion to R is being considered here.) The bracket in equation 4.1 can be written as X = [(a+ ~ ? p+)2~0 @ ~ p ~ I , where p 3 1/c and F = --c + (f 2 + g2)/2. When L and M have the same dimension, p is called the "magnification factor." Note that c 2 0, with F = 0 only when f = g = p-'l2 and sin I9 = 1; that is, when a square coordinate grid in M maps onto a square grid in L (Fig. 2c). Any deviation from this square mapping makes a negative contribution to R, akin to a "surface distortion energy" cost term. 5 Magnification Factor
Now consider a number, N , of local regions of the embedding in L space (they need not be near each other). The kth such region has area ALk in L
Maximizing Mutual Information
409
space, p-value p k , firing rate cik per unit area in L space, and information rate r'k = r ( 3 l ) per unit area in M space, where ,I1 is a node in region k . Assume that f = y = ,1-'12 and sin0 = 1 for each region, so that f = 0 as derived above. How should a fixed total area of M space (this area is Ck p k A L p ) be allocated among the A' regions so as to maximize the total contribution, C k r~.pkALk,to R from these regions? The result (obtained using equation 4.1 for each r k , and the Lagrange multiplier method) is that the p k should be chosen such that the value of ( p k + p i . / j / o ) / q k is the same for all k . We thus obtain an "equation of state" relating p (the area of the M sheet that maps onto a unit area in L space) to the firing probability (I of that region of the M sheet (measured per unit area in L space): p = 4 / 2 1 + [(t2/4) + X f q l ' / 2
(5.1)
where t = o/d, and X is chosen so that the total area of M space being allocated has the desired value. Note that if the L space is two-dimensional and the mapping is bijective, (I z PL(I.1. Our "equation of state" has two limiting regimes. (I) If p r i << 0 , then 11 'x4 and Phf(Al) is constant. In this regime the lateral interaction y accounts for most of the spatial extent of the spread of activation within the M sheet (for given input I,). (2) If p$ >> a, then 1.' ,x(I"'. In this regime the feedforward function A accounts for most of the activation spread. If M is a one-dimensional space, the corresponding limiting forms of the "equation of state" are, respectively, 0 'x(I and 0: Y ' / ~ ,where 0 is the linear magnification factor. If one is not attempting to maximize the information rate, different mapping algorithms can be devised, which in general give rise to different magnification factors. For example, an earlier "feature map" algorithm (Kohonen 1982) tends to assign input vectors L to output nodes A1 so as to minimize the variance among the inputs assigned to the same node. Analysis of the magnification factor for that algorithm in the onedimensional case (Ritter and Schulten 1986, Kohonen 1988) shows that, contrary to earlier supposition (Kohonen 1982), p is not proportional to (I. By way of contrast, R-maximization yields a distribution of nodes for which (in the first limiting regime above) [I x (I and the firing probability per node is uniform. This result is sometimes desired in practical classification applications, even apart from the issue of maximizing the information rate. 6 Information Content and Information Value
The leading bit and a lower order bit of a signal value have equal information content in the sense of Shannon. Yet the leading bit is often more important (to an animal's survival, for example). Since the proposed
Ralph Linsker
410
principle maximizes the information rate, how can it take any account of the relative importance of information? It does so in the present model by means of the function A(L.Af’). If two inputs L1, L:! differ only in their low-order bits, then the overlap between A(LI, M’) and A(L2, M’) is large for any .r(hf‘).Because of this, a given number of M nodes can more reliably (hence with greater information rate) discriminate the value of a leading bit than that of a low-order bit. The function A can - but need not - represent a physical process (such as noise or spread of activation) that confounds sufficiently similar input signals. Even without regard to such a physical process, however, it may be desirable for a processing stage to ignore small differences in input signal values. How might one control the resolution below which such small differences are ignored? This type of control is useful in facilitating the formation of generalizations, and may be relevant to attentional mechanisms. Within the present information-theoretic framework one can control the desired resolution by maximizing, not the information that hi conveys about L, but the information that M conveys about the coarse-grained value, of L. (The introduction of is a device for removing the informational value of discriminations among inputs that differ by less than a desired amount. It is not related to the derivation of the “coarse-grained” information rate in an earlier section.) To outline how this quantity may be maximized, note that the inverse of the mapping L ---t i is a mapping from each coarse-grained value to a neighborhood of similar Ls. Define A(.!,, M’) as the probability II(L I of E generating L via this inverse mapping, multiplied by the activation a(L,M’) of M’ due to L, and integrated over L. [Given assume Ch,a(L.M’) is constant over all L for which l I ( L I is nonnegligible.] Then equations 1.1, 2.1, and 2.2 all remain valid (with L replaced by i), and we can use the methods described to maximize the average mutual information between and M .
e
e,
e
e)
e,
e
7 Conclusion
We started with a design principle: choose the parameters of a signal processing system so that the system’s outputs optimally discriminate among an ensemble of inputs. We derived a learning rule that generates such a system, and found, moreover, that the rule combines elements of Hebb-like modification and cooperative and competitive learning to accomplish this task. Topographic mapping, map magnification factors, and other geometric properties emerge from an analysis of the optimization process. The approach described holds promise for the analysis of adaptation rules, information-theoretic optimization, and emergent structure in more complex and biologically realistic systems.
Maximizing Mutual Information
411
References Kohonen, T. 1982. Analysis of a simple self-organizing process. Bid. Cyberriet. 44, 135-140. Kohonen, T. 1988. Self-organization and associative memory, 2nd ed. SpringerVerlag, New York. Linsker, R. 1988a. Self-organization in a perceptual network. Cornpiiter 21 (March), 105-117. Linsker, R. 1988b. Towards an organizing principle for a layered perceptual network. In Neural Inforination Processing S y s t e m (Denver, CO, 1987), D.Z. Anderson, ed., pp. 48-94. American Institute of Physics, New York. Linsker, R. 1990. Perceptual neural organization: Some approaches based on network models and information theory. Ann. Rev. Neurosci. 13, in press. Ritter, H., and Schulten, K. 1986. On the stationary state of Kohonen’s selforganizing sensory mapping. Biol. Cybemet. 54, 99-106. Shannon, C.E. 1949. In The Matliemntical Theory of Coinminication, C.E. Shannon and W. Weaver, eds. University of Illinois Press, Urbana. von der Malsburg, C., and Willshaw, D.J. 1977. How to label nerve cells so that they can interconnect in an ordered fashion. Proc. Natl. Acad. Sci. U.S.A. 74, 517 6 5 178.
Received 22 May 1989; accepted 13 July 1989.
Communicated by Harry Barrow
Finding Minimum Entropy Codes H.B. Barlow T.P. Kaushal G.J. Mitchison Physiological Laboratory, Downing Strcct, Cambridge, CB2 3EG, England
To determine whether a particular sensory event is a reliable predictor of reward or punishment it is necessary to know the prior probability of that event. If the variables of a sensory representation normally occur independently of each other, then it is possible to derive the prior probability of any logical function of the variables from the prior probabilities of the individual variables, without any additional knowledge; hence such a representation enormously enlarges the scope of definable events that can be searched for reliable predictors. Finding a Minimum Entropy Code is a possible method of forming such a representation, and methods for doing this are explored in this paper. The main results are (1)to show how to find such a code when the probabilities of the input states form a geometric progression, as is shown to be nearly true for keyboard characters in normal text; (2) to show how a Minimum Entropy Code can be approximated by repeatedly recoding pairs, triples, etc. of an original 7-bit code for keyboard characters; (3) to prove that in some cases enlarging the capacity of the output channel can lower the entropy. 1 Reasons for Minimum Entropy Coding When any combination of sensory stimuli - a sensory event - occurs the brain needs to know whether it is an expected, common, usual event, or an unexpected, rare, unusual event. It is evident that the brain works this out for itself because animals make startle or alerting reponses to unusual stimuli, and the following explanation of the need to attend to the unusual is also fairly obvious. One of the brains’s most important jobs is to find predictive or causal relationships between the sensory events that impinge on it, the motor actions it takes, and the rewards and punishments these lead to. Now unexpected rewards and punishments are somewhat unusual, and for an animal with good knowledge of its environment they are presumably very unusual, so it follows that the sensory events it is seeking as new but reliable predictors are themselves somewhat or very unusual. For the purpose of learning something new the vast majority of sensory events can be ignored, but it is necessary to Neural Computation 1, 412-423 (1989) @ 1989 Massachusetts Institute of Technology
Finding Minimum Entropy Codes
413
know the prior probability of what is happening in order to select the small minority of events that might form reliable new predictors. Another way of seeing the importance of prior probabilities is to consider what is needed to detect "suspicious coincidences," which are the clues for detecting causal relations not only in detective stories but also in real life. Obviously the co-occurrence of two events is not in the least suspicious if they both occur frequently, so it is essential to know that the constituent events are rare before one can reach any conclusion at all about the significance of a coincidence that might point to a previously unsuspected causal factor. A sensory event is signaled by the joint activity of a good many nerve cells, and it is therefore necessary to know the prior probability, not just of the signals from individual cells, but of combinations and perhaps other logical functions of such signals. There is only one condition under which this can be done, short of having available a past record of the occurrence of every combination that occurs, and that is for the activities of each of the nerve cells to be independent of all others in the sensory environment to which the animal is accustomed. Minimum Entropy Coding aims to achieve this by measuring the entropies of individual representative variables and choosing the reversible code for which the sum is a minimum. The summed entropy seems an easily computed measure for assessing putative codes, and the aim of this article is to explore methods of putting the idea into practice. The general idea of Minimum Entropy Coding has been discussed by Watanabe (1981, 1985) and the principle is described in a previous article (Barlow 1989); the aim is to find a set of symbols to represent the messages such that, in the normal environment, the occurrence of each symbol is independent of the occurrence of any of the others. If such a set can be found it is called a factorial code, since the probabilities of the Is and 0s in the code word are factors of the probability of the input state the word represents. In such a coded representation any nonindependence between two symbols signals the occurrence of a new, hitherto undiscovered, association, whereas with nonindependent codes one must always take into account the associative structure of the normal messages from the environment. The idea of minimum entropy coding comes from thoughts about redundancy reducing codes (Attneave 1954; Barlow 1960; Watanabe 1981; 1985)and is related to decorrelation (Barlow and FoIdiAk 1989); such a code separates knowledge of the environment derived from the redundancy of past messages from the information in the current inputs, and its specification stores the knowledge that is required for the detection of new associations. In the previous article the principle was explained using for an example the coding of keyboard characters in 7 binary digits, as in the familiar 7-bit ASCII code. The advantages of this choice are its familiarity and simplicity, the fact that the characters occur nonrandomly with known frequencies (Kucera and Francis 1967; Zettersten 1978), and the ready
H.B. Barlow, T.P. Kaushal, and G.J. Mitchison
414
availability of samples of English text with normal statistical structure. An appendix shows some of the more general mathematical properties of minimum entropy codes, while below we discuss how the 7-bit coding can actually be done, and what its results are. 2 Finding the Codes
The requirement is to code keyboard characters that occur with probabilities A, on to 7 binary outputs that occur as nearly as possible independently of each other when transmitting normal messages. The entropy calculated from the probabilities of the input states is H A ) = - C A, log A,, and for a code b this is normally less than the sum of the bit entropies ?(A, b) given by
? ( A ,b) = -
C p , logp7 C qt log -
47
where p , = 1 - (1%is the probability of the 7th bit taking the value 1 in the encoded form of ordinary text. The aim is to find a reversible code that minimizes ?(A,b). We assume that p? < q?, which is a trivial restriction since substituting its logical complement for an output simply interchanges p, and 9,. 2.1 The Number of Possible Codes. Consider a list of the input states in any order, and number them with a seven digit binary number 0000000, 0000001, 0000010, 0000011, . . . ,1111111. Changing the order of input states in the list changes the code, and any reversible code can be produced by a list in appropriate order; hence the number of reversible codes is the number of permutations of a list of 27,that is Q7)!.But many of the above codes are equivalent from the point of view of redundancy. One can substitute the complement for each output, interchanging the values of p and and obviously leaving the entropy unchanged. Since this can be done independently for each of the outputs one must divide by 27. In addition, one can permute all the outputs without changing the entropy sum so one must also divide by 7!. Thus the total number of nonequivalent codes is (27)!/(27.7!), still a very large number. 2.2 Finding the Code with the Minimum Sum of Bit Entropies. We have failed to find a general method, other than searching through all codes and calculating the summed entropy for each, which the number of codes makes impractical. We have so far tried three methods that are described briefly below. Minimizing the sum of 7 quantities that obey the constraints of the bit probabilities of a reversible code has the flavor of a problem suitable for a network solution, but this has not yet been tried. One approach is to minimize the average probability of a bit being active, since given that p , < 1/2, the lower it is the less its individual
Finding Minimum Entropy Codes
Code Minimum bit probability Binary sequence
ASCII Random-1 Random-2 Random-3
41 5
Sum of bit entropies Redundancy ( ( A .b) R (7'0) 4.50 4.37 5.42 6.89 6.77 6.74
3.7 0.7 24.9 58.8 56.0 55.3
Table 1: Entropies and Redundancies of Different Codes. Character entropy E ( A ) = 4.34 bits. Redundancy R = [ c ( A ,b) - E(A)l/E(il). contribution is to the summed entropy. This can be done as follows. First measure the probabilities of each of the input states and rank them in order of decreasing probability. The most probable is assigned the output with all Os, the next 7 are assigned to the outputs with just one bit active, the next 21 to those with 2 active, the next 35 to those with 3 active, and so on. We call these Miri~mumBit P r a b a ~ i ~ codes. i ~ y Although this gives the lowest average number of active outputs, it does not necessarily minimize the sum of bit entropies (see Appendix). Table 1 shows the values of the average entropy E ( i l ) obtained from the character frequencies in a portion of a scientific text, and the sum of bit entropies for the same text coded in this way. For a factorial code they would be equal, and it will be seen that the sum of the bit entropies comes quite close to the character entropy. Another code can be obtained by listing input state probabilities as before and then numbering them in an ordinary binary sequence. This Binary Sequence Coding gives a factorial code provided that the probabilities of the input states form a geometric progression (see Appendix). It will be seen from table 1 that this code reduces the summed bit entropies to very nearly the same as the character entropy, the residual redundancy being only 0.7%. Thus, it is very nearly a factorial code. The reasons for this success are that the character probabilities form a geometric progression, as shown in Figure 1, for which binary sequence coding yields a factorial code. We can compare these codes with the regular 7-bit ASCII code, the entropies for which are also shown in Table 1. Though this was not defined with any idea of minimizing entropy in mind, the summed bit entropy is in fact a great deal less than 7, because the more frequently used letters have fewer active bits. 2.3 Subsection Coding. In order to test how successfully the next set of recoding procedures minimized the entropy we needed to start
H.B. Barlow, T.P. Kaushal, and G.J. Mitchison
416
o.ol **. 0.1
-. -.....
... ..
... ...... ..
0.001
b a
.....
......
0.0001
4.
......
I 0
0.00001
0
10
20
30
40
50
60
70
80
90
CHARACTER RANK
Figure 1: The ranked character probabilities plot close to a straight line on these log-linear axes, showing that the ranked probabilities form a geometric progression. The text sample analyzed was a portion of a scientific text similar to this paper, and was composed of approximately 42,000 printable ASCII characters. Similar distributions are commonly found (Kugera and Francis 1967; Zettersten 1978).
with a 7-bit binary code with maximum entropy, so we have generated random codes in which the binary numbers up to 1111111 were assigned at random to the input states; three examples of such codes are included in Table 1. Minimum bit probability and binary sequence codes are based on measurements of the frequencies of the 128 different input states of a 7-bit code. This is not difficult to do, but it would become impossible if one were dealing with the very large input channels that feed the brain, so we have been very interested in how well one can achieve a minimum entropy code by repeated recoding of small subsections. The codes listed in Table 2 as pairs were formed by taking 2 bits at a time, recoding on the principle of mimimum bit probability or binary sequence coding, and doing this repeatedly until no further improvement occurred. Similar codes were formed taking 3, 4, or 5 at a time for those marked, triples, quadruples, and quintuples. For the minimum bit probability code, Figure 2a graphs the summed bit entropies as a function of the number of recoding attempts, taking different numbers of bits at a time; Figure 2b does likewise for the binary sequence code. It will be seen that recoding small subsets of elements is surprisingly ineffective in produc-
Finding Minimum Entropy Codes
Subsection
41 7
Code Minimum bit Binary probability sequence r(A.b) R (('4. b) R (%)
(%)
Pairs Triples Quadruples Quintuples
5.76 4.61 4.55 4.51
32.7 6.2 4.8 3.9
5.76 32.7 4.75 9.4 4.44 2.3 4.37 0.7
All 7 bits
4.50
3.7
4.37
0.7
Table 2: Entropies and Redundancies of Subsection Codes: Recoding of Random-2 by Subsection Method. Entropy of Random-2 code = 6.77 bits; character entropy E ( A )= 4.34 bits; redundancy R = l 4 A . b) - E(.4)I/E(A). ing low entropy codes, and many repetitions are needed to approach low values. 2.4 Expansive Coding. It is shown in the appendix that coding on to an expanded channel can in some cases decrease the entropy. This is unexpected, because in such cases some of the output states do not occur and this represents additional redundancy. Furthermore, the recoding tends to introduce negative dependencies between some of the output variables, and when this happens the code cannot be a true minimum entropy code. Nonetheless, when we compare the efficiency of association formation with various coded representations we plan to include such expanded codes since they have some plausibility as the type of coding that the cerebral cortex does.
3 Conclusions
The 7-bit coding of keyboard characters has served to illustrate the problem of finding minimum entropy codes. In this example there happens to be a good approximation to a factorial code because the probabilities of the input states form an approximate geometric progression, but it does not lead to a code in which the individual bits have any readily interpretable significance: for instance the bits in the code do not correspond to significant features of the characters, such as being vowels or numbers (see Table 3). A minimum entropy code would come into its own when the states being coded have a natural combinatorial character, but it is unknown how far this is true of the world of sensory data.
H.B. Barlow, T.P. Kaushal, and G.J. Mitchison
418
(a)
Subsection Coding using minimum bit probabilities
73
t
0 LT IW
z t
m 4.341
- _ _ _ ____ ___ _ _ _ _ _ _ _ _ _ _ _ _ - - ---
Character Entropy
4-1,
0
(b)
I
1
I
10 20 NUMBER OF RECODINGS APPLIED
30
Subsection coding using binary sequences
L Pairs
6
k
5
m O W
H
rrn 2
A RA
Character
1,
. -4- ~ 0
1
I
10 20 NUMBER OF RECODINGS APPLIED
Entropy
1
30
Figure 2: Coding subsections of the input. These graphs show the gradual reduction in entropy that is typically achieved by repeatedly recoding subsections of the 7-bit input. In (a) a Minimum Bit Probability code was applied to pairs, triples, etc. of bits; the group of bits (bt,b,, bb, . . ) chosen for recoding was the one that maximized P(b, & b, & br . . .) - H b , ) P(b,) P(hr).... A h 2 criterion was also tried, without improvement. Coding was repeatedly applied to pairs, triples, etc. until there was no further reduction of the summed bit entropy. The procedure in (b) was the same, except that a binary sequence code was applied to the pair, triple, etc. selected for recoding. The character entropy, marked on the graphs at 4.34bits/character, is the theoretical minimum that can be achieved by any recoding, and the better the coding, the closer and faster the curves will fall to this baseline. As can be seen in (b), the full width (7-bit) binary sequence code achieves this goal in a single step.
Finding Minimum Entropy Codes
419
Rank
Character
Code Probability 6543210
Rank
0 1
'space' e t
0000000
0.17595 0.09843 0.07817 0.06272 0.06214 0.05791 0.05773 0.05620 0.04375 0.03604 0.03050 0.02610 0.02610 0.02345 0.01953 0.01796 0.01768 0.01768 0.01231 0.01032 0.01025 0.00943 0.00753 0.00490 0.00253 0.00251 0.00195 0.00195 0.00187 0.001 76 0.00171 0.00150 0.00148 0.00129 0.00122 0.00117 0.00117 0.00117 0.00108 0.00098 0.00096
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
~
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1
0
a n S
r
h 1 C
U
d f m b P g
Y
w V I
N k I T X
9 ) (
V P 1 5
B A
i ~
0000001 0000010 0000011 0000100 0000101 0000110 0000111 0001000 0001001 0001010 0001011 0001100 0001101 0001110 0001111 0010000 0010001 0010010 0010011 00107 00 OOlOlO1 0010110 0010111 0011000 0011001 0011010 0017011 0011100 0011101 0011110 0011111 0100000 0100001 0100010 0100011 0100100 0100101 0100110 0100111 0101000
Probability of each bit Bit Probability 6 0.001196 5 0,018945 4 0.110085 3 0.247532 2 0.368637 1 0.401721 0 0.439799
Table 3: Actual Binary Sequence Code
Character > >
0 C 0 R j
I
f
E F D M H L z
6
Q w -
!
3 G 4-
U 4 ? X
5 8 9
> K # R
7
<
J [
1
Code 6543210
Probability
0101001 0101010 0101011 0101100 0101101 0101110 0101111 0110000 0110001 0110010 0110011 0110100 0110101 0110110 0110111 0111000 0 111001 0111010 0111011 0111100 0111101 0111110 0111111 1000000 1000001 1000010 1000011 1000100 1000101 1000110 1000111 1001000 1001001 1001010 1001011 1001100 1001101 1001110 1001111 1010000 1010001
0.00070 0.00066 0.00063 0.00061 0.00056 0.00054 0.00052 0.00047 0.00045 0.00042 0.00033 0.00028 0.00028 0.00028 0.00026 0.00026 0.00021 0.00019 0.00019 0.00019 0.00016 0.00014 0.00014 0.00014 0.00012 0.00012 0.00009 0.00009 0.00009 0.00007 0.00007 0.00007 0.00007 0.00007 0.00005 0.00002 0.00002 0.00002 0.00002 0.00002 0.00002
H.B. Barlow, T.P. Kaushal, and G.J. Mitchison
420
Although in this case the code does not impose a particularly informative classification or categorization on the characters, it nevertheless has the advantages promised in the introduction, namely that the probability of the characters can be predicted from the probabilities of the bits that represent them. Translating this to the field of sensory events, if these were represented by variables that occurred independently under normal circumstances, one’s ability to detect unusual or suspicious coincidences would be enormously enhanced. 4 Mathematical Appendix (G.J. Mitchison)
Suppose we are given a set of states whose probabilities are A, (with C A, = l),and a code b that assigns to each character a binary string. In the coding problem considered in the main text the states are keyboard characters. Let the ith bit of the code for the j t h state be denoted by bi,. The probability of the zth bit taking the value 1, summed over all the states, is denoted by p ? . Formally
1 A,
Pi =
(4.1)
b,,=1
and we define 4%= 1-p?. We say that b is a factorial code if the probability of each state A, is the product of the probabilities that the bits have the values in the code for A,. Formally,
A, = rJ
(4.2)
where r1
Pa.
= h,,=l
gi h,,=O
It is assumed here that there is a state for each possible polynomial T,, i.e., that there are 2N states, where N is the number of bits. The state entropy E(A)is defined by E(A) = - C A, log A,. The sum of the bit entropies, e(A,b), is defined by ?(A,b) = - Cp?logp, - C 4%log q l .
Proposition 1. e(A,b) 2 E ( A ) , with e(A,b) = E ( A ) if and only if b is factorial. This proposition is proved in Watanabe (1985). Intuitively, it means that the capacity of the bits of the code is greater than that of the states they represent, except in the case of a factorial code where the states are represented by a product of independent bits. The following short proof uses an adaptation, suggested by Dr. C.J. St C. Webber (private communication), of an argument in Jones (1979):
Finding Minimum Entropy Codes
421
Proof. Using (4.1) and the definition of p7 following (4.2) we can write:
?(A, b) - E(A) =
-
=
-
1
[JI
log!),
-
1
(!I
x(xA,)lOg/A
-
b,,=I
-
C A, log(7r,/AJ) 2
C(CA,)logV,
C A, log(T,/A,)
It is easy to check that 1og.r 5 .r - 1 for s = 1. Thus: -
1A, log A,
b,,=O
' cA, log =
log qf +
-
c
.I.
> 0, with equality only for
AJ(7rJ/AJ - 1)=
C A, - C
T,
= 1- 1= 0
using the fact that C 7rJ = n ( p , + q,) = 1. Equality occurs if and only if 7rJ = A, for all 3, which means that b is factorial. This result implies that, for any set {A,} that has a factorial code, any code b which minimizes the entropy ?(A, b) will also be factorial. Minimizing r(A, b) can therefore be regarded as a strategy for finding a code that is "as factorial as possible." Of course, not every set of 2" state probabilities has a factorial code. In fact, the factorial codes have dimension N (one for each choice of the N pis) and the sets of A, with C A, = 1 have dimension 2N - 1, so it is highly unlikely that an arbitrary set {A,} will have a factorial code. However, one case of interest is the following: Proposition 2. If A, = K s J ,for I = 1 . . .2", for some .r, with K tllosei~,so that C A, = 1, the11 A hils R f W f o I J d ode. Proof. Define p , by p , / q , = .r2'-', for 1 = 1 to N. Then, for the binary sequence code, T , = Q . zJ, where C Q.rJ = 1. It follows that Q = K and A, = 7 r r , so A has a factorial code. This result illustrates the difference between minimizing the sum of bit entropies e(A, b) and minimizing Cpf, the total probability of the bits. To achieve the latter one should assign the binary string consisting entirely of zeros to the largest A,, the strings with just one "1" to the next smallest A,s, and so on. The factorial code for input states with probabilities in a geometric progression does not follow this sequence. Until now we have assumed that there are N bits and 2N states. More generally, one could try to code hf states on N bits, where N 2 log, Af, with the convention that one assigns a probability of zero to the codewords that are not used. A simple example shows that a lower entropy can sometimes be obtained by coding on to more than the minimum possible number of bits. Let the state probabilities be ( I , a, Ko', 1 -2n - Ku2, where 0 < (I < 1 and 1 - 2u - Iiri' > 0. One can consider two codings:
0 0 1 1
0 with probability 1 - 20 - Iin2 1 I1 0 (I 1 Ii (1
'
422
H.B. Barlow, T.P. KaushaI, and G.J. Mitchison
and
0 0 0 1
0 0 1 0
0 with probability 1 - 20 1 a 0 n 0 Kd
-
Ka2
It is straightforward to check that when a is small enough for terms in a3 and higher powers to be neglected, then the second (expansive) coding gives a lower entropy when log, K > 3. One can prove other results of this kind. For instance, if there are A4 states, M - 1 of which have the same probability a, then the expansive coding on A4 - 1 bits that assigns a codeword with a single 1 to each state of probability a and the ail-zero codeword to the state with probability 1 - (A4- l)a, gives minimum entropy provided a is small enough. For suppose there were a coding on L bits with L < M - 1, and let the bit probabilities p , be ak,, where the k, are integers. Since there must be at least one bit which is set to 1 in two or more codewords it follows that C k , 2 A4. For small bit probabilities we can approximate the bit entropy by - Cp, logp,. Then the difference Ae between the bit entropies for the coding on L bits and that on A1 - 1 bits is approximately A? = - C ak, log(tik~)+(lll-l)aloga 2 -aloga-ci C k, log k,. Sinceeach k, < A4 the maximum value of C k,log k, is M210gM, so Ae > 0 provided that -logo > M210ghf or a < (I/M)"'.
References Attneave, F. 1954. InformationaI aspects of visual perception. Psychol. Rev. 61, 18S193. Barlow, H.B. 1960. The coding of sensory messages. pp. 331-360. In Current Problems in Animal Behaviour, W.H. Thorpe and O.L. Zangwill, eds., Cambridge University Press, Cambridge. Barlow, H.B. 1989. Unsupervised learning. Neural Comp. 1, 295-311. Barlow, H.B., and Foldilk, P.F. 1989. Adaptation and decorrelation in the cortex. I n The Computing Neuron, R. Durbin, C. Miall, and G.J. Mitchison, eds. Addison-Wesley, New York. Jones, D.S. 1979. Elementary Information Theory. Oxford University Press, London. Kuqera, H., and Francis, W.N. 1967. Computational Analysis of Present day American English. Brown University Press, Providence, RI. Watanabe, S. 1981. Pattern recognition as a quest for minimum entropy. Pattern Recognit. 13,381-387. Watanabe, S. 1985. Pattern Recognition: Human and Mechanical. Wiley, New York.
Finding Minimum Entropy Codes
423
Zettersten, A. 1978. A Word Freyueiicy List Based oti Arnericari Etrglish Press Reportage. Universitetsforlaget, 1 Kopenhavn, Akademisk Forlag, Copenhagen.
Received 30 March 1989; accepted 22 May 1989
Learning in Artificial Neural Networks: A Statistical Perspective Halbert White Department of Economics, University of California, San Diego, La Jolla. CA 92093 USA
The premise of this article is that learning procedures used to train artificial neural networks are inherently statistical techniques. It follows that statistical theory can provide considerable insight into the properties, advantages, and disadvantages of different network learning methods. We review concepts and analytical results from the literatures of mathematical statistics, econometrics, systems identification, and optimization theory relevant to the analysis of learning in artificial neural networks. Because of the considerable variety of available learning procedures and necessary limitations of space, we cannot provide a comprehensive treatment. Our focus is primarily on learning procedures for feedforward networks. However, many of the concepts and issues arising in this framework are also quite broadly relevant to other network learning paradigms. In addition to providing useful insights, the material reviewed here suggests some potentially useful new training methods for artificial neural networks. 1 Introduction Readers of this journal are by and large well aware of the widespread and often dramatic successes recently achieved through the application of connectionist modeling and learning techniques to an impressive variety of pattern recognition, classification, control, and forecasting problems. In many of these cases, success has been achieved by the now rather simple expedient of appropriately training a hidden layer feedforward network using some variant of the method of backpropagation (Werbos 1974; Parker 1982; Rumelhart et al. 1986). These successes have stimulated an entire industry devoted to devising ever new and better variants of backpropagation. Typically, papers representative of this industry contain some clever heuristics and some more or less limited Monte Carlo experiments demonstrating the advantages of the new and improved methods. These successes have also encouraged consideration of some important and difficult questions, such as, "Under what conditions will a given network generalize well?" 'What is meant by generalization?" "How can one determine an appropriate level of complexity for a given Neural Computation 1, 425464 (1989) @ 1989 Massachusetts Institute of Technology
426
Halbert White
network?” “How can one tell when to stop training if the targets are affected by unmeasurable noise?” The purpose of this article is to review concepts and analytical results from the literatures of mathematical statistics, econometrics, systems identification, and optimization theory relevant to the analysis of learning in artificial neural networks, with particular attention paid to material bearing on the answers to the questions just raised. Because of the considerable variety of available learning procedures and necessary limitations of space, it will not be possible to provide a comprehensive treatment. Our focus here will be primarily on learning procedures for feedforward networks. However, many of the concepts and issues arising in this framework are also quite broadly relevant to other network learning paradigms. We comment on some of these as we proceed. Because we are concerned with artificial neural networks, we are not limited to learning methods that have biological or cognitive plausibility. Thus, we are free to consider “artificial” learning methods. To the extent that biological or cognitive processes or constraints suggest useful approaches to learning, we are free to adopt them. To the extent that such processes or constraints get in the way of using an artificial network to encode empirical knowledge, we are free to dispense with them. As we shall see, basing our approach to learning on the principles of analytical and computational expediency nevertheless leads us to an appreciation of the usefulness of methods such as backpropagation, simulated annealing, and the genetic algorithm. The plan of this paper is as follows. In Section 2, we show why mathematical statistics has something to say about network learning. Section 3 discusses relevant concepts of probability fundamental to the analysis of network learning. In Section 4, we consider some alternative approaches to network learning and describe the statistical properties of these methods. Section 5 provides a review of some recently obtained results establishing that multilayer feedforward networks are capable of learning an arbitrary mapping; these results apply recent developments in the nonparametric statistics literature. Section 6 contains a brief summary and some concluding remarks. A mathematical appendix contains formal statements of some of the convergence theorems described in Sections 4 and 5. 2
T h e Relevance of Statistics
Suppose we are interested in learning about the relationship between two variables X and Y , numerical representations of some underlying phenomenon of interest. For example, X could be measurements on geological attributes of a site, and Y could be a variable assuming the value one if oil is present and zero otherwise. Alternatively, X could be measurements of various economic variables at a particular point in
Learning in Artificial Neural Networks
427
time and Y could be the closing value for the Dow Jones index on the next day. As another example, X could be the treatment level for an experimental drug in a controlled experiment using laboratory animals, and Y could be the percentage of the group treated at that level that benefit from the drug. Often, a theory exists or can be constructed that describes a hypothesized relation between X and Y , but the ultimate success or failure of any theory must be determined by an examination of how well its predictions accord with repeated measurements on the phenomenon of interest. In other cases, no satisfactory theory exists or can be constructed because of the complexity of the phenomenon and the difficulties of controlling for difficult to measure influences that are correlated with measurable influences. Nevertheless, repeated measurements of X and Y can be obtained. In either case, the possibility of making repeated measurements allows us to build up a form of empirical knowledge about the phenomenon of interest. A neural network is one form in which this empirical knowledge can be encoded. The relevance of statistical analysis arises as soon as repeated measurements are made. Suppose we have n measurements or “observations” on X , denoted 5 1 , . . . ,x, and n corresponding observations on Y , denoted y1,. . . , y., Both X and Y may be vector quantities, and therefore so will be xt and yt, t = 1,.. . , n. We suppose that X is of dimension r x 1 and Y is of dimension p x 1 for integers r and p . For notational convenience, we shall write Z = (X’, Y’)‘ and zt = (xi,yiY, where a prime denotes vector (or matrix) transposition. Thus, we have n observations, denoted zn = (21, . . . , z,). We refer to zn as a “sample,” or, in connectionist jargon, a ”training set.” It is convenient to suppose that the measurement process could be continued indefinitely, in which case we would obtain a sequence of observations { z t } = ( z t ,t = 1,2,.. .). By definition, a statistic is any function of the sequence {zt}. A familiar example of a statistic is the sample average of observations on Y , n-l Cy=lyt. A less familiar example of a statistic, but one that provides a complete representation of our empirical knowledge, is the sample itself, zn (a matrix-valued statistic). Because the entire sample is an unwieldly way of representing our empirical knowledge, we can attempt to summarize it in some convenient way, which is why things such as averages and correlations (“summary statistics”)are useful. However, a potentially much more powerful way of compressing our empirical knowledge is to convert it into the weights of a suitable neural network. Because this conversion can be accomplished only as some function of the sequence { ~ t } , the resulting network weights are a (vector-valued) statistic. Thus, the process of network learning on a given training set is in fact the process of computing a particular statistic. It follows that the analytical tools that describe the behavior of statistics generally can be used to describe the behavior of statistics specifically
428
Halbert White
obtained by some neural network learning procedure. These behavioral properties all have a fundamental bearing on the answers to the questions posed in Section 1. They also permit us to formulate and answer additional important questions, as we see below. These considerations are quite general. They apply regardless of whether we consider artificial neural networks and learning algorithms such as backpropagation or Kohonen self-organization, or whether we consider biological neural networks and whatever actual learning mechanisms occur there. Because the latter are largely unknown, our subsequent focus will be on learning in artificial neural systems. However, the concepts relevant for examining artificial systems are also relevant for the study of natural systems.
3 Probabilistic Background
3.1 Measurements, Probability Laws, and Conditional Probability. Consideration of the method by which our measurements are obtained is fundamental to the analysis of any resulting statistics. It is helpful to distinguish initially between cases in which we have complete control over the values xt and those cases in which we do not, and between cases in which yt is determined completely by the values xt and those cases in which other influences affect the measurement yt. Situations in which we have complete control over the values taken by xt occur in the laboratory when it is possible to set experimental conditions with absolutely perfect precision or occur in computer experiments. Situations in which control is not complete are common; these occur when nature has a hand in generating measurements xt. Nature’s role may be complete, as in the social sciences or meteorology, or it may be partial, as when our measurements are gathered by stratified sampling of some population or in an experiment in which, although xt can be measured with absolute precision, its precise value is determined to some extent by chance. In either of these situations, it is possible and quite useful to define an ”environmental” probability law p (or simply an ”environment”)that provides a complete description of the manner in which the measurements are generated. When nature’s role in determining xt is complete, we can regard xt as a realization of the random variable Xt having probability law p. When the experimenter’s control is complete, we can regard p as describing the relative frequencies with which different values for xt are set. When the researcher has partial experimental control (but still perfect measurement capability) we can regard p as embodying the combined influences of both nature and the experimenter, again determining the relative frequencies with which different values for xt are observed. Formally, p assigns to every relevant subset of R‘ a number between zero
Learning in Artificial Neural Networks
429
and one representing the relative frequency with which xt is observed to belong to that subset. Now consider the determination of yt. Cases in which yt is determined completeIy by xtoccur in computer experiments or in physical systems in which every single influence in the determination of yt can be measured with perfect precision and there is no inherent uncertainty attaching to Yt. For these cases, we can express an exact functional relationship between xt and yt as Yt
= ght)
for some mapping g : R + RP. The function g embodies everything there is to know about the relationship between yt and xt. It is the mapping g that is the natural object of interest in this case. In any situation in which it is not possible to obtain absolutely precise measurements on every single influence affecting the measurement Yt or in which yt is subject to inherent uncertainty, it is no longer possible to express an exact functional relationship between xt and yt. Instead it is possible to express a probabilistic relationship between xt and yt. For this, it is appropriate to view xt and yt as a realization of the jointly distributed random variables X t and yt. For notational convenience, we write Zt = ( X i ,y')'.Hence Zt is a random vector with T + p components. Just as with X t , we can define a joint probability law v that describes the relative frequency of occurrence of vectors 2,.The law Y embodies the environment p as well as the probabilistic relationship between X t and yt. Because we shall assume that X,and have the same joint probability law as X and Y,we drop the subscript t whenever convenient. The probabilistic relationship between X and Y is completely summarized by the conditional probability law of Y given X, which we denote as y(. I x), that is, ? ( A I z) = P[Y E A I X = XI, for any set A in Rp. The notation P[Y E AIX = XI is read as "the probability ( P ) that Y belongs to (E) the set A given that (I) the random variable z takes on (=) the value x." In the case where Y is completely determined by X , for example, as Y = g ( X ) , we have P[Y = g(z)lX = x] = y([g(x)][x)= 1 for all z. (We denote the set consisting of the single element g(x) as [g(x)].)Otherwise, there is generally no function g such that y([g(x)l(x)= 1 for all x. Because a proper understanding of the notions just introduced is important for following the discussion ahead, we shall briefly summarize before proceeding. The foregoing discussion establishes that a single framework applies to all the different situations initially distinguished at the beginning of this section. This framework is that the joint behavior of X and Y is described by a joint probability law v. (True randomness is allowed, but not required.) This joint behavior can be decomposed into a probability law p (the "environment") that describes the behavior (relative frequency of occurrence) of X , and a conditional probability law y that describes the behavior (relative frequency of occurrence) of Y
Halbert White
430
given X . In this “probabilistic” context, it is the knowledge of that is the natural object of interest, because this function embodies all there is to know about the relationship of interest, that between X and Y . The case in which there is an exact functional relationship g is a special case; in this case, knowledge of y and knowledge of g are equivalent. The relevance of the probabilistic context is that it applies to a much wider class of phenomena, as the discussion at the beginning of this section should suggest. Accordingly, from now on we take y to be the fundamental object of interest in our study of the relationship between X and Y . Certain aspects of the conditional probability law y play an important role in interpreting what it is that is learned by artificial neural networks using standard techniques. Primary among these is the conditional expectation of Y given X , denoted E(Y1X). This conditional expectation gives the value of Y that will be realized ”on average,” given a particular realization for X . Whenever E ( Y J X )exists, it can be represented solely as a function of X , that is, g ( X ) = E(Y IX) for some mapping g : R‘ -+ Rp. The expected value for Y given that we observe a realization 2 of X is then gh). Of course, this value will be correct only “on average.” The actual realization of Y will almost always differ from g(z). We can define a random ”expectational error” E = Y - H Y J X ) .Because g ( X ) = E ( Y J X ) we can also write
Y = g ( X )+ E .
By definition of E and by the properties of conditional expectation, it follows that E(EIX) = 0. That is, the average expectational error given any realization of X is zero. This contains the previous case of an exact relationship as the special case in which E = 0 for all realizations of X . With a probabilistic relationship, E is nonzero with positive probability. An important special case occurs when Y can take on only the values 0 and 1, as is appropriate for any two-way classification problem. For this case, the conditional expectation function g also provides the conditional probability that Y = 1 given any realization of X , that is, y([l] 12) = g(x). Because the conditional probability embodies all information available about the relationship between X and Y , a knowledge of g provides complete knowledge about the phenomenon of interest for the classification problem just as it does in the case of exact determination of Y by X . This discussion highlights some of the reasons for the theoretical importance of the conditional expectation function. The reason for its important role in network learning will become apparent when we subsequently examine specific learning methods. 3.2 Objectives of Learning. Although the conditional probability law y is a natural object of interest in the abstract, learning in neural networks,
Learning in Artificial Neural Networks
431
whether natural or artificial, is not necessarily directed in an explicit manner at the discovery of y. Instead, a common goal of network learning is that the network perform acceptably well (or even optimally) at some specific task in a specific environment. However, because of the fundamental role played by y, such performance-based objectives for network learning typically are equivalent or closely related to methods explicitly directed at the discovery of particular specific aspects of y. We examine this linkage in this section and the next section. When the relationship between X and Y is of interest, it is often because X is to be used to predict or explain Y.In such cases, network performance can be measured using a performance function T : Rp x Rp + R. Given a target value y and network output 0, the performance function gives a numerical (real-valued) measure of how well the network , It is convenient to normalize 7r so that as d y , o ) performs as ~ ( y0). becomes bigger, the performance becomes worse. A frequently encountered performance measure for artificial neural networks is squared error, d y , 0) = Iy - 012/2.
Many other choices are possible: with p = 1 ( y and o now scalars) we , = Iy - 01, d y , o) = Iy - o ( Q / qor , ~ ( y0) , = -[y logo + could also take ~ ( y0) (1-y) log(1-o)] (for 0 < o < 1). Note that in each of these cases d y ,0)2 0 and d y , 0)is minimized if and only if y = 0. Such behavior for T is often convenient, but is not an absolute requirement; we might wish to make T measure the profit in dollars made by action o of the network when the environment produces a realization y. Network output can generally be expressed in terms of an output function mapping inputs and network weights into network output. Formally, f : R' x W -+ Rp, where W is a weight space appropriate to the network architecture embodied in f . We take W to be a subs& of R", where s is some integer. The precise form of f is not of particular importance. Given weights w and inputs 2 , output is given as o = f(x,w). Given targets y, network performance is then d y , f(z,w)). For any combination of y and 2 , and for any choice of weights w,we , w)]); however, it is can now measure network performance (as ~ [ yf(z, generally required that a network perform well in a range of situations, that is, for a range of different values for y and 5 . One way of making this requirement precise is to require the network to perform well "on average." Average performance is given mathematically by the (unconditional) expectation of the random quantity T(Y,f(X, w>>, expressed formally as
We call X the "expected performance function." Note that it depends only on the weights w,and not on particular realizations y and 2 . These
Halbert White
432
have been "averaged out." This averaging is explicit in the integral representation defining A. The integral is a Lebesgue integral taken over Rp". The measure v permits integrating either continuous or discrete measures (or a mixture of the two) over RP". The second expression re(.,, w))over the joint distribution of Y flects the fact that averaging ~ ( yf and X,that is, v, gives the mathematical expectation (E(.))of the random w)). performance T(Y,f(X, Because we are concerned with artificial networks, we have the potential capability of selecting weights w that deliver the best possible average performance. In the context of artificial networks, then, it is sensible to specify that the goal of network learning is to find a solution to the problem min A(w). WEW
We denote a solution to this problem w*,and refer to w* as "optimal weights." The solution w*is not necessarily unique; because our concern is only with performance A, any one solution is as good as any other. The requirement that X represent average performance is imposed above for concreteness, not out of necessity. We shall continue to use this interpretation, but it should be realized that A may more generally represent any criterion (e.g., median performance) relevant in a given context. To illustrate our earlier remark that choosing a performance measure T is intimately related to which aspect of the probabilistic relationship between X and Y is implicitly of concern, consider the case in which ~ ( y0) , = (y - o ) ~This . is the case relevant to learning by standard backpropagation. Then A(w) = E([Y- f(X, w)I2).
Taking g(X)= E(Y1x1,we have A(w)
= = =
E(tY - g(X)+ g(X)- f(X, w)17 E([Y- g(X)I2) + 2E([g(X) - f(X, w)" + E([g(X) - f(X, w)I2) E([Y- g(X)I2) + E(tg(X)- f(X, w)lZ).
-
g(X)l)
The final equality holds because E([g(X)-f(X, w)I[Y-g(X)l) = E([g(X)f(X, w)Id = E[E([g(X) - f(X, W)l&IX)I= E[(g(X) - f(X, W))E(EIX)I= 0 by the law of iterated expectations and the properties of E noted earlier. It follows that w*not only minimizes Nw), but also minimizes
E([g(X) - f(X, w)I2)= /[g(2) - f (., w)I2p(dd. In other words, w* is a weight vector having the property that f(.,w*) is a mean-squared error minimizing approximation to the conditional
433
Learning in Artificial Neural Networks
expectation function g. It is this aspect of the probabilistic relationship that becomes the focus of interest under the squared error performance measure. Note that the environment measure /I plays a crucial role here in the determination of optimal approximating weights w*.These weights give small errors (on average) for values of X that are very likely to occur at the cost of larger errors (on average) for values of X that are unlikely to occur. It follows that weights w*will not give optimal performance in an operating environment +ip. This crucial role holds generally, not just for the case of squared error. A similarly crucial role in the determination of w* is played by the performance function T . Weights w*optimal under the choice K need not be optimal for some other performance measure T K . If performance in the operating environment is to be evaluated using 9 # 7r (e.g., maximum absolute error instead of mean squared error), then weights w* will not give optimal operating performance. Consequently, it is of great importance that 7r and p be selected so as to reflect accurately the conditions under which operating performance is to be evaluated. Suboptimal network performance will generally result otherwise. By taking weights w* to be the object of network learning, we automatically provide a solution to the question of what is meant by "generalization." Weights w* generalize optimally by construction in the sense that given a random drawing from the probability law v governing X and Y,network output f ( X ,w*)has the best average performance, X(w*). As long as v governs the observed realization, a given random drawing need not have been "seen" by the network before. On the other hand, if the realization is drawn from an environment different than that from which optimal w* is obtained, then the network will not generalize as well as it could have in this precise sense, even if it has "seen" the particular realization during training.
+
+
3.3 Learning and Information. The relation between choice of performance measure K and approximation of the probabilistic relation between X and Y is established at a deep level by considerations related to the theory of information initiated by Shannon (1948) and Wiener (1948). When T is chosen to satisfy a certain mild condition, it can be shown that optimal weights w*have a fundamental information theoretic interpretation, related to the statistical mechanical interpretation discussed by Tishby et al. (1989). To obtain this interpretation, define
h(y(z, w) = h ( z ,w1-I expi-dy, f(x,m))I, where we assume that ko(a, w) 5 J exp[-T(y, f(x,w))]y(dylz) is finite. It follows that for each IE and w, ~ ( - ~ I Eis, wa ) conditional probability density function on Rp. This can be viewed as an approximation to the
Halbert White
434
true conditional probability density dy(.lz) of the conditional probability law y(.lz). Taking (natural) logarithms gives log h(y(z,w)= - log ko(z, w)- d y , f(z,w)).
The term in brackets in the first integral, which we define as I(dy : h;z,w) =
/ logIdy(ylz)/h(ylx,w)ly(dylz)
is the Kullbuck-Leibler Information of h(+, w)relative to dy(.lz) (Kullback and Leibler 1951). This is a fundamental information theoretic measure of how accurate the conditional density h(+, w)is as an approximation to the true conditional density dy(.lz) (see, e.g., Renyi 1961). Heuristically, I(d7 : h; z, w)measures the information theoretic surprise we experience when for given z and w we believe the conditional density of Y given X = z is h(.Iz,w)and we are then informed that the conditional density is in fact dy(./z). A fundamental theorem (the "information inequality") is that I(dy : h; z, w)2 0 for all z and w and that I(dy : h;z, w)= 0 if and only if dy(ylz) = h(ylz, w)for almost all y [under y(.Iz)l. In other words, this information measure is never negative, and is zero when (and only when) h(.lz,w)is in fact the true conditional density. Substituting, we have
X(W) = E(I(dy : h; X , w)) + kl(~), where kl(w)= - J log k&, w)p(dz)- J logdy(ylz)v(dy,dz). Note that the second term in the definition of k1 is entirely determined by the data distribution, and is independent of w. If 7r is chosen in such a way that kO(z,w) does not depend on w, a common and important situation (this is the mild condition earlier mentioned), then the first term also is determined entirely by the data distribution, and kl(w)is simply a constant, say k l ( w ) = El for all w. In this case, the average performance function can be interpreted as differing by a constant from the expected Kullback-Leibler Information of the conditional density h(.Iz,w) = ko(z, w)-'exp[-7r(., f(z,w))] relative to the true conditional density dy(.)zc). It follows that optimal weights w* have the fundamental information theoretic interpretation that they minimize expected Kullback-Leibler Information given the chosen architecture (embodied by f) and performance measure 7r. Indeed, if f and 7~ are chosen in such a way that
Learning in Artificial Neural Networks
435
h(ylz,w") = dy(y1z) almost everywhere for a unique weight vector w o in W , it follows from the information inequality that w* = wo, so that w* provides complete information on the probabilistic relation between X and Y . Further general discussion of the meaning of Kullback-Leibler Information in a related context is given in White (1989a, chaps. 2-51. Viewing learning as related to Kullback-Leibler Information in this way implies that learning is a quasi-maximum likelihood statistical estimation procedure. White (1989a) contains an extensive discussion of this subject. Relevant discussion of these and related issues in a connectionist context is provided in a paper by Golden (1988).
3.4 Network Learning and Parametric Statistics. In many network paradigms, there may be no particular target, or the target and input may be the same. Nevertheless, it is often still possible to define a learning objective function as
where t : Rp" x W -+ R is a given loss function measuring network performance given weights w (the state of the network) and observables z (the state of the world). The interpretation of learning now is that the goal of the network is to adjust its state (w) in such a way as to minimize the expected loss suffered over the different possible states of the world (z). In the special case in which targets and inputs are distinguished and T is given as above, we have l ( z , w) = d g ,f(z,w)). We shall make use of the general formulation in terms of loss functions in what follows, although our examples will typically assume distinct targets and inputs. The use of loss functions is common throughout the field of parametric statistics. Indeed, nothing in any of the foregoing subsections takes us outside of concepts and analysis common to parametric statistics, nor does any of our discussion depend crucially on the network architecture under consideration or even on the use of a neural network model at all. The role of neural network modeling is to provide a specific form of the function f . The advantages of neural network modeling have to do with the virtues associated with such specific forms. 4 Statistical Properties of Learning Methods
We saw in the previous section that the goal of network learning can be viewed as finding a solution w* to the optimization problem
If the joint probability law v were known, w*could be solved for directly. It is our ignorance of v that makes learning necessary. It is the nature of
Halbert White
436
the response to this ignorance that leads to specific learning algorithms. The details of these algorithms interact with the probability law (call it P ) governing the entire sequence { Z t } to determine the properties of the learning algorithms. The role of statistical analysis is to describe these properties. In this section, we describe several possible responses and the statistical properties of the resulting learning algorithms. 4.1 Learning by Optimizing Performance over the Sample. Despite our fundamental ignorance of v, the ability to make repeated measurements on Z = ( X ' ,Y')' permits us to obtain empirical knowledge about v. Given a sample zn [recall z" = (ZI,. . . , z,)], a direct sample analog of v, denoted v,, can be calculated as
v,(C)
=
(number of times zt belongs C ) / n ,
where C is any subset of RP". When n is large, the law of large numbers ensures that this will be a good approximation to v(C) for any set C. Using this approximation to v, it is possible to compute an approximation to X as
n
= n-'
Cl(zt, w), w E
w
t=l
This is easily recognized as average performance of the network over the sample (training set). Because this number is readily computed, we can attempt to solve the problem min X,(w). WE W
We denote a solution to this problem as w,. With this approach, the study of network learning now reduces to the study of the relationship between w, and w+. In general, we can say nothing about the precise relation between w, and w*. The problem is that w, is a realization of random variable. The best that we can do is to make probability statements about the random variable giving rise to w,. To obtain this random variable, we make use of the random counterpart of v,,
fi,(C) = (number of times 2,belongs C ) / n , C to define
A,(w)
=
/
N z , w)C,(dz) n
c Rpf'
Learning in Artificial Neural Networks
437
We then define 8, as a random variable that solves the problem n
min K,(w) = n - l x N Z ~w). , WEW
(4.2)
t=l
Thus, the solution w, defined earlier is simply a realization of the random variable 8,. Consequently, we focus attention on the relationship between the random variable 8, and optimal weights w*. In the special case where T(V,O) = (y - 0 ) ~ / 2we get l ( z , w ) = (y f (x,~ ) ) ~and / 2 (4.3)
This is precisely the problem of nonlinear least-squares regression, so that 6, is a nonlinear least-squares estimator. Nonlinear regression has been extensively analyzed in the econometrics, statistics, and systems identification literatures. We give some relevant references below. 4.7.1 Large Sample Behavior of 6”. As with any random variable, the behavior of 8, is completely described by its probability law. In general, this probability law is prohibitively difficult to obtain for a training set of given size n. However, approximations to this probability law for large n can be obtained by making use of standard statistical tools, including the law of large numbers and the central limit theorem. These approximations reveal that the probability law of 8, collapses onto a particular well-defined set, becoming more and more concentrated around this set as n increases. The properties of the set on which this concentration occurs are therefore of fundamental importance. The increasing concentration property is referred to as the property of “consistency.” We describe this in more detail below. The collapse of the probability law of 6, can be shown to occur at a certain specific rate. It is possible to offset this collapse by a simple standardization. The approximate probability law of the standardized random variable is thus stabilized; this probability law is known as the ”limiting distribution” or “asymptotic distribution” of 6,.The precise form of the limiting distribution depends on the nature of the set onto which 8, collapses. When Ci, collapses onto an isolated point, the central limit theorem can be applied to show that the appropriate standardization of 8, has approximately a multivariate normal distribution when n is large. The approximation is better the larger is n. We describe this also in somewhat more detail below. Formal statistical inference regarding w * is possible whenever the limiting distribution of 8, is known. Because many questions of interest regarding the precise form of the optimal network architecture can be formulated as formal hypotheses regarding w*, these questions can be resolved to the extent permitted by the available data by calculating some
438
Halbert White
standard and relatively straightforward statistics. To date, the significance of this fact has not been widely appreciated or exploited in the neural network literature. Below, we discuss some of the possible applications of these methods. In the statistics literature, an examination of the consistency and limiting distribution properties of any proposed new statistic is standard. Such analyses reveal general useful properties and difficulties that cannot be inferred from or substantiated with Monte Carlo simulations. As the field of neural computation matures, rigorous analysis of the consistency and limiting distribution properties of any proposed new learning technique should become as standard as the Monte Carlo studies now prevalent. The discussion to follow will indicate some of the typical issues involved in such analyses. 4.2.2 Consistency. Three concepts are directly relevant to the issue of consistency. The first is the standard concept of deterministic convergence. Let {a,} G (alla2,. . .) be a sequence of (nonrandom) real variables. We say that a, converges to a, written a, -+ a (as n co),if there exists a real number a such that for any (small) E > 0, there exists an integer N, sufficiently large that la, - a / < E for all n 2 N,. We call a the “limit” of {a,}. For the second concept, let { 8,} be a sequence of real-valued random variables. We say that 8, converges to a almost surely -P, written 8, -,a (as n -+ 00) a s . - P if P[8, -, a] = 1 for some real number a. That is, the probability of the set of realizations of {&,} for which (deterministic) convergence to a occurs has probability 1. Heuristically, it is possible for a realization of (8,) to fail to converge, but it is more likely that all the ink on this page will quantum mechanically tunnel to the other side of the page sometime in the next 5 fsec. This form of stochastic convergence is known as ”strong consistency” or “convergence with probability one” (w.p.1.). It is also written as 6, 5 a, and we say that 15, is “strongly consistent” for a. The third concept is that of convergence “in probability.” Again, let {a,} be a sequence of random variables. We say that 6, converges to a in probability (-P), written 8, + a prob - P if there exists a real number a such that for any (small) E > 0, P[l& - a / < E ] -+ 1 as n * 00. Heuristically, the probability that 8, will be found within E of a tends to one as n becomes arbitrarily large. This form of stochastic convergence is known as “weak consistency.” It is implied by strong consistency. Convergence in probability is also written 8, 5 a, and we say that 8, is ”weakly consistent” for a. These notions of consistency are used extensively in what follows. We shall frequently drop the reference to ”strong” or “weak” and refer simply to ”consistency”when either or both of these concepts is relevant. -+
Learning in Artificial Neural Networks
439
Applying these concepts to 6,,it would be satisfying to establish that 8,+ w*a.s. - P,that is, that 8, is strongly consistent for optimal weights w*. This is in fact true under appropriate conditions on I , W , and { Z t } discussed in the econometrics literature by White (1981,1982, 1984a, 1989a1, Domowitz and White (19821, and Gallant and White (1988). It is useful to give a brief description of the underlying heuristics. Suppose for the moment that there is a unique solution w*to equation 4.1. The basic idea is that because A,(w) is an average of random variables (for each fixed w),the law of large numbers applies [under conditions placed on the probability law P governing { Z t } , and on 1 - see White (1984b, chap. 2)l to ensure that l,(w) + X(w) a.s.- P.Because 8, minimizes A, and w*minimizes X and because 1, and X are close as.- P for n large, then 8,should be close to w*. This heuristic argument is not complete, but it can be made complete by ensuring that the convergence of A, to X is uniform over W (i.e., sup,,EwIl,(w)- X(w)l-+0 a.s.- P). For this, it helps to assume that W is a compact set. These issues have been thoroughly studied in the econometrics literature under general conditions. It follows under general conditions on P, I, and W that any learning procedure capable of successfully solving the problem 4.2 delivers learned weights 6, that are arbitrarily close to the (for the moment unique) optimal weights w* for all n sufficiently large, with probability one. The set on which 8, becomes concentrated thus has a single point, w*. For some readers, the assurance that this convergence occurs ”under general conditions” may be unsatisfyingly vague. For completeness, a formal statement of conditions sufficient to guarantee convergence is provided in Theorem 1 of the mathematical appendix. We omit this from the text to spare those less interested in technical details. The assumption that w*be unique is introduced above for simplicity. In the context of network learning, this assumption can be easily violated. For example, when f is the output function of a single hidden layer feedforward network (see equation 5.1) and n(y,o) = (y - o ) ~ the , interchangeability of the hidden units guarantees the existence of multiple solutions w* whenever W is sufficiently unrestricted. However, if W is restricted to be a single cone of the type described by Hecht-Nielsen (1989) (this eliminates interchangeability), then multiple minima are no longer guaranteed. In some cases, restriction of W to such a cone may be enough to ensure a unique minimum, and the argument above again applies. Even when W is restricted to be a Hecht-Nielsen cone, it can still happen that there is no unique solution. In the case of standard single hidden layer feedforward networks, there are two main reasons for this possibility. The first we refer to as the case of ”redundant inputs”; the second we refer to as the case of ”irrelevant hidden units.” The case of redundant inputs occurs when one or more of the network inputs is an exact linear combination of the other inputs, including the “bias
Halbert White
440
input," assumed to always take the value unity. The case of irrelevant hidden units occurs when identical optimal network performance can be achieved with fewer hidden units. Both redundant inputs and irrelevant hidden units generate entire manifolds on which Mw) is flat and minimal. Because finding and eliminating redundant inputs and irrelevant hidden units both involve some computational effort, networks are usually trained without regard to their possible presence. Restricting training to Hecht-Nielsen cones is also generally neglected for similar reasons of computational cost. Fortunately, the possible presence of multiple minima has no essential effect asymptotically for solutions to equation 4.2. Instead of 3, 4 W* a.s. - P, we have 6,+ W aa.s. - P, where W*is the set of all minimizers of A,
W* = {w*E W : A(w*) _< X(w) for all w E W } and the convergence 6,+ W* a.s. - P means that inf,*,w. 113, - w*ll 4 0 a.s. - PI where 11.11 is Euclidean (or equivalent) norm on W c R'. Note that 6,need not converge to any particular value - it simply gets closer and closer to being optimal. Thus, we have a definitive answer to the question of what is learned when equation 4.2 is solved: learned network weights collapse onto the set of weights W' that deliver optimal performance. 4.2.3 Limiting Distribution. The appropriate formal concept for studying the limiting distribution of 6, is that of convergence in distribution. Let {a,} be a sequence of random variables having distribution functions {F,} (recall F,(a) = P[& 5 a]). We say that 6, converges to F in distribution, written 2, 3 F , if IF,(a) - F(a)I + 0 for every continuity point a of F . This is a very weak convergence condition indeed. However, it permits approximately accurate probability statements to be made about 6, using the limiting distribution F in place of the exact distribution F,. The ability to make such probability statements is quite useful. The limiting distribution of 6, depends on the nature of W*. In general, W* may consist of isolated points and/or isolated "flats." If convergence to a flat occurs, then the estimated weights 3, have a limiting distribution that can be analyzed using the theory of Phillips (1989) for "partially identified" models. These distributions belong to the "limiting mixed Gaussian" (LMG) family introduced by Phillips. When W* is locally unique, the model is said to be "locally identified and estimated weights 3, converging to w*have a limiting multivariate normal distribution. This result is also a consequence of Phillips' results, but can be derived as well from less powerful arguments using the central limit theorem.
Learning in Artificial Neural Networks
441
Because flats (e.g., those caused by redundant inputs or irrelevant hidden units in hidden layer models) can usually be removed by the investigator, control can be exercised over the limiting distribution of 8,.The properties of the multivariate normal distribution make it much more desirable to work with than the LMG family, so our subsequent remarks will focus on the limiting normal distribution; we suppose that the investigator has expended the effort necessary to arrive at a locally identified model. (Note that W need not be restricted to be a HechtNielsen cone, because we require only local but not global uniqueness of a solution to equation 4.1.) Although the discussion so far has been in terms of global solutions to equations 4.1 and 4.2, the local nature of the analysis of limiting distributions and the freedom allowed by the ability to choose W in any convenient way (and in particular as a small neighborhood of the set to which the learning method ultimately converges) means that the foregoing remarks apply not just to learning methods that globally solve equation 4.2, but also to methods that find a local solution to 4.2. Accordingly, we can think of w* as being a unique local solution to 4.2 in the discussion to follow. In Theorem 2 of the appendix, we give conditions ensuring that the limiting distribution of f i ( G n - w*)is the multivariate normal distribution with mean vector zero and an s x s covariance matrix (say c*)that can be given a precise analytic expression. We refer to C’ as the ”asymptotic covariance matrix” of 8,. The smaller this covariance matrix is (as measured by, say, tr C* or det C*)the more tightly the distribution of 6, is concentrated around w*,with less consequent uncertainty about the value of w*.It is therefore desirable that C’ be small, but there are fundamental limits on how small C* can be. When two learning methods yield weights 61,and &, respectively, that are both consistent for w+,one with asymptotic covariance matrix C;, and other with asymptotic covariance matrix C;, the method yielding the smaller asymptotic covariance matrix is preferable, because that method makes relatively more “efficient” use of the same sample information. In certain cases, it can be shown that C; - C; is a positive semidefinite matrix, in which case it is said that the second method is “asymptotically efficient” relative to the first method. Thus, study of the limiting distribution of alternative learning methods can yield insight into the relative desirability of different learning methods. As a specific example, White (1989b) proves that learning methods that solve equation 4.2 (locally) for squared error performance are asymptotically efficient relative to the method of backpropagation regardless of the local minimizer to which convergence occurs (see Theorems 5 and 6 of the appendix). In this sense the method of backpropagation is statistically inefficient. Kuan and White (1989) discuss a modification of backpropagation that has asymptotic efficiency equivalent to the (local) solution of equation 4.2.
Halbert White
442
4.1.4 Statistical Znference and Network Architecture. Of significant consequence is the fact that the limiting distribution of 8, can be used to test hypotheses about w*.Two hypotheses of particular interest for hidden layer feedforward networks are the "irrelevant input hypothesis" and the "irrelevant hidden unit hypothesis." The irrelevant input hypothesis states that a given input or group of inputs is of no value (as measured by A) in predicting or explaining the target. (Note that redundant inputs are irrelevant, but irrelevant inputs are not necessarily redundant. We continue to assume the absence of redundant inputs.) The alternative to the irrelevant input hypothesis is that the given input or some member of the given group of inputs is indeed of value in predicting or explaining the target. Similarly, the irrelevant hidden unit hypothesis states that a given hidden unit or group of hidden units is of no value in predicting or explaining the target. The alternative hypothesis is that the given hidden unit or some member of the given group of hidden units is indeed of value in predicting or explaining the target. Because these hypotheses can generally be expressed as the restriction that particular elements of w*are zero (those corresponding to the specified units) and because the learned weights 6,are close to w* for large n, the learned weights can be used to provide empirical evidence in favor of or in refutation of the hypothesis under consideration. Under the irrelevant input hypothesis, the corresponding learned weights Gn should be close to zero. The question of how far from zero is too far from zero to be consistent with the irrelevant input hypothesis can be answered approximately for large n by making use of the known limiting distribution of Gn. Specifically, the irrelevant input hypothesis can be expressed as HO : Sw* = 0, where S is a q x s selection matrix picking out the q elements of w* hypothesized to be zero under the irrelevant input hypothesis. The fact that J5i(Gn - w*)has a limiting multivariate normal distribution with mean zero and covariance matrix C implies that f i S ( G n - w*) has a limiting multivariate normal distribution with mean zero and covariance matrix S P S ' . Because the irrelevant input hypothesis implies Sw*= 0, it follows that @SGn has a limiting multivariate around distribution with mean zero and covariance matrix S C S ' under the irrelevant input hypothesis Ho. From this it follows that under H o the random scalar nG;S'(SCS')-'SG, has a limiting x2 distribution with q degrees of freedom (xi). A realization of this random variable cannot be computed, because although an analytical expression for C* is available, a knowledge of the probability law P is required for its numerical evaluation. Fortunately, an estimator of C* can be constructed that is weakly consistent, that is, -+ C* prob - P. (An explicit formula is given there exists C?,, such that in Theorem 2 of the appendix.) Replacing c* with its weakly consistent estimator 6%has no effect on the limiting distribution of the statistic just given, although larger n (and in some cases much larger n) will be needed
en
443
Learning in Artificial Neural Networks
to obtain an approximation using xi as good as that obtained if C* itself were available. It follows that
nG:,S’(S~,,s’)-’SG,4 x’, under the irrelevant input hypothesis Ho. The probability distribution of the irrelevant input test statistic nG~S’(S&S’)-’S& is therefore approximated for large n by the x i distribution when the irrelevant input hypothesis is true. Under the alternative hypothesis Ha : Sw’ 0, the irrelevant input test statistic tends to infinity with probability one. It follows that the procedure of failing to reject H o whenever nG~S‘(S&S’)-’SG, fails to exceed the 1- a percentile of the xi distribution (for some typically small value of a, say a = 0.05 or a = 0.01) leads to incorrect rejection of the irrelevant input hypothesis with (small) probability approximately equal or less than a. As n becomes large the probability of correctly rejecting the irrelevant input hypothesis with this procedure tends to one. This procedure is an application of standard techniques of statistical inference. It allows us to determine whether specific input(s) are irrelevant, to the extent permitted by the sample evidence by controlling the probability of incorrectly rejecting Ho. This approach has obvious applications in investigating the appropriateness of given network architectures. The irrelevant hidden unit hypothesis is of exactly the same form, that is, H o : Sw’ = 0, except that now the q x s selection matrix S picks out weights associated with q hidden units hypothesized to be irrelevant. As before, the alternative is Ha : Sw’ 0. Similar reasoning can be used to develop an irrelevant hidden unit test statistic. However, there are some rather interesting difficulties in the development of the limiting distribution of G,, under Ho. Problems arise because when HO is true, the optimal weights from input units to the irrelevant hidden unit(s) are not locally unique - they have no effect on network output. This problem is known in the statistics literature as that in which “nuisance parameters are identified only under the alternative hypothesis.” The LMG family now plays an essential and unavoidable role in the analysis. Hypothesis testing in the present context has been studied by Davies (1977, 1987); the analysis is complicated. As one should expect from the LMG family, the resulting test statistic distributions are no longer x2. However, certain techniques can be adopted to avoid these difficulties, yielding a x’, statistic for testing the irrelevant hidden unit hypothesis. One such test is described by White (1989c), and its properties are investigated by Lee et al. (1989). Statistical inference plays a fundamental role in modern scientific research. The techniques just described permit application of the methods of statistical inference to questions regarding the precise form of optimal artificial neural network architectures.
+
+
4.1.5 The Role of Sample Size. In the foregoing sections we have described approximations that are valid for n ”sufficiently large.” A natural
444
Halbert White
and important question is “How large an n is large enough for the approximations to be good?” Unfortunately, the answer is highly context dependent; there is no simple or general answer. We can say that generally, the larger s (the dimension of W )is, the larger n must be to obtain a given degree of approximation; also, the greater the dependence among the elements of {Zt) (i.e., across t ) is, the larger n must be to obtain a given degree of approximation. However, a precise answer can be obtained analytically only by finding the exact distribution of 8, for given n; as previously mentioned, this is typically an intractable problem. Analysis can provide some relevant information. In particular, large deviation inequalities and Berry-Esseen inequalities (see, e.g., Serfling 1980) can provide information on the rate at which convergence to the limiting set or limiting distribution occurs under specific conditions. Such results reveal that convergence is often reassuringly exponential in n, but the results are again asymptotic and do not provide exact answers for given n. Monte Carlo experiments can provide certain limited information on behavior of 8, for fixed n. The limitation of Monte Carlo experiments is that any results obtained pertain only to the environment in which the experiments are carried out. In particular, the data-generating mechanism must be specified by the researcher, and it is often difficult to know whether any given data-generating mechanism is to any degree representative of realistic empirical settings. Motivated by the desire to obtain distributional results for 8, that rely neither on large n approximations nor on artificial data-generating assumptions, statisticians have developed so-called “resampling techniques” that permit rather accurate estimation of finite sample distributions for 6,when { Z t } is a sequence of independent identically distributed (i.i.d.) random variables (see, e.g., Efron 1982). The basic idea is to draw a large number N of random samples of size n with replacement from zn = (21, . . . ,zn), calculate 8,for each of the N samples, say Gi i = 1,.. . , N , and use the resulting empirical distribution of the estimates 6;as an estimate of the sampling distribution of 8,. The computational cost of such a procedure is impressive even with today’s computers, and reflects the fundamental difficulty of the problem. At present, there is no easy answer to the question how large n must be for the approximations described earlier to be “good.” Practical experience in econometrics suggests that it is probably safe to recommend taking n to be on the order of magnitude of s x 100, but this is no more than an educated guess. Rules suggesting that s x 10 may be adequate (e.g., Baum and Haussler 1989) are relevant to classification problems in which the relation between X and Y is nonprobabilistic. Results for some probabilistic cases are discussed in Haussler (1989). 4.1.6 Methods for Global Optimization over the Sample. Now consider how equation 4.2 might be solved in practical situations. In general,
Learning in Artificial Neural Networks
445
we seek a global solution to what is typically a highly nonlinear optimization problem. Such problems are the general concern of an entire subarea of mathematics, optimization theory. Rinnooy Kan et al. (1985) (RBT) give a survey of results from this literature that is directly relevant to finding the solution to equation 4.2, as well as describing a new procedure, "multilevel single linkage," that appears to provide performance superior to a variety of now standard methods. Before describing this technique, however, we first consider two methods for solving 4.2 that are relatively familiar to the neural computation community: the method of simulated annealing and the genetic algorithm. Both of these methods for function optimization have been applied in the present context or in related contexts. Because of their relative familiarity, we shall not go into great detail regarding the specifics of implementation, but indicate general features of these methods. The method of simulated annealing (Kirkpatrick et al. 1983; Cerny 1985) proceeds by leading an "energy landscape" over the state space W . It is desired to settle into a low energy state, the lowest being 8,. Different annealing strategies arise depending on whether W is a finite set or is a continuum, but the basic idea is to start at some initial weight vector and compute the "energy" (value of 1,) for a nearby weight vector. If energy is lower, move to the new vector. If energy is higher, move to the new vector with a probability controlled by the annealing "temperature" schedule. By setting the temperature high initially, one may escape from local minima. The "temperature" is lowered at an appropriate rate so as to control the probability of jumping away from relatively good minima. Hajek (1985, 1988) gives a useful survey and some theorems establishing conditions under which simulated annealing ultimately delivers the solution 8, to equation 4.2. See also Davis (1987) and van Laarhoven (1988). It is useful to recognize that such procedures leave us twice removed from the optimal weights w*, in a certain sense. If we could find G,,we would be once removed, effectively by sampling variation, although the results described above show that this sampling variation gets averaged out as n becomes large. However, finding 6, is only guaranteed in the limit of the annealing process. Because the annealing process must be terminated at some finite time, we are once removed from 3,,and therefore twice removed from w*. The most we can hope for is that weights, say uln, delivered by annealing after some finite time, will be close to 6,in the sense that &,(ul,) is close to A,(G,). These weights could be far apart in standard metrics on W , but this is not of major concern: our primary concern is with measured average performance. The statistical properties of ulrLare not necessarily identical to those of 8,.However, if ul, delivers a local minimum of 1, we can view w, as minimizing jnover some restriction of W , and regain similar statistical
446
Halbert White
properties with respect to this restriction. We discuss finding a local minimum in more detail below. The genetic algorithm (Holland 1975) proceeds by viewing the opposite of ,A, that is, -An, as a fitness function and w as a ”DNA vector.” Use of the genetic algorithm for function optimization is treated by Goldberg (1989) (see also Davis 1987). The basic idea is to begin with a population of N “individuals“ with “DNA” w’, i = 1,.. . , N . The fitness of each individual is evaluated as -l,(wi).Individuals mate with other individuals, exchanging ”genetic material” in a manner bearing certain analogies to the exchange of DNA in biological organisms. More fit individuals are more likely to mate; further, the exchange of “DNA” is governed by ”genetic operators” such as “crossover” and “mutation” that allow for local search (mutation)and distant search (crossover)in W . The result is a new generation of individuals with new ”DNA.” The process continues for many generations. Heuristically, one might expect an optimal individual to emerge from this process as the number of generations becomes large. To date, there does not appear to be a general theoretical result guaranteeing that 8, is indeed produced in the limit; such a result would be highly desirable. Nevertheless, the method does seem to perform reasonably well in applications. Typically, the method delivers weights in the neighborhood of an optimum relatively quickly, but can be very slow to find the optimum itself, owing to its simple method of local search. To aid the production of a highly fit individual in the present context, it is desirable to ”clone” the most fit individual from one generation to the next. Also, it appears desirable to treat the elements of w as distinct entities in the crossover process, with attention also paid to keeping together ”clumps” of hidden units, as these typically collectively encode information leading to good fitness. Again, weights produced by the genetic algorithm are twice removed from w*for reasons that are the same as with the method of simulated annealing. The identical comments apply, especially as regards obtaining a local minimum of 1 , for the purpose of exploiting the statistical properties of the resulting weights. This discussion of particular learning methods clarifies the relationship between two separate areas relevant for the analytic investigation of network learning. The first is the area of statistical analysis, which allows us to study the properties of any procedure that delivers a solution to equation 4.2. These properties are fairly well established. The second is the area of optimization theory, which delivers methods leading to the solution of 4.2. Such methods present a current challenge; the vast literature of optimization theory can be expected to yield a variety of useful methods for attempting to solve 4.2 in specific applications. As an example, we describe the multilevel single linkage algorithm of Rinnooy Kan et al. (1985). This method is a variant of the ”multistart” method, which has three steps:
Learning in Artificial Neural Networks
447
2. Carry out a local search starting from w (see below for methods of local search) to obtain a local minimizer 5,, say.
3. If i,(iir,)is the smallest value obtained so far, put 8, = W,. Return to step 1. The procedure continues until a stopping criterion is met. The multilevel single linkage technique is designed to improve the efficiency of the multistart procedure by performing local search for a minimum, starting not from every point drawn in step 1, but only from points satisfying certain criteria. Specifically, draw a sample of weight vectors w’, i = 1, . . . , N from the uniform distribution over W and initiate local search from each weight, unless (1) w1is too close to the boundary of W (within a distance 7 > 0, in RBT’s notation); (2) 20’ is too close to a previously identified local minimizer (within a distance w > 0 in RBT’s notation); or (3) there is a weight vector w3,j# i, such that i,(w3) < i,(wz)and w3 is close to w 2(within a distance T N > 0 in RBT’s notation). Timmer (1984) proves that if there is a finite number of minima and if ?-N is chosen appropriately and tends to zero as N -+ 00, then any local minimum iir, (and consequently global minimum 8,) will be found within a finite number of iterations, with probability 1; the reader is referred to Timmer (1984) for further discussion. For network learning this method is extremely computation intensive but presents an interesting alternative to simulated annealing or the genetic algorithm. 4.1.7 Local Optimization Methods. Methods for solving equation 4.2 locally are themselves the subject of a voluminous literature. We merely sketch the outline of some gradient descent techniques that are straight, is differentiablein w, an iteration forward to implement. Specifically, if 1 can be constructed as
where 6:) is the estimate at the kth iteration, 6:’ is any starting value, q k is a positive step-size parameter, HAk) is an s x s positive definite matrix, and V is the gradient operator with respect to w, so that Ox,, is an s x 1 vector. Different choices for r]k and HAk) implement different specific gradient descent methods. A discussion of these and much additional relevant material can be found in Ortega and Rheinboldt (1970) and Rheinboldt (1974). Note that the sometimes extreme local irregularity (“roughness,” ”ruggedness”) of the function A, over W arising in network learning applications may require development and use of appropriate modifications to the standard methods. Under appropriate regularity conditions, 6:) converges as k * 03 to G,, a vector such that Vi,(6,) = 0. These equations are the necessary , interior to W . first order conditions for a local minimum of i
Halbert White
448
Under appropriate conditions, it can be further shown that G, tends to wt,a parameter vector solving the problem VNw)= 0. When interchange of derivative and integral is possible, we have
=
1VZ(Z,w)v(dz)
=
E(VZ(2, w)).
Thus, seeking a solution to VX(w)= 0 is the same as seeking a solution to the problem
E(VZ(2,w))= 0.
(4.4)
Because v is unknown, we cannot solve this problem directly. Nevertheless, the gradient descent methods just discussed provide one approach to attempting to find such a solution using available sample information, because Vi,(w)= n-’ Cy=l VZ(Zt, w). 4.2 Learning by Recursive Methods. In 1951, Robbins and Monro considered the problem of finding a solution to the problem
E(m(2,WJ)) =
f m(z,
W ) V ( d Z ) = 0,
(4.51
when the expectation cannot be computed because v is unknown. Instead, error-laden observations on E(m(Z,w))are given by realizations of the random variables m(Zt,w).Robbins and Monro (1951) proposed the method of “stochastic approximation” for finding an approximate solution to equation 4.5 using the recursion n = 1 , 2 , .. . 6,= G,-,+ qnm(Znl6,-]),
(4.6)
]
where Go is arbitrary and qn is a learning rate. Robbins and Monro (1951) studied their procedure for the case in which w is a scalar; Blurn (1954) extended their analysis to the vector case. By setting m(z,w)= -Vl(z, w),we can apply the Robbins-Monro procedure to obtain an approximate solution to the problem 4.4. The recursion is simply n = 1,2,.. . . w, = w,-1 - qnVl(Zn,Gn-l), -
I
(4.7)
When ~ ( y0) , = (y - 0)’/2, we have VKZ, w)= -of(., w)(y
-
f(z,w))
so that
G, = Gn-l + q n V f ( X , , G,-l)(Y,
-
f(X,, Gn-l)),
n = 1,2,. . . .
Learning in Artificial Neural Networks
449
This is easily recognized as the method of backpropagation (Werbos 1974; Parker 1982; Rumelhart et al. 1986). Thus, the method of backpropagation can be viewed as an application of the Robbins-Monro (1951) stochastic approximation procedure to solving the first-order conditions for a nonlinear least-squares regression problem. Results in the statistics and systems identification literature can be applied directly to investigate the properties of {CI,}. White (198913) applies results of Ljung (1977) and Walk (1977) to obtain consistency and limiting distribution results for equations 4.6 and 4.7 and the method of backpropagation. The results are similar to those for &. If the method does not diverge, then 13, + Wta.s. - P, where W t is the set of solutions to equation 4.4, Wt = {w : E ( V l ( 2 , w ) )= 0). Note that this set includes global minima, local minima, and inflection points. If convergence is to a unique local minimum, then W, has a limiting normal distribution. A statement of these results is given in Theorems 3 and 4 of the appendix. In those results, the random variables Z,,Z,,. . . are assumed to be statistically independent. Such an assumption is implausible for the analysis of time series data, so Kuan and White (1989) apply results of Kushner and Clark (1978) and Kushner and Huang (1979) to establish consistency and limiting distribution results for dependent sequences of random variables. An interesting feature of these results is that they specify conditions on the learning rate q, that are necessary to ensure the desired convergence results. In particular, the most rapid convergence occurs when C,"=,vn = co and v, 5 An-' for some A < 00. The results of White (1989b)and of Kuan and White (1989)can be used to construct tests of the irrelevant input hypothesis and the irrelevant hidden unit hypothesis in ways directly analogous to those discussed earlier. Also of interest is the fact that the recursion 4.6 can be used to generate modifications of the method of backpropagation that have improved convergence and statistical efficiency properties. Several of these are described by Kuan and White (1989). Kushner (1987) has studied a modification of equation 4.7 guaranteed to converge w.p.1 to a global solution to equation 4.1 as n + 00. His method embodies a form of annealing. The recursion is
w, = 6,-1
V,(VZ(Z,, & I ) + <J where {qn} is a decreasing positive sequence and is a sequence of independent identically distributed Gaussian random variables. This approach amounts to adding decreasing amounts of random noise to the weights at each step. Convergence to a global optimum as n -+ 00 occurs almost surely, provided that qn is proportional to 1/ log n. This gives very slow convergence. The discussion of this section has so far related a variety of wellknown learning methods for artificial neural networks to existing methods of statistical estimation. The statistics literature suggests a variety of additional relevant procedures that to my knowledge have not yet been -
{cn}
Halbert White
450
proposed as network learning methods. One such procedure is that of Kiefer and Wolfowitz (1952). The Kiefer-Wolfowitz procedure and its variants are useful for situations in which computing V1 is difficult or impossible. Instead of using V1, use is made of an estimate of 81 based on observations on 1. A particularly convenient variant of the Kiefer-Wolfowitz procedure known as the "method of random directions" has been analyzed by Kushner and Clark (1978). To implement this method, one chooses a sequence of real constants {c,} and a sequence of direction vectors {d,} uniformly distributed over the unit sphere in R". Weights are generated by the recursion 221,
=
- qndn(l(Zn, 221,-1 + cndn) l(Zn,221,-1 - c n d , ) ) / 2 ~ , 71 = 1 , 2 , .. . .
221,-1 -
(4.8)
The term dn(1(Zn,221,-1 + cndn) - 1(Zn,221, - cndn))/2cnplays the role of Vl(Z,, 221,-1) in the Robbins-Monro procedure (equation 4.7), implementing a random local exploration of the loss function. Kushner and Clark (1978) give conditions under which 221, + W t a s . - P. As an example, setting Z(z, w)= (y - f (5,w))'/2 in equation 4.8 yields a version of backpropagation that requires no computation of the gradient of the network output function, V f . Because this gradient is not particularly burdensome to compute even in feedforward networks with many hidden layers, the Kiefer-Wolfowitz procedure may or may not offer much advantage in this context. Further, as Richard Durbin has pointed out, using local random search to approximate the gradient may prove difficult in contexts common to network applications in which the energy landscape is characterized by long narrow ravines. For these reasons, the potential usefulness of the Kiefer-Wolfowitz procedure is likely to lie in network applications where, for whatever reasons, gradients cannot be computed. One possibility is a situation in which the network is given to the researcher as a sealed black box, with only the weights subject to external control. A useful review of recursive estimation methods such as the RobbinsMonro and Kiefer-Wolfowitz procedures has recently been given by Ruppert (1989). Much of the material contained there has direct relevance for learning in artificial neural networks. 4.3 Summary. To summarize this section, a large class of learning methods for artificial neural networks can be viewed as statistical procedures for solving the problems 4.1 or 4.4. Concepts of stochastic convergence provide an appropriate framework in which to analyze the properties of these procedures. Existing results in the statistics, econometrics, systems identification, and optimization theory literatures can be applied directly to describe the properties of network learning methods. These properties can be exploited to answer questions about optimal
Learning in Artificial Neural Networks
451
network architectures using the tools of statistical inference. Further, existing methods can suggest novel and potentially useful network learning procedures, such as the multilevel single linkage algorithm or the KieferWolfowitz method. 5 Nonparametric Estimation with Feedforward Networks
In all of the foregoing discussion, we have considered learning methods for networks of fixed complexity. Despite the great flexibility that such networks can afford in their input-output response (e.g., in their ability to approximate arbitrary mappings), they are nevertheless fundamentally limited. In particular, feedforward networks of fixed complexity will be able to provide only partial approximations to arbitrary mappings; their performance for especially complicated mappings can be quite poor. However, it is now well established that hidden layer feedforward networks with as few as a single hidden layer are capable of arbitrarily accurate approximation to an arbitrary mapping provided that sufficiently many hidden units are available [see Carroll and Dickinson 1989; Cybenko 1989; Funahashi 1989; Hecht-Nielsen 1989; Hornik et al. 1989a,b (HSW); Stinchcombe and White 19891. It is natural to ask whether it is possible to devise a learning procedure that can learn an arbitrarily accurate approximation to an arbitrary mapping. In this section, we review some recent results showing that this is indeed possible. These results are obtained by permitting the complexity of the network to grow at an appropriate rate relative to the size of the available training set. For concreteness, we assume that our interest centers on learning the conditional expectation function, which we now denote 0,, so that 0 J X ) = E(Y 1x1. (We have B0 corresponding to g in our previous notation.) Other aspects of the conditional distribution y of Y given X can be given a similar treatment. Because in practice we always have a training set of finite size n and because 0, is an element of a space of functions (say 0) and is generally not an element of a finite dimensional space, we have essentially no hope of learning 8, in any complete sense from a sample of fixed finite size. Nevertheless, it is possible to approximate or estimate 0, to some degree of accuracy using a sample of size n, and to construct increasingly accurate approximations with increasing n. Let a learned approximation to 0, based on a training set of size n be denoted 8,. Just as in our discussion of the convergence properties of learned network weights G,, we can define appropriate notions of stochastic convergence for learned approximations and it is in terms of this stochastic convergence that the approximations may become increasingly accurate. To define the appropriate notions of stochastic convergence, we need a way to measure distances between different functions belonging to 0.
en,
Halbert White
452
A formal way to do this is to introduce a "metric" p, that is, a real-valued function on 0 x 0, which has the properties that p(8l,82) 2 0 (nonnegativity), p(Bl, 62) = p(&, 8,) (symmetry), and p(&, I p(&, B3) + do3,82) (triangle inequality) for all 01,&,& in 0.When p(&, 82) = 0, we view O1 and 8, as identical. The pair (0,p) is known as a "metric space." For any function space 0 there are usually many different possible choices for p . However, once a suitable metric is specified, we can define stochastic convergence in terms of the chosen metric. The property of strong (p-) consistency of 8, for 8, holds when p(8,, 8,) --t 0 (as n --+ 00) a.s. - P. The property of weak ( p - ) consistency of 8, for 0, holds when p(8,,8,) -+ 0 prob - P. Because weak consistency is often easier to establish, we focus only on weak consistency, and drop the explicit use of the word "weak." In a very precise sense, then, a "consistent" learning procedure for 8, is one for which the probability that 8, exceeds any specified level of approximation error relative to 8, as measured by the metric p tends to zero as the sample size n tends to infinity. Procedures that are not consistent will always make errors in classification, recognition, forecasting, or pattern completion (forms of generalization) that are eventually avoided by a consistent procedure. The only errors ultimately made by a consistent procedure are the inherently unavoidable errors ( E = Y - 8,W) arising from any fundamental randomness or fuzziness in the true relation between X and Y. White (1988) uses statistical theory for the "method of sieves" (Grenander 1981; Geman and Hwang 1982; White and Wooldridge 1989) to establish that multilayer feedforward networks can be used to obtain a consistent learning procedure for 8, under fairly general conditions. The method of sieves is a general approach to nonparametric estimation in which an object of interest 0, lying in a general (i.e., not necessarily finite dimensional) space 0 is approximated using a sequence of parametric models in which dimensionality of the parameter space grows along with the sample size. The success of this approach requires the approximating parametric models to be capable of arbitrarily accurate approximation to elements of 0 as the underlying parameter space grows. For this reason, Fourier series (e.g., Gallant and Nychka 1987) and spline functions (e.g., Wahba 1984; Cox 1984) are commonly used in this context. Because multilayer feedforward networks have universal approximation properties, they are also suitable for such use. Without the universal approximation property, attempts at nonparametric estimation using feedforward networks would be doomed from the outset. White (1988) considers approximations obtained using single hidden layer feedforward networks with output functions
c 4
f"c., W Q )
w10 +
wlj$di-hoj),
(5.1)
1 :j
where wq = (w;, wi)' is the s x 1 (s = q(r + 2) + 1) vector of network weights. There are 4 hidden units. The vector wo contains the input to
453
Learning in Artificial Neural Networks
hidden unit weights, wo= (w&,. . .,wQf, woj = (wOjo,wojl, . . . ,woj,.)', and the vector WI contains the hidden to output weights w1 = ( 2 ~ 1 0 , .. . ,~ 1 ~ ) ' . $ is the hidden unit activation function, and 1 = (1,~')'.Note that the network output function and the weight vector are explicitly indexed by the number of hidden units, q. Because the complexity of such networks is indexed solely by q, we construct a sequence of approximations to 8, by letting q grow with n at an appropriate rate, and for given n (hence given q ) selecting connection strengths 8, so that 8, = f q n ( . , 8 , ) provides an approximation to the unknown regression function 8, that is the best possible in an appropriate sense, given the sample information. To formulate precisely a solution to the problem of finding 8,, White defines the set
T($,q, A)
=
q
(0
E
0 : 0(.) = f q ( . , w q ) ,
1
I A,
lw~jl
j=O "
T
This is the set of all single hidden layer feedforward networks with q hidden units having activation functions $, and with connection strengths satisfying a particular restriction on their sum norm, indexed by A. This last restriction arises from certain technical aspects of the analysis. A sequence of "sieves" {On($)} is constructed by specifying sequences { q,} and {A,} and setting On(+) = T(+,q,, A,), n = 1 , 2 , .. .. The sieve On($) becomes finer (less escapes) as qn + 0;) and A, + 00. For given sequences {q,} and {A,}, the "connectionist sieve estimator" 8, is defined as a solution to the least-squares problem [appropriate for learning
E(YIX)I n
Associated with dn is an estimator 9, of dimension s, x 1 (s, = q,(r+2)+1) such that = fqn(.,9,). The estimator 8, is defined as a solution to the problem
en(.)
n
where W, = {wqn : XEo lwljl 5 A,, XZl El=1lwojiI 5 q,A,}. Comparing this to the problem of equation 4.3, we see that the only difference is that equation 5.3 explicitly references network complexity, which is now a function of the size n of the available body of empirical evidence. Theorem 7 of the appendix (White 1988) gives precise conditions on 0, { Zt} (equivalently, P), $1, {q,} and {A,} that ensure the consistency
Halbert White
454
of
8,
for 0, in 0 in the sense of the root mean square metric p2,
Note that the integral is taken with respect to the environment measure p. The conditions are straightforward to describe. 0 is taken to be the space of square integrable functions on a given compact subset of R‘ (0 = (0 : K + R : JKO%p < 00, K a compact subset of R‘}). The probability measure P is assumed to generate identically distributed random variables 2, having joint probability measure v, with subvector X t having probability measure p such that p ( K ) = 1. For simplicity, l’i is also assumed bounded, although this condition can be relaxed. The probability measure P also governs the interdependence of 2,and Z,, t 7. White considers the case of independent random variables (appropriate for cross-section samples) and the case of “mixing” random variables (appropriate for time-series samples). The activation functions 1c, can be chosen to be any activation function that permits single hidden layer feedforward networks to possess the universal approximation property in (0,p2). For example, 1c, can be sigmoid, as shown by HSW. With these conditions specified, it is possible to derive growth rates 0 prob - P, that is, for network complexity ensuring that p2(8,,0,) that single hidden layer feedforward networks are capable of learning an arbitrarily accurate approximation to an unknown function, provided that they increase in complexity at an appropriate rate. In fact, the theoretical results require proper control of both A, and qn. The appropriate choice for A, is proven to be such that A, + 00 as n -+ 00 and A, = o(n’I4), that is, n-’f4A, -+ 0. A standard choice in the sieve estimation literature is A, 0: log n. The appropriate choice for q, depends on A, and on the assumed dependence properties of { Zt }. When { Z t } is an independent sequence, it suffices that q, + 00 and q,Atlog(qnAn) = o(n); when { Z t } is a mixing sequence, it suffices that q,A; log(qnA,) = o(n1/2). For the choice A, 0: logn, these conditions permit qn c( n1-6, 0 < 6 < 1, for the independent case and qn c( n(1-6)/2for the mixing case. The underlying justification for these growth rates is quite technical and cannot be given a meaningful simple explanation; most of the theoretical analysis is devoted to obtaining these rates. Nevertheless, their purpose is clear: they serve to prevent network complexity from growing so fast that overfitting results in the limit. These analytical results show only that network learning of an arbitrarily accurate approximation of an arbitrary mapping is possible. They do not provide more than very general guidance on how this can be done, and what guidance they do provide suggests that such learning will be hard. In particular, the learning method requires solution of equation 5.3. Global optimization methods, such as those discussed in Section 4.1.6 are appropriate.
+
--f
Learning in Artificial Neural Networks
455
Furthermore, although these results do provide asymptotic guidelines on growth of network complexity, they say nothing about how to determine adequate network complexity in any specific application with a given training set of size n. The search for an appropriate technique for determining network complexity has been the focus of considerable effort to date (e.g., Rumelhart 1988; Ash 1989; Hirose et al. 1989). It is apparent that methods developed by statisticians will prove helpful in this search. In particular, White (1988) discusses use of the method of cross-validation (e.g., Stone 1974) to determine network complexity appropriate for a training set of given size. An alternative approach is given by Barron (1989). White’s analysis does not treat the limiting distribution of 8,. This analysis is more difficult than that associated with 6,because 8, has a probability distribution over a function space. A study of this distribution is of theoretical interest, and may also be of some practical use. Results of Andrews (1988) may be applicable to obtain the limiting distributions of linear and nonlinear functionals of 8,. These would allow construction of asymptotic confidence intervals for the value of 8, at a given point 20, for example. Also of interest are hypothesis tests that will permit inference about the nature of a given mapping of interest. Specifically, some theory of the phenomenon of interest might suggest that a particular unknown mapping 0, has a specific linear or nonlinear form, so that one might formulate the null hypothesis H o : 8, E 0,, where 0,is some specific class of functions having the specified property ( e g , affine functions or some specified parametric family). The alternative is that 8, does not belong to 0,, that is, Ha : 0 4 0,. Tests of this Ho against Ha have been extensively studied in the econometrics literature, where they are known as specification tests. A specification test using single hidden layer feedforward networks has been proposed by White (1989~)and investigated by Lee et al. (1989). Most such specification tests are “blind to certain alternatives, however, in that they will fail to detect certain departures from Ho no matter how large is n. Recent work of Wooldridge (1989) has exploited the nonparametric estimation capabilities of series estimators, a special class of sieve estimators, to obtain specification tests that can detect any departure from H , with probability approaching 1 as n becomes large. It is plausible that Wooldridge’s approach can be applied to the connectionist sieve estimator as well, so that ”failsafe” tests of H o can be obtained using feedforward networks. 6 Summary and Concluding Remarks
It is the premise of this review that learning methods in artificial neural networks are sophisticated statistical procedures and that tools developed for the study of statistical procedures generally not only yield useful in-
456
Halbert White
sights into the properties of specific learning procedures, but also suggest valuable improvements in, alternatives to, and generalizations of existing learning procedures. Particularly applicable are asymptotic analytical methods that describe the behavior of statistics when the size n of the training set is large. The study of the stochastic convergence properties (consistency, limiting distribution) of any proposed new learning procedure is strongly recommended, to determine what it is that the network eventually learns and under what specific conditions. Derivation of the limiting distribution will generally reveal the statistical efficiency of the new procedure relative to existing procedures and may suggest modifications capable of improving statistical efficiency. Furthermore, the availability of the limiting distribution makes possible valid statistical inferences. Such inferences can be of great value in the investigation of optimal network architectures in particular applications. A wealth of applicable theory is already available in the statistics, econometrics, systems identification, and optimization theory literatures. Among the applications of results already available in these literatures are some potentially useful learning methods for artificial neural networks based on the multilevel single linkage and the Kiefer-Wolfowitz procedures, as well as a demonstration of the usefulness of multilayer feedforward networks for nonparametric estimation of an unknown mapping. We have described recent work of White (1988) along these lines, establishing that arbitrary mappings can indeed be learned using multilayer feedforward networks. It is evident that the field of statistics has much to gain from the connectionist literature. Analyzing neural network learning procedures poses a host of interesting theoretical and practical challenges for statistical method; all is not cut and dried. Most importantly, however, neural network models provide a novel, elegant, and extremely valuable class of mathematical tools for data analysis. Application of neural network models to new and existing data sets holds the potential for fundamental advances in empirical understanding across a broad spectrum of the sciences. To realize these advances, statistics and neural network modeling must work together, hand in hand. Mathematical Appendix
Here we state some theorems providing conditions under which the convergence properties of various learning methods discussed in the text can be precisely specified. The conditions given are not the most general possible, but are instead chosen to provide the reader with an introduction to the sorts of conditions and results encountered in the study of convergence properties. The conditions stated rely on concepts of probability, measure, and integration that are well presented in, for example, Bartle (1966) and Billingsley (1979).
Learning in Artificial Neural Networks
457
Theorem 1. Let (a,F, P ) be a complete probability space on which is defined the sequence of independent identically distributed (i.i.d.) random variables { Z t } = (2,: s2 + R", t = 1,2, . . .), u E N = {1,2, . . .}. Let 1 : R" x W + R be a function such that for each w in W , a compact subset 0fR8, s E N, l(., w) is measurable-B" (where B" is the Bore1 u-field generated by the open sets of R"), and for each z in R", l ( z ,.) is continuous on W . Suppose further that there exists d : R" -+ R' such that for all w in W ll(z,w)l 5 d ( z ) and E(d(Z,))< 00 (i.e., 1 is dominated on W by an integrable function). Then for each n = 1,2,. . . there exists a solution Gn to the problem min,,wA,(w) = n-lC~=IZ(Ztrw) and 6, + W* a.s. - P , where W' = 0 {w*E W : X(w*) 5 X(w) for all w E W } ,X(w) = E(l(Zt,w)). We give a proof for this result, as a concise proof is not readily available in the literature. The proof also makes clear points at which assumptions may be relaxed. Proof. The existence of Gn follows because for each realization of { Z,} K, is a continuous function on a compact set, n = I,2 , . . ,. Given domination of 1 and compactness of W , it follows from Theorem 16.8(i) of Billingsley (1979) that X is continuous on W . Given domination of 1, compactness of W , and the assumption that (2,) is i.i.d., it follows from the uniform law of large numbers (e.g. Jennrich 1969, Theorem 2) that supyEwlAn(w)- X(w)l + 0 a.s. - P. Pick a realization of { Z t } for which this convergence occurs. For this realization, let (6,)be a sequence of minimizers of An,n = 1,2,. . .. Because W is compact, there exists a limit point wo E W and a subsequence {n'} such that -, wo. It follows from the triangle inequality that ~~,t(6,~)-X(wo)~5 ~~,~(6,~)-X(6,,)~+~X(6,,)-X(wo)~ < 2~ for any E > 0 and all n' sufficiently large, given the uniform convergence and continuity just established. Now X(wo) - X(w) = [X(wo) - A,,(G,,)I + [K,t(8,/) - Kn4w)l + [K,I(W) - X(w)l I 3~ for any E and all n' sufficiently large, because X(wo)-A,J(6nf) 5 2~ as just established, KnI(G,,)-&,(w) 5 0 by the optimality of G,,, and K,t(w) - X(w) < E by uniform convergence. Because E is arbitrary, X(wo) 5 X(w), and because w is arbitrary wo E W'. Because (6,) is arbitrary, every limit point wo of such a sequence belongs to W". Now suppose that inf,,,.EW.Iltij,- w*ll f , 0. Then there exist E > 0 and a subsequence {n'} such that JJ6,r - w*II 2 E for all n' and w* E W*. But {G,!}has a limit point that, by the preceding argument, must belong to W*. This is a contradiction to 1 1 6 , ,- w*ll 2 E for all n', so infw.,p(16, - w*ll + 0. Because the realization of { Z t } is chosen from a set with probability 1, the concIusion follows. 6 t ,
W , and 1 be as in Theorem 1, and suppose Theorem 2. Let (QF,P ) , {Z,}, that 8, -+ w*a.s. - P where w* is an isolated element of W* interior to W . Suppose in addition that for each z in R" l ( z , .) is continuously differentiable of order 2 on J W ; that E(Vl(Z,,w*)'Vl(Zt,w ' ) ) < co; that each
Halbert White
458
element of V21 is dominated on W by an integrable function; and that A* = E(V21(Zt,w*)) and B* = E(V1(Zt, w*)Vl(Zt,w*)') are nonsingular (s x s) matrices, where V and V2 denote the (s x 1) gradient and (s x s) Hessian operators with respect to w. Then @(6, - w*) 3 N ( 0 ,c'), where c ' = A*-'B*A*-'. If in addition each element of VlVl' is dominated on W by an integrable function, then C?, + c ' a.s. - P , where = = n-' Cz=l V21(Zt, 6,), B, = 0 n-1 Vl(Z,, 6,)Vl(Z,, 6,)'.
ct"=1
en a;'&&', a,
See Gallant and White (1988, chap. 5) for a proof of a more general result. The steps of the proof of the present result are completely analogous to those of Gallant and White. The following four results are proven in White (1989b). The notation has been adapted to correspond to that used here. The function m appearing in these results corresponds to -V1 in the results above; use of m permits treatment of situations in which learning need not optimize a performance measure.
Theorem 3. (White 1989b, Proposition 3.1). Let (2,)be a sequence of i.i.d. random v x 1 vectors such that lZ,l < A < 03. Let m : R" x Rs+ R5 be continuously differentiable on R" x R" and suppose that for each w in R", M(w) z E(m(Z,,w)) < 00. Let (7, E R'} be a decreasing sequence such that xglvn = 00, limn+wsup(q;' - 7;;') < 03 and Czl 7: < 00 for some d > 1. Define the recursive m-estimator 271, = W,-l + qnm(ZnrWn-1),n = I, 2,. . ., where 60E R" is arbitrary. (a) Suppose there exists X : R" + R twice continuously differentiable such that VX(w)'M(w) 5 0 for all w in R". Then either 6,+ W i = {w : VX(w)'M(w) = 0) or W, -+ 03 with probability one (w.p.l). (b) Suppose w* E R" is such that P[6,-+ S,'] > 0 for all E > 0, where St ZE {w E R" : IIw - w*JJ< E } . Then M(w') = 0. If in addition M is continuously differentiable in a neighborhood of w with V M * = VM(w*) finite and if J* = E(m(Z,, w*)m(Z,, w*)') is finite and positive definite, then VM* has all eigenvalues in the left half-plane. (c) Suppose the conditions of (a) hold, that M(w) = -VX(w), that X(w) has isolated stationary points, and that the conditions of (b) hold for each w* E W t = {w : VX(w) = 0). Then as n 4 oa either 6, tends to a local minimum of X(w) w.p.l or 6,--+ 00 w.p.l. 0
Theorem 4. (White 1989b, Proposition 4.1). Let the conditions of Theorem 3(a,b) hold, and suppose also that Irn(Zn,w)l < A < co as. for all w in R". Let be the maximum value of the real part of the eigenvalues of VM* and suppose C* < -1/2. Define J(w) f var [m(Z,, w)] and suppose J is continuous on a neighborhood of w'. Set J*= J(w*) and vn = n-'.
Learning in Artificial Neural Networks
459
Then the sequence of random elements T,(a) of C ~ 1 [ 0 , 1with ] sup norm, a E [0,11, defined by T,(d = n-'/'SIna] + (nu - [nal)n-'/'(S~,,l,~ - SL,~]), with S, = n(Wn- w*), converges in distribution to a Gaussian Markov process G with G(a)= exp[(lna)(I+V M * ) l x exp[-(lnt)(VM* + I)ldW(t), a E (0,1], where W is a Brownian motion in R" with W(0)= 0, E(W(1))= 0 and E(W(l)W(I)') = J'. In particular, n'/'(W, - w * ) ~ N ( O P), , where F' = e x p ( - ( h t ) [ V M * + 1])J*exp(-(lnt)[VM" + I ] ) & is the unique solution t o the equation ( V M * + 1/2)F* + F*(VM*' + 112) = -P. When V M ' is symmetric, F* = P H P - ' , where P is the orthogonal matrix such that PZP-' = - V M * with Z the diagonal matrix containing the (real) eigenvalues (c1,. . . , o f - V M * in decreasing order, and H the s x s matrix with elements Hij = (&+A, -l)-'K$, i , j = 1,.. . ,s , where [K,',] = K' = P-'J'P.
cs)
0
Theorem 5. (White 1989b, Proposition 5.1). Let M : R" + R" have unique zero w*interior to a convex compact set W c R" and suppose M is continuously differentiableon W with V M ' finite and nonsingular. Let (R,F, P ) be a probability space, and suppose there exists a sequence {M,: R x W + R3} such that for each w in W,M,(., w) is measurable-F and for each w in R,M,(w, .) is continuously differentiable on W , with Jacobian VM,(w, .). Suppose that for some positive definite matrix B*,n'/'M,(., w*)5 N ( 0 ,B') and that w)- M ( w ) + 0, VM,(-, w)- V M ( w )+ 0 as.(-P) uniformly on W . Let (6,: R + R"} be a measurable sequence such that W, + w-*a.s. and n'/'(W, - w*) is Op(l). Then with M , = M,(.,W,) and V M , = - -1 VMn(.,W,),Gn w, - V M , M , is such that G, + w*a s . and n'f2(G, w') 5 N ( 0 ,c*),where C* = A*-'B'A*-'', A* F VM*. I f there exists {B,} such that E n + B' a s . , then with A,, = O M , we - --I' have 6, = A, B,A, + c* a s . 13
--'
Theorem 3 and 4 establish that the recursive m-estimator satisfies the conditions required for W n here. The utility of Theorem 5 is that 6,can afford an improvement over 6,in the sense of having smaller asymptotic covariance matrix. Theorem 6. (White 1989b, Proposition 5.2). Let the conditions of Theorem 5 hold with w* an isolated zero of M(w) = E(m(Z,, w)) = 0, and let W be a convex compact neighborhood of w*.Put M,(., w) = n-' Cy=,m(Zt,w) so that OM,(., w)= n-* Cy=lVm(Zt,w), and suppose that Vm is dominated on W by an integrable function. Let W, be the recursive m-estimator and - -1 define 3, = 6,- V M , M,, n = I, 2 , . . .. Then the conclusions of Theorem 5 hold and F* - C' is positive semidefinite. 0 In stating the final result, we make use of mixing measures of stochastic dependence, in particular, uniform (+) and strong (a-) mixing.
460
Halbert White
These are defined as
where
FT
3
= o(Z1,.. . ,ZJ is the o-field generated by { Z l , . . . ,Zt}, and a(&, . . .) is the o-field generated by { Z t ,Zt+,,. . .}. For a discus-
6
sion of d ( k ) and a ( k ) and the properties of mixing processes { & } [i.e., processes for which d ( k ) + 0 or a ( k ) + 0 as k + co],we refer to White (1984b). The sets On($) and T($,q, A) are as defined in Section 5.
Theorem 7. (White 1988, Theorem 4.5). Suppose that the observed data are the realization of a stochastic process {Zt : R -+ I", t = 1,2,. . .}, I = [0,1], on the complete probability space (a,F, P ) , and that P is such that either (1) { Z t } is i.i.d. or (2) { Z t }is a stationarymixingprocess with either d ( k ) = or a ( k ) = crop;, k 2 I, for some constants $0, a0 > 0, O < po < 1. Suppose that 8, is the unique element of 0 = L 2 ( I T , p )such that E ( x l X t ) = 8,(Xt). Put On($) = T ( $ ,qn, A,) where $ is a cumulative distribution function satisfying a Lipschitz condition, and (4,) and {A,} are such that q, and A, are increasing with n, q, -+ co and A, -+ co as n -+ co,A, = ~ ( n *and / ~ either ) (1) q,A;logq,A, = o(n) (for i.i.d. { Z t } )or (2) qnA~logqnAn = (for mixing {Zt}). Then there exists a measurable connectionist sieve estimator 4, : R -+ 0 such that n
n
Further, p(&, 0,)
-+
0 prob - P.
0
Acknowledgments The author is indebted to Richard Durbin, Mark Salmon, and the editor for helpful comments and references. This work was supported by National Science Foundation Grant SES-8806990.
References Andrews, D. W. K. 1988. Asymptotic normality of series estimators for various nonparametric and semi-parametric estimators. Yale University, Cowles Foundation Discussion Paper 874. Ash, T. 1989. Dynamic node creation in backpropagation networks. Poster presentation, International Joint conference on Neural Networks, Washington, D.C.
Learning in Artificial Neural Networks
461
Barron, A. 1989. Statistical properties of artificial neural networks. Paper presented to the IEEE Conference on Decision and Control. Bartle, R. 1966. The Elements of Integration. Wiley, New York. Baum, E., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1,151-160. Billingsley, P. 1979. Probability and Measure. Wiley, New York. Blum, J. 1954. Multivariate stochastic approximation methods. Ann. Math. Stat. 25,737-744. Carroll, S. M., and Dickinson, B. W. 1989. Construction of neural nets using the Radon transform. In Proceedings of the International Joint Conference on Neural Networks, Washington, D.C. pp. 1:607-611. IEEE, New York. Cerny, V. 1985. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. J. Opt. Theory Appl. 45, 41-51. Cox, D. 1984. Multivariate smoothing splines. S I A M J. Numerical Anal. 21, 7894313. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control, Signals Sys. 2, 303-314. Davies, R. B. 1977. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 64, 247-254. Davies, R. B. 1987. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 74, 33-43. Davis, L. (ed.) 1987. Genetic Algorithms and Simulated Annealing. Pitman, London. Domowitz, I., and White, H. 1982. Misspecified models with dependent observations. J. Economet. 20, 35-50. Efron, B. 1982. The lacknife, the Bootstrap and Other Re-sampling Plans. SIAM, Philadelphia. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Gallant, A. R., and Nychka, D. 1987. Semi-nonparametric maximum likelihood estimation. Econometrica 55,363-390. Gallant, A. R., and White, H. 1988. A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. Basil Blackwell, Oxford. Geman, S., and Hwang, C. 1982. Nonparametric maximum likelihood estimation by the method of sieves. Ann. Stat. 10, 401414. Goldberg, D. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA. Golden, R. 1988. A unified framework for connectionist systems. Biological Cybernetics 59, 109-120. Grenander, U. 1981. Abstract Inference. Wiley, New York. Hajek, B. 1985. A tutorial survey of theory and applications of simulated annealing. In Proceedings of the 24th I E E E Conference on Decision and Control, pp. 755-760. Hajek, B. 1988. Cooling schedules for optimal annealing. Math. Operations Res. 13,311-329.
462
Halbert White
Haussler, D. 1989. Generalizing the PAC model for neural net and other learning applications. UCSC Computer Research Laboratory Tech. Rep. UCSC-CRL89-30. Hecht-Nielsen, R. 1989. Theory of the back-propagation neural network. In Proceedings of the International Joint Conference on Neural Networks, Washington, D.C., pp. 1:593-606. IEEE, New York. Hirose, Y., Yamashita, K., and Hijiya, S. 1989. Back-propagation algorithm which varies the number of hidden units. Poster presentation, International Joint Conference on Neural Networks, Washington, D.C. Holland, J. 1975. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor. Homik, K., Stinchcombe, M., and White, H. 1989a. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-368. Hornik, K., Stinchcombe, M., and White, H. 1989b. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. UCSD Department of Economics Discussion Paper. Jennrich, R. 1969. Asymptotic properties of nonlinear least squares estimators. Ann. Math. Stat. 40, 633-643. Kiefer, J., and Wolfowitz, J. 1952. Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23, 462466. Kirkpatrick, S., Gelatt, C. D., Jr., and Vecchi, M. P. 1983. Optimization by simulated annealing, Science 220, 671-680. Kuan, C.-M., and White, H. 1989. Recursive M-estimation, nonlinear regression and neural network learning with dependent observations. UCSD Department of Economics Discussion Paper. Kullback, S., and Leibler, R. A. 1951. On information and sufficiency. Ann. Math. Stat. 22, 79-86. Kushner, H. 1987. Asymptotic global behavior for stochastic approximations and diffusions with slowly decreasing noise effects: Global minimization via Monte Carlo. SIAM I. Appl. Math. 47, 169-185. Kushner, H., and Clark, D. 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, Berlin. Kushner, H., and Huang, H. 1979. Rates of convergence for stochastic approximation type algorithms. SIAM J. Control Optim. 17, 607-617. Lee, T.-H., White, H., and Granger, C. W. J. 1989. Testing for neglected nonlinearity in time series models: A comparison of neural network methods and alternative tests. UCSD Department of Economics Discussion Paper. Ljung, L. 1977. Analysis of recursive stochastic algorithms, IEEE Truns. Automatic Control AC-22, 551-575. Ortega, J., and Rheinboldt, W. 1970. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York. Parker, D. B. 1982. Learning logic. Invention Report 581-64, File 1, Office of Technology Licensing, Stanford University. Phillips, I? C. B. 1989. Partially identified econometric models. Econometric Theory 5, 181-240.
Learning in Artificial Neural Networks
463
Renyi, A. 1961. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium in Mathematical Statistics, Vol. 1, pp. 547-561. University of California Press, Berkeley. Rheinboldt, W. 1974. Methods for Solving Systems of Nonlinear Equations. SIAM, Philadelphia. Rinnooy Kan, A. H. G., Boender, C. G. E., and T i e r , G. Th. 1985. A stochastic approach to global optimization. In Computational Mathematical Programming, K. Schittkowski, ed., NATO AS1 Series, Vol. F15, pp. 281-308. Springer-Verlag, Berlin. Robbins, H., and Monro, S. 1951. A stochastic approximation method. Ann. Math. Stat. 22, 400-407. Rumelhart, D. 1988. Parallel distributed processing. Plenary lecture, I E E E International Conference on Neural Networks, San Diego. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructures of Cognition, D. E. Rumelhart and J. L. McClelland, eds., Vol. 1, pp. 318-362. MIT Press, Cambridge. Ruppert, D. 1989. Stochastic approximation. In Handbook of Sequential Analysis, B. Ghosh and P. Sen, eds. Marcel Dekker, New York, forthcoming. Serfling, R. 1980. Approximation Theorems of Mathematical Statistics. Wiley, New York. Shannon, C. E. 1948. A mathematical theory of communication. Bell System Tech. J. 27, 379-423, 623-656. Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions. In Proceedings of the International Joint Conference on Neural Networks, Washington, D.C., pp. 1613-617. IEEE, New York. Stone, M. 1974. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. SOC. SU. B 36, 111-133. Timmer, G. Th. 1984. Global optimization: A stochastic approach. Unpublished Ph.D. Dissertation, Erasmus Universiteit Rotterdam, Centrum voor Wiskunde en Informatica. Tishby, N., Levin E., and Solla, S. 1989. Consistent inference of probabilities in layered networks: predictions and generalization. Proceedings of the International Joint Conference on Neural Networks, Washington, D.C., pp. II:403-409. IEEE, New York. van Laarhoven, P. J. M. 1988. Theorefical and Computational Aspects of Simulated Annealing. Centrum voor Wiskunde en Informatica, Amsterdam. Wahba, G. 1984. Cross-validated spline methods for the estimation of multivariate functions from data on functionals. In Statistics: An Appraisal, H. A. David and H. T. David, eds., pp. 205-235. Iowa State University Press, Ames. Walk, H. 1977. An invariance principle for the Robbins-Monro process in a Hilbert space. Z. Wahrscheinlichkeitstheor. Venuand. Geb. 39, 135-150. Werbos, P. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished Ph.D. Dissertation, Harvard University, Department of Applied Mathematics.
464
Halbert White
White, H. 1981. Consequences and detection of misspecified nonlinear regression models. J. Am. Stat. Assoc. 76,419-433. White, H. 1982. Maximum likelihood estimation of misspecified models. Econometrica 50, 1-25. White, H. 1984a. Maximum likelihood estimation of misspecified dynamic models. In Misspecificution Analysis, T. Dijkstra, ed., pp. 1-19. Springer-Verlag, New York. White, H. 1984b. Asymptotic Theory for Econometricians. Academic Press, New York. White, H. 1988. Multilayer feedforward networks can learn arbitrary mappings: Connectionist nonparametric regression with automatic and semi-automatic determination of network complexity. UCSD Department of Economics Discussion Paper. White, H. 1989a. Estimation, Inference and Specification Analysis. Cambridge University Press, New York, forthcoming. White, H. 1989b. Some asymptotic results for learning in single hidden layer feedforward networks. J. Am. Stat. Assoc., forthcoming. White, H. 1989~. An additional hidden unit test for neglected nonlinearity in multilayer feedforward networks. Proceedings of the lnternational Joint Conference on Neural Networks, Washington, D.C., pp. II:451455. IEEE, New York. White, H., and Wooldridge, J. 1989. Some results on sieve estimation with dependent observations. In Nonparametric and Semiparametric Methods in Econometrics and Statistics, W. Barnett, J. Powell, and G. Tauchen, eds. Cambridge University Press, New York, forthcoming. Wiener, N. 1948. Cybernetics. Wiley, New York. Wooldridge, J. 1989. Some results on specification testing against nonparametric alternatives. MIT Department of Economics Working Paper.
Received 10 August 1989; accepted 26 September 1989.
NOTE
Communicated by Halbert White
Representation Properties of Networks: Kolmogorov’s Theorem Is Irrelevant Federico Girosi Tomaso Poggio Massachusetts Instit u te of Technology, Artificial Intelligence Laboratory, Cam bridge, MA 02142 USA and Center for Biological Information Processing, Whitaker College, Cambridge, MA 02142 USA
Many neural networks can be regarded as attempting to approximate a multivariate function in terms of one-input one-output units. This note considers the problem of an exact representation of nonlinear mappings in terms of simpler functions of fewer variables. We review Kolmogorov‘s theorem on the representation of functions of several variables in terms of functions of one variable and show that it is irrelevant in the context of networks for learning. 1 Kolmogorov’s Theorem: An Exact Representation Is Hopeless A crucial point in approximation theory is the choice of the representation of the approximant function. Since each representation can be mapped in an appropriate network choosing the representation is equivalent to choosing a particular network architecture. In recent years it has been suggested that a result of Kolmogorov (1957) could be used to justify the use of multilayer networks composed of simple one-input-one-output units. This theorem and a previous result of Arnol’d (1957) can be considered as the definitive disproof of Hilbert’s conjecture (his thirteenth problem, Hilbert 1900): there are continuous functions of three variables, not representable as superpositions of continuous functions of two variables. The original statement of Kolmogorov’s theorem is the following (Lorentz 1976):
Theorem 1.1. (Kolmogorov 1957). There exist fixed increasing continuous functions hp,(x),on I = [0,1] so that each continuous function f on I“ can be written in the form f(x7,. . . I x n > =
2n+1
I1
q=1
p=l
C gq(C
hpq(xp))$
where gp are properly chosen continuous functions of one variable Neural Computation 1, 465469 (1989) @ 1989 Massachusetts Institute of Technology
Federico Girosi and Tomaso Poggio
466
X
Y
F
Figure 1: The network representation of an improved version of Kolmogorov's theorem, due to Kahane (1975). The figure shows the case of a bivariate function. The Kahane's representation formula is f(q,. .. ,z,J = Ci1;' g[CF=l lphq(zp)l where h, are strictly monotonic functions and lp are strictly positive constants smaller than 1.
This result asserts that every multivariate continuous function can be represented by the superposition of a small number of univariate continuous functions. In terms of networks this means that every continuous function of many variables can be computed by a network with two hidden layers (see Figure 1) whose hidden units compute continuous functions (the functions g, and hpq). Does Kolmogorov's theorem, in its present form, prove that a network with two hidden layers is a good and usable representation? The answer is definitely no. There are at least two reasons for this: 1. In a network implementation that has to be used for learning and generalization, some degree of smoothness is required for the func-
Representation Properties of Networks
467
tions corresponding to the units in the network. Smoothness of the h,, and of the g, is important because the representation must be smooth in order to generalize and be stable against noise. A number of results of Vituskin (1954, 1977) and Henkin (1964) show, however, that the inner functions h, of the Kolmogorov’s theorem are highly not smooth (they can be regarded as “hashing” functions). Due to this ”wild” behavior of the inner functions h,,, the functions g, do not need to be smooth, even for differentiable functions f (de Boor 1987). 2. Useful representations for approximation and learning are purametrized representations that correspond to networks with fixed units and modifiable parameters. Kolmogorov’s network is not of this type: the form of 9, (corresponding to units in the second “hidden” layer) depends on the specific function f to be represented (the h,, are independent of it). g, is at least as complex, for instance in terms of bits needed to represent it, as f .
A stable and usable exact representation of a function in terms of two or more layers network seems hopeless. In fact the result obtained by Kolmogorov can be considered as a “pathology” of the continuous functions: it fails to be true if the inner functions h, are required to be smooth, as it has been shown by Vitushkin (1954). The theorem, though mathematically surprising and beautiful, cannot be used by itself in any constructive way in the context of networks for learning. This conclusion seems to echo what Lorentz (1962) wrote, more than 20 years ago, asking ’Will it [Kolmogorov’s theorem] have useful applications?. . . One wonders whether Kolmogorov’s theorem can be used to obtain positive results of greater [than trivial] depth.” Notice that this leaves open the possibility of finding good and well founded approximate representations. This argument is discussed in some length in Poggio and Girosi (19891, and a number of results have been recently obtained by some authors (Hornik et al. 1989; Stinchcombe and White 1989; Carroll and Dickinson 1989; Cybenko 1989; Funahashi 1989; Hecht-Nielsen 1989). The next section reviews Vitushkin’s main results. 2 The Theorems of Vitushkin
The interpretation of Kolmogorov’s theorem in term of networks is very appealing: the representation of a function requires a fixed number of nodes, polynomially increasing with the dimension of the input space. Unfortunately, these results are somewhat pathological and their practical implications very limited. The problem lies in the inner functions of Kolmogorov’s formula: although they are continuous, theorems of Vitushkin and Henkin (Vitushkin 1964, 1977; Henkin 1964; Vitushkin and Henkin 1967) prove that they must be highly nonsmooth. One could ask if it is
Federico Girosi and Tomaso Poggio
468
possible to find a superposition scheme in which the functions involved are smooth. The answer is negative, even for two variable functions, and was given by Vitushkin with the following theorem (1954):
Theorem 2.1. (Vitushkin 1954). There are T(r = 1,2, . . .) times continuously differentiable functions of n 2 2 variables, not representable by superposition of r times continuously differentiable functions of less than n variables; there are r times continuously differentiable functions of two variables that are not representable by sums and continuously differentiable functions of one variable.
We notice that the intuition underlying Hilbert's conjecture and theorem 2.1 is the same: not all the functions with a given degree of complexity can be represented in simple way by means of functions with a lower degree of complexity. The reason for the failing of Hilbert's conjecture is a "wrong" definition of complexity: Kolmogorov's theorem shows that the number of variables is not sufficient to characterize the complexity of a function. Vitushkin showed that such a characterization is possible and gave an explicit formula. Let f be an T times continuously differentiable function defined on I" with all its partial derivatives of order r belonging to the class Lzp[O,11". Vitushkin puts x = (T + a)/n and shows that it can be used to measure the inverse of the complexity of a class of functions. In fact he succeded in proving the following:
Theorem 2.2. (Vitushkin 1954). Not all functions of a given characteristic xo = qO'o/ko > 0 can be represented by superpositions of functions of characteristic x = q / k > xo, q 2 1. Theorem 2.1 is easily derived from this result.
Acknowledgments We acknowledge support from the Defense Advanced Research Projects Agency under contract number N00014-89-J-3139. Tomaso Poggio is supported by the Uncas & Helen Whitaker Chair at MIT.
References Amol'd, V. I. 1957. On functions of three variables. Dokl. Akad. Nauk S S S R 114, 679-681.
Carroll, S. M., and Dickinson, B. W. 1989. Construction of neural nets using the Radon transform. In Proceedings of the International Joint Conference on Neural Networks, pp. 1-607-1-611, Washington, D.C., June 1989. IEEE TAB Neural Network Committee. Cybenko, G. 1989. Approximation by superposition of a sigmoidal function. Math. Control Systems Signals, in press.
Representation Properties of Networks
469
de Boor, C. 1987. Multivariate approximation. In The State of the Art in Numerical Analysis, A. Iserles and M. J. D. Powell, eds., pp. 87-109. Clarendon Press, Oxford. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Hecht-Nielsen, R. 1989. Theory of backpropagation neural network. In Proceedings of the International Joint Conference on Neural Networks, pp. 1-593-1-605, Washington D.C., June 1989. IEEE TAB Neural Network Committee. Henkin, G. M. 1964. Linear superpositions of continuously differentiable functions. Dokl. Akad. Nauk SSSR 157, 288-290. Hilbert, D. 1900. Mathematische probleme. Nachr. Akad. Wiss. GGttingen, 290329. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Kahane, J. P. 1975. Sur le theoreme de superposition de Kolmogorov. J. Approx. Theory 13, 229-234. Kolmogorov, A. N. 1957. On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR 114, 953-956. Lorentz, G. G. 1976. On the 13-th problem of Hilbert. In Proceedings of Symposia in Pure Mathematics, pp. 419-429, Providence, RI, 1976. American Mathematical Society. Lorentz, G. G. 1962. Metric entropy, widths, and superposition of functions. Am. Math. Monthly 69, 469485. Poggio, T., and Girosi, F. 1989. A theory of networks for approximation and learning. A.I. Memo No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions. In Proceedings of the International Joint Conference on Neural Networks, pp. I-6071-611, Washington, D.C., June 1989. IEEE TAB Neural Network Committee. Vitushkin, A. G. 1954. On Hilbert's thirteenth problem. Dokl. Akad. Nauk SSSR 95, 701-704. Vitushkin, A. G. 1964. Some properties of linear superposition of smooth functions. Dokl. Akad. Nauk SSSR 156: 1003-1006. Vitushkin, A. G. 1977. On Representation of Functions by Means of Superpositions and Related Topics. L'Enseignement Mathematique. Vitushkin, A. G., and Henkin, G. M. 1967. Linear superposition of functions. Russian Math. Surveys 22, 77-125. ~
Received 17 July 1989; accepted 30 August 1989.
Communicated by Eric Baum
NOTE
Sigmoids Distinguish More Efficiently Than Heavisides Eduardo D. Sontag SYCON-Rutgers Center for Systems and Control, Department of Mathematics, Rutgers University, New Brunswick, NJ 08903 USA
Every dichotomy on a 2k-point set in RN can be implemented by a neural net with a single hidden layer containing k sigmoidal neurons. If the neurons were of a hardlimiter (Heaviside) type, 2k - 1 would be in general needed. 1 Introduction and Definitions
The main point of this note is to draw attention to the fact mentioned in the title, that sigmoids have different recognition capabilities than hardlimiting nonlinearities. One way to exhibit this difference is through a worst-case analysis in the context of binary classification, and this is done here. Results can also be obtained in terms of VC dimension, and work is in progress in that regard. For technical details and proofs, the reader is referred to Sontag (1989). RN Let N be a positive integer. A dichotomy (S-, S+)on a set S is a partition S = S- US, of S into two disjoint subsets. A function f :IRN + pi will be said to implement this dichotomy if it holds that f(u)> 0 for u E S+ and f(u>< 0 for u E SLet 6 : R + R be any function. We shall say that f is a single hidden layer neural net with k hidden neurons of type 6 [or just that f is a ” ( k ,@-net”] if there are real numbers wo, wl, . . . ,‘Wk,7 1 , . . . ,Tk, and vectors v1,. . . ,v k E RN such that, for all u E RN,
where the dot indicates inner product. For fixed 6, and under mild assumptions on 6, such neural nets can be used to approximate uniformly arbitrary continuous functions on compacts. See, for instance, Cybenko (1989) and Hornik et al. (1989). In particular, they can be used to implement arbitrary dichotomies. Neural Computation 1, 470-472 (1989)
@ 1989 Massachusetts Institute of Technology
Sigmoids Distinguish More Efficiently Than Heavisides
471
In neural net practice, one often takes B to be the sigmoid 1 = 1 + e-" or equivalently, up to translations and change of coordinates, the hyperbolic tangent Q(z)= tanh(z). Another usual choice is the hardlimiter or Heaviside function ~
which can be approximated well by tanh(yz) when the ''gain'' y is large. Most analysis has been done for 31, but backpropagation techniques typically use the sigmoid (or equivalently tanh). It is easy to see that arbitrary dichotomies on an 1-element set can be implemented by ( I - 1,'H)-nets, but that some dichotomies on sets of 1 elements cannot be implemented by nets with less than 1 - 1 Heaviside hidden neurons. We consider functions 8 : R + R that satisfy the following two properties: (S1) t , := limz++mB(s) and t- := limz+-m B(z) exist, and t , # t-. (S2) There is some point c such that B is differentiable at c and Q'(c) = p # 0. Note that the function 31 does not satisfy (S2), but the sigmoid of course does. The main result will be stated for these. 2 Main Result and Remarks
Theorem 1. Let 8 satisfy ( S l ) and (S2), and let S be any set of cardinality 1 = 2 k . Then, any dichotomy on S can be implemented by some ( k , 8)-net. Thus, using sigmoids we can reduce the number of neurons from 2k - 1 to k , a factor of 2 improvement. Of course this result should not really be surprising, since for Heaviside functions there are fewer free degrees of freedom [because 31(yz) = H ( z ) for any y > 01, and in fact its proof is very simple. The idea is to first classify using a net with k - 1 Heaviside hidden neurons plus a direct connection from the inputs to the output, and then replacing these direct connections by just one nonlinear hidden neuron. The differentiablity assumption allows this replacement, since it means that at low gains any linear map can be approximated. To conclude this note, we wish to remark that there are "universal" functions B satisfying (Sl)-(S2) and as differentiable as wanted, even realanalytic, such that, for each N and each dichotomy on any finite set S RN, this dichotomy can be implemented by a (1,@-net. Of course, the function B is so complicated as to be purely of theoretical interest, but it serves to indicate that, unless further restrictions are made on (Sl)-(S2), much better bounds can be obtained.
472
Eduardo D. Sontag
Acknowledgments Supported in part by U.S. Air Force Grant AFOSR-88-0235.
References Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control, Signals, Syst. 2, 303-314. Hornik, K. M., Stinchcornbe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Nehuorks Z(5): 359-366. Sontag, E. D. 1989. Sigmoids distinguish more efficiently than Heavisides. Report 89-12, SYCON-Rutgers Center for Systems and Control, August 1989. (Electronic versions available from
[email protected].)
Received 28 July 1989; accepted 15 September 1989.
Communicated by John Allman
How Cortical Interconnectedness Varies with Network Size Charles F. Stevens* Section of Molecular Neurobiology, Yale University School of Medicine, New Haven, CT 06510 USA
When model neural networks are used to gain insight into how the brain might carry out its computations, comparisons between features of the network and those of the brain form an important basis for drawing conclusions about the network’s relevance to brain function. The most significant features to be compared, of course, relate to behavior of the units. Another network property that would be useful to consider, however, is the extent to which units are interconnected and the law by which unit-unit connections scale as the network is made larger. The goal of this paper is to consider these questions for neocortex. The conclusion will be that neocortical neurons are rather sparsely interconnected - each neuron receives direct synaptic input from fewer than 3% of its neighbors underlying the surrounding square millimeter of cortex - and the extent of connectedness hardly changes for brains that range in size over about four orders of magnitude. These conclusions support the currently popular notion that the brain‘s circuits are highly modular and suggest that increased cortex size is mainly achieved by adding more modules. 1 Introduction Different mammalian species have brains of very different sizes - man’s brain is more than a thousand times larger than that of a mouse - but homologous areas are thought to have the same circuits and to operate in the same way. Thus, evolution has scaled what we believe to be the same basic network up many fold and this presents the opportunity to examine how connectedness varies with the size of the network. Since neuronal intercommunications occur at synaptic contacts, estimates of connectedness can be made by simply counting the number of synapses on a neuron. Direct counting is impracticable, however, because a cubic millimeter of cortex contains on the order of a billion synapses. This paper presents a simple theory that provides a way of characterizing connectedness from measurements on cortical thickness and surface area. *Address correspondence to The Salk Institute, P.O. Box 85800, San Diego, CA 921389216.
Neural Compufafion 1, 473-479 (1989) @ 1989 Massachusetts Institute of Technology
Charles F. Stevens
474
Such data are available for many species so that the way connectedness is scaled can be assessed over a very wide range, about four orders of magnitude, of cortex sizes. 2 A Scaling Law
Brain size will be specified by N , the total number of neurons in neocortex, or in some defined subsystem in neocortex. The goal, then, is to find an expression for the average number of synapses a cortical neuron receives (and gives) as a function of brain size. I begin with a consideration of how the average number of synapses per cortical neuron, q(N), scales with brain size N . Consider some reference brain, a mouse or cat brain, for example, with size n and q(n) synapses per neuron. If evolution scales this brain by a factor s = N / n to a new brain of size N , q should change as
where f(s) is some function that gives the increase in q for the larger brain over the reference one. Because we believe all mammalian brains, no matter how large or small, conform to the same general design and operate in the same general way, f(s) should vary continuously with s and the particular brain we select as a reference should not alter the scaling function f(s). A standard result for homogenous functions (Aczel 1969) implies, then, that f(s) is a power function
for some constant b. Rearrangement of the preceding equations gives the scaling law q(N)= q(n)(N/db
If every neuron were connected to a constant fraction of the neurons in every sized cortex then b would be 1, whereas if each neuron were connected to the same number of others independent of brain size then b would equal zero; if the degree of interconnectedness decreases as brains became larger (perhaps larger, more powerful brains operate more efficiently by sharing information over a smaller number of units), b would be negative. 3 Relating Measurable Quantities
The next goal in our development is to recast the preceding equation into a form that relates quantities for which data are available so that the accuracy of the scaling law can be evaluated, and so that q(n) and b can be determined.
How Cortical InterconnectednessVaries with Network Size
475
Two experimental observations provide the key for relating the quantities in the scaling law to measured features of cortical structure. The first is that the density of synapses in cortex, r, is constant (within measurement error) across cortical layers, regions, and species (Aghajanian and Bloom 1967; Armstrong-James and Johnson 1970; Cragg 1967, 1975a,b; Schuz and Palm 1989; Jones and Cullen 1979; O'Kusky and Colonnier 1982; Rakic et al. 1986; Vrensen et al. 1977) and has an average value: r = 0.6 x lo9 synapses/mm3
The second experimental observation is that the number of neurons that underlies a square millimeter of cortical surface, p , is also about constant (Powell 1981; Rockel et al. 1980) across cortical regions and species (with the exception of primate area 17 where it is also constant across species, but differs in magnitude from other cortical areas): p = 1.48 x lo5 neurons/mm2
(for primate area 17 p = 3.57 x
105 neurons/mm2)
This quantity p is constant in spite of variations in cortical thickness. The average number of synapses per cortical neuron is, by definition, the total number of cortical synapses Q divided by the number of neurons N . From the experimental observations presented above Q = rV
where V is the total cortical volume and is given by the product of the average cortical thickness T and the cortical surface area A: V=AT N is, according to the preceding, given by N=pA
This means that the average number of synapses per cortical neuron q ( N ) is q ( N ) = Q / N = rAT/pA = rT/p
From the scaling law, however,
where a is the surface area and t the cortical thickness of the reference brain. If q(N) is eliminated between these last two equations, a relationship between cortical surface area and thickness results: T / t = (A/a)b
476
Charles F. Stevens
Cortical thickness and surface area data available in the literature thus provide a way of testing the scaling law and of evaluating the constant b. Because the scaling law is a power relation, the most convenient form for comparisons with experimental data is obtained by taking logarithms of the preceding equation: lo@) = b log(A) + [(log(t)- b log(^)] A double logarithmic plot of cortical thickness T versus surface area A should be linear, then, with a slope that gives b, the parameter that characterizes neuronal interconnectedness as a function of brain size. 4 Conclusions
Data are available in the literature for evaluating b in two contexts, a particular cortical subsystem, primate (and one tree shrew) area 17, and the entire cortex from a variety of mammalian species. The advantage of using data from primate primary visual cortex is that this cortical region is comparable from one species to another, but the disadvantage is that a relatively narrow range of cortical sizes is available. Using data for the entire mammalian cortex means that different functional regions might be compared from species to species, but a four order of magnitude range of brain sizes is available. In so far as cortex is uniform, as many believe (Creutzfeldt 1977; Eccles 1984; Lorente de No 1949; Powell 1981), the comparison of mammalian cortex across species is appropriate. If the cross species comparison is invalid, then my conclusions are limited to the primate visual area. Figure 1 presents log(T)/log(A) for primate area 17. The data for cortical sizes in this figure vary over a 50-fold range and conform to the expectations of the scaling law with b = 0.07 (least-squares fit). Prothero and Sundsten (1984) have gathered data for thickness and surface area of total mammalian neocortex from 15 species (7 animal orders) and find the regression line for a double logarithmic plot like that in Figure 1 has a slope of 0.09. Their data (in their Figures 1 and 3) range in cortical size over about four orders of magnitude. The scaling law for interconnectedness does indeed seem to provide an adequate fit for the experimental data, and the value of b is small, but not zero. This implies that each neuron is connected to an almostconstant number of other neurons irrespective of brain size. The quantity q(n) for a I-mm-thick reference cortex is 4.12 x lo3 synapses per neuron (1.71x 103 for primate area 17). This means that a particular neuron could receive synaptic connections from less than 3% of the neurons underlying the surrounding square millimeter of cortex, so that brain cells are rather sparsely interconnected. Other, but less complete, data for hippocampus (Stevens, unpublished) suggest that interconnectedness grows slowly (less than linearly) with hippocampal size. The sparseness in interconnections places limits on models for neuronal circuits, and suggests,
How Cortical Interconnectedness Varies with Network Size
100
1000
477
10000
Cortical Surface Area (mrn’)
Figure 1: A double logarithmic plot of cortical thickness (in mm) as a function of cortical surface area (in mm2) for area 17. Data derived from cortical volumes given by Frahm et al. (1984) and cortical thickness given by Rockel et al. (1980). Animals represented, in order of increasing cortical size, are tree shrew (Scandentia), galago (Prosimian), marmoset, squirrel monkey, macaque, baboon, chimpanzee, and man (Simians). The regression line is log(T) = 0.0710g(A) - 0.047 where T is cortical thickness (mm) and A the surface area (mm’).
for example, that any large content-addressable memories present in cortex seem not to be based on rich connectedness of large populations of neurons. Further, network models that provide realistic representations of cortical circuits should also embody a scaling law like that used by the brain.
Acknowledgments Supported by NIH No. 12961-14 and the Howard Hughes Medical Institute.
Charles F. Stevens
478
References
Aczel, J. 1969. Applications and Theory of Functional Equations. Academic Press, New York. Aghajanian, G. K., and Bloom, F. E. 1967. The formation of synaptic junctions in developing rat brain: A quantitative electron microscopic study. Brain Res. 6, 716-727. Armstrong-James, M., and Johnson, R. 1970. Quantitative studies of postnatal changes in synapses in rat superficial motor cerebral cortex. Z. Zellforsch 110,559-568.
Cragg, B. G . 1967. The density of synapses and neurones in the motor and visual areas of the cerebral cortex. 1. Anat. 101,639-654. Cragg, B. G. 1975a. The density of synapses and neurons in normal mentally defective and ageing human brains. Brain 98, 81-90. Cragg, B. G. 1975b. The development of synapses in the visual system of the cat. J. Comp. Neurol. 160, 147-168. Creutzfeldt, 0. D. 1977. Generality of the functional structure of the neocortex. Naturzuissenschaften 64,507-517. Eccles, J. C. 1984. The cerebral neocortex. A theory of its operation. In Cerebral Cortex, Vol. 2, Functional Properties of Cortical Cells, E. G. Jones and A. Peters, eds., pp. 1-36. Plenum Press, New York. Frahm, H. D., Heinz, S., and Baron, G. 1984. Comparison of brain structure volumes in insectivora and primates. V. Area striata (AS).J. Hirnforsch 25, 537-557.
Jones, D. G., and Cullen, A. M. 1979. A quantitative investigation of some presynaptic terminal parameters during synaptogenesis. Exp. Neurobiol. 64, 245259.
Lorente de No 1949. Cerebral cortex: Architecture, intracortical connections, motor projections. In Physiology of the Nemous System, J. Farguhar Fulton, ed., pp. 288-315. Oxford University Press, London. OKusky, J., and Colonnier, M. 1982. A laminar analysis of the number of neurons, glia, and synapses in the visual cortex (area 17) of adult macaque monkeys. J. Comp. Neurol. 210, 27E-290. Powell, T. P. S. 1981. Certain aspects of the intrinsic organisation of the cerebral cortex. In Brain Mechanisms and Perceptual Awareness, 0. Pompeiano and C. Ajmone Marsan, eds., pp. 1-9. Raven, New York. Prothero, J. W., and Sundsten, J. W. 1984. Folding of the cerebral cortex in mammals. A scaling model. Brain Behav. Evol. 24, 152-167. Rakic, P., Bourgeois, J.-P., Eckenhoff, M. F., Zecevic, N., and Goldman-Rakic, P. S. 1986. Concurrent overproduction of synapses in diverse regions of the primate cerebral cortex. Science 232, 232-235. Rockel, A. J., Hirons, R. W., and Powell, T. I? S. 1980. The basic uniformity in structure of the neocortex. Brain 103,221-244. Schiiz, A., and Palm, G. 1989. Density of neurons and synapses in the cerebral cortex of the mouse. J. Comp. Neurol. 286,442455.
How Cortical Interconnectedness Varies with Network Size
479
Vrensen, G., De Groot, D., and Nunes-Cardozo, J. 1977. Postnatal development of neurons and synapses in the visual and motor cortex of rabbits: A quantitative light and electron microscopic study. Brain Res. Bull. 2,405-416.
Received 29 June 1989; accepted 11 October 1989.
Communicated by Gordon M. Shepherd
A Canonical Microcircuit for Neocortex Rodney J. Douglas Kevan A.C. Martin David Whitteridge MRC Anatomical Neuropharmacology Unit, Department of Pharmacology, South Parks Road, Oxford OX1 3QT, England
We have used microanatomy derived from single neurons, and in vivo intracellular recordings to develop a simplified circuit of the visual cortex. The circuit explains the intracellular responses to pulse stimulation in terms of the interactions between three basic populations of neurons, and reveals the following features of cortical processing that are important to computational theories of neocortex. First, inhibition and excitation are not separable events. Activation of the cortex inevitably sets in motion a sequence of excitation and inhibition in every neuron. Second, the thalamic input does not provide the major excitation arriving at any neuron. Instead the intracortical excitatory connections provide most of the excitation. Third, the time evolution of excitation and inhibition is far longer than the synaptic delays of the circuits involved. This means that cortical processing cannot rely on precise timing between individual synaptic inputs. 1 Introduction
The uniformity of the mammalian neocortex (Hubel and Wiesel 1974; Rockel et al. 1980) has given rise to the proposition that there is a fundamental neuronal circuit (Creutzfeldt 1977; Szentfigothai 1978) repeated many times in each cortical area. Here we provide evidence for such a canonical circuit in cat striate cortex, and model its form and functional attributes. The microcircuitry of the striate cortex of the cat is by far the best understood of all cortical areas. The anatomical organization that has emerged from studies (Gilbert and Wiesell979; Martin 1988) of neuronal morphology and immunochemistry is one of stereotyped connections between different cell types: pyramidal cells connect principally to other pyramidal cells, and the smooth cells connect principally to pyramidal cells. Pyramidal cells are excitatory; smooth cells are GABAergic and thought to be inhibitory. Some neurons of both types are driven directly by thaiamic input and others indirectly. We used these findings and those described below to develop the simplest neuronal circuit that Neural Computation 1, 480-488 (1989) @ 1989 Massachusetts Institute of Technology
481
A Canonical Microcircuit for Neocortex
h
That a m us
Figure 1: Model of cerebral cortex that successfully predicts the intracellular responses of cortical neurons to stimulation of thalamic afferents. Three populations of neurons interact with one another: one population is inhibitory (GABA cells, solid synapses), and two are excitatory (open synapses), representing superficial (P2 + 3) and deep (P5 + 6) layer pyramidal neurons. The layer 4 spiny stellate cells are incorporated with the superficial group of pyramids. Each population receives excitatory input from the thalamus, which is weaker (dashed line) to deep pyramids. The inhibitory inputs activate both GABAA and GABAB receptors on pyramidal cells. The thick line connecting GABA to P5+6 indicates that the inhibitory input to the deep pyramidal population is relatively greater than that to the superficial population. However, the increased inhibition is due to enhanced GABAA drive only. The GABAB inputs to P5 + 6 is similar to that applied to P2 + 3. showed analogous functional behavior to that which we observed in our intracellular recordings (Fig. 1). 2 Cortical Model
The model circuit consisted of populations of neurons that interacted with one another. The behavior of each population was modeled by a single "cell" that represented the average response of the neurons be-
482
Rodney J. Douglas, Kevan A.C. Martin, and David Whitteridge
longing to that population. The action potential discharge was treated as a rate-encoded output rather than discrete spike events. The populations excited or inhibited one another by inducing changes in the average membrane potential of their target populations, after a transmission delay. The relaxation of the membrane potential was governed by a membrane time constant. The magnitude of excitation or inhibition was determined by the product of the input population‘s discharge rate, a synaptic coupling coefficient, and a synaptic driving potential. The discharge rate was a thresholded hyperbolic function of the average membrane potential. The synaptic coupling coefficient incorporated the fraction of all synaptic input that was derived from a particular source population, the average efficacy of a synapse from that source, and the sign of its effect (either positive or negative). The synaptic driving potential was the difference between the average membrane potential and the appropriate synaptic reversal potential. The number and characteristics of the populations, and the functional weighting of their interconnections, were optimized by comparing the performance of the model with that of the cortex itself, as described below. The model was programmed in TurboPascal and run on a 8-MHz 80286/287 AT-type computer, which computed a typical model response of 400 msec in 30 sec.
3 Intracellular Recordings
Neurons were recorded from the postlateral gyrus of the striate visual cortex (area 17) of anesthetized, paralyzed cats (Martin and Whitteridge 1984; Douglas et al. 19881, while continuously monitoring vital signs. Glass micropipettes were filled with 2 M K citrate, or a 4% buffered solution of horseradish peroxidase (HRP) in 0.2 M KC1. GABA agonists and antagonists were applied ionophoretically via a multibarrel pipette using a Neurophore (Medical Systems Inc.). The intracellular electrode was mounted in a “piggy-back configuration on a multibarrel ionophoretic pipette. The tip of the recording electrode was separated from the tips of the ionophoretic barrels by 10-20 pm. Receptive fields of the cortical neurons were first plotted in detail by hand, and then intracellular recordings were made while stimulating the optic radiation (OR) above the lateral geniculate nucleus via bipolar electrodes (0.2-0.4 msec, 200400 PA). A control period of 100 rnsec, followed by 300 msec of the intracellular response to OR stimulation, was averaged over up to 32 trials. We used electrical pulse stimulation as a test signal both because it simplifies the analysis of systems, and because it permits the canonical microcircuit hypothesis to be tested in the many cortical areas whose natural stimulus requirements are not yet known. Where possible, HRP was injected intracellularly following data collection to enable morphological identification. Twenty HRP-labeled neurons were recovered (Fig. 2).
A Canonical Microcircuit for Neocortex
483
4 Results and Discussion
In all 53 cells examined, the stimulus pulse induced a sequence of excitation followed by a lengthy (100-200 msec) hyperpolarizing inhibitory postsynaptic potential (IPSP) that inhibited completely any spontaneous action potential discharge. This general pattern has been reported in visual and other cortical areas, and it is currently supposed that the early excitation is due to activation of thalamic afferents, and that the inhibition arises from feed-forward and feedback excitation of cortical smooth cells. There is strong evidence that inhibition in the cortex involves GABAA receptors, which can be selectively blocked by bicuculline (Sillito 1975). However, we found that there was also a second GABAergic inhibitory mechanism present in vim, which was insensitive to bicuculline and could be activated by the specific GABAB agonist, baclofen. Baclofen mediated inhibition has also been observed in in vitro cortical preparations (Connors et al. 1989). Both GABAergic mechanisms were incorporated into the model so that GABAA simulation produced an early inhibition of short duration, while the GABAB reponse evolved more slowly and had a longer duration. For simplicity, both inhibitory processes behaved linearly. This approximation is reasonable since nonlinear inhibition is not prominent in cat visual cortex (Douglas et al. 1988). Using these principles, we were able to model the neuron’s response to electrical stimulation by a circuit that consisted simply of two interacting populations: one population of excitatory pyramidal cells and another of inhibitory smooth cells, with thalamic input applied to both populations. However, the temporal forms of the responses obtained from in vivo cortical cells were not all similar. For pyramidal cells, which formed the bulk of our HRPlabeled sample, we could discriminate two different temporal patterns of poststimulus response on the basis of the latency to maximum hyperpolarization (Fig. 2). These patterns were not correlated with functional properties such as receptive field type or ordinal position, but were strongly correlated with cortical layer (Fig. 2). Hyperpolarization evolved more slowly in morphologically identified pyramidal neurons of layer 2 and 3 than those located in layers 5 and 6. These data suggested that the pyramidal cells might be involved in two different circuits, one for superficial layers (2 and 3), and another for deep layers (5 and 6). Consequently, the model was expanded to incorporate these two populations. Unfortunately, we did not label any spiny stellate cells, which are found only in layer 4. However, the output of spiny stellate cells is directed to the superficial layers as well as layer 4 (Martin and Whitteridge 1984), and so we assumed that they should be incorporated with the population of superficial pyramids. We also assumed that the superficial and deep populations should have similar rules of interconnection. By exploring the properties of the expanded
Rodney J. Douglas, Kevan A.C. Martin, and David Whitteridge
484
0
30
60
90
120
150 ms
Latency t o max hyperpol
0
300 rns
Figure 2 Correlation of cortical layer with the pattern of the intracellular responses to stimulation of the optic radiation. (a) Hyperpolarization evolved more slowly in morphologically identified pyramidal neurons of layer 2 and 3 than those located in layers 5 and 6. (b,d) (modeled in c, e) Latencies to maximum hyperpolarization (filled arrows) were measured with respect to the stimulus at t = 0. Mean latencies for the two populations were 108.9f -7.3 SEM msec ( N = 9) and 27.5 f -1.7 msec ( N = 11). Superficial pyramids [e.g., b, position circled in a always exhibited marked excitation (open arrow), which was less prevalent in deep layers (e.g., d circled in a)]. In this figure, stimulus artifacts removed for clarity. Depths of identified pyramidal cell somata and layer boundaries were measured with respect to cortical surface and then normalized against the layer 5/6 boundary. The model (Fig. 1) predicted qualitatively similar responses to those observed in vivo (compare b to c and d to e) if and only if GABAA inhibition of deep layers was greater than that of superficial layers. model, we were able to show that the two layers correlated response patterns could be elicited from the same basic circuit by modifying the the relative intensities of GABAA inhibition applied to the two pyramidal populations. The observed differences in the pattern of response of superficial and deep neurons could be most simply achieved by making the
A Canonical Microcircuit for Neocortex
485
GABAA inhibition of the deep pyramidal cells four times stronger than that of the superficial cells. The strength of GABAB inhibition was the same in both layers. This configuration (Fig. 1)of the model provided the best fit to the biological data. Alternative combinations of populations and coupling coefficients were markedly less successful in simulating the biological data. Having established the basic configuration, we then modeled the affect of altering the weightings of the inhibitory (GABAergic)connections. The predictions were tested experimentally by recording the intracellular pulse response during ionophoretic application of various GABA agonists and antagonists. These drugs were applied directly to the recorded cell via a multibarrel ionophoretic micropipette that was mounted on the shank of the intracellular pipette. The close agreement between the predicted and the experimental results are shown in Figure 3. The performance of the model depends on the coupling between the pyramidal and smooth cells, and this suggests that excitation and inhibition are not separable events. Activation of the cortex inevitably sets in motion a sequence of excitation and inhibition in every neuron. Moreover, the time evolution of excitation and inhibition is far longer than the synaptic delays of the circuits involved. In particular, the large component of the inhibition derives from the GABAB-like process that extends over some 200 msec. This means that cortical processing cannot rely on precise timing between individual synaptic inputs. The model also predicted that the excitation due to intracortical connections would greatly exceed that of the thalamic afferents, which provide the initial excitation. This amplification of excitation is a consequence of the intracortical divergence (Gilbert and Wiesel 1979; Martin 1988) of pyramidal cell projections. Thus, thalamic input does not provide the major excitation arriving at any neuron. Instead the intracortical excitatory connections provide most of the excitation. This excitation would grow explosively, but it is gated by inhibition of the pyramids. It is this intracortical excitatory component that is more strongly inhibited in the deep layers, so the onset of maximum hyperpolarization occurs more rapidly in these cells (compare Fig. 2b and d). The degree to which the intracortical component is normally inhibited can be demonstrated by comparing the form of the excitatory depolarization before and after blockade of GABAA-mediated inhibition (Fig. 4). Bicuculline enhanced predominantly the intracortical excitatory component. This suggests that the GABAA mechanism is activated by the arrival of the thalamic volley, and that the role of tonic cortical inhibition is small. 5 Conclusion
Taken together, these data show that this simple model can provide a remarkably rich description of the average temporal behavior of
Rodney J. Douglas, Kevan A.C. Martin, and David Whitteridge
486
mode 1
cortex rnV- a
-70
gaba
-501b L
I
-70
bicuc
-501 -70 *
L
d
A M
0
300
ms
0
300 ms
Figure 3: Comparison of predictions of the model with experimental results following modification of synaptic weights. To simulate the localized effect of ionophoresis, manipulations of the model affected only a small subset of P2+3 or P5+6 cells (Fig. 1). (a) Observed and predicted control response of a deep ceII. (b) Sustained GABA ejection (Sigma, 0.5M, 70 nA) hyperpolarized the membrane of this cell so reducing the stimulus-induced hyperpolarization. (c) Additional application of GABAA antagonist bicuculline (Sigma, 100 mM, 200 nA) did not reverse the GABA-induced hyperpolarization, but enhanced excitation (arrow). (d) When GABA was removed, bicuculline further enhanced excitation (arrowed), but did not block late hyperpolarization. (e) Control response of a superficial cell. (f) The GABAB agonist baclofen (Ciba Geigy, 10 mM, 60 nA) hyperpolarized the membrane of this cell, accentuating the early excitation (arrowed).
A Canonical Microcircuit for Neocortex
487
b>r model
cort ex
thalamus
d l
control -40 - 60
bicuc
-40 - 60
E
~
l
0
'% I
l
30
l
l
30 m s
0
, thalamus
10
' cortex 0
5
10
15 ms
Figure 4: Blocking GABAA inhibition unmasks intracortical excitation. (a) Observed and predicted excitatory response of a superficial cell during the first 30 msec following the stimulus. Excitatory response was followed by IPSP similar to that seen in Figure 3e. Only initial phase of IPSP is seen in this short time window. Thalamic and intracortical components of excitation are indicated in model response; stimulus artifact is marked with open arrow. (b) Sustained application of bicuculline (O.lM, 100 nA) enhanced the intracortical component, but left the thalamic component largely unaffected. (c) Histogram of experimental results showing that the latency to peak of the bicuculline-affected depolarization (filled bars) corresponds with that of the intracortical component (hatched bars). Both were significantly later than the earliest (thalamic) depolarization (open bins). populations of cortical neurons when they are activated by pulse stimuli. Because this stimulus is not area specific, the same experimental methods and tests could, in principle at least, be applied to any cortical area, even those whose function is unknown. Similar responses obtained in another cortical area would suggest a basic circuitry similar to that of visual cortex. Furthermore, models of cortical processing that are based
l
488
Rodney J. Douglas, Kevan A.C. Martin, and David Whitteridge
on analogues of neurons (Sejnowski et al. 1988) should exhibit responses to pulse activation of their inputs that are qualitatively similar to those reported here.
Acknowledgments We thank John Anderson for technical assistance, and the E.P. Abrahams Trust for support. R.J.D. acknowledges the support of the Guarantors of Brain, and the SA MRC.
References Connors, B. W., Malenka, R. C., and Silva, L. R. 1988. Two inhibitory postsynaptic potentials, and GABAA and GABAB receptor-mediated responses in neocortex of rat and cat. J.Physiol. 406,443-468. Creutzfeldt, 0.D. 1977. Generality of the functional structure of the neocortex. Natumissenschaften 64, 507-517. Douglas, R. J., Martin, K. A. C., and Whitteridge, D. 1988. Selective responses of visual cortical cells do not depend on shunting inhibition. Nature (London) 332, 642-644. Gilbert, C. D., and Wiesel, T. N. 1979. Morphology and intracortical projections of functionally characterised neurons in the cat visual cortex. Nature (London) 280, 120-125. Hubel, D. H., and Wiesel, T. N. 1974. Uniformity of monkey striate cortex: A parallel between field size, scatter, and magnification factor. 1. Comp. Neurol. 158,295-305. Martin, K. A. C. 1988. From single cells to simple circuits in the cerebral cortex. Q. J. Exp. Physiol. 73, 637-702. Martin, K. A. C., and Whitteridge, D. 1984. Form, function, and intracortical projections of spiny neurones in the striate visual cortex. I. PhysioI. 353, 463-504. Rockel, A. J., Hiorns, R. W., and Powell, T. P. S. 1980. The basic uniformity in structure of the neocortex. Brain 103,221-244. Sejnowski, T. J., Koch, C., and Churchland, P. S. 1988. Computational neuroscience. Science 241, 1299-1306. Sillito, A. M. 1975. The contribution of inhibitory mechanisms to the receptive field properties of neurones in the striate cortex of the cat. J. Physiol. 250, 387-304. Szenthgothai, J. 1978. The neuron network of the cerebral cortex: A functional interpretation. Proc. R. SOC. (London) Ser. B 201, 219-248.
Received 13 July 1989; accepted 2 October 1989.
Communicated by John Wyatt
Synthetic Neural Circuits Using Current-Domain Signal Representations Andreas G. Andreou Kwabena A. Boahen Electrical and Computer Engineering, The Johns Hopkins University, Baltimore, M D 21218 USA
We present a new approach to the engineering of collective analog computing systems that emphasizes the role of currents as an appropriate signal representation and the need for low-power dissipation and simplicity in the basic functional circuits. The design methodology and implementation style that we describe are inspired by the functional and organizational principles of neuronal circuits in living systems. We have implemented synthetic neurons and synapses in analog CMOS VLSI that are suitable for building associative memories and self-organizing feature maps. 1 Introduction
Connectionist architectures, neural networks, and cellular automata (Rumelhart and McClelland 1986; Kohonen 1987; Grossberg 1988; Toffoli 1988) have large numbers of simple and highly connected processing elements and employ massively parallel computing paradigms, features inspired by those found in the nervous system. In a hardware implementation, the physical laws that govern the cooperative behavior of these elements are exploited to process information. This is true both at the system level, where global properties such as energy are used, and at the circuit level, where the device physics are exploited. For example, Hopfield’s network (1982) uses the stable states of a dynamic system to represent information; associative recall occurs as the system converges to its local energy minima. On the other hand, Mead’s retina (1989) uses the native properties of silicon transistors to perform local automatic gain control. In this paper we discuss the importance of signal representations in the implementation of such systems, emphasizing the role of currents. The paper is organized into six sections: Section 2 describes the roles played by current as well as voltage signals. The metal-oxidesemiconductor (MOS) transistor, the basic element of complementaryMOS (CMOS) very large scale integration (VLSI) technology, is introduced in Section 3. In the subthreshold region, the MOS transistor’s behavior strongly resembles that of the ionic channels in excitable cell Neural Computation 1, 489-501 (1989)
01989
Massachusetts Institute of Technology
490
Andreas G . Andreou and Kwabena A. Boahen
membranes. Translinear circuits, a computationally rich class of circuits with current inputs and outputs, are reviewed in Section 4. These circuits are based on the exponential transfer characteristics of the transistors, a property that also holds true for certain ionic channels. Simple and useful circuits for neurons and synapses are described in Section 5. Proper choice of signal representations leads to very efficient realizations; a single line provides two-way communication between neurons. Finally, a brief disscussion of the philosophy behind the adopted design methodology and implementation style is presented in Section 6.
2 Signals
In an electronic circuit, signals are represented by either voltages or currents.' A digital CMOS circuit depends on two well defined voltage levels for reliable computation. Currents play only an incidental role of establishing the desired voltage levels (through charging or discharging capacitive nodes). Since the abstract Turing model of computation does not specify the actual circuit implementation, two distinct current levels will work as well. In contrast, the circuits described here use andog signals and rely heavily on currents; both currents and voltages having continuous values. At the circuit level, Kirchoff's current law (KCL) and Kirchoff's voltage law (KVL) are exploited to implement computational primitives. KCL states that the sum of the currents entering a node equals the sum of the currents leaving it (conservation of charge). So current signals may be summed simply by bringing them to the same node. KVL states that the sum of voltages around a closed loop is zero (conservation of energy). Therefore, voltage signals may be summed as well. Actually, the translinear circuits described in Section 4 rely on KVL while avoiding the use of differential voltage signals (not referenced to ground). Voltages are used for communicating results to different parts of the system or for storing information locally. Accumulation of charge on a capacitor (driven by a current source) results in a voltage that represents local memory in the system. This also implements the useful function of temporal integration. Distributed memory can be realized using spatiotempord patterns of charge, following the biological model (Freeman et al. 1988; Eisenberg et al. 1989). In this type of memory, stored information is represented by limit-cycles in the phase space of a dynamic system. However, in current VLSI implementations, memory is represented as point attractors (i.e., a stable equilibrium) in the spatial distributions of charge, as, for example, in our bidirectional associative memory chips (Boahen et al. 1989a,b). 'This may be an area in which biological systems have a distinct advantage by employing both chemical and electrical signals in the computation.
Synthetic Neural Circuits Using Current-Domain Signal Representations
491
Figure 1: The MOS transistor. (a) Structure. 3 Devices
The MOS transistor, shown in Figure la, has four terminals: the gate (G), the source (S), the drain (D), and the substrate (B, for bulk). The gate and source potentials control the charge density in the channel between the source and the drain, and hence the current passed by the device. The MOS transistor is analogous to an ensemble of ionic channels in the lipid membrane of a cell controlled by the transmembrane potential. We operate the MOS transistor in the so-called "off" region, characterized by gate source voltages that are below the threshold voltage. In this region charge transport is by diffusion from areas of high carrier concentration to energetically preferred areas of lower carrier concentration. This is referred to as weak-inversion (ViHoz and Fellrath 1977) or subthreshold conduction (Mead 1989; Maher et al. 1989). The transfer characteristics are shown in Figure lb. These curves are very similar to those for the calcium-ontrolled sodium channel, Hille (1984, p. 317). In both cases the exponential relationships arise from the Boltzmann distribution. The subthreshold current is given by2
2For the sake of brevity, we discuss only the n-type device whose operation depends on the transport of negative charges. The operation of a p-type device is analogous.
Andreas G. Andreou and Kwabena A. Boahen
492 I d s (nA)
vgs ( V ) 0.7
0.4
100.03 0 . 0.-
10.03.0-
1.00.3-
0.10.03-
0.010.003.1Vgs (V)
Figure 1: Cont’d (b) transfer characteristics, (c) output characteristics. To first order, the current is exponentially dependent on both the substrate and the gate voltages. In (b) the dots show measured data from an n-type transistor of size 4 x 4 ,%m,with vd, = 1.0 V . The solid lines are obtained using equation 3.1 with 10 = 0.72 x 10-l8A and rc. = 0.75. The data in (b) are for a similar device with Vb, = 0; it is fitted with Vo = 15.0 V .
where 10 is the zero-bias current and K measures the effectiveness of the gate potential in controlling the channel current. To first order, the effectiveness of the substrate potential is given by (1 - K ) ; VT = IcT/q, the thermal voltage, equals 26 mV at room temperature, and V, is the Early voltage, which can be determined from the slope of the I d s versus vds curves. Notice that Id, changes by a factor of e for a VT/K= 33.0 mV change in Vgs.This drain current equation is equivalent to that in Maher
Synthetic Neural Circuits Using Current-Domain Signal Representations 493
et al. (1989);however, in this form the dependence on the substrate voltage is explicit. This three parameter model is adequate for rough design calculations but not for accurate simulation of device operation. Refer to Mead (1989, Appendix B) for a more elaborate model. Subthreshold currents are comparable to currents in cell membranes; they range from a few picoamps to a few microamps. For a given gate-source voltage V,,, the MOS transistor has two distinct modes of operation, determined by the drain-source voltage v d s , as shown by the output characteristics in Figure lc. The behavior is roughly linear if v d s is less than V d s a t % 100 mv; small changes in v d s cause proportional changes in the drain current. For voltages above V d s a t , the current saturates. In this region the MOS transistor is a current source with output conductance: (3.2) The change in drain current for a small change in gate voltage is given by (3.3) gm is called the transconductance because it relates a current between two nodes to a voltage at a third node. As we shall see, the subthreshold MOS transistor is a very versatile circuit element because gln>> &sat. 4 Circuits
Area-efficient (compact) functional blocks can be obtained by using the MOS transistor itself to perform as many circuit functions as possible. The three possible circuit configurations for the transistor are shown in Figure 2: In the common-source mode, it is an inverting amplifier with high voltage gain: gm/gdsat. In the common-drain mode, it is a voltage follower with low output resistance; l/gm. In the common-gate mode, it is a current buffer with low output conductance; (&sat. In the synthetic neuronal circuits described in the next section the inverting amplifier is used as a feedback element to obtain more ideal circuit operation while the voltage follower and the current buffer are used to effectively transfer signals between different circuits.
Andreas G . Andreou and Kwabena A. Boahen
494
The actual computations are performed by current-domain (or currentmode) circuits. A Current-Domain (CD) circuit is one whose input signals and output signals are currents. The simplest CD circuit Is shown in Figure 3. This circuit copies the input current to the output and reverses its direction. It is appropriately named a current mirror. The circuit has just two transistors: an input transistor and an output transistor. The input current Ii, is converted to a voltage V b by the input transistor. This voltage sets the gate voltage of the output transistor. Thus, both devices have the same gate-source voltages and will pass the same current if they are identical and have the same drain and substrate voltages. In practice, device mismatch produces random variations in the output current, while the nonzero drain conductance results in systematic variations. More complicated mirror circuits, for example, the Wilson mirror or the Complex mirror (Pavasovie et a2. 19881, may be used to obtain lower output conductance. By using more output devices, several copies of the input current can be obtained. The current mirror is analogous to a basic synapse structure in biological systems: it is simple in
0 I
CI
I
Figure 2 MOS transistor circuit configurations. (a) Common-source, (b) common-drain, and (c) common-gate modes of operation. In (a) voltage gain is obtained by converting the current produced by the device's transconductance to a voltage across its drain conductance. In (b) a voltage follower/buffer is realized; the gatesource drop is kept constant by using a fixed bias current and setting = 0. In (c) the device serves as a current buffer by transferringthe signal from its high conductance source terminal to the low conductance drain node.
Synthetic Neural Circuits Using Current-Domain Signal Representations 495
lout
(a)
(b)
Figure 3: Current mirror circuits using (a) n-type and (b) ptype transistors. These circuits provide an output current that equals the input current if the devices are perfectly matched. For subthreshold operation, we observe variations of about lo%, on average, using 4 x 4pm devices.
form, it enforces unidirectional information flow, and it can function over a large range of input and output signal levels. Translinear circuits (Gilbert 1975) are a computationally powerful subclass of CD circuits. A translinear circuit is defined as one whose operation depends on the linear relationship between the transconductance and the channel current of the active devices (Equation 3.3). The current mirror in subthreshold operation is an example of a translinear circuit. The Translinear Principle (Gilbert 1975) can be used to synthesize a wide variety of circuits to perform both linear and nonlinear operations on the current inputs, including products, quotients, and power terms with fixed exponents. The Gilbert current multiplier is one of the better known translinear circuits. Gilbert’s elegant analog array normalizer (1984) is an example of a more powerful translinear circuit. One fascinating aspect of translinear circuits is that although the currents in its constitutive elements (the transistors) are exponentially dependent on temperature, the overall input/output relationship is insensitive to isothermal temperature variations. The effect of small local variations in fabrication parameters can also be shown to be temperature independent. Finally, translinear circuits are simple, because an analog representation is used and the native device properties provide the computational primitives. 3Translinear circuits have traditionally been built using bipolar transistors.
Andreas G. Andreou and Kwabena A. Boahen
496
5 Synapses and Neurons In a neuronal circuit, the interaction between neurons is mediated by a large variety of synapses (Shepherd 1979). A neuron receives its inputs from other neurons through synaptic junctions that may have different efficacies. In a VLSI system, the synapses are implemented as a twodimensional array with the neurons on the periphery. This is because OW2) synapses are required in a network with N neurons. Generally, two sets of lines (buses) are run between the neurons and the synaptic array; one carries neuronal output to the synapses and the other feeds input to the neurons. However, in networks with reciprocal connections, such as the bidirectional associative memory (Boahen et al. 1989a,b), proper choice of signal representations leads to a more efficient implementation. Our circuit implementations for neurons and synapses are shown in Figure 4. These circuits use voltage to represent a neuron's output (presynaptic signal) and current to represent its inputs (postsynaptic signals). Since currents and voltages may be independently transmitted along the same line, these signal representations allow a neuron's output and
"2
(a)
Figure 4: Circuits for synapses and neurons. (a) Reciprocal synapse and (b)neuron. These circuits demonstrate efficient signal representations that use a single line to provide two-way communication. A voltage is used to represent information going one way while a current is used to send information the other way. The synapse circuit in (a) provides bidirectional interaction between two neurons connected to nodes nl and n2. The neuron circuit in (b) sends out a voltage that mirrors its output current toutin the synapses while receiving the total current I, from these synapses.
Synthetic Neural Circuits Using Current-Domain Signal Representations 497 inputs to be communicated using just one line. Voltage output facilitates fan-out while current input provides summation. Thus, in close analogy to actual neuronal microcircuits, the output signal is generated at the same node at which inputs are integrated. The two transistor synapse circuit (Figure 4a) provides bidirectional interaction between neurons connected to nodes nl and 722; each transistor serves as a synaptic junction. When s is at ground, voltages applied at nodes rzl and n2 are transformed into currents by the transconductances of M2 and M I , respectively. If these voltages exceed Vdsat,the transistors are in saturation and act as current sources. Thus, changes in the voltage at n1(rz2)do not affect the current in Ml(M2). Actually, for a small change in V,,,, the changes in 11 and I 2 are related by
This gives
Hence, we can double 1 2 (using the voltage at n l ) while disturbing I1 by only 0.2%. The interaction is turned off by setting s to a high voltage, or modulated by applying an analog signal to the substrate. The circuit for the neuron also uses just two transistors (Figure 4b). The net input current I , (for activation), formed by summing the inputs at node rz, is available at the drain of Ml. This device buffers the input current and controls the output voltage. I, is fed through a nonlinearity, for example, thresholding (not shown), to obtain IolLt,which sets the output voltage Vmt.This is accomplished by using MI as a voltage follower and providing feedback through M2, which functions as an inverting amplifier; MI adjusts V,, so that the current in M2 equals Imt. Hence, Vout will mirror Iout in the synapses. The feedback makes the output voltage insensitive to changes in the input current, I,. Actually, the output conductance is approximately gmlgm2/gdstrtz;it is increased by a factor equal to the gain provided by M2. In this case, a small change in VoUt produces changes in I, and Is (the postsynaptic copy of IOuJ given by
Hence, if I, doubles, the resulting change in Voutdecreases I , by only 0.2%-just as in the previous case. Note that Iout must always exceed a few picoamps to keep Voutabove &,at. The characteristics of these
498
Andreas G. Andreou and Kwabena A. Boahen
Figure 5: Characteristics of a synthetic neuronal circuit. (a) A simple circuit consisting of two neurons (nl and nz) and a synapse (8)was built and tested to demonstrate the proposed communication scheme. The currents sent by nl(n2) and that received by nz(n1) are denoted by 112(121) and f1&), respectively. Continued on next page. circuits, designed using 4pm x 4pm devices and fabricated through MOSIS, are shown in Figure 5a-c. 6 Discussion
The adopted design methodology is governed by three simple principles: First, the computation is carried out in the analog domain; this gives simple functional blocks and makes efficient use of interconnect lines. Second, the physical properties of silicon-based devices and circuits are used synergeticully to obtain the desired result. Third, circuits are designed with power dissipation and area efficiency as prime engineering constraints, not accuracy or speed. We believe power dissipation will be a serious limitation in large scale-analog computing hardware. Unlike digital integrated circuits, the massive parallelism and concurrency attainable with analog computation impose serious limits on the amount of power that each circuit can dissipate. This is why we operate the devices with currents in the nanoamps range and, if possible, picoamps, about the same current levels found in biological systems. This approach is similar to, and strongly influenced by, that of Mead’s group at Caltech. Our approach is more minimalistic, we view the transistor itself as the basic building block; not the transconductance amplifier. Thus, currents, rather than differential voltages, are the primary signal representation.
Synthetic Neural Circuits Using Current-Domain Signal Representations
499
I21^ (nA)
'"1
Vbs (mV)
8 0.-
6 0.-
0
4 0..
-50 2 0.-
-100
20
40
60
80
: I12(nA) 100
(b) I12^ (nA) vbs (mV)
100.-
8 0-
6 0.-
4 0-
2 0.-
20
40
60
80
: I12 ( n A ) 100
Figure 5: Cont'd Plots (b) and (c) show how 1;2 and vary as 112 is stepped from 2.0nA to lOOnA while 121 is held at 50nA, for various substrate bias voltages. The values vbs = 0, -50, and -100mV correspond to weights of 0.93, 0.57, and 0.33, respectively. Notice that these weights modulate signals going both ways symmetrically.
We are not concerned about accuracy or matching in the basic elements because biological systems perform well despite the limited precision of their neurons and synaptic connections. The emerging view is that this is a result of the collective nature of the computation performedwhereby large numbers of elements contribute to the final result. From
Andreas G. Andreou and Kwabena A. Boahen
500
a system designer’s point of view, this means that random variations in transistor characteristics are not deleterious to the system’s performance, whereas systematic variations are and must therefore be kept to a minimum. Indeed, we have observed this in silicon chips. The translinear property of the subthreshold MOS transistor provides a very powerful computational primitive. This property arises from the highly nonlinear relationship between the gate potential and the channel current. In fact, the exponential is the strongest nonlinearity relating a voltage and a current in solid-state devices (Shockley 1963; Gunn 1968). It is interesting to note that the same property holds for voltage-activated ionic channels, however, the conductance dependence is steeper due to correlated charge control of the current (Hille 1984, p. 55). In translinear (current-domain) circuits we have seen a classical example of how a rich form for circuit design emerges from the properties of the basic units (MOS transistor in subthreshold). To summarize, we have addressed some issues related to the engineering of collective analog computing systems. In particular, we have demonstrated that currents are an appropriate analog signal representation. Current levels comparable to those in excitable membranes are achieved by operating the devices in the subthreshold region resulting in manageable power dissipation levels. This design methodology and implementation style have been used to build associative memories (Boahen et nl. 1989a, b) and self-organizing feature maps in analog VLSI. Acknowledgments This research was funded by the Independent Research and Development program of the Applied Physics Laboratory; we thank Robert Jenkins for his personal interest and support. The authors would like to thank Professor Carver Mead of Caltech for encouraging this work. Philippe Pouliquen and Marc Cohen made excellent comments on the paper, and Sasa Pavasovie helped with acquiring the experimental data. We are indebted to Terry Sejnowski, who provided a discussion forum and important insights in the field of neural computation at Johns Hopkins University. We thank the action editor, Professor John Wyatt, for his critical review and insightful comments. References Boahen, K. A,, Pouliquen, P. O., Andreou, A. G., and Jenkins, R. E. 1989a. A heteroassociative memory using current-mode MOS analog VLSI circuits. I E E E Trans. Circ. Sys. 36 (5), 643-652.
Boahen, K. A., Andreou, A. G., Pavasovic, A., and Pouliquen, P. 0.1989b. Architectures for associative memories using current-mode analog MOS circuits.
Synthetic Neural Circuits Using Current-Domain Signal Representations
501
Proceedings of the Decennial Caltech Conference on VLSI, C. Seitz, ed. MIT Press, Cambridge, MA. Eisenberg, J., Freeman, W. J., and Burke, B. 1989. Hardware architecture of a neural network model simulating pattern recognition by the olfactory bulb. Neural Networks 2, 315-325. Freeman, W. I., Yao, Y., and Burke, B. 1988. Central pattern generating and recognizing in olfactory bulb: A correlation learning rule. Neural Networks 1, 277-288. Gilbert, B. 1975. Translinear circuits: A proposed classification. Electron. Left. 11 (l),14-16. Gilbert, 8. 1984. A monolithic 16-channel analog array normalizer. I E E E J. Solid-state Circuits SC-19,956-963. Grossberg, S. 1988. Nonlinear neural networks: Principles, mathematics, and architectures. Neural Networks 1, 17-61. G u m , J.B. 1968. Thermodynamics of nonlinearity and noise in diodes. J. Appl. Phy. 39 (12), 5357-5361. Hille, B. 1984. lonic Channels of Excitable Membranes. Sinauer, Sunderland, MA. Hopfield, J. J. 1982. Neural networks and physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Kohonen, T. 1987. Self-Organization and Associative Memoy. Springer-Verlag, New York. Maher, M. A. C., DeWeerth, S. I?, Mahawold, M. A., and Mead, C. A. 1989. Implementing neural architectures using analog VLSI circuits. I E E E Trans. Circ. Sys. 36 (5), 643452. Mead, C. A. 1989. Analog V L S l and Neural Systems. Addison-Wesley, Reading, MA. Pavasovit, A., Andreou, A. G., and Westgate, C. R. (1988) An investigation of minimum-size, nano-power MOS current mirrors for analog VLSI systems. JHU Elect. Computer Eng. Tech. Rep., JHU/ECE 88-10, Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge, MA. Shepherd, G. M. 1979. The Synaptic Organization of the Brain. Oxford University Press, New York. Shockley, W. 1963. Electrons and Holes in Semiconductors, p. 90. D. van Nostrand, Princeton, NJ. Toffoli, T. 1988. Information transport obeying the continuity equation. IBM J. Res. Dev. 32, 29-35. ViHoz, E. A., and Fellrath, J. 1977. CMOS analog integrated circuits based on weak inversion operation. I E E E J. Solid-state Circuits SC-12,224-231.
Received 30 March 1989; accepted 13 October 1989.
Communicated by Jack Cowan
Random Neural Networks with Negative and Positive Signals and Product Form Solution Erol Gelenbe Ecole des Hautes Etudes en Informatique (EHEI), Universitk Paris V, 45 rue des Saints-P&es, 75006 Paris, fiance
We introduce a new class of random "neural" networks in which signals are either negative or positive. A positive signal arriving at a neuron increases its total signal count or potential by one; a negative signal reduces it by one if the potential is positive, and has no effect if it is zero. When its potential is positive, a neuron "fires," sending positive or negative signals at random intervals to neurons or to the outside. Positive signals represent excitatory signals and negative signals represent inhibition. We show that this model, with exponential signal emission intervals, Poisson external signal arrivals, and Markovian signal movements between neurons, has a product form leading to simple analytical expressions for the system state. 1 Introduction
Consider an open random network of n neurons in which "positive" and "negative" signals circulate. External arrivals of signals to the network can either be positive, arriving at the ith neuron according to a Poisson process of rate A(i), or negative according to a Poisson process of rate X(i). Positive and negative signals have opposite roles. A negative signal reduces by 1 the potential of the neuron to which it arrives (i.e., it "cancels" an existing signal) or has no effect if the potential is zero. A positive signal adds 2 to the neuron potential. Negative potentials are not allowed at neurons. If the potential at a neuron is positive, it may "fie," sending signals out toward other neurons or to the outside of the network. As signals are sent, they deplete the neuron's potential by the same number. The times between successive signal emissions when neuron i fires are exponentially distributed random variables of average value 1/r(z);hence r(i) is the rate at which neuron i fires. A signal leaving neuron i when it "fires" heads for neuron j with probability p + ( i , j )as a positive signal, or as a negative signal with probability p - ( z , j ) , or departs from the network with probability d(i). Let p ( i , j ) = p + ( i ,j ) + p - ( i , j ) ; it is the transition probability of a Markov chain representing the movement of signals between neurons. We shall not Neural Computation 1, 502-510 (1989) @ 1989 Massachusetts Institute of Technology
Random Neural Networks
503
allow the signals leaving a neuron to return directly back to the same neuron: p ( i , i) = 0 for all i. We have n Cp(i,j) + di)= 1 for 1 I i I
(1.1)
3
Positive signals represent excitation and negative signals represent inhibition. Positive external arrivals represent input information and negative external arrivals can be used to represent the thresholds at each neuron. A simple example of the computational use of this model is presented in Section 3. We show that this new model has “product form” solution (Gelenbe and Mitrani 1980; Gelenbe and Pujolle 1986). That is, the stationary probability distribution of its state can be written as the product of the marginal probabilities of the state (or potential) of each neuron. This leads to simple expressions for the network state. Previously, product form solutions were known to exist for certain networks with only “positive signals,” which are queueing networks used in computer and communication system modeling and in operations research (Gelenbe and Mitrani 1980; Gelenbe and Pujolle 1986). 2 The Main Properties of the Model
The main properties of our model are presented in the following theorems. Theorem 1. Let qi denote the quantity qi = A+(i)/[r(i) + X-(i)l (2.1) where the X+(z), X-(i) for i = 1,. . . , n satisfy the system of nonlinear simulataneous equations:
qj,(j)p+(j, i) + Mi),X ( i ) =
X’(i) = j
q j r ( j ) p - ( j , i)
+ X(i)
(2.2)
j
Let k ( t ) be the vector of neuron potentials a t time t , and k = ( k l , . . . , k,) be a particular value of the vector; let p ( k ) denote the stationary probability distribution p ( k ) = lim Prob[k(t) = k ] t-m
If a nonnegative solution { A + ( i ) , X - ( i ) } exists to equations 2.1 and 2.2 such that each qi < 1, then
The proof is given in Appendix A. A direct consequence is: Corollary 1.1 The stationary probability that a neuron i fires is given by + X-(i)l if qi < 1 lim Prob[ki(t) > 01 = pi = X+(i)/[r(z) ‘tcc
Erol Gelenbe
504
2.1 Networks with Some Saturated Neurons. We say that neuron i is safurafed if A+(i)/[r(i) + A-(i)] 2 1; i.e., in steady-state it continuously fires. In many applications, one is interested in working with networks containing some saturated neurons. We have the following extension of Theorem 1. Let N S be the (largest) subset of neurons such that no neuron in N S is saturated, and S be its complement. Consider the solutions A+(i), X - ( i ) of the flow equations:
qjr(j)p+cj,i ) + A(i),
A+(i) =
qjr(j>p-(j,i)
A-(i) =
+ Ni),
j
j
where qi = A+(z)/[?-(i)
+ A-WI,
if i
E
N S and qi = 1 if i E S
Theorem 2. Let k ( t ) N S denote the restriction o f the vector k ( t ) to the neurons in N S . l f a positive solution to the flow equations exists then limt-oo Plkt(t) > 01 = A+(z)/[r(z)+ A-(i)], 1,
ifi E N S ifiE S
and
lim P [ k N S ( t ) = k N S l
t-oo
= PUNS) =
n
11- qJq?
iENS
We omit the proof of this result. 2.2 Equations 2.1 and 2.2 Describing Signal Flow in Feedforward Networks. Let us now turn to the existence and uniqueness of the solutions A+(i), A-(i), 1 5 i 5 n to equations 2.1 and 2.2, which represent the average arrival rate of positive and negative signals to each neuron. We are unable to guarantee the existence and uniqueness of these quantities for arbitrary networks, except for feedforward networks. A network is said to be feedforward if for any sequence il, . . ., is, . . ., i,, . . ., i,,, of neurons, is = i, for r > s implies that ni-1
JJ p(iu,zu+d= 0 u=l
Theorem 3. If the network is feedforward, then the solutions A+(i), A-(i) to equations 2.1 and 2.2 exist and are unique. Proof. For any feedforward network, we may construct an isomorphic network by renumbering the neurons so that neuron 1 has no predecessors [i.e., p ( i , 1) = 0 for any i], neuron n has no successors [i.e., p(n,i ) = 0 for any i], and p ( i , j ) = 0 if j < i. Thus in the isomorphic network, a signal can possibly (but not necessarily) go directly from neuron i to neuron j only if j is larger than i.
Random Neural Networks
505
For such a network, the X+(z) and X-(z) can be calculated recursively as follows: First compute X'(1) = A(l),X-(l) = X(2), and calculate q1 from equation 2.1; if you obtain q1 2 l, set q1 = l (neuron l is saturated), otherwise leave it unchanged. For each successive z such that X+(z), X-(z) have not yet been calculated proceed as follows; since the qJ for each j < a are known, compute X'(z)
=
c
q j r ( j ) p f O ,z)
+N z ) ,
X-(z) =
.I
c
qjrO)p-(j, 2 )
+ X(z)
3<2
and then compute qz; if q, 2 1 replace it by qt = 1; otherwise leave it unchanged. The procedure will end because at each step the computations are carried out for an increased value of 2. This completes the proof since we have proved existence and uniqueness by calculating in a unique manner the solution to equations 2.1 and 2.2 for a feedforward network. QED 3 Analogy with Formal Neural Networks
A formal neural network (Rumelhart et al. 1986) is a set of n neurons each of which computes its state ~ ( z ) using a sigmoid function ~ ( z ) = f(r(z)) where dz)= (C,w,,y(j) - 6,) is the input signal composed of the weighted sums of the states y o ) of the other neurons of the network; the wit are the weights and 0, is the threshold. In its simplest form f(.) is the unit step function. The set of weights and the set of thresholds completely characterize the network. An analogy between the usual model of neural networks and the model introduced in this paper, which we henceforth call the random network, can be constructed. Let us remain within the framework of feedforward networks, of which multilayer formal neural networks (Rumelhart et al. 1986) are a subclass, for which Theorem 3 has been proved. Each neuron is represented by a neuron of the random network. The threshold of neuron z is represented by a flow of negative signals to the neuron so that X(z) = 8,. Consider the nonoutput neuron z; it is represented by the random neuron z whose parameters are chosen as follows: d(z) = 0, r(z)p+(a,j)= uiZJif wTJ> 0 and T - ( L ) ~ - ( z , J=) lwfJ I
if II
,J
<0
Summing over all 2 , the firing rate ~ ( z )of the nonoutput "random" neuron z is chosen: T(Z)
=
c
I
jWZJ
J
Finally, for each output random neuron z set d(z) = 1, and assign some appropriate value to r(z).
506
Erol Gelenbe
To introduce into the random network the external parameters or inputs to the neural network we use the arrival rates of positive signals A(i). Assume that the external signals to the formal neural network are binary. If the input signal to neuron i is 0 or if neuron i is not connected to an external signal we set A(i) = 0. If the external signal to neuron i has the value 1, we set A(i) = A. Here A can be chosen so as to obtain the desired effect at the output neurons. All the parameters of the random network are chosen from those of the formal network, except for the firing rates of the output neurons, and the input rate of positive signals. The state Y = (yl, , . . ,yn) of the formal neural network, where yi is 0 or 1, is simulated by the probabilities of the random neurons. Let zi = lim t -+ mP[ki(t)> 01; we associate yi to zi.Thus, we have for some ‘‘cut-point” 1 - a, [yi = 01
-
zi < 1 - a;
[yi = 11
* zi 2 1- a
In an arbitrary neural network, that is, one that does not have the feedforward structure, this procedure could also be used for establishing the random network. We could also use the d(i) at each neuron i to represent the loss currents that are known to exist in biological neural systems (Kandel and Schwartz 1985), and take r(i)d(i)to be the rate of loss of electric potential at a neuron if we wish to include this effect in the model. 3.1 An Example: The Neural Network for the XOR Function. A simple example often given to illustrate the behavior of formal neural networks is the network for the XOR (exclusiveOR) function (Rumelhart et al. 1986). We present the equivalent random model representation for this network. In Figure 1 we show a formal neural network that receives two binary inputs 21,x2 and that produces y(zl, x2) the boolean function XOR. It is composed of four neurons; each neuron is numbered (1 to 4) and the synaptic weights are indicated on arcs between neurons. The threshold of each neuron is assumed to be 0. In Figure 2 we show the random network analog corresponding to Figure 1. According to the rules we have given for constructing the random network analog, we have X ( i ) = 0 for i = 1,.. . , 4 because all thresholds are 0;
~ ( 1=) r(2) = 2, r(3) = 1.l, ~ ( 4=) T , as yet undetermined; p+(l,3) = pc(2,3) = 0.5, p-(1,4) = p-(2,4) = 0.5, p+(3,4)= 1;
d(l) = d(2) = d(3) = 0, d(4) = 1.
Recall that according to the rules proposed in Section 3, we choose a value A(i) = A to represent xi = 1, and A(i) = 0 to represent xi = 0. Set
Random Neural Networks
XI
507
x2
Figure 1: A formal neural network for the boolean XOR function.
A large enough to saturate neurons 1 and 2, that is, any A > 2. z4 is the analog of the output y of the neural network of Figure 1. Notice that 0, z4 = l.l/[r
+ 21,
l/[r + 11,
if A(1) = A(2) = 0 if A(1) = A(2) = A > 2 if A(1) = A, A(2) = 0 or vice versa
Setting a = 0.6 and T = 0.1 we see that we obtain the XOR function with this network. In fact we may choose any 1 - a such that l.l/[r + 21 < 1 - a < l/[r + 11. 4 Conclusions
We have introduced a new random neural network in which "negative" and "positive" signals circulate among "neurons." Positive signals ac-
Erol Gelenbe
508
r(l)=2
Figure 2: The random network analog of the formal neural network shown in Figure 1; here p+(l,3) = p+(2,3)= 0.5, p+(3,4)= 1, and p-(1,4) = p-(2,4) = 0.5.
cumulate at neurons, and are then sent off to other neurons according to a probabilistic transition rule as the neuron fires. Neuron firing times are exponentially distributed random variables that may differ from one neuron to another. Negative signals cancel an existing positive signal at a neuron, if one exists; otherwise they have no effect. Positive signals leaving a neuron can enter another neuron as a positive or negative signal, or simply leave the system. System departure can therefore be used to simulate output signals, as well as to represent a "loss" of neuron potential. Poisson external arrivals of positive and negative signals occur. We show that this network has a product form that leads to a compact and elegant representation for its steady-state behavior. We prove that the nonlinear signal flow equations of the model have a unique solution
Random Neural Networks
509
when the network has a feedforward structure. An analogy with the usual formal neural network model is shown.
Appendix A Proof of Theorem 1 Since { k ( t ) : t > 0} is a continuous time Markov chain, it satisfies the usual Chapman-Kolmogorov equations; thus, in steady state it can be seen that p ( k ) satisfies the following global balance equations: p ( k ) CIA(4 + ( X ( d + d d ) l I k i > 011
(A.1)
i
= C [ p ( k , f ) r ( i ) d (+ i )p(kz~)A(Z)l[ki > 01 + p(ka+)X(i) z
+ E(p(k$r(z)p+(Z,j ) l [ k j > 01 j
+ p ( k ; ) d z ) p - ( i , , j ) l [ k j = 01 + p(kz3r(2)p-(i,’j))1 where the vectors used are defined by
kaT
= ( k l , . . . , ki = ( k 1 , . . . ,ki
k;
= ( k l , . . . ,ki
k;
kt+ 23
=
+ 1 , . . . ,k J -
1 , .. . , k,&)
+ 1 , . . . , k 3 - 1,.. . , k J ( k l , . . . , ki + 1,.. . , k j + 1 , .. . , k,)
and 1[X] is the characteristic function that takes the value 1 if X is true and 0 otherwise. These vectors have no meaning whenever any of their elements are negative; in such cases, the corresponding probabilities in the global balance equation are understood to be zero. We now verify that the product from equation 2.3 satisfies the equation. Substituting 2.3 in equation A.1 we have:
C[X(i) + ( X i )+ r(Z))lIkz > 011
J
(A.2)
Erol Gelenbe
510
which, after using equation 2.1 becomes, for each i,
0 = -A+W
+ (r(d + A-(i))A+(Z)/[r(i)+ X-(i)l
=0
(A.5)
Thus, the product form is verified since equation 2.3 satisfies the global balance equations A.l. QED
References Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Purallel Distributed Processing, Vols. I and 11. Bradford Books and MIT Press, Cambridge, MA. Gelenbe, E., and Mitrani, I. 1980. Analysis and Synthesis of Computer Systems. Academic Press, London. Gelenbe, E., and Pujolle, G. 1986. Introduction to Networks of Queues. Wiley, New York. Kandel, E. C., and Schwartz, J. H. 1985. Principles of Neural Science. Elsevier, Amsterdam.
Received 17 July 1989; accepted 17 August 1989.
Communicated by Scott Kirkpatrick
Nonlinear Optimization Using Generalized Hopfield Networks Athanasios G. Tsirukis Gintaras V. Reklaitis School of Chemical Engineering, Purdrie University, West Lafayette, IN 47907 USA
Manoel F. Tenorio School of Electrical Engineering, Purdue University, West Lafayette, IN 47907 USA
A nonlinear neural framework, called the generalized Hopfield network (GHN), is proposed, which is able to solve in a parallel distributed manner systems of nonlinear equations. The method is applied to the general nonlinear optimization problem. We demonstrate GHNs implementing the three most important optimization algorithms, namely the augmented Lagrangian, generalized reduced gradient, and successive quadratic programming methods. The study results in a dynamic view of the optimization problem and offers a straightforward model for the parallelization of the optimization computations, thus significantly extending the practical limits of problems that can be formulated as an optimization problem and that can gain from the introduction of nonlinearities in their structure (e.g., pattern recognition, supervised learning, and design of contentaddressable memories).
1 Introduction The ability of networks of simple nonlinear analog processors (neurons) to solve complicated optimization problems was demonstrated in a series of papers by Hopfield (1984) and Tank and Hopfield (1986). The dynamics of such networks, generated by the analog response, high interconnectivity, and the existence of feedback connections, produce a path through the space of independent variables that tends to minimize the objective function value. Eventually, a stable steady-state configuration is reached, which corresponds to a local minimum of the objective function. Since each optimization or, in general, nonlinear equation solving problem can be considered as a transition from an initial state to an NeuruI Computation 1, 511-521 (1989)
@ 1989 Massachusetts Institute of Technology
512
Athanasios G. Tsirukis, Gintaras V. Reklaitis, Manoel F. Tenorio
optimal one, we will extend the original model, so as to handle nonlinear optimization problems. Specifically we will 1. propose a systematic procedure to transform a general nonlinear optimization problem into a dynamic model, 2. investigate the necessary structure of a network of nonlinear analog processors implementing the dynamic model, and
3. propose a highly distributed algorithm for solving nonlinear optimization problems in a series of parallel digital processors, able to implement all the important optimization methods. 2 Literature Review
The computation in a linear Hopfield network (LHN) is done by a collection of linearly interconnected neurons, dynamically relaxing to a locally optimal state. The linear connectivity among processors limits the method’s applicability to problems with a quadratic objective and linear constraints. The method was primarily applied to the solution of combinatorially complex decision problems. It was soon realized (Bruck and Goodman 1987) that the quality of the neural networks solution depends on the quality of the underlying algorithm and that combinatorial problems (e.g., traveling salesman problem) cannot be solved with guaranteed quality, getting trapped in locally optimal solutions. Jeffrey and Rossner (1986) extended Hopfield’s technique to the nonlinear unconstrained optimization problem. They claim that their method generates good directions for global optimization. Under the general framework presented in this paper, their method appears to be equivalent to the steepest descent method, with clearly local behavior. Kennedy and Chua (1988) presented an analog implementation of a network solving a nonlinear optimization problem. The underlying optimization algorithm is a simple transformation method (Reklaitis et al. 1983) that is known to be relatively inefficient for large nonlinear optimization problems. 3 The Nonlinear Optimization Problem
The general nonlinear optimization problem is given by minimize f(x1,XZ,. . . ,x,) subject to hi(X*,x2,.. . , 2,) = 0 i = 1 , 2,..., K , a j 5 gj(xl,x2,.. . ,xn)5 bj j = 1 , 2 , . . . , M xk 5 xk 5 xy k = 1 , 2 , .. . , N
K
(3.1)
where f is the objective function, hi are the equality constraints, gj are the inequality constraints, and xi,i = 1 , 2 , .. . , N are the independent
Nonlinear Optimization Using Generalized Hopfield Networks
513
variables. xk and xf are respectively the lower and upper bounds of the variable xk. Evidently, the nonlinear optimization problem requires more expressive power than this offered by the LHN architecture: 1. The objective function f can take any nonlinear form, not just quadratic. 2. The feasible region, implicitly defined by the constraints, can have any shape, as opposed to Hopfield’s hypercube geometry.
3. The optimum can lie anywhere in the feasible region. Clearly, the neurons must be allowed to interact in a general nonlinear fashion, contrary to the linear summation of inputs used in an LHN. 4 Unconstrained Optimization The unconstrained optimization problem can be expressed as follows:
minimizef (x)
(4.1)
Two methods have been widely used for the solution of this problem: 1. Cuuchy’s method: The famous steepest descent algorithm, which tracks the direction of the largest change in the objective function value: x(k+l) = x ( k )
+E V f ,
E
=
il
The optimization problem can be viewed as a dynamically changing system that progresses from an initial state to a final one, following the ”equation of motion:”
dx
- = E V ~ ;~ ( 0 =) ~o(4.2.1)
dt
The steady states of the initial value problem (4) correspond to the extremums of the original function f :
df
-=
dt
af d x i = V f Tdx C -- = E 11 V f 112 dx, dt dt
(4.2.2)
Therefore, if E = -1 the value of f monotonically decreases with time and the steady state corresponds to a local minimum. Steepest descent dynamics were adopted in the original Hopfield model. 2. Newton’s method: If second-derivative information about the objective function is available, rapid convergence can be achieved using Newton’s approximation: = X(k) + E ( V f ) - ’ V f ,
&
= fl
Athanasios G. Tsirukis, Gintaras V. Reklaitis, Manoel F. Tenorio
514
with corresponding "equation of motion":
dx
- = €(V2f)-'V
dt
f
(4.3)
Newton's method is applicable only if (V2f ) exists and is nonsingular. The time behavior of the algorithm is
The quadratic form Q determines system's (4.3)steady-state behavior. If Q is guaranteed positive or negative definite, the convergence to local maximum or minimum is controlled through E . If Q is indefinite, the Levenberg-Marquardt approach (Reklaitis et al. 1983) can be adopted. Applying the dynamic variation of Cauchy's method to (4.1): (4.5)
The corresponding nonlinear network for solving the unconstrained minimization problem consists of N processors, each one representing an independent variable. Equation 4.5 describes the time behavior of each processor. The inputs to each processor must be arbitrarily nonlinear, dictated by 4.5. We will refer to such networks as nonlinear or generalized Hopfield networks (GHN). The linear summation of inputs in Hopfield's model cannot reproduce the nonlinear interactions that the problem demands. Example. In order to demonstrate the power of GHNs in unconstrained optimization we simulated in a digital computer a nonlinear network solving the following problem (from Reklaitis et al. 1983):
minimizef(x) = (x:
+ x2 - 11)~ + (XI + xz
'-
11)~
As shown in Figure 1, the above function, also known as the Himmelblau function, possesses four local minima. Two neurons are needed, representing the problem's independent variables. The nonlinear input summation to each neuron is dictated by equation 4.5. Steepest descent is a local optimization algorithm. The system's steady state depends on the initial state. This is shown clearly in Figure 1, where two different initial points produce convergence to different steady states. The integration step seems to influence the system's convergence: very large integration steps can lead to numerical instabilities. 5 Augmented Lagrangian Method
One of the first attempts to solve the general nonlinear optimization problem involved the incorporation of the constraints into the objective
Nonlinear Optimization Using Generalized Hopfield Networks
515
HIMMELBLAU FUNCTION 6.0
3.0
0.0
-3.0
-6.0 -6.0
I
I
-4.0
-2.0
I
I
2.0
0.0
I
4.0
6.0
Figure 1: Convergence to local optima. function. Such strategies are called transformation methods. The most successful of these is the augmented Lagrangian method, which has the following penalty function structure (Reklaitis et al. 1983):
P ( z , 6 , T ) = f(z)+
RE{
2
-.?}
j
+
iR E { [ h i ( Z ) Ti]’
- T‘}:
(5.1)
i
where R is a predetermined constant. Function 5.1 can be shown to be an approximation of the Lagrangian function (Reklaitis et al. 1983). Variables u,T are iteratively updated via the equations (5.2)
Athanasios G. Tsirukis, Gintaras V. Reklaitis, Manoel F. Tenorio
516
The bracket operator (< . >) is defined as
=
a
0
ifas0 ifa>O
When the process converges, x is the solution of the original problem and aj, T~the corresponding inequality-quality Lagrange multipliers (Reklaitis et al. 1983). Equation 4.2.1 suggests that the dynamic behavior of a network implementing the augmented Lagrangian optimization method can be given by
dx -Vf - 2R < g + u >T V g - 2 R [ h + rITVh dt du - = + V , P = ~ R < ~ + U > - ~ R U (5.3) dt dr - = +V,P=2Rh dt - = -V,P =
where V g and V h are matrices, for example, V h = [ V h l , . . . , V h J . In the corresponding GHN, ( N + K + M ) neurons, representing the independent variables and the equality-inequality Lagrange multipliers, are necessary. The connectivity among the neurons is dictated by equation 5.3. A simpler penalty form was adopted by Hopfield, without the provision of updating the Lagrange multipliers (Reklaitis et al. 1983). 6 Generalized Reduced Gradient (GRG) Method
Without loss of generality, we will concentrate on the equality-constrained optimization problem. The inequalities can be transformed to equalities by introducing slack variables (Reklaitis et al. 1983). In the GRG method the independent variables x are separated into two subsets: the set of K basic variables, & the set of N
-
K nonbasic variables, Sj
If V f is an ( N - K ) dimension vector containing the partial derivatives of f with respect to the nonbasic variables and if the basic variables are chosen so that V h is nonsingular, then the necessary conditions for a local minimum can be written as Vf h(x)
=
Of- V f ( V h ) - ' V h= o
=
0
(6.1) (6.2)
V f is the modified reduced gradient for the nonbasic variables, Z j , influenced by the shape of the constraints. The values of the K basic variables, Pi, are computed from the system of K nonlinear equation 6.2. In GRG, equations 6.1 and 6.2 are solved using either Cauchy's or Newton's
Nonlinear Optimization Using Generalized Hopfield Networks
517
method. If we adopt Cauchy’s method for 6.1 and Newton’s method for 6.2, the state equations of a dynamic GRG optimizer are (6.3) (6.4)
System (6.3H6.4) is a differential algebraic system, with an inherent sequential character: for each small step toward lower objective values, produced by (6.31, the system of nonlinear constraints should be solved, by relaxing equation (6.4) to a steady-state. The procedure is repeated until both equations (6.3) and (6.4) reach a steady state. The above problem can be solved using a GHN of N - K + K = N nonlinear processors, the connectivity of which is dictated by (6.3)-(6.4). GRG uses K less processors than the augmented Lagrangian method, but spends more effort in the computation of the reduced gradient. 7 Successive Quadratic Programming (SQP)
In the SQP strategy, Newton’s method is employed in the computation of both the independent variables, 2, and the Lagrange multipliers, v. The state equations of a dynamic SQP optimizer are dz
- = E[v2Ll-’(vL) = &H(VL)
dt
do)
(7.1)
= zo
where z is the augmented set of independent variables, and L is the Lagrangian function, defined as follows:
z = [z;vI L = f -vTh If matrix H is nonsingular, then the steady states of equation 7.1 are identical to the stationary points of L. In order to attain convergence to a local optimum of the optimization problem, which is a saddle point of L, we must guarantee, through continuous manipulation of E, that the z state equations produce a descent in L-space and the v equations produce an ascent in it. Example. The dynamics of the three algorithms were investigated with the following nonlinear optimization problem: minimize f(z)= -slz?p$31 subject to h l ( z ) = Z: h2(2) =
+ Z; + ~3
2 -112
Z2Z3
-
- 13 = 0
1= 0
Athanasios G. Tsirukis, Gintaras V. Reklaitis, Manoel F. Tenorio
518
The SQP network was an adaptation of equation 7.1 that used the Levenberg-Marquardt method. Figure 2 shows the transient behavior of the SQP and the augmented Lagrangian (AL) networks, starting from a feasible initial state. The behavior of the GRG algorithm is almost identical to that of the AL. Since initially, the objective-function gradients are very small, the second-order Newton dynamics of the SQP network prevail over the first-order steepest descent dynamics of the GRG and AL networks. A major disadvantage of the original GRG algorithm is the requirement of feasibility for both the initial and intermediately generated points. Figure 3 shows the transient behavior of the networks starting from an infeasible initial state. Again, the GRG and AL dynamics are almost identical. All three networks converged to a local optimum. Additional experiments showed that starting from an infeasible initial point, x,,= [go, Z,], the GRG network always converges to a local optimum, as long as there exists a solution to the system of nonlinear equations:
AUGMENTED UGRANGIAN
-e -10
-ia
\
-
-16
-
-19
-
-14
SQP NETWORKS
I
i
b
:
Augm. Lagr. net
_ _ _ _ _--.--'.----
,
Figure 2: Network dynamics and feasible initial state.
4
Nonlinear Optimization Using Generalized Hopfield Networks
519
SQP 6 GRG NETWORKS
Figure 3: Network dynamics and infeasible initial state.
A study of the stability for all the dynamic optimization methods presented here can be found in Tsirukis et al. (1989). 8 Optimization and Parallel Computation
A most important application of the proposed model lies in its direct translation to a parallel algorithm, which can distribute the computational burden of optimization to a large (at most N + K ) number of simultaneously computing digital processors. Each one of them simulates the nonlinear analog processors of a GHN, which represents either a variable or a Lagrange multiplier and is continuously updated through an explicit integration of the state equations:
x; = x3 + qqz, v) where 2, v are the most recent updates of the independent variables and the Lagrange multipliers, and q5 depends on both the optimization
Athanasios G. Tsirukis, Gintaras V. Reklaitis, Manoel F. Tenorio
520
algorithm and the integration method. Here are two unique features of the algorithm:
1. An integral of the state equations is available, namely the Lagrangian function, which was differentiated in the first place. Thus, since only a steady-state solution is desired, it is not necessary to use a method for stiff 0.D.Es. Any explicit integration method will suffice. 2. Because of the above and the fact that the state equations are autonomous, it is possible to update each variable in each processor completely asynchronously: the (I; + 1)th update of xj does not require the lcth update of all the other variables. Thus, the need to synchronize the computations done by the various processors is avoided. Consequently, the algorithm is robust with respect to intercommunication and execution delays. In contrast, the conventional forms of the optimization algorithm require complete synchronization in the underlying unconstrained optimizations and the solution of linear (SQP) and nonlinear (GRG) algebraic equations. As a result their potential for parallelization is significantly reduced. By contrast, the proposed algorithm efficiently distributes the computational burden among the parallel processors.
References Bruck, J., and Goodman, J. 1988. On the power of neural networks for solving hard problems. In Neural Information Processing Systems, D. Z. Anderson, ed., pp. 137-143. American Institute of Physics, New York, NY. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81,3088-3092.
Jeffrey, W., and Rosner, R. 1986. Neural network processing as a tool for function optimization. In Neural Networks for Computing, J.S. Denker, ed., pp. 241246. American Institute of Physics, New York, NY. Kennedy, M. P., and Chua, L. 0.1988. Neural networks for nonlinear programming. IEEE Trans. Circuits and Systems 35(5), 554-562. Reklaitis, G. V., Ravindran, A., and Ragsdell, K. M. 1983. Engineering Optimization: Methods and Applications. Wiley - Interscience, New York. Tank, D. W., and Hopfield, J. J. 1986. Simple "neural" optimization networks: An A/D converter, signal decision circuit, and a linear programming circuit. IEEE Trans. Circuits and Systems, CAS-33, no. 5.
Nonlinear Optimization Using Generalized Hopfield Networks
521
Tsirukis, A. G., Reklaitis, G. V., and Tenorio, M. F. 1989. Computational Properties of Generalized Hopfield Networks Applied to Nonlinear Optimization. Tech. Rep., TREE 89-69, School of Electrical Engineering, Purdue University, December 1989.
Received 20 July 1989; accepted 14 September 1989
Communicated by Fernando J. Pineda
Discrete Synchronous Neural Algorithm for Minimization Hyuk Lee Department of Electrical Engineering, Polytechnic Institute of New York, Brooklyn, N Y 11201 USA
A general discrete minimization algorithm that can be implemented by highly parallel neural networks is developed. It can be applied to the energy functions that can be expressed as arbitrary types of polynomial functions of the state variables. The algorithm can be operated in a synchronous way.
1 Introduction
Highly parallel neural computing algorithms have been investigated extensively. The Hopfield model has been successfully applied for solving combinatorial minimization problems (Hopfield 1982). However, the energy function in the Hopfield model is restricted to a symmetric quadratic form having all the diagonal elements zero. A higher order Hopfield model (Maxwell et al. 1986; Psaltis and Park 1986; Paek ef al. 1988) has also been considered. In this case, the energy function is a polynomial of the state variables and it is assumed to have special symmetry properties. Furthermore, the updating rules of such algorithms are based on the serial operation. At each step, a state variable is selected randomly and minimization is carried out by updating only the selected state variable and leaving all the other state variables unchanged. Therefore, the algorithm can be operated in an asynchronous way, but it cannot be operated in a synchronous way. Synchronous algorithms for continuous state variables have been investigated (Marcus and Westervelt 1989). However, synchronous discrete neural algorithms have not been studied. In this paper, a partially synchronous discrete neural algorithm that can be applied to the arbitrary types of polynomial energy functions is developed. In Section 2 a general algorithm is developed. In Section 3 the Hopfield and quadratic neural models are considered as specific examples of the general algorithm. Neural Computation 1, 522-531 (1989) @ 1989 Massachusetts Institute of Technology
Discrete Synchronous Neural Algorithm for Minimization
523
2 Partially Synchronous Algorithm
The energy function is assumed to be an arbitrary type of polynomial function of the state variables. Real binary variables having values -1 and 1 are considered as state variables. The energy function can be described as E = F({Bi,&, . . . , B N } )
(2.1)
where B‘s are state binary variables and the total number of state variables is N . Partially synchronous minimization is considered for the most general case. Totally synchronous or totally asynchronous minimizations are specific examples of the general case. Assume that, at each step, M state variables are selected randomly, and minimization is carried out by updating the M state variables simultaneously and leaving all the other state variables unchanged. M can be any integer from 1 to N , and the minimization algorithm becomes totally asynchronous or totally synchronous if M is equal to 1 or N . At each step of updating, the number M can be changed and chosen arbitrarily. Even if the same number M is chosen, a different set of M neurons out of N neurons can be used to minimize the neural network. A set P is defined to consist of the indices for the selected state variables. Another set P’ = {1,2,.. . ,N } - P is defined to represent the indices of the state variables that are unchanged. As an example, consider a neural system consisting of five neurons numbered 1, 2, 3, 4, 5. If the system is updated by changing the states of three neurons 1, 2, and 5, then the set P is {1,2,5} and the set P’ is {3,4}. In the case of totally synchronous updating algorithm, the set P is given by {1,2,3,4,5}and P’ is an empty set. On the other hand, the set P for the totally asynchronous updating rule by changing the state of the neuron 3 becomes {3}, and the set P’ for this case is given by {1,2,4,5}. The updated state variables B,‘ and the change of the state variables ABi satisfy the relation B: = Bi
+ A&
(2.2)
where i E P. The updated state variables are also binary variables having values -1 and 1. Therefore, the possible values of ABi are ABi = -2,0,2
(2.3)
The incremental change in energy A E due to the updated state variables given by equation 2.2 is considered to develop an algorithm that minimizes the energy function described by equation 2.1. AE is defined as
Hyuk Lee
524
where i
E
P and j
A E = E[{B,
E
P', and utilizing equation 2.2, it becomes
+ ABi, Bj}] - E[{Bi, Bj}I
(2.5)
The first term in the right-hand side of equation 2.5 can be expanded as a Taylor's series in several variables because E is a polynomial of state variables. Therefore, the incremental energy change A E can be written as (Toralballa 1967)
The total number of terms in the summation of equation 2.6 is finite because E is a polynomial. To reduce the products of changes in equation 2.6 to linear forms, the following relations are derived. For an arbitrary state variable B and a positive integer n, (AB)" satisfies (AB)"
=
(-2B)"-'AB
(2.7)
which can be proved as follows. If AB is zero, equation 2.7 is satisfied automatically. If A B is 2, B and B' should be -1 and 1 according to equation 2.2. In this case, the right-hand side of equation 2.7 becomes the same as the left-hand side, that is, [(-2)(-1)]"-'[2]
= 2" = (AB)"
(2.8)
On the other hand, if A B is -2, B and B' should be 1 and -1 according to equation 2.2. Therefore, the right-hand side of equation 2.7 becomes [(-2)(1)1"-"-21
=
1-21"
(2.9)
which is the same as the left-hand side of equation 2.7 in the case of AB = -2. Consider next a product of changes given by AAB,, , . . . , A s z m with an arbitrary coefficient A. Assume that m > 1. The product can be written as AAB,, . . . AB,,
=
IA1(1/2)[-{S(A)ABt, - ABZ2... AB,m}2 + (AB,, 1' + (AB,, . . . AB,_ )'I, (2.10) I ~A\(l/2)[(ABal)2 + (AB,, . . .~ l B , , ) ~ l (2.11)
where S(A) is the sign of A . The first term in equation 2.10 is always negative or zero and equation 2.11 follows. The second term in equation 2.11 is a product of smaller number of changes than the left term in equation 2.10. Therefore, this technique can be applied iteratively to reduce the product of changes in equation 2.10 to a sum of individual changes, and it is given by
Discrete Synchronous Neural Algorithm for Minimization
525
Equation 2.12 can be symmetrized by considering m cyclic orderings of indices { i l l . . . , im}. Equation 2.12 now is given by
If equation 2.7 is utilized, equation 2.13 can be written as m
AAB,, . . .A&_ 5 I(m)lAl
C Bi,ABi,
(2.14)
k=l
where (2.15) The incremental energy change introduced in equation 2.6 can be written as a sum of the linear term in ABi and the higher order terms as follows:
(2.16) where the index m is rearranged in the second term. From equation 2.14, we have the following relation: m
AABiAB,, . . . ABi, 5 (A(Z(m+ l){CB,,ABi,
+ B,AB,}
(2.17)
k=l
and utilizing the above equation, the second term in equation 2.16 satisfies
(2.18)
Hyuk Lee
526
Equation 2.16 now satisfies the following relation if equation 2.18 is used:
aE
= C-AB~+CB,AB~{Ci E P aBi
c ' dB,dBiZ
i,EP
iEP
F+'E . . .aB1,
m=I
I)
I ( m + 1)
m!
c ... ilEP
(2.19)
where the dummy indices i and ii, j = 1,. . . ,m, have been interchanged and (2.20) has been used to obtain the equality in the above equation. The incremental energy change is given by
AE 5
-
C GiABi
(2.21)
iEP
where
From equation 2.21 it is clear that if
G,AB, 2 0 for all i
E
P
(2.23)
the incremental energy change is always negative or zero. Therefore, equation 2.23 updates the state variables in such a way that it minimizes the energy function in the limit of iteration. For positive G,, equation 2.23 is satisfied if AB, is positive or zero. If B, is equal to 1, B,' should be equal to 1because AB, = 0 is the only solution. However, if B,is equal to -1, B,' should be 1 because this makes the incremental energy change more negative than using the condition AB, = 0. Therefore, if G, is positive, the updated value becomes 1, which is the same as the sign of the value G,. If G, is negative, the above argument can be applied to show that the updated value for B,' becomes -1. This is the same as the negative of the sign of G,. If G, is zero, B: can have any values. In this case, there are three possibilities. The first one is to choose 1 for B:. Using the value -1 for B: is the second possibility. The third one is to use the relation B: = -B:. As an example, the first possibility, that is, using the value 1
Discrete Synchronous Neural Algorithm for Minimization
527
for B,' when Gi is zero, is used in the following. Summarizing the above result, the updating rule can be written as B,' = T(Gi)for all i E P
(2.24)
where T is a unit step function defined by T ( z )= 1 if z 2 0 and T ( z )= -1 if z < 0. Equation 2.24 is the general discrete neural algorithm that minimizes the energy functions consisting of arbitrary types of polynomials of state variables in a partially synchronous way. To illustrate the general algorithms obtained in the above, a simple example consisting of three neurons is considered and the results of the simulation are described. The energy function for this example is chosen as E = BIB^
+ 2B2B3 + B1
(2.25)
A totally synchronous algorithm is considered in the following. Therefore, the set P introduced in the beginning of this section is given by { 1 , 2 , 3 } and the set P' is empty. The intermediate variables Gi are obtained by using equation 2.22 and they are given by
GI = B1
-1
(2.26)
+ 3B2 - 2B3
(2.27)
-
GZ= -B1
G3 = -2B2
B2
+ 2B3
(2.28)
where I(2) = -1 was used. As an example, select B1 = 1, B2 = 1, B3 = 1 as an initial state. Then the values Gi become -1, 0, 0, respectively, using the same set of values for Bi, that is, in a totally synchronous way. The updated states are given by BI = -1, Bb = 1, B&= 1 and the energy
E 4 0 -2 2 0 -4 -2 2
Table 1: Values of Gi
AE -4 -4 0 -4 0 0 0 -2
Hyuk Lee
528
change A E becomes -4. In Table 1, the values Gi corresponding to the eight initial states are shown. The updated values Bi, i = 1,2,3, and the energy change AE for each case are also shown. It is clearly seen that the energy always decreases or remaines the same, which demonstrates the validity of the algorithm. 3 Partially Synchronous Hopfield and Quadratic Neural Models
~
As an example, an energy function consisting of cubic, quadratic, and linear terms is considered. In general, a partially synchronous algorithm is considered. The most general form of such energy function is given by
where Q i j k , Tij, and I, are constant coefficients. The intermediate variable Gk in equation 2.22 is obtained by calculating the derivatives
+
(3.4)
QilkiZ)
If equations 3.2-3.4 are substituted in equation 2.22,
Gk
is given by
where 1(2) = -1 and 1(3)= -(5/3) are used, and the indices k , il, and i 2 represent the neurons used to update the system, that is, they belong to the set P defined in Section 2.
Discrete Synchronous Neural Algorithm for Minimization
529
A generalized form of the Hopfield model in the case of partially synchronous algorithm is obtained if Q i j k and 1,are all zero. The memory matrix M[ in the Hopfield model is defined as
j
and from equation 3.5 the memory matrix is given by
where Sij is the Kronecker symbol. The memory vectors BLm' are contained in the T matrix as
where m represents the number of memory vectors, and Ti, are all zero. It is clear that the memory matrix is symmetric. However, the diagonal components of the memory matrix M H are not zero in general. As an example, we consider a neural network consisting of 32 neurons. Three memory vectors a, b, c shown in Table 2 were stored using equation 3.8. Four partially synchronous as well as totally synchronous and totally asynchronous algorithims were used. In Table 2, the degree Memory a
b C
Memory Vectors -1 1 1 1 1 -1
-1 1 1 1 1 -1 -1 -1 -1 -1 1 -1 1 1 -1 1 1 1 1 1 -1 1 -1 1
-1 1 1 -1 1 -1 -1 1 -1 -1 1 -1 -1 -1 -1 1 1 1 -1 1 1 1 -1 -1 1 1 1 1 1 -1 -1 -1 1 1 -1 1 -1 -1 1 -1 -1 -1 -1 1 1 - 1 1-1 1-1 1-1 1 1 - 1 -1 -1 -1 1 -1 -1 1 1 -1 -1 -1
Degree
Attraction radii
of synchronism a b 1 11 16 2 16 16 4 10 15 9 12 8 16 3 5 32 0 0
Table 2: Memory Vectors.
c 12 11 10 9 3 0
Hyuk Lee
530
of synchronism was defined as the number of neurons that was used to update the neural network simultaneously. For example, the number 4 for the degree of synchronism implies that the neural network was updated by selecting four neurons and changing the states of the four neurons simultaneously. As special cases, degree of synchronism 1and 32 represents the totally asynchronous (conventional Hopfield model) and totally synchronous algorithm, respectively. Attraction radii for the three memory vectors were obtained by computer simulation using different degrees of synchronism. The results are shown in Table 2. It shows that as the degree of synchronism increases, the attraction radii, that is, the effectiveness of the algorithm, decreases. Specially, the attraction radii for the totally synchronous algorithm are zero. This phenomenon can be explained if equation 2.22 is investigated in detail. In equation 2.22, the effect of an increased degree of synchronism is to have a larger set P, and the coefficient of the term B, in the first part of the right-hand side of equation 2.22 is always negative. This implies that as the set becomes larger, the first part of the right-hand side of equation 2.22 becomes the dominant one. Therefore, the sign of G, becomes the same as that of B,, and the updated states are the same as the initial states. A higher order Hopfield model is obtained by assuming Tz3= 0 and I, = 0. The memory vectors are contained in the Q matrix and it is given by Qkz3= B~m)BB,'m'B~m' (3.9) m
The updating rule for the synchronous algorithm can be read from equation 3.5
c
Gk =
2
c(Qkz3
+
x(Qk2,3 21
+
+ Qzk3
-t Qz,k)B&
3
+ Q k p l + Q p l k + Q3kzl + Qt13k + Qtlk~)B~Il)Bk
3
C
(5/6)Bk{c t1
+ Q+zI)
IQkzlzz
+ Qk2221 + Qz221k + QqkZ1 + Qztzzk
22
(3.10)
Considering the second term of equation 3.10, it is clear that the state variables are inside the absolute value. This makes it impossible to define a memory matrix M Q in the form of Gk =
ccM$BzB3 z
(3.11)
3
which is the same form as that used in the asynchronous quadratic model. 4 Conclusion
In conclusion, a general discrete neural algorithm that minimizes arbitrary types of polynomial energy functions has been developed for the
Discrete Synchronous Neural Algorithm for Minimization
531
first time within my knowledge. The updating rules have been obtained for a general partially synchronous algorithm. The algorithms obtained in this paper do not need any special conditions, such as no starvation condition, bounded delays, etc., which were assumed in the past partially synchronous algorithms. As shown in the simulation results on the Hopfield type associative memory in Section 3, there is a trade-off between the degree of synchronism, that is, degree of parallelism, and the effectiveness of minimization process. This is understood by the fact that the algorithms for different degrees of synchronism obtained in this paper differ only in the higher order derivative terms as shown in equation 2.22. Therefore, as the total number of neurons in a neural network and the degree of synchronism increase, the linear derivative term in equation 2.22 becomes less important and the minimization process becomes less effective. Detailed analysis on this trade-off will be the subject of future publications.
Acknowledgments This work was supported by the National Science Foundation Grant No. EET-8810288 and the Center for Advanced Technology in Telecommunications. The author appreciates the reviewer’s insightful comments, which made the paper clearer.
References Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554. Marcus, C. M., and Westervelt, R. M. 1989. Dynamics of analog neural networks with time delays. In Advances in Neural Information Processing Systems, D. S. Touretzky, ed., pp. 568-576. Morgan Kauffman, San Mateo, CA. Maxwell, T., Giles, C. L., Lee, Y. C., and Chen, H. H. 1986. Nonlinear dynamics of nonlinear systems. In Neural Networks for Computing, J. S. Denker, ed., pp. 299-304. American Institute of Physics, New York, NY. Paek, E. G., Liao, P. F., and Gharavi, H. 1988. A Derivation of Hopfield Neural Network Models. Snowbird, Utah, April 6-10, Psaltis, D., and Park, C. 1986. Nonlinear discriminant functions and associative memories. In Neural Networks for Computing, J. S. Denker, ed., pp. 370-375. American Institute of Physics, New York, NY. Toralballa, L. V. 1967. Calculus. Academic Press, New York. Received 15 December 1988; accepted 15 October 1989.
Communicated by Halbert White
Approximat ion of Boolean Functions by Sigmoidal Networks: Part I: XOR and Other Two-Variable Functions E. K.Blum Mathematics Department, University of Southern California, Los Angeles, CA 90089 USA
We prove the existence of a manifold of exact solutions (mean-square error E = 0) of weights and thresholds for sigmoidal networks for XOR and other 2-variable Boolean functions. We also prove the existence of a manifold of local minima of E where E $0.
1 Introduction It is well known that any Boolean function of n variables can be approximated arbitrarily closely by a suitable two-layer feed-forward network of sigmoidal ”neurons.” For example, for n = 2 the network shown in Figure 1 can be used (Rumelhart et al. 1986; Chauvin 1989; Soulie et al. 1987). z1 and 2 2 are the input nodes. The “hidden” layer outputs are y1 = u(wIz1 + ~ 2 x 2- L1) and y2 = O ( W ~ X ~+ ~ 4 x 2 Lz), where o(a)= 1/[1+exp(-a)]. The network output function is z = o(UIy1 + UZYZ- LO). By appropriate choice of the weights w,, U, and thresholds L, the function z can be made to approximate any of the 16 Boolean functions of 51 and 5 2 . For the eight symmetric functions, it is sufficient to impose the following symmetry constraints: w 3 = w2, w4 = w1,L2 = L1, Ul = U2. (To simplify notation we let w = i71 = UZ and L = LO.) For example, applying these constraints with w1 = 4, w2 = -4, LI = 3, w = 10, and L = 3, we obtain the following values of z(zl, 2 2 ) : z(0,O) = z(1,l) = 0.1139, z ( 0 , l ) = z(1,O) = 0.99869, which yields a good approximation to XOR. To approximate OR we take w1 = 4, w2 = -1, L1 = 3, w = 10, L = 3 to get z(0,O) = 0.1139, z(0,l) = z(l,O) = 0.9889 and z(l,1) = 0.9991. Let [1 = (0,O). €2 = (0,1), [3 = (l,O), and (4 = (1,l). In a typical application of feed-forward networks of this kind, one specifies “target” values t, in the real interval (0,l) and seeks values of the weights and thresholds such that for all i, z, = z([J approximates t, sufficiently closely. The usual measure of approximation is the mean square error, E = C , E,, where E, = (z, - t,)’/2. In this example, letting u = (w, wl, w2,L , LI), we see that E is a function of 21 for given t,. To find an acceptable value of 21 one usually applies a procedure to minimze E(v). If an exact Neural Computation 1, 532-540 (1989) @ 1989 Massachusetts Institute of Technology
Approximation of Boolean Functions by Sigmoidal Networks
533
Figure 1: Network for Boolean functions of two variables.
solution, v*, exists, then min E ( v ) = E(v*)= 0 is the absolute minimum. Otherwise, min E ( v ) > 0. The necessary condition for a minimum is that V E ( v * )= 0. This is not sufficient, since there may exist stationary points that are maxima or saddle points. More important, there may be stationary points that are local minima but not absolute minima.
E. K. Blum
534
Therefore, in the general feed-forward network problem, when a v is found for which VE(v) = 0 (approximately) but for which E(v) is too large (i.e., some zi failing to be acceptably close to its ti), one does not know whether to ascribe the failure to v being an unacceptable local minimum or to the nonexistence of an acceptable absolute minimum for the particular network. One does not know whether to continue to search for other minima or to change the structure of the network. In this paper, we analyze these questions for the symmetric n = 2 case, in particular, for the network in Figure 1. We show that a manifold of exact solutions exist, but also a manifold of local minima exists. In Part I1 (with Leong Li), we shall analyze the behavior of various gradient methods with respect to these manifolds and extend the results to TZ > 2. 2 Existence of Exact Solutions for Symmetric Functions
Let 0 < ti < 1, 1 5 i 5 4, be target values for the respective inputs ti as above. If t 2 = t 3 , we call {ti} a symmetric target set (STS). Case 1: An STS with t 2 > tl and t 2 > t 4 is said to approximate XOR. Case 2 An STS with t 2 = t 4 and t2 > tl is said to approximate OR. Case 3: An STS with tl = t 4 = t 2 is said to approximate a constant functiom (0 or 1). Case 4: An STS with t 2 = tl and t 4 > tl is said to approximate AND. (The other four symmetric functions are complements of these.) Note that a-'(ti) = -ln(l/ti - 1) is the inverse function of 0. Theorem 1. Let {ti} be an STS. There exists a vector of parameter values, v = (w,wllw2,L,L1), such that E(v) = 0 for the symmetric version of the network of Figure I, that is, zi = ti. Proof. We give an explicit construction for computing v. Until further notice let LI be arbitrary but fixed. yl(&) = u ( - L J . To satisfy z1 = tl we must choose w and L such that
w = [ L + 0-1(t1)]/20(-L1)
(2.1)
We shall need w # 0. Hence, we require (i) L
# -g-l(tl)
To satisfy z4 = t 4 we must choose L and w so that also yl(t4) = [ L -t(T-'(t4)1/2W
(2.2)
Since y1 > 0, we require that (ii) L > - 0 - ' ( t 4 )
if w > O or L < -0-I(t4)
if w < O
Further, since y1 < 1, equations 2.2 and 2.1 require that
> [0-1(t&-L1) - u-I(t1)]/[1 - u(-L1)] if w > 0 and 1 > a(-L1) or w < 0 and 1 < o(-Ll), with a reversed (iii) L
inequality in (iii) otherwise.
Approximation of Boolean Functions by Sigmoidal Networks
535
Next, having found L and w satisfying the above conditions and with given by equation 2.2, we solve for K = w1 + w2 in
y1([4)
(2.3)
fl(K - L1) = yI((44) It is useful to combine 2.1 and 2.2 to get y1([4) = Sa(-Ll), where
6 = 6(L) = [ L + a-'(t4)1/[L + fl-l(tl)l
(2.4)
Letting a = exp(L1),we have u(-L,) = l / ( l + a ) .Hence, from equation 2.3,
a exp(-K) = (1 + a ) / b - 1
(2.5)
With L1 given and L and w chosen as above, equation 2.5 is an equation in w1 and w2. To obtain a second equation we set 22 = t 2 by solving for A in a ( w A - L ) = t2, where A = y1(c2) + y2((2). We get A = [ L + a - ' ( t ~ ) I / w = 2yc7-L) = 2y/(l + a), where
y = y(L) = [ L+ a-'(t2)1/[L + o-Yt1)I
(2.6)
Since O(W~ L1) + O ( W ~ L1) = A
(2.7)
we must require that 0 < A < 2. A > 0 requires that (iv) L > -a-'(t2) if w > o or L < -a-'(t2) if w < O A < 2 requires that L satisfy
(v) y(L) < 1 + Q Conditions (i)-(v) can be satisfied by taking L sufficiently large, since this makes w > 0. We shall also require that A # 1. So (vi) L
# 2[(1 + a)a-'(t1)/2
- ~ - ~ ( t z ) l /( Ia )
Returning to equation 2.7, letting x = exp(w1) and y = exp(w2),we get x / ( x + a ) + y / ( 1 / + a )= A s(y + a ) + y(z + a ) = A(x + a ) ( y + a ) (1 - A)ax + (1 - A)ay = Aa2 + (A - 2)zy
(2.8)
From equation 2.5 we get exp(K) = C, where C = ab/(a + 1 - 6). Since K = w1 + w2, this yields xy = C. Substituting C for xy in 2.8, we get x + y = -B, where B = [Aa2+ C ( A - 2)]/(A - 1)a. Hence, z + C / x = -B and x2 + Bx + C = 0
(2.9)
E. K. Blum
536
Equation 2.9 has a positive real root if B < 0 and B2 2 4C. Now, + C) - 2C]/(A - 1)a = 2[7(a2 + C ) - C(1 -I-~r)]/0(2y- 1 - 0). AS B= L + 00, y + 1, 6 4 1, and C -+ 1. Hence, B --t -2, that is, B < 0. Now, substituting the expression defining C, we get B = 2{y[a2+ cu6/(a+ 1 ,511 - a6(a + i ~ /+ 1(-~s))/0(2y - 1- a ) = 2{y[a3 + (1- sia2 + 6 4 - as(a+ I)}/cu(a+ 1 - 6)(-a + 27 - 1). Hence, B2 2 4C if and only if {y[a3+ (1- @a2+ 601 - a6(a+ 1)}2 2 Sa3(a+ 1- 6)(a+ 1 - 2612 (2.10) On the left is a polynomial in a with leading term y2a6 and on the right a polynomial with leading term 6a6. For a large the sign of B2 - 4C is determined by (y2 - 6)a6.Hence, B2 > 4C if y2 > 6. Thus, by equations 2.4 and 2.6, for 2.9 to have a positive real root it suffices to take (Y large and IL + o-l(t2)l2 > [ L + a-'(t4)IIL + ~ ' ( t 1 1 1 ,that is, L[2C1(t2) - a-'(tl) - o-'(t4)] > K ' ( t & - ' ( t l ) - [g-'(t2>I2
(2.11)
> a-'(t4) Case 1. Approximate XOR ( t 2 > t 4 and tz > t l ) . Thus, and a-I(t2) > o-'(tI). So equation 2.9 has a positive real root in this case if L1 (until now arbitrary) is sufficiently large and L satisfies (i)-(v) and 2.11, which in this case becomes (vii) L > { ~ - ' ( t 4 ) 0 - ' ( t 1 ) - [(~-'(t2)]~}/[2a-'(t2)- ~ - ' ( t l-) o-'(t4)1.
Case 2. Approximate OR ( t 2 = t 4 and t 2 > t l ) . Thus, a-'(tz) > g - ' ( t ~ ) . Again, equation 2.11 becomes (vii), which now is just L > -a-'(t2). Case 3. Approximate constant (tl = t 2 = t 4 ) . This makes y = S = 1 and both sides of equation 2.10 reduce to a4(0- l)', that is, B2= 4C and B = -2, C = 1. So x = 1 and w1 = 0 = w2. Case 4.Approximate AND (tl = t Z ,t 4 > t l ) . In this case, equation 2.11 becomes (viii) L < -a-'(t2) It is easily verified that (viii) is compatible with (i)-(v) in this case. We have proved that an exact solution is obtained by first choosing an Ll large enough, then choosing L to satisfy (i)-(v) and (vii) or (viii) as the case may be, computing w by equation 2.1 and w1 = en z where x is a positive root of equation 2.9 and w2 = K - wl,where K satisfies equation 2.5. Example 1. Let tl = t4 = 0.01 and t 2 = t 3 = 0.99 approximate XOR (case 1). Since (T(-5.5) = 0.00407 < t l , a reasonable guess is L1 = 5.5. This should produce acceptable values for w and L in equation 2.1. L must ) -4.595, and a-'(t2) = ~ ~ ' ( 0 . 9 = 9) satisfy (vii). Since a-l(tl) = ~ ' ( 0 . 0 1 = -a-'(0.01) = 4.595, we see that (vii) requires only that L > 0. To satisfy (v), assume L > - n - ' ( t ~ ) . Then (v) becomes L + < [ L+ o-'(t~)l(l+ a), whence L > [a-'(t2) - (1+ a)a-'(tl)l/a. Here, a = exp(L1) = 244.69. So L > [4.595 - 245.69(-4.595)]/244.69 = 4.632. We take L = 4.7. By equation 2.1, w = (4.7-4.595)/0.00814 = 12.884. Next, by equation 2.6 y =
Approximation of Boolean Functions by Sigmoidal Networks
537
(4.7+4.595)/(4.7-4.595) = 88.625. By equation 2.7, A = 2(88.625)/245.69= 0.7214. By equation 2.4, S = 1. Hence, C = 1. The formula for B gives B = -[0.7214(244.69)2 - 1.2786]/0.2786 x 244.69 = -633.576. Solving equation 2.10, we find x = (633.576 + 633.573112 = 633.574. Thus, w1 = en(633.574) = 6.451. By equation 2.5, with S = 1, we get K = 0. So u12 = -201. With nine-digit precision we calculate w1 = 6.45158607 and w = 12.8840961. With these values, the network yields 24 - t 4 = 21 - tl = -5.8 x lo-'' and 22 - t2 = -1.1 x an exact solution up to rounding errors. ExurnpZe 2. Let tl = 0.01 and t? = t 4 = 0.99 approximate OR (case 2). As before, take L1 = 5.5. Hence, a is as in example 1. One readily verifies that L = 4.7 also satisfies (i)-(vii) in this case. Thus, w = 12.8840961 again. The value of y is as in example 1 but now 6 = y. Hence, we get C = 138.07 and A = 0.72144061. This yields B = -631.13743. Solving equation 2.9, we get x = 630.91859 and wl = en x = 6.4471768. By equation 2.5, K = 4.9277608 and w2 = -1.5194160. With these weights the network yields 21 - t l = -5 x lo-", 22 - t 2 = -1 x and 24 - t 4 = -1 x lop7, an exact solution up to rounding errors. From the proof of Theorem 1 and from the two examples it is clear that there is a manifold of exact solutions for any symmetric target set approximating the four Boolean functions XOR, OR, AND, and 1. (Their complements are approximated by reversing the signs of weights and thresholds.) The manifold of exact solutions can be regarded as parametrized by L and L1 in the region of the ( L , L , ) plane defined by inequalities (i)-(viii). As L and L1 vary over this feasible region, the corresponding values of w, w1,and w2 are determined by equations 2.1, 2.5, and 2.9. Thus, the exact solutions (the absolute minima of E ) lie on a surface in (m,w1, w2) space parametrized by L and L1 restricted to the feasible region. Since equation 2.9 has two solutions, the surface may have two disconnected pieces. Assuming that ( L ,L1) are already in the feasible region and held fixed, a gradient descent or other iterative minimization procedure to find the weights should generate a path in weight space approaching this surface. However, in the next section, we shall show that there is another manifold, consisting of stationary points and local minima, which may attract such iterative trajectories. It has been frequently observed in numerical experiments that such points exist in problems of this kind (for example, see McInerney et al. 1988). Here, we are able to prove their existence and characterize the manifold of such points in relation to the manifold of absolute minima. In Part 11, we shall consider their respective regions of attraction for gradient methods. 3 Stationary Points and Local Minima
For a given STS { t z } appproximating one of the Boolean functions we consider the mean square error E as a function of v = (w,zul,w2, L,L1).
E. K. Blum
538
2 1711 712 +722
0 0
1 714
7 2 2
714
712
1714
-
Theorem 2. For any STS that approximates XOR the corresponding error function E(v) has a manifold of stationary points that are relative minima but not absolute minima. Proof. By the preceding discussion, it suffices to prove that there exist w such that A(w)r’ = 0 has a nontrivial solution. This is true if and only if rank (A) < 3. A necessary condition on A is that v12 = 7 ~ Otherwise . rows 4 and 5 are independent and, therefore, also rows 1,4, and 5 . Now, 1712 = 1722 if and only if (a)
912
+ 922 = 1
Approximation of Boolean Functions by Sigmoidal Networks that is, a(wZ (b) ~1
1
-
L1) + u(w1 - L1) = 1. This is equivalent to
+ ~2
1
2 711 2712 yll 1
0 0
539
= 2L1
1 711 Y14
1712
7714
7712
714
-
and to have rank (A) < 3 it is further necessary that the determinant of rows 1, 2, and 4 be 0, which implies 712 = 114. But then y12 = 112 and since L1 = 0, we must have w2 = 0 and therefore w1 = 0. Now, consider (c). This implies y11+y12 = 1. Thus, O ( - L ~ ) + O ( W ~ - L =~) 1, which implies L1 = wp - L1, that is, w2 = 2L1 and so WI = 0, by (b). However, (a) and (c) imply yll + y22 = 1 also. Thus, w1 = 2L1, which implies L1 = w 2 = 0 as well. Hence, we arrive at the necessary and sufficient condition w1 = wp = L1 = 0. With these values, y j , = 112 and z, = a(w - L ) for all i . All (, are equal and there is a unique nontrivial solution of Ar = 0, namely, T = ( T I , - T , , T I ) . [To satisfy T , = z, - t, we need to take T I = ( t 2 - t1)/2.1 This implies that t4 = tl and t 2 = 2a(w - L ) - t l . Thus, the {t,} must approximate XOR, and then there are many stationary points that are not absolute minima. They correspond to w and L satisfying the equation a(w - L ) = (tl + t2)/2. Hence, they lie on a line w = L + const. in the ( w , L ) plane. Actually, these points are local minima of E , the value of E being ( t 2 - t1)2/2. To prove these are local minima we show that d E l d s = dE(v*+ s A v ) / d s 2 0 for any v f = (w*,0, 0, L*,01, where ( w * ,L*) is on the stationary line, Av = ( A w ,A w l , Aw2, A L , A&) has norm (( Av I(= E sufficiently small and 0 < s < 1.
E. K. Blum
540
Since zi(v*) and yji(v*) are all equal, all Vzi(v*) are equal. Let w =
+ SAW.Then zi(w)= zi(v*)+ sVzi(w*). Aw + O ( E ~ )Letting . z = zi(v*)+ sVi(v") . Aw, all zi(w) = z + O(&. Similarly, all ci(v) = c, yji(w) = y, and
'u*
qjZ(w) = q up to O(E'). Henceforth, we ignore terms of size O(E'). Thus, w e calculate BEIBL dEldL1 dEldw1 dE/ds
=
-C(rl
+ 27-2 + 7-4) =
-
t), where t = tl
+ 2t2 + t4
= - 2 c 7 ~ ( 4-~t), d E / d w = 2[y(4t - t) = dE/dw2=2Cqw(2z - t 2 - t 4 ) = c ~ p ~ ( 4 -zt), whentl = t4
aE/awAw + aE/dwlAWl+ ~ E / ~ ' W Z + A aE/aLAL W~ + dE/dLlAL1 = C(4z - t ) [ 2 y A ~ - 2wqAL1 - AL + qw(AWl + Aw2)l
=
Since z = dw(y14 + y24) - Ll + O(&, we have u p to Ok2) Az = c[2yAw 2wqAL1 - AL + qw(Aw1 + A w ~ )and ] dE/ds = 474.2 - t)Az. Since 4z - t = (4z"- t) + 4Az = 4Az, it follows that for E small enough, dE/ds 2 0 for all Az with 11 Az I[= E and 0 < s < 1. This completes the proof. These analytic results suggest further research, both analytical and computational (McInerney et al. 19881, on the nature of the error surfaces in back propagation and on the regions of attraction for the local and global minima manifolds.
Acknowledgments This research was partially supported by AFOSR Grant 88-0245 and NSF Grant CCR-8712192.
References Chauvin, Y. 1989. A back-propagation algorithm with optimal use of hidden units. In Advances in Neural Information Processing Systems I, D. Touretzky, ed. Morgan Kauffmann, San Mateo, CA. McInerney, J., Haines, K., Biafore, S., and Hecht-Nielsen, R. 1988. Can backpropagation error surfaces have non-global minima? Dept. of Electrical Engineering and Computer Engineering, UCSD, August 1988. Abstract in IJCNN 89 Proc. 11, 627. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, Vol. 1. MIT Press, Cambridge, MA. Soulie, F. F., Gallinari, P., Le Cun, Y., and Thiria, S. 1987. Automata networks and artificial intelligence. In Automata Networks in Computer Science, F. F. Soulie, Y. Robert, and M. Tchuente, eds. Princeton University Press, Princeton, NJ. Received 27 June 1989; accepted 15 September 1989.
Communicated by Dana Ballard
Backpropagation Applied t o Handwritten Zip Code Recognition Y. LeCun B. Boser J. S. Denker D. Henderson R. E. Howard W.Hubbard L. D. Jackel AT&T 3ell Laboratories, Holmdei, NJ 07733 USA
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S.Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification. 1 Introduction
Previous work performed on recognizing simple digit images (LeCun 1989) showed that good generalization on complex tasks can be obtained by designing a network architecture that contains a certain amount of a priori knowledge about the task. The basic design principle is to reduce the number of free parameters in the network as much as possible without overly reducing its computational power. Application of this principle increases the probability of correct generalization because it results in a specialized network architecture that has a reduced entropy (Denker ef al. 1987; Patarnello and Carnevali 1987; Tishby et al. 1989; LeCun 1989), and a reduced Vapnik-Chervonenkis dimensionality (Baum and Haussler 1989). In this paper, we apply the backpropagation algorithm (Rumelhart et al. 1986) to a real-world problem in recognizing handwritten digits taken from the U.S. Mail. Unlike previous results reported by our group on this problem (Denker et al. 1989), the learning network is directly fed with images, rather than feature vectors, thus demonstrating the ability of backpropagation networks to deal with large amounts of low-level information. Neural Computation 1, 541-551 (1989) @ 1989 Massachusetts Institute of Technology
542
LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel
2 Zip Codes 2.1 Data Base. The data base used to train and test the network consists of 9298 segmented numerals digitized from handwritten zip codes that appeared on U.S. mail passing through the Buffalo, NY post office. Examples of such images are shown in Figure 1. The digits were written by many different people, using a great variety of sizes, writing styles, and instruments, with widely varying amounts of care; 7291 examples are used for training the network and 2007 are used for testing the generalization performance. One important feature of this data base is that both the training set and the testing set contain numerous examples that are ambiguous, unclassifiable, or even misclassified. 2.2 Preprocessing. Locating the zip code on the envelope and separating each digit from its neighbors, a very hard task in itself, was performed by Postal Service contractors (Wang and Srihari 1988). At this point, the size of a digit image varies but is typically around 40 by 60 pixels. A linear transformation is then applied to make the image fit in a 16 by 16 pixel image. This transformation preserves the aspect ratio of the character, and is performed after extraneous marks in the image have been removed. Because of the linear transformation, the resulting image is not binary but has multiple gray levels, since a variable number of pixels in the original image can fall into a given pixel in the target image. The gray levels of each image are scaled and translated to fall within the range -1 to 1.
3 Network Design
3.1 Input and Output. The remainder of the recognition is entirely performed by a multilayer network. All of the connections in the network are adaptive, although heavily constrained, and are trained using backpropagation. This is in contrast with earlier work (Denker et ai. 1989) where the first few layers of connections were hand-chosen constants implemented on a neural-network chip. The input of the network is a 16 by 16 normalized image. The output is composed of 10 units (one per class) and uses place coding.
3.2 Feature Maps and Weight Sharing. Classical work in visual pattern recognition has demonstrated the advantage of extracting local features and combining them to form higher order features. Such knowledge can be easily built into the network by forcing the hidden units to combine only local sources of information. Distinctive features of an object can appear at various locations on the input image. Therefore it seems judicious to have a set of feature detectors that can detect a particular
Backpropagation Applied to Handwritten Zip Code Recognition
543
Figure 1: Examples of original zip codes (top) and normalized digits from the testing set (bottom).
544
LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel
instance of a feature anywhere on the input plane. Since the precise location of a feature is not relevant to the classification, we can afford to lose some position information in the process. Nevertheless, approximate position information must be preserved, to allow the next levels to detect higher order, more complex features (Fukushima 1980; Mozer 1987). The detection of a particular feature at any location on the input can be easily done using the "weight sharing" technique. Weight sharing was described in Rumelhart et al. (1986) for the so-called T-C problem and consists in having several connections (links) controlled by a single parameter (weight). It can be interpreted as imposing equality constraints among the connection strengths. This technique can be implemented with very little computational overhead. Weight sharing not only greatly reduces the number of free parameters in the network but also can express information about the geometry and topology of the task. In our case, the first hidden layer is composed of several planes that we call feature maps. All units in a plane share the same set of weights, thereby detecting the same feature at different locations. Since the exact position of the feature is not important, the feature maps need not have as many units as the input. 3.3 Network Architecture. The network is represented in Figure 2. Its architecture is a direct extension of the one proposed in LeCun (1989). The network has three hidden layers named H1, H2, and H3, respectively. Connections entering H1 and H2 are local and are heavily constrained. H1 is composed of 12 groups of 64 units arranged as 12 independent 8 by 8 feature maps. These 12 feature maps will be designated by H1.l, H1.2, . . . , H1.12. Each unit in a feature map takes input on a 5 by 5 neighborhood on the input plane. For units in layer H1 that are one unit apart, their receptive fields (in the input layer) are two pixels apart. Thus, the input image is undersampled and some position information is eliminated. A similar two-to-one undersampling occurs going from layer H1 to H2. The motivation is that high resolution may be needed to detect the presence of a feature, while its exact position need not be determined with equally high precision. It is also known that the kinds of features that are important at one place in the image are likely to be important in other places. Therefore, corresponding connections on each unit in a given feature map are constrained to have the same weights. In other words, each of the 64 units in H1.l uses the same set of 25 weights. Each unit performs the same operation on corresponding parts of the image. The function performed by a feature map can thus be interpreted as a nonlinear subsampled convolution with a 5 by 5 kernel. Of course, units in another map (say H1.4) share another set of 25 weights. Units do not share their biases (thresholds). Each unit thus has 25 input lines plus a bias. Connections extending past the boundaries of the input plane take their input from a virtual background plane whose
Backpropagation Applied to Handwritten Zip Code Recognition
545
15
n
@ v
10
5
0
training passes
Figure 2: Network architecture.
state is equal to a constant, predetermined background level, in our case -1. Thus, layer H1 comprises 768 units (8 by 8 times 121, 19,968 connections (768 times 26), but only 1068 free parameters (768 biases plus 25 times 12 feature kernels) since many connections share the same weight. Layer H2 is also composed of 12 features maps. Each feature map contains 16 units arranged in a 4 by 4 plane. As before, these feature maps will be designated as H2.1, H2.2, . . . ,H2.12. The connection scheme
546
LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel
between H1 and H2 is quite similar to the one between the input and H1, but slightly more complicated because H1 has multiple two-dimensional maps. Each unit in H2 combines local information coming from 8 of the 12 different feature maps in H1. Its receptive field is composed of eight 5 by 5 neighborhoods centered around units that are at identical positions within each of the eight maps. Thus, a unit in H2 has 200 inputs, 200 weights, and a bias. Once again, all units in a given map are constrained to have identical weight vectors. The eight maps in H1 on which a map in H2 takes its inputs are chosen according a scheme that will not be described here. Connections falling off the boundaries are treated like as in H1. To summarize, layer H2 contains 192 units (12 times 4 by 4) and there is a total of 38,592 connections between layers H1 and H2 (192 units times 201 input lines). All these connections are controlled by only 2592 free parameters (12 feature maps times 200 weights plus 192 biases). Layer H3 has 30 units, and is fully connected to H2. The number of connections between H2 and H3 is thus 5790 (30 times 192 plus 30 biases). The output layer has 10 units and is also fully connected to H3, adding another 310 weights. In summary, the network has 1256 units, 64,660 connections, and 9760 independent parameters.
4 Experimental Environment
All simulations were performed using the backpropagation simulator SN (Bottou and LeCun 1988) running on a SUN-4/260. The nonlinear function used at each node was a scaled hyperbolic tangent. Symmetric functions of that kind are believed to yield faster convergence, although the learning can be extremely slow if some weights are too small (LeCun 1987). The target values for the output units were chosen within the quasilinear range of the sigmoid. This prevents the weights from growing indefinitely and prevents the output units from operating in the flat spot of the sigmoid. The output cost function was the mean squared error. Before training, the weights were initialized with random values using a uniform distribution between -2.4/Fi and 2.4/Fi where F, is the number of inputs (fan-in) of the unit to which the connection belongs. This technique tends to keep the total inputs within the operating range of the sigmoid. During each learning experiment, the patterns were repeatedly presented in a constant order. The weights were updated according to the so-called stochastic gradient or "on-line" procedure (updating after each presentation of a single pattern) as opposed to the "true" gradient procedure (averaging over the whole training set before updating the weights). From empirical study (supported by theoretical arguments), the stochastic gradient was found to converge much faster than the true gradient,
Backpropagation Applied to Handwritten Zip Code Recognition
547
especially on large, redundant data bases. It also finds solutions that are more robust. All experiments were done using a special version of Newton’s algorithm that uses a positive, diagonal approximation of the Hessian matrix (LeCun 1987; Becker and LeCun 1988). This algorithm is not believed to bring a tremendous increase in learning speed but it converges reliably without requiring extensive adjustments of the parameters. 5 Results
After each pass through the training set, the performance was measured both on the training and on the test set. The network was trained for 23 passes through the training set (167,693 pattern presentations). After these 23 passes, the MSE averaged over the patterns and over the output units was 2.5 x on the training set and 1.8 x lop2 on the test set. The percentage of misclassified patterns was 0.14% on the training set (10 mistakes) and 5.0% on the test set (102 mistakes). As can be seen in Figure 3, the convergence is extremely quick, and shows that backpropagation can be used on fairly large tasks with reasonable training times. This is due in part to the high redundancy of real data. In a realistic application, the user usually is interested in the number of rejections necessary to reach a given level of accuracy rather than in the raw error rate. We measured the percentage of test patterns that must be rejected in order to get 1%error rate on the remaining test patterns. Our main rejection criterion was that the difference between the activity levels of the two most active units should exceed a given threshold. The percentage of rejections was then 12.1%for 1%classification error on the remaining (nonrejected) test patterns. It should be emphasized that the rejection thresholds were obtained using performance measures on the test set. Some kernels synthesized by the network can be interpreted as feature detectors remarkably similar to those found to exist in biological vision systems (Hubel and Wiesel 1962) and/or designed into previous artificial character recognizers, such as spatial derivative estimators or off-center/on-surround type feature detectors. Most misclassifications are due to erroneous segmentation of the image into individual characters. Segmentation is a very difficult problem, especially when the characters overlap extensively. Other mistakes are due to ambiguous patterns, low-resolution effects, or writing styles not present in the training set. Other networks with fewer feature maps were tried, but produced worse results. Various fully connected, unconstrained networks were also tried, but generalization performances were quite bad. For example, a fully connected network with one hidden layer of 40 units (10,690 connections total) gave the following results: 1.6% misclassification on the
548
LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel
layer H3 30 hidden units layer H2 12 x 16=192
e
_---------
10 output units
,*
hidden units
fully connected 300 links
-
00000
fully connected 6000 links
-
- 40,000
links kernels from 12 5 x 5 ~ 8
layer H1 12 x 64 = 768 hidden units H1.l -20,OO 0 links from 12 kernels 5x5 256 input units
Figure 3: Log mean squared error (MSE) (top) and raw error rate (bottom) versus number of training passes. training set, 8.1% misclassifications on the test set, and 19.4% rejections for 1%error rate on the remaining test patterns. A full comparative study will be described in another paper. 5.1 Comparison with Other Work. The first several stages of processing in our previous system (described in Denker et al. 1989) involved convolutions in which the coefficients had been laboriously hand designed. In the present system, the first two layers of the network are constrained to be convolutional, but the system automatically learns the coefficients that make up the kernels. This "constrained backpropagation" is the key to success of the present system: it not only builds in shiftinvariance, but vastly reduces the entropy, the Vapnik-Chervonenkis dimensionality, and the number of free parameters, thereby proportionately reducing the amount of training data required to achieve a given level
Backpropagation Applied to Handwritten Zip Code Recognition
549
of generalization performance (Denker et al. 1987; Baum and Haussler 1989). The present system performs slightly better than the previous system. This is remarkable considering that much less specific information about the problem was built into the network. Furthermore, the new approach seems to have more potential for improvement by designing more specialized architectures with more connections and fewer free parameters. l Waibel (1989) describes a large network (but still small compared to ours) with about 18,000 connections and 1800 free parameters, trained on a speech recognition task. Because training time was prohibitive (18 days on an Alliant mini-supercomputer), he suggested building the network from smaller, separately trained networks. We did not need such a modular construction procedure since our training times were "only" 3 days on a Sun workstation, and in any case it is not clear how to partition our problem into separately trainable subproblems. 5.2 DSP Implementation. During the recognition process, almost all the computation time is spent performing multiply accumulate operations, a task that digital signal processors (DSP) are specifically designed for. We used an off-the-shelf board that contains 256 kbytes of local memory and an AT&T DSP-32C general purpose DSP with a peak performance of 12.5 million multiply add operations per second on 32 bit floating point numbers (25 MFLOPS). The DSP operates as a coprocessor; the host is a personal computer (PC), which also contains a video acquisition board connected to a camera. The personal computer digitizes an image and binarizes it using an adaptive thresholding technique. The thresholded image is then scanned and each connected component (or segment) is isolated. Components that are too small or too large are discarded; remaining components are sent to the DSP for normalization and recognition. The PC gives a variable sized pixel map representation of a single digit to the DSP, which performs the normalization and the classification. The overall throughput of the digit recognizer including image acquisition is 10 to 12 classifications per second and is limited mainly by the normalization step. On normalized digits, the DST performs more than 30 classifications per second.
6 Conclusion
We have successfully applied backpropagation learning to a large, realworld task. Our results appear to be at the state of the art in digit recognition. Our network was trained on a low-level representation of 'A network similar to the one described here with 100,000 connections and 2600 free parameters recently achieved 9% rejection for 1% error rate. That is about 30% better than the best of the hand-coded-kernel networks.
550
LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel
data that had minimal preprocessing (as opposed to elaborate feature extraction). The network had many connections but relatively few free parameters. The network architecture and the constraints on the weights were designed to incorporate geometric knowledge about the task into the system. Because of the redundant nature of the data and because of the constraints imposed on the network, the learning time was relatively short considering the size of the training set. Scaling properties were far better than one would expect just from extrapolating results of backpropagation on smaller, artificial problems. The final network of connections and weights obtained by backpropagation learning was readily implementable on commercial digital signal processing hardware. Throughput rates, from camera to classified image, of more than 10 digits per second were obtained. This work points out the necessity of having flexible "network design'' software tools that ease the design of complex, specialized network architectures. Acknowledgments We thank the U.S. Postal Service and its contractors for providing us with the data base. The Neural Network simulator SN is the result of a collaboration between LCon-Yves Bottou and Yann LeCun. References
Baum, E. B., and Haussler, D. 1989. What size net gives valid generaliztion? Neural Comp. 1,151-160. Becker, S., and LeCun, Y. 1988. Improving the Convergence of Back-Propagation Learning With Second-Order Methods. Tech. Rep. CRG-TR-88-5, University of Toronto Connectionist Research Group. Bottou, L.-Y., and LeCun, Y. 1988. Sn: A simulator for connectionist models. In Proceedings of NeuroNimes 88, Nimes, France. Denker, J., Schwartz, D., Wittner, B., Solla, S. A., Howard, R., Jackel, L., and Hopfield, J. 1987. Large automatic learning, rule extraction and generalization. Complex Syst. 1, 877-922. Denker, J. S., Gardner, W. R., Graf, H. P., Henderson, D., Howard, R. E., Hubbard, W., Jackel, L. D., Baird, H. s., and Guyon, I. 1989. Neural network recognizer for hand-written zip code digits. In D. Touretzky, ed., Advances in Neural Information Processing Systems, pp. 323-331. Morgan Kaufmann, San Mateo, CA. Fukushima, K. 1980. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Bid. Cybemet. 36,193-202.
Backpropagation Applied to Handwritten Zip Code Recognition
551
Hubel, D. H., and Wiesel, T. N. 1962. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. J. of Physiol. 160, 106-154. LeCun, Y. 1987. ModPles connexionnistes de l'apprentissage. Ph.D. thesis, Universit6 Pierre et Marie Curie, Paris, France. LeCun, Y. 1989. Generalization and network design strategies. In Connectionism in Perspective, R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, eds. NorthHolland, Amsterdam. Mozer, M. C. 1987. Early parallel processing in reading: A connectionist approach. In Attention and Performance, XII: The Psychology of Reading, M. Coltheart, ed., Vol. XII, pp. 83-104. Erlbaum, Hillsdale, NY. Patarnello, S., and Carnevali, P. 1987. Learning networks of neurons with boolean logic. Europhys. Lett. 4(4), 503-508. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumelhart and J. L. McClelland, eds., Vol. I, pp. 318-362. Bradford Books, Cambridge, MA. Tishby, N., Levin, E., and Solla, S. A. 1989. Consistent inference of probabilities in layered networks: Predictions and generalization. In Proceedings of the International Joint Conference on Neural Networks, Washington DC. Waibel, A. 1989. Consonant recognition by modular construction of large phonemic time-delay neural networks. In Advances in Neural Information Processing Systems, D. Touretzky, ed., pp. 215-223. Morgan Kauhann, San Mateo, CA. Wang, C. H., and Srihari, S. N. 1988. A framework for object recognition in a visually complex environment and its application to locating address blocks on mail pieces. Int. I. Computer Vision 2, 125.
Received 7 July 1989; accepted 12 September 1989.
Communicated by Fernando J. Pineda
A Subgrouping Strategy that Reduces Complexity and Speeds Up Learning in Recurrent Networks David Zipser Department of Cognitive Science, University of California, San Diego, La Jolla, CA 92093 USA
An algorithm, called RTRL, for training fully recurrent neural networks has recently been studied by Williams and Zipser (1989a, b). Whereas RTRL has been shown to have great power and generality, it has the disadvantage of requiring a great deal of computation time. A technique is described here for reducing the amount of computation required by RTRL without changing the connectivity of the networks. This is accomplished by dividing the original network into subnets for the purpose of error propagation while leaving them undivided for activity propagation. An example is given of a 12-unit network that learns to be the finite-state part of a Turing machine and runs 10 times faster using the subgrouping strategy than the original algorithm.
1 Introduction
Williams and Zipser (1989a, b) have developed a gradient-following learning algorithm for fully recurrent neural networks called “real-time recurrent learning” (RTRL).RTRL is used for training networks with unrestricted connectivity that sample their inputs on every cycle and train any unit on any cycle. RTRL is particularly useful for training networks to do tasks with complex temporal properties such as finite-state machines. For example, it can train a network to be the finite-state part of a Turing machine that balances parentheses. This is a fairly complex task requiring at least four internal states and a network with 12 or more units. The algorithm can also train recurrent networks to recognize finitestate grammars that are too complex for some simpler recurrent learning procedures (Servan-Schreiber et al. 1988; Smith and Zipser 1989). A drawback to RTRL is that it requires a great deal of computation. The dominating source of this computational load is the need to update a large array of values that encode information about the past. The size of this array does not change with time but is determined only by the number of units and weights in the network. A network with n fully recurrent units and w weights requires O(wn2)computations on each learning cycle Neural Computation 1, 552-558 (1989) @ 1989 Massachusetts Institute of Technology
A Subgrouping Strategy
553
to update this large array, whereas the other computations require only O(w) operations. Note that for fully interconnected networks, ur scales as n2, so in this general case the computational requirements are O(n4). This large computational requirement does not preclude applying the algorithm to problems of considerable interest using networks of moderate size, but the factor n4 becomes overwhelming as the networks are scaled UP. Here I describe a way to use RTRL that reduces its complexity and increases its speed without changing network connectivity. This technique involves dividing the network into subnetworks for the purpose of error propagation while keeping it undivided for activity propagation. Simulation studies show that difficult tasks can still be learned while computation time is greatly reduced.
2 The Subgroup Learning Technique
To see how subgrouping speeds up computation it is useful to review what slows down the RTRL learning algorithm analyzed by Williams and Zipser (1989a). The value of n, in the factor wn’,which causes all the trouble for RTRL, refers to the number of recurrently connected units. It is possible to effectively reduce the value of n by viewing a recurrent network, for the purpose of learning, as consisting of a set of smaller recurrent networks all connected together. When a network is viewed in this way the connections within subnets are the recurrent connections for learning, and the connections between subnets are treated as inputs. Because the subnets are smaller than the original net, the value of n is reduced. The overall physical connectivity of the network remains the same, but now error propagation is confined to the subnets. This means that each subnet must have at least one unit for which a target exists. It turns out (see detailed analysis below) that subgrouping a network in this way leads to a speed-up in the computation of the large array of values representing the past by a factor of g2, where g is the number of equal size subnets into which the network is divided. If the number of subnets is increased in proportion to the total number of units, which keeps the subnet size constant, the complexity of the whole RTRL algorithm is reduced from O(uin2)to O(w). Although the original and subgrouped networks are the same with respect to their total number of units and connections, they differ in the way the learning algorithm is applied. In one case the algorithm is applied to the network as a whole; in the other it is applied to each subnetwork individually. There is no guarantee that this subgrouping procedure will solve the same problems as the original form of RTRL; however, initial empirical tests indicate that subgrouped networks can solve the same problems as the undivided networks but require much less computation time.
554
David Zipser
3 T h e Complexity of RTRL To see how subgrouping speeds up computation we must examine the computations involved in the learning algorithm, but we need not be concerned with their derivation. RTRL is applied to networks of n semilinear units with m external inputs. Let z&) represent the value of a signal, either an input or recurrent unit activation, at time t. The sets U and I of subscripts are defined so that if Z k is to be treated as an input then k E 1 and if Z ~ Cis a signal from a recurrently connected unit then k E u. A single cycle of the algorithm consists of three steps. First, the unit activations are updated. The net input to a unit is given by equation 3.1, and the output activations by equation 3.2 where f is the squashing function: (3.1) (3.2)
The second step on each cycle is to update the complete set of dz,(t)/ k E U that are the variables that encode the past. For brevity we define
dwij,
Note that there are n pfj associated with each weight. It is the requirement for large numbers of these pfj values that distinguishes RTRL and accounts for both its power and its large computational requirement. Equation 3.3 shows how the pfj are updated:
(3.3)
where Sik is the Kronecker delta and &(O) = 0. The third step is to update the weights using the p& values and the error. This step has to be done only on cycles where there is a training set. For our complexity analysis the worst case of a target on each cycle is assumed. Equation 3.4 shows how the pFj values are used to update the weights: (3.4)
A Subgrouping Strategy
555
where T is the set of subscripts of units with targets, given by ek(t)
= turget,(t)
- .zk(t),
ek
is the "error"
k ET
and cy is the learning rate. The amount of computation needed to carry out each of the three steps for a fully recurrent undivided network is
1. Activities: klw 2. Weights: k3rw
+ k2n operations, + k4w operations,
3. p values: k5wn2 + kbwn operations, where the ki are implementation-dependent constants, w is the number of weights, n is the number of units, and r is the number of units with targets. Since w = n2 + nm for a completely recurrent network with m inputs going to all the units, updating the p values is O(n4), which dominates the computation time for networks with more than just a few units. 4 Subgrouping Networks
A fully recurrent network with n units and m inputs can be divided into g fully recurrent subnetworks each with n / g units (throughout this discussion we will assume that g is a factor of n). Each of these subnetworks needs to have at least one target, but the way the targets are distributed among the subnetworks is not germane at this point. If none of the connections in the original undivided network is changed, then each of the units in a subnetwork will receive as input, in addition to the original m inputs, the activities of the n - n/g units in the other subgroups. The effect of subgrouping is to reduce the size of the set U of recurrently connected units from n to n/g. This means that the number of pfJ values per weight will be reduced from n to n/g and the number of operations to update all the p:, will be reduced from k5wn2 + kbwn to ksw(n2/g2) + k6w(n/g). The total number of weights in the complete network has not changed, so the computation time for the pfJ is now O(wn2/g2).For example, dividing a network into two subnets gives a g of 2 and a 4-fold speed-up in computing the p:. In the particular case where g is increased in proportion to n, which keeps the sue of the subnets constant, n2/g2is a constant and the complexity is reduced to O(w), the same as for the other steps in the RTRL algorithm. 5 The Performance of Subgrouped Networks The speed gained from subgrouping is of little interest unless the subgrouped networks can learn the same tasks as the undivided networks.
556
David Zipser
To do this, the subnets will often have to cooperate to do computations that cannot be done by the individual subnets alone. To demonstrate cooperation, a fully connected subdivided network was taught XOR. We have previously shown that a fully recurrent network of three units can be taught by RTRL to organize itself into a feedforward network to do XOR continually. In the test for cooperation used here, a fully connected network with four units, each receiving the two inputs, was treated as two subnets of two units each. The single target needed for XOR was replicated so that one unit in each subnet received the XOR target. This left only one unit in each subnet to serve as a hidden unit. The subnets must cooperate because two hidden units are required to do XOR continually on every cycle. The divided network learned XOR in about the same number of learning trials as an undivided network. The weights on the two output units were nearly identical, whereas the hidden unit in one subnetwork became an OR and the hidden unit in the other became an AND. This cooperative solution was possible, even though error propagation was confined to the subnets, because all units had access to the activities of the others through recurrent connections. All the weights on recurrent connections not required for the feedforward computation of XOR eventually went to very low values. Although this demonstrates the ability of subnets to cooperate, XOR is too simple a problem to provide a meaningful test for the power of subgrouping. As a more challenging test, a subgrouped network was trained to emulate the finite-state machine part of a Turing machine for balancing parentheses, a task that had previously been shown to be learnable by RTRL. For this task the network receives as input the same tape mark that the Turing machine "sees," and is trained to produce the same outputs as the Turing machine for each cell of the tape that it visits. There are four output lines in the version of the problem used (Williams and Zipser 1989b). They code for the direction of movement, the character to be written on the tape, and whether a balanced or unbalanced final state has been reached. It had previously been shown that a fully recurrent network with 12 units was the smallest network that learned the Turing machine task. A 12-unit fully connected network divided for learning into four subnets of three units each, with one of the four outputs learned by each subnet, was used to test subgrouping on this task. The RTRL algorithm can be run either with or without teacher forcing. In the teacher-forcing mode, the outputs of all units with targets are set to the previous target value before the start of each cycle. The unsubdivided network learned the task with or without teacher forcing about 50% of the time after seeing fewer than 100,000 cells of the Turing machine tape. The subdivided network also learned the task about 50% of the time in fewer than 100,000 Turing machine cycles, but only in the teacher-forcing mode. It never learned the task without teacher forcing. The subdivided network ran 9.8 times
A Subgrouping Strategy
557
faster than the undivided network. A 16-unit network divided as the 12-unit net described above, but into subnets of four units each, was able to learn the task with or without teacher forcing on about 50% of the trials. This network ran 11 times faster than an undivided 16-unit net. However, the undivided 16-unit net learned the task on every trial. 6 Discussion
The subgrouping technique greatly speeds up computation with RTRL without destroying its ability to solve difficult problems. There is some loss of power. The 12-unit subgrouped network could only learn the Turing machine problem in the teacher-forcing mode. The subgrouped 16-unit network learned without teacher forcing, but only on half the trials. The performance on this task has not been optimized, so we cannot yet tell if the subdivided nets need more learning cycles. However, it is clear from preliminary studies that any increase in the number of learning cycles is small compared to the 10-fold increase in computation speed. It is important to know when and under what conditions the subgrouping strategy will actually converge to a solution. Unfortunately there is no convergence proof or even a good mathematical understanding of why the subgrouping strategy works at all. Some heuristic insight into this issue can be gotten from a consideration of the XOR problem. When a feedforward net learns XOR, it very quickly reduces error by finding the correct outputs for three of the four inputs. Getting three correct is a linear problem that can be solved with a single unit, so the subgroups with one hidden unit can also get three correct outputs. There are many different ways this can be done. If we assume it is the hidden units that find the solutions, then in one quarter of these cases the whole XOR can be computed by the output units as a linear combination of the information they get from both hidden units. This is a pathway to a solution that requires no significant interaction between the subnets. Because this pathway is rapid, interaction would not be expected to cause much trouble. However, the behavior is better than this because a solution is found much more often than one quarter of the time. It is interesting that sometimes solutions are found rather fast, but often they take quite a while. This suggests that interaction can help but it is time consuming. For problems that depend heavily on recurrent connections, the learning trajectories of the subgroup and normal versions are about the same in the cases for which the subgrouping algorithm finds a solution. In the cases where no solution is found, the learning is similar initially as the nets find the correct outputs for frequent events. However, the networks that fail never learn the correct responses for rare events. This suggests that the perturbations introduced by subgrouping make the ability of the nets to deal with information from the distant past more sensitive to the initial weight values. Ultimately a much more detailed analysis of the
558
David Zipser
learning trajectories will be required to get a full understanding of how the subgroups interact to learn solutions that require cooperation. Every subnet needs at least one unit with a training target. There are generally many ways to distribute the targets among the subnets. In the XOR problem there is only one target value that is replicated and used to train all the subnets. There are four target values in the Turing machine task. In the example described here, one of these was used to train each subnet, but there are many other ways to distribute the targets. This raises the interesting question of whether there are "good ways to divide targets among subnets that would lead to particularly efficient learning.
References Servan-Schreiber, D., Cleeremans, A., and McClelland, J. L. 1988. Encoding Sequencing Structure in Simple Recurrent Networks. Tech. Rep. CMU-CS88183. Carnegie-Mellon University, Pittsburgh, PA. Smith, A. W., and Zipser, D. 1989. Encoding sequential structure: Experience with the real-time recurrent learning algorithm. Proc. IJCNN 1, 645-648. Williams, R. J., and Zipser, D. 1989a. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1,268-278. Williams, R. J., and Zipser, D. 1989b. Experimental analysis of the real-time recurrent learning algorithm. Connection Sci. 1,87-111.
Received 6 July 1989; accepted 15 September 1989.
Communicated by David Touretzky
Unification as Constraint Satisfaction in Structured Connectionist Networks Andreas Stolcke Computer Science Division, University of California, Berkeley, CA 94720 USA, and International Computer Science Institute, 1947 Center Street, Suite 600, Berkeley, CA 94704 USA
Unification is a basic concept in several traditional symbolic formalisms that should be well suited for a connectionist implementation due to the intuitive nature of the notions it formalizes. It is shown that by approaching unification from a graph matching and constraint satisfaction perspective a natural and efficient realization in a structured connectionist network can be found. 1 Introduction Unification is a special matching operation on recursive symbolic structures widely used - in a number of variants - in fields related to symbolic artificial intelligence, most prominently theorem proving (Martelli and Montanari 1982) and computational linguistics (Shieber 1986). In the connectionist literature unification is addressed in the context of resolution theorem proving (Ballard 19861, although considering only a simple special case. Investigating the possibilities for a connectionist approach to unification seems worthwhile for at least two reasons: The importance and ubiquity of the concept in traditional formalisms, and the fact that unification incorporates notions that, at an informal level, seem to be essential to human cognition. These include integrating and merging of information into a consistent whole, checking for compatibility, and pattern matching. Since connectionist models are generally assumed to be well suited for problems involving these tasks, unification is a potentially useful concept in the realm of neurally-inspired processing. 2 Feature Structures and Unification
Unification is usually defined in a strictly technical sense, namely referring to a specific operation in certain types of algebras. The variant Neural Computation 1, 559-567 (1989)
@ 1989 Massachusetts Institute of Technology
Andreas Stolcke
560
considered here operates on recursive sets of attribute-value pairs known as feature structures (f-structures). An f-structure is either an atomic label like, for example, square, or a set of features with f-structures as their values. Complex f-structures are conventionally represented as feature matrices such as
I
shape: square value: 5 length unit : inch width : *
'
[
1'
1
(2.1)
The asterisk marks a feature value that is shared among more than one feature. These possible reentrancies in f-structures suggest an alternative representation that maps every structure into a rooted directed labeled graph. The graph corresponding to structure 2.1 is depicted in Figure la.
idth
ength
idth U
alue
nit inch
Figure 1: (a) Labeled graph representing structure 2.1. Features map into directed edges, atomic values translate into terminal nodes. Internal nodes correspond to substructures and are numbered for reference. (b) Three f-structures whose unification results in the structure given in (a).
Unification as Constraint Satisfaction
561
Unifying a set of f-structures intuitively means merging all the features into a single structure, preserving reentrancies. Using U as the unification operator, structure 2.1 may be obtained as
[ length : [ value: 5 ] [ width : [ unit : inch ] shape : square length : 1 width : *
1 1* L
A
=
[
shape : square
[
length : :e::
;rich]*
(2.2)
width :
Hence unification can be thought of as taking the union of features at each level and unifying values of identical features recursively. This process bottoms out at atomic values where feature values have to match exactly. Consequently unification is said to fail (or result in 0) if atomic values mismatch as in
shape : square color: red
1
u [ color : blue ]
=0
(2.3)
It should be evident from this short discussion that unification incorporates the three intuitive concepts alluded to above: data from several sources are merged into a single structure, checking for compatibility in the process. Alternatively, each of the structures being unified can be viewed as a recursive pattern against which all the other structures are matched.
'
3 Unification as Graph Matching
Consider the graphs involved in unification 2.2, shown in Figure lb. There is a mapping from nodes in the operand structures onto nodes in the unified structure such that edges are consistent among both. This suggests that unification can be viewed as a specialized form of graph matching and given a connectionist treatment as a discrete constraint satisfaction problem (the same general approach has recently been used by Mjolsness et al. (1989). This idea can be made precise by observing that the mapping from operand nodes to result nodes defines an equivalence relation (partitioning). Thus, in Figure 1, the partitions are {sg, s ~s7}, , { s d , 3 6 , sg}, {square}, { 5 } , and {inch}, corresponding to nodes s1, s2, square, 5, and inch, 'The more familiar variant of term unification used in logic-based formalisms can be shown to be a special case of the version presented here (for details see, e.g., Stolcke 1989a). Also, the model described here translates to term unification in a straightforward way.
Andreas Stolcke
562
respectively, in the unified structure. Equivalence relations thus induced by unifications satisfy certain conditions: (V1) For any pair of edges with identical labels (features) z.f= z’, y.f = y‘, 2 y only if 2’ y’. N
N
(V2) For any pair of atomic nodes (values) u,u,
u
-
u only if
u = u.
(V3) For any atomic node u and nonatomic node y, u N y only if y has no outgoing edges y. f . [Here denotes the equivalence relation and x.f refers to the value of feature f on node (structure) 2.1 We will call equivalence relations satisfying these constraints valid, in accordance with Paterson and Wegman (1976) who used the concept to devise a fast sequential algorithm for unification. It can be shown that the unification of a set of f-structures is isomorphic to the partitions obtained from the finest valid equivalence that makes the operand structures’ root nodes equivalent. Such an equivalence is guaranteed to exist if the structures are unifiable. Hence we can redefine unification entirely in terms of a set of constraints on binary relations between graph nodes. N
4 The Connectionist Implementation 4.1 Representation. F-structures can be seen as an abstraction of several frame-like data structures commonly used for knowledge representation, some of which have been addressed in the connectionist literature using various distributed and local representations (Touretzky 1987; Shastri 1988). Our implementation uses a localist scheme for representing both f-structures and node equivalences. All units in the model are binary threshold units choosing their activations asynchronously between 0 and 1. F-structures are represented simply as their constituting set of edges. Assuming we draw nodes from some pool N and given a feature set F , we obtain a pool E N x F x N of possible edges, each of which is assigned an e-unit. The e-unit (2.f= y) is active precisely when the corresponding edge is present. In many cases the semantics of a specific application rule out all but some small subset of N x F x N , or predetermine the presence or absence of many of the edges, thus avoiding the full combinatorics of the representation. This is illustrated in Stolcke (1989b) where the model is applied to the processing of unification-based grammars. Node equivalences - or, loosely speaking, node unifications - in N x N are represented by u-units. The u-unit (x y) is active iff 2 and 9 are considered equivalent (unified). The encoding of constraints requires that nonequivalence be explicitly represented as well, hence nuunit ( 2 y) is active whenever 2 and y are nonunifiable. N
+J
Unification as Constraint Satisfaction
563
4.2 Enforcing Validity. Figure 2a shows how the first validity constraint (V1) is enforced by interactions between e-unit, u-units, nu-units, and auxiliary units implementing conjunctions. The global result is that node unifications propagate top-down from the roots of the structures
Figure 2: Link structure enforcing validity of node equivalences. (a) Auxiliary units A and B implement constraint (Vl). They operate conjunctively, that is, become active only when receiving activation from all three of their inputs. Eunits with matching features f effectively enable paths allowing equivalences and nonequivalences to propagate top-down and bottom-up, respectively. Unification may fail, however, therefore nu-units are allowed to suppress corresponding u-units via strong inhibitory nu-links. (b) Constraint (V3) requires nonequivalence of a node x with any atomic nodes u, v , . . . if x has any outgoing edges. This is accomplished by auxiliary unit C, which behaves disjunctively.
Andreas Stolcke
564
toward the leaves, while nonunifiability is determined bottom-up, starting at incompatible atomic values. Initial activation for nu-units is provided as a consequence of constraint (VZ), which requires nu-units of the form (u71. v), where u and zr are nonequal atomic values, to be clamped on. The network structure in Figure 2b implements constraint (V3) and causes additional nu-units to become active. To attempt unification of f-structures rooted in z and y, the u-unit (x N y) is externally activated. This is the only source of initial activation for u-units and also allows the network to return its result: if a unification exists the network will settle into a state where (x y) and all other relevant u-units are turned on; otherwise (z y) will be deactivated by (x 71. y). This happens either as a result of some stable state or as part of an oscillation in the network. N
N
4.3 Enforcing Consistency. The correctness of the solution found by the network relies on the fact that all its stable states represent valid equivalence relations. Hence consistency of the representation with regard to equivalence needs to be enforced. Reflexivity and symmetry can be made intrinsic to the representation by simply omitting reflexive u/nu-units and merging corresponding symmetric units. Transitivity is enforced by a dedicated link structure shown in Figure 3. Consistency between u-units and nu-units is guaranteed by inhibitory links (shown in Figures 2a and 3) that allow nu-units to suppress their counterparts unconditionally. This is justified by the observation that nuunits denote nonunifiability whereas u-units merely represent attempts at unification. An exact specification of all network parameters, as well as a number of case studies of the network's dynamics can be found in Stolcke (1989a).
5 Discussion The foremost characteristic of the connectionist implementation is that it naturally exploits opportunities for parallel processing of substructures. Therefore the network will arrive at a solution (or negative result) in time proportional to the depth of the f-structures processed if the structures are essentially tree-like, giving time complexity of the order of the log of the structure sizes. [Known serial algorithms are linear in the structure size (Paterson and Wegman 1976).]' The number of units and links required for this speedup is quadratic in the number of edges and cubic in the number of nodes. Typical applications, however, often include a fair number of fixed edges. These together with nu-units that have to be clamped on due to constraints (V2) 'Note, however, that in the worst case reentrancies can prevent effective parallelization causing the network to degenerate into sequential behavior (cf.Dwork et al. 1984).
Unification as Constraint Satisfaction
565
-
Figure 3: Network structure guaranteeing transitivity of node equivalences. Units A through E implement the implications P yA z + .c z and z # yAy z + 2 z . Omitted here are additional inhibitory links that render A , B , C and D , E mutually exclusive. These links are necessary to prevent stable coalitions of u-units and nu-units. N
+
-
N
and (V3) imply that a certain portion of units has constant activations. Such units can be eliminated by a straightforward optimization. The network places no limit on the number of separate f-structures being unified at once (other than by the total number of nodes and edges). When trying to unify sets of more than two f-structures cases arise where the complete set has no unifier, but overlapping subsets can still take part in partial unifications. The asynchronous operation of the network would randomly choose between alternative pairings of f-structures in this case. By adding controlled noise the network might be used to search stochastically through a space of mutually exclusive unifications. Conventional algorithms usually do not allow f-structures to be cyclic, although unification is well defined on such structures (Colmerauer 1982). Incidentally our model naturally represents and processes cyclic structures. It is encouraging (and maybe surprising) that a formalism prototypical for highly structured, symbolic processing lends itself to a relatively straightforward connectionist implementation, namely when approached from the point of view of constraint satisfaction. One basic deficiency of the model in its present form, however, is inherited from the underlying formalism it implements.
566
Andreas Stolcke
Unification as traditionally defined does not distinguish between different degrees of unifiability (or matching), although this might seem natural given its intuitive interpretation. Therefore it remains to be seen if unification can be usefully generalized into a graded notion of structured matching. Related to this issue, and a subject of ongoing research, is the question of how unification can deal with distributed and/or nondisCrete forms of connectionist encoding of compositional structures, such as coarse coding (Touretzky 1986) and recursive autoassociation (Pollack 1988).
Acknowledgments I wish to thank Steffen Holldobler and Jerome Feldman for their helpful comments and the International Computer Science Institute for general support. The author is currently an IBM Graduate Fellow.
References Ballard, D. H. 1986. Parallel logical inference and energy minimization. In Proceedings of the 5th National Conference on Artificial Intelligence, pp. 203-208. Philadelphia, PA. Colmerauer, A. 1982. Prolog and infinite trees. In Logic Programming, K. L. Clark and S.-A. Tamlund, eds., pp. 231-251. Academic Press, New York. Dwork, C., Kanellakis, P. C., and Mitchell, J. C. 1984. On the sequential nature of unification. Journal of Logic Programming 1,35-50. Martelli, A., and Montanan, U. 1982. An efficient unification algorithm. ACM Trans. Program. Lung. Syst. 4, 258-282. Mjolsness, E., Gindi, G., and Anandan, P. 1989. Optimization in model matching and perceptual organization. Neural Comp. 1(2),218-229. Paterson, M. S., and Wegman, M. N. 1976. Linear Unification. Report RC 5904 (#25518), IBM Thomas J. Watson Research Center, Yorktown Heights, NY. Pollack, J. 1988. Recursive auto-associative memory: Devising compositional distributed representations. In Proceedings of the 10th Annual Conference of the Cognitive Science Society, pp. 33-39. Montreal, Quebec, Canada. Shastri, L. 1988. A connectionist approach to knowledge representation and limited inference. Cog. Sci. 12,331-392. Shieber, S. M. 1986. An Introduction to Unification-Based Approaches to Grammar. CSLI Lecture Note Series. Center for Study of Language and Information, Stanford, CA. Stolcke, A. 1989a. A Connectionist Model of Unification. Tech. Rep. TR-89432, International Computer Science Institute, Berkeley, CA. Stolcke, A. 1989b. Processing unification-based grammars in a connectionist framework. In Proceedings of the 22th Annual Conference of the Cognitive Science Society, pp. 908-915. University of Michigan, Ann Arbor, MI.
Unification as Constraint Satisfaction
567
Touretzky, D. S. 1986. BoltzCONS: Reconciling connectionism with the recursive nature of stacks and trees. In Proceedings of the 8th Annual Conference of the Cognitive Science Society, pp. 522-530. Amherst, MA. Touretzky, D. S. 1987. Representing conceptual structures in a neural network. In Proceedings of the I E E E 1st International Conference on Neural Networks, M. Caudill and C. Butler, eds., pp. 11-279-11286. San Diego, CA.
Received 6 July 1989; accepted 11 October 1989.
Index Volume 1 By Author Abu-Mostafa, Y. S. The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning (Review)
1(3):312-317
Ahmad, S. - See Tesauro, G. Ahumada, A. J. - See Maloney, L. T. Altman, J. S. and Kien, J. New Models for Motor Control (View)
1(2):173-1 83
Anandan, P. - See Mjolsness, E. Anastasio, T. J. and Robinson, D. A. Distributed Parallel Processing in the VestibuloOculomotor System (Letter)
1(2):230-241
Andersen, R. A. - See Husain, M. Andreou, A. G. and Boahen, K. A. Synthetic Neural Circuits Using Current-Domain Signal Representations (Letter)
1(4):489-501
Barlow, H. B. Unsupervised Learning (Review)
1(3):295-311
Barlow, H. B., Kaushal, T. P., and Mitchison, G. J. Finding Minimum Entropy Codes (Letter)
1(3):412423
Baum, E. B. A Proposal for More Powerful Learning Algorithms (View)
1(2):201-207
Baum, E. B. and Haussler, D. What Size Net Gives Valid Generalization? (Letter)
1(1):151-160
Blum, E. K. Approximation of Boolean Functions by Sigmoidal Networks: Part I: XOR and Other Two-Variable Functions (Letter)
1(4):532-540
Boahen, K. A. - See Andreou, A. G. Boser, B. - See LeCun, Y.
Index
569
Brooks, R. A. A Robot that Walks: Emergent Behaviors from a Carefully Evolved Network (Letter)
1(2):253-262
Cleeremans, A., Servan-Schreiber, D., and McClelland, 1. L. Finite State Automata and Simple Recurrent Networks (Letter) 1(3):372-381 Darken, C. J. - See Moody, J. Denker, J. S. - See LeCun, Y. Dobbins, A. - See Zucker, S. W. Domasio, A. R. The Brain Binds Entities and Events by Multiregional Activation from Convergence Zones (Letter)
1(1):123-1 32
Douglas, R. J., Martin, K. A. C., and Whitteridge, D. A Canonical Microcircuit for Neocortex (Letter)
1(4):480-488
Durbin, R. and Rumelhart, D. E. Product Units: A Computationally Powerful and Biologically Plausible Extension to Backpropagation Networks (Letter)
1(1):133-142
Durbin, R., Szeliski, R., and Yuille, A. An Analysis of the Elastic Net Approach to the Traveling Salesman Problem (Letter)
1(3):348-358
Gelenbe, E. Random Neural Networks with Negative and Positive Signals and Product Form Solution (Letter)
1(4):502-510
Gindi, G. - See Mjolsness, E. Girosi, F. and Poggio, T. Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant (Note)
1(4):465-469
Grzywacz, N. M. - See Yuille, A. L. Haussler, D. - See Baum, E. B. He, Y.
-
See Tesauro, G.
Henderson, D. - See LeCun, Y. Hinton, G. E. Deterministic Boltzmann Learning Performs Steepest Descent in Weight-Space (Letter) "
1
1(11:143-150
570
Index
Howard, R. E. - See LeCun, Y. Husain, M., Treue, S., and Andersen, R. A. Surface Interpolation in Three-Dimensional Structure-from-Motion Perception (Letter)
1(3):324-333
Hubbard, W. - See LeCun, Y. Iverson, L. - See Zucker, S. W. Jackel, L. D. - See LeCun, Y. Kaushel, T. P. - See Barlow, H. B. Kien, J. - See Altman, J. S. Koch, C. Seeing Chips: Analog VLSI Circuits for Computer Vision (View)
1(2):184-200
Koch, C. and Suarez, H. - See Suarez, H. Koch, C., Wang, H. T., and Mathur, B. - See Wang, H. T. Krauzlis, R. J. and Lisberger, S. G. A Control Systems Model of Smooth Pursuit Eye Movements with Realistic Emergent Properties (Letter) Lazzaro, J. and Mead, C. A. A Silicon Model of Auditory Localization (Letter)
1(1):116-1 22 1(1):47-57
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation Applied to Handwritten Zip Code Recognition (Letter)
1(41~541-551
Lee, H. Discrete Synchronous Neural Algorithm for Minimization (Letter)
1(4):522-531
Linsker, R. How to Generate Ordered hraps by Maximizing the Mutual Information between Input and Output Signals (Letter)
1(3):402-411
Lippmann, R. P. Review of Neural Networks for Speech Recognition (Review) Lisberger, S. G . - See Krauzlis, R. J.
1(1):1-38
Index
Maloney, L. T. and Ahumada, A. J. Learning by Assertion: Two Methods for Calibrating a Linear Visual System (Letter)
571
1(3):392-401
Martin, K. A. C. - See Douglas, R. J. Mathur, B. - See Wang, H. T. McClelland, J. L. - See Cleeremans, A. Mead, C. A. - See Lazzaro, J. Miall, C. The Storage of Time Intervals Using Oscillating Neurons (Letter)
1(3):359-371
Mitchison, G. J. - See Barlow, H. B. Mjolsness, E., Gindi, G., and Anandan, P. Optimization in Model Matching and Perceptual Organization (Letter)
1(2):218-229
Moody, J. and Darken, C. J. Fast Learning in Networks of Locally-Tuned Processing Units (Letter)
1(2):281-294
Pearlmutter, B. A. Learning State Space Trajectories in Recurrent Neural Networks (Letter)
1(2):263-269
Pentland, A. Part Segmentation for Object Recognition (Letter)
1(1):82-91
Pentland, A. A Possible Neural Mechanism for Computing Shape From Shading (Letter)
1(2):208-2 17
Pineda, F. J. Recurrent Backpropagation and the Dynamical Approach to Adaptive Neural Computation (View)
1(2):161-1 72
Poggio, T. - See Girosi, F. Reklaitis, G. V. - See Tsirukis, A. G. Robinson, D. A. - See Anastasio, T. J. Rojer, A. and Schwartz, E. A Multiple-Map Model for Pattern Classification (Letter)
1(1):104-115
Index
572
Rumelhart, D. E. - See Durbin, R. Schwartz, E. - See Rojer, A. Servan-Schreiber, D. - See Cleeremans, A. Shadmehr, R. A Neural Model for Generation of Some Behaviors in the Fictive Scratch Reflex (Letter)
1(2):242-252
Sontag, E. D. Sigmoids Distinguish More Efficiently Than Heavisides (Note)
1(4):470-472
Standley, D. L. -See
Wyatt, J. L., Jr.
Stevens, C. F. How Cortical Interconnectedness Varies with Network Size (Letter)
1(4):473479
Stolcke, A. Unification as Constraint Satisfaction in Structured Connectionist Networks (Letter)
1(4):559-567
Suarez, H. and Koch, C. Linking Linear Threshold Units with Quadratic Models of Motion Perception (Note)
1(3):318-320
Szeliski, R. - See Durbin, R. Tenorio, M. F, - See Tsirukis, A. G. Tesauro, G. Neurogammon Wins Computer Olympiad (Note)
1(3):321-323
Tesauro, G., He, Y., Ahmad, S. Asymptotic Convergence of Backpropagation (Letter)
1(3):382-391
Treue, S. - See Husain, M. Tsirukis, A. G., Reklaitis, G. V., and Tenorio, M. F. Nonlinear Optimization Using Generalized Hopfield Networks (Letter) Wang, H. T., Mathur, B., and Koch, C. Computing Optical Flow in the Primate Visual System (Letter) Waibel, A. Modular Construction of Time-Delay Neural Networks for Speech Recognition (Letter)
1(4):511-521
1(1):92-103
1(1):3946
Index
573
White, H. Learning in Artificial Neural Networks: A Statistical Perspective (Review) 1(4):425-464 Williams, R. J. and Zipser, D. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks (Letter)
1(2):270-280
Whitteridge, D. - See Douglas, R. J. Wyatt, J. L., Jr. and Standley, D. L. Criteria for Robust Stability In A Class Of Lateral Inhibition Networks Coupled Through Resistive Grids (Letter)
1(1):58-67
Yuille, A. L. and Grzywacz, N. M. A Winner-Take-All Mechanism Based on Presynaptic Inhibition Feedback (Letter) 1(3):334-347 Yuille, A. L., Durbin, R., and Szeliski, R. - See Durbin, R. Zipser, D. A Subgrouping Strategy that Reduces Complexity and Speeds Up Learning in Recurrent Networks (Letter)
1(4):552-558
Zipser, D. and Williams, R. J. - See Williams, R. J. Zucker, S. W., Dobbins, A., and Iverson, L. Two Stages of Curve Detection Suggest Two Styles of Visual Computation (Letter)
1(1):68-8 1